Reindex-Then-Adapt: Improving Large Language Models for

Reindex-Then-Adapt: Improving Large Language

Models for Conversational Recommendation

Zhankui He

1,∗

Zhouhang Xie

Harald Steck

Dawen Liang

Rahul Jha

Nathan Kallus

2,3

Julian McAuley

UC San Diego

Netﬂix

Cornell University

∗

[email protected]

Abstract

Large language models (LLMs) are revolutionizing conversational recommender sys-

tems by adeptly indexing item content, understanding complex conversational contexts,

and generating relevant item titles. However, controlling the distribution of recom-

mended items remains a challenge. This leads to suboptimal performance due to the

failure to capture rapidly changing data distributions, such as item popularity, on tar-

geted conversational recommendation platforms. In conversational recommendation,

LLMs recommend items by generating the titles (as multiple tokens) autoregressively,

making it diﬃcult to obtain and control the recommendations over all items. Thus,

we propose a Reindex-Then-Adapt (RTA) framework, which converts multi-token item

titles into single tokens within LLMs, and then adjusts the probability distributions

over these single-token item titles accordingly. The RTA framework marries the ben-

eﬁts of both LLMs and traditional recommender systems (RecSys): understanding

complex queries as LLMs do; while eﬃciently controlling the recommended item distri-

butions in conversational recommendations as traditional RecSys do. Our framework

demonstrates improved accuracy metrics across three diﬀerent conversational recom-

mendation datasets and two adaptation settings.

1 Introduction

Conversational Recommender Systems (Christakopoulou et al., 2016; Li et al., 2018; He

et al., 2023) (CRS) are an emerging recommendation task aiming to suggest relevant and

personalized items via interactive dialogues between users and systems. Recently, Large Lan-

guage Models (Schulman et al., 2022; Brown et al., 2020b; He et al., 2023; Feng et al., 2023;

Kang et al., 2023) (LLMs) have demonstrated proﬁciency in understanding user intentions

within natural language conversational contexts and exhibited substantial domain-speciﬁc

knowledge (e.g., in movies). Consequently, LLMs oﬀer distinct advantages for CRS and

outperform existing non-LLM baselines (He et al., 2023; Wang et al., 2023b; Feng et al.,

2023). This has garnered signiﬁcant interest within the research community, positioning

LLMs as an indispensable component of CRS.

In this work, we ﬁrst provide preliminary analysis for LLMs as conversational recom-

menders. In detail, we view LLMs for conversational recommendations as Diﬀerentiable

arXiv:2405.12119v1 [cs.IR] 20 May 2024

Figure 1: Representative items (The Dark Knight and Black Panther) demonstrate popu-

larity misalignments between the dataset (ReDIAL (Li et al., 2018)) and the LLM (Llama2-

7b (Touvron et al., 2023)). This misalignment implies a signiﬁcant room for recommendation

accuracy improvement. Our Reindex-Then-Adapt (RTA) framework addresses this gap by

aligning the distributions (e.g., item popularites), leading to substantial accuracy improve-

ments.

Search Indexing (DSI) (Tay et al., 2022; Chen et al., 2023) models, then study LLMs’

abilities and limitations for item-indexing and item-recommendation tasks:

• Abilities: LLMs have indexed numerous popular movies, potentially adequate to

understand complex conversation contexts and address many movie conversational

recommendation scenarios.

• Limitations: LLMs exhibit misalignment with data distributions from target plat-

forms, as illustrated by item popularity in Figure 1. Moreover, data distributions like

item popularity evolve rapidly in practice, making adjusting LLMs more challenging.

We propose to overcome the aforementioned misalignment limitation by easily adjust-

ing LLMs towards changing target distributions. Figure 1 illustrates an example with an

LLM, Llama-7b (Touvron et al., 2023), on the conversational recommendation dataset, Re-

DIAL (Li et al., 2018). Despite the promising conversational recommendations by LLMs (He

et al., 2023), Figure 1(a) points out a lack of alignment with the data distribution of the

target recommendation platform. For example, The Dark Knight is popular on ReDIAL (Li

et al., 2018) but not within the LLM, while Black Panther presents a contrasting scenario.

Our proposed approach alleviates this issue by adjusting the recommendation distributions

for all target items from LLMs. Figure 1(b) shows our approach results in more aligned item

popularity between LLMs and the target dataset or platform for recommended items such

as The Dark Knight and Black Panther. These alignments bring additional recommenda-

tion accuracy improvements, and may have broader beneﬁts, including controllability and

fairness.

To achieve this recommendation probability distribution adjustment, there exists a tech-

nical challenge. Unlike adjusting recommendation probability distributions over all target

Figure 2: Item Indexing tasks by using movie descriptions from WikiPedia (Auer et al., 2007)

to prompt movie titles. We tested MPT-7b, Mistrial-7b , Llama2-7b and GPT-Turbo-3.5

models and group the accuracy by the range of occurrences of the movies in ReDIAL CRS

dataset (Li et al., 2018). We measure the performance with HIT@5, i.e., whether the target

movie in the top-K movie list generated by the LLMs to reﬂect the movie knowledge stored

in the parameters of the LLMs.

items via tweaking the logit vectors in traditional RecSys, obtaining such logit vectors from

LLMs is challenging due to their generative retrieval paradigm for CRS (He et al., 2023).

LLMs generate recommendations by auto-regressively producing multiple item titles (e.g.,

Top-10), represented by varying numbers of tokens. This process makes obtaining proba-

bility distributions over all recommended items computationally expensive, hindering sub-

sequent control or adjustment eﬀorts.

To overcome this challenge, we propose a Reindex-Then-Adapt (RTA) framework. With

treating LLMs as DSI models for conversational recommendations, we ﬁrst conduct a

reindex step: for original LLMs, we convert already-indexed multi-token item titles (e.g.,

Edge of Tomorrow) into single tokens (e.g., |Edge of Tomorrow|); then we conduct an

adapt step: for reindexed LLMs, the recommended item distributions can be obtained

eﬃciently for following adapt step, e.g., bias terms adjustment or RecSys gating.

Based on the RTA framework, we investigate four reindexing modules and two adaptation

strategies across three CRS datasets. Our experimental results show improved recommen-

dation accuracy in CRS. For instance, we improve the recommendation accuracy for the

original Llama2-7b by 59.37% in terms of the Top-10 Hit Rate, surpassing all open-source

baselines. Additionally, our studies highlight the signiﬁcance of adjusting LLMs towards

target distributions in CRS and provide insights into scheduling conversational recommen-

dation modules with LLMs.

2 Preliminaries

2.1 Task Formulation

In CRS, a conversation is represented by C = (u

, s

, I

)

t=1

involving users u

∈ U and items

⊆ I with T conversation turns. Each utterance s

comprises tokens v

from vocabulary

V. A conversation typically involves a seeker and a recommender. As formulated in CRS

studies (Li et al., 2018; Chen et al., 2019; Zhou et al., 2020; Wang et al., 2022; He et al.,

2023), our goal is to learn a recommender to generate a ranked list of items

at turn k

that aligns with I

, based on the preceding context (u

, s

, I

)

k−1

t=1

Table 1: Additional statistics for cold items, warm items and popular items in ReDIAL (Li

et al., 2018) datasets. #Items counts the total numbers of items and #Occurences sums

the total occurences of such items in conversations.

ReDIAL Cold – [0, 10) Warm – [10, 100) Pop. – [100, +inf)

#Items 4,960 (78.97%) 1,193 (18.99%) 128 (2.04%)

#Occurences 12,523 (18.03%) 33,304 (47.94%) 23,647 (34.04%)

2.2 Diﬀerentiable Search Index (DSI)

The transformer model has shown proﬁciency in retrieval tasks, encoding item information

within its parameters, which is a method termed Diﬀerentiable Search Index (DSI) (Tay

et al., 2022). DSI involves two key training tasks for pre-trained language models: Learn to

Index (L2I) and Learn to Retrieve (L2R), which can be used to train a model jointly or in a

sequential order. L2I focuses on mapping item content, such as movie description, to item

indices, exempliﬁed by linking a description of Edge of Tomorrow to its title:

L2I Example: “A 2014 American science ﬁction action ﬁlm starring Tom Cruise and Emily

Blunt with ...” → Edge of Tomorrow

L2R, on the other hand, maps queries to item indices, such as:

L2R Example: “I’m feeling bored today and looking for a sci-ﬁ action movie, preferably

starring Tom Cruise.” → Edge of Tomorrow

DSI models are originally proposed for text retrieval tasks (Tay et al., 2022), yet their for-

mulation can be connected to LLMs used in CRS. Considering the LLMs as CRS framework

proposed in (He et al., 2023) through the lens of DSI, we observe that:

• Item Indexing: LLMs index items by using the item titles (e.g., “Edge of Tomor-

row”) as the item identiﬁers via L2I.

• Item Recommendation: LLMs use conversational context as queries to generate

item indices via L2R.

Thus, LLMs inherently function as DSI models, by including a certain number of training

samples for L2I and L2R tasks in their pre-training corpus. Compared to common two-tower

models, DSI models require only a single model for item recommendations, by indexing item

information into its parameters (Tay et al., 2022).

2.3 Item Indexing: LLMs Show Suﬃcient Item Content Knowledge

According to (He et al., 2023), LLMs demonstrate superior knowledge in content and con-

text, particularly in the movie domain. This proﬁciency is attributed to the performance in

the “Learn to Index” (L2I) task, as viewed through the DSI (Tay et al., 2022). Therefore,

our primary concern is the extent to which item content has been indexed in LLMs through

the pre-training corpus.

Figure 3: Visualization of item monthly relative popularity from Reddit-Movie (He et al.,

2023) datasets, since this dataset is the only CRS dataset with long-range timestamps in

the wild. Item popularities are shown changing overtime rapidly.

2.3.1 Observation

We gathered 6,281 pairs of movie titles from ReDIAL (Li et al., 2018) and the related

descriptions from Wikipedia

for experiments in Figure 2 to assess LLMs performance on

L2I task, and observe that:

• Good Content Knowledge for Popular Items: All LLMs had indexed a con-

siderable amount of movie content for conversational recommendation tasks. No-

tably, for frequently mentioned movies, as deﬁned as movie occurences in the ReDIAL

dataset (Li et al., 2018) that the item occurences range is [100, +inf), all LLMs exhibit

impressive content knowledge.

• Best LLMs: The proprietary model GPT-3.5-t (Schulman et al., 2022) outperforms

others. Among the open-sourced LLMs of similar size, Llama2 demonstrates the best

performance in the given task, as shown in Figure 2, making it the chosen base model

for our subsequent experiments.

2.3.2 Impact

Figure 2 shows that, in terms of item indexing capability as DSI models, LLMs without

speciﬁc ﬁne-tuning have already indexed numerous popular movies. Table 1 shows the

imbalance of item occurrences in conversational recommendations, where items labeled as

warm and pop. constitute about 20% in terms of item counts but contribute to over 80% of

occurrences. This suggests that zero-shot LLMs may be suﬃcient for handling many

movie conversational recommendation scenarios, because many are about warm or

popular movies. Although, we admit ﬁne-tuning LLMs to cover more cold items remains

future work.

https://www.wikipedia.org/

I like Back to the

Future … I need

a sci-ﬁ movie

with my family …

1. Edge of Tomorrow

2. Terminator

3. The Matrix

4. ……

|Edge_of_Tomorrow|

0.23

|The_Mar<an|

0.12

|Terminator|

0.11

|The_Matrix|

0.07

Reindex Adapt

🤖

I watch Sci-Fi movies with my family…

L2R

Back to the Future is my favorite sci-ﬁ…

L2R

Sci-Fi Ac;on movie stars Tom Cruise …

L2I

Terminator

The

Matrix

|Terminator|

|The_Matrix|

Edge

Tomorrow

|Edge_of_Tomorrow|

…

logit vector

!" # $

%! # &' ( )*

…

Bias Term Adjustment

RecSys Ga9ng

|The_Mar<an|

0.25

|Edge_of_Tomorrow|

0.22

|The_Matrix|

0.10

|Terminator|

0.08

LLMs

Figure 4: Reindex-Then-Adapt (RTA) Framework. LLMs can generate a list of item titles as

the recommendations given the conversation contexts. To further improve the accuracy and

controllability, we conduct (1) reindex step: reindexing the item (e.g., movie) titles in LLMs

as single tokens to obtain the predicted logit vectors eﬃciently; (2) adapt step: adapting the

recommenders towards target data distributions eﬀectively with multiple options on the logit

vectors such as adjusting bias terms or combining RecSys models with Gating mechanism.

2.4 Item Recommendation: LLMs Show Severe Distribution Mis-

alignment

2.4.1 Observation

In this section, we aim to show that even though LLMs can index items eﬀectively, the data

distributions LLMs ﬁtted in training do not match the target inference distributions for

CRS. We discuss this distribution misaligment from two perspectives, using item popularity

distribution as an example:

• Static Perspective: As depicted in Figure 1(a), LLMs reﬂect item popularities from

the large-scale training corpus to some extent, which often do not align with popular

items on the speciﬁc platform for CRS.

• Dynamic Perspective: Target data distributions, such as item popularity, undergo

rapid changes over time due to factors like seasons and promotion strategies. For in-

stance, Figure 3 shows that monthly relative item popularities on the Reddit-Movie (He

et al., 2023) dataset change over time, which cannot be captured by a static LLM,

even ﬁne-tuned ones.

2.4.2 Impact

Our observations highlight the distribution misalignments between items on the target plat-

form and those recommended by LLMs. This misalignment is considered from both static

and dynamic perspectives, suggesting that: (1) despite LLMs exhibiting impressive perfor-

mance in conversational recommendation(He et al., 2023), there exists room for im-

proving recommendation accuracy by aligning with the distributions of target

platforms; (2) due to the dynamic nature of recommendation platforms, target data dis-

tributions change rapidly, necessitating the more eﬃcient methods to adjust item

recommendations from LLMs accordingly.

3 Framework

3.1 Overview

As discussed in Section 1, we argue that representing target items with varying token counts

in LLMs poses challenges for adjusting recommendation distributions over all target items.

To tackle this, we view LLMs as DSI models that have already indexed suﬃcient item con-

tent knowledge (see Section 2.3), and propose the Reindex-Then-Adapt (RTA) framework,

illustrated in Figure 4:

1. Reindex item indices with varying token numbers in LLMs into single-token item

indices using a mixture of data samples from the L2I corpus and/or L2R corpus. This

aims to remove the adapt step barrier. In contrast to the index step in the original

DSI models (Tay et al., 2022), the reindex step reuses the content of indexed items

from the LLMs, thereby facilitating the learning process for the new item indices.

2. Adapt logits from the reindexed LLMs, achieved by transforming the logit vectors

or by combining with other traditional RecSys using Gating mechanism (Hochreiter

and Schmidhuber, 1997; Chung et al., 2014; Gu et al., 2020). This adjustment aims

to eﬀectively align the recommendation probability distributions over items with the

target data distributions.

3.2 Reindex Step: Single-Token Items in LLMs

The key of reindex step is to “squeeze” multi-token item embeddings into single-token item

embeddings eﬃciently, and preserves the semantics of the original item embeddings in LLMs

generation.

3.2.1 Identify Item Indices.

Formally, for a sentence s = (v

)

i=1

consisting of tokens v

∈ V, we denote the tokens

representing an item for CRS tasks as (v

)

j+n

i=j

, where j indicates the starting position of

the ﬁrst token for the item in s, and n is the number of tokens representing this item.

Consequently, with the Embed layer, LLMs can retrieve the sequence of token embeddings

for this item:

)

j+n

i=j

= Embed



)

j+n

i=j



, (1)

where v ∈ R

is the token embedding, and d is the embedding size. For example, we may

look up “Edge of Tomorrow” embeddings represented by “[14, 72, 98]” in the sentence s.

3.2.2 Aggregate Multi-Token Embeddings

In the proposed reindex step, we assume that the semantics from multiple (typically shorter

than 10) token embeddings can be aggregated into a new single token embedding with a

trainable aggregator, such as:

v = Aggregator



)

j+n

i=j



, (2)

where the aggregated

v ∈ R

serves as the new representation of the target item in LLMs

generation. Therefore, if all target items are represented by the “squeezed” single embed-

ding, scoring all the target items to obtain a logit vector from LLMs is eﬃcient. In general,

many existing model architectures can be used as Aggregator, such as RNN-based (Cho

et al., 2014), Transformer-based models (Vaswani et al., 2017), or even Weighted Pooling.

We discuss the details and the comparisons with pure new embeddings in Section 4.3.

3.2.3 Learning Process

The contrative loss (Oord et al., 2018) is used for learning the aggregator to “squeeze”

multi-token item embeddings and preserve the semantics for LLMs:

reindex

= −

|D|

q,˜v∈D

log

exp



⊤

˜v



exp (q

⊤

˜v) +

n∈N

exp (q

⊤

, (3)

where we loop over the training set D that consists of (q,

v) pairs. Those pairs are collected

from sentences containing target items. Here, q ∈ R

is the contextual embedding of

the last position from a LLM, which is originally used to generate the ﬁrst token of original

indexed items, but now we aim to force the aggregated item representation ˜v ∈ R

described

in Section 3.2.2 to be generated by LLMs. To achieve this reindex step, we also prepare

negatives n ∈ R

from the negative representation set N .

Two groups of corpus are considered in the reindex step:

• L2R Data: In this training corpus, (query, target) pairs are used as L2R samples. In

CRS context, those are samples from the conversations.

• L2I Data: In this training corpus, (content, target) pairs are used as L2I samples.

In CRS context, those are samples from the item metadata like textual descriptions.

• Data Mixture: In this case, we consider mixing the both L2R and L2I samples as a

uniﬁed corpus and use it to train our model jointly. We use this option and include

details in Appendix A.2.

3.3 Adapt Step: Item Probabilities Adjustment

After re-indexing, all the items are represented by single-token embeddings. It makes recom-

mendation as easy as one-step decoding in LLMs, and also enables multiple eﬃcient ways to

adjust the recommendation item distributions to adapt towards target platforms or speciﬁc

data distributions. We introduce two types of adaptation methods in the following sections,

one for item popularity adjustments, and another one for combining with the traditional

recommender systems. To start with, we assume the logit vector g ∈ R

|I|

has already been

given by the LLMs, and the corresponding probability vector should be p = softmax(g).

3.3.1 Bias Term Adjustment

Inspired by (Zhao et al., 2021), a common way to adjust logits is an aﬃne transformation,

i.e.:

p = softmax (gW + b) , (4)

where W ∈ R

|I|×|I|

is a weight matrix and b ∈ R

|I|

is the bias term. Similar to (Zhao

et al., 2021), we restrict the matrix W to be diagonal to prevent the size of parameters

from growing quadratically in the size of items. Therefore, in this special case, we are able

to interpret the W and b as multiplicative and additive bias terms towards the target data

distributions, respectively.

Table 2: Dataset Statistics. We update Reddit-Movie CRS dataset as Reddit-V1.5 accord-

ing to the raw data dump provided by (He et al., 2023) from 2012 to 2022. Speciﬁcally,

conversation turns with valid recommended items are denoted as R Turns.

INSPIRED ReDIAL Reddit-V1.5

(Hayati et al., 2020) (Li et al., 2018) (He et al., 2023)

#Conv. 999 11,348 2,726,471

#Turns 21,124 139,557 5,063,007

#R Turns 1,950 30,322 1,787,050

#Users 999 764 520,913

#Items 1,472 6,281 68,285

3.3.2 Traditional RecSys Gating

Inspired by (He et al., 2023), we notice that LLMs excel at content/context knowledge, but

traditional RecSys, where the output logit vector can be denoted as

g ∈ R

|I|

, is good at

collaborative knowledge instead. Motivated by this observation, combining those two types

of models becomes easy after the re-indexing step:

p = softmax (αg + (1 − α)

g) , (5)

where the coeﬃcient α ∈ [0, 1] can be set in many diﬀerent ways. For simplicity sake, we

use a learnable scalar ˜α for α = sigmoid (˜α) in our experiments, but more options can be

considered, such as being predicted by a MLP model to naturally determine how much we

should weight the responses from LLMs or traditional RecSys, like α = sigmoid (MLP(q)),

where q ∈ R

can be the contextual embedding from a LLM in Equation (3).

3.3.3 Learning Process

We use maximum likelihood estimation to derive the loss for adapt step, in order to learn

the parameters of the bias terms or the recsys model. Note that the LLMs parameters are

not involved in this step, ensuring an eﬃcient learning process:

adapt

= −

∗

i=1

log

i,∗

, (6)

where dataset D

∗

is collected from the target platform, such as ReDIAL (Li et al., 2018),

which is typically a small sized dataset. Here,

i,∗

denotes the probability of the ground-

truth item in the i

data sample. Our purpose is to adapt the model towards the underlying

data distributions of D

∗

through this learning process.

4 Experiments

4.1 Experiment Setup

4.1.1 Datasets

Three conversational recommendation datasets (Hayati et al., 2020; Li et al., 2018; He

et al., 2023) are used in our experiments, where the statistics are summarized in Table 2:

Table 3: The main results for our models on conversational recommendation accuracy per-

formance, compared against (1) traditional recommendation models; (2) zero-shot large

language models (LLMs); (3) traditional conversational recommendation models; and (4)

zero-shot dense retrievers. The size of the reported LLMs used here is 7B. We denote the

model metrics with the best performance in bold. Llama2-R denotes the Llama2-7b model

after our reindex step. We also show the results after the adapt step with bias terms

(+Bias) or RecSys model combination with Gating mechanism (+RecSys).

INSPIRED ReDIAL Reddit-V1.5

H@5 N@5 H@10 N@10 H@5 N@5 H@10 N@10 H@5 N@5 H@10 N@10

Popuplarity .089 .020 .065 .015 .103 .021 .070 .015 .035 .003 .025 .002 .052 .003 .030 .002 .008 .001 .004 .000 .014 .001 .006 .000

FISM .075 .018 .045 .012 .103 .021 .054 .012 .065 .004 .040 .003 .112 .005 .054 .003 .022 .001 .012 .001 .043 .001 .019 .001

SASRec .061 .016 .037 .010 .103 .021 .051 .011 .068 .004 .041 .002 .116 .005 .056 .003 .022 .001 .013 .001 .039 .001 .018 .001

MPT .075 .018 .045 .011 .099 .020 .052 .012 .072 .004 .045 .003 .116 .005 .059 .003 .026 .001 .017 .001 .040 .001 .021 .001

Mistral .061 .016 .040 .011 .066 .017 .041 .012 .082 .004 .056 .003 .111 .005 .065 .003 .029 .001 .020 .001 .038 .001 .023 .001

Llama2 .080 .019 .050 .012 .122 .022 .064 .013 .094 .004 .059 .003 .145 .005 .075 .003 .042 .001 .027 .001 .064 .001 .034 .001

ReDIAL .060 .016 .041 .012 .106 .021 .056 .012 .067 .004 .044 .003 .106 .005 .057 .003 .029 .001 .019 .001 .044 .001 .024 .001

UniCRS .091 .019 .055 .011 .132 .019 .073 .014 .085 .003 .058 .003 .112 .004 .071 .003 .028 .001 .017 .001 .040 .001 .021 .001

SBERT .038 .013 .026 .010 .066 .017 .036 .010 .016 .002 .010 .001 .026 .002 .013 .001 .003 .000 .002 .000 .005 .000 .002 .000

Instructor .052 .015 .034 .011 .085 .019 .045 .011 .025 .002 .013 .001 .043 .003 .019 .001 .009 .001 .006 .000 .017 .001 .008 .000

Llama2-R .066 .017 .041 .011 .103 .021 .053 .012 .071 .004 .042 .002 .117 .005 .057 .003 .055 .001 .035 .001 .093 .002 .047 .001

+Bias .103 .021 .066 .014 .164 .025 .083 .015 .083 .004 .053 .003 .123 .005 .066 .003 .059 .001 .037 .001 .096 .002 .049 .001

+RecSys .089 .020 .052 .013 .164 .025 .076 .013 .094 .004 .060 .003 .146 .005 .076 .003 .061 .001 .038 .001 .101 .002 .051 .001

INSPIRED (Hayati et al., 2020) and ReDIAL (Li et al., 2018): These two datasets consist

of small-scale human-human conversations for movie recommendations with crowd-sourced

annotations from MTurk

. Due to their short collection time span, temporal patterns are

unlikely to be observed. Nevertheless, considering their widespread use, we present our

model results based on these datasets. In the following experiments, we randomly split

the datasets into training, validation, and test sets using an 8:1:1 ratio. Reddit-V1.5 (He

et al., 2023): This dataset comprises large-scale movie discussions on Reddit, which were

collected and processed by (He et al., 2023). This dataset shows real movie conversation

recommendations in the wild and includes corresponding timestamps for 10 years to study

temporal patterns. For data splitting, we use the last two months (i.e., Nov. and Dec. in

2022) as validation and testing set respectively to approximate the real setting. Due to the

large size of the given dataset, we uniformly sample 20% conversation turns for validation

(i.e., 11,241 samples) and testing (i.e., 13,816 samples).

4.1.2 Baselines

We consider four groups of baseline models for comparison. (1) We consider some represen-

tative traditional item-based

RecSys models, including Popularity, FISM (Kabbur et al.,

2013) and SASRec (Kang and McAuley, 2018). (2) We consider some representative CRS

models: ReDIAL (Li et al., 2018) and UniCRS (Wang et al., 2022): This model uses a

pre-trained language model. (3) We consider some dense retrieval models given the connec-

tions to document retrieval: SBERT (Reimers and Gurevych, 2019) and Instructor (Su

et al., 2022). (4) We consider some zero-shot open-sourced LLMs as baselines like (He et al.,

2023) and use the 7-billion-parameter version due to compute burden: MPT-7b (Team,

2023), Mistral-7b (Jiang et al., 2023) and Llama2-7b (Touvron et al., 2023). We also

discuss the results from GPT-3.5-turbo (Schulman et al., 2022), which is a much larger

proprietary model that can achieve state-of-the-art CRS performance even in a zero-shot

https://mturk.com

We only use item-based models since INSPIRED does not have historical user interactions.

setting (He et al., 2023). The details of baseline models are found in Appendix A.1.

4.1.3 Evaluation Metrics

We focus on recommendation accuracy using HIT@K (H@K) and NDCG@K (N@K), fol-

lowing (Li et al., 2018; Chen et al., 2019; Zhou et al., 2020; Wang et al., 2022). We consider

the means and the standard errors

of the metrics with K = {5, 10}. Please ﬁnd the

implementation details in Appendix A.2.

4.2 General CRS Performance

4.2.1 Baseline Performance.

Table 3 shows the recommendation accuracy of four groups of baselines on three conversa-

tional recommendation datasets. There are some observations:

On Traditional RecSys. Conventional recsys models eﬀectively capture target popu-

larity and further item-item similarities, resulting in reasonable recommendation accuracies.

Interestingly, on INSPIRED, we ﬁnd that non-personalized popularity serves as a strong

baseline, because the limited size of the training set may restrict the ability to capture more

complex item-item relationships. The results of traditional recommendation system mod-

els also indicate the potential of improving the recommendation accuracy by aligning with

target data distributions.

On LLMs. LLMs with zero-shot prompting from (He et al., 2023) achieve impressive

results, surpassing even the best results on ReDIAL datasets. Additionally, the rank of

recommendation accuracy within the LLM group aligns with the performance from Figure 2.

Further details on the speciﬁc proprietary model GPT-3.5-t are discussed in Section 4.5.2.

On Other Baselines. We observe that zero-shot state-of-the-art dense retrievers are

unable to achieve comparable performance as zero-shot LLMs; this may be due to two

reasons: (1) Dense retrievers focus more on retrieving similar documents according to se-

mantic similarities (e.g., similar contents), but LLMs show better understanding abilities for

conversation contexts; (2) We are encoding the movie textual title rather than the descrip-

tion of the movie for fair comparison, which may limit the dense retrievers’ performance.

As for traditional CRS models, since we follow the setting in (He et al., 2023) to remove

“repeated” items, many popular CRS models perform relatively weaker in the corrected

evaluation protocol.

4.2.2 Ours vs. Baselines.

We construct a small-sized aggregator on top of Llama2 as an example, then use this aggre-

gator to reindex multi-token movie titles into single-token movie titles as recommendation

candidates, namely Llama2-R.

On Recommendation Accuracy. Table 3 shows that, following the reindex and

adapt steps, our model excels over baselines on INSPIRED and Reddit-V1.5 datasets,

achieving the competitive best results on the ReDIAL dataset. Examining the reindex

step (Llama2-R) and adapt step (+Bias or +RecSys), we observe a potential performance

decrease in the reindex step due to the semantic gap from original token embeddings to

the new single token embeddings from the relatively small aggregator. However, our mod-

els compensate by capturing the target data distribution through bias terms or traditional

We use error bars in our ﬁgures and gray numbers in our tables for standard errors.

Figure 5: Diﬀerent methods to represent items in LLMs with single-token embeddings and

the related recommendation accuracy HIT@5 after the reindex step.

RecSys models. A more in-depth analysis of these adapt methods will be discussed in Sec-

tion 4.4.

On Eﬃciency and Flexibility. It is crucial to mention that the aggregator-based

methods are around 10× smaller than the corresponding out-of-vocabulary item embed-

ding tables and approximately 233× smaller than the Llama2-7b base model, emphasizing

its space eﬃciency. Additionally, as all movie titles with varying numbers of tokens are

”squeezed” into single tokens, our model can rank all items with a single decoding step,

making it around 100× faster than the generative retrieval from LLMs to recommend the

top-20 items. Moreover, single tokens facilitate easy acquisition of the recommendation item

distribution, enhancing ﬂexibility in control or further adjustment of the recommendations.

4.3 Eﬀectiveness of the Reindex Step

4.3.1 Experiment Setup

We explore methods for representing item titles with single-token embeddings in LLMs,

investigating four approaches: (1) Embed: randomly initialized out-of-vocabulary (OOV)

embeddings. Subsequently, three models aggregate existing LLM token embeddings into a

single-token embedding and trained on the samples from those three datasets: (2) Weighted:

learning position-wise attention weights to aggregate multi-token embeddings into a single

one, followed by a simple linear projection; (3) TRM: employing a single-layer transformer

to derive a contextual embedding from the output CLS token; (4) RNN: using a simple GRU

model to aggregate multiple token embeddings, with the last hidden state vectors serving

as the item representations.

4.3.2 Embedding vs. Aggregator

The embedding-based method cannot be shared across diﬀerent datasets due to the practical

challenge in normalizing item titles. However, the aggregators are shared across diﬀerent

datasets, using the raw text of item titles as inputs. Figure 5 demonstrates that aggregators

are not only generalizable across diﬀerent datasets but also yield superior recommendation

accuracy. Interestingly, despite Reddit having a dominant share of training samples (96%)

as shown in Table 2, the trained aggregators with mixed data samples perform even better

than the dataset-speciﬁc new embeddings in Figure 5.

Table 4: Recommendation accuracy comparison among Continual-Training on Llama2-R

(Cont.), and the detailed conﬁgurations of adding bias terms or RecSys gating.

INSPIRED ReDIAL Reddit-V1.5

H@10 N@10 H@10 N@10 H@10 N@10

Llama2-R .103 .021 .053 .012 .117 .005 .057 .003 .093 .002 .047 .001

Cont. .146 .024 .081 .015 .124 .004 .067 .003 .093 .001 .047 .001

Bias Term Adjustment (+Bias)

w/ gW .155 .025 .081 .014 .123 .005 .066 .003 .093 .001 .048 .001

w/ b .103 .021 .053 .012 .118 .005 .057 .003 .096 .002 .049 .001

w/ gW + b .164 .025 .083 .004 .123 .005 .066 .003 .096 .001 .049 .001

RecSys Model Gating (+RecSys)

+ FISM .164 .025 .076 .013 .139 .005 .072 .003 .101 .002 .049 .001

+ SASRec .136 .023 .071 .014 .146 .005 .076 .003 .101 .002 .051 .001

4.3.3 Diﬀerent Aggregators

Among the three aggregators, the Weighted method demonstrates competitive performance

despite its simple architecture. This suggests that the existing token embeddings from the

LLMs are eﬀective enough, making the weighted-sum with linear projection a reasonable ap-

proach to consolidating token embeddings. Additionally, TRM performs worse than RNN,

possibly because (1) titles (e..g, movies) are typically short (fewer than 20 tokens), dimin-

ishing the signiﬁcance of TRM’s advantages over RNN in handling long dependencies; (2)

CLS tokens show diﬃculty in representing a sentence, as noted in the literature (Choi et al.,

2021).

4.4 Eﬀectiveness of the Adapt Step

4.4.1 Component Analysis.

Table 4 shows introducing bias terms after the reindex step is a simple yet eﬀective strategy.

This is attributed to the potential for improving recommendation accuracy by addressing

popularity misalignments, as discussed in Figure 1. Additionally, we observe that on the

small dataset, INSPIRED, +Bias outperforms +RecSys. This is because the parameter

space for learning is signiﬁcantly reduced, changing from learning item-item relationships to

learning item point-wise popularity, which can be eﬀectively captured with a small number

of training samples.

Meanwhile, Table 4 demonstrates that introducing traditional RecSys models is eﬀective

when there is a large number of training samples available to adapt the recommendation

distribution. On ReDIAL and Reddit-V1.5, this leads to improved recommendation ac-

curacy compared to Cont. and +Bias. However, on the small dataset INSPIRED, using

RecSys to learn item-item relationships tends to result in overﬁtting. This motivates us

to consider diﬀerent adapt steps by cases. For example, after collecting the most recent

samples, bias-term adjustment (+Bias) is recommended if the number of new samples is

limited. Otherwise, RecSys gating would be a good option.

SYSTEM:

‘Tell me what

kind of movies do you

like? ‘

USER:

‘I like scary

movies’

SYSTEM:

‘did you

watch Avengers:

Infinity War (2018)?’

USER:

‘I did not but I

want to’

1. The Conjuring

2. The Ring

3. The Babadook

4. Get Out

1. Get Out

2. It

3. Happy Death Day

4. The Conjuring

1. It

2. Oujia

3. Wonder Woman

4. Halloween II

Llama2

Llama2-R

Llama2-RTA

SYSTEM:

‘Great! Based on

your preference for scary

movies, I recommend

"It"

(2017).

It's a classic horror

movie that has been

terrifying audiences for

decades. The movie follows

a group of young outcasts

who must face their fears

and battle a monstrous

entity that preys on their

town every 27 years.’

(a) Context (b) Recommendation (c) Generation

Figure 6: An Example with real results from Llama2, Llama-R and Llama-RTA (+SAS-

Rec), followed by a natural language response from Llama2 (detailed prompts can be found

in Appendix B). This conversational context is from ReDIAL dataset, with the ground-truth

movie IT.

4.4.2 Impact of Bias Term Types

Both multiplicative and additive bias terms improve accuracy across diverse datasets, though

their impact varies. Speciﬁcally, multiplicative bias terms exhibit signiﬁcant improvement

on INSPIRED and ReDIAL datasets, whereas additive bias terms play a pivotal role on

Reddit-V1.5.

4.4.3 Impact of RecSys Model Types

Our current focus is on ”item-based” RecSys models without incorporating long-term user

representations. In this context, FISM and SASRec exhibit enhanced performance. Notably,

FISM outperforms SASRec on the INPSIRED dataset, possibly due to the complexity of

SASRec, a transformer-based model, being less suitable for smaller datasets. Conversely,

on larger datasets such as ReDIAL and Reddit-V1.5, SASRec demonstrates superior perfor-

mance, suggesting that employing transformer-based RecSys models is advantageous when

dealing with larger data sizes. Speciﬁcally, on ReDIAL, characterized by longer conversation

rounds, SASRec may bring additional beneﬁts in capturing item-to-item sequential patterns

within conversations.

4.5 Discussions

4.5.1 Conversational Recommendation Responses

Figure 6 illustrates the complete pipeline of generating results for conversational recommen-

dation tasks. Our discussions are below:

On Recommendation. The outputs of the recommendation phase are items. In Fig-

ure 6, the three models (Llama2 and its variants under our framework) understand contexts,

yielding high-quality recommendations for scary movies. Speciﬁcally, Llama2-RTA builds

Table 5: Recommendation accuracy comparison our model based on a 7B open-sourced

LLM (Llama2) and the proprietary model ChatGPT (GPT-3.5-t).

INSPIRED ReDIAL Reddit-V1.5

Model H@10 N@10 H@10 N@10 H@10 N@10

Ours. .164 .025 .083 .004 .131 .005 .068 .003 .102 .002 .052 .001

GPT-3.5-t .150 .024 .089 .016 .163 .006 .089 .003 .104 .002 .055 .001

a connection between the superhero movie Avengers: Inﬁnity War in the context and the

candidate Wonder Woman, using item-to-item relationships modeled by the SASRec (Kang

and McAuley, 2018) model. Meanwhile, we posit that while multiple recommended items

align with conversation contexts, the failure to adjust for the popularity of items on the

target platform (e.g., movie IT being popular on ReDIAL) leads to zero-shot LLMs failing

to meet user interests.

On Generation. The outputs of the generation phase are texts. In Figure 6, the

generation phase is accomplished by prompting the Llama2 model. It is noted that our

focus in this work is solely on the technical aspects of the recommendation phase. We treat

the generation phase as a separate task that can be completed either by existing LLMs or

adjusted based on user interface requirements. Still, we make some observations: (1) In

many cases, presenting the recommendation phase suﬃces for users. However, our RTA

framework, which introduces only a few additional parameters without changing the weights

of the original LLMs, eﬃciently enables the reuse of LLMs for further generating natural-

language responses as shown in Figure 6; (2) In conversational recommendations, there

is an ongoing debate about whether to perform the recommendation or generation phase

ﬁrst (Li et al., 2018; Zhou et al., 2020; Wang et al., 2022). Our example suggests that, if

the recommendation phase is frequently adjusted (a common scenario due to distribution

shift), it is advisable to perform the recommendation phase ﬁrst and then the generation

phase. Reversing the order may lead to text-item inconsistency issues (e.g., the generated

response is speciﬁcally tailored for recommended movie IT, leading to a mismatch with the

recommendation from Llama2).

4.5.2 Comparison with Proprietary Models

To deepen our understanding of the models, we adopt the setting in (He et al., 2023) to

query the proprietary model GPT-3.5-t (Schulman et al., 2022). As shown in Table 5,

GPT-3.5-t remains a competitive model for conversational recommendations with zero-shot

prompting. However, it is reasonable to guess that, given our LLM-architecture-agnostic

approach, improving recommendation accuracy based on GPT-3.5-t is possible if the weights

are accessible. A reasonable next step involves working on models similar to GPT-3.5-t, such

as Llama2-70b. This could be pursued as future work, if the required compute resources are

available.

5 Related Work

5.1 Conversational Recommendation (CRS)

The objective of conversational recommender systems (CRS) is to elicit user preferences and

deliver tailored recommendations through interactive dialogues. Historically, CRS imple-

mentations have ranged from some template-driven systems (Christakopoulou et al., 2016;

Lei et al., 2020b,a; He et al., 2022; Zhang et al., 2022) to critique-based approaches (Chen

and Pu, 2012; Wu et al., 2019; Li et al., 2021). With the evolution of natural language

processing, ”deep” CRS models (Li et al., 2018; Chen et al., 2019; Wang et al., 2022)

have been developed, enabling more natural-language interactions. Research indicates the

utility of CRS models is enhanced by incorporating diverse supplementary data, such as

knowledge-enriched models (Chen et al., 2019; Zhou et al., 2020) utilizing external knowl-

edge bases (Auer et al., 2007; Liu and Singh, 2004), review-centric models (Lu et al., 2021),

and session/sequence-oriented models (Zou et al., 2022; Li et al., 2022b). UniCRS (Wang

et al., 2022) uses knowledge bases (Auer et al., 2007), built on DialoGPT (Zhang et al.,

2020) and employing prompt tuning (Brown et al., 2020b), represents a state-of-the-art

CRS model on datasets like ReDIAL (Li et al., 2018) and INSPIRED (Hayati et al., 2020).

Recently, an emerging topic is to leverage LLMs in CRS, with (Friedman et al., 2023; He

et al., 2023) introducing a novel CRS pipeline, even in the zero-shot setting (He et al., 2023),

and (Wang et al., 2023b) focusing on advanced user simulation for LLM evaluation. Our

research is the ﬁrst to study the distribution misalignments in zero-shot LLMs for CRS and

solutions for this issue to improve recommendation accuracy.

5.2 Large Language Models (LLMs)

Recent breakthroughs in natural language processing (NLP) have demonstrated that large

language models (LLMs) possess a remarkable capacity for generalizing to unfamiliar tasks

and areas (Chowdhery et al., 2022; Brown et al., 2020a; Wei et al., 2022) in zero-shot

or few-shot settings. Studies have shown that scaling up LLMs can signiﬁcantly enhance

their performance and eﬃciency in downstream applications (Kaplan et al., 2020). In line

with these developments, LLMs have been successfully applied to various downstream tasks

such as question answering, numerical reasoning, code generation, and commonsense rea-

soning, often without requiring gradient updates (Zheng et al., 2023; Brown et al., 2020a;

Li et al., 2022a; Kaplan et al., 2020). The recommendation ﬁeld has recently begun inte-

grating LLMs, either by adapting LLM architectures (Geng et al., 2022; Cui et al., 2022)

or repurposing existing LLMs for recommendation purposes (Li et al., 2023; Wang et al.,

2023a; Liu et al., 2023). Our study aligns with the research line of utilizing LLMs for con-

versational recommendations. We improvements in recommendation accuracy by adjusting

item recommendations within the proposed framework.

5.3 LLMs for Recommendation

There is growing interest in the academic community to harness LLMs for recommendation-

related tasks. One research direction explores LLMs within conventional recommendation

setup, which typically incorporate user feedback and item metadata (Kang et al., 2023; Hou

et al., 2023; Yue et al., 2023; Dai et al., 2023; Bao et al., 2023; Harte et al., 2023; Sanner

et al., 2023). This includes tasks such as rating prediction (Kang et al., 2023) and sequential

recommendation (Harte et al., 2023; Yue et al., 2023; Hou et al., 2023). In such contexts, em-

ploying LLMs as recommenders has shown potential, particularly in scenarios with extreme

data sparsity (Bao et al., 2023) or during the cold-start phase (Sanner et al., 2023). However,

they often struggle to surpass simpler baseline methods, like non-personalized popularity-

based models, in standard recommendation scenarios (Kang et al., 2023; Hou et al., 2023).

Nevertheless, enhancing existing recommender systems with features generated by LLMs

has yielded improved performance (Agrawal et al., 2023). Another signiﬁcant research di-

rection focuses on language-centric recommendation tasks (He et al., 2023; Acharya et al.,

2023; Mysore et al., 2023; Feng et al., 2023; Friedman et al., 2023). These tasks include gen-

erating explanations for recommendations, narrative-based recommendations (Mysore et al.,

2023), and conversational recommendations (He et al., 2023; Feng et al., 2023; Friedman

et al., 2023). LLMs exhibit proﬁcient performance in understanding intricate textual inputs,

allowing for personalized recommendation outputs. Recent investigations in conversational

recommendation demonstrate encouraging outcomes leveraging LLMs, even in zero-shot

conﬁgurations. Our study employs existing LLMs with minimal additional parameters, im-

plementing the Reindex-Then-Adapt framework. Through the reindexing of item content

within LLMs and ﬁne-tuning recommendations to align with target data distributions, our

framework enhances recommendation accuracy in CRS.

6 Conclusion

This study proposes a solution to mitigate distribution misalignments between zero-shot

large language models (LLMs) and target recommendation platforms for conversational

recommendations. We conceptualize LLMs as Diﬀerential Search Index (DSI) models and

introduce the Reindex-Then-Adapt (RTA) framework. The framework involves converting

multi-token item titles into single tokens within LLMs (reindex step) and subsequently

adjusting their probability distributions (adapt step). By combining the strengths of LLMs

and traditional RecSys, the RTA framework achieves improved recommendation accuracy

metrics across various conversational recommendation datasets and adaptation settings.

References

Arkadeep Acharya, Brijraj Singh, and Naoyuki Onoe. 2023. LLM Based Generation of Item-

Description for Recommendation System. In Proceedings of the 17th ACM Conference on

Recommender Systems. 1204–1207.

Saurabh Agrawal, John Trenkle, and Jaya Kawale. 2023. Beyond Labels: Leveraging Deep

Learning and LLMs for Content Metadata. In Proceedings of the 17th ACM Conference

on Recommender Systems. 1–1.

S¨oren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and

Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web:

6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC

2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings. Springer, 722–

735.

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023.

Tallrec: An eﬀective and eﬃcient tuning framework to align large language model with

recommendation. arXiv preprint arXiv:2305.00447 (2023).

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla

Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.

2020b. Language models are few-shot learners. Advances in neural information processing

systems 33 (2020), 1877–1901.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla

Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini

Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya

Ramesh, Daniel Ziegler, Jeﬀrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric

Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner,

Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020a. Language

Models are Few-Shot Learners. In Advances in Neural Information Processing Systems,

H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran

Associates, Inc., 1877–1901.

Li Chen and Pearl Pu. 2012. Critiquing-based recommenders: survey and emerging trends.

User Modeling and User-Adapted Interaction 22 (2012), 125–150.

Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and

Jie Tang. 2019. Towards Knowledge-Based Recommender Dialog System. In Proceedings

of the 2019 Conference on Empirical Methods in Natural Language Processing and the

9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).

1803–1813.

Xiaoyang Chen, Yanjiang Liu, Ben He, Le Sun, and Yingfei Sun. 2023. Understanding

Diﬀerential Search Index for Text Retrieval. arXiv preprint arXiv:2305.02073 (2023).

Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On

the Properties of Neural Machine Translation: Encoder–Decoder Approaches. In Pro-

ceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical

Translation. 103–111.

Hyunjin Choi, Judong Kim, Seongho Joe, and Youngjune Gwon. 2021. Evaluation of bert

and albert sentence embedding performance on downstream nlp tasks. In 2020 25th In-

ternational conference on pattern recognition (ICPR). IEEE, 5482–5487.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra,

Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann,

Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker

Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben-

ton C. Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy

Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev,

Henryk Michalewski, Xavier Garc´ıa, Vedant Misra, Kevin Robinson, Liam Fedus, Denny

Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov,

Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanu-

malayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon

Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta,

Mark D´ıaz, Orhan Firat, Michele Catasta, Jason Wei, Kathleen S. Meier-Hellstern, Dou-

glas Eck, Jeﬀ Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Mod-

eling with Pathways. ArXiv abs/2204.02311 (2022).

Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conver-

sational recommender systems. In Proceedings of the 22nd ACM SIGKDD international

conference on knowledge discovery and data mining. 815–824.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empiri-

cal evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint

arXiv:1412.3555 (2014).

Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. M6-

Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems.

arXiv:2205.08084 [cs.IR]

Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun,

Xiao Zhang, and Jun Xu. 2023. Uncovering ChatGPT’s Capabilities in Recommender

Systems. arXiv preprint arXiv:2305.02182 (2023).

Yue Feng, Shuchang Liu, Zhenghai Xue, Qingpeng Cai, Lantao Hu, Peng Jiang, Kun Gai,

and Fei Sun. 2023. A Large Language Model Enhanced Conversational Recommender

System. arXiv preprint arXiv:2308.06212 (2023).

Luke Friedman, Sameer Ahuja, David Allen, Terry Tan, Hakim Sidahmed, Changbo Long,

Jun Xie, Gabriel Schubiner, Ajay Patel, Harsh Lara, et al. 2023. Leveraging Large Lan-

guage Models in Conversational Recommender Systems. arXiv preprint arXiv:2305.07961

(2023).

Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Rec-

ommendation as Language Processing (RLP): A Uniﬁed Pretrain, Personalized Prompt

& Predict Paradigm (P5). In RecSys ’22: Sixteenth ACM Conference on Recommender

Systems, Seattle, WA, USA, September 18 - 23, 2022, Jennifer Golbeck, F. Maxwell

Harper, Vanessa Murdock, Michael D. Ekstrand, Bracha Shapira, Justin Basilico, Keld T.

Lundgaard, and Even Oldridge (Eds.). ACM, 299–315.

Albert Gu, Caglar Gulcehre, Thomas Paine, Matt Hoﬀman, and Razvan Pascanu. 2020. Im-

proving the gating mechanism of recurrent neural networks. In International Conference

on Machine Learning. PMLR, 3800–3809.

Jesse Harte, Wouter Zorgdrager, Panos Louridas, Asterios Katsifodimos, Dietmar Jannach,

and Marios Fragkoulis. 2023. Leveraging Large Language Models for Sequential Rec-

ommendation. In Proceedings of the 17th ACM Conference on Recommender Systems.

1096–1102.

Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu.

2020. INSPIRED: Toward Sociable Recommendation Dialog Systems. In Proceedings of

the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).

8142–8152.

Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bod-

hisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large language

models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM In-

ternational Conference on Information and Knowledge Management. 720–730.

Zhankui He, Handong Zhao, Tong Yu, Sungchul Kim, Fan Du, and Julian McAuley. 2022.

Bundle MCR: Towards Conversational Bundle Recommendation. In Proceedings of the

16th ACM Conference on Recommender Systems. 288–298.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural compu-

tation 9, 8 (1997), 1735–1780.

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and

Wayne Xin Zhao. 2023. Large language models are zero-shot rankers for recommender

systems. arXiv preprint arXiv:2305.08845 (2023).

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh

Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile

Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).

Santosh Kabbur, Xia Ning, and George Karypis. 2013. Fism: factored item similarity models

for top-n recommender systems. In Proceedings of the 19th ACM SIGKDD international

conference on Knowledge discovery and data mining. 659–667.

Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation.

In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.

Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed

Chi, and Derek Zhiyuan Cheng. 2023. Do LLMs Understand User Preferences? Evaluating

LLMs On User Rating Prediction. arXiv preprint arXiv:2305.06474 (2023).

Jared Kaplan, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon

Child, Scott Gray, Alec Radford, Jeﬀ Wu, and Dario Amodei. 2020. Scaling Laws for

Neural Language Models. ArXiv abs/2001.08361 (2020).

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training

of deep bidirectional transformers for language understanding. In Proceedings of NAACL-

HLT, Vol. 1. 2.

Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan,

and Tat-Seng Chua. 2020a. Estimation-action-reﬂection: Towards deep interaction be-

tween conversational and recommender systems. In Proceedings of the 13th International

Conference on Web Search and Data Mining. 304–312.

Wenqiang Lei, Gangyi Zhang, Xiangnan He, Yisong Miao, Xiang Wang, Liang Chen, and

Tat-Seng Chua. 2020b. Interactive path reasoning on graph for conversational recommen-

dation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge

discovery & data mining. 2073–2083.

Jinming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni.

2023. GPT4Rec: A Generative Framework for Personalized Recommendation and User

Interests Interpretation. arXiv:2304.03879 [cs.IR]

Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin,

and Chris Pal. 2018. Towards deep conversational recommendations. Advances in neural

information processing systems 31 (2018).

Shuyang Li, Bodhisattwa Prasad Majumder, and Julian McAuley. 2021. Self-Supervised

Bot Play for Conversational Recommendation with Justiﬁcations. arXiv preprint

arXiv:2112.05197 (2021).

Shuokai Li, Ruobing Xie, Yongchun Zhu, Xiang Ao, Fuzhen Zhuang, and Qing He. 2022b.

User-centric conversational recommendation with multi-aspect user modeling. In Proceed-

ings of the 45th International ACM SIGIR Conference on Research and Development in

Information Retrieval. 223–233.

Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´emi

Leblond, Tom, Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hu-

bert, Peter Choy, Cyprien de, Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen

Huang, Johannes Welbl, Sven Gowal, Alexey, Cherepanov, James Molloy, Daniel Jaymin

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de, Freitas, Koray

Kavukcuoglu, and Oriol Vinyals. 2022a. Competition-level code generation with Alpha-

Code. Science 378 (2022), 1092 – 1097.

Hugo Liu and Push Singh. 2004. ConceptNet—a practical commonsense reasoning tool-kit.

BT technology journal 22, 4 (2004), 211–226.

Junling Liu, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. 2023. Is ChatGPT a Good

Recommender? A Preliminary Study. arXiv:2304.10149 [cs.IR]

Yu Lu, Junwei Bao, Yan Song, Zichen Ma, Shuguang Cui, Youzheng Wu, and Xiaodong He.

2021. RevCore: Review-Augmented Conversational Recommendation. In Findings of the

Association for Computational Linguistics: ACL-IJCNLP 2021. 1161–1173.

Sheshera Mysore, Andrew McCallum, and Hamed Zamani. 2023. Large Language

Model Augmented Narrative Driven Recommendations. arXiv preprint arXiv:2306.02250

(2023).

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with

contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using

Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods

in Natural Language Processing and the 9th International Joint Conference on Natural

Language Processing (EMNLP-IJCNLP). 3982–3992.

Scott Sanner, Krisztian Balog, Filip Radlinski, Ben Wedin, and Lucas Dixon. 2023. Large

Language Models are Competitive Near Cold-start Recommenders for Language-and

Item-based Preferences. In Proceedings of the 17th ACM Conference on Recommender

Systems. 890–896.

John Schulman, Barret Zoph, C Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Fe-

lipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, Rapha Gontijo Lopes, and

Sengjia Zhao. 2022. Chatgpt: Optimizing language models for dialogue. OpenAI (2022).

Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Au-

toencoders meet collaborative ﬁltering. In Proceedings of the 24th international conference

on World Wide Web. 111–112.

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau

Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. 2022. One embedder, any task:

Instruction-ﬁnetuned text embeddings. arXiv preprint arXiv:2212.09741 (2022).

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin,

Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a diﬀerentiable search

index. Advances in Neural Information Processing Systems 35 (2022), 21831–21843.

MosaicML NLP Team. 2023. Introducing MPT-7B: A New Standard for Open-Source, Com-

mercially Usable LLMs. www.mosaicml.com/blog/mpt-7b Accessed: 2023-05-05.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,

Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama

2: Open foundation and ﬁne-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural

information processing systems. 5998–6008.

Wenjie Wang, Xinyu Lin, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2023a.

Generative Recommendation: Towards Next-generation Recommender Paradigm.

arXiv:2304.03516 [cs.IR]

Xiaolei Wang, Xinyu Tang, Wayne Xin Zhao, Jingyuan Wang, and Ji-Rong Wen. 2023b.

Rethinking the Evaluation for Conversational Recommendation in the Era of Large Lan-

guage Models. arXiv preprint arXiv:2305.13112 (2023).

Xiaolei Wang, Kun Zhou, Ji-Rong Wen, and Wayne Xin Zhao. 2022. Towards Uniﬁed

Conversational Recommender Systems via Knowledge-Enhanced Prompt Learning. In

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data

Mining. 1929–1937.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed

Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reason-

ing in Large Language Models. In Advances in Neural Information Processing Systems,

S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35.

Curran Associates, Inc., 24824–24837.

Ga Wu, Kai Luo, Scott Sanner, and Harold Soh. 2019. Deep language-based critiquing

for recommender systems. In Proceedings of the 13th ACM Conference on Recommender

Systems. 137–145.

Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and Even Oldridge.

2023. LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking.

arXiv preprint arXiv:2311.02089 (2023).

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng

Gao, Jingjing Liu, and William B Dolan. 2020. DIALOGPT: Large-Scale Generative

Pre-training for Conversational Response Generation. In Proceedings of the 58th Annual

Meeting of the Association for Computational Linguistics: System Demonstrations. 270–

278.

Yiming Zhang, Lingfei Wu, Qi Shen, Yitong Pang, Zhihua Wei, Fangli Xu, Bo Long, and Jian

Pei. 2022. Multiple Choice Questions based Multi-Interest Policy Learning for Conversa-

tional Recommendation. In Proceedings of the ACM Web Conference 2022. 2153–2162.

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before

use: Improving few-shot performance of language models. In International Conference on

Machine Learning. PMLR, 12697–12706.

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei

Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A

Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X.

arXiv:2303.17568 [cs.LG]

Kun Zhou, Wayne Xin Zhao, Shuqing Bian, Yuanhang Zhou, Ji-Rong Wen, and Jingsong

Yu. 2020. Improving conversational recommender systems via knowledge graph based

semantic fusion. In Proceedings of the 26th ACM SIGKDD international conference on

knowledge discovery & data mining. 1006–1014.

Jie Zou, Evangelos Kanoulas, Pengjie Ren, Zhaochun Ren, Aixin Sun, and Cheng Long.

2022. Improving conversational recommender systems via transformer-based sequential

modelling. In Proceedings of the 45th International ACM SIGIR Conference on Research

and Development in Information Retrieval. 2319–2324.

A More Details of Experiments

A.1 Baseline Details

We consider four groups of baseline models for comparison. Firstly, we consider some

representative traditional item-based RecSys models:

• Popularity: This method is non-personalized and returns the top-K most popular

movies within the related datasets.

• FISM (Kabbur et al., 2013): A commonly used factored item similarity model for

item-based collaborative ﬁltering.

• SASRec (Kang and McAuley, 2018): A competitive self-attention-based sequential

recommender system.

Secondly, we consider some representative CRS models:

• ReDIAL (Li et al., 2018): This model is released along with the ReDIAL dataset

with an auto-encoder (Sedhain et al., 2015)-based recommender.

• UniCRS (Wang et al., 2022): This model uses pre-trained language model, Di-

aloGPT (Zhang et al., 2020), with prompt tuning to conduct recommendation and

conversation generation tasks respectively. This model is treated as a state-of-the-art

CRS models before LLMs (He et al., 2023).

Thirdly, we consider some dense retrieval models given the connections to document

retrieval:

• SBERT (Reimers and Gurevych, 2019): A modiﬁcation of the pretrained BERT (Ken-

ton and Toutanova, 2019) network. It uses siamese and triplet network structures to

generate semantically meaningful sentence embeddings.

• Instructor (Su et al., 2022): A text embedding model that has been ﬁne-tuned for

instructional purposes, which is considered state-of-the-art in dense retrieval tasks.

Lastly, we consider some zero-shot open-sourced LLMs as baselines like (He et al., 2023),

we are using the 7-billion version due to compute burden:

• MPT-7b (Team, 2023): A recently released open-sourced LLM released by Mo-

saicML’s team trained on 1T tokens.

• Mistral-7b (Jiang et al., 2023): A recently released open-sourced Large Language

Model with impressive performance on multiple tasks, trained by Mistral AI Team.

• Llama2-7b (Touvron et al., 2023): A commonly used open-sourced Large Language

Model with a wide eco-system support.

We also discuss the results from GPT-3.5-turbo (Schulman et al., 2022), which is a

much larger proprietary model that can achieve state-of-the-art CRS performance even in

a zero-shot setting (He et al., 2023)

In particular, the GPT-3.5-turbo API was called in January 2024 with a temperature setting of 0. This

note aims to enhance reproducibility of GPT APIs, considering the continuous updates made by OpenAI

over time.

A.2 Implementation Details

For zero-shot baselines, we conﬁgured models based on links from huggingface oﬃcial model

pages for inference on our datasets. Trainable baselines utilized hyperparameters suggested

by authors, with a batch size of 256. The learning rate search space is {1e-3, 1e-4, 1e-5},

and weight decay is {0, 1e-6, 1e-4, 1e-2, 1}. Baselines were trained for 200 epochs, and the

best model was selected based on H@10 on the validation set. Reindex and adapt steps of

our model followed the same hyper-parameter setup above. For L2R Data and L2I Data,

we used the original data mixture without adjusting the sampling ratio. The initial data

weights were approximately 98:2, and addressing the data mixture weighting is deferred to

future work, as it may enhance recommendation results, though not the primary focus of

this paper. For Reindex Step, the RNN we used is a GRU (Chung et al., 2014) network,

with embedding size as the same as Llama2-7b (i.e., 4096) and hidden size is 1024. We use

the bidirectional single-layer GRU modules. For the Adapt Step, the FISM models are with

embedding size 64, and the SASRec models are using embedding size 64, 2 self-attention

layers and 2 attention heads.

B Details of Prompts for LLMs

B.1 Prompt(s) for Recommendation

For LLMs, we follow (He et al., 2023) to deﬁne the recommendation prompts as follows,

which can be used to obtain the LLM baseline results and used in our reindex and adapt

steps.

For LLM baselines, this prompt is following (He et al., 2023) with the fuzzy matching

method to convert generated recommendation lists into within-dataset item ID lists. In

the prompt example, we omit the “converstaion templates“, which are obtained them from

FastChat

to ensure the zero-shot performance of LLM baselines.

Prompt for LLM Baselines: Pretend you are a movie recommender system. I will give you

a conversation between a user and you (a recommender system). Based on the conversation,

you reply me with 20 movie titles without extra sentences. Here is the conversation: {}

Here, “{}” is the placeholder for the conversational context, which is exampliﬁed by Fig-

ure 6.

For our RTA framework, since the base model is Llama2-7b (Touvron et al., 2023),

we specify the prompt to make it clear that how we “reindex” the generated items. This

prompt is exactly ended with “1. ”, for which the original next steps are the tokens for the

recommended item titles, such as “1. Edge of Tomorrow”. However, we record the query

embedding ended by “1. ” and replace the embedding sequence for “Edge of Tomorrow”

with “|Edge of Tomorrow|” for reindexing. The concrete prompt for the reindex step is:

Prompt for Llama-RTA: ¡s¿ [INST] Pretend you are a movie recommender system. I will

give you a conversation between a user and you (a recommender system). Based on the con-

versation, you reply me with 20 movie titles without extra sentences. Here is the conversation:

{} [/INST] 1.

https://github.com/lm-sys/FastChat/blob/1db84d0906196673db361eac50d5aa65180a0ffe/

fastchat/conversation.py

B.2 Prompt(s) for Generation

For the prompt used in case studies Figure 6, we deﬁne the prompt as below, where the

ﬁrst placeholder “{}” is for the conversational context, and the second placeholder “{}” is

for the item recommendation list.

Prompt for Llama-RTA: Pretend you are a movie recommender system. I will give you a

conversation between a user and you (a recommender system). Based on the conversation, you

reply me with 20 movie titles without extra sentences. Here is the conversation: {} . Please

respond the above conversations using the recommended items below, it is better if explaining

why they are recommended, but do not list them as bullets. Insert them into your responses:

{}

It is noted that we do not aim to demonstrate the optimal generation strategy, but rather

provide an example of how the language model framework developed for our recommendation

system can also be reused for generative tasks, for the use cases where natural-language

responses are required.