Reindex-Then-Adapt: Improving Large Language
Models for Conversational Recommendation
Zhankui He
1,
Zhouhang Xie
1
Harald Steck
2
Dawen Liang
2
Rahul Jha
2
Nathan Kallus
2,3
Julian McAuley
1
1
UC San Diego
2
Netflix
3
Cornell University
Abstract
Large language models (LLMs) are revolutionizing conversational recommender sys-
tems by adeptly indexing item content, understanding complex conversational contexts,
and generating relevant item titles. However, controlling the distribution of recom-
mended items remains a challenge. This leads to suboptimal performance due to the
failure to capture rapidly changing data distributions, such as item popularity, on tar-
geted conversational recommendation platforms. In conversational recommendation,
LLMs recommend items by generating the titles (as multiple tokens) autoregressively,
making it difficult to obtain and control the recommendations over all items. Thus,
we propose a Reindex-Then-Adapt (RTA) framework, which converts multi-token item
titles into single tokens within LLMs, and then adjusts the probability distributions
over these single-token item titles accordingly. The RTA framework marries the ben-
efits of both LLMs and traditional recommender systems (RecSys): understanding
complex queries as LLMs do; while efficiently controlling the recommended item distri-
butions in conversational recommendations as traditional RecSys do. Our framework
demonstrates improved accuracy metrics across three different conversational recom-
mendation datasets and two adaptation settings.
1 Introduction
Conversational Recommender Systems (Christakopoulou et al., 2016; Li et al., 2018; He
et al., 2023) (CRS) are an emerging recommendation task aiming to suggest relevant and
personalized items via interactive dialogues between users and systems. Recently, Large Lan-
guage Models (Schulman et al., 2022; Brown et al., 2020b; He et al., 2023; Feng et al., 2023;
Kang et al., 2023) (LLMs) have demonstrated proficiency in understanding user intentions
within natural language conversational contexts and exhibited substantial domain-specific
knowledge (e.g., in movies). Consequently, LLMs offer distinct advantages for CRS and
outperform existing non-LLM baselines (He et al., 2023; Wang et al., 2023b; Feng et al.,
2023). This has garnered significant interest within the research community, positioning
LLMs as an indispensable component of CRS.
In this work, we first provide preliminary analysis for LLMs as conversational recom-
menders. In detail, we view LLMs for conversational recommendations as Differentiable
1
arXiv:2405.12119v1 [cs.IR] 20 May 2024
Figure 1: Representative items (The Dark Knight and Black Panther) demonstrate popu-
larity misalignments between the dataset (ReDIAL (Li et al., 2018)) and the LLM (Llama2-
7b (Touvron et al., 2023)). This misalignment implies a significant room for recommendation
accuracy improvement. Our Reindex-Then-Adapt (RTA) framework addresses this gap by
aligning the distributions (e.g., item popularites), leading to substantial accuracy improve-
ments.
Search Indexing (DSI) (Tay et al., 2022; Chen et al., 2023) models, then study LLMs’
abilities and limitations for item-indexing and item-recommendation tasks:
Abilities: LLMs have indexed numerous popular movies, potentially adequate to
understand complex conversation contexts and address many movie conversational
recommendation scenarios.
Limitations: LLMs exhibit misalignment with data distributions from target plat-
forms, as illustrated by item popularity in Figure 1. Moreover, data distributions like
item popularity evolve rapidly in practice, making adjusting LLMs more challenging.
We propose to overcome the aforementioned misalignment limitation by easily adjust-
ing LLMs towards changing target distributions. Figure 1 illustrates an example with an
LLM, Llama-7b (Touvron et al., 2023), on the conversational recommendation dataset, Re-
DIAL (Li et al., 2018). Despite the promising conversational recommendations by LLMs (He
et al., 2023), Figure 1(a) points out a lack of alignment with the data distribution of the
target recommendation platform. For example, The Dark Knight is popular on ReDIAL (Li
et al., 2018) but not within the LLM, while Black Panther presents a contrasting scenario.
Our proposed approach alleviates this issue by adjusting the recommendation distributions
for all target items from LLMs. Figure 1(b) shows our approach results in more aligned item
popularity between LLMs and the target dataset or platform for recommended items such
as The Dark Knight and Black Panther. These alignments bring additional recommenda-
tion accuracy improvements, and may have broader benefits, including controllability and
fairness.
To achieve this recommendation probability distribution adjustment, there exists a tech-
nical challenge. Unlike adjusting recommendation probability distributions over all target
2
Figure 2: Item Indexing tasks by using movie descriptions from WikiPedia (Auer et al., 2007)
to prompt movie titles. We tested MPT-7b, Mistrial-7b , Llama2-7b and GPT-Turbo-3.5
models and group the accuracy by the range of occurrences of the movies in ReDIAL CRS
dataset (Li et al., 2018). We measure the performance with HIT@5, i.e., whether the target
movie in the top-K movie list generated by the LLMs to reflect the movie knowledge stored
in the parameters of the LLMs.
items via tweaking the logit vectors in traditional RecSys, obtaining such logit vectors from
LLMs is challenging due to their generative retrieval paradigm for CRS (He et al., 2023).
LLMs generate recommendations by auto-regressively producing multiple item titles (e.g.,
Top-10), represented by varying numbers of tokens. This process makes obtaining proba-
bility distributions over all recommended items computationally expensive, hindering sub-
sequent control or adjustment efforts.
To overcome this challenge, we propose a Reindex-Then-Adapt (RTA) framework. With
treating LLMs as DSI models for conversational recommendations, we first conduct a
reindex step: for original LLMs, we convert already-indexed multi-token item titles (e.g.,
Edge of Tomorrow) into single tokens (e.g., |Edge of Tomorrow|); then we conduct an
adapt step: for reindexed LLMs, the recommended item distributions can be obtained
efficiently for following adapt step, e.g., bias terms adjustment or RecSys gating.
Based on the RTA framework, we investigate four reindexing modules and two adaptation
strategies across three CRS datasets. Our experimental results show improved recommen-
dation accuracy in CRS. For instance, we improve the recommendation accuracy for the
original Llama2-7b by 59.37% in terms of the Top-10 Hit Rate, surpassing all open-source
baselines. Additionally, our studies highlight the significance of adjusting LLMs towards
target distributions in CRS and provide insights into scheduling conversational recommen-
dation modules with LLMs.
2 Preliminaries
2.1 Task Formulation
In CRS, a conversation is represented by C = (u
t
, s
t
, I
t
)
T
t=1
involving users u
t
U and items
I
t
I with T conversation turns. Each utterance s
t
comprises tokens v
i
from vocabulary
V. A conversation typically involves a seeker and a recommender. As formulated in CRS
studies (Li et al., 2018; Chen et al., 2019; Zhou et al., 2020; Wang et al., 2022; He et al.,
2023), our goal is to learn a recommender to generate a ranked list of items
ˆ
I
k
at turn k
that aligns with I
k
, based on the preceding context (u
t
, s
t
, I
t
)
k1
t=1
.
3
Table 1: Additional statistics for cold items, warm items and popular items in ReDIAL (Li
et al., 2018) datasets. #Items counts the total numbers of items and #Occurences sums
the total occurences of such items in conversations.
ReDIAL Cold [0, 10) Warm [10, 100) Pop. [100, +inf)
#Items 4,960 (78.97%) 1,193 (18.99%) 128 (2.04%)
#Occurences 12,523 (18.03%) 33,304 (47.94%) 23,647 (34.04%)
2.2 Differentiable Search Index (DSI)
The transformer model has shown proficiency in retrieval tasks, encoding item information
within its parameters, which is a method termed Differentiable Search Index (DSI) (Tay
et al., 2022). DSI involves two key training tasks for pre-trained language models: Learn to
Index (L2I) and Learn to Retrieve (L2R), which can be used to train a model jointly or in a
sequential order. L2I focuses on mapping item content, such as movie description, to item
indices, exemplified by linking a description of Edge of Tomorrow to its title:
L2I Example: “A 2014 American science fiction action film starring Tom Cruise and Emily
Blunt with ...” Edge of Tomorrow
L2R, on the other hand, maps queries to item indices, such as:
L2R Example: “I’m feeling bored today and looking for a sci-fi action movie, preferably
starring Tom Cruise.” Edge of Tomorrow
DSI models are originally proposed for text retrieval tasks (Tay et al., 2022), yet their for-
mulation can be connected to LLMs used in CRS. Considering the LLMs as CRS framework
proposed in (He et al., 2023) through the lens of DSI, we observe that:
Item Indexing: LLMs index items by using the item titles (e.g., “Edge of Tomor-
row”) as the item identifiers via L2I.
Item Recommendation: LLMs use conversational context as queries to generate
item indices via L2R.
Thus, LLMs inherently function as DSI models, by including a certain number of training
samples for L2I and L2R tasks in their pre-training corpus. Compared to common two-tower
models, DSI models require only a single model for item recommendations, by indexing item
information into its parameters (Tay et al., 2022).
2.3 Item Indexing: LLMs Show Sufficient Item Content Knowledge
According to (He et al., 2023), LLMs demonstrate superior knowledge in content and con-
text, particularly in the movie domain. This proficiency is attributed to the performance in
the “Learn to Index” (L2I) task, as viewed through the DSI (Tay et al., 2022). Therefore,
our primary concern is the extent to which item content has been indexed in LLMs through
the pre-training corpus.
4
Figure 3: Visualization of item monthly relative popularity from Reddit-Movie (He et al.,
2023) datasets, since this dataset is the only CRS dataset with long-range timestamps in
the wild. Item popularities are shown changing overtime rapidly.
2.3.1 Observation
We gathered 6,281 pairs of movie titles from ReDIAL (Li et al., 2018) and the related
descriptions from Wikipedia
1
for experiments in Figure 2 to assess LLMs performance on
L2I task, and observe that:
Good Content Knowledge for Popular Items: All LLMs had indexed a con-
siderable amount of movie content for conversational recommendation tasks. No-
tably, for frequently mentioned movies, as defined as movie occurences in the ReDIAL
dataset (Li et al., 2018) that the item occurences range is [100, +inf), all LLMs exhibit
impressive content knowledge.
Best LLMs: The proprietary model GPT-3.5-t (Schulman et al., 2022) outperforms
others. Among the open-sourced LLMs of similar size, Llama2 demonstrates the best
performance in the given task, as shown in Figure 2, making it the chosen base model
for our subsequent experiments.
2.3.2 Impact
Figure 2 shows that, in terms of item indexing capability as DSI models, LLMs without
specific fine-tuning have already indexed numerous popular movies. Table 1 shows the
imbalance of item occurrences in conversational recommendations, where items labeled as
warm and pop. constitute about 20% in terms of item counts but contribute to over 80% of
occurrences. This suggests that zero-shot LLMs may be sufficient for handling many
movie conversational recommendation scenarios, because many are about warm or
popular movies. Although, we admit fine-tuning LLMs to cover more cold items remains
future work.
1
https://www.wikipedia.org/
5
I like Back to the
Future … I need
a sci-fi movie
with my family …
1. Edge of Tomorrow
2. Terminator
3. The Matrix
4. ……
|Edge_of_Tomorrow|
0.23
|The_Mar<an|
0.12
|Terminator|
0.11
|The_Matrix|
0.07
Reindex Adapt
🤖
I watch Sci-Fi movies with my family…
L2R
Back to the Future is my favorite sci-fi…
L2R
Sci-Fi Ac;on movie stars Tom Cruise …
L2I
Terminator
The
Matrix
|Terminator|
|The_Matrix|
Edge
of
Tomorrow
|Edge_of_Tomorrow|
!
logit vector
!" # $
%! # &' ( )*
+
,
Bias Term Adjustment
RecSys Ga9ng
|The_Mar<an|
0.25
|Edge_of_Tomorrow|
0.22
|The_Matrix|
0.10
|Terminator|
0.08
LLMs
Figure 4: Reindex-Then-Adapt (RTA) Framework. LLMs can generate a list of item titles as
the recommendations given the conversation contexts. To further improve the accuracy and
controllability, we conduct (1) reindex step: reindexing the item (e.g., movie) titles in LLMs
as single tokens to obtain the predicted logit vectors efficiently; (2) adapt step: adapting the
recommenders towards target data distributions effectively with multiple options on the logit
vectors such as adjusting bias terms or combining RecSys models with Gating mechanism.
2.4 Item Recommendation: LLMs Show Severe Distribution Mis-
alignment
2.4.1 Observation
In this section, we aim to show that even though LLMs can index items effectively, the data
distributions LLMs fitted in training do not match the target inference distributions for
CRS. We discuss this distribution misaligment from two perspectives, using item popularity
distribution as an example:
Static Perspective: As depicted in Figure 1(a), LLMs reflect item popularities from
the large-scale training corpus to some extent, which often do not align with popular
items on the specific platform for CRS.
Dynamic Perspective: Target data distributions, such as item popularity, undergo
rapid changes over time due to factors like seasons and promotion strategies. For in-
stance, Figure 3 shows that monthly relative item popularities on the Reddit-Movie (He
et al., 2023) dataset change over time, which cannot be captured by a static LLM,
even fine-tuned ones.
2.4.2 Impact
Our observations highlight the distribution misalignments between items on the target plat-
form and those recommended by LLMs. This misalignment is considered from both static
and dynamic perspectives, suggesting that: (1) despite LLMs exhibiting impressive perfor-
mance in conversational recommendation(He et al., 2023), there exists room for im-
proving recommendation accuracy by aligning with the distributions of target
platforms; (2) due to the dynamic nature of recommendation platforms, target data dis-
tributions change rapidly, necessitating the more efficient methods to adjust item
recommendations from LLMs accordingly.
6
3 Framework
3.1 Overview
As discussed in Section 1, we argue that representing target items with varying token counts
in LLMs poses challenges for adjusting recommendation distributions over all target items.
To tackle this, we view LLMs as DSI models that have already indexed sufficient item con-
tent knowledge (see Section 2.3), and propose the Reindex-Then-Adapt (RTA) framework,
illustrated in Figure 4:
1. Reindex item indices with varying token numbers in LLMs into single-token item
indices using a mixture of data samples from the L2I corpus and/or L2R corpus. This
aims to remove the adapt step barrier. In contrast to the index step in the original
DSI models (Tay et al., 2022), the reindex step reuses the content of indexed items
from the LLMs, thereby facilitating the learning process for the new item indices.
2. Adapt logits from the reindexed LLMs, achieved by transforming the logit vectors
or by combining with other traditional RecSys using Gating mechanism (Hochreiter
and Schmidhuber, 1997; Chung et al., 2014; Gu et al., 2020). This adjustment aims
to effectively align the recommendation probability distributions over items with the
target data distributions.
3.2 Reindex Step: Single-Token Items in LLMs
The key of reindex step is to “squeeze” multi-token item embeddings into single-token item
embeddings efficiently, and preserves the semantics of the original item embeddings in LLMs
generation.
3.2.1 Identify Item Indices.
Formally, for a sentence s = (v
i
)
m
i=1
consisting of tokens v
i
V, we denote the tokens
representing an item for CRS tasks as (v
i
)
j+n
i=j
, where j indicates the starting position of
the first token for the item in s, and n is the number of tokens representing this item.
Consequently, with the Embed layer, LLMs can retrieve the sequence of token embeddings
for this item:
(v
i
)
j+n
i=j
= Embed
(v
i
)
j+n
i=j
, (1)
where v R
d
is the token embedding, and d is the embedding size. For example, we may
look up “Edge of Tomorrow” embeddings represented by “[14, 72, 98]” in the sentence s.
3.2.2 Aggregate Multi-Token Embeddings
In the proposed reindex step, we assume that the semantics from multiple (typically shorter
than 10) token embeddings can be aggregated into a new single token embedding with a
trainable aggregator, such as:
˜
v = Aggregator
(v
i
)
j+n
i=j
, (2)
where the aggregated
˜
v R
d
serves as the new representation of the target item in LLMs
generation. Therefore, if all target items are represented by the “squeezed” single embed-
ding, scoring all the target items to obtain a logit vector from LLMs is efficient. In general,
7
many existing model architectures can be used as Aggregator, such as RNN-based (Cho
et al., 2014), Transformer-based models (Vaswani et al., 2017), or even Weighted Pooling.
We discuss the details and the comparisons with pure new embeddings in Section 4.3.
3.2.3 Learning Process
The contrative loss (Oord et al., 2018) is used for learning the aggregator to “squeeze”
multi-token item embeddings and preserve the semantics for LLMs:
L
reindex
=
1
|D|
X
q,˜v∈D
log
"
exp
q
˜v
exp (q
˜v) +
P
n∈N
exp (q
n)
#
, (3)
where we loop over the training set D that consists of (q,
˜
v) pairs. Those pairs are collected
from sentences containing target items. Here, q R
d
is the contextual embedding of
the last position from a LLM, which is originally used to generate the first token of original
indexed items, but now we aim to force the aggregated item representation ˜v R
d
described
in Section 3.2.2 to be generated by LLMs. To achieve this reindex step, we also prepare
negatives n R
d
from the negative representation set N .
Two groups of corpus are considered in the reindex step:
L2R Data: In this training corpus, (query, target) pairs are used as L2R samples. In
CRS context, those are samples from the conversations.
L2I Data: In this training corpus, (content, target) pairs are used as L2I samples.
In CRS context, those are samples from the item metadata like textual descriptions.
Data Mixture: In this case, we consider mixing the both L2R and L2I samples as a
unified corpus and use it to train our model jointly. We use this option and include
details in Appendix A.2.
3.3 Adapt Step: Item Probabilities Adjustment
After re-indexing, all the items are represented by single-token embeddings. It makes recom-
mendation as easy as one-step decoding in LLMs, and also enables multiple efficient ways to
adjust the recommendation item distributions to adapt towards target platforms or specific
data distributions. We introduce two types of adaptation methods in the following sections,
one for item popularity adjustments, and another one for combining with the traditional
recommender systems. To start with, we assume the logit vector g R
|I|
has already been
given by the LLMs, and the corresponding probability vector should be p = softmax(g).
3.3.1 Bias Term Adjustment
Inspired by (Zhao et al., 2021), a common way to adjust logits is an affine transformation,
i.e.:
ˆ
p = softmax (gW + b) , (4)
where W R
|I|×|I|
is a weight matrix and b R
|I|
is the bias term. Similar to (Zhao
et al., 2021), we restrict the matrix W to be diagonal to prevent the size of parameters
from growing quadratically in the size of items. Therefore, in this special case, we are able
to interpret the W and b as multiplicative and additive bias terms towards the target data
distributions, respectively.
8
Table 2: Dataset Statistics. We update Reddit-Movie CRS dataset as Reddit-V1.5 accord-
ing to the raw data dump provided by (He et al., 2023) from 2012 to 2022. Specifically,
conversation turns with valid recommended items are denoted as R Turns.
INSPIRED ReDIAL Reddit-V1.5
(Hayati et al., 2020) (Li et al., 2018) (He et al., 2023)
#Conv. 999 11,348 2,726,471
#Turns 21,124 139,557 5,063,007
#R Turns 1,950 30,322 1,787,050
#Users 999 764 520,913
#Items 1,472 6,281 68,285
3.3.2 Traditional RecSys Gating
Inspired by (He et al., 2023), we notice that LLMs excel at content/context knowledge, but
traditional RecSys, where the output logit vector can be denoted as
˜
g R
|I|
, is good at
collaborative knowledge instead. Motivated by this observation, combining those two types
of models becomes easy after the re-indexing step:
ˆ
p = softmax (αg + (1 α)
˜
g) , (5)
where the coefficient α [0, 1] can be set in many different ways. For simplicity sake, we
use a learnable scalar ˜α for α = sigmoid (˜α) in our experiments, but more options can be
considered, such as being predicted by a MLP model to naturally determine how much we
should weight the responses from LLMs or traditional RecSys, like α = sigmoid (MLP(q)),
where q R
d
can be the contextual embedding from a LLM in Equation (3).
3.3.3 Learning Process
We use maximum likelihood estimation to derive the loss for adapt step, in order to learn
the parameters of the bias terms or the recsys model. Note that the LLMs parameters are
not involved in this step, ensuring an efficient learning process:
L
adapt
=
1
|D
|
|D
|
X
i=1
log
ˆ
p
i,
, (6)
where dataset D
is collected from the target platform, such as ReDIAL (Li et al., 2018),
which is typically a small sized dataset. Here,
ˆ
p
i,
denotes the probability of the ground-
truth item in the i
th
data sample. Our purpose is to adapt the model towards the underlying
data distributions of D
through this learning process.
4 Experiments
4.1 Experiment Setup
4.1.1 Datasets
Three conversational recommendation datasets (Hayati et al., 2020; Li et al., 2018; He
et al., 2023) are used in our experiments, where the statistics are summarized in Table 2:
9
Table 3: The main results for our models on conversational recommendation accuracy per-
formance, compared against (1) traditional recommendation models; (2) zero-shot large
language models (LLMs); (3) traditional conversational recommendation models; and (4)
zero-shot dense retrievers. The size of the reported LLMs used here is 7B. We denote the
model metrics with the best performance in bold. Llama2-R denotes the Llama2-7b model
after our reindex step. We also show the results after the adapt step with bias terms
(+Bias) or RecSys model combination with Gating mechanism (+RecSys).
INSPIRED ReDIAL Reddit-V1.5
H@5 N@5 H@10 N@10 H@5 N@5 H@10 N@10 H@5 N@5 H@10 N@10
Popuplarity .089 .020 .065 .015 .103 .021 .070 .015 .035 .003 .025 .002 .052 .003 .030 .002 .008 .001 .004 .000 .014 .001 .006 .000
FISM .075 .018 .045 .012 .103 .021 .054 .012 .065 .004 .040 .003 .112 .005 .054 .003 .022 .001 .012 .001 .043 .001 .019 .001
SASRec .061 .016 .037 .010 .103 .021 .051 .011 .068 .004 .041 .002 .116 .005 .056 .003 .022 .001 .013 .001 .039 .001 .018 .001
MPT .075 .018 .045 .011 .099 .020 .052 .012 .072 .004 .045 .003 .116 .005 .059 .003 .026 .001 .017 .001 .040 .001 .021 .001
Mistral .061 .016 .040 .011 .066 .017 .041 .012 .082 .004 .056 .003 .111 .005 .065 .003 .029 .001 .020 .001 .038 .001 .023 .001
Llama2 .080 .019 .050 .012 .122 .022 .064 .013 .094 .004 .059 .003 .145 .005 .075 .003 .042 .001 .027 .001 .064 .001 .034 .001
ReDIAL .060 .016 .041 .012 .106 .021 .056 .012 .067 .004 .044 .003 .106 .005 .057 .003 .029 .001 .019 .001 .044 .001 .024 .001
UniCRS .091 .019 .055 .011 .132 .019 .073 .014 .085 .003 .058 .003 .112 .004 .071 .003 .028 .001 .017 .001 .040 .001 .021 .001
SBERT .038 .013 .026 .010 .066 .017 .036 .010 .016 .002 .010 .001 .026 .002 .013 .001 .003 .000 .002 .000 .005 .000 .002 .000
Instructor .052 .015 .034 .011 .085 .019 .045 .011 .025 .002 .013 .001 .043 .003 .019 .001 .009 .001 .006 .000 .017 .001 .008 .000
Llama2-R .066 .017 .041 .011 .103 .021 .053 .012 .071 .004 .042 .002 .117 .005 .057 .003 .055 .001 .035 .001 .093 .002 .047 .001
+Bias .103 .021 .066 .014 .164 .025 .083 .015 .083 .004 .053 .003 .123 .005 .066 .003 .059 .001 .037 .001 .096 .002 .049 .001
+RecSys .089 .020 .052 .013 .164 .025 .076 .013 .094 .004 .060 .003 .146 .005 .076 .003 .061 .001 .038 .001 .101 .002 .051 .001
INSPIRED (Hayati et al., 2020) and ReDIAL (Li et al., 2018): These two datasets consist
of small-scale human-human conversations for movie recommendations with crowd-sourced
annotations from MTurk
2
. Due to their short collection time span, temporal patterns are
unlikely to be observed. Nevertheless, considering their widespread use, we present our
model results based on these datasets. In the following experiments, we randomly split
the datasets into training, validation, and test sets using an 8:1:1 ratio. Reddit-V1.5 (He
et al., 2023): This dataset comprises large-scale movie discussions on Reddit, which were
collected and processed by (He et al., 2023). This dataset shows real movie conversation
recommendations in the wild and includes corresponding timestamps for 10 years to study
temporal patterns. For data splitting, we use the last two months (i.e., Nov. and Dec. in
2022) as validation and testing set respectively to approximate the real setting. Due to the
large size of the given dataset, we uniformly sample 20% conversation turns for validation
(i.e., 11,241 samples) and testing (i.e., 13,816 samples).
4.1.2 Baselines
We consider four groups of baseline models for comparison. (1) We consider some represen-
tative traditional item-based
3
RecSys models, including Popularity, FISM (Kabbur et al.,
2013) and SASRec (Kang and McAuley, 2018). (2) We consider some representative CRS
models: ReDIAL (Li et al., 2018) and UniCRS (Wang et al., 2022): This model uses a
pre-trained language model. (3) We consider some dense retrieval models given the connec-
tions to document retrieval: SBERT (Reimers and Gurevych, 2019) and Instructor (Su
et al., 2022). (4) We consider some zero-shot open-sourced LLMs as baselines like (He et al.,
2023) and use the 7-billion-parameter version due to compute burden: MPT-7b (Team,
2023), Mistral-7b (Jiang et al., 2023) and Llama2-7b (Touvron et al., 2023). We also
discuss the results from GPT-3.5-turbo (Schulman et al., 2022), which is a much larger
proprietary model that can achieve state-of-the-art CRS performance even in a zero-shot
2
https://mturk.com
3
We only use item-based models since INSPIRED does not have historical user interactions.
10
setting (He et al., 2023). The details of baseline models are found in Appendix A.1.
4.1.3 Evaluation Metrics
We focus on recommendation accuracy using HIT@K (H@K) and NDCG@K (N@K), fol-
lowing (Li et al., 2018; Chen et al., 2019; Zhou et al., 2020; Wang et al., 2022). We consider
the means and the standard errors
4
of the metrics with K = {5, 10}. Please find the
implementation details in Appendix A.2.
4.2 General CRS Performance
4.2.1 Baseline Performance.
Table 3 shows the recommendation accuracy of four groups of baselines on three conversa-
tional recommendation datasets. There are some observations:
On Traditional RecSys. Conventional recsys models effectively capture target popu-
larity and further item-item similarities, resulting in reasonable recommendation accuracies.
Interestingly, on INSPIRED, we find that non-personalized popularity serves as a strong
baseline, because the limited size of the training set may restrict the ability to capture more
complex item-item relationships. The results of traditional recommendation system mod-
els also indicate the potential of improving the recommendation accuracy by aligning with
target data distributions.
On LLMs. LLMs with zero-shot prompting from (He et al., 2023) achieve impressive
results, surpassing even the best results on ReDIAL datasets. Additionally, the rank of
recommendation accuracy within the LLM group aligns with the performance from Figure 2.
Further details on the specific proprietary model GPT-3.5-t are discussed in Section 4.5.2.
On Other Baselines. We observe that zero-shot state-of-the-art dense retrievers are
unable to achieve comparable performance as zero-shot LLMs; this may be due to two
reasons: (1) Dense retrievers focus more on retrieving similar documents according to se-
mantic similarities (e.g., similar contents), but LLMs show better understanding abilities for
conversation contexts; (2) We are encoding the movie textual title rather than the descrip-
tion of the movie for fair comparison, which may limit the dense retrievers’ performance.
As for traditional CRS models, since we follow the setting in (He et al., 2023) to remove
“repeated” items, many popular CRS models perform relatively weaker in the corrected
evaluation protocol.
4.2.2 Ours vs. Baselines.
We construct a small-sized aggregator on top of Llama2 as an example, then use this aggre-
gator to reindex multi-token movie titles into single-token movie titles as recommendation
candidates, namely Llama2-R.
On Recommendation Accuracy. Table 3 shows that, following the reindex and
adapt steps, our model excels over baselines on INSPIRED and Reddit-V1.5 datasets,
achieving the competitive best results on the ReDIAL dataset. Examining the reindex
step (Llama2-R) and adapt step (+Bias or +RecSys), we observe a potential performance
decrease in the reindex step due to the semantic gap from original token embeddings to
the new single token embeddings from the relatively small aggregator. However, our mod-
els compensate by capturing the target data distribution through bias terms or traditional
4
We use error bars in our figures and gray numbers in our tables for standard errors.
11
Figure 5: Different methods to represent items in LLMs with single-token embeddings and
the related recommendation accuracy HIT@5 after the reindex step.
RecSys models. A more in-depth analysis of these adapt methods will be discussed in Sec-
tion 4.4.
On Efficiency and Flexibility. It is crucial to mention that the aggregator-based
methods are around 10× smaller than the corresponding out-of-vocabulary item embed-
ding tables and approximately 233× smaller than the Llama2-7b base model, emphasizing
its space efficiency. Additionally, as all movie titles with varying numbers of tokens are
”squeezed” into single tokens, our model can rank all items with a single decoding step,
making it around 100× faster than the generative retrieval from LLMs to recommend the
top-20 items. Moreover, single tokens facilitate easy acquisition of the recommendation item
distribution, enhancing flexibility in control or further adjustment of the recommendations.
4.3 Effectiveness of the Reindex Step
4.3.1 Experiment Setup
We explore methods for representing item titles with single-token embeddings in LLMs,
investigating four approaches: (1) Embed: randomly initialized out-of-vocabulary (OOV)
embeddings. Subsequently, three models aggregate existing LLM token embeddings into a
single-token embedding and trained on the samples from those three datasets: (2) Weighted:
learning position-wise attention weights to aggregate multi-token embeddings into a single
one, followed by a simple linear projection; (3) TRM: employing a single-layer transformer
to derive a contextual embedding from the output CLS token; (4) RNN: using a simple GRU
model to aggregate multiple token embeddings, with the last hidden state vectors serving
as the item representations.
4.3.2 Embedding vs. Aggregator
The embedding-based method cannot be shared across different datasets due to the practical
challenge in normalizing item titles. However, the aggregators are shared across different
datasets, using the raw text of item titles as inputs. Figure 5 demonstrates that aggregators
are not only generalizable across different datasets but also yield superior recommendation
accuracy. Interestingly, despite Reddit having a dominant share of training samples (96%)
as shown in Table 2, the trained aggregators with mixed data samples perform even better
than the dataset-specific new embeddings in Figure 5.
12
Table 4: Recommendation accuracy comparison among Continual-Training on Llama2-R
(Cont.), and the detailed configurations of adding bias terms or RecSys gating.
INSPIRED ReDIAL Reddit-V1.5
H@10 N@10 H@10 N@10 H@10 N@10
Llama2-R .103 .021 .053 .012 .117 .005 .057 .003 .093 .002 .047 .001
Cont. .146 .024 .081 .015 .124 .004 .067 .003 .093 .001 .047 .001
Bias Term Adjustment (+Bias)
w/ gW .155 .025 .081 .014 .123 .005 .066 .003 .093 .001 .048 .001
w/ b .103 .021 .053 .012 .118 .005 .057 .003 .096 .002 .049 .001
w/ gW + b .164 .025 .083 .004 .123 .005 .066 .003 .096 .001 .049 .001
RecSys Model Gating (+RecSys)
+ FISM .164 .025 .076 .013 .139 .005 .072 .003 .101 .002 .049 .001
+ SASRec .136 .023 .071 .014 .146 .005 .076 .003 .101 .002 .051 .001
4.3.3 Different Aggregators
Among the three aggregators, the Weighted method demonstrates competitive performance
despite its simple architecture. This suggests that the existing token embeddings from the
LLMs are effective enough, making the weighted-sum with linear projection a reasonable ap-
proach to consolidating token embeddings. Additionally, TRM performs worse than RNN,
possibly because (1) titles (e..g, movies) are typically short (fewer than 20 tokens), dimin-
ishing the significance of TRM’s advantages over RNN in handling long dependencies; (2)
CLS tokens show difficulty in representing a sentence, as noted in the literature (Choi et al.,
2021).
4.4 Effectiveness of the Adapt Step
4.4.1 Component Analysis.
Table 4 shows introducing bias terms after the reindex step is a simple yet effective strategy.
This is attributed to the potential for improving recommendation accuracy by addressing
popularity misalignments, as discussed in Figure 1. Additionally, we observe that on the
small dataset, INSPIRED, +Bias outperforms +RecSys. This is because the parameter
space for learning is significantly reduced, changing from learning item-item relationships to
learning item point-wise popularity, which can be effectively captured with a small number
of training samples.
Meanwhile, Table 4 demonstrates that introducing traditional RecSys models is effective
when there is a large number of training samples available to adapt the recommendation
distribution. On ReDIAL and Reddit-V1.5, this leads to improved recommendation ac-
curacy compared to Cont. and +Bias. However, on the small dataset INSPIRED, using
RecSys to learn item-item relationships tends to result in overfitting. This motivates us
to consider different adapt steps by cases. For example, after collecting the most recent
samples, bias-term adjustment (+Bias) is recommended if the number of new samples is
limited. Otherwise, RecSys gating would be a good option.
13
SYSTEM:
‘Tell me what
kind of movies do you
like? ‘
USER:
‘I like scary
movies’
SYSTEM:
‘did you
watch Avengers:
Infinity War (2018)?’
USER:
‘I did not but I
want to’
1. The Conjuring
2. The Ring
3. The Babadook
4. Get Out
1. Get Out
2. It
3. Happy Death Day
4. The Conjuring
1. It
2. Oujia
3. Wonder Woman
4. Halloween II
Llama2
Llama2-R
Llama2-RTA
SYSTEM:
‘Great! Based on
your preference for scary
movies, I recommend
"It"
(2017).
It's a classic horror
movie that has been
terrifying audiences for
decades. The movie follows
a group of young outcasts
who must face their fears
and battle a monstrous
entity that preys on their
town every 27 years.
(a) Context (b) Recommendation (c) Generation
Figure 6: An Example with real results from Llama2, Llama-R and Llama-RTA (+SAS-
Rec), followed by a natural language response from Llama2 (detailed prompts can be found
in Appendix B). This conversational context is from ReDIAL dataset, with the ground-truth
movie IT.
4.4.2 Impact of Bias Term Types
Both multiplicative and additive bias terms improve accuracy across diverse datasets, though
their impact varies. Specifically, multiplicative bias terms exhibit significant improvement
on INSPIRED and ReDIAL datasets, whereas additive bias terms play a pivotal role on
Reddit-V1.5.
4.4.3 Impact of RecSys Model Types
Our current focus is on ”item-based” RecSys models without incorporating long-term user
representations. In this context, FISM and SASRec exhibit enhanced performance. Notably,
FISM outperforms SASRec on the INPSIRED dataset, possibly due to the complexity of
SASRec, a transformer-based model, being less suitable for smaller datasets. Conversely,
on larger datasets such as ReDIAL and Reddit-V1.5, SASRec demonstrates superior perfor-
mance, suggesting that employing transformer-based RecSys models is advantageous when
dealing with larger data sizes. Specifically, on ReDIAL, characterized by longer conversation
rounds, SASRec may bring additional benefits in capturing item-to-item sequential patterns
within conversations.
4.5 Discussions
4.5.1 Conversational Recommendation Responses
Figure 6 illustrates the complete pipeline of generating results for conversational recommen-
dation tasks. Our discussions are below:
On Recommendation. The outputs of the recommendation phase are items. In Fig-
ure 6, the three models (Llama2 and its variants under our framework) understand contexts,
yielding high-quality recommendations for scary movies. Specifically, Llama2-RTA builds
14
Table 5: Recommendation accuracy comparison our model based on a 7B open-sourced
LLM (Llama2) and the proprietary model ChatGPT (GPT-3.5-t).
INSPIRED ReDIAL Reddit-V1.5
Model H@10 N@10 H@10 N@10 H@10 N@10
Ours. .164 .025 .083 .004 .131 .005 .068 .003 .102 .002 .052 .001
GPT-3.5-t .150 .024 .089 .016 .163 .006 .089 .003 .104 .002 .055 .001
a connection between the superhero movie Avengers: Infinity War in the context and the
candidate Wonder Woman, using item-to-item relationships modeled by the SASRec (Kang
and McAuley, 2018) model. Meanwhile, we posit that while multiple recommended items
align with conversation contexts, the failure to adjust for the popularity of items on the
target platform (e.g., movie IT being popular on ReDIAL) leads to zero-shot LLMs failing
to meet user interests.
On Generation. The outputs of the generation phase are texts. In Figure 6, the
generation phase is accomplished by prompting the Llama2 model. It is noted that our
focus in this work is solely on the technical aspects of the recommendation phase. We treat
the generation phase as a separate task that can be completed either by existing LLMs or
adjusted based on user interface requirements. Still, we make some observations: (1) In
many cases, presenting the recommendation phase suffices for users. However, our RTA
framework, which introduces only a few additional parameters without changing the weights
of the original LLMs, efficiently enables the reuse of LLMs for further generating natural-
language responses as shown in Figure 6; (2) In conversational recommendations, there
is an ongoing debate about whether to perform the recommendation or generation phase
first (Li et al., 2018; Zhou et al., 2020; Wang et al., 2022). Our example suggests that, if
the recommendation phase is frequently adjusted (a common scenario due to distribution
shift), it is advisable to perform the recommendation phase first and then the generation
phase. Reversing the order may lead to text-item inconsistency issues (e.g., the generated
response is specifically tailored for recommended movie IT, leading to a mismatch with the
recommendation from Llama2).
4.5.2 Comparison with Proprietary Models
To deepen our understanding of the models, we adopt the setting in (He et al., 2023) to
query the proprietary model GPT-3.5-t (Schulman et al., 2022). As shown in Table 5,
GPT-3.5-t remains a competitive model for conversational recommendations with zero-shot
prompting. However, it is reasonable to guess that, given our LLM-architecture-agnostic
approach, improving recommendation accuracy based on GPT-3.5-t is possible if the weights
are accessible. A reasonable next step involves working on models similar to GPT-3.5-t, such
as Llama2-70b. This could be pursued as future work, if the required compute resources are
available.
15
5 Related Work
5.1 Conversational Recommendation (CRS)
The objective of conversational recommender systems (CRS) is to elicit user preferences and
deliver tailored recommendations through interactive dialogues. Historically, CRS imple-
mentations have ranged from some template-driven systems (Christakopoulou et al., 2016;
Lei et al., 2020b,a; He et al., 2022; Zhang et al., 2022) to critique-based approaches (Chen
and Pu, 2012; Wu et al., 2019; Li et al., 2021). With the evolution of natural language
processing, ”deep” CRS models (Li et al., 2018; Chen et al., 2019; Wang et al., 2022)
have been developed, enabling more natural-language interactions. Research indicates the
utility of CRS models is enhanced by incorporating diverse supplementary data, such as
knowledge-enriched models (Chen et al., 2019; Zhou et al., 2020) utilizing external knowl-
edge bases (Auer et al., 2007; Liu and Singh, 2004), review-centric models (Lu et al., 2021),
and session/sequence-oriented models (Zou et al., 2022; Li et al., 2022b). UniCRS (Wang
et al., 2022) uses knowledge bases (Auer et al., 2007), built on DialoGPT (Zhang et al.,
2020) and employing prompt tuning (Brown et al., 2020b), represents a state-of-the-art
CRS model on datasets like ReDIAL (Li et al., 2018) and INSPIRED (Hayati et al., 2020).
Recently, an emerging topic is to leverage LLMs in CRS, with (Friedman et al., 2023; He
et al., 2023) introducing a novel CRS pipeline, even in the zero-shot setting (He et al., 2023),
and (Wang et al., 2023b) focusing on advanced user simulation for LLM evaluation. Our
research is the first to study the distribution misalignments in zero-shot LLMs for CRS and
solutions for this issue to improve recommendation accuracy.
5.2 Large Language Models (LLMs)
Recent breakthroughs in natural language processing (NLP) have demonstrated that large
language models (LLMs) possess a remarkable capacity for generalizing to unfamiliar tasks
and areas (Chowdhery et al., 2022; Brown et al., 2020a; Wei et al., 2022) in zero-shot
or few-shot settings. Studies have shown that scaling up LLMs can significantly enhance
their performance and efficiency in downstream applications (Kaplan et al., 2020). In line
with these developments, LLMs have been successfully applied to various downstream tasks
such as question answering, numerical reasoning, code generation, and commonsense rea-
soning, often without requiring gradient updates (Zheng et al., 2023; Brown et al., 2020a;
Li et al., 2022a; Kaplan et al., 2020). The recommendation field has recently begun inte-
grating LLMs, either by adapting LLM architectures (Geng et al., 2022; Cui et al., 2022)
or repurposing existing LLMs for recommendation purposes (Li et al., 2023; Wang et al.,
2023a; Liu et al., 2023). Our study aligns with the research line of utilizing LLMs for con-
versational recommendations. We improvements in recommendation accuracy by adjusting
item recommendations within the proposed framework.
5.3 LLMs for Recommendation
There is growing interest in the academic community to harness LLMs for recommendation-
related tasks. One research direction explores LLMs within conventional recommendation
setup, which typically incorporate user feedback and item metadata (Kang et al., 2023; Hou
et al., 2023; Yue et al., 2023; Dai et al., 2023; Bao et al., 2023; Harte et al., 2023; Sanner
et al., 2023). This includes tasks such as rating prediction (Kang et al., 2023) and sequential
16
recommendation (Harte et al., 2023; Yue et al., 2023; Hou et al., 2023). In such contexts, em-
ploying LLMs as recommenders has shown potential, particularly in scenarios with extreme
data sparsity (Bao et al., 2023) or during the cold-start phase (Sanner et al., 2023). However,
they often struggle to surpass simpler baseline methods, like non-personalized popularity-
based models, in standard recommendation scenarios (Kang et al., 2023; Hou et al., 2023).
Nevertheless, enhancing existing recommender systems with features generated by LLMs
has yielded improved performance (Agrawal et al., 2023). Another significant research di-
rection focuses on language-centric recommendation tasks (He et al., 2023; Acharya et al.,
2023; Mysore et al., 2023; Feng et al., 2023; Friedman et al., 2023). These tasks include gen-
erating explanations for recommendations, narrative-based recommendations (Mysore et al.,
2023), and conversational recommendations (He et al., 2023; Feng et al., 2023; Friedman
et al., 2023). LLMs exhibit proficient performance in understanding intricate textual inputs,
allowing for personalized recommendation outputs. Recent investigations in conversational
recommendation demonstrate encouraging outcomes leveraging LLMs, even in zero-shot
configurations. Our study employs existing LLMs with minimal additional parameters, im-
plementing the Reindex-Then-Adapt framework. Through the reindexing of item content
within LLMs and fine-tuning recommendations to align with target data distributions, our
framework enhances recommendation accuracy in CRS.
6 Conclusion
This study proposes a solution to mitigate distribution misalignments between zero-shot
large language models (LLMs) and target recommendation platforms for conversational
recommendations. We conceptualize LLMs as Differential Search Index (DSI) models and
introduce the Reindex-Then-Adapt (RTA) framework. The framework involves converting
multi-token item titles into single tokens within LLMs (reindex step) and subsequently
adjusting their probability distributions (adapt step). By combining the strengths of LLMs
and traditional RecSys, the RTA framework achieves improved recommendation accuracy
metrics across various conversational recommendation datasets and adaptation settings.
References
Arkadeep Acharya, Brijraj Singh, and Naoyuki Onoe. 2023. LLM Based Generation of Item-
Description for Recommendation System. In Proceedings of the 17th ACM Conference on
Recommender Systems. 1204–1207.
Saurabh Agrawal, John Trenkle, and Jaya Kawale. 2023. Beyond Labels: Leveraging Deep
Learning and LLMs for Content Metadata. In Proceedings of the 17th ACM Conference
on Recommender Systems. 1–1.
oren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and
Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web:
6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC
2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings. Springer, 722–
735.
Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023.
17
Tallrec: An effective and efficient tuning framework to align large language model with
recommendation. arXiv preprint arXiv:2305.00447 (2023).
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
2020b. Language models are few-shot learners. Advances in neural information processing
systems 33 (2020), 1877–1901.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya
Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner,
Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020a. Language
Models are Few-Shot Learners. In Advances in Neural Information Processing Systems,
H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran
Associates, Inc., 1877–1901.
Li Chen and Pearl Pu. 2012. Critiquing-based recommenders: survey and emerging trends.
User Modeling and User-Adapted Interaction 22 (2012), 125–150.
Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and
Jie Tang. 2019. Towards Knowledge-Based Recommender Dialog System. In Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing and the
9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
1803–1813.
Xiaoyang Chen, Yanjiang Liu, Ben He, Le Sun, and Yingfei Sun. 2023. Understanding
Differential Search Index for Text Retrieval. arXiv preprint arXiv:2305.02073 (2023).
Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On
the Properties of Neural Machine Translation: Encoder–Decoder Approaches. In Pro-
ceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical
Translation. 103–111.
Hyunjin Choi, Judong Kim, Seongho Joe, and Youngjune Gwon. 2021. Evaluation of bert
and albert sentence embedding performance on downstream nlp tasks. In 2020 25th In-
ternational conference on pattern recognition (ICPR). IEEE, 5482–5487.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra,
Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann,
Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker
Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben-
ton C. Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy
Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev,
Henryk Michalewski, Xavier Garc´ıa, Vedant Misra, Kevin Robinson, Liam Fedus, Denny
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov,
Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanu-
malayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon
Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta,
18
Mark D´ıaz, Orhan Firat, Michele Catasta, Jason Wei, Kathleen S. Meier-Hellstern, Dou-
glas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Mod-
eling with Pathways. ArXiv abs/2204.02311 (2022).
Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conver-
sational recommender systems. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining. 815–824.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empiri-
cal evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555 (2014).
Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. M6-
Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems.
arXiv:2205.08084 [cs.IR]
Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun,
Xiao Zhang, and Jun Xu. 2023. Uncovering ChatGPT’s Capabilities in Recommender
Systems. arXiv preprint arXiv:2305.02182 (2023).
Yue Feng, Shuchang Liu, Zhenghai Xue, Qingpeng Cai, Lantao Hu, Peng Jiang, Kun Gai,
and Fei Sun. 2023. A Large Language Model Enhanced Conversational Recommender
System. arXiv preprint arXiv:2308.06212 (2023).
Luke Friedman, Sameer Ahuja, David Allen, Terry Tan, Hakim Sidahmed, Changbo Long,
Jun Xie, Gabriel Schubiner, Ajay Patel, Harsh Lara, et al. 2023. Leveraging Large Lan-
guage Models in Conversational Recommender Systems. arXiv preprint arXiv:2305.07961
(2023).
Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Rec-
ommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt
& Predict Paradigm (P5). In RecSys ’22: Sixteenth ACM Conference on Recommender
Systems, Seattle, WA, USA, September 18 - 23, 2022, Jennifer Golbeck, F. Maxwell
Harper, Vanessa Murdock, Michael D. Ekstrand, Bracha Shapira, Justin Basilico, Keld T.
Lundgaard, and Even Oldridge (Eds.). ACM, 299–315.
Albert Gu, Caglar Gulcehre, Thomas Paine, Matt Hoffman, and Razvan Pascanu. 2020. Im-
proving the gating mechanism of recurrent neural networks. In International Conference
on Machine Learning. PMLR, 3800–3809.
Jesse Harte, Wouter Zorgdrager, Panos Louridas, Asterios Katsifodimos, Dietmar Jannach,
and Marios Fragkoulis. 2023. Leveraging Large Language Models for Sequential Rec-
ommendation. In Proceedings of the 17th ACM Conference on Recommender Systems.
1096–1102.
Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu.
2020. INSPIRED: Toward Sociable Recommendation Dialog Systems. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
8142–8152.
Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bod-
hisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large language
19
models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM In-
ternational Conference on Information and Knowledge Management. 720–730.
Zhankui He, Handong Zhao, Tong Yu, Sungchul Kim, Fan Du, and Julian McAuley. 2022.
Bundle MCR: Towards Conversational Bundle Recommendation. In Proceedings of the
16th ACM Conference on Recommender Systems. 288–298.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural compu-
tation 9, 8 (1997), 1735–1780.
Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and
Wayne Xin Zhao. 2023. Large language models are zero-shot rankers for recommender
systems. arXiv preprint arXiv:2305.08845 (2023).
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
Santosh Kabbur, Xia Ning, and George Karypis. 2013. Fism: factored item similarity models
for top-n recommender systems. In Proceedings of the 19th ACM SIGKDD international
conference on Knowledge discovery and data mining. 659–667.
Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation.
In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed
Chi, and Derek Zhiyuan Cheng. 2023. Do LLMs Understand User Preferences? Evaluating
LLMs On User Rating Prediction. arXiv preprint arXiv:2305.06474 (2023).
Jared Kaplan, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon
Child, Scott Gray, Alec Radford, Jeff Wu, and Dario Amodei. 2020. Scaling Laws for
Neural Language Models. ArXiv abs/2001.08361 (2020).
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings of NAACL-
HLT, Vol. 1. 2.
Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan,
and Tat-Seng Chua. 2020a. Estimation-action-reflection: Towards deep interaction be-
tween conversational and recommender systems. In Proceedings of the 13th International
Conference on Web Search and Data Mining. 304–312.
Wenqiang Lei, Gangyi Zhang, Xiangnan He, Yisong Miao, Xiang Wang, Liang Chen, and
Tat-Seng Chua. 2020b. Interactive path reasoning on graph for conversational recommen-
dation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge
discovery & data mining. 2073–2083.
Jinming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni.
2023. GPT4Rec: A Generative Framework for Personalized Recommendation and User
Interests Interpretation. arXiv:2304.03879 [cs.IR]
20
Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin,
and Chris Pal. 2018. Towards deep conversational recommendations. Advances in neural
information processing systems 31 (2018).
Shuyang Li, Bodhisattwa Prasad Majumder, and Julian McAuley. 2021. Self-Supervised
Bot Play for Conversational Recommendation with Justifications. arXiv preprint
arXiv:2112.05197 (2021).
Shuokai Li, Ruobing Xie, Yongchun Zhu, Xiang Ao, Fuzhen Zhuang, and Qing He. 2022b.
User-centric conversational recommendation with multi-aspect user modeling. In Proceed-
ings of the 45th International ACM SIGIR Conference on Research and Development in
Information Retrieval. 223–233.
Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, emi
Leblond, Tom, Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hu-
bert, Peter Choy, Cyprien de, Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen
Huang, Johannes Welbl, Sven Gowal, Alexey, Cherepanov, James Molloy, Daniel Jaymin
Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de, Freitas, Koray
Kavukcuoglu, and Oriol Vinyals. 2022a. Competition-level code generation with Alpha-
Code. Science 378 (2022), 1092 1097.
Hugo Liu and Push Singh. 2004. ConceptNet—a practical commonsense reasoning tool-kit.
BT technology journal 22, 4 (2004), 211–226.
Junling Liu, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. 2023. Is ChatGPT a Good
Recommender? A Preliminary Study. arXiv:2304.10149 [cs.IR]
Yu Lu, Junwei Bao, Yan Song, Zichen Ma, Shuguang Cui, Youzheng Wu, and Xiaodong He.
2021. RevCore: Review-Augmented Conversational Recommendation. In Findings of the
Association for Computational Linguistics: ACL-IJCNLP 2021. 1161–1173.
Sheshera Mysore, Andrew McCallum, and Hamed Zamani. 2023. Large Language
Model Augmented Narrative Driven Recommendations. arXiv preprint arXiv:2306.02250
(2023).
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with
contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using
Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP). 3982–3992.
Scott Sanner, Krisztian Balog, Filip Radlinski, Ben Wedin, and Lucas Dixon. 2023. Large
Language Models are Competitive Near Cold-start Recommenders for Language-and
Item-based Preferences. In Proceedings of the 17th ACM Conference on Recommender
Systems. 890–896.
John Schulman, Barret Zoph, C Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Fe-
lipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, Rapha Gontijo Lopes, and
Sengjia Zhao. 2022. Chatgpt: Optimizing language models for dialogue. OpenAI (2022).
21
Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Au-
toencoders meet collaborative filtering. In Proceedings of the 24th international conference
on World Wide Web. 111–112.
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau
Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. 2022. One embedder, any task:
Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741 (2022).
Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin,
Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search
index. Advances in Neural Information Processing Systems 35 (2022), 21831–21843.
MosaicML NLP Team. 2023. Introducing MPT-7B: A New Standard for Open-Source, Com-
mercially Usable LLMs. www.mosaicml.com/blog/mpt-7b Accessed: 2023-05-05.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama
2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural
information processing systems. 5998–6008.
Wenjie Wang, Xinyu Lin, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2023a.
Generative Recommendation: Towards Next-generation Recommender Paradigm.
arXiv:2304.03516 [cs.IR]
Xiaolei Wang, Xinyu Tang, Wayne Xin Zhao, Jingyuan Wang, and Ji-Rong Wen. 2023b.
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Lan-
guage Models. arXiv preprint arXiv:2305.13112 (2023).
Xiaolei Wang, Kun Zhou, Ji-Rong Wen, and Wayne Xin Zhao. 2022. Towards Unified
Conversational Recommender Systems via Knowledge-Enhanced Prompt Learning. In
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining. 1929–1937.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed
Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reason-
ing in Large Language Models. In Advances in Neural Information Processing Systems,
S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35.
Curran Associates, Inc., 24824–24837.
Ga Wu, Kai Luo, Scott Sanner, and Harold Soh. 2019. Deep language-based critiquing
for recommender systems. In Proceedings of the 13th ACM Conference on Recommender
Systems. 137–145.
Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and Even Oldridge.
2023. LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking.
arXiv preprint arXiv:2311.02089 (2023).
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng
Gao, Jingjing Liu, and William B Dolan. 2020. DIALOGPT: Large-Scale Generative
22
Pre-training for Conversational Response Generation. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics: System Demonstrations. 270–
278.
Yiming Zhang, Lingfei Wu, Qi Shen, Yitong Pang, Zhihua Wei, Fangli Xu, Bo Long, and Jian
Pei. 2022. Multiple Choice Questions based Multi-Interest Policy Learning for Conversa-
tional Recommendation. In Proceedings of the ACM Web Conference 2022. 2153–2162.
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before
use: Improving few-shot performance of language models. In International Conference on
Machine Learning. PMLR, 12697–12706.
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei
Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A
Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X.
arXiv:2303.17568 [cs.LG]
Kun Zhou, Wayne Xin Zhao, Shuqing Bian, Yuanhang Zhou, Ji-Rong Wen, and Jingsong
Yu. 2020. Improving conversational recommender systems via knowledge graph based
semantic fusion. In Proceedings of the 26th ACM SIGKDD international conference on
knowledge discovery & data mining. 1006–1014.
Jie Zou, Evangelos Kanoulas, Pengjie Ren, Zhaochun Ren, Aixin Sun, and Cheng Long.
2022. Improving conversational recommender systems via transformer-based sequential
modelling. In Proceedings of the 45th International ACM SIGIR Conference on Research
and Development in Information Retrieval. 2319–2324.
23
A More Details of Experiments
A.1 Baseline Details
We consider four groups of baseline models for comparison. Firstly, we consider some
representative traditional item-based RecSys models:
Popularity: This method is non-personalized and returns the top-K most popular
movies within the related datasets.
FISM (Kabbur et al., 2013): A commonly used factored item similarity model for
item-based collaborative filtering.
SASRec (Kang and McAuley, 2018): A competitive self-attention-based sequential
recommender system.
Secondly, we consider some representative CRS models:
ReDIAL (Li et al., 2018): This model is released along with the ReDIAL dataset
with an auto-encoder (Sedhain et al., 2015)-based recommender.
UniCRS (Wang et al., 2022): This model uses pre-trained language model, Di-
aloGPT (Zhang et al., 2020), with prompt tuning to conduct recommendation and
conversation generation tasks respectively. This model is treated as a state-of-the-art
CRS models before LLMs (He et al., 2023).
Thirdly, we consider some dense retrieval models given the connections to document
retrieval:
SBERT (Reimers and Gurevych, 2019): A modification of the pretrained BERT (Ken-
ton and Toutanova, 2019) network. It uses siamese and triplet network structures to
generate semantically meaningful sentence embeddings.
Instructor (Su et al., 2022): A text embedding model that has been fine-tuned for
instructional purposes, which is considered state-of-the-art in dense retrieval tasks.
Lastly, we consider some zero-shot open-sourced LLMs as baselines like (He et al., 2023),
we are using the 7-billion version due to compute burden:
MPT-7b (Team, 2023): A recently released open-sourced LLM released by Mo-
saicML’s team trained on 1T tokens.
Mistral-7b (Jiang et al., 2023): A recently released open-sourced Large Language
Model with impressive performance on multiple tasks, trained by Mistral AI Team.
Llama2-7b (Touvron et al., 2023): A commonly used open-sourced Large Language
Model with a wide eco-system support.
We also discuss the results from GPT-3.5-turbo (Schulman et al., 2022), which is a
much larger proprietary model that can achieve state-of-the-art CRS performance even in
a zero-shot setting (He et al., 2023)
5
.
5
In particular, the GPT-3.5-turbo API was called in January 2024 with a temperature setting of 0. This
note aims to enhance reproducibility of GPT APIs, considering the continuous updates made by OpenAI
over time.
24
A.2 Implementation Details
For zero-shot baselines, we configured models based on links from huggingface official model
pages for inference on our datasets. Trainable baselines utilized hyperparameters suggested
by authors, with a batch size of 256. The learning rate search space is {1e-3, 1e-4, 1e-5},
and weight decay is {0, 1e-6, 1e-4, 1e-2, 1}. Baselines were trained for 200 epochs, and the
best model was selected based on H@10 on the validation set. Reindex and adapt steps of
our model followed the same hyper-parameter setup above. For L2R Data and L2I Data,
we used the original data mixture without adjusting the sampling ratio. The initial data
weights were approximately 98:2, and addressing the data mixture weighting is deferred to
future work, as it may enhance recommendation results, though not the primary focus of
this paper. For Reindex Step, the RNN we used is a GRU (Chung et al., 2014) network,
with embedding size as the same as Llama2-7b (i.e., 4096) and hidden size is 1024. We use
the bidirectional single-layer GRU modules. For the Adapt Step, the FISM models are with
embedding size 64, and the SASRec models are using embedding size 64, 2 self-attention
layers and 2 attention heads.
B Details of Prompts for LLMs
B.1 Prompt(s) for Recommendation
For LLMs, we follow (He et al., 2023) to define the recommendation prompts as follows,
which can be used to obtain the LLM baseline results and used in our reindex and adapt
steps.
For LLM baselines, this prompt is following (He et al., 2023) with the fuzzy matching
method to convert generated recommendation lists into within-dataset item ID lists. In
the prompt example, we omit the “converstaion templates“, which are obtained them from
FastChat
6
to ensure the zero-shot performance of LLM baselines.
Prompt for LLM Baselines: Pretend you are a movie recommender system. I will give you
a conversation between a user and you (a recommender system). Based on the conversation,
you reply me with 20 movie titles without extra sentences. Here is the conversation: {}
Here, “{}” is the placeholder for the conversational context, which is examplified by Fig-
ure 6.
For our RTA framework, since the base model is Llama2-7b (Touvron et al., 2023),
we specify the prompt to make it clear that how we “reindex” the generated items. This
prompt is exactly ended with “1. ”, for which the original next steps are the tokens for the
recommended item titles, such as “1. Edge of Tomorrow”. However, we record the query
embedding ended by “1. and replace the embedding sequence for Edge of Tomorrow
with |Edge of Tomorrow| for reindexing. The concrete prompt for the reindex step is:
Prompt for Llama-RTA: ¡s¿ [INST] Pretend you are a movie recommender system. I will
give you a conversation between a user and you (a recommender system). Based on the con-
versation, you reply me with 20 movie titles without extra sentences. Here is the conversation:
{} [/INST] 1.
6
https://github.com/lm-sys/FastChat/blob/1db84d0906196673db361eac50d5aa65180a0ffe/
fastchat/conversation.py
25
B.2 Prompt(s) for Generation
For the prompt used in case studies Figure 6, we define the prompt as below, where the
first placeholder {} is for the conversational context, and the second placeholder {} is
for the item recommendation list.
Prompt for Llama-RTA: Pretend you are a movie recommender system. I will give you a
conversation between a user and you (a recommender system). Based on the conversation, you
reply me with 20 movie titles without extra sentences. Here is the conversation: {} . Please
respond the above conversations using the recommended items below, it is better if explaining
why they are recommended, but do not list them as bullets. Insert them into your responses:
{}
It is noted that we do not aim to demonstrate the optimal generation strategy, but rather
provide an example of how the language model framework developed for our recommendation
system can also be reused for generative tasks, for the use cases where natural-language
responses are required.
26