Recsperts - Recommender Systems Experts | Transcript: #29: Transformers for Recommender Systems with Craig Macdonald and Sasha Petrov

#29: Transformers for Recommender Systems with Craig Macdonald and Sasha Petrov

August 27, 2025 / 01:37:25/E30

Note: This transcript has been generated automatically using OpenAI's whisper and may contain inaccuracies or errors. We recommend listening to the audio for a better understanding of the content. Please feel free to reach out if you spot any corrections that need to be made. Thank you for your understanding.

Recommender systems are nearly everywhere.
Every time I open an app on my phone, or even when I look at my phone, it's making a recommendation choice about which app it thinks I'm about to use next.
This is the model that is dominating all other model architectures.
Look, it's replacing RNNs, it is replacing convolutional neural networks.
So it was clear that it is the future of deep learning-based everything including recommender systems.
Setting the problem right is more important than kind of tweaking the solution.
When you set the training objective, the loss functions properly, the model architecture itself doesn't matter that much.
This item masking of BERT4Rec is good.
You learn a very, very good representation, but it takes an awful long time to do that.
And at the end of the day, it's not actually that closely related to the end goal.
You're getting a good representation because you're hiding maybe the second item in a sequence and you're forcing it to recover that.
But that's not very good at helping it predict the last item in the sequence.
If you take out negative sampling from SASRec and apply the same loss function as BERT4Rec uses, essentially the difference in effectiveness disappears.
We were able to achieve the same effectiveness, but with retaining negative sampling.
LLMs as recommenders are not going to be everywhere.
They're just too big at the moment.
So we're calling it joint product quantization because we are learning to product quantize and do recommendation at the same time.
Hello and welcome to this new episode of Recsperts, a recommender systems experts.
For today's episode, I have invited two experts from the academic, but also industrial side of things.
And we are going to talk about sequential recommender systems, some topic that many of you might've heard about before and applied it to your own recommendations in your domain.
So a huge and large topic that has been seeing increasing attention and also works from different domains outside of recommender systems being applied in the domain of recommender systems.
So to give you some brief keywords for this episode and get you hyped before welcoming my guests, we are going to talk about sequential recommendations or maybe also the topic of SASRec or BERT4Rec.
We are talking a lot about these transformer models and how to apply them to sequences to come up with recommendations.
And we look into replicability studies.
We look into a tutorial on transformers for sequential recommendations.
And you might already be guessing whom I have invited to join me on the session and talk us through all those exciting things.
And you might be right.
I have invited two researchers from the University of Glasgow.
And here with me today are Professor Craig Macdonald and Alexander Sasha Petrov.
Hello and welcome to the show.
Hi, Marcel.
It's great to be here.
Hi, Marcel.
I'm really pleased to join today's episode.
Thanks.
Let me give a brief introduction to welcome my guests to the show today.
So Professor Craig Macdonald is a full professor of information retrieval in the School of Computing Science at the University of Glasgow in Scotland, UK.
He's one of the lead developers for the Terrier information retrieval platform and has many papers published at the transactions on recommender systems at ECIR, ACL, RecSys, of course, dub dub dub or SIGIR.
And I could also see that Craig is a sailor.
At least this is what the image on his bio site might tell us.
Sasha Petrov is a PhD student under the supervision of Craig and currently pursuing his PhD at the University of Glasgow.
And I guess, or he can tell us that he is almost finished or by the release of this episode has actually already finished his PhD.
But he has also quite some extensive practical experience in recommender systems since he has been a senior software engineer and applied scientist working for Amazon for a couple of years before switching to the academic side.
And he has also plenty of experience in recommender systems, search engine ranking and personalized ads.
And of course also lots of papers, for example, at with ACM RecSys or SIGIR.
With that, Craig, do you want to continue and tell us a bit more about yourself and your way into recommender systems?
Yeah, so as you said, really well, we do lots of research here.
We have a big information retrieval group in Glasgow.
It's been here for 40 odd years.
It was actually founded by Keith Van Versbergen who kind of wrote a fundamental book on information retrieval.
But my background is I came from search engine research, but we realized that increasingly there was more kind of crossovers that were possible between search engines and recommendations.
So when Sasha started talking to me about doing a PhD, I was quite excited.
I could see the value in his background and some interesting things that we could do to push forward in recommendation.
Given the many papers that both of you have authored so far, it has been quite a productive collaboration so far, hasn't it?
Yep, and that's resulting in a very, very big PhD thesis, which I'm having to read draft chapters of fairly regularly.
We should have the final draft soon.
And I have some examiners lined up for Sasha, but I can't tell them who they are.
Sasha, on your side of things, maybe what might be of particular interest for many listeners is why you switched sides and what made you switch from the, I mean, it's not that you weren't doing any practical things at all.
I guess we will learn today that on the research side, you are already thinking about a lot of practical problems, and we are going to dive into that a bit more throughout the episode.
But nevertheless, now being on the academic side, so what made you join academics and how has been your journey so far as a applied scientist before and as a researcher nowadays?
Yeah, so basically throughout my applied career, I was always kind of gravitating towards machine learning side of things.
And in particular in the domain of personalization.
So I've been working with personalization since 2012.
So kind of over 10 years already.
And I was applying things without knowing too deep how they work.
And I was always interested in going deeper and deeper in understanding and maybe even giving something back to the community.
And I think I went through different roles in industry, including like a software engineer and like a manager of software engineers and applied scientists.
And my last role was obviously in Amazon, where I also worked on one of the closely related areas to recommender systems.
And there was also a cool project of Amazon with University of Edinburgh, where some of Amazon employees co-supervised students from University of Edinburgh.
And I was happy enough to co-supervise several master students with some of the people in Amazon who actually worked on recommender systems.
And I really enjoyed doing this kind of more researchy type of things.
And this is when I decided that I probably want to go into academia and dive deeper into this.
But I want to say kind of actually that it was a very hard decision to go from industry to academia by a number of reasons.
But the most obvious of which is that I had to kind of really reduce my spendings a lot.
But no, I think it was with a great decision and I really enjoyed.
So kind of now I really understand that I'm not just saying from transport rooms in part model, I really understand how it works, why it works that way and how to make them applicable for the tasks that are of my interest.
Yeah, yeah.
Which I mean, it's actually also an investment into your professional future and actually makes your knowledge and skills much more valuable.
So it's maybe rather an investment than having to spare some of their good, like say big tech money for a while and actually research and also teaching in the UK to which degree you actually have teaching obligations or many of the other things.
So how is this just out of personal curiosity?
As a PhD student, so we said we wanted to work on recommendation, some form of sequential recommendation, but there's no requirement for the PhD students to be too involved in teaching.
But yeah, no, Sasha has been involved as a tutor within labs and also given guest lectures in a course on recommender systems that I co-teach with one of my colleagues.
And I think it's great to bring in people from industry to give guest lectures to the students so that they can see some of the challenges that are faced in taking, we're telling them about some fundamental models and we'll say, well, what really are the challenges of how do we deploy that in a particular domain and trying to get them to think qualitatively about the problem and recommenders.
I feel it's quite tempting just to look, oh, how does NDCG change?
No, you need to eyeball the recommendations and understand if they're good and understand the constraints of the domain that we're operating in.
It's not just a case of one model fits all recommendation scenarios.
Craig, how actually come that you went more or less over time into the recommender systems field?
I mean, like traditionally search and recommendation are somewhat closely related, but where was the point where you were more interested in recommender systems and then also find this as an interesting opportunity for yourself?
I think that's a really good question actually.
At one point we felt that search, people could only see one outlet for search.
It was on a Google or a home, if you weren't having impact there, you were not doing the right kind of research.
So I feel that there was lots more applications of recommendation, so therefore lots more ways to have impact, lots of more companies to talk to.
I think the landscape's moved on a lot in the last 10 years and there are lots of different companies doing search on different domain specific things or rag, et cetera.
I don't think it's as true as it was in the past, but I do feel that we're able to have great research and great impact in both search and recommendation.
Yeah, definitely makes sense.
And then especially with last year's RecSys in 2024, we have actually also seen quite some works that try to put both applications a bit closer together.
I guess there was at work by Spotify or also that's unicorn model as proposed by Netflix, kind of bridging the gap from search to recommendation.
There was also a keynote by the Google, Ed Chi or Ed Kai.
Ed Kai, yeah.
Ed Kai, yeah.
So he gave the keynote at the large RecSys workshop where he was also talking exactly on this topic of kind of saying that everything is essentially the same thing and you just can have an universal around care and just give it a different input, something kind of user can be input, then it becomes a recommendation.
If you give it a query as an input, it becomes search.
If you give both user and query, it becomes personalized search.
If you give, let's say content of a webpage, it becomes an advertisement.
So essentially the same thing, the same model can serve most of the tasks.
And there is definitely a trend on this in the industry.
Yeah, that's definitely a great way of putting this.
However, there was also a different conference last year, the European Conference on Information Retrieval, ECIR.
And as part of that conference, you both presented a tutorial and this was for me kind of the starting point for today's session.
And when I reached out to you, this tutorial was called Transformers for Sequential Recommendation.
And I still remember that I have at least already seen Craig somewhere at RecSys on stage and presenting their work.
And this tutorial also paired with some personal interest was something that made me aware that sequential recommendation at all is currently underserved on RECSPERTSs.
And then I actually reached out and said, okay, let's talk about this topic.
And nowadays transformers are kind of everywhere.
Everybody talks about transformers, even though the paper attention is all you need, is already a bit old.
And we have seen quite a quick adaption in recommender systems with one of the, I hope I could say that earliest applications of transformers in the domain of recommendations by the paper of Julian McOlly and Kang back in 2018.
So self-attentive sequential recommendation.
So it's already seven years since then.
A lot of things have happened in the meanwhile.
Before we dive deeper into the specifics and also into your publications, could you describe the setting for sequential recommendations?
So why actually does it need sequential recommendations?
And what is sequential recommendation actually about?
So what does it mean?
Yeah, so sequential recommendation is about, you can imagine the situation where we're logging what the user's looking at or buying, and we're using that as input to the model at inference times.
So we're taking the recent history of interactions with the user, maybe all of the interactions that we have and say, this is what the user has been looking at.
Can we predict the next thing that the user is going to view, purchase, watch, et cetera?
So the task of sequential recommendation is really a step forward from what came before, which was a kind of collaborative filtering where we just think about, we've got a matrix of users and items, and we're trying to predict which of the blank cells the users might be interested in purchasing, because it really takes into, it takes the current context of the user into account.
We don't have to retrain the model every night because the user has looked at more items.
What the user has looked at recently is part of the input to the model at inference time.
And that allows us to take causal relationships into account.
If you're just bought a coffee maker, then maybe you need coffee.
If you've just bought a phone, then maybe you need a screen protector or a case or a charging cable or something.
So it really allows you to take those kinds of, adapt to the context that the user is in based on their recent interactions into account.
Okay, and which kind of traditional approaches or which earlier approaches have we taken?
I mean, you mentioned collaborative filtering, typical matrix factorization.
I mean, there are some time-aware extensions of matrix factorization, like time, SVD, plus plus, or such things.
So where have we observed with which models possibly the first approaches to models that kind of sequence of interactions and recommender systems?
So, I mean, I would highlight models like GIU for REC.
That's an RNN based model.
Kaser was a kind of CNN based model.
So people were already trying to say, how can we address this task taking techniques that have come from other areas?
So RNNs, yeah, but it was a language modeling type framework as well.
And Hidatsi took that and said, okay, let me make an RNN based model for recommendation.
There has been lots of implementations for GIU for REC.
Hidatsi made a really good talk saying there's lots of implementations, but none of them are really as good as the actual effect as it's on my original implementation.
So those were the kind of existing frameworks that were kind of really treating it as a sequential recommendation task.
I mean, there are other sequence tasks that existed in the literature.
You can even think of some kind of Markov chain model as just being a way of trying to predict what is the next item in a sequence.
But I think transformers have nice properties.
And clearly when transformers started taking off that near the SAS Rec paper by Kang and McCauley was one of the first attempts to look at that.
So why was that?
Well, sequential recommendation, you can think of it as completing a sequence of numerical IDs.
So if I've seen the IDs for some movies that you have watched, then we're trying to predict the ID of the next item that we think that the user should watch.
So we're gonna have a ranking of items in descending order.
These are the items that we think the user is most likely to watch.
So compared to having a language model, in the language model for text, we have IDs for tokens, for words if you like.
In recommendation, we can use the same kind of model architecture as the same kind of training infrastructure but instead of having IDs for words, we've got IDs for the items that are in our catalog.
So then making a sequence recommendation is just completing a sequence.
We're saying, given that we've seen these numbers in this order, what is the missing ID that we think is most likely to go at the end of that sequence of IDs?
Yeah, there's two models that we studied quite a bit at the start of our kind of couple of years of research.
So SAS Rec, that's a decoder-based model.
It used, so it's a causal attention, i.e. like GPT, but that meant like the representation of an item can't be influenced by future interactions.
And we train it by saying, here is a sequence of items.
Can you correctly predict the last item?
The other model that we looked at is kind of the opposite.
It was based on the encoder architecture.
So that's BERT for Rec, right?
So BERT for Rec, it's really just the BERT language model architecture and the training methodology applied to item IDs.
So there is what we'd say an item masking training objective.
So given a training dataset with lots of item IDs and sequences, it randomly selects a few to mask out to hide from the model, and the model is trained to recover those.
And then come inference time, we say, here is a sequence of items, and the last item is masked out.
Can you predict correctly what items should go there?
And so that was really the start of our research in this area.
If you go back, you mentioned these two kind of foundational papers, SAS Rec, which uses causal self-attention.
And then on the other side, the encoder-like BERT for Rec model that uses bidirectional self-attention.
So kind of two opposing models used for kind of the same task.
The first, so the SAS Rec paper appearing in 2018, the BERT for Rec paper one year later in 2019 at CIKM.
And actually also not very surprisingly used in its evaluation a comparison against SAS Rec and claimed superiority over the SAS Rec paper across different datasets.
Of course, our beloved and famous movie lens, but also the Steam and some Amazon beauty dataset.
So this is basically kind of the, for me, the founding pillars for all of the work that you have done.
Because for me, it all starts with this systematic review and replicability study of BERT for Rec for sequential recommendation that you both published at RecSys 2022 in Seattle.
So three years later after that work, how actually come your interest in this particular work so that you said, okay, there are these two, let's call them maybe foundational models, even though this term is a bit overused nowadays.
How came that you looked into them and took that as a starting point for a replicability study and yeah, investigate their effectiveness and comparing and questioning what has been reported before?
By the time of 2022, I think transformer in general, even like in language modeling, but also in recommender system, it was clearly that this is the model that is dominating all other model architectures.
Look, it's replacing RNNs, it is replacing convolutional neural networks.
So it was clear that it is the future of deep learning based everything, including recommender systems.
And those two papers were early works and already by the time we started working on these type of models, they already gathered quite a few citations.
They attracted all the attention from other researchers and they already was a corpus of literature, which we could look and we could read and we could compare results published in some papers.
And there were quite a lot of papers published and actually we found that the results reported in these papers were quite inconsistent between each other.
Kind of we can take two papers that use the same data set quite frequently, even the same experimental setup.
And we saw that the results reported in those two papers are very different.
Because for me, it was the topic that I was planning to study during my PhD.
It was very important to kind of to have some grounds and to be sure that like, okay, am I using actually good baselines?
I'm using the right implementation that is able to achieve the same results that originally the researchers who published those papers achieved.
And because we observed these big inconsistencies, we decided there's definitely a need for a replicability study where we can understand what model is better out of those two.
What is the cause of these inconsistencies that we observe and how we can ourselves avoid the same maybe traps that some of other researchers may be falling into because of some kind of problems with these models that didn't allow them to achieve consistent results in these papers.
Yeah, I guess this is what you show and also one of the tables where you compared how many papers, 39 papers or such, or I'm not sure whether I can get the number.
Yeah, I think we actually went through all the papers that cited Bert Porek by the time when we wrote this paper.
So it was like 300 something papers.
And I chose all the papers that compared SAS Rec with Bert Porek.
And then we also filtered out only two papers that we deem, let's say credible by which we meant that they at least were peer-realed.
And I think we ended up with something around 40 papers that compared these two models on a number of datasets.
Yeah.
So yeah, that was actually a big chunk of work of just going through all these papers.
Oh yeah, I can definitely believe that.
And especially doing all the filtering, which is relevant and to be included in the study and which not, because of course in the end, you don't want to make your own work be, of course you also always want to be criticized or critically reflected, but not to make it that easy to, in terms of attacking because you applied some dubious filtering or something like that.
But what you actually found is that in most of those cases, a Bert Porek was superior, but in some cases SAS Rec was, and there were also a couple of ties.
And this started some kind of a research or investigation into different implementations, I guess.
And there you also found that other people were running into a lot of underfitting or what was kind of explaining the reasons for those inconsistencies or what have been your hypotheses back then?
Yeah, okay.
So we indeed, we are thinking that these differences could be explained rather by underfitting or maybe some inconsistencies with the implementations that people used with the original implementation that original authors of Bert Porek published and SAS Rec also.
And we kind of found a mixture of both.
For example, even the original Bert Porek calls that was published by the original Bert Porek authors, it was good enough to reproduce at least the results that published in the Bert Porek paper, but the configuration files that were published within original Bert Porek repository were using kind of configuration that was training quickly, but massively underfitted.
So to get the result that they reported in the paper, you needed to increase the number of training steps like 20 times.
And I guess many people didn't do that.
And also another problem that many people used third party implementations and these third party implementations didn't accurately reproduce the original Bert Porek paper.
And I think one of good consequence of our publication that actually, for example, there is very popular library called RecBowl.
So they, after our publication, they kind of found a couple of bugs in their code and they fixed these bugs.
And I hope now people who are using RecBowl can get better results, but yeah, they said, okay, we now fix this.
And now our effectiveness is much better than it was before.
And this is kind of super popular library.
So like all the researchers who use that probably were getting weak baseline there.
I think this is not the first time that this happens and it definitely also won't be the last.
So it's good to have this kind of replicability studies.
And I mean, it's not only, only is maybe the wrong word here, just trying to replicate somebody else's work and seeing like whether it's replicable or how it does also compare across other works, but it kind of sparked many more ideas that kind of fueled further work on the one side.
So from an academic point of view or from an intellectually appealing thought, but of course on the other hand side, there's also a great need to make those models better because in certain domains, they have to deal with large item corpora and there we also have to deal with certain problems.
So can we touch a bit on also on that side, what motivated further work there?
Yeah, so I mean, I remember having these discussions with Sasha and he goes, well, actually it needs, you know, bare for record, this corporate needs 16 times the amount of training that SAS records.
And I was like, how is this possible?
The architectures are the same.
I mean, it's like, well, they have different training objectives.
Okay, well, what is it that's different about it and let's talk about it?
And then what do we think that that's a reasonable training objective or is there some kind of compromise there?
Other things that we looked at was, you know, Sasha and I came to come to me and say, I can't train bare for record on this data set.
It just has, it's not, it either took too long or it has much bigger memory consumption.
So that kind of motivated some of the work.
We've done some of this reproducibility work and we're saying, well, okay, can we, let's look into detail about how it is that SAS Rec is trained and does pretty well after one hour and why bare for rec needs 16 hours.
So kind of the differences between them, apart from the architecture, and we'll come back, I think we can talk about the model architecture later, but one of the big differences on it is actually the training task, the training objective.
So I think I mentioned earlier in bare for rec, we show a sequence and we randomly hide items, random parts of the sequence, while SAS Rec, it's more of a sequence continuation.
Here's a sequence, predict the thing that goes at the end.
So actually, I mean, this item masking of bare for rec, it's good, you learn a very, very good representation, but it takes an awful long time to do that.
And at the end of the day, it's not actually that closely related to the end goal.
You're getting a good representation because you're hiding maybe the second item in a sequence and you're forcing it to recover that, but that's not very good at helping it predict the last item in the sequence, okay?
While at least in SAS Rec, you're always trying to complete the end of the sequence.
At that point, I was also personally curious about one thing.
A couple of years ago, I had a master thesis student and she was working on removing noise from sequences for also sequential recommendations.
And this work and that argument that are a bird for rec is in that sense, and I guess you also mentioned it in the tutorial, less tied to the actual or proper recommendation objective of predicting the next item in a sequence that actually remembered me of a different view on it and would be curious what your take is on that.
So I agree that bird for rec with that masking might not be that closely related to the recommendation goal.
However, is it more robust against noise in sequences possibly?
For example, I'm buying a coffee maker, then I buy maybe a new keyboard and afterwards I buy coffee.
So obviously like the keyboard is somewhat noise between the two items.
Does SAS Rec have more problems with that and would bird for rec here be a better choice or am I getting something wrong here?
I think our training data sets are going to be massive, but I think SAS Rec has always been shown things in the right order.
While bird for rec, you're asking it to predict the items at different places in the sequence.
So I suspect it is getting a bit more of the robustness.
It's also making a better use of the training sequences that we have in the data, because it's being forced to predict multiple things in a given training sequence.
So it's a kind of sort of augmentation, but it's being forced, you get more value for the same amount of training data.
But of course you're having to train for longer because it's having to make multiple predictions.
Yeah, okay.
You were already about diving into one of the works that followed this, which is the paper called RSS, Effective and Efficient Training for Sequential Recommendation Using Recency Sampling.
Please continue the introduction.
Yeah, yeah.
In RSS, we're basically trying to find a new training objective that is somehow a compromise between SAS Rec sequence continuation and bird for rec's item masking.
We wanted to be able to sample multiple positive items in a training sequence, but also be able to give preference towards items at the end of the sequence, because that's the ultimate goal for the recommendation task.
So decency sampling is just basically that.
We take each training sequence and we randomly select which items that we are going to be as the targets to kind of imagine are the end of the sequence, but we prefer to take the real items at the end of the sequence.
So we've got some kind of prior over the position of each item in a sequence, and we're going to prefer to take items towards the end of the sequence as being our target.
So it always, it prefers to sample the most recent items as being targets.
Compared to sequence continuation, it will show items out of order.
It does a little bit of that, and that helps with that robustness thing that you mentioned.
But I think it's a case of where we said, well, this item masking task really came from language, and I can see how it makes sense there.
It makes less sense to be randomly masking out in some situations the start of a sequence.
It doesn't really help us learn how to complete a sequence.
And what we found was this really, really helps with cutting down the training time.
So you could put this RSS training objective, you could combine that with a SAS Rec model, and you could train a really good SAS Rec model as good as BERT4Rec in just one hour compared to like the 16 hours that BERT4Rec took.
So for us, it's a great example of just thinking more about the task and showing that if we do something that's a bit more tailored to recommendation, we can do much better.
So the kind of moniker I have here is that items are tokens until they're not tokens.
If we think a bit harder about the task, we can find ways to advance the model and to make models that are more effective or smaller or faster, et cetera.
Yeah, and I guess what you're alluding to and folks can see that on the very first page of this RSS paper or of the paper that came up with the recency sampling of sequences, short RSS, that you can see on the very first page of it.
And when I do this and refer about what somebody can see in the paper, I'm always thinking about, okay, I should maybe put this or change that into a let's say semi-video podcast where I could at least show the papers and the corresponding images as part of the sequence.
So I will take a note there, but what I was saying is that on the very first page, you could actually see that within that comparison, whether you have the training time and hours, I think given the same machine on the X axis and on the Y axis, you take NDCG as kind of the metric that stands for effectiveness of the model here.
And then of course the X axis is training time represents the efficiency.
And then you put in there as a SAS rec model, the BERT for rec and in the end, like the SAS rec model that is enhanced with the RSS sampling technique that then ends up in the top left corner, achieving even better NDCG score with comparable training time as you had for the vanilla SAS rec model.
So this is something one could already take to just change the sampling of training data that I used to basically make more of the data.
But in a fashion that I prefer items that are later in the sequence, as you said, right?
With a, I guess you had our exponential function there that you were using.
Yeah, I think in an extended version of the paper, we looked at lots of different functions and when you tuned the hyper parameters, they ended up just the same as the exponential prior.
So it's one of those things, it didn't really matter.
It was, you could make a linear one, tune the parameters and they ended up just the same as the exponential one.
Not probably linear, but some, so there's some, but yeah, if you will take some like, so X to the power of N or something like, some of the nonlinearities and like, if you were, you give it enough capacity, maybe you get the absolutely same shape, yes.
Yeah, yeah.
All right, and this was back then at RecSys 2022.
Also, as I found, achieved the best student paper award nominee, so still congrats for that.
So that means back then you have already kind of tweaking a bit or looking into the way of sampling and showed that with tweaking the way you sample, you can already change the effectiveness of the model quite a lot.
It was the first paper in a row of papers that you published.
Is this the right time to actually hand over and talk a bit more about another study that you performed on sampling and how to do sampling smarter?
Yeah, so, but I actually think it's very important that we kind of distinguish a bit.
So we can sample a positive from a training sequence and that's the item that we're trying to predict.
There's another thing that happens at training time and that's like, are you scoring all of the other items in the catalog or are you just scoring some of them?
And we call that negative sampling.
There's a key distinguishing feature there.
And that is, so, and indeed, BERT for Rec doesn't do any negative sampling.
SAS Rec did negative sampling.
It was another one of the differences between the models, but the negative sampling has been around in recommendations since at least I think BPR was the first kind of negative sampling model.
And it's an important configuration that we knew that we have to do at training time.
Yeah, yeah.
It's good that you make that clear.
I'm sorry for being a bit sloppy with my terminology, but of course, what you are bringing up there.
So all the science around how to do negative sampling properly in recommender systems could be perceived as its own subfield.
Maybe one day there will be a dedicated track at a RecSys that is only focusing on negative sampling, like different techniques and how many items, which technique to use and how to inform negative sampling, soft negatives, hard negatives, whatever.
Yeah, so here we talked about how to select the positive sample for composing the actual training sample, but in the other paper, one year later, G-SUS Rec, I'm always thinking about how much that pun was intended or not, because you could also read it as Jesus, Jesus Rec.
And I was already thinking about, okay, what if you combine it somewhat with the B of BERT to say like B-Jesus Rec?
So this is something that you have to tell me whether that pun was intended or not, but there was one colleague making me aware of that.
And it was RecSys 2023, so the paper is called G-SUS Rec or Jesus Rec.
You need to tell me how to pronounce it correctly or how you want it to be pronounced.
Reducing overconfidence in sequential recommendation trained with negative sampling.
I guess, Sasher, you are going to provide some more context, motivation, and background for this, right?
Yeah.
With, of course, telling us about your intended or non-intended Jesus joke.
No, the Jesus joke was actually never intended, and this is like, now I learned this, you can use this this way, and now I like it.
But for BERT Rec, we actually have a G-BERT Rec model, which is probably, you suggested we should put it like different way around, right?
Okay, no.
Yeah, actually with this paper, I think like the starting point, so in the paper, the same information organized a bit differently, and this information comes a bit later.
But actually for me, the starting for this paper was when I decided to understand better why BERT Rec actually achieves better effectiveness than SAS Rec, because it was consistently, more or less consistently, like according to our public ability study, that BERT Rec is outperforming SAS Rec, and we also found kind of in our implementation, that was also important, and BERT Rec authors attributed these kind of better performance to their better, basically, representation learning.
So kind of what you mentioned in the beginning, that they kind of, they could look forward to make the representation of items that happened in the past, and according to their analysis, it was kind of this what was given the model better effectiveness.
However, I found mostly, actually it was, I don't know, by out of curiosity, I guess, or I just found that when you control for negative sampling, the effectiveness difference disappears between those two models.
Sometimes on some data set, BERT Rec works a bit better, especially kind of when you need a bit more kind of augmentation type of things.
In some other data sets, SAS Rec work better, but like, as I said, like if you take out negative sampling from SAS Rec and apply the same loss function as BERT Rec uses, essentially, the difference in effectiveness disappears.
So that made me curious why, what happens there, why this negative sampling that was used in SAS Rec made it that much worse.
And I started digging into it.
And we also mentioned already in this podcast that we weren't able to train BERT Rec exactly by this reason, because it does not use negative sampling.
It is not applicable to larger data sets, because like when you try to use like vanilla BERT Rec on some data set with few million items, it actually, for every position sequence, for every sequence in batch, it will compute every score and compute the softmax.
And if you have like few million of items in your catalog, at kind of every training step, it's just impossible.
So here we kind of get the dilemma that on one hand, we have BERT Rec, which is a great model, which has a great effectiveness, but you cannot apply it with big data sets.
On the other hand, you have SAS Rec, which is also great model, but only when you kind of get rid of negative sampling that had been used originally.
So that was kind of my motivation.
And I actually knew that many companies have really large catalogs, right?
I had experience with Amazon and Amazon has like one of the biggest catalogs in the world probably.
And obviously there are some other companies like YouTube and others that have very big catalogs and they all struggle from this problem that you cannot train these models because you have large catalog and you need to kind of employ some techniques like negative sampling.
And as I said, like my early experiments have shown that that is not a free lunch, it comes to this really big effectiveness degradation.
So I started looking into this and I actually started looking into the math, like what happens there, what probabilities do these models actually learn?
And I found that essentially when you do negative sampling, you change the distribution in your training data of positive samples and negative samples.
And essentially what happens is that your model learns to overestimate a probability of an item being positive.
And while like for ranking, it doesn't matter, right?
Because for ranking, you only need the binary, like who is this probability higher than that probability.
But in fact, if you think that where all the variance happens in this course with when you apply negative sampling, the variance happens somewhere in the middle of your catalog.
So kind of all the top items have the predicted probability of being positive, very close to one.
And the model actually doesn't distinguish between them.
And this is where actually these problem happens.
So this is why when you do negative sampling, you need to apply some kind of correction that tries to counter this effect of distribution shifting in your training data.
Having these high scores for lots of items is what we ended up calling overconfidence.
The model getting too many high scores.
Yeah.
Yeah, this I was gonna ask though, the overconfidence is about that.
And it's, I assume, hopefully correctly due to the fact that of course you always choose a certain number greater than one for the number of negatives you sample per positive.
And this basically blows up the representation of positives in your data set.
And then finally you basically visit the positives just too often.
And this creates that overconfidence effect.
Yeah, also just imagine, all right, so the model just predicts a probability of an item being a positive, right?
And if, for example, for SAS rec, so SAS rec originally used just one negative per positive.
And for SAS rec, if an item is just popular, right?
It will be sampled as positive much, much more frequently than like any other item in the catalog.
So for SAS rec, for item being popular there is very strong indicators that these item will be included as a kind of top score item.
And even though like this actually wasn't our intention, but some other researchers at last three are axis from, I think this was German TV company, CTF.
Do you have CTF in German?
Yes.
Yes.
So they basically, they presented their work based on our work.
And they found that indeed when you apply our like correction that we presented in that paper, you end up with recommending less popular items.
So we kind of reduce this popularity bias within commander systems.
So this is where this root of this problem so comes from.
So the model learns to overestimate and over focuses on some simple things like popularity of the item and just makes the recommendations based on these simple things rather than trying to take into account more subtle user dependent features of items.
Okay, I see.
So this was basically describing the problem that you ran into when changing, or I guess if I'm correct, your first kind of spark for that idea was to substitute the loss function for SAS rec from that binary cross entropy loss to the softmax loss and thereby identify that the difference in effectiveness has diminished or basically disappeared.
And then you looked further and found or came across that overconfidence aspect.
Yeah, yeah, yeah.
And we found that.
So when you correct this overconfidence, you can actually get the model that is effective as like BERT for it or SAS rec without sampling, but retaining negative sampling.
And this is super important for large cut blocks where you have millions of items because you, they are just cannot train a model.
So in the end, we proposed there is a training schema.
So when you choose more negatives than normally SAS rec would choose, and you also use some correction term in the loss function that counters all these negative effects, like overconfidence of negative sampling.
And we were able to achieve the same effectiveness, but with retaining negative sampling.
So we have shown in the paper there actually that you can just select, I think, 100 negatives or 200 negatives and then apply this correction.
And then your effectiveness will be essentially the same as full softmax over all items in the catalog.
All right, so you are already talking about the result, but maybe let's dive a bit more into that actual correction that you proposed to use there, which better explains the G in Jesus rec.
Yeah, okay.
So basically SAS rec uses binary cross-entropy as a loss function.
And the binary cross-entropy means that SAS rec treats the problem as a binary classification problem.
So it basically predicts for every item, whether or not this item is positive or negative.
And to convert the score of the model to a probability of item being positive SAS rec uses sigmoid.
And sigmoid is a very common way of converting a score into a probability.
And G comes from generalized sigmoid.
So this was actually proposed in some prior papers from people studying, I believe, biology or some kind of some other works.
And they said, okay, here is our sigmoid, but it having additional parameters.
And one of these parameters was the power to which you raise the sigmoid.
And yeah, in our case, we took these generalized sigmoid function from somebody else's work and applied to SAS rec.
And it added one extra hyper parameter to the model, which is in paper, we call this hyper parameter beta, which is a degree to which the model counters over confidence effect.
And technically there are like some heavy math in this paper that proves that if you set this parameter equal to sampling grade, yes, so what is the proportion of all negatives you take for every positive.
In the paper, we actually prove that given enough training data, given unlimited training time, your model will converge to predict actual probability of user interaction with item.
Obviously that's not true in actual setups because your training data is usually limited, but it gives a reasonable approximation.
So because we also wanted the training process to be efficient, so we don't want the model to be trained like for unlimited time, we experimentally found some special value for that parameters that controls essentially the shape of this sigmoid function that works well in many cases.
And essentially allows you to train a model using negative sampling and to achieve the results that are close to optimal, but retaining something and also within reasonable training time.
So it all boils down.
What you do is to raise the sigmoid for the positive item to the power of beta and choose that beta to be between zero and one.
And of course, in the case of one, it would be just the standard binary cross entropy loss.
And something that I found very good that is explaining this was also that slide in your tutorial where you show this effect of the calibration parameter T that pays forward to that beta on the gradients.
And this somehow like made the whole idea in my brain, like, oh, okay, now it starts to make sense because what you effectively do is you smoothen the curvature of the sigmoid function a tiny bit and thereby don't have that steep gradients for the positive ones, but lower gradients, but also higher for the negative items.
What's the summarize it properly?
Yeah, yeah, yeah.
So I think this is a good summary of this work.
And indeed this also a figure is actually included in the extended journal version of this paper where you actually can see that if you use binary cross entropy, then the gradients of the model just try to make the score higher and higher and higher and higher.
Whereas when you kind of choose it to be too small, the kind of actually will be trying to make it very small.
And when you set to our recommended parameters, there's a balance with these forces.
So it will try to make positive score more positive and negative score more negative, but the forces kind of gradient arrows are attached kind of discourse will more or less have same magnitudes.
All right, and then in the end, you already mentioned, but I just want to highlight it that with that method, you can still resort to using negative sampling and binary cross entropy loss, but achieve the same results as or comparative results as you saw with softmax loss, for example, by being used in BERT4rec.
Therefore, what I would also like to maybe wrap this paper up with is something that I then saw in that experiment section or towards the end of the experiment section before you went into the actual results was going back to your initial replicability study, where the authors of BERT4rec claimed that the superiority of SAS Rec is due to the architecture.
So due to like mainly the bi-directional attention mechanism and not using causal attention mechanism, but that this is actually not the case as you found.
Yeah, exactly.
Yeah, we found that in many cases, SAS Rec is actually better.
There are some cases when BERT4rec is better, but it is more frequently that SAS Rec's architecture is better.
And the main thing is what you learn rather, like the architecture.
So, and in general, I find that when you set the training objective, the loss functions properly, the model architecture itself doesn't matter that much.
We can actually take even older models like GRU or like LSTM and still get reasonable effectiveness if you kind of properly choose a training mechanism for them.
So this is one of the most important things that I actually learned during my work with this model that setting the problem right is more important than kind of tweaking the solution and kind of tweaking some bits of the architecture.
Yeah.
Maybe Craig, from your side, anything to add regarding this work?
I've never had a paper with so many proofs in it.
It was a lot of fun.
And we were very careful to make sure that every single step was shown.
So I think that's a really nice thing about the paper is that we spelled out exactly what the theory shows and how the theory in you is.
So we looked empirically back at, or we see that overconfidence is fixed and that it has the knock-on impact on the results.
And I think, I mean, thinking a little bit wider, binary cross entropy is used in a lot of places.
So we have a little follow-up work where we took GBCE and we put it into a cross encoder in IR in a particular setting and showed that it really, it helped there as well.
So there is potential for GBCE to be used in other situations, in other tasks, in other fields, outside recommender systems.
But we said, let's focus on recommender systems and go and fix other problems that we've seen.
Actually adding to that, again, referring to the journal version of this paper, in the journal version of the paper, we have a little appendix there where we also show that the same approach, the same kind of GBCE laws can be also used for other recommender systems models.
And there, I think we experimented with matrix factorization, old good.
And we also found that it is very effective there.
So if you employ the same training approach for the classic task where you kind of try to complete the matrix, it works incredibly well as well, in the same approach.
Yeah, yeah.
So I guess therefore it's also not only noteworthy, but not that surprising anymore that this paper also received the best paper award in 2023.
So congrats to that, but I guess this is well earned.
And I mean, beyond that award, what could be even more of an honor if your work already gets applied somewhere else by somebody else and then basically telling you, hey, we tried what you proposed and it worked and now things are better than they were before, yeah?
Yeah, we actually, yeah, we actually already saw some applications.
One of them I already mentioned to you, right?
So this is CTF, a German broadcaster published a paper at URecSys this year.
There were also a couple of commercial libraries out there already that included GPC and the JSAS track.
And some of them referred to my code as the kind of reference implementation of that.
And like they are saying, we are using that in production or like we are proposing this to our clients.
And we also have more of these evidence that people use these in production through private communication, which unfortunately we cannot share to the public in the podcast.
But to me, it is actually one of the great things in academia because like when, for example, I did some things in Amazon or in some other kind of startups before, mostly you can, if you do the work and nobody beyond your team knows that you've done, what you've done and kind of that it is cool.
And here, like this is actually, I think, a really, really good thing about academia to hear that people are actually using this across the globe.
I want to turn our focus to another work that also deals with a lot of challenges that you face in industrial recommender systems.
And that might, as you have shown as well, also help not only with efficiency, but also with effectiveness, which is about the item embeddings themselves that are learned as kind of the key model component for a transformer model.
And there's a work from quite recently, so Wisdom 2024, it's called Rec JPQ, Training Large Catalog Sequential Recommenders.
And I guess this is up to Craig to elaborate a bit more.
So can you give us an introduction to that work?
Yeah, I can.
So again, this is another piece of work that we looked at where we said we really want to address this large scale catalog.
We've said that we can use language model transformer architectures, but we know language models tend to have like 30,000 tokens, 50,000 tokens, and that's about it.
While we could imagine situations where there might indeed be millions of items.
I mean, we talked about YouTube has 800 million items.
So if each of those items has an embedding that's 256 dimensions, and then you're doing back propagation, there's a lot of gradients that need to be updated at training time.
And yeah, for some models, we were just simply unable to train them at large scale.
And scale was one of the things that we came in, come into this piece of research trying to address.
So here we were kind of inspired a bit by language models.
So we know that language models have these trained tokenizers where the frequent words get their own token, that's their own ID, and the infrequent words are broken down into sub words.
So this means that we don't have to keep a token ID in a language model for a word that we rarely ever see.
We break it up into these kind of more frequent sub IDs.
So we said, can we do a similar thing in recommendation?
Instead of having one embedding for each item, can we break up each item into several sub IDs?
Because if we can do that, then maybe we'll be able to train a model.
So we were kind of inspired by a piece of work in retrieval called joint product quantization.
And it was for like a dense retrieval by encoder setting.
But it actually, instead of explaining that work, I'll explain how REQGPQ works.
And it's kind of quite intuitive, it's simple to understand and it works really well.
And I hope that my listeners are already well familiar with product quantization.
Yeah, yeah, yeah.
I'll maybe do a kind of compare contrast later.
So in this, we want to tokenize the items.
And our goal here is always to be able to train the model.
So approaches like product quantization, assume that you can train the model with its full embeddings, and then later on do some kind of post-processing to make your your ANN based approach and make this smaller lookups.
We said, what if you've got so many items, you can't even train the model?
What do we do then?
So we tokenize these items into smaller sub IDs.
Take a really simple model, something like a matrix factorization, a truncated SVD, like pure SVD.
And we say, well, okay, let's train that really quickly to get our kind of standard collaborative filtering models to get a really rough embedding for each item.
Let's say, I don't know, it's length four for each item or length eight.
Then let's take each of those numbers within that rough embedding and quantize that.
So let's say we want to quantize the first dimension into 256, okay?
And those quantized numbers are actually just then our sub IDs for that part of the item.
So each item then is going to have four sets of sub IDs, or if you like four sub IDs, each coming from a set of 256.
And because we've assigned based on some first pass model like our truncated SVD, similar items because they are clicked on by the same users, will get similar sub IDs.
So then each of those sub IDs gets its own embedding.
Then we can recover the full embedding for a given item by concatenating the embeddings of its assigned sub IDs.
So we can actually just then train a model that does item ranking, but instead of training the items embeddings, we're training the sub item embeddings.
And there are much, much less sub item embeddings compared to all of the item embeddings that we would have had in the normal model, okay?
So we're calling it joint product quantization because we are learning to product quantize and do recommendation at the same time.
It's a joint training in that respect, okay?
So it's like product quantization that's done in libraries like FICE, but that requires the full item embeddings to be trained first.
Here we don't have to, we just have to have some allocation of items to some sub IDs, and then we train the sub ID embeddings.
So what's the impact of that?
Well, actually taking data sets with millions of items, we realized that we could train these shared sub ID embeddings.
We no longer have a really big item embedding tensor that we have to manipulate.
We're just training these sub IDs and you end up with a model that's got a checkpoint size that's like 50 times smaller.
And interestingly, and is at least as effective or can be more effective than the original model.
And that was a really interesting observation that we said, we've thrown out literally millions of parameters and it's more effective.
So in essence, this is a regularization.
We're forcing the sharing of parameters across items.
So you can imagine one of those sub IDs might represent something latent, but it might be something like the genre.
Another sub ID might be different preferences of users.
So we've encoded the information that we got from that initial SVD.
That's given us these rough sub ID embeddings.
And that's what's trained in the model.
You mentioned what you do is a very rough first item embedding that you create to then do the initial assignment and have those codes for every embedding that you then use as an initialization to comply with your demand for having similar items being represented by like similar codes.
And for this, you said like you could use something like truncated SVD, or you are talking about two different methods, one also uses BPR or just a randomized one.
So when you do that, you mentioned like, let's take just a four or eight dimensional embedding.
So let's stay with the four dimensional embedding.
And then each of these four elements will be represented by its own code.
And then we have that initial code assignment.
The embeddings that I'm training then in that joint training phase, they can actually be larger or they are constrained to that initial dimension or that initial size.
I think what we're trying to do is make things comparable.
So if we had like 256 embeddings for our baseline model, we would have 256 divided by four sub ID embedding length.
But there are some lovely heat maps in the papers that show the impact of measuring and because there's two parameters, there's the length of the code and also the number of the embedding dimension.
But this embedding dimension, so this of course stays consistent.
So the embeddings that you learned as part of the initialization, those embeddings are like, is it even correct to say so fine tuned or actually those cluster centroids are then fine tuned or how to put it correctly?
No, the rough embeddings that we got are just quantized.
So that means, so let's say that we, your embedding went from the first dimension of the embedding went between zero and one.
We would maybe break that up into 256 bins and we're trying to, and those 256 bins, their offset becomes the sub ID for that particular centroid.
Okay.
In essence, it's a quantization of the rough and then you randomly initialize your sub ID level embeddings as you would normally and then train them.
Yeah, basically the dimensionality of the final item embedding is not constrained by the dimensionality of these rough embeddings.
And it is just, should be a multiple of these.
So if for example, your original, this SVD embedding was let's say four as Craig said, then your final item embedding will be four times the size of sub ID embedding.
For example, if your sub ID embeddings is 32 dimensional, then your item embedding will be 128 dimensional.
Okay, maybe another question related to the initialization by whichever method.
So how come that if you can't train a BERT4REC model there that you can actually train some SVD on it to perform the initialization?
Yeah, so because actually there are distributed implementations of SVD.
So for instance, Apache Spark has one and regardless the model that we're using to assign these kind of starting point rough embeddings can be really simple.
It doesn't, you can train a matrix factorization very, very quickly.
It doesn't require the same amount of resources needed to train a matrix factorization model with lots of black propagation.
And remember the embedding dimensions that we're training for this initial are just really short.
We mentioned four, we mentioned eight.
That's much smaller than the amount of the real granularity that we need for a good sequential recommendation like a BERT4REC at 256 dimensions or something.
All right.
Okay, and you say simple, but yet they are very effective because you said it helps us with placing similar items or assigning similar items, similar codes.
However, in the results, I had some difficulties finding that clearly represented by the results because there were also some cases where you found the random method to perform better.
How do you explain that?
It is actually, I was on other data set on that it was on Goaala, I think, which was actually the largest one.
But the reason was that these Goaala dataset had very, very many long time items, right?
So a lot of items in these data set have very few interactions.
So kind of they are kind of, if you wish, it is cold start items.
And remember that we were talking in the beginning that React GPQ has a strong regularization effect.
And here is the catch that the more random you assign the codes, this means that you have the item, will have to share its codes with some other random items.
So it means that it cannot be too specific.
The representation of item cannot be too specific.
And if this is what you need, if you are dealing with a long trial distribution and you want to avoid overfitting, you don't want your item representation to be very specific.
So I guess our hypothesis there was that actually the long tail distribution is what made the kind of random assignment working better.
That said, I would also refer back a little bit now to our previous work, to the Jesus track work.
And in our later paper that was published at RecSys 2024, quite recently actually, we also did some analysis of React GPQ based models for other problem, which we can discuss later.
It was about inference, but there we used Jesus track model with React GPQ.
And for React GPQ with Jesus track, it actually was again that SPD was working just fine.
So this is again to the point that like SAS like just learned a little bit wrong thing.
And when you assign these codes non randomly, it was very easy for SAS track to overfit and kind of to recommend more popular items.
For example, if one of the dimensions of SVD was correlated to with popular items, but when you train or when you optimize the model for a right thing, these different sexually disappears.
So this is again, like if you choose the right training objective, you'll get a better results.
All right, all right, cool.
So what one could get from this work is that much smaller models using that compression technique that you served there are of course, not only more efficient in terms of the storage capability they need, but they can actually also be better as Craig already mentioned in terms of a regularization effect that you suppose is going on there.
Yeah, in essence, your items are being forced to share representation.
So maybe this latent dimension, maybe the genre embeddings are being updated, and that's being shared across lots of items, et cetera.
So I think it's a really interesting observation.
We've been building these really big models, and we didn't need to go to that extent.
The other great advantage that we've been looking into of RecGPQ is that actually it can help it inference time for speed as well, because you no longer have to take your, if you like your query sequence embedding and multiply it by a really big embedding table.
You can do it at the sub ID level and then just see which of the sub IDs kind of pop out as being, which are the IDs corresponding to the sub IDs that end up being scored highly are the ones that we need to retrieve.
So we did some experiments about that in the Recs S24 paper.
And you could see that we could, you're compared to the original models on a data set like Goaala with millions of items were going from 133 milliseconds just down to 13 milliseconds in terms of scoring time, right?
So there's benefits there from inference as well.
And that's really great because, you know, the transformer model is not so big.
You can actually do inference on a CPU.
And then if you've removed that great big matrix multiplication at the end where you do the item ranking, then you don't need the GPU there either.
You can do all of this on the CPU and that makes the models deployable at much, much less expense.
Yeah.
Okay, so efficiency gains on all ends, not only for the model size itself, but also due to the decreased model size, we also see gains and inference, a reduction down to 10%, 10 to 15% of it, if I get it correctly.
So down to tens or 50 milliseconds from over a hundred.
So that's quite astonishing and something that will also create a happy or a smile on the faces of all the ML engineers that are listening to us here.
I mean, like, you know, one cool thing about this work is actually that usually before this work, people were thinking about items as atomics thing, right?
So kind of they are not splittable.
Well, the original idea was just to kind of to compress item embedding.
Then like with this approach, we realized, okay, now this item ID is not a splittable anymore, right?
It has a structure inside it.
And you can do a lot of different things when you have a structure.
It's actually very nice when we talk about atomic, you know, like it's actually a very nice analogy to physics where like before people were thinking about, let's say atoms as a kind of non-splittable thing, there were kind of only that many things that you could do with it kind of different physical objects.
But when you start to understand, okay, there are like electrons inside and there are like protons, neutrons inside, you can do kind of a lot more.
You can radio waves and et cetera.
So it's similar here.
And I think like one interesting thing is just not only us finding that these sub-ID quantization things are useful.
For example, there's very famous work by Google was published I think in Neorips 2023 or 2024.
I don't remember exactly.
There's a work called Tiger.
There they propose something called semantic IDs.
The idea was similar and like actually it was published roughly at the same time, but Google use content information to infer the sub-ID whereas prior isn't the collaborative information to initialize the sub-ID.
They used like let's say content embeddings and from content embeddings, they went to the sub-ID.
But yeah, that's quite interesting that they are finding are echoing ours.
So they're also finding you can kind of generalize the recommendations better.
And by generalize, I'm saying that you, for example, can propagate the knowledge from the items that you know well to the less popular items, maybe some of the cold star items and start recommending them even without having to extensively learn all the user interaction.
So this is definitely very important to directions.
And I think you and all your listeners will see a lot of awards soon in this direction.
Because this seemed to be quite important and quite big recently.
Yeah.
No, I guess you are right.
That's already a great overview and also sub summary because seldomly there have been episodes where I was going through so many papers, not to say that my previous guests haven't published also a lot, but this time was of particular interest for myself and an interesting story behind these works also.
So far, we covered a bit more than half of the tutorial.
And of course, for the listeners, see all the papers and the tutorial.
And there are also recordings of it in the show notes as always.
But another part of the tutorial, you also started looking into more like generative methods.
Not sure whether this is the right introduction for that paper, but there was also a paper that you published, which was called GPT Rec.
What was going on there?
There was actually a couple of papers and I think you should more think of them as a joint work.
So one was published at GenIR workshop at SIGIR 2023 and one GenREX's workshop at www.lastyear.
The main motivation there was that when we were talking about all these models, about BERT-FAREX, ASHRAE, they were all optimized for ranking accuracy.
So they all were trained using cross-entropy laws, meaning that they put the items that has the highest probability of being interrupted to the top, which is great.
And this is definitely useful, but in many cases, there are other parameters of good recommendation that you want to receive.
Yeah, going back to the coffee example from the beginning of our conversation, you can think of like, I just bought a coffee machine.
What is the most likely next purchase for me?
Obviously it will be a coffee, right?
But I don't want to go to a website, to let's say e-commerce website and to my recommendation being dominated by coffee.
I want some diverse recommendations.
So kind of diversity of recommendations makes sense.
Or there are other metrics that, for example, you want to buy an outfit, right?
And like, when you're thinking of outfit, you want the items in your outfit to suit each other, right?
So, and this is something that a classic recommender system struggled to address because they usually follow something called score and rank approach.
So you apply, you assign some score to every item and then you rank item according to the score.
But because you are assigning these scores independently from each other, you kind of cannot optimize for metrics where your goal is actually interdependent, like diversity or like some kind of, let's say suitability of items to each other.
So this is a big problem and in the kind of existing recommender models.
And we were thinking, okay, why we not just look into the language models again as a source of inspiration and look what language models can do.
And if you think like, if you ask, okay, chat GPT, please recommend me something that would be diverse enough so it can generate actually diverse lists.
And the difference is that it actually doing the multi-step generation, it kind of when it generates next thing, it already knows what it had generated previously.
So this was our idea.
We decided, okay, why not?
We optimize the transformer based model the same way.
When we predict or when we put item to place, like say I, we already know what we put to all the places up to I minus one.
The only problem with that is that it is hard to optimize because you usually don't have goals and data for this.
Like when you train models using supervised learning, you usually need for something that the model can use as ground truth.
But for example, for training models to generate a diverse set of items, you don't have a golden set of diverse set of items that at the same time accurately reflect user interest.
Because if you had you, you kind of, you don't need to solve this problem.
You already have this, you can have your system, but you normally don't have them.
So this is why we went again to a language modeling and they decided to look what people are doing there.
And in language modeling, people actually using reinforcement learning and in particular, they're using reinforcement learning with human feedback, which would be also kind of really cool to use with human feedback where like some human tell us whether or not this recommendation is good or bad.
But on the other hand, it's super expensive to do in academic budget.
So we decided to solve a smaller problem when at least we know, we say, okay, we have a metric that can say, okay, well, I know these generate at least recommendations is good or not good, let's say.
And by good, we can say, okay, it for example, has relevant items, but also has good coverage of interests or something like that.
And so it includes several components like diversity, novelty, popularity, et cetera, et cetera.
And yeah, and then we just came up with a reinforcement learning based approach where this is kind of trial and error loop where the model generates recommendations auto-regressively, so kind of one by one.
And then we take our reward metric, which kind of the metric of our interest, which may be combination of relevance and diversity.
And we measure that and we increase the chances of the model generating the same recommendation next to them if the metric was good or we decrease the chances if according to our metric, the recommendation was bad.
And this way, we actually are able to optimize our model to almost like any goal.
The only requirement is that the goal is measurable.
So you can take your input, you can take your recommendation list and you can say, okay, how good is this recommendation?
And it may be, the good thing is it can be applied on the level of the whole list rather than like on the level of individual item.
Okay, okay.
There's a really interesting observation actually that came out of this work.
So lots of language models like chatgpt, we've got an auto-regressive head that can generate as meaningful structured things, right?
But in the task that we started talking about, just sequential recommendation, where we want to rank items in descending order of likelihood, that's not a good fit for using a model with an auto-regressive head.
If you don't want a diverse recommendation set, there's no need to use the auto-regressive head.
You will not get better performance than just having the normal score and rank approach that we've been talking about for most of the podcast.
I think that's a really, because the temptation is to go, well, language models are all using auto-regressive heads, but that's what we should use for ranking.
No, just make a sequence embedding and do item similarity.
And that's your most effective way.
If you want to do some of these beyond accuracy things, that's when the auto-regressive head is useful.
And that's what we've shown in this piece of work.
This is a lot of topics and research for the listeners to digest.
And many, many thanks to both of you for sharing that with the community.
Coming to my second last question, looking backwards to all of that work, looking maybe what's coming up in the future.
So Craig, Sascha, what do you think for recommender systems are some of the challenges or the biggest challenges that you would like to address or that you would like to see addressed by the community?
Yeah, okay, so I think actually we can, we lost already in the beginning of the conversation, we said that there's a big trend of using recommender systems with search systems.
And actually it was just like NLP type of things, because like you can say that arguably, chat GPT is a recommender system for many applications.
So I think one big question for us as a community, how are we going to live in this world where there are kind of these big universal models?
What is actually a recommender system?
What the role of the recommender system and how it should be married with these large-language models that are coming around?
There are some early words that kind of suggest that there could be a potential usage of large-language models and I think a plus three axis every second paper was talking something about large-language models, but the problem that these models exist and everybody understand that they should be somehow used for recommendation, but actually to my best understanding, like there is no proofs yet that kind of these large-language models can replace traditional recommender systems.
I think as a kind of future myself, I like to find the sweet spot where all the resources that had been done by the community by this time in the RecSys, in IR and NLP communities can be joined together and we can use all these models.
Craig, what about you?
Any other hunch or a different perspective on that?
Look, one of the things we talked about was the motivations for doing research and recommender systems.
Recommender systems are nearly everywhere, right?
Every time I open an app on my phone or even when I look at my phone, it's making a recommendation choice about which app it thinks I'm about to use next.
There's lots of them that are ubiquitous and we need small models to be able to do that.
And so LLMs as recommenders are not going to be everywhere.
They're just too big at the moment and we need to find ways to integrate what we already know how to do, which is about ranking items within the LLM settings.
I think that hacks about representing items or indeed documents with three words that kind of represents what they're about aren't really, that don't really feel that intuitive to me.
So how do we adapt an LLM to still retain its knowledge about what language is while introducing knowledge of items into the process as well?
So I think there's work going on there in different areas and I think it's exciting to be thinking about what to do next.
Cool, okay, great.
Then last question, which is quick.
Whom else do you want me to feature on RECSPERTSs to hear more about their work?
That's an excellent question.
I actually liked quite a few people that you already had in previous episodes and you kind of covered many of my kind of wishlist already.
On the other hand, I personally think that I'd like to learn more from the people who really work with big industrial applications, like maybe from Google, maybe from Amazon, maybe from, I don't know, Alibaba, who can tell us the real kind of challenges of kind of using these state-of-the-art models in industrial settings.
So I don't think like I will give you some kind of specific names here, but the companies that present at RecSys and like Google or Amazon and look into their papers would be really interesting.
That sounds good.
Thank you again for participating in this.
I hope that we crossed many of the topics that you have been spending a lot of work and time and energy in, and we're able to represent them properly.
Yeah, thank you for having us.
It is actually a great honor to be here.
So thank you, Marcel, for inviting us.
Yeah, thank you very much, Marcel.
I think it was really great fun and you had lots of insightful questions.
Got to the crux of the matter for us in many cases.
And also I think like some, so for the listeners, if you have any questions, I think it's not too hard to find me and Craig in social networks like LinkedIn or Twitter.
So feel free to reach out if you have any questions.
Please reach out.
I'm always going to include your LinkedIn profile in the show notes so people can even more easily find you if they don't even know you yet, but they will definitely know you then afterwards, hopefully.
Yes, thanks again.
And with that, I would say have a nice rest of the day and hopefully see you at RecSys.
Indeed, hopefully.
Bye.
Thank you.
Thanks for having us.
Bye-bye.
Thank you so much for listening to this episode of RECSPERTSs, recommender systems experts, the podcast that brings you the experts in recommender systems.
If you enjoy this podcast, please subscribe to it on your favorite podcast player and please share it with anybody you think might benefit from it.
If you have questions or recommendation for an interesting expert you want to have in my show or any other suggestions, drop me a message on Twitter or send me an email.
Thank you again for listening and sharing and make sure not to miss the next episode because people who listen to this also listen to the next episode.
Goodbye.
Bye.

#29: Transformers for Recommender Systems with Craig Macdonald and Sasha Petrov

Broadcast by

headphones Listen Anywhere

Listen Anywhere