#23: Generative Models for Recommender Systems with Yashar Deldjoo
Note: This transcript has been generated automatically using OpenAI's whisper and may contain inaccuracies or errors. We recommend listening to the audio for a better understanding of the content. Please feel free to reach out if you spot any corrections that need to be made. Thank you for your understanding.
It's just the beginning of a new era for recommender system research.
Some of these mainstream collaborative filtering models have reached a level of saturation on how much performance they can bring.
Journey models have the potential to improve the power of personalization.
Traditional recommender system used to do this, for example, in a two-stage ranking phase.
Now with these journey models, you're able to align these before modeling the distribution.
You can align these two into a same distribution space, build a journey model on that, and then sample it to generate what user likes.
With journey models, basically you're able to create items that do not even exist in the catalog.
You can generate things that user likes, and then go and generate it, and go and build it, or find something that is closer to that item.
Their ability to generate believable content at a faster rate holds the promise of greater opportunities, but of course at the same time greater risk that should be taken into account.
Learning to underline distribution data enables new opportunities and new applications.
Journey to AI is not only about LLMs.
LLMs made a huge recognition to them, but yeah, today they will talk more about other areas of journey to AI.
Hello and welcome to this new episode of RECSPERTS, a recommender systems experts.
For today's episode, I have invited a guest from the research side of recommender systems, and we are talking about a topic that is gaining a lot of attention in the general AI community and basically connected to recommender systems.
What we have observed in the past is that especially the evolution of LLMs as part of the generative modeling landscape has gained increasing attention and therefore also to no surprise recommender systems and generative modeling are working closer together.
We have seen applications of more modern generative models on recommender systems.
And for this episode, I have invited Yashar Deldjoo and I'm very happy that he is joining today's episode on generative models for recommender systems.
Hello and welcome to the show.
Hi, Marcel.
Thank you very much for having me in this podcast.
I'm very happy to be talking to you in person and I've been listening and following some of the podcasts through your channel and recommender system and I found them very interesting by very good people.
That's nice to see and to hear.
And I guess also the people who have been part of previous episodes will be delighted to hear this.
And I guess it's also working the other way around.
So I guess also many, many listeners will be delighted to listen to this episode and hear from you, from your research, from how you actually joined the recommender systems field and what you do as a research nowadays.
And therefore, as always, I would just start with a small introduction about you and then hand over to you so that you can introduce yourself with proper depth and let the listeners know whom they are listening to today.
So Yashar Del D'Hue is an assistant professor at the Polytechnic University of Bari and he is a researcher in recommender systems for quite a long time already with papers published at conferences like SIGIR, KDD and RecSys, of course, just to name a couple of them.
He has been researching in fields like fairness, multimodal recommender systems and foremost and also most important for today's episode, generative models for recommender systems.
He has obtained his PhD from the Polytechnic University of Milano in computer science and degrees in electrical engineering and linguistics, which we definitely need to talk about.
He is also involved with the RecSys community being a chair for the demo and late breaking results for this year's recommender systems conference, which actually takes place in the very same city in Bari.
So very much looking forward to that.
And again, thanks, Yashar, for joining me for this episode.
And please continue and introduce yourself to our listeners because I guess there are lots of interesting things about yourself, your work in recommender systems that I haven't yet mentioned.
Yeah, thank you very much, Marcel, for the kind introduction.
Yeah, basically, I'm coming from a background on electrical engineering.
So I consider myself not a computer scientist originally, coming from more mathematical, a statistical and somehow, if you can call it theoretical, backgrounds.
But at the time when I was a master's student, which I did in Sweden, at Chalmers Tech, one of the best universities in North Europe and also in Europe.
So yeah, I was involved at that time on the topic of multimedia, multimedia signal processing.
And basically, my original interest was in that topic.
So basically, during PhD, which I did at Polytechnic University of Milan, that is the best technical school here in Italy, I was quite involved in developing different forms of multimodal recommender system, in particular, video recommender system with the goal of exploiting the visual content of the video signal to provide better content-oriented recommender systems.
So during my research at Polytechnic in Milano and a few years after that, the visiting period, which I did also in Austria, at JKU, I have developed a number of video recommender systems exploiting these multimedia signals and produce a number of not for see research work at conferences like SIGIR.
And I've also published quite a lot of monographs on, for example, at ACM Computing Survey on the topic of multimedia recommender systems.
So community originally knows me on those areas of multimedia, multimodal recommender system.
Over time, my interest grew.
And I started to look at other aspects of recommender system, in particular, the broad field of trust for CAI and its integration with recommender system, which was quite intensified when I moved to Bali as an assistant professor here.
And ever since I started my work here at Polytechnic University of Bali, I've been working on different aspects of trust for semicommender system.
So basically, I have a note for see works on these areas of trust for semicommender system, for example, at SIGIR, Adversarial Robustness, but also a lot of work produced on the topic of fairness, which I've been quite interested in.
On the topic of generative AI, a few years ago, I have one of the first surveys on the topic for with GANS, especially after introduction of LLMs.
My interests grew up in this topic.
And it has been now quite some time that I've been focused on this topic.
Mostly my works have been focused on the domain of recommender system, but I'm also doing works on general machine learning application like healthcare.
We had a work last year at ECLI conference on the topic of developing LLM based system for diagnosis in the house.
So this is like a very brief summary of works.
And yeah, currently we are doing this very interesting monograph on the topic of recommender system with generative models.
We have a tutorial coming up in August in KDD.
So there's a lot of interesting things to talk about.
There's a summer school here at RecSys that we will do a lecture on this topic.
And I'm also doing a tutorial at ECLI this year on the trustworthiness aspects of LLMs specifically.
Yeah, as we have been talking, generative AI is not only about LLMs.
LLMs made a huge recognition to them.
But today we will talk more about other areas of generative AI basically for recommender systems.
That sounds great.
Also quite a long history of areas that you have been doing research in taking, I would say different angles of different viewpoints on recommender systems so far.
So I guess this is always nice when it then gets to a seemingly new topic and you see how concepts from what you have been doing before or in other areas start to reappear again or somehow synthesize or manifest in different ways.
So that's I guess a very, very useful thing.
Before we go into more depth and towards our main topic for today, what was it initially that sparked your interest and created your motivation for recommender systems?
So why recommender systems and why, as I assume, are you still so intrigued by recommender systems and your research?
Yeah, to be honest, at the time I was a master's student, I got to know about recommender systems through a colleague, a classmate.
At that time, there was a lot of areas of artificial intelligence that looked interesting.
But as soon as I started it, I felt somehow in love with that, maybe more compared to general information retrieval.
One of the things that I found interesting about recommender system was that, first of all, there are much more areas that are still unsolved in recommender system, much areas for research, to be honest.
And the second thing is that, because at that time I was quite involved in more, I would say, objective areas of artificial intelligence, like computer vision.
And this, besides being interesting, there has been a huge community of computer vision, even in areas like NLP, there is a huge area of researchers working on that.
Whereas in RecSys, you have much more areas to work, you have much more impactful industries, I would say, in some sense, because you're dealing with user, but also less researchers.
Working in the community of recommender systems.
So the combination of this user aspect and the ample room for research was something that was quite fascinating for me.
And obviously, one of the main things is that what is done in RecSys is absolutely, I would say, practical.
They could be implemented at an industrial scale.
And the gap between research done in RecSys and what is done in industrial, we may arguably say that is relatively lower compared to other fields where I originally worked in, like in electrical engineering and so on.
So yeah, this has been a very motivating factor.
Yeah, especially that part about the breadth of the RecSys field also connects quite well to our topic for today, because for me, it seems like the work that has been done in other areas or sparked there or done primarily will sooner or later transfer into the RecSys field like you might have seen with item to vec models that stemmed from vert to vec or with sequence models always transformers.
So all of the stuff finds itself again in recommender systems being applied there.
I remember once a paper presented by Nvidia a couple of years ago, where they showed that timeline for transformers and when the adaption appeared applied to recommender systems, and how that actually evolved over time.
So that was pretty interesting to see.
And I guess this is how it relates to many things in the RecSys field.
You made a nice summary of that indeed.
This brings us to our main topic for today, which is generative models for recommender systems.
And I guess the genesis of this was actually that tutorial, which you just mentioned, that is going to take place in approximately one month at KDD.
And this will not only stay a tutorial, but it has already also become a book that you work together with a couple of well-known folks from the RecSys space, like for example, Julian McOllie or Francesco Ricci.
Can you maybe start elaborating first why you came up with generative models for recommender systems or where you felt the need for such a tutorial?
Yeah, absolutely.
I think we had a tutorial on the topic of GANS and adversarial learning.
So GANS are part of this landscape of generative models, as we know, at WSDM, at CIR, at RecSys, if I'm not wrong, just two or three years ago.
So obviously, generative models have been there in the RecSys community.
But notably, just similar to convolutional neural network, or generally neural network, which got a hype a few times during the last decade.
Maybe one of the Midas have been where convolutional neural network achieved remarkable performance.
Yeah, since last year, with introduction of chat GPT in particular and other large language model, it was clear that the research community is interested in this.
And this is not only for RecSys, for many other areas of RecSys.
It was always there, the research community is going to look at applications for their target domain, basically to work on that, and obviously looking at risks and harms of this system and how to evaluate these models.
The interest toward this was somehow driven by the interest of real research is going on.
And probably this would be one of the important research areas for the years coming ahead.
And obviously, the other thing is that, specifically with LLMs or multimodal foundation models in general models, one of the things about general models, as we talk about today, is that it's not only about prediction.
They can do prediction arguably in a better or in a different way, but also the generative parts, the generative tasks that they can achieve.
And these can introduce a whole lot of new areas for recommender system.
So ample opportunities, simply speaking, ample opportunities to use generative models was one of the main driving factors here.
So when you say ample opportunities, what we all have in mind is like using recommender models for providing top K recommendations and then doing better on that side, optimizing for some relevance goals.
So what are other objectives that the generative models are extending upon or supporting?
That's a very important question and interesting one.
I would say, journey models can improve traditional recommender system in three main manners.
First, there are tasks that are classical task of recommendation system, such as top K recommendation, as you mentioned.
We can arguably say that some of these mainstream collaborative filtering models have reached a level of saturation on how much performance they can bring.
Journey models have the potential to improve the power of personalization.
And other examples are, for example, VAEs and many variations of variational autoencoder collaborative filtering models, which simply speaking, they have this generative idea of a decoder encoding a more complicated distribution into a simpler distribution and then decoding that back.
And then different variations of that, putting different prior assumptions or for different tasks, sequential, so on and so forth, or even using with adversarial training.
Simply speaking, VAEs, variational autoencoders are one notable journey models that have achieved better performance compared to collaborative filtering, mainstream collaborative filtering models.
Another example, we can talk about LLMs when we talk about pre-trained models.
LLMs, basically with their internalized knowledge, have the potential to produce new ways of thinking of what user likes or the user profile.
So this is the first area.
So traditional recommender systems, I guess we owe our audience definitely a distinction between traditional and generative.
And I guess you will come to that in a moment.
But if we keep with that vague term for the moment, like traditional recommender systems, and even though, let's say advanced modeling strategies for those traditionally tasked recommender systems have seen a level of saturation.
When we say saturation, what is the goal we are meeting there?
So are we just talking about, let's say, standard relevance metrics?
So for example, some precision at KMRR offline, some conversion rate, or something like that online.
Or would you also claim that this saturation is appearing in other goals?
Like for example, content diversification, short versus long term satisfaction of users, fairness, and so on and so forth.
So saturation only with regards to relevance also, or also with other objectives that we could aim for in RecSys.
Yeah, I was mostly intending top K relevance, since that is the main goal of the recommender system.
Obviously, it doesn't mean that opportunities for research doesn't exist with classical discriminant models.
Now, this was not what I originally intended to say, but wanted to say that probably would have been very difficult if you wanted to write a new collaborative filtering models, it would be hard to justify why this is really different.
We have many editions of collaborative filtering models already providing proper performance.
And there's no new, let's say, admittedly, innovative ideas of how you're modeling the performance.
So the task of this creative model is quite clear.
You have historical user item interaction, usually binary or implicit or explicit interaction.
So this is a standardized, let's say, narrow interaction, that is the signal you have from the user.
Then you have this task of, for example, writing prediction, or rank prediction, for which you need task specific optimization, right?
And for that also, you need a lot of data to do that task, right?
And this puts the user somehow out of control of what recommendations are being generated.
So user has less control.
And then there is a whole lot of biases that exist.
For example, when the user is choosing these items, there's a whole lot of biases that exist there, like interaction biases that could bias what users eventually see as a recommendation.
So this is the ecosystem of recommender system from the input data generation to prediction steps.
And that is what you had.
Now, for example, to give you an example on LLMs, with LLMs, you have the possibility to enrich this landscape.
So first of all, the user and system can now interact, cannot talk to each other in a more natural way.
So suddenly you're moving from this narrow standardized setting of ratings to more information about what user likes and dislikes, right?
And then you have slight information of user and items which could be enriched, right?
Thanks to the medium of natural language that is there and is basically available.
So both on what signals you receive from the user, how much you can enhance that information.
Once you can do for the prediction step, things are, let's say, improved.
There's an ample room for research on any of these parts.
Yeah, no, definitely makes sense.
And this is basically where, for example, LLMs or other generative models could help support also the, let's say, original basic task of recommender systems in providing the relevance of my top-k recommendations.
Yeah, absolutely.
So this is where I stepped in, but I guess you were going to outline two additional aspects.
So do you like to continue?
Absolutely.
So the first aspect, let's call them classical task, top-k recommendation, as you said, or top-k ranking.
So these are tasks that classical discriminator recommender systems do, but we can claim that generative models do that more effectively.
Okay.
The second thing is that what conventional discriminative model, I want to emphasize on this, the discriminative parts for now, because generative models could also be seen as conventional.
There are some commercial model, but in particular, the discriminative models, they claim to do.
So conventional models claim to do, but actually they are not implemented as efficiently or as effectively or at an industrial scale.
For example, classical models, when you think about conversational models, may still use a lot of like rigid template-based type of conversation, which is not interactive, which is not, let's say, multi-turn and so on and so forth.
It's different from what users expect, right?
And you can imagine how with LLMs, this level of interactivity and level of user encouragement has increased.
The other example I can make is, for example, cross-domain recommender system, which according to a note on Francesco Ricci's works, it has been never implemented at an industrial scale.
You can see that basically, generative models, it's easy to imagine that they can improve.
They can increase the scope of what actually a recommender system can achieve.
So these are those areas which we can call them claim to be done, but not done so effectively.
Actually that single point that you quoted by Francesco Ricci is something that we need maybe to discuss in a follow-up episode with somebody from the cross-domain Rexel space or with him himself, which I'm also definitely looking forward to since I would have a slightly different opinion there, but that's definitely debatable.
Yeah, absolutely.
Yeah, at the end, the point is that to highlight the gap between what people expect, what users expect, or what industries expect, and what is actually achieved.
So maybe there are ones that are implemented, but they are not seen as effectively for many reasons.
And if I may talk about the third one, which is the most exciting one, the third benefits of generative recommender system, we can say there are areas where are completely new to recommender system.
And these areas, you can look at it from both application scenarios or from a learning point of view.
From a learning point of view, for example, the best example is in-context learning, few-shot, zero-shot learning, which thanks to the generalization abilities of LLMs and the reasoning abilities, it's easy to imagine basically how these models with minimum amount of data can provide effective recommendation.
So if you say with minimal amount of data, then you definitely don't mean the training data, but the data that is provided in France as context and prompt or something like that.
Yeah, absolutely.
Let's say just similar to a content-based system, which is not anymore used, let's say, in a recommender system now, we can see that only we target user information, you can still provide a very effective recommendation to the users.
And another category, as I was mentioning, is on the application scenario.
So obviously on the learning part, we can talk about many more.
We can talk about the ability of learning models to create, for example, on the data synthesis part, useful data to enrich recommender system models, right?
And applications where how they can improve, for example, the model itself, better regularization, and so on and so forth.
But something maybe from an end point, from an end user or from a system designer point of view that is interesting is that new application areas that journey models are introduced.
And this we can see them as supportive tasks recommender system.
Obviously, the things on multimodal recommendation are quite interesting here to be mentioned.
For example, the ability of in-context visualization or virtual reality and virtual preview.
Or for example, you can imagine you have an Amazon application and you see suddenly a nice shoe.
And then you say, oh, how does this shoe looks like to me?
So then you point it to your shoe and say, okay, this is how it looks like to you on your fee.
And then there are some supporting suggestions and so on and so forth.
For example, you go to EECA, you see there's some nice furniture, or you just want to see how this looks like in your house.
So you just point your cell phones there and it just visualizes that thing on your home.
And that's something absolutely beautiful once you have it.
These are like supportive tasks for decision making for the user.
They're like screening you do before the purchase has happened.
So especially the furniture example, we are not only talking about, hey, there is some furniture I know the dimensions of and now please turn on your camera, scan your room so that we can tell you whether it fits or not.
But it's more about scanning your room.
We take a capture of what your style is, how many items you prefer to have in your flat or whatever we are talking about so that I can take this to enrich the user representation to better understand the user's intent when searching for new furniture by providing this as context.
And this is then basically a totally different game.
So we are not taking, let's say CNNs and different machine learning models to see whether something fits, but we rather want to understand along with LLMs, whether it not only fits into the flat, but also if it fits to the user.
Is this how one could put it?
Yeah, absolutely.
You summarize it very nicely.
We can call it personalized style of recommendation, which is something absolutely nice for the fashion domain.
The idea as you correctly mentioned is to infer, for example, via textual queries of the user or its visual query, his or her taste.
And also you mentioned this idea of combination, which is actually something quite interesting in the fashion domain, in the sense that, for example, when we talk about outfit recommendation, like an outfit of clothes you want to wear or a couple of things that go together in a house, their visual appearance might be completely different.
So for example, a velvet, a purple sofa might go very well with colorings of some forms of green, let's say.
So these are like in completely different color space, but in terms of the style of space, they match very well, right?
They could be using the same style of space, but visually different things.
This is what we have in outfit recommendation, also in the fashion domain, a couple of items worn together, they're in a similar style and so on.
Yeah, and this is, for example, some of the things that can simply enhance both on the consumer side and producer side, the engagement of the user, it could lead to increasing user aesthetic, let's say, kind of feelings to be able to like more items, explore more items and it could be absolutely beneficial for the producer and consumers.
You can imagine, for example, during COVID area where we were during a period of our blocks at home, how much, perhaps applications of the Zalando that was using mobile application were perhaps arguably used more.
And people were interested to buy products or fashion closing via application, right?
I think journey models here have a lot of roles.
Consider that one of the things that journey models here can do since they learn on the multi-model work, we are going to talk about it, is that generally speaking, merging text and images together in a similar style is not an easy task.
So traditional recommender system used to do this, for example, in a two stage ranking phase.
But now with this journey models, you're able to align this before modeling the distribution.
So you can align these two into a same distribution space, build a journey model on that and then sample it to generate what user likes.
And generally speaking, it means as you were mentioning, for example, you like a sofa, you would say to the system, I like a nice sofa with this color and this brand, something like this.
And then so these texts were extracted from your query or from your entrance, right?
And which are the style, you talk about style, color and brand.
And then we have this information from the visual content of the catalog of the items.
Before you generate recommendation, these are aligned together in a similar space where you have it then passed through a journey models.
And that is used to generate items on demand.
And this on demand part is something quite nice because this is one of the main interesting areas of the third category is the fact that with journey models, basically you're able to create items that do not even exist in the catalog.
You can generate things that user likes and then go and generate it and go and build it or find something that is closer to that item, right?
So you can bridge the gap between what user likes between and what you have in your catalog.
Whereas in the classical models, the way this worked was somewhat different.
We had items and presented to the user in a rating and ranking way of looking at it or a score and rank in a way.
Okay.
And now these are done at the same time.
They're like generated.
This is something quite nice, I would say on the third category, the generative part.
Obviously one last thing I may want to add here are other areas where journey models can really help.
And that is, for example, the power of persuasion.
Simply speaking, since journey models can reason well and explanation and good reasoning together when they are merged together, they mean opportunities.
For example, the system, recommender system powered by LLM can motivate the user with a proper explanation or multimodal explanation why this item is interesting for you.
If a user didn't like it, it can try to persuade the user.
For example, by bringing factual knowledge, factual information of what this thing that you didn't like can do for you or why it could be useful, right?
And connect it to the other users.
So the ability of the reasoning of LLMs and journey models in general, their ability to generate believable content at a faster rate holds the promise of greater opportunities.
But of course, at the same time, greater risk that should be taken into account.
All right.
A couple of thoughts.
Maybe first, when you mentioned that we are somewhat, if I got this correctly, doing scoring and ranking at the very same time, and this is how it will be different, I somehow can't really resonate with that idea.
I mean, even though we would be extending the modeling and generation capabilities we have at our hands by making additional or only usage of generative models, aren't we still facing a ranking problem?
So something which can be more or less diversified.
But what I want to show to the user is what's most relevant for them or maybe inspiring for them.
Just with the difference that now I'm trying to get more breadth of, let's say, information in terms of visual, textual, free text data, and so on and so forth.
And in a sense that, but these two don't need to be connected, but they can also be independent from each other.
Also maybe engaging more in a conversation with the user by going back and forth, back and forth, and refining the list of recommendations.
But isn't it still a ranking problem in both of these scenarios?
Yes.
From a user perspective, what user sees at the end, if you have two black boxes, one is called generative models, called discriminative model.
At the end, both boxes are showing a rank list of items, right?
Simply speaking, what the generative model box perhaps is able to show are items that, for example, might not exist in your catalog, right?
They are items that are generated based on your demands.
And simply speaking, those items could be manufactured later, or something closer to them could be built up.
An example I can make here is the idea of recommendation using generative retrieval, where classical models of recommender system usually use an ID for items in their catalog, which is an atomic ID.
It doesn't have any meaning, and it's just used to find the target item at the final stage of recommendation, so on.
So it's just an identifier given to this.
Now, the idea of generative retrieval is that we can tokenize this ID, for example, based on the content information of the item, usually textual.
For example, when you talk about shirts, you can say that what color is this shirt?
So this would get a number, like 25.
Then what is the brand?
It would get another number, 3.
And what is the style?
9, blah, blah, blah.
And then you have a sequence of tokens, or we can call it a code word.
So each item in your catalog now has a code word.
Now, the items you have produced, you can see them like having a code word.
A sequence of code words together.
So just look how beautiful this idea of journey modeling here works is that you can now look at these as some kind of autoregressive tasks, right?
Where you have a sequence of tokens together.
And the task here is that given this token, please suggest or recommend the next token.
And this could mean that even without surfacing the item itself, by looking at these semantic IDs, you can surface items that the user probably likes more based on what he chooses.
And that may or may not exist in your catalog.
So if it doesn't exist, you can go and build it.
Or you can simply suggest something that is quite closer to the user.
At least let her or him know about it, right?
So somehow we are now on the space of the IDs, right?
Semantic items instead of the content itself.
This is like one of the ways.
So we can say regarding to your example, the task at the end is the same, but the way these information are generated could be in a completely nuanced manner, I would say.
Yeah.
If you want to tear these two apart, so the discriminative versus the generative models, then I mean in the discriminative, you have provided some features, some contextual information and try to make a prediction about the target.
And the target in those kinds is basically, let's say our item space, and then I score those and rank them for a user who might be part of that context.
And this is what, if I got it correctly, you turn around on generative models.
So you rather try to fit or generate the distribution of, let's say semantic IDs.
Is this how one could put it?
And this is a richer representation, which I then could use to fulfill, let's say the standard or conventional use case of TopK recommendation, but also more useful for providing explanations for sparking product development IDs or for engaging in a conversation with the user.
And this is one of the examples of how generative models can work.
We can look at this question from the perspective of input, model and output, simply speaking, right?
The input of these models, generative models, when you compare it to the classical models, we can consider them more or less the same.
They're using the same inputs, right?
And the task output is the same, right?
So TopK recommendation.
Now, obviously, they have more abilities, for example, to enrich the input data, for example, by conversing with the user, so on and so forth.
What is different mainly here, if you want to compare classical models and degenerative models, one of their things is the model itself, but also the output itself.
So for the output, which is simpler, if you want to talk about the output of a journey models could be singular items, a rank list of item, which also exists in classical models, or whole page recommendation, right?
You may say that, okay, but in classical model or classical discrete models, we could also have this, that's correct.
These also existed in classical models.
But we can say that these are created more effectively with journey models.
Why effectively?
For example, classical models, when talking about rank list of items, somehow worked in a greedy manner.
In a way, you find affinity score for each item in your list with the user taste and somehow try to sort this.
These two steps of scoring and ranking is not done anymore.
Okay, so to create a list, a top list of recommender system to be presented to the user, a journey models can look at the interrelationship with fin items whole together, right?
By modeling the distribution of the underlying data.
So this entire list is generated.
The same applies for the whole page.
So basically the core idea is on the generative part.
I have some data.
If I can learn the underlying distribution of this data, whatever data it is, user item interaction, text, images, music, whatever you want to call it, I can use it for many useful inferential purposes.
Learning the underlying distribution data enables different applications, including top-care recommendation generation, right?
The generation or the creation that top-care ranking list is done via sampling that distribution that you have learned.
So the way this information is learned, we can say is arguably different, but also the complexity of the output is now different.
Because, for example, when you're interested in some fashion products, by looking at the interconnection between your browsing and what exists in the catalog, a whole new page can be suddenly generated for you, which you have relevant products, content information, and so on, which is done in a more efficient and effective manner compared to what classical models can do.
So how the information are learned, right?
And the outputs, the complexity of the output, we can arguably say that it has been improved.
I can also make another example of, for example, bundle recommendation.
Yeah, yeah, it's good that you're bringing bundle recommendations up.
Let's definitely go into that.
But as you just mentioned this, how is this then actually different from a sequential model for video recommendations that leverages a standard LSTM or even a GRU network that also in the end outputs a distribution over items.
So keep a hidden state that summarizes the user interaction sequence, which can also be augmented with site information about the user, about the items the user interacted with, in that case the videos.
And then in the end takes that hidden state to generate a probability distribution over items, which I'm going to use to sample ranked list of items.
This is an interesting practical question.
So LSTMs, RNNs, or the one using attention layer, they're, let's say, part of, let's say what we call ID-based recommender system DGM, so model paradigms.
They fall under the broad umbrella of journey models, obviously, as you correctly mentioned.
What is different here is what you provide them as condition or context to the system and how the generative task is done.
So we can broadly categorize the journey models in terms of the modeling paradigm into four main modeling paradigms, which are the one based on the idea of a variational auto encoder.
You have this encoding and decoding stage.
So encoding done from a more complex distribution into a simpler one and then generate it.
This is one of the paradigms.
The second paradigm is the paradigm of auto regressive models, which is actually the foundation to large language models that we have today once they applied on textual data.
So sequential models, as you were mentioning, fall under this modeling paradigm.
And they could use the difference between GRU, for example, and transformers are that you replace this layer in GRUs with the attention layer.
And then you got transformers and with transformers, you have all this encoder, decoder models and so on.
So this is the second paradigm.
The third paradigm, obviously, are the GANs, or journey multiversarial networks.
So these GANs also, we can consider them as latent space models.
The difference is that they are working with two different competing networks in an adversarial training framework.
So basically, these are like we can see them as different ways of looking and modeling this distribution.
If you look at it in an auto regressive fashion or you look at it in a latent space modeling fashion and so on and so forth, at the end, it's the power of data that rules here, I would say.
And learning down the line distribution data enables new opportunities and new applications.
All right, that makes sense to me.
You have provided me with plenty of material regarding the book that you folks are about to write that extends on all of these.
And I guess so far, we have talked a lot about possible applications, about tasks.
You also mentioned bundle recommendations we will come to in a second.
What I would like to dive into a bit more is really the centerpiece of that work where you really say, hey, there are three different generic scenarios.
So the interaction driven recommendations, text driven recommendations and multi model recommendations.
And I would claim that it's increasingly becoming complex, but also more powerful.
So let's do the following, please, and like the last use case, which is bundle recommendations, and then let's step into the recommendation scenarios.
Yeah, bundle recommendation is also one of the nice areas, nice application areas in the RecSys and what gender models can do hopefully better.
And I would say it's a nice thing for e-commerce in general, in the sense that you can create a list of items that are not necessarily the same on different categories.
And you can add a nice explanation and try to sell them together right to the user.
Yeah, generally applications of this type, whether it's bundle, whether it's outfit, whether it's a playlist, all the items are not necessarily homogeneous.
I think journey models can play a good role, both on knowing what to generate.
So looking at the interrelationship between items, but also coupling that with a good explanation.
This you can imagine how useful could be.
So for example, consider a list of items that you probably or categories of item you probably would like to buy, but some of these categories you never buy it, but now looking at it in a bundle together, your appetite and your taste of trying new items might increase, right?
So it could be something quite interesting for producers to sell up more niche products here.
I think this is certainly when we talk about collective set of items, recommendation to the user or journey models, we can arguably say that they have more power because at the end of the day, they can learn high dimensional distribution.
Let's say if we can claim journey models can learn high dimensional distribution, if we can claim that they are able to learn that, or at some day they're able to learn that.
For example, visual texts and so on and so forth.
Simply speaking, it means there could be more powerful than classical models, because when you learn high dimensional distribution, it simply means that you perfectly know, or you very well know how, for example, a coloring would go well with a textual description, or how two images could go together.
And this doesn't happen unless you learn the underlying distribution of the data.
So at the end, we come to the point that learning what data say, and it was new applications.
And certainly we can imagine on the, for example, multimodal area on the image area, where you have more data, more pixels, let's say.
Basically there is a semantic gap between the textual information, visual information.
So learning this high dimensional distribution would allow finding better matching between items, which classical discriminant models arguably did it differently.
To give you an example, many synchronous filtering models obviously use a lot of side information, can use a lot of side information.
So for example, visual Bayesian present ranking, which also Julia McQuilly has been driving force behind, and this, there will also works on the temporal version of that.
They simply are good models to incorporate visual information.
And we do have also many synchronous different model using textual information.
But to the best of my knowledge, we don't have mature works that can combine these textual and visual information together at the same time.
And this is what we do in real life, and we are deciding about an item to buy.
We look at the content, we look at the author, for example, when looking at the movies, when we look at who is this movie made, and who is the director, who is the author, we look at the thumbnail, we look at all this visual information together.
For example, when you're doing hotel recommendation, you look at all these visual appearance things to do the final decision making.
And this is not done so effectively with classical models.
And usually the problem they have is that they do not scale up well.
When you talk about high dimensional, so yeah, we have some infrastructure issues there.
This is already touching a lot, I guess, on the multi model recommendation side.
I think it makes sense to go back to introduce it.
This is the very first section within the recommendation scenarios, which is interaction driven recommendation.
Can you elaborate how generative models manifest in that scenario?
The way these three sub fields, we can call it, of general models have been separated in our work, or the way we look at it, is simply based on the input space.
What kind of underlying data they use, which we divided into UI data, user item interaction data, textual data, and multi models.
So multi model would be any combination of these together with visual information or audio.
Is it only about that the input would be different?
Because if we look at the models, then of course, the model needs to anticipate the growing complexity of the input so that in that very first section of interaction driven recommendations, we are not really using LLMs, nevertheless, more or less complicated models, one could argue.
But then later on, when it comes more to the processing of textual data or NLP based recommendations, we are exploiting them more and more.
Or how is that relationship between recommendation scenarios and models?
How tightly are those connected?
They are tightly connected.
This is an interesting point that was mentioned.
So I think input dominates models and models dominate output, simply speaking, right?
Genuine models using UI data, or we also call it ID based data, are simply models that enable collaborative filtering recommendation.
So they use the same information, collaborative filtering signal.
These are the simplest signals that you have available.
And they represent user history or user preference, in a sense.
The idea of Genuine Models with UI data is that models that take this user item interaction and can produce recommendation.
So we are in the same atmosphere of collaborative filtering recommendation, we can call it.
What is different here arguably is how this information is created or learned, and how does the output look like?
Obviously, we have the top k ranking setting here, but Genuine Models also add other elements as the output, like creating a whole page to the users.
The input data that we have in UI data has led to creation of at least, we can call it five categories of DGM's, Deep Genuine Model Paradigms.
We should differentiate between paradigms and architecture here.
Paradigms, we are talking about different components, different architectural choice.
For example, you can have an architectural choice using attention by your use in different modeling paradigms.
So modeling paradigms looks at the interconnection of different components altogether.
These paradigms, we have five main modeling paradigms in ID based recommendation or narrative recommendation as we mentioned.
Categories of VAEs, categories of autoregressive models, categories of GANs, and categories of diffusion models, and other categories.
So in these other categories, you can put flowing networks and so on.
This is like a broad five main categories.
You can also condense or expand them.
For example, you can group VAEs with maybe GANs and even diffusion model, call them together late space model.
So it depends how you want to look at this modeling paradigm, from which angle you're looking at it.
Basically, the main thing that they have in common is that they use the same data.
They all want to learn some kind of distribution to produce some kind of output.
On the input space, they are the same.
On what they're learning or how they're learning, they have differences.
So for example, the noising model works by adding this noising step to the input signal coming to a signal that is noisy.
You can see them as somehow noise.
This noise you can see there's a latent space, latent vector.
And then coming back to the original space.
So in some sense, it's similar to what encoder does, but it's on the same dimension of the original input.
You can see them.
On the way this is achieved, what you're learning, they have differences.
And this obviously creates also affects the outputs, what you want to generate.
So for example, you may want to impute the missing data.
You may want to generate top-care recommendation lists.
Also do something that is not directly about top-care recommendation, but useful for top-care recommendations.
For example, GANs can be used to find negative sample, which is an important step in personalized ranking.
So maybe we can summarize the differences between this model with a few WH question words, which is what you're learning, how you're learning, or maybe what you are producing.
So these few question words could summarize the differences between different modeling and paradigms that we have in DGNs in ID base models.
All right.
Okay.
Then with those models, whatever they learn, I mean, on the input data side, we are pretty standard there because it's user item interaction data.
And that's it.
Maybe you have different nuances like clicks, buys, or something like that.
But in the end, I guess mostly talking about implicit signals, but maybe also a couple of explicit signals.
And we are learning on top of this, but have, let's say, greater modeling capabilities, which then, for example, enables us to also provide hard negative samples that are more useful.
If we take that second one, if we just talk about textual data, I mean, there are also different notions between like very structured textual data.
Can you help our listeners a bit to lay out the landscape?
What can the input look like and how can LLMs be used there to enhance or provide recommendations?
Yeah, absolutely.
I think this is one of the most important questions when talking about NLP and LLMs is that basically, if you look at recommender system composed of three main players, users, LLM or the model that is actually providing this recommendation and the item themselves, LLM can help and reach three of these entities, which is the user preferences or the user profile, the user system interaction, and the item description itself.
To give you an example with the user preference and one of the reasons in the original of this conversation I mentioned that we may have reached a level of saturation in what recommender system can achieve in terms of relevance.
Part of this, for example, could be linked to the fact that a collaborative filtering or many collaborative filtering signal basically use these atomic ratings, a simple number or a simple binary feedback, supposed to say everything about the users.
Now, these feedbacks could come from a taste, but they could come also from mis-operations like a misclick, right?
Or they could reflect or represent a taste that has changed or will be changed over time.
So, it's not clear exactly what we are modeling is the actual reflection of reality, right?
Now, one of the things that, for example, you can do with text or a structured, we can call it user profile, is that with LLMs you are able to effectively build a natural language user profile.
But a natural language user profile is a textual description composed of several parts.
For example, let's talk about movie domain.
Let's say you have watched a number of movies.
And by looking at these movies, an LLM can extract the information that you are, let's say, an 18th century sci-fi thriller, right?
This is information about your long-term preference.
And then it can give more information about your recent interaction, what you have recently.
So this is your general taste.
This is what you have recently.
And further contextual information could be added to it, like some examples you have seen.
So detailed instructions, right?
Now this is what you're inputting to your system, right?
The input, what you're representing as your user to the system is like I want to recommend you a movie to watch as a friend, right?
I've analyzed your interests.
I've understood what you want.
I've thought it through.
And then now I want something that probably would engage you better.
This once in reach, for example, with good explanation, good reasoning has a lot of chance to increase user taste and user interest on what you want.
Especially speaking, because we haven't reached the input side, I think this is something that should not be forgotten.
Maybe a lot of attention has been done on the algorithmic aspects of the commander's system over years, which is absolutely important.
But something that is now with LLMs, we have a lot of ample area for work on the data parts.
We can improve, we can enrich the input information we are providing, even to our classical models, to LLMs, building natural language profile, building even multimodal profiles, these represent current, better information of the users.
Then we can of course talk about the middle part, which is the system, the recommender system part.
That area certainly we have seen how much LLM driven models can help.
Particularly speaking, LLMs can now allow multi-turn mixed initiative conversation.
For example, if classical models use some rigid based templates for conversation, now it's easy to obtain user preference and dynamically incorporate user feedback into the system, so further adjust it.
The system can ask clarifying question to the user to make sure what is recommended is good.
You can see that here, but we are this interaction that has been powered by LLMs, this increased interactivity.
We can see them as somehow a denoising the step we are doing to the data.
What information we are obtaining from the user is more useful for recommender system, simply speaking, because we are engaging in conversation with the user we are obtaining.
All these biases that we have in classical models, let's say for example, it has been obtained by data set by asking questions about user preference and so on.
The user may not have even watched those movies.
They simply ask some question and find those.
That has been used as the main stream data set for years.
Now we are creating data or interaction data that are more representative of user tastes and interests.
Related to conversational recommender systems where LLMs might bring that to totally new or different game.
I'm asking myself how much are users willing to engage in conversations with a system to elicit preferences to achieve a certain target that is, let's say, desirable or satisfying.
What is your take on this?
If we say, yeah, you could conduct conversations with such a system, how realistic is it that users are actually willing to do so?
Or what are the driving factors behind users willing or unwilling to do so?
It's an interesting question.
The way I want to look at it or the part I'm interested in, despite current limitations of LLMs and all the hives behind risks, privacy issues and other aspects that LLMs might have, hallucinations and so on and so forth, I would like to look at it first from a positive perspective.
Imagine hypothetical, good, honest, truthful, factual LLMs that is in front of us.
If we have these LLMs and we assume they're acting as they're supposed to do, one of the things that LLMs can do, and I really appreciate it as an end user, is that they can reason.
They can reason well.
Once you combine rich natural language data with LLM reasoning abilities, you get basically possibility of providing personalized recommendation in diverse contexts.
Even with minimal data, you're able to have systems that can provide you personalization in different tasks and applications.
You can't imagine any of the classical discerning model, this potent, this much powerful that can be used for multitask with acceptable performances.
I'm assuming that the risk and the false positive and false negative and hallucinations have been addressed.
If you have it, I think everyone agrees on the positive aspects that this system have in terms of the faster deployments of these systems.
They can be easily deployed.
They can be scaled up in some sense.
The reasoning capability, I think that is something dramatically increased and the ability to explain what you are recommending.
Certainly, these two are useful products of LLM-driven recommendation that has bridged the gap, the existing gap between the user and their willingness to buy or to consume the products.
I'm not talking about now the items itself, how good or bad they are.
Simply speaking, what I want to say is that even if an item is not good or is something that you don't know if it's good or bad, with good reasoning, LLM can make you buy that product or you find it useful.
Obviously here, I think from a user perspective, one of the things as you correctly mentioned here is that how much the information is presented is correct because one of the risks is that the system are very believable.
This introduces obviously new risk because LLMs can tell you a lie very convincingly.
These systems can be used in the hands of good people but also bad people.
For market penetration, for all these type of things, the reason I was looking at that hypothetical scenario was for this and we can talk about these risks.
I'm definitely joining you on the side of being positive and open to it.
I was actually thinking about another aspect which is just being lazy.
Sometimes I would argue that this is some of the motivations that many papers in the Rexelspace posit is users want to feel understood.
They want to feel supported in their decision making and they don't want to put too much effort in something because they want the system to learn from the data, the traces it has.
I guess there are even two things.
As a user, I don't even want to conduct a three, five or ten minute conversation, be it like orally or in Britain or by clicking something to go back and forth with such a conversational system.
To help the system get my preferences, I basically wanted the lazy way.
Take what you have and sometimes you might even have the aspect that what you are eliciting is just what comes from the top of your head and then you might be missing out on some very important aspects which on the other way could already be captured in the systems or platforms data it has collected about you.
Just in that very moment, it wasn't present to you but it is something that is part of your preference profile.
Let me try to rephrase this as a question.
How applicable is the assumption that users would like to engage in conversations with a recommender system to get what they want?
That's an interesting question.
I think we are touching here the preference elicitation step in some sense, which I call it.
What actually makes the user, so like pre-consumption phase.
You have the actual consumption when you click when you buy, are you rate a product or what makes you use that product?
This is like, temporarily speaking, this is one step behind that.
I think actually this is one of the areas that Journey Models can have a very useful task, a very helpful capability.
Obviously, this is more clear when we talk about the multi-model part.
This is a starting point to talk about multi-model.
Feel free.
One of the driving factors, for example, when you're looking at places you want to go for travels, what do you do?
You open booking.com and you simply browse the images.
Based on the appearance of the images and the price factor and some other things, the images play a key role to make user consume a product.
Again, when you're looking at food domain, when you want to consume a pizza for this evening, what do you do?
You look at the browser, you look at the application, you look at this.
These images could be generated, could be even presented.
The presentation step of these images, like the style layout where it's presented, can have a huge impact on user deciding to even click on this item.
Best example might be artwork personalization by Netflix, or are we really talking about generating newly and not just selecting from existing ones?
Both of them.
Because when we're talking about generating, it could also exist in the catalog.
It's just that we are generating that based on how it's done.
Personalized artwork suggestion, all kind of personalized visual recommendation, I think, can have a nice effect on making the user closer to the interaction step.
One other thing here, I think there are papers written, for example, by Paolo Karamanese, and also by RecSys2017 and 2018, which was shown that actually when you deploy systems in real time, a lot of things at the end come to how you present this information to the user.
These impact a lot user decision making to consume an item.
This is one of the things that I think which JIRA models can have ability by being able to create items at a faster rate, and items that are things that align better with user stylistic preference, if we can call it.
Also thanks to the power of text, you can add other elements.
For example, you can add elements of sympathy, depending on the context of the user to make him engage to persuade her to consume the product.
We can look at this human to human conversation, if you're deciding to talk to a person as a friend, as a company, there are expectations you have from him.
You need to be understood, and this is as you mentioned, what LLMs can now better do.
They can understand you in a human manner like.
That person hopefully can show you, can bring up good things that you may like.
These are things that I think thanks specifically to LLMs in the multi-model foundation models, we are now better able to do.
Of course, on the multi-model part, there is more room to be done.
I think there are a lot of weaknesses currently.
We are not still able to generate very useful images that are complete, let's say, compared to text.
But yeah, essentially, we are in that direction.
That actually brings us to that last of those recommendation scenarios section.
We already touched a bit on this, so the multi-model recommendations, especially the appeal of textual descriptions, image descriptions, but also further.
Let's cover that last one before we talk and wrap up with the evaluation and risks and open problems.
I mean, you already made the case.
There are some prior models that people might have heard about.
For example, like clip embeddings that merge language with image data into embeddings that we can use.
What are the basic facts of multi-model generative models for recommendations?
We can look at it from a learning perspective and from an application perspective.
From a learning perspective, there are two points that come to my mind.
First of all, thanks to foundation models, multi-model foundation models similar to LLMs, we are somehow linking the internet scale data and the information.
In the embedding, we're extracting for visual items.
Obviously, we had this also with convolutional neural networks for this idea of pre-training and fine tuning, but now the scale of these data is higher.
For representing your visual items, you can represent those embeddings on information that is coming from internet scale data.
Hopefully, you're able to find better relationship between items and use of an item.
This is from a pre-training, let's say, and fine tuning type of looking at the task.
From a learning perspective, we can also talk about the day where we can say that we are able to capture the high dimensional distribution effectively.
We are already doing that quite well.
One of the things that here is, for example, quite done, as you also mentioned, is the capability of journey models to handle this problem more effectively.
For example, to provide a recommendation of places to go or visual items to recommend to the user based on a given textual query, a journey model may start by doing this alignment of visual and textual features, which means mapping the latent information from these two signals into a latent space before doing something with that.
First of all, these are aligned using, for example, ideas like contrastive learning on data space that again a distribution is learned.
At the end of the day, everything we are talking about in all these type of models is that the power of data and the power of learning the distribution.
This distribution we're talking about is its joint distribution of data, text, and visual signals, which if you can sample, you can build items.
You can generate items.
You can recommend items, either it's a single item, top-k recommendation list, or page.
At the end, everything that distribution that you have learned is a powerful tool that you can use to fulfill user via recommendation.
Obviously, one of the things is that on the learning aspects, journey models here play a role.
On the multimodal side, I think the application we can name off is relatively higher than the other ones, thanks to the power of visual signals.
Maybe one of the interesting examples here I can mention is this idea of fashion IQ or multimodal fashion conversation.
For example, in the mobile application can be very nice.
Imagine you want to choose an item to wear for a party tonight.
You start with selecting an item.
It's worn either on your body or on a given avatar there.
Then you start talking with the system in a conversation.
I'm like, can you change the color of the shoe?
Make it this?
Can you try to make this combined with this?
Multimodally, this fashion outfits of the tech thing that you're seeing is changed.
You can imagine for the costs that the user might have to go to that shop and other aspects of time, the context, if it's hot, if it's cold, how much this can help engage user, make them interested to buy certain products.
These are some nice things.
But generally speaking, one of the areas that also here could be, yeah, I mentioned in-context product visualization, obviously, when you want to visualize based on your context a certain product, like when you go to EKL, you want to buy furniture, you want to simply know based on the style of this and your style of your house, how does this look like in your house?
You can imagine how much it can save money, efforts, and reduce the return rates, which is an important issue in the fashion and textile industry.
A lot of items are returned simply, maybe due to the size issue, due to the issues of that they do not, the user did not like based on their style and so on and so forth.
Generally speaking, one of the things that, for example, I can mention, maybe these two application areas are also interesting that we mentioned here.
When we talk about multimodal, this does not necessarily means images, but anything more than UI data.
It's things that are about planning.
The complex event planning, like travels or like wedding, is composed of a couple of things to do, right?
A couple of steps, a couple of locations, travel, things together.
Classical models, I would say, lag significantly behind journey models.
In being able to do this planning, of course, we are not claiming that journey models are now doing it perfectly, and they are not flawless or not endless, but we can say that they are doing it, first of all, at a much faster rate.
Seemingly, they are doing okay with giving you, for example, a plan for travel.
Hopefully, with some instruction, they are in good traction to do tasks that used to be very difficult, like a package recommendation to the user, right?
I think we can put them in the multimodal scenario since they're using multimodal signals, and yeah, maybe the last thing I can say is about streaming or streaming services.
I worked a lot with classical models of video recommendation during my PhD, and I can say that video recommendation based on understanding contents is in infancy.
Simply speaking, our video recommendation models do not understand what user likes in those videos.
We are not matching users' taste with the actual contents.
Well, why?
Simply speaking, because videos, for example, are memory-extensive, let's say, items, right?
They occupy a lot of space.
It's difficult to process them.
It's difficult to map them into a proper representation, to match it with the user.
So simply speaking, there might be a lot of movies.
For example, imagine you watch a movie last night, and for a day or two, you're mentally in that environment of that movie you're watching, you want to watch something in that atmosphere to continue to follow up.
But simply speaking, since we don't know what is that you like in those videos, like all those mis-unseen, all those black and white scenes, I don't know, from World War II, something that is quite interested there, we cannot follow up, right?
So multi-modal foundation models or multi-modal VLLMs, if we can call it a visual language model or in general, generally speaking multi-modal, only models hold the promise to increase our ability of a content understanding, visual content understanding.
This is something that can be absolutely useful for recommending products that we could not previously do so effectively.
I see.
Because they can enrich the representation of my content so that I have basically more candy dates I could trace back to the response or the behavior of the user so that with a richer representation and of course with more items consumed, it is easier to trace back what have been the driving factors for the user to consume that very content or on the other side talking negative feedback to abort consuming certain content.
Absolutely.
Everything boils down to content representation or visual content representation here, which for item types like video is the key.
The better we can visually encode this information, the better we can surface these items.
Yeah.
Which again also is kind of a reminder of what you said already before, that for certain recommender models and really taking this as there is some input and the model is tasked to, let's say, provide a highly relevant ranking that on this algorithmic side, we are already doing very good, even though that this qualitative argument of very good might be disputable.
But I mean, it could be the case that higher leverage is on the user and item representation side where LLMs could help a lot or where generative models overall could help a lot.
Yes, absolutely.
So we can say that going beyond user behavior and modeling the content itself, this means that generating personalized content that can fulfill a user request.
I guess this covers quite well the different data sources, models and scenarios, but those systems can influence users.
They are more powerful.
They provide better reasoning capabilities.
So when we want to evaluate those systems, which are also, I would say, complex to build and costly to maintain, let's start off with the evaluation.
So how do we actually judge how effective those systems, those composed systems are in reaching their goals, which might be manifold?
Yeah, I think evaluation is the most important question here, which is the key.
I think we can look at the topic of evaluation from different perspectives.
The first obvious question is what are we evaluating?
That needs to be answered here, which joins the topic of impact and harm.
So impact, we can summarize everything that is good in the impact part, relevance, I don't know, discoverability, novelty, good explanation, good parts, we can put it in the impact and things about harm.
So all kinds of social and ethical harms that also needs to be understood first and then evaluated and if needed, mitigated.
These are the three parts that we're talking about.
Once you answer what are you evaluating, then there is the question of what is the goal of this evaluation?
Is it like performance and quality, efficiency, safety and ethical consideration, user satisfaction and different associated metrics for that?
The other things that is RecSys specific, I would say, specifically talking about harms is that who is mostly benefited or harmed from this system?
Because recommender system working two-sided or multi-sided stakeholder setting, right?
It's good to know and it's important to know if certain types of recommendation is harming certain group of items, producer, which one more or less.
To start this discussion, you can imagine LLMs being trained on a lot of textual data or internet data.
So internet data simply is biased, right?
It could be biased to a lot of sensitive demographic like female, certain colors, certain race, certain countries, certain ethnicities.
What are we doing?
We are bringing this information into our recommender system, right?
We are bringing this information to our recommender system or even on the producer side, you can talk about brands.
You can imagine when you talk about fashion brands or fashion or I don't know, looks at the cars, how much information there are about nice cars or those emerging companies.
It's easy to imagine that some of our items or users in the system would be underrepresented or overrepresented.
This can cues on fairness issues on the producer side and fairness side.
So it's absolutely important to understand them first because we are in the earlier stages to evaluate them next and then finally if needed mitigate them.
So obviously I'm now more focused and biased on the hard parts, but maybe we can come back to your first question on the general evaluations.
As you already said, golds are many fold.
Of course, one of the things is fairness, which is a relatively contested topic and definitely there is no universal definition of fairness.
Also for the listeners who are going to go back to my talk with Michael Exstrand, where we actually talked about fairness and recommender systems.
If we think about the two aspects, how good the model is in terms of accuracy and maybe not necessarily accuracy, but rather how good it is at achieving user engagement, which might be a bit fuzzier to explain it on the other side.
Do the infrastructure and modeling costs actually, how is it actually also on the cost side with those models?
What is actually the trade of in actually operating these systems, but also in maintaining and let's say training those systems and how does this compare to the additional benefit that I get?
I think it's an important question of costs versus benefits.
So I think a few days ago I read a tweet from Jan Lekun saying that if you want to use to work on creative AI, don't work so much on the LLMs, especially in a prompt based fashion, since at the end of the day, if you want to do something useful, it needs infrastructure.
And those infrastructure is something you don't have, maybe as a student.
Also, there are questions of how much that industrial scale is useful.
So let's come back to a question.
So first of all, about the metrics, about what we are evaluating.
Obviously, the first thing is that we have the classical traditional objectives of relevance.
And here we can use all types of recall and DCG, ROC, types of metrics.
But one of the things that here, I think I'm going to this evaluation dimension one by one, is that the output space of generating models is now broader and more complex.
You're not simply recommending now an item, but you can generate text, you could generate images.
The quality of these needs to be somewhat evaluated properly.
Right. So one of the things here is the output complexity.
The only thing that here could be mentioned when talking about evaluation is that we generally models in particular, we have things that basically open ended tasks.
And by open ended task, basically, I mean, tasks that they may not have clear, ground truths or pretty and objective.
For example, let's say you want to make sure the general models, you want to make sure that you have an interesting shopping experience to assist user and you don't have dimensions like relevance and helpfulness.
It's a question how we can evaluate in this non ground truth settings, basically, the systems.
And the other aspects that here we can mention is that general models have a higher chance to be composed of several components.
So for example, you can imagine you have a rack based system with a developmental generator composed of different parts. And how do you want to evaluate it?
Do you want to evaluate it, like end to end from an input from a output perspective, where you want to evaluate each of these components, let's say together, because one of the things that is quite commonly spoken to increase the power or usefulness of journey models to combine them in a positive way with classical models, like interact based fashion, or via fine tuning, where, for example, in a rack based system, you have a smaller candidate selected by classical receivers given to LLM, for example, to do the final recommendation.
So now you have a system composed of two main modules, interconnected sequentially, and you want to evaluate it is how to evaluate these two. And finally, obviously, is the is the potential for harms. So harm is absolutely important. We can put everything from false negative false positive hallucination in that harm category, to things that talks about the dimension of fairness, security, privacy, and so on and so forth. So yeah, maybe we can say that the four main pillars of evaluation is generally speaking performance and quality, which is shared with classical models, efficiency, safety and ethical consideration in user satisfaction. And yeah, maybe all of these are shared with classical models. But it could be that some of these are more exaggerated now with journey models. For example, when you think about LLMs, where you use an API to build an LLM based recommender system, simply speaking, you're taking some information from target user you want to recommend, you feed this information via an APL call to to an LLM, like open AI, and then you receive this information back. So what you're gaining here is the power of this large language model, better understanding, better reasoning, so on and so forth. What is the cost? Is the cost of communication? Right? The first thing that comes, how much this thing costs for you or for which which scale of data this is useful? Is it is it useful, for example, if I have a data set of lots of tones of user tones of interaction, then there is also limitations on the tokens you can send. Imagine a user, for example, a movie lens data set, or last event might have used huge amount of information, huge amount of items, how can you summarize all these in 4096 tokens and send it to the API. So if you send it short, a shorter version of these items, you're basically misrepresenting the user profile, right? So obviously, with API based, we can say system, this is one aspect. But generally speaking on the topic of evaluation, we can say computational demands, throughput, latency and energy consumption are things that needs to be understood. And these could have differences based on if you're like using, let's say a prompt based system, the prompt based LLM, or you're like you're doing fine tuning. If you're talking about fine tuning, you need infrastructure to load this system and to build. According to what community in the AI community generally suggests is that the general trend should be that researchers in academia, students should try to use approaches where they can improve existing models, for example, approaches in the direction of fine tuning or building tasks specific LLMs and so on and so forth. This is what the community this is what seniors in the community are suggesting, where research should go in the years ahead. But this comes with the question of infrastructures, because PhD students simply doesn't have infrastructure to get these it needs a lot of GPU, and there's a lot of resources to be able to build a smaller scale and smallest case here, I would say is in the billion scale parameter, which is by far more than any any convolutional neural network, maybe he has used in there. So there is obviously challenges here to be done. I can also mention, if I can very shortly, some ideas about about generation metrics. So we talked about traditional metrics. Here, since we are generating also text, conversation, and multimodal data, each of these could be evaluated. We have text either as, yeah, for example, explanation could be evaluated properly.
For the conversation, we have metrics that, for example, we can measure the interaction cycle, how much it takes to complete it, the task completion, how much it takes for user to add something to the cards. And for the multimodal content, we have already metrics designed to measure, for example, the diversity and quality of these metrics and so on. As an idea, some of the things that people do here is that you can also look at the evaluation from the perspective of who is actually evaluating. So you can evaluate offline, you can have any human evaluator, you could even have an LLM evaluating another LLM. For example, a GPT-4, which is much stronger, can evaluate it, a recommendation generator by GPT-3.5 or GPT-3 or other models to evaluate its helpfulness and so on and so forth. So these are like trends that are becoming more, but generally speaking, yeah, factors are related to harm, scalability, fault tolerance. I think from an industrial perspective are key questions to be answered. As you have now, together with your colleagues, compiled this whole survey and structure on generative recommender systems, there are lots of open questions and still further research needed. Are there maybe like two or three specific areas where you would place your bet on? I mean, you are a professor, so I guess there are also some PhD students that you are supervising. So if I were a new PhD student to your research group, then what would be kind of your advice or your bet if I would engage in that area to say, hey, take a look at this or start working on this because these will become issues or these are really significant open research questions. What would that be? Yeah, I would like to take this question as an opportunity to introduce our workshop, Ro-Zhen, which is the workshop of risk opportunities and evaluation of JIT models and Rixis. And I think the first thing I would like to tell my PhD students to think about is on the macro scale, think about if you want to look at the positive aspects or negative aspects, both equally interesting to identify his or her interests.
Because this is the first question I think could be answered. Actually, today, before the scholar, I was talking to my wife, Atena, and she's also working on this on machine learning, but also on recommendation tasks. And this is a discussion that we were having on the positive and negative side that roomed opportunities. There is ample research to be done on the positive side, depending on the domain. Imagine, for example, on the one specific domain of fashion, there's a whole lot of things that could be done on fashion recommendation. For example, now with JIT models, you're able to propose ideas to better provide outfit recommendation or bundle recommendation to the user, right? And match it with proper explanation, multimodal explanation, evaluate it properly. So this by itself is a dedicated research topic that could end up being useful for fashion and textile industry. On the negative side, there is also a lot of risk because I think at an industrial, let's say, level, if I want to take a giant in models and say, okay, this is my classical models, corruptive filtering, matrix, I want to take it away. I want to put giant in models box here. The first question I would say is that can I really trust this? If I go away, come back from vacation? Can I trust it? Could it like collapse my entire company? I'm just making a joke to make it interesting. There are a lot of risks, false negative, false positive, privacy issues, and so on. There's a room to be done. So I think this is a macro school question that could be a reality based on the interests of the target students. And this is what his interest would say. I would have my own opinions, but both sides are there. Maybe I can also say about the harm part, because I think we didn't talk about it, if I can say just in a few words.
Basically, there are a whole lot of research that could be done on the topic of bias and fairness, as we mentioned, misrepresentation of certain categories, demographics, that we can imagine in LLM driven or generative driven recommendation models that could be amplified. So we have risks that are old, that could be exaggerated, but we have also new risks that are completely newly introduced, just like new application, new risks. So those risks need to be understood. And one of the biggest risks, if I want to make among all risks that we have that immediately affect users is that the commensibility of LLMs. So imagine you search LLM for important information. Imagine you want some drug recommendation, some health related suggestion for some symptoms that you have, and use one of the one of the latest models of a certain LLM. The way the information is presented, the factual it looks, let's say there are some errors there in between. How much likely is that the user can understand that information? How much is likely that the user can detect those information? Now, obviously, the risk of this for music recommendation is not so much at the end of the day, you don't like your music and you skip. But in task sensitive areas, this can have a huge price. So I think if I want to work on those risks scenario or harm scenario, I want to look at tasks that impact human lives more. And reducing those risks for human lives could be seen as a positive application. And I think we can arguably say that for some of the fairness scenario, I'm just saying some of the fairness scenario, this could be more damageful. If you, for example, provide wrong factual information about health related suggestion. No, it could be much more immediate, much more impactful. And I think these are the areas that could be identified and understood.
Especially given as you outlined the persuasive capabilities via LLMs.
Absolutely.
Wow. Okay. That was a really great overview and to the reason about this exciting topic. And yeah, it's really good that you put together this because I mean, that's what not only serve as an introduction, but also as an overview to know where to start from and what maybe I need to learn first since throughout that whole book that you wrote together with your colleagues, you can also find lots of references to all the papers that were published in that regard. And I think and go back on more into detail there. So provides a very good overview and structure to see basically the multiple facets of this research area within recommender systems. So exciting work and maybe also a spark for further ideas for other people to work on taking this. So when can we actually expect this work to appear so far there are at least two, let's say points people can look at, which is actually the tutorial at KDD, which might be a bit too short noticed and maybe also too far for some people. But then we are also looking at your workshop, the first installment of this kind of workshop on the risks, opportunities and evaluation of generative models for recommendations.
Which will co-locate it with the RecSys this year, but maybe the most accessible way is already out there, which is the paper that is more of a very condensed version of this book. But when can we expect this book to appear so that people can take it and make that deep dive? Yeah, thank you Marcel for introducing these works to our audience. So I think our book, which I here I should acknowledge and thank all the great co-authors of this work. This manuscript wouldn't have come by putting all these minds together. Everybody has done a lot of efforts from different industries, academia, Jungkook, Julianne, Anton, Scott, Arna, Ramesa, René Wiedel and Francesco Ricci, if I can just name them quickly. So our book is going to be released soon. I hope somewhere between July and August they would be released as an archive version. Currently we are doing the final editing, but a very good visual representation and summary of this is the KDD tutorial that we have in our book in August. So this tutorial is a reflection of the book, I would say, and the survey that we read for the KDD tutorial in a much more extended manner. So people would be able to visually see the content that we are covering in the book in much more detail because the survey that we wrote on the KDD was a success in general. It's however a condensed version of what we really have in the book. And about the workshop of Rojan, where I'm basically hosting with a couple of great colleagues in the academia and industry, we are going to have good speakers, senior researchers from DeepMind, from Meta, and known institutions talking about their works and their understanding of journey models, LLMs and so on, specifically to recommend their system. The idea is that we could have four speakers in the workshop for four sessions and as you mentioned this is the first specialized workshop on journey models and I think the only workshop that we have at RecSys and hopefully this is going to be continued at RecSys but also at other venues. Generally the idea is that we can invite papers on both positive and negative aspects or harm aspects of journey models or papers talking about evaluation frameworks, evaluation metrics for specific to this system. So we hope we can attract good papers. I think audience are welcome to follow our book on journey models that would come soon. I think with this book you would have a proper understanding from a let's say jungle of algorithms and methods and talks to at least place them in proper order. So in our book for example we name between 50 to 100 journey models, okay, but we try to do that properly by putting them under the right data modality they work on and what is the idea of this model with respect to that model. I think this is one of the things that we try to do it differently with existing surveys that sometimes try to let's say look at this from a what who did what and without giving you proper understanding of what is the core idea and differences between one model and the other one. So yeah this book would be a very good starting point to follow up build application for PhD students and also industrial practitioners I would say.
Cool that will definitely be some kind of a race to see what will be out there earlier. This podcast episode or the book maybe it will appear at the very same or similar time but let's see.
So first and foremost I'm really really grateful and thank you very much for putting that together and also sharing all of this with our listeners and as always we will put any references that we discussed also into the show notes. Apart from that on RECSPERTS I have so far talked to many guests and there are even more out there and you have already mentioned a couple of people from the community. As I'm constantly seeking for new guests on my podcast and also I really appreciate that there are listeners who reach out to me and propose new guests which I'm really grateful for. Who could you think about that you would like me to invite to the show and have here on RECSPERTS? To be honest it's hard to answer this question there are so many good people. I would like to recommend in particular senior people in our book.
I think most of them are proper candidates. So depending on the interests of a specific episodes like you have covered before and how much you want to diversify those people could be chosen so I can certainly name of all great colleagues Julian himself Francesco Ricci Scott Sanner. So these have been driving forces behind this work together with their great PhD students Anna Rameza and René Vidal also from Amazon they've been working there in the driving force behind the multimodal chapters. And so I think on the journey model part specifically we want to have more dedicated discussion on these topics. I think these people would be a great choice.
Mahesh I should not mention that Mahesh has been also working on the topic of evaluation which could be also a great work. He has been a senior member at Google DeepMind before.
Generally speaking I think that the topic of generative, yeah recommendation with genetic model generics has the potential to be discussed more and more. It's just the beginning of a new era for recommender system research. Maybe it will take place the same way as it was going for deep learning for recommender systems. It started out with a workshop and then it became an integral part of recommender models so that it didn't deserve or need the workshop for it anymore.
Indeed I think we may have a paper written called Journey Model is All You Need.
Attention is all you need so we should expect that.
That sounds great. Okay then let's wrap it up there. That was quite a long but very insightful discussion and again thank you very much for sharing all of that with the community. I hope they will like it a lot. I'm also looking forward to then meet you again in person this year at RecSys 2024 in Bari which would not be that big of a journey for you but also not such a big journey for some people or folks at least from Europe as it was last time. So really looking forward to that and also to nice weather conditions in the nice city of Bari.
Absolutely so I would like to thank you Marcela personally. I've followed some of your podcasts actually I found them really interesting. Thank you very much for putting this this effort to have this engaging and very interesting interviews and yeah I really look forward to meeting you hopefully in person in Bari if you're here I would be happy to host you or to help you to eat out with you at some point. Of course. Great that sounds good let's do that.
Have a wonderful rest of the day and see you soon. See you soon. Bye. Bye.
Thank you so much for listening to this episode of RECSPERTS's recommender systems experts the podcast that brings you the experts in recommender systems if you enjoy this podcast please subscribe to it on your favorite podcast player and please share it with anybody you think might benefit from it if you have questions a recommendation for an interesting expert you want to have in my show or any other suggestions drop me a message on twitter or send me an email thank you again for listening and sharing and make sure not to miss the next episode because people who listen to this also listen to the next episode goodbye you