Toward the Next Generation of Recommender Systems:A Survey of the State-of-the-Art and
Possible Extensions
Gediminas Adomavicius,Member,IEEE,and Alexander Tuzhilin,Member,IEEE Abstract—This paper prents an overview of the field of recommender systems and describes the current generation of
recommendation methods that are usually classified into the following three main categories:content-bad,collaborative,and hybrid recommendation approaches.This paper also describes various limitations of current recommendation methods and discuss
possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications.The extensions include,among others,an improvement of understanding of urs and items,incorporation of the contextual information into the recommendation process,support for multcriteria ratings,and a provision of more flexible and less intrusive types of recommendations.
Index Terms—Recommender systems,collaborative filtering,rating estimation methods,extensions to recommender systems.
æ
1I NTRODUCTION
R ECOMMENDER systems have become an important rearch area since the appearance of the first papers on collaborative filtering in the mid-1990s[45],[86],[97]. There has been much work done both in the industry and academia on developing new approaches to recommender systems over the last decade.The interest in this area still remains high becau it constitutes a problem-rich rearch area and becau of the abundance of practical applications that help urs to deal with information overload and provide personalized recommendations, content,and rvices to them.Examples of such applica-tions include recommending books,CDs,and other products [61],movies by MovieLens [67],and news at VERSIFI Technologies()[14].Moreover,some of the vendors have incorporated recommendation capabilities into their commerce rvers[78].
However,despite all of the advances,the current generation of recommender systems still requires further improvements to make recommendation methods more effective and applicable to an even broader range of real-life applications,including recommending vacations,certain types of financial
rvices to investors,and products to purcha in a store made by a“smart”shopping cart[106]. The improvements include better methods for reprent-ing ur behavior and the information about the items to be recommended,more advanced recommendation modeling methods,incorporation of various contextual information into the recommendation process,utilization of multcriteria ratings,development of less intrusive and more flexible recommendation methods that also rely on the measures that more effectively determine performance of recommen-der systems.
In this paper,we describe various ways to extend the capabilities of recommender systems.However,before doing this,we first prent a comprehensive survey of the state-of-the-art in recommender systems in Section2.Then, we identify various limitations of the current generation of recommendation methods and discuss some initial ap-proaches to extending their capabilities in Section3.
2T HE S URVEY OF R ECOMMENDER S YSTEMS Although the roots of recommender systems can be traced back to the extensive work in cognitive science[87], approximation theory[81],information retrieval[89], forecasting theories[6],and also have links to management science[71]and to consumer choice modeling in marketing [60],recommender systems emerged as an independent rearch area in the mid-1990s when rearchers started focusing on recommendat
ion problems that explicitly rely on the ratings structure.In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been en by a ur.Intuitively,this estimation is usually bad on the ratings given by this ur to other items and on some other information that will be formally described below.Once we can estimate ratings for the yet unrated items,we can recommend to the ur the item(s)with the highest estimated rating(s).
More formally,the recommendation problem can be formulated as follows:Let C be the t of all urs and let S be the t of all possible items that can be recommended, such as books,movies,or restaurants.The space S of
.G.Adomavicius is with the Carlson School of Management,University of
Minnesota,32119th Avenue South,Minneapolis,MN55455.
E-mail:gedas@umn.edu.
. A.Tuzhilin is with the Stern School of Business,New York University,
实词44West4th Street,New York,NY10012.E-mail:u.edu.
Manuscript received8Mar.2004;revid14Oct.2004;accepted10Dec.
2004;published online20Apr.2005.
For information on obtaining reprints of this article,plea nd e-mail to:
tkde@computer,and reference IEEECS Log Number TKDE-0071-0304.
1041-4347/05/$20.00ß2005IEEE Published by the IEEE Computer Society
possible items can be very large,ranging in hundreds of thousands or even millions of items in some applications,such as recommending books or CDs.Similarly,the ur space can also be very large—millions in some cas.Let u be a utility function that measures the ufulness of item s to ur c ,i.e.,u :C ÂS !R ,where R is a totally ordered t (e.g.,nonnegative integers or real numbers within a certain range).Then,for each ur c 2C ,we want to choo such item s 02S that maximizes the ur’s utility.More formally:
8c 2C;
s 0c ¼arg max s 2S
u ðc;s Þ:
ð1Þ
verst
eur是什么意思In recommender systems,the utility of an item is usually reprented by a rating ,which indicates how a particular ur liked a particular ,John Doe gave the movie “Harry Potter”the rating of 7(out of 10).However,as indicated earlier,in general,utility can be an arbitrary function,including a profit function.Depending on the application,utility u can either be specified by the ur,as is often done for the ur-defined ratings,or is computed by the application,as can be the ca for a profit-bad utility function.
Each element of the ur space C can be defined with a profile that includes various ur characteristics,such as age,gender,income,marital status,etc.In the simplest ca,the profile can contain only a single (unique)element,such as Ur ID.Similarly,each element of the item space S is defined with a t of characteristics.For example,in a movie recommendation application,where S is a collection of movies,each movie can be reprented not only by its ID,but also by its title,genre,director,year of relea,leading actors,etc.
The central problem of recommender systems lies in that utility u is usually not defined on the whole
C ÂS space,but only on some subt of it.This means u needs to be extrapolated to the whole space C ÂS .In recommender systems,utility is typically reprented by ratings and is initially defined only on the items previously rated by the urs.For example,in a movie recommendation application (such as the one at MovieLens),urs initially rate some subt of movies that they have already en.An example of a ur-item rating matrix for a movie recommendation application is prented in Table 1,where ratings are specified on the scale of 1to 5.The “ ”symbol for some of the ratings in Table 1means that the urs have not rated the corresponding movies.Therefore,the recommendation engine should be able to estimate (predict)the ratings of the nonrated movie/ur combinations and issue appropriate recommendations bad on the predictions.
Extrapolations from known to unknown ratings are usually done by 1)specifying heuristics that define the utility function and empirically validating its performance
and 2)estimating the utility function that optimizes certain performance criterion,such as the mean square error.
Once the unknown ratings are estimated,actual recommendations of an item to a ur are made by lecting the highest rating among all the estimated ratings for that ur,according to (1).Alternatively,we can recommend the N best items to a ur or a t of urs to an item.
The new ratings of the not-yet-rated items can be estimated in many different ways using methods from machine learning,approximation theory,and various heuristics.Recommender systems are usually classified according to their approach to rating estimation and,in the next ction,we will prent such a classification that was propod in the literature and will provide a survey of different types of recommender systems.The commonly accepted formulation of the recommendation problem was first stated in [45],[86],[97]and this problem has been studied extensively since then.Moreover,recommender systems are usually classified into the following categories,bad on how recommendations are made [8]:
.
totallyContent-bad recommendations :The ur will be recommended items similar to the ones the ur preferred in the past;
.Collaborative recommendations :The ur will be
recommended items that people with similar tastes and preferences liked in the past;
.Hybrid approaches :The methods combine colla-borative and content-bad methods.
In addition to recommender systems that predict the absolute values of ratings that individual urs would give to the yet unen items (as discusd above),there has been work done on preference-bad filtering ,i.e.,predicting the relative preferences of urs [22],[35],[51],[52].For example,in a movie recommendation application,prefer-ence-bad filtering techniques would focus on predicting the correct relative order of the movies,rather than their individual ratings.However,this paper focus primarily on rating-bad recommenders since it constitutes the most popular approach to recommender systems.2.1Content-Bad Methods
In content-bad recommendation methods,the utility u ðc;s Þof item s for ur c is estimated bad on the utilities u ðc;s i Þassigned by ur c to items s i 2S that are “similar”to item s .For example,in a movie recommendation application,in order to recommend movies to ur c ,the content-bad recommender system tries to understand the commonalities among the movies ur c has rated highly in the past (specific actors,directors,genres,subject matter,
TABLE 1
A Fragment of a Rating Matrix for a Movie Recommender System
etc.).Then,only the movies that have a high degree of similarity to whatever the ur’s preferences are would be recommended.
The content-bad approach to recommendation has its roots in information retrieval[7],[89]and information filtering[10]rearch.Becau of the significant and early advancements made by the information retrieval and filtering communities and becau of the importance of veral text-bad applications,many current content-bad systems focus on recommending items containing textual information,such as documents,Web sites(URLs),and Unet news messages.The improvement over the tradi-tional information retrieval approaches comes from the u of ur profiles that contain information about urs’tastes, preferences,and needs.The profiling information can be elicited from urs ,through questionnaires, or implicitly—learned from their transactional behavior over time.机警
More formally,let ContentðsÞbe an item ,a t of attributes characterizing item s.It is usually computed by extracting a t of features from item s(its content)and is ud to determine the
appropriateness of the item for recommendation purpos.Since,as mentioned earlier, content-bad systems are designed mostly to recommend text-bad items,the content in the systems is usually described with keywords.For example,a content-bad component of the Fab system[8],which recommends Web pages to urs,reprents Web page content with the 100most important words.Similarly,the Syskill&Webert system[77]reprents documents with the128most informative words.The“importance”(or“informative-ness”)of word k j in document d j is determined with some weighting measure w ij that can be defined in veral different ways.
One of the best-known measures for specifying keyword weights in Information Retrieval is the term frequency/inver document frequency(TF-IDF)measure[89]that is defined as follows:Assume that N is the total number of documents that can be recommended to urs and that keyword k j appears in n i of them.Moreover,assume that f i;j is the number of times keyword k i appears in document d j.Then, T F i;j,the term frequency(or normalized frequency)of keyword k i in document d j,is defined as
T F i;j¼
f i;j
max z f z;j
;ð2Þ
where the maximum is computed over the frequencies f z;j of all keywords k z that appear in the document d j. However,keywords that appear in many documents are not uful in distinguishing between a relevant document and a nonrelevant one.Therefore,the measure of inver document frequencyðIDF iÞis often ud in combination with simple term frequencyðT F i;jÞ.The inver document frequency for keyword k i is usually defined as
IDF i¼log N
n i
:ð3Þ
Then,the TF-IDF weight for keyword k i in document d j is defined as
w i;j¼T F i;jÂIDF ið4Þand the content of document d j is defined as
Contentðd jÞ¼ðw1j;...w kjÞ:
As stated earlier,content-bad systems recommend items similar to tho that a ur liked in the past[56],[69], [77].In particular,various candidate items are compared with items previously rated by the ur and the best-matching item(s)are recommended.More formally,let ContentBadP rofileðcÞbe the profile of ur c containing tastes and preferences of this ur.The profiles are obtained by analyzing the content of the items previously en and rated by the ur and are usually constructed using keyword analysis techniques from information retrieval.For example,ContentBadP rofileðcÞcan be defined as a vector of weightsðw c1;...;w ckÞ,where each weight w ci denotes the importance of keyword k i to ur c and can be computed from individually rated content vectors using a variety of techniques.For example,some averaging approach,such as Rocchio algorithm[85],can be ud to compute ContentBadP rofileðcÞas an“average”vector from an individual content vectors[8],[56].On the other hand,[77]us a Bayesian classifier in order to estimate the probability that a document is liked.The Winnow algorithm[62]has also been shown to work well for this purpo,especially in the situations where there are many possible features[76].
In content-bad systems,the utility function uðc;sÞis usually defined as:
uðc;sÞ¼scoreðContentBadP rofileðcÞ;ContentðsÞÞ:ð5ÞUsing the above-mentioned information retrieval-bad paradigm of recommending Web pages,Web site URLs, or Unet news messages,b
oth ContentBadP rofileðcÞof ur c and ContentðsÞof document s can be reprented as TF-IDF vectors~w c and~w s of keyword weights.Moreover, utility function uðc;sÞis usually reprented in the information retrieval literature by some scoring heuristic defined in terms of vectors~w c and~w s,such as the cosine similarity measure[7],[89]:
uðc;sÞ¼cosð~w c;~w sÞ¼
~w cÁ~w s
jj~w c jj2Âjj~w s jj2
¼
P K
i¼1
w i;c w i;s
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P K
i¼1
w2
i;c
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P K
i¼1
w2
i;s
q;
ð6Þ
万圣节快乐 英文
where K is the total number of keywords in the system.
For example,if ur c reads many online articles on the topic of bioinformatics,then content-bad recommenda-tion techniques will be able to recommend other bioinfor-matics articles to ur c.This is the ca becau the articles will have more bioinformatics-related ,“genome,”“quencing,”“proteomics”)than articles on other topics and,therefore,ContentBadP rofileðcÞ,as defined by vector~w c,will reprent such terms k i with high weights w ic.Conquently,a recommender system using the cosine or a related similarity measure will assign higher utility uðc;sÞto tho articles s that have high-weighted bioinformatics terms in~w s and lower utility to the ones where bioinformatics terms are weighted less.
Besides the traditional heuristics that are bad mostly on information retrieval methods,other techniques for content-bad recommendation have also been ud,such as Bayesian classifiers[70],[77]and various machine learning techniques,including clustering,decision trees, and artificial neural networks[77].The techniques differ from information retrieval-bad approaches in that they calculate utility predictions bad not on a heuristic formula,such as a cosine similarity measure,but rather are bad on a model learned from the underlying data using statistical learning and machine learning techni-ques.For example,bad on a t of Web pages that were rated as“rele
vant”or“irrelevant”by the ur,[77]us the naive Bayesian classifier[31]to classify unrated Web pages.More specifically,the naive Bayesian classifier is ud to estimate the following probability that page p j belongs to a certain class C ,relevant or irrelevant) given the t of keywords k1;j;...;k n;j on that page:
PðC i j k1;j&...&k n;jÞ:ð7ÞMoreover,[77]us the assumption that keywords are independent and,therefore,the above probability is proportional to
PðC iÞ
Y
x
Pðk x;j j C iÞ:ð8Þ
While the keyword independence assumption does not necessarily apply in many applications,experimental results demonstrate that naı¨ve Bayesian classifiers still produce high classification accuracy[77].Furthermore,both Pðk x;j j C iÞand PðC iÞcan be estimated from the underlying training data.Therefore,for each page p j,the probability PðC i j k1;j&...&k n;jÞis computed
for each class C i and page p j is assigned to class C i having the highest probability[77].
While not explicitly dealing with providing recommen-dations,the text retrieval community has contributed veral techniques that are being ud in content-bad recommen-der systems.One example of such a technique would be the rearch on adaptive filtering[101],[112],which focus on becoming more accurate at identifying relevant documents incrementally by obrving the documents one-by-one in a continuous document stream.Another example would be the work on threshold tting[84],[111],which focus on determining the extent to which documents should match a given query in order to be relevant to the ur.Other text retrieval methods are described in[50]and can also be found in the proceedings of the Text Retrieval Conference (TREC)(v).
As was obrved in[8],[97],content-bad recommender systems have veral limitations that are described in the rest of this ction.
2.1.1Limited Content Analysis
Content-bad techniques are limited by the features that are explicitly associated with the objects that the systems recommend.Therefore,in order to have a sufficient t of features,the content mu
st either be in a form that can be pard automatically by a ,text)or the features should be assigned to items manually.While information retrieval techniques work well in extracting features from text documents,some other domains have an inherent problem with automatic feature extraction.For example,automatic feature extraction methods are much harder to apply to multimedia ,graphical images, audio streams,and video streams.Moreover,it is often not practical to assign attributes by hand due to limitations of resources[97].
Another problem with limited content analysis is that,if two different items are reprented by the same t of features,they are indistinguishable.Therefore,since text-bad documents are usually reprented by their most important keywords,content-bad systems cannot distin-guish between a well-written article and a badly written one,if they happen to u the same terms[97].
2.1.2Overspecialization
When the system can only recommend items that score highly against a ur’s profile,the ur is limited to being recommended items that are similar to tho already rated. For example,a person with no experience with Greek cuisine would never receive a recommendation for even the greatest Greek restaurant in town.This problem,which has also been studied in other domains,is often addres
d by introducing some randomness.For example,the u of genetic algorithms has been propod as a possible solution in the context of information filtering[98].In addition,the problem with overspecialization is not only that the content-bad systems cannot recommend items that are different from anything the ur has en before.In certain cas,items should not be recommended if they are too similar to something the ur has already en,such as a different news article describing the same event.Therefore, some content-bad recommender systems,such as Daily-Learner[13],filter out items not only if they are too different from the ur’s preferences,but also if they are too similar to something the ur has en before.Furthermore,Zhang et al.[112]provide a t of five redundancy measures to evaluate whether a document that is deemed to be relevant contains some novel information as well.In summary,the diversity of recommendations is often a desirable feature in recommender systems.Ideally,the ur should be pre-nted with a range of options and not with a homogeneous t of alternatives.For example,it is not necessarily a good idea to recommend all movies by Woody Allen to a ur who liked one of them.
2.1.3New Ur Problem
The ur has to rate a sufficient number of items before a content-bad recommender system can really understand the ur’s preferences and prent the ur with reliable recommendations.Theref
ore,a new ur,having very few ratings,would not be able to get accurate recommendations.
2.2Collaborative Methods
Unlike content-bad recommendation methods,collabora-tive recommender systems(or collaborative filtering systems) try to predict the utility of items for a particular ur bad on the items previously rated by other urs.More formally, the utility uðc;sÞof item s for ur c is estimated bad on the utilities uðc j;sÞassigned to item s by tho urs c j2C who are“similar”to ur c.For example,in a movie
recommendation application,in order to recommend movies to ur c ,the collaborative recommender system tries to find the “peers”of ur c ,i.e.,other urs that have similar tastes in movies (rate the same movies similarly).Then,only the movies that are most liked by the “peers”of ur c would be recommended.
There have been many collaborative systems developed in the academia and the industry.It can be argued that the Grundy system [87]was the first recommender system,which propod using stereotypes as a mechanism for building models of urs bad on a limited amount of information on each individual ur.Using stereotypes,the Grundy system would build individual ur models and
u them to recommend relevant books to each ur.Later on,the Tapestry system relied on each ur to identify like-minded urs manually [38].GroupLens [53],[86],Video Recommender [45],and Ringo [97]were the first systems to u collaborative filtering algorithms to automate prediction.Other examples of collaborative recommender systems include the book recommendation system ,the PHOAKS system that helps people find relevant information on the WWW [103],and the Jester system that recommends jokes [39].
According to [15],algorithms for collaborative recom-mendations can be grouped into two general class:memory-bad (or heuristic-bad )and model-bad .
Memory-bad algorithms [15],[27],[72],[86],[97]esntially are heuristics that make rating predictions bad on the entire collection of previously rated items by the urs.That is,the value of the unknown rating r c;s for ur c and item s is usually computed as an aggregate of the ratings of some other (usually,the N most similar)urs for the same item s :
r c;s ¼aggr c 02^C
r c 0;s ;
get off>fingerscrosd
ð9Þ
where ^C
denotes the t of N urs that are the most similar to ur c and who have rated item s (N can range anywhere from 1to the number of all urs).Some examples of the aggregation function are:
ða Þr c;s ¼1N X
edwardc 02^C r c 0;s ;ðb Þr c;s
conduct¼k X c 02^C
sim ðc;c 0ÞÂr c 0;s ;
ðc Þr c;s ¼"r
c þk X
c 02^C
sim ðc;c 0ÞÂðr c 0;s À"r
c 0Þ;ð10Þ
where multiplier k rves as a normalizing factor and is usually lected as k ¼1 P c 02^C
j sim ðc;c 0
Þj ,and where the average rating of ur c ,"r
c ,in (10c)is define
d as 1"r c ¼À1 j S c j ÁX s 2S c
r c;s
;where S c ¼f s 2S j r c;s ¼ g :ð11ÞIn the simplest ca,the aggregation can be a simple average,as defined by (10a).However,the most common aggregation approach is to u the weighted sum,shown in (10b).The similarity measure between urs c and c 0,
sim ðc;c 0Þ,is esntially a distance measure and is ud as a
,the more similar urs c and c 0are,the more weight rating r c 0;s will carry in the prediction of r c;s .Note that sim ðx;y Þis a heuristic artifact that is introduced in order to be able to di
fferentiate between levels of ur similarity (i.e.,to be able to find a t of “clost peers”or “nearest neighbors”for each ur)and,at the same time,simplify the rating estimation procedure.As shown in (10b),different recommendation applications can u their own ur similarity measure as long as the calculations are normalized using the normalizing factor k ,as shown above.The two most commonly ud similarity measures will be described below.One problem with using the weighted sum,as in (10b),is that it does not take into account the fact that different urs may u the rating scale differently.The adjusted weighted sum,shown in (10c),has been widely ud to address this limitation.In this approach,instead of using the absolute values of ratings,the weighted sum us their deviations from the average rating of the correspond-ing ur.Another way to overcome the differing us of the rating scale is to deploy preference-bad filtering [22],[35],[51],[52],which focus on predicting the relative prefer-ences of urs instead of absolute rating values,as was pointed out earlier in Section 2.
Various approaches have been ud to compute the similarity sim ðc;c 0Þbetween urs in collaborative recom-mender systems.In most of the approaches,the similarity between two urs is bad on their ratings of items that both urs have rated.The two most popular approaches are correlation and cosine-bad .To prent them,let S xy be the t of all items corated by both urs x
and y ,i.e.,S xy ¼f s 2S j r x;s ¼ &r y;s ¼ g .In collaborative recom-mender systems,S xy is ud mainly as an intermediate result for calculating the “nearest neighbors”of ur x and is often computed in a straightforward ,by computing the interction of ts S x and S y .However,some methods,such as the graph-theoretic approach to collaborative filtering [4],can determine the nearest neighbors of x without computing S xy for all urs y .In the correlation-bad approach,the Pearson correlation coefficient is ud to measure the similarity [86],[97]:
sim ðx;y Þ¼P
s 2S xy
ðr x;s À"r
x Þðr y;s À"r y ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP s 2S xy
ðr x;s À"r x Þ2P s 2S xy
ðr y;s À"r
y Þ2r :ð12Þ
In the cosine-bad approach [15],[91],the two urs x and y are treated as two vectors in m -dimensional space,where m ¼j S xy j .Then,the similarity between two vectors can be measured by computing the cosine of the angle between them:
sim ðx;y Þ¼cos ð~x ;~y Þ¼~x Á~y
jj ~x jj 2Âjj ~y jj 2¼P
s 2S xy r x;s r y;s ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP s 2S xy
r 2x;s
r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP s 2S xy
r
2y;s
r ;ð13Þ
where ~x Á~y denotes the dot-product between the vectors ~x
and ~y .Still another approach to measuring similarity between urs us the mean squared difference measure
1.We u the r c;s ¼ notation to indicate that item s has not been rated by ur c .