首页 > 美文鉴赏

Content-Bad Book Recommending Using Learning for Text Categorization

更新时间:2023-05-30 22:48:01 阅读：评论：0

a r X i v :c s /9902011v 1 [c s .D L ] 7 F e

b 1999

Raymond J.Mooney

Department of Computer Sciences

Loriene Roy

Graduate School of Library and Information Science

University of Texas Austin,TX 78712

Email :mooney@cs.utexas.edu,loriene@gslis.utexas.edu

兰寿金鱼好养吗ABSTRACT

Recommender systems improve access to relevant products and information by making personalized s

uggestions bad on previous examples of a ur’s likes and dislikes.Most ex-isting recommender systems u social ﬁltering methods that ba recommendations on other urs’preferences.By con-trast,content-bad methods u information about an item itlf to make suggestions.This approach has the advantage of being able to recommended previously unrated items to urs with unique interests and to provide explanations for its recommendations.We describe a content-bad book rec-ommending system that utilizes information extraction and a machine-learning algorithm for text categorization.Initial experimental results demonstrate that this approach can pro-duce accurate recommendations.

KEYWORDS:Recommender systems,information ﬁltering,

machine learning,text categorization

INTRODUCTION

There is a growing interest in recommender systems that sug-gest music,ﬁlms,books,and other products and rvices to urs bad on examples of their likes and dislikes [19,26,11].A number of successful startup companies like Fire-ﬂy,Net Perceptions,and LikeMinds have formed to provide recommending technology.On-line book stores like Ama-zon and BarnesAndNoble have popular rec

ommendation r-vices,and many libraries have a long history of providing reader’s advisory rvices [2,21].Such rvices are im-portant since readers’preferences are often complex and not readily reduced to keywords or standard subject categories,but rather best illustrated by example.Digital libraries should

be able to build on this tradition of assisting readers by pro-viding cost-effective,informed,and personalized automated recommendations for their patrons.紫菜怎么做成海苔

Existing recommender systems almost exclusively utilize a form of computerized matchmaking called collaborative or social ﬁltering .The system maintains a databa of the pref-erences of individual urs,ﬁnds other urs who known preferences correlate signiﬁcantly with a given patron,and recommends to a person other items enjoyed by their matched patrons.This approach assumes that a given ur’s tastes are generally the same as another ur of the system and that a sufﬁcient number of ur ratings are available.Items that have not been rated by a sufﬁcient number of urs cannot be effectively recommended.Unfortunately,statistics on li-brary u indicate that most books are utilized by very few patrons [12].Therefore,collaborative approaches naturally tend to recommend popular titles,perpetuating homogene-ity in reading choices.Also,since signiﬁcant information about other urs is required to make recommendations,this approach rais concerns a

bout privacy and access to propri-etary customer data.

Learning individualized proﬁles from descriptions of exam-ples (content-bad recommending [3]),on the other hand,allows a system to uniquely characterize each patron with-out having to match their interests to someone el’s.Items are recommended bad on information about the item itlf rather than on the preferences of other urs.This also allows for the possibility of providing explanations that list content features that caud an item to be recommended;potentially giving readers conﬁdence in the system’s recommendations and insight into their own preferences.Finally,a content-bad approach can allow urs to provide initial subject in-formation to aid the system.

Machine learning for text-categorization has been applied to content-bad recommending of web pages [25]and news-group messages [15];however,to our knowledge has not previously been applied to book recommending.We have

been exploring content-bad book recommending by apply-ing automated text-categorization methods to mi-structured text extracted from the web.Our current prototype system, L IBRA(Learning Intelligent Book Recommending Agent), us a databa of book information extracted from web pages Urs provide1–10ratings for a lected t of training bo

oks;the system then learns a proﬁle of the ur using a Bayesian learning algorithm and produces a ranked list of the most recommended additional titles from the sys-tem’s catalog.

As evidence for the promi of this approach,we prent ini-tial experimental results on veral data ts of books ran-domly lected from particular genres such as mystery,sci-ence,literaryﬁction,and scienceﬁction and rated by differ-ent urs.We u standard experimental methodology from machine learning and prent results for veral evaluation metrics on independent test data including rank correlation coefﬁcient and average rating of top-ranked books.

The remainder of the paper is organized as follows.Section 2provides an overview of the system including the algorithm ud to learn ur proﬁles.Section3prents results of our initial experimental evaluation of the system.Section4dis-cuss topics for further rearch,and ction5prents our

conclusions on the advantages and promi of content-bad

book recommending.

SYSTEM DESCRIPTION

Extracting Information and Building a Databa

First,an Amazon subject arch is performed to obtain a list of book-description URL’s of broadly relevant titles.L I-BRA then downloads each of the pages and us a simple pattern-bad information-extraction system to extract data

about each title.Information extraction(IE)is the task of lo-

cating speciﬁc pieces of information from a document,thereby obtaining uful structured

data

from

unstructured

text[16,

9].Speciﬁcally,it involvesﬁnding a t of substrings from the document,calledﬁllers,for each of a t of speciﬁed slots.When applied to web pages instead of natural language text,such an extractor is sometimes called a wrapper[14]. The current slots utilized by the recommender are:title,au-thors,synops,published reviews,customer comments,re-lated authors,related titles,and subject terms.Amazon pro-duces the information about related authors and titles using collaborative methods;however,L IBRA simply treats them as additional content about the book.Only books that have at least one synopsis,review or customer comment are retained as having adequate content information.A number of other slots are also publisher,date,ISBN,price, etc.)but are currently not ud by the recommender.We have initially asmbled databas for literaryﬁction(3,061 titles),scienceﬁction(3,813titles),mystery(7,285titles), and science(6,177titles).

Since the layout of Amazon’s automatically generated pages is quite regular,a fairly simple extraction system is sufﬁ-cient.L IBRA’s extractor employs a simple pattern matcher that us pre-ﬁller,ﬁller,and post-ﬁller patterns for each slot, as described by[6].In other applications,more sophisticated information extraction methods and inductive learning of ex-traction rules might be uful[7].

The text in each slot is then procesd into an unordered bag of words(tokens)and the examples reprented as a vector of bags of words(one bag for each slot).A book’s title and authors are also added to its own related-title and related-author slots,since a book is obviously“related”to itlf,and this allows overlap in the slots with books listed as related to it.Some minor additions include the removal of a small list of stop-words,the preprocessing of author names into unique tokens of the formﬁrst-initial

P(D)

|D|

i=1

P(a i|c j)(1)

where a i is the i th word in the document,and|D|is the length of the document in words.Since for any given docu-

ment,the prior P(D)is a constant,this factor can be ignored if all that is desired is a ranking rather than a probability es-timate.A ranking is produced by sorting documents by their odds ratio,P(c1|D)/P

(c0|D),where c1reprents the pos-itive class and c0reprents the negative class.An example is classiﬁed as positive if the odds are greater than1,and negative otherwi.

In our ca,since books are reprented as a vector of“doc-uments,”d m,one for each slot(where s m denotes the m th slot),the probability of each word given the category and the slot,P(w k|c j,s m),must be estimated and the posterior cat-egory probabilities for a book,B,computed using:

P(c j|B)=P(c j)

WORDS ZUBRIN9.85

WORDS SMOLIN9.39

WORDS TREFIL8.77

WORDS DOT8.67

通便蔬菜SUBJECTS COMPARATIVE8.39

AUTHOR D

ZUBRIN7.63

AUTHOR R

MORA VEC7.63

RELATED-AUTHORS B

RADFORD7.63

WORDS LEE7.57

WORDS MORA VEC7.57

WORDS WAGNER7.57

RELATED-TITLES CONNECTIONIST7.51

RELATED-TITLES BELOW7.51

Table1:Sample Positive Proﬁle Features

an average of11.5conds,and probabilistically categorized

new test examples at an average rate of about200books per

cond.An optimized implementation could no doubt sig-

niﬁcantly improve performance even further.

A proﬁle can be partially illustrated by listing the

features most indicative of a positive or negative rating.Table1prents the top20features for a sample proﬁle learned for recom-mending science books.Strength measures how much more likely a word in a slot is to appear in a positively rated book than a negatively rated one,computed as:

Strength(w k,s j)=log(P(w k|c1,s j)/P(w k|c0,s j))(6) Producing,Explaining,and Revising Recommendations Once a proﬁle is learned,it is ud to predict the preferred ranking of the remaining books bad on posterior probabil-ity of a positive categorization,and the top-scoring recom-mendations are prented to the ur.

The system also has a limited ability to“explain”its rec-ommendations by listing the features that most contributed to its high rank.For example,given the proﬁle illustrated above,L IBRA prented the explanation shown in Table2. The strength of a cue in this ca is multiplied by the num-ber of times it appears in the description in order to fully indicate its inﬂuence on the ranking.The positiveness of a feature can in turn be explained by listing the ur’s training examples that most inﬂuenced its strength,as illustrated in Table3where“Count”gives the number of times the feature appeared in the description of the rated book.

After reviewing the recommendations(and perhaps disrec-ommendations),the ur may assign their own rating to ex-amples they believe to be incorrectly ranked and retrain the

The Fabric of Reality:

The Science of Parallel Univers-And Its Implications by David Deutsch recommended becau:

Slot Word Strength The Life of the Cosmos1015

Before the Beginning:Our Univer and Others87 Unveiling the Edge of Time103 Black Holes:A Traveler’s Guide93

The Inﬂationary Univer92 Table3:Sample Feature Explanation

system to produce improved recommendations.As with rel-evance feedback in information retrieval[27],this cycle can be repeated veral times in order to produce the best results. Also,as new examples are provided,the system can track any change in a ur’s preferences and alter its recommendations bad on the additional information.

EXPERIMENTAL RESULTS

Methodology

Data Collection Several data ts were asmbled to eval-uate L IBRA.Theﬁrst two were bad on theﬁrst3,061 adequate-information titles(books with at least one abstract, review,or customer comment)returned for the subject arch “literatureﬁction.”Two parate ts were randomly lected from this datat,one with936books and one with935,and rated by two different urs.The ts will be called L IT1 and L IT2,respectively.The remaining ts were bad on all of the adequate-information Amazon titles for“mystery”(7,285titles),“science”(6,177titles),and“scienceﬁction”(3,813titles).From each of the ts,500titles were chon at random and rated by a ur(the same ur rated both the science and

scienceﬁction books).The ts will be called

Data

L IT1

935 4.5341.2

M YST

500 4.1531.2

12345678910

27178677410612583704022 L IT2

7311782946456466151 S CI

56119758367332821126

Table5:Data Rating Distributions

M YST,S CI,and SF,respectively.

In order to prent a quantitative picture of performance on a realistic sample;books to be rated where lected at ran-dom.However,this means that many books may not have been familiar to the ur,in which ca,the ur was asked to supply a rating bad on reviewing the Amazon page de-scribing the book.Table4prents some statistics about the data and Table5prents the number of books in each rating category.Note that overall the data ts have quite different ratings distributions.

Performance Evaluation To test the system,we performed 10-fold cross-validation,in which each data t is randomly split into10equal-size gments and results are averaged over10trials,each time leaving a parate gment out for independent testing,and training the system on the remain-ing data[22].In order to obrve performance given varying amounts of training data,learning curves were generated by testing the system after training on increasing subts of the overall training data.A number of metrics were ud to mea-sure performance on the novel test data,including:

浮想联翩造句

•Classiﬁcation accuracy(Acc):The percentage of exam-ples correctly classiﬁed as positive or negative.

•Recall(Rec):The percentage of positive examples classi-ﬁed as positive.

•Precision(Pr):The percentage of examples classiﬁed as positive which are positive.

•Precision at Top3(Pr3):The percentage of the3top ranked examples which are positive.

•Precision at Top10(Pr10):The percentage of the10top ranked examples which are positive.

•F-Measure(F):A weighted average of precision and recall frequently ud in information retrieval:

F=(2·P r·Rec)/(P r+Rec)

Data N

L IT15

65.551.353.386.776.049.7 6.63 6.650.35 L IT120

73.965.163.686.781.063.47.407.320.64 L IT1100

79.862.875.996.794.068.58.578.030.74

59.057.652.470.074.053.3 6.80 6.820.31 L IT210

69.567.263.293.391.064.18.207.840.59 L IT240

78.078.571.296.794.074.48.778.220.72 L IT2840

M YST5

75.687.982.490.090.083.88.408.340.40 M YST20

考研常识85.295.485.996.794.090.38.378.520.50 M YST100

85.893.288.196.798.090.58.908.970.61

62.863.846.373.360.051.1 6.97 6.170.35 S CI10

75.466.064.296.780.063.18.377.030.51 S CI40

81.874.472.293.383.072.38.507.290.65 S CI450

SF5

64.649.028.953.336.031.5 5.83 4.720.15 SF20

72.658.940.170.043.043.0 6.47 5.260.39 SF100

79.282.249.190.063.060.67.70 6.260.61

Table6:Summary of Results

•Rating of Top3(Rt3):The average ur rating assigned to the3top ranked examples.

•Rating of Top10(Rt10):The average ur rating assigned to the10top ranked examples.

•Rank Correlation(r s):Spearman’s rank correlation coef-ﬁcient between the system’s ranking and that impod by the urs ratings(−1≤r s≤1);ties are handled using the method recommended by[1].

The top3and top10metrics are given since many urs will be primarily interested in getting a few top-ranked recom-mendations.Rank correlation gives a good overall picture of how the system’s continuous ranking of books agrees with the ur’s,without requiring that the system actually predict the numerical rating score assigned by the ur.A correlation coefﬁcient of0.3to0.6is generally considered“moderate”and above0.6is considered“strong.”

Basic Results

The results are summarized in Table6,where N reprents the number of training examples utilized and results are shown for a number of reprentative points along the learning curve. Overall,the results are quite encouraging even when the sys-tem is given relatively small training ts.The SF data t is clearly the most difﬁcult since there are very few highly-rated books.

The“top n”metrics are perhaps the most relevant to many urs.Consider precision at top3,which is fairly consis-tently in the90%range after only20training examples(the exceptions are L IT1until70examples1and SF until450 examples).Therefore,L IBRA’s top recommendations are highly likely to be viewed positively by the ur.Note that the“%Positive”column in Table4gives the probability that a randomly chon example from a given data t will be positively rated.Therefore,for every data t,the top3and top10recommendations are always substantially more likely than random to be rated positively,even after only5training examples.

0.10.20.30.40.50.6

0.70.80

100

200

300

400500600700

800

900

C o r r e l a t i o n C o e f f i c i e n t

Training Examples

LIBRA LIBRA-NR

榉木树

Figure 1:L IT 1Rank Correlation Considering the average rating of the top 3recommenda-tions,it is fairly consistently above an 8after only 20training examples (the exceptions again are L IT 1until 100examples and SF).For every data t,the top 3and top 10recommen-dations are always rated substantially higher than a randomly lected example (cf.the average rating from Table 4).Looking a

t the rank correlation,except for SF,there is at least a moderate correlation (r s ≥0.3)after only 10exam-ples,and SF exhibits a moderate correlation after 40exam-ples.This becomes a strong correlation (r s ≥0.6)for L IT 1after only 20examples,for L IT 2after 40examples,for S CI after 70examples,for M YST after 300examples,and for SF after 450examples.

Results on the Role of Collaborative Content

Since collaborative and content-bad approaches to recom-mending have somewhat complementary strengths and weak-ness,an interesting question that has already attracted some initial attention [3,4]is whether they can be combined to produce even better results.Since L IBRA exploits content about related authors and titles that Amazon produces using collaborative methods,an interesting question is whether this collaborative content actually helps its performance.To ex-amine this issue,we conducted an “ablation”study in which the slots for related authors and related titles were removed from L IBRA ’s reprentation of book content.The resulting system,called L IBRA -NR,was compared to the original one using the same 10-fold training and test ts.The statisti-cal signiﬁcance of any differences in performance between the two systems was evaluated using a 1-tailed paired t -test requiring a signiﬁcance level of p <0.05.

Overall,the results indicate that the u of collaborative con-tent has a signiﬁcant positive effect.Figures 1,2,and 3,show sample learning curves for different important met-rics for a few data ts.For the L IT 1rank-correlation re-sults shown in Figure 1,there is a consistent,statistically-signiﬁcant difference in performance from 20examples on-

010鸡和兔相冲吗

20304050607080901000

100

150

200250300350

400

450

% P r e c i s i o n T o p 10

Training Examples

LIBRA LIBRA-NR

Figure 2:M YST Precision at Top 10

12345678

050100150

200250300350400450

R a t i n g T o p 3

Training Examples

酸辣空心菜梗LIBRA LIBRA-NR

Figure 3:SF Average Rating of Top 3

ward.For the M YST results on precision at top 10shown in Figure 2,there is a consistent,statistically-signiﬁcant differ-ence in performance from 40examples onward.For the SF results on average rating of the top 3,there is a statistically-signiﬁcant difference at 10,100,150,200,and 450examples.The results shown are some of the most consistent differ-ences for each of the metrics;however,all of the datats demonstrate some signiﬁcant advantage of using collabora-tive content according to one or more metrics.Therefore,in-formation obtained from collaborative methods can be ud to improve content-bad recommending,even when the ac-tual ur data underlying the collaborative method is unavail-able due to privacy or proprietary concerns.

FUTURE WORK

We are currently developing a web-bad interface so that L IBRA can be experimentally evaluated in practical u with a larger body of urs.We plan to conduct a study in which each ur lects their own training examples,obtains recom-mendations,and provides ﬁnal informed ratings after reading one or more lected books.

本文发布于:2023-05-30 22:48:01，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/953641.html

上一篇：最新初一开学作文字三篇(优质)

下一篇：责任初中作文字责任初一作文字(模板七篇)

标签：紫菜考研海苔常识做成

留言与评论（共有 0 条评论）