LSA隐含语义分析经典例子

更新时间:2023-07-20 16:19:18 阅读: 评论:0

Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of tho documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.
Unfortunately, this problem is difficult becau English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.
For example, the word bank when ud together with mortgage, loans, and rates probably means a financial institution. However, the word bank when ud together with lures, casting, and fish probably means a stream or river bank.
How Latent Semantic Analysis Works
Latent Semantic Analysis aro from the problem of how to find relevant documents from arch words. The fundamental difficulty aris when we compare words to find relevant documents, becau what we really want to do is compare the meanings or concepts behind the words. LSA attempts to solve this problem by mapping both words and documents into a "concept" space and doing the comparison in this space.
Since authors have a wide choice of words available when they write, the concepts can be obscured due to different word choices from different authors. This esntially random choice of words introduces noi into the word-concept relationship. Latent Semantic Analysis filters out some of this noi and also attempts to find the smallest t of concepts that spans all the documents.
In order to make this difficult problem solvable, LSA introduces some dramatic simplifications.
1. Documents are reprented as "bags of words", where the order of the words in a document is not important, only how many times each word appears in a document.
2. Concepts are reprented as patterns of words that usually appear together in documents. For example "leash", "treat", and "obey" might usually appear in documents about dog training.
3. Words are assumed to have only one meaning. This is clearly not the ca (banks could be river banks or financial banks) but it makes the problem tractable.
To e a small example of LSA, take a look at the next ction.

A Small Example
As a small example, I arched for books using the word “investing” and took the top 10 book titles that appeared. One of the titles was dropped becau it had only one index word in common with the other titles. An index word is any word that:
appears in 2 or more titles, and
感知is not a very common word such as “and”, “the”, and so on (known as stop words). The words are not included becau do not contribute much (if any) meaning.
In this example we have removed the following stop words: “and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.
Here are the 9 remaining tiles. The index words (words that appear in 2 or more titles and are not stop words) are underlined.
1. The Neatest Little Guide to Stock 小满节气的含义是什么Market Investing
2. Investing For Dummies, 4th Edition
慷慨陈词的意思
3. The Little Book of Common Sen Investing梁姓起源: The Only Way to Guarantee Your Fair Share of Stock Market Returns
4. The Little Book of Value Investing
5. Value Investing少儿简笔画: From Graham to Buffett and Beyond
6个月宝宝体重6. Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!
7. Investing in Real Estate, 5th Edition
8. Stock Investing For Dummies
9. 青霉素谁发明的Rich Dad's Advisors: The ABC's of 拌面的做法和配料Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss
Once Latent Semantic Analysis has been run on this example, we can plot the index words and titles on an XY graph and identify clusters of titles. The 9 titles are plotted with
blue circles and the 11 index words are plotted with red squares. Not only can we spot clusters of titles, but since index words can be plotted along with titles, we can label the clusters. For example, the blue cluster, containing titles T7 and T9, is about real estate. The green cluster, with titles T2, T4, T5, and T8, is about value investing, and finally the red cluster, with titles T1 and T3, is about the stock market. The T6 title is an outlier, off on its own.

本文发布于:2023-07-20 16:19:18,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/1089269.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:配料   节气   青霉素
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图