AmazonReviewDatat数据集介绍
Amazon Review Datat数据集记录了⽤户对亚马逊⽹站商品的评价,是推荐系统的经典数据集,并且Amazon⼀直在更新这个数据集,根据时间顺序,Amazon数据集可以分成三类:
2013 版
2014版
如果直接跳转到2018版,可换为访问
2018版
Amazon数据集可以根据商品类别分为 Books,Electronics,Movies and TV,CDs and Vinyl等⼦数据集,这些⼦数据集包含两类信息:
以2014版数据集为例:
明矾泡脚1. 商品信息描述
asin商品id
title商品名称贴对联作文
price价格
imUrl商品图⽚链接
related相关商品
salesRank折扣信息
brand品牌
categories⽬录类别
官⽅例⼦:
{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "/images/I/51fAmVkTbyL._SY300_.jpg",
"related":
{
"also_bought": ["B00JHONN1S", "B002BZX8Z6"],
"also_viewed": ["B002BZX8Z6", "B00JHONN1S"],
"bought_together": ["B002BZX8Z6"]
有山有水的字
},
"salesRank": {"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", "Other Sports", "Dance"]]
}
2. ⽤户评分记录数据36000韩元
reviewerID⽤户id
怎么下载电影到u盘asin商品id
reviewerName⽤户名
helpful有效评价率(helpfulness rating of the review, e.g. 2/3)
reviewText评价⽂本
overall评分
overall评分
reviewerID⽤户id
summary评价总结
unixReviewTime评价时间戳
reviewTime评价时间
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing the old hymns. The music is at times ha rd to read becau we think the book was published for singing from more than playing from. Great purcha though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
Amazon数据集读取:
因为下载的数据是json⽂件,不易操作,这⾥主要介绍如何将json⽂件转化为csv格式⽂件。以2014版Amazon Electronics数据集的转化为例:
商品信息读取
import pickle
import pandas as pd
file_path ='meta_Electronics.json'
fin =open(file_path,'r')
df ={}
uless_col =['imUrl','salesRank','related','title','description']# 不想要的字段
i =0
for line in fin:
d =eval(line)
红豆仙for s in uless_col:
if s in d:听雨作文
d.pop(s)
df[i]= d
水龙头漏水怎么办i +=1
df = pd.DataFrame.from_dict(df, orient='index')
<_csv('meta_Electronics.csv',index=Fal)
⽤户评分记录数据读取
file_path ='Electronics_10.json'
fin =open(file_path,'r')
df ={}
uless_col =['reviewerName','reviewText','unixReviewTime','summary']# 不想要的字段
i =0
for line in fin:
d =eval(line)
for s in uless_col:
if s in d:
d.pop(s)
df[i]= d
i +=1
df = pd.DataFrame.from_dict(df, orient='index')
<_csv('Electronics_10.csv',index=Fal)