AmazonReviewDatat数据集介绍
Amazon Review Datat数据集记录了⽤户对亚马逊⽹站商品的评价,是推荐系统的经典数据集,并且Amazon⼀直在更新这个数据集,根据时间顺序,Amazon数据集可以分成三类:
2013 版
2014版工会活动总结
如果直接跳转到2018版,可换为访问
2018版
Amazon数据集可以根据商品类别分为 Books,Electronics,Movies and TV,CDs and Vinyl等⼦数据集,这些⼦数据集包含两类信息:
以2014版数据集为例:
1. 商品信息描述
asin商品id
title商品名称
price价格
imUrl商品图⽚链接
related相关商品
salesRank折扣信息
brand品牌
categories⽬录类别
官⽅例⼦:
{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "/images/I/51fAmVkTbyL._SY300_.jpg",
"related":
{
"also_bought": ["B00JHONN1S", "B002BZX8Z6"],
"also_viewed": ["B002BZX8Z6", "B00JHONN1S"],
"bought_together": ["B002BZX8Z6"]
},
"salesRank": {"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", "Other Sports", "Dance"]]
}
2. ⽤户评分记录数据
reviewerID⽤户id
asin商品id
reviewerName⽤户名
helpful有效评价率(helpfulness rating of the review, e.g. 2/3)
reviewText评价⽂本
overall评分
overall评分
reviewerID⽤户id
圣女果
summary评价总结
unixReviewTime评价时间戳
高加米拉战役
reviewTime评价时间
甜甜的睡前小故事
{
"reviewerID": "A2SUAM1J3GNN3B",
班师回朝
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing the old hymns. The music is at times ha rd to read becau we think the book was published for singing from more than playing from. Great purcha though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
教英语
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
Amazon数据集读取:
因为下载的数据是json⽂件,不易操作,这⾥主要介绍如何将json⽂件转化为csv格式⽂件。以2014版Amazon Electronics数据集的转化为例:
商品信息读取
import pickle
import pandas as pd
file_path ='meta_Electronics.json'
fin =open(file_path,'r')
df ={}
uless_col =['imUrl','salesRank','related','title','description']# 不想要的字段
i =0
for line in fin:
d =eval(line)
猝然for s in uless_col:
if s in d:
d.pop(s)
df[i]= d
i +=1
df = pd.DataFrame.from_dict(df, orient='index')
<_csv('meta_Electronics.csv',index=Fal)
⽤户评分记录数据读取
file_path ='Electronics_10.json'
fin =open(file_path,'r')
df ={}
中秋节祝福语
uless_col =['reviewerName','reviewText','unixReviewTime','summary']# 不想要的字段
i =0
for line in fin:
d =eval(line)
for s in uless_col:
if s in d:
d.pop(s)
df[i]= d
i +=1
df = pd.DataFrame.from_dict(df, orient='index')
<_csv('Electronics_10.csv',index=Fal)