图数据集整理
前⾔
给⾃⼰做个梳理。
主要参考
小马过河雅思>北京city
<-benchmark仓库
3._benchmark
结点分类任务
cora,citeer,
Amazon的2个⼦集。
co-author的2个⼦集。
PPI
Reddit
(Symmetric Stochastic Block Model Mixture datat,)
Wiki-CS
Wikipedia
1.引⽂⽹络(LBC Project)
LBC project
下辖三⼤数据集。
Cora
Citeer
WebKB 。这是⼀个复合数据集。
The WebKB datat consists of 877 scientific publications classified into one of five class. The citation network consists of 1608 links.
Each publication in the datat is described by a 0/1-valued word vector indicating the abnce/prence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the datat provides more details. Click here to download the tarball containing the datat.
1.1 WebKB
3个⼦集⽐较常⽤
Cornell, Texas, and Wisconsin
原始⽂件是{.cites, .content}的组合。5个类别。
需要处理。
暂时不知道标准划分是什么。
在Geom-GCN⾥被作为disassortative graphs使⽤。
2.亚马逊Amazon
不是蒙⾯战⼠Amazon(逃alt
no matter what⼜称AmazonCoBuy,因为边的含义是两个结点商品共同被购买。
常⽤两个⼦集 Computers, Photo
现在作为GNN Benchmark的是后⼈从中⼆次采集的数据集。
⼆次采集的论⽂是这篇
3.共同作者 Co-author
4.WikiCS
论⽂ Wiki-CS: A Wikipedia-Bad Benchmark for Graph Neural Networks
含3个⽂件,解压后120MB。
data.json
metadata.json
statistics.json
官⽅给了20个ed下的mask。
print(np.array(train_masks).shape)
>(20,11701)pianist
11701个结点的划分是
print(np.sum(train_masks)/20)
print(np.sum(val_masks)/20)
print(np.sum(stopping_masks)/20)
print(np.sum(test_mask))
>580 train
>1769 valid
koka>3505 stopping
>5847 test
6. Reddit
7.演员共现⽹络(Actor)
Actor co-occurrence network.
⼀般出现时叫Actor。⾮常古⽼了。
This datat is the actor-only induced subgraph of the film-directoractor-writer network (Tang et al., 2009)
Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks.In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807–816. ACM, 2009.
5.维基百科Wikipedia network
常⽤3个⼦集,Chameleon, squirrel, crocodiles
回归任务是预测⽹页的⽉度平均流量。
原⽂写的
The data was collected from the English Wikipedia (December 2018). The datats reprent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes reprent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The prence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.
Wikipedia-Geom分类版本
Wikipedia本⾝没有类别。
但是在Geom-GCN⾥,流量被分箱成5个类别了。
the average monthly traffic of the web page is converted into five categories to predict.
于是被做成了分类任务啊。
⽽且Geom-GCN只对Chameleon, squirrel做了处理。
“crocodile” is not available.
这个数据集,dgl没有收录。
torch_geometric⾥有,可以控制是否开启分类任务。
原始格式是edges.csv和features.json。
处理成分类任务后,变成了
“out1_” + “out1_node_”
Datat Nodes edges features class link
Wiki-CS11,/pmernyei/wiki-cs-datat/raw/master/datat
Amazon-
13,/shchur/gnnbenchmark/raw/master/data/npz/amazon_electronics_computers.npz Computers
Amazon-
7,/shchur/gnn-benchmark/raw/master/data/npz/amazon_electronics_photo.npz Photo
Coauthor-
18,33381,/shchur/gnn-benchmark/raw/master/data/npz/ms_academic_cs.npz CS
Coauthor-
34,493247,/shchur/gnn-benchmark/raw/master/data/npz/ms_academic_phy.npz Physics
Wikipedia-
2,27736,1012,3255
Chameleon
Wikipedia-
5,201217,0732,0895
Squirrel
Chameleon Crocodile Squirrel
megadethNodes2,27711,6315,201
Edges31,421170,918198,493
Density0.0120.0030.015
Transitvity0.3140.0260.348
备注
geom-gcn
new_data⾥
包含了
WebKB的Cornell, Texas, and Wisconsin。
理论上应该是回归任务的Wikipedia。
chameleon,squirrel。
不知道哪⾥来的film。(论⽂⾥没有⽤到film,建议直接删掉)
⽽且⽤的是它寄⼏的格式…
坑。
HAN
异质图总结
异质图差点逼疯我。因为涉及到⼏个框架哥抢市场。。
1.raw version
1.1 DBLP four area
DBLP is a bibliography website of computer science. We u
a commonly ud subt in 4 areas with nodes reprenting
宝贝的英文单词
authors, papers, terms and venues.
下载后的⽂件应该叫DBLP_four_area.zip(1.99MB)。
这是最原始的⽂件,zip⾥⾯全部都是txt。
从这份声明可以看到
这个数据集是Jing Gao () 和 Yizhou Sun (yizhou.)标注的。
Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han, Graph-bad Connsus Maximization among Multiple Supervid and Unsupervid Models". Advances in Neural Information Processing Systems (NIPS), 22, 2009, 585-593.
可以从作者之⼀的主页下载到标注后的数据集。
备注 需要跟另⼀个citation network datat DBLP区分
Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang, and Yang Wang. Tri-party deep network reprentation. Network, 11(9):12, 2016.
这个DBLP是同质性的数据集。
因为DBLP是⼀个⽹站,任何⼈从这个上⾯爬⾍⼀些内容做出来的数据集都叫DBLP!
1.2 IMDB
kaggle的公开数据。原始格式是movie-metadata.csv(1.49 MB)。
2.HAN version
HAN版本是jihouye同学于⽅兴未艾之际抢市场时留下的祸根。
使⽤了 ACM,DBLP,IMDB 3个数据集。
见仓库
ACM3025.mat(252M)。
DBLP4057_GAT_with_idx_tra200_val_800.zip(387.2M)layering
IMDB5k(545.5M)
别问我为什么命名这么不规范。。。
甚⾄解压后的⽂件排布也很奇葩。。。
备⽤链接
需要注意的是!ACM 或者说ACM3025是HAN⾸次引⼊的⾃⼰采样标注的数据集。
We extract papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and divide the papers into three
class (Databa, Wireless Communication, Data Mining). Then we construct a heterogeneous graph that compris 3025 papers §, 5835 authors (A) and 56 subjects (S). Paper features correspond to elements of a bag-of-words reprented of keywords. We employ the meta-path t {PAP, PSP} to perform experiments. Here we label the papers according to the conference they published.
DBLP是MAGNN这篇⽂章对原始⽂件的再采样。 IMDB是原始⽂件的再采样。
We extract a subt of DBLP which contains 14328 papers §, 4057 authors (A), 20 conferences ©, 8
上海韦博英语价格是多少789 terms (T). The authors are divided into four areas: databa, data mining, machine learning, information retrieval. Also, we label each author’s rearch area according to the conferences they submitted. Author features are the elements of a bag-of-words reprented of keywords. Here we employ the meta-path t {APA, APCPA, APTPA} to perform experiments
虽然HAN原⽂写的We extract 但实际上使⽤的是 they extrac。
MAGNN的处理代码
解压后⼀堆.npz和.npy,还有⼀个⽂件夹0.
其中⽂件夹/0/是 post-processing for mini-batched training
Here we extract a subt of IMDB which contains 4780 movies (M), 5841 actors (A) and 2269 directors (D). The movies are divided into three class (Action, Comedy, Drama) according to their genre. Movie features correspond to elements of a bag-of-words reprented of plots. We employ the meta-path t {MAM, MDM} to perform experiments.
处理⽅法见代码(反正我没看)
除此之外HAN还使⽤了FreeBa数据集。
FreeBa is a huge knowledge graph. We sample a subgraph of 8 genres of entities with about 1,000,000 edges following the procedure of a previous survey.
FreeBa数据集来⾃这篇⽂章。
Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun, and Jiawei Han. 2020. Heterogeneous Network Reprentation Learning: A Unified Framework with Survey and Benchmark. TKDE (2020)
freeba这个东西,后来⼀直被⼈忽视。