首页 > 英文翻译

图数据集整理

更新时间:2023-06-25 09:10:14 阅读：评论：0

图数据集整理

前⾔

给⾃⼰做个梳理。

主要参考

小马过河雅思>北京city

<-benchmark仓库

3._benchmark

结点分类任务

cora,citeer,

Amazon的2个⼦集。

co-author的2个⼦集。

PPI

(Symmetric Stochastic Block Model Mixture datat,)

Wiki-CS

Wikipedia

1.引⽂⽹络(LBC Project)

LBC project

下辖三⼤数据集。

Cora

Citeer

WebKB 。这是⼀个复合数据集。

The WebKB datat consists of 877 scientific publications classified into one of five class. The citation network consists of 1608 links.

Each publication in the datat is described by a 0/1-valued word vector indicating the abnce/prence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the datat provides more details. Click here to download the tarball containing the datat.

1.1 WebKB

3个⼦集⽐较常⽤

Cornell, Texas, and Wisconsin

原始⽂件是{.cites, .content}的组合。5个类别。

需要处理。

暂时不知道标准划分是什么。

在Geom-GCN⾥被作为disassortative graphs使⽤。

2.亚马逊Amazon

不是蒙⾯战⼠Amazon（逃alt

no matter what⼜称AmazonCoBuy，因为边的含义是两个结点商品共同被购买。

常⽤两个⼦集 Computers, Photo

现在作为GNN Benchmark的是后⼈从中⼆次采集的数据集。

⼆次采集的论⽂是这篇

3.共同作者 Co-author

4.WikiCS

论⽂ Wiki-CS: A Wikipedia-Bad Benchmark for Graph Neural Networks

含3个⽂件，解压后120MB。

data.json

metadata.json

statistics.json

官⽅给了20个ed下的mask。

print(np.array(train_masks).shape)

>(20,11701)pianist

11701个结点的划分是

print(np.sum(train_masks)/20)

print(np.sum(val_masks)/20)

print(np.sum(stopping_masks)/20)

print(np.sum(test_mask))

>580 train

>1769 valid

koka>3505 stopping

>5847 test

6. Reddit

7.演员共现⽹络(Actor)

Actor co-occurrence network.

⼀般出现时叫Actor。⾮常古⽼了。

This datat is the actor-only induced subgraph of the film-directoractor-writer network (Tang et al., 2009)

Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks.In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807–816. ACM, 2009.

5.维基百科Wikipedia network

常⽤3个⼦集，Chameleon, squirrel, crocodiles

回归任务是预测⽹页的⽉度平均流量。

原⽂写的

The data was collected from the English Wikipedia (December 2018). The datats reprent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes reprent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The prence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.

Wikipedia-Geom分类版本

Wikipedia本⾝没有类别。

但是在Geom-GCN⾥，流量被分箱成5个类别了。

the average monthly traffic of the web page is converted into five categories to predict.

于是被做成了分类任务啊。

⽽且Geom-GCN只对Chameleon, squirrel做了处理。

“crocodile” is not available.

这个数据集，dgl没有收录。

torch_geometric⾥有，可以控制是否开启分类任务。

原始格式是edges.csv和features.json。

处理成分类任务后，变成了

“out1_” + “out1_node_”

Datat Nodes edges features class link

Wiki-CS11,/pmernyei/wiki-cs-datat/raw/master/datat

Amazon-

13,/shchur/gnnbenchmark/raw/master/data/npz/amazon_electronics_computers.npz Computers

Amazon-

7,/shchur/gnn-benchmark/raw/master/data/npz/amazon_electronics_photo.npz Photo

Coauthor-

18,33381,/shchur/gnn-benchmark/raw/master/data/npz/ms_academic_cs.npz CS

Coauthor-

34,493247,/shchur/gnn-benchmark/raw/master/data/npz/ms_academic_phy.npz Physics

Wikipedia-

2,27736,1012,3255

Chameleon

Wikipedia-

5,201217,0732,0895

Squirrel

Chameleon Crocodile Squirrel

megadethNodes2,27711,6315,201

Edges31,421170,918198,493

Density0.0120.0030.015

Transitvity0.3140.0260.348

备注

geom-gcn

new_data⾥

包含了

WebKB的Cornell, Texas, and Wisconsin。

理论上应该是回归任务的Wikipedia。

chameleon,squirrel。

不知道哪⾥来的film。(论⽂⾥没有⽤到film，建议直接删掉）

⽽且⽤的是它寄⼏的格式…

坑。

HAN

异质图总结

异质图差点逼疯我。因为涉及到⼏个框架哥抢市场。。

1.raw version

1.1 DBLP four area

DBLP is a bibliography website of computer science. We u

a commonly ud subt in 4 areas with nodes reprenting

宝贝的英文单词

authors, papers, terms and venues.

下载后的⽂件应该叫DBLP_four_area.zip(1.99MB)。

这是最原始的⽂件，zip⾥⾯全部都是txt。

从这份声明可以看到

这个数据集是Jing Gao () 和 Yizhou Sun (yizhou.)标注的。

Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han, Graph-bad Connsus Maximization among Multiple Supervid and Unsupervid Models". Advances in Neural Information Processing Systems (NIPS), 22, 2009, 585-593.

可以从作者之⼀的主页下载到标注后的数据集。

备注需要跟另⼀个citation network datat DBLP区分

Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang, and Yang Wang. Tri-party deep network reprentation. Network, 11(9):12, 2016.

这个DBLP是同质性的数据集。

因为DBLP是⼀个⽹站，任何⼈从这个上⾯爬⾍⼀些内容做出来的数据集都叫DBLP！

1.2 IMDB

kaggle的公开数据。原始格式是movie-metadata.csv(1.49 MB)。

2.HAN version

HAN版本是jihouye同学于⽅兴未艾之际抢市场时留下的祸根。

使⽤了 ACM,DBLP,IMDB 3个数据集。

见仓库

ACM3025.mat(252M)。

DBLP4057_GAT_with_idx_tra200_val_800.zip(387.2M)layering

IMDB5k(545.5M)

别问我为什么命名这么不规范。。。

甚⾄解压后的⽂件排布也很奇葩。。。

备⽤链接

需要注意的是！ACM 或者说ACM3025是HAN⾸次引⼊的⾃⼰采样标注的数据集。

We extract papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and divide the papers into three

class (Databa, Wireless Communication, Data Mining). Then we construct a heterogeneous graph that compris 3025 papers §, 5835 authors (A) and 56 subjects (S). Paper features correspond to elements of a bag-of-words reprented of keywords. We employ the meta-path t {PAP, PSP} to perform experiments. Here we label the papers according to the conference they published.

DBLP是MAGNN这篇⽂章对原始⽂件的再采样。 IMDB是原始⽂件的再采样。

上海韦博英语价格是多少789 terms (T). The authors are divided into four areas: databa, data mining, machine learning, information retrieval. Also, we label each author’s rearch area according to the conferences they submitted. Author features are the elements of a bag-of-words reprented of keywords. Here we employ the meta-path t {APA, APCPA, APTPA} to perform experiments

虽然HAN原⽂写的We extract 但实际上使⽤的是 they extrac。

MAGNN的处理代码

解压后⼀堆.npz和.npy，还有⼀个⽂件夹0.

其中⽂件夹/0/是 post-processing for mini-batched training

Here we extract a subt of IMDB which contains 4780 movies (M), 5841 actors (A) and 2269 directors (D). The movies are divided into three class (Action, Comedy, Drama) according to their genre. Movie features correspond to elements of a bag-of-words reprented of plots. We employ the meta-path t {MAM, MDM} to perform experiments.

处理⽅法见代码(反正我没看)

除此之外HAN还使⽤了FreeBa数据集。

FreeBa is a huge knowledge graph. We sample a subgraph of 8 genres of entities with about 1,000,000 edges following the procedure of a previous survey.

FreeBa数据集来⾃这篇⽂章。

Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun, and Jiawei Han. 2020. Heterogeneous Network Reprentation Learning: A Unified Framework with Survey and Benchmark. TKDE (2020)

freeba这个东西，后来⼀直被⼈忽视。

本文发布于:2023-06-25 09:10:14，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/157000.html

上一篇：常用前缀

下一篇：NTKO文档控件常见问题解决方案

标签：数据任务分类标注结点类别异质处理

留言与评论（共有 0 条评论）