Datasets Used in My Papers
Plain Graphs
| Name | #nodes | #edges | #labels | Type | URL |
|---|---|---|---|---|---|
| PPI | 3,890 | 76,584 | 50 | undirected | [raw] [raw] [preprocessed] |
| Blogcatalog3 | 10,312 | 333,983 | 39 | undirected | [raw] [raw] [preprocessed] |
| Flickr | 80,513 | 5,899,882 | 195 | undirected | [raw] [raw] [preprocessed] |
| Amazon | 334863 | 925872 | 100 | undirected | [raw] [preprocessed] |
| DBLP | 425957 | 1049866 | 100 | undirected | [raw] [preprocessed] |
| Youtube | 1,138,499 | 2,990,443 | 47 | undirected | [raw] [preprocessed] |
| TWeibo | 2,320,895 | 50,655,143 | 100 | directed | [raw] [preprocessed] |
| Orkut | 3,072,441 | 117,185,084 | 100 | undirected | [raw] [preprocessed] |
| LiveJournal | 3997962 | 34681189 | 100 | undirected | [raw] [preprocessed] |
| In-2004 | 1,382,908 | 16,539,643 | - | directed | [raw] [preprocessed] |
| DBLP | 5,425,963 | 17,298,032 | - | undirected | [raw] [preprocessed] |
| Pokec | 1,632,803 | 30,622,564 | - | directed | [raw] [preprocessed] |
| LiveJournal | 4,847,571 | 68,475,391 | - | directed | [raw] [preprocessed] |
| IT-2004 | 41,291,594 | 1,135,718,909 | - | directed | [raw] [preprocessed] |
| 41,652,230 | 1,468,365,182 | - | directed | [raw] [preprocessed] | |
| Friendster-small | 7,944,949 | 447,219,610 | 100 | undirected | [raw] [raw] [preprocessed] |
| Friendster | 65,608,366 | 1,806,067,135 | 100 | undirected | [raw] [raw] [preprocessed] |
| OAG | 67,768,244 | 895,368,962 | 19 | undirected | [raw] [preprocessed] |
| UK-2007 | 105,896,555 | 3,738,733,648 | - | directed | [raw][preprocessed] |
| UK-union | 133,633,040 | 5,475,109,924 | - | directed | [raw] [preprocessed] |
| ClueWeb12 | 978,408,098 | 42,574,107,469 | - | directed | [raw] |
| ClueWeb09 | 1,684,868,322 | 7,939,635,651 | - | directed | [raw] [preprocessed] |
Welcome to cite our paper if you publish results based on our preprocessed datasets.
@article{yang13homogeneous,
title={Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank},
author={Yang, Renchi and Shi, Jieming and Xiao, Xiaokui and Yang, Yin and Bhowmick, Sourav S},
journal={Proceedings of the VLDB Endowment},
volume={13},
number={5},
pages={670--683},
year={2020},
publisher={VLDB Endowment}
}
@article{shi13realtime,
title={Realtime Index-Free Single Source SimRank Processing on Web-Scale Graphs},
author={Shi, Jieming and Jin, Tianyuan and Yang, Renchi and Xiao, Xiaokui and Yang, Yin},
journal={Proceedings of the VLDB Endowment},
volume={13},
number={7},
pages={966--980},
year={2020},
publisher={VLDB Endowment}
}
Attributed Graphs
| Name | Type | #nodes | #edges | #attributes | #labels | URL |
|---|---|---|---|---|---|---|
| Wiki | directed | 2405 | 17981 | 4973 | 19 | [raw] [preprocessed] |
| Cora | directed | 2708 | 5429 | 1433 | 7 | [raw] [preprocessed] |
| Citeseer | directed | 3312 | 4660 | 3703 | 6 | [raw] [preprocessed] |
| Pubmed | directed | 19717 | 44338 | 500 | 3 | [raw] [preprocessed] |
| BlogCatalog | undirected | 5196 | 343486 | 8189 | 6 | [raw] [preprocessed] |
| PPI | undirected | 56944 | 818716 | 50 | 121 | [raw] [preprocessed] |
| Flickr | undirected | 7575 | 479476 | 12047 | 9 | [raw] [preprocessed] |
| undirected | 4039 | 88234 | 1283 | 193 | [raw] [preprocessed] | |
| ArXiv | undirected | 169343 | 1157799 | 128 | 40 | [raw] [preprocessed] |
| undirected | 232,965 | 11,606,919 | 602 | 41 | [raw] [preprocessed] | |
| Yelp | undirected | 716847 | 6,977,410 | 300 | 100 | [raw] [preprocessed] |
| directed | 81306 | 1768149 | 216839 | 4065 | [raw] [preprocessed] | |
| Amazon2M | undirected | 2449029 | 61859140 | 100 | 47 | [raw] [raw] [preprocessed] |
| Google+ | directed | 107614 | 13673453 | 15907 | 468 | [raw] [preprocessed] |
| TWeibo | directed | 2320895 | 50655143 | 1657 | 8 | [raw] [preprocessed] |
| MAG | directed | 59249719 | 978147253 | 2000 | 100 | [raw] [preprocessed] |
| MAG-SC | directed | 10541560 | 265219994 | 2784240 | 8 | [raw] [preprocessed] |
Our datasets are also available in Pytorch-Geometric. Node attributes can be loaded as a sparse matrix using the following code
from scipy import sparse
features = sparse.load_npz("attrs.npz")
Welcome to cite our paper if you publish results based on our preprocessed datasets.
@article{yang2020scaling,
title={Scaling Attributed Network Embedding to Massive Graphs},
author={Yang, Renchi and Shi, Jieming and Xiao, Xiaokui and Yang, Yin and Liu, Juncheng and Bhowmick, Sourav S},
journal={Proceedings of the VLDB Endowment},
volume={14},
number={1},
pages={37--49},
year={2021},
publisher={VLDB Endowment}
}
Bipartite Graphs
| Name | |U| | |V| | |E| | URL |
|---|---|---|---|---|
| Avito | 27736 | 16589 | 67029 | [raw] [preprocessed] |
| AOL | 4811647 | 1632788 | 10741954 | [raw] [preprocessed] |
| DBLP | 6001 | 1524 | 29257 | [raw] [preprocessed] |
| Movielens-1M | 6040 | 3706 | 1000210 | [raw] [preprocessed] |
| KDDCup2012 | 255170 | 1848114 | 2766394 | [raw] [preprocessed] |
| Last.fm | 359349 | 160168 | 17559531 | [raw] [preprocessed] |
| Amazon-games | 826767 | 50210 | 1324754 | [raw] [preprocessed] |
| DBLP | 6,001 | 1,308 | 29,256 | [raw] [preprocessed] |
| Wikipedia | 15,000 | 3,214 | 64,095 | [raw] [preprocessed] |
| 55,187 | 9,916 | 1,500,809 | [raw] [preprocessed] | |
| Yelp | 31,668 | 38,048 | 1,561,406 | [raw] [preprocessed] |
| MovieLens-10M | 69,878 | 10,677 | 10,000,054 | [raw] [preprocessed] |
| Last.fm | 359,349 | 160,168 | 17,559,530 | [raw] [preprocessed] |
| MIND | 876,956 | 97,509 | 18,149,915 | [raw] [preprocessed] |
| Netflix | 480,189 | 17,770 | 100,480,507 | [raw] [preprocessed] |
| Orkut | 2,783,196 | 8,730,857 | 327,037,487 | [raw] [preprocessed] |
| MAG | 10,541,560 | 2,784,240 | 1,095,315,106 | [raw] [preprocessed] |
Welcome to cite our paper if you publish results based on our preprocessed datasets.
@inproceedings{yang2022efficient,
title={Efficient and Effective Similarity Search over Bipartite Graphs},
author={Yang, Renchi},
booktitle={Proceedings of the ACM Web Conference 2022},
pages={308--318},
year={2022}
}
@inproceedings{yang2022scalable,
title={Scalable and Effective Bipartite Network Embedding},
author={Yang, Renchi and Shi, Jieming and Huang, Keke and Xiao, Xiaokui},
booktitle={Proceedings of the 2022 International Conference on Management of Data},
pages={1977--1991},
year={2022}
}
Dataset Repositories
| Name | Type | Collected by |
|---|---|---|
| SNAP | Graphs & Networks | Stanford |
| LAW | Graphs & Networks | UNIMI |
| BioSNAP | Biomedical Networks | Stanford |
| KONECT | Graphs & Networks | Jérôme Kunegis |
| Aminer | Academic Networks | AMiner |
| UCI Network Data Repository | Graphs & Networks | UCI Datalab |
| Network Repository | Graphs & Networks | - |
| Open Academic Graph | Academic Networks | Microsoft |
| Open Graph Benchmark | Graphs & Networks | Stanford |
| TuDatasets | Graphs & Networks | Christopher Morris, etc. |
| StreamingGraphs | Streaming Graphs | Yibo Yao |
| ARB | Graphs & Networks | Austin R. Benson |
| SuiteSparse Matrix Collection | Matrix/Graphs | TAMU |
| Web Data Commons | Hyperlink Graphs/Web Tables/RDFa | University of Mannheim |
| Yahoo Webscope Datasets | Graphs/Ratings/Languages/Advertising | Yahoo |
| UCI Machine Learning Repository | Multivariate/Text/Time-Series | UCI |
| Yelp Open Dataset | businesses/reviews/user data | Yelp |
| Recommender Systems Datasets | graphs/interactions/reviews/ratings | UCSD |
| MIcrosoft News Dataset | user behavior logs | Microsoft |
| Search Query Logs | query logs | Jeff Huang |
| AOL DS | query logs | Ricardo Campos |
| AWS | - | Amazon |
| Kaggle Datasets | - | Kaggle |
| OpenML | - | OpenML |
| Datasets | - | - |
| Netzschleuder | - | - |