Datasets Used in My Papers

Plain Graphs

Name#nodes#edges#labelsTypeURL
PPI3,89076,58450undirected[raw] [raw] [preprocessed]
Blogcatalog310,312333,98339undirected[raw] [raw] [preprocessed]
Flickr80,5135,899,882195undirected[raw] [raw] [preprocessed]
Youtube1,138,4992,990,44347undirected[raw] [preprocessed]
TWeibo2,320,89550,655,143100directed[raw] [preprocessed]
Orkut3,072,441117,185,084100undirected[raw] [preprocessed]
In-20041,382,90816,539,643-directed[raw] [preprocessed]
DBLP5,425,96317,298,032-undirected[raw] [preprocessed]
Pokec1,632,80330,622,564-directed[raw] [preprocessed]
LiveJournal4,847,57168,475,391-directed[raw] [preprocessed]
IT-200441,291,5941,135,718,909-directed[raw] [preprocessed]
Twitter41,652,2301,468,365,182-directed[raw] [preprocessed]
Friendster-small7,944,949447,219,610100undirected[raw] [raw] [preprocessed]
Friendster65,608,3661,806,067,135100undirected[raw] [raw] [preprocessed]
OAG67,768,244895,368,96219undirected[raw] [preprocessed]
UK-2007105,896,5553,738,733,648-directed[raw][preprocessed]
UK-union133,633,0405,475,109,924-directed[raw] [preprocessed]
ClueWeb12978,408,09842,574,107,469-directed[raw]
ClueWeb091,684,868,3227,939,635,651-directed[raw] [preprocessed]

Welcome to cite our paper if you publish results based on our preprocessed datasets.

@article{yang13homogeneous,
  title={Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank},
  author={Yang, Renchi and Shi, Jieming and Xiao, Xiaokui and Yang, Yin and Bhowmick, Sourav S},
  journal={Proceedings of the VLDB Endowment},
  volume={13},
  number={5},
  pages={670--683},
  year={2020},
  publisher={VLDB Endowment}
}

@article{shi13realtime,
  title={Realtime Index-Free Single Source SimRank Processing on Web-Scale Graphs},
  author={Shi, Jieming and Jin, Tianyuan and Yang, Renchi and Xiao, Xiaokui and Yang, Yin},
  journal={Proceedings of the VLDB Endowment},
  volume={13},
  number={7},
  pages={966--980},
  year={2020},
  publisher={VLDB Endowment}
}

Attributed Graphs

NameType#nodes#edges#attributes#labelsURL
Wikidirected240517981497319[raw] [preprocessed]
Coradirected2708542914337[raw] [preprocessed]
Citeseerdirected3312466037036[raw] [preprocessed]
Pubmeddirected19717443385003[raw] [preprocessed]
BlogCatalogundirected519634348681896[raw] [preprocessed]
PPIundirected5694481871650121[raw] [preprocessed]
Flickrundirected7575479476120479[raw] [preprocessed]
Facebookundirected4039882341283193[raw] [preprocessed]
Twitterdirected8130617681492168394065[raw] [preprocessed]
Google+directed1076141367345315907468[raw] [preprocessed]
TWeibodirected23208955065514316578[raw] [preprocessed]
MAGdirected592497199781472532000100[raw] [preprocessed]
MAG-SCdirected1054156026521999427842408[raw] [preprocessed]

Our datasets are also available in Pytorch-Geometric. Node attributes can be loaded as a sparse matrix using the following code

from scipy import sparse
features = sparse.load_npz("attrs.npz")

Welcome to cite our paper if you publish results based on our preprocessed datasets.

@article{yang2020scaling,
  title={Scaling Attributed Network Embedding to Massive Graphs},
  author={Yang, Renchi and Shi, Jieming and Xiao, Xiaokui and Yang, Yin and Liu, Juncheng and Bhowmick, Sourav S},
  journal={Proceedings of the VLDB Endowment},
  volume={14},
  number={1},
  pages={37--49},
  year={2021},
  publisher={VLDB Endowment}
}

Bipartite Graphs

Name|U||V||E|URL
Avito277361658967029[raw] [preprocessed]
AOL4811647163278810741954[raw] [preprocessed]
DBLP6001152429257[raw] [preprocessed]
Movielens-1M604037061000210[raw] [preprocessed]
KDDCup201225517018481142766394[raw] [preprocessed]
Last.fm35934916016817559531[raw] [preprocessed]
Amazon-games826767502101324754[raw] [preprocessed]
DBLP6,0011,30829,256[raw] [preprocessed]
Wikipedia15,0003,21464,095[raw] [preprocessed]
Pinterest55,1879,9161,500,809[raw] [preprocessed]
Yelp31,66838,0481,561,406[raw] [preprocessed]
MovieLens-10M69,87810,67710,000,054[raw] [preprocessed]
Last.fm359,349160,16817,559,530[raw] [preprocessed]
MIND876,95697,50918,149,915[raw] [preprocessed]
Netflix480,18917,770100,480,507[raw] [preprocessed]
Orkut2,783,1968,730,857327,037,487[raw] [preprocessed]
MAG10,541,5602,784,2401,095,315,106[raw] [preprocessed]

Welcome to cite our paper if you publish results based on our preprocessed datasets.

@inproceedings{yang2022efficient,
  title={Efficient and Effective Similarity Search over Bipartite Graphs},
  author={Yang, Renchi},
  booktitle={Proceedings of the ACM Web Conference 2022},
  pages={308--318},
  year={2022}
}

@inproceedings{yang2022scalable,
  title={Scalable and Effective Bipartite Network Embedding},
  author={Yang, Renchi and Shi, Jieming and Huang, Keke and Xiao, Xiaokui},
  booktitle={Proceedings of the 2022 International Conference on Management of Data},
  pages={1977--1991},
  year={2022}
}

Dataset Repositories

NameTypeCollected by
SNAPGraphs & NetworksStanford
LAWGraphs & NetworksUNIMI
BioSNAPBiomedical NetworksStanford
KONECTGraphs & NetworksJérôme Kunegis
AminerAcademic NetworksAMiner
UCI Network Data RepositoryGraphs & NetworksUCI Datalab
Network RepositoryGraphs & Networks-
Open Academic GraphAcademic NetworksMicrosoft
Open Graph BenchmarkGraphs & NetworksStanford
TuDatasetsGraphs & NetworksChristopher Morris, etc.
StreamingGraphsStreaming GraphsYibo Yao
ARBGraphs & NetworksAustin R. Benson
SuiteSparse Matrix CollectionMatrix/GraphsTAMU
Web Data CommonsHyperlink Graphs/Web Tables/RDFaUniversity of Mannheim
Yahoo Webscope DatasetsGraphs/Ratings/Languages/AdvertisingYahoo
UCI Machine Learning RepositoryMultivariate/Text/Time-SeriesUCI
Yelp Open Datasetbusinesses/reviews/user dataYelp
Recommender Systems Datasetsgraphs/interactions/reviews/ratingsUCSD
MIcrosoft News Datasetuser behavior logsMicrosoft
Search Query Logsquery logsJeff Huang
AOL DSquery logsRicardo Campos
AWS-Amazon
Kaggle Datasets-Kaggle
OpenML-OpenML