This repository contains the code and data for the KDD 2023 paper Predicting Information Pathways Across Online Communities.
- Authors: Yiqiao Jin, Yeon-Chang Lee, Kartik Sharma, Meng Ye, Karan Sikka, Ajay Divakaran, Srijan Kumar
 - Organizations: Georgia Institute of Technology, SRI International
 
If our code or data helps you in your research, please kindly cite us:
@inproceedings{jin2023predicting,
  title        = {Predicting Information Pathways Across Online Communities},
  author       = {Jin, Yiqiao and Lee, Yeon-Chang and Sharma, Kartik and Ye, Meng and Sikka, Karan and Divakaran, Ajay and Kumar, Srijan},
  year         = 2023,
  booktitle    = {KDD},
}
The problem of community-level information pathway prediction (CLIPP) aims at predicting the transmission trajectory of content across online communities. A successful solution to CLIPP holds significance as it facilitates the distribution of valuable information to a larger audience and prevents the proliferation of misinformation. Notably, solving CLIPP is non-trivial as inter-community relationships and influence are unknown, information spread is multi-modal, and new content and new communities appear over time. In this work, we address CLIPP by collecting large-scale, multi-modal datasets to examine the diffusion of online YouTube videos on Reddit. We analyze these datasets to construct community influence graphs (CIGs) and develop a novel dynamic graph framework, INPAC (Information Pathway Across Online Communities), which incorporates CIGs to capture the temporal variability and multi-modal nature of video propagation across communities. Experimental results in both warm-start and cold-start scenarios show that INPAC outperforms seven baselines in CLIPP.
We constructed real-world, large-scale datasets covering 60 months of Reddit posts sharing YouTube videos, from January 2018 to December 2022, available on 🤗 HuggingFace (Ahren09/reddit)
Install the datasets library:
pip install datasetsYou can load the dataset using:
from datasets import load_dataset
dataset = load_dataset("Ahren09/reddit", "2018") 
where "2018" is the subset name. Replace it with "2019", ..., "2022" to load the other subsets
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install pyg -c pyg
conda install -c conda-forge tensorflow
NOTE: To avoid any import or path issues, it is recommended to use PyCharm.
For the large dataset, run
python main.py --dataset_name large --do_static_modeling --session_split_method session --delta_t_thres 4.13625 --do_val
For the small dataset, run
python main.py --dataset_name small --do_static_modeling --session_split_method session --delta_t_thres 4.13625 --do_val
- 
dataset_name:smallfor the 3-month Small dataset,largefor the 54-month Large dataset. - 
delta_t_thres: The precomputed threshold in Section 3.2. You can also run without specifyingdelta_t_thresand let the code compute it for you. - 
c,mu,sigma: Hyperparameters in the equation$\delta t^{thres} = \mu - c \sigma$ . - 
resource:vfor video. We will include more types of resources in the future, such asurl - 
eval_neg_sampling_ratio: the number of negative items to sample for each positive interaction. This is for evaluation. - 
eval_every: evaluate the model everyeval_everyepochs. 
The data can be downloaded from Google Drive. Please put the entire data/ folder under INPAC
The urls_df.pkl file contains the unfiltered data:
                                                 url           netloc post_id   timestamp       subreddit             author            v
0                       https://youtu.be/tmmpaOZ3nQg         youtu.be  eiazyl  1577836805  virtualreality          Zweetprot  tmmpaOZ3nQg
1        https://www.youtube.com/watch?v=LuAyGWqYza4  www.youtube.com  eib0a6  1577836845          FTMMen  00110100-00110010  LuAyGWqYza4
2        https://www.youtube.com/watch?v=d4hJA7IUaDs  www.youtube.com  eib0a6  1577836845          FTMMen  00110100-00110010  d4hJA7IUaDs
3  https://www.youtube.com/watch?v=5U_2V6yr-Nw&fe...  www.youtube.com  eib0a6  1577836845          FTMMen  00110100-00110010  5U_2V6yr-Nw
4                       https://youtu.be/tmmpaOZ3nQg         youtu.be  eib0em  1577836862         SteamVR          Zweetprot  tmmpaOZ3nQg
5                       https://youtu.be/mumHdNhclrM         youtu.be  eib0h6  1577836869  SmallYTChannel      thevinamazing  mumHdNhclrM
6                       https://youtu.be/tmmpaOZ3nQg         youtu.be  eib0nk  1577836892        VRGaming          Zweetprot  tmmpaOZ3nQg
7        https://www.youtube.com/watch?v=uxtqIvOP0rQ  www.youtube.com  eib0se  1577836909        ripplers            daNext1  uxtqIvOP0rQ
8                       https://youtu.be/tmmpaOZ3nQg         youtu.be  eib0ur  1577836917        HTC_Vive          Zweetprot  tmmpaOZ3nQg
9                       https://youtu.be/HE1Vy5lKuzw         youtu.be  eib0wn  1577836926      HelpMeFind            Sanojoj  HE1Vy5lKuzw
Each row represents a video reddit_dataset.pkl along with the mappings.
If you have any questions, please contact the author Yiqiao Jin.
