Publication: https://doi.org/10.1101/2024.04.11.589004
Alternative splicing plays a pivotal role in various biological processes. In the context of cancer, aberrant splicing patterns can lead to disease progression and treatment resistance. Understanding the regulatory mechanisms underlying alternative splicing is crucial for elucidating disease mechanisms and identifying potential therapeutic targets. We present DeepRBP, a deep learning (DL) based framework to identify potential RNA-binding proteins (RBP)-Gene regulation pairs for further in-vitro validation. DeepRBP is composed of a DL model that predicts transcript abundance given RBP and gene expression data coupled with an explainability module that computes informative RBP- Gene scores using DeepLIFT. We show that the proposed framework is able to identify known RBP-Gene regulations, demonstrating its applicability to identify new ones.
This project includes instructions for:
- Apply already trained DeepRBP to calculate RBP-Gene pair scores (explainability module) on your data using a trained model
 - Train your own DeepRBP predictor using TCGA or another data.
 - Replicate paper results using 3 POSTAR3 datasets and 6 real knockdown experiments
 
To crate a Conda environment using the provided environment.yml, follow these steps:
git clone https://github.com/ML4BM-Lab/DeepRBP.git
cd DeepRBP
conda env create -f environment.yml
conda activate DeepRBPIn this project, we have used several databases. On one hand, we have used a cohort that contains, among others, samples from TCGA and GTEx. The samples from the former have been used to train the DeepRBP predictor that learns transcript abundances, and the samples from GTEx have been used to evaluate how well the predictive model generalizes.
On the other hand, the DeepRBP explainer has been validated using TCGA samples from a tumor type to calculate the GxRBP scores and a binary matrix with shape GxRBP indicating whether the evidence in POSTAR3 experiments indicates regulation or not. Also, the DeepRBP explainer has been tested in real knockdown experiments. Below, you are instructed on how to download these data.
To download and process the TCGA and GTeX data used in this project you need to execute the following shell scripts:
data/input_create_model/raw/download_script.sh
data/input_create_model/processed/create_data.shIn create_data.sh bash script you need to change the path_deepRBP for your DeepRBP folder:
#!/bin/bash
SCRIPT_DIR="$PWD"
path_deepRBP="" # Change this path to your DeepRBP folder
PATH_DATA="$path_deepRBP/data/input_create_model"
echo "Ruta de PATH_DATA: $PATH_DATA"
python "$SCRIPT_DIR/create_data.py" --chunksize 5000 --select_genes 'cancer_genes' --path_data "$PATH_DATA"With this data, you will be able to create a model from scratch and/or check the performance of the prediction of the transcripts, such as performing explainability. On the other hand, if you want to identify the RBPs that regulate a gene for your experiment, as a previous step in the next tutorial, ./data/data_real_kds/tutorial_download_kd_experiments, we explain how to download the fastq files and run kallisto. We also explain how to use voom-limma for differential expression analysis between two conditions as further validation of the scores obtained with DeepRBP."
After running the shell script, a folder named splitted_datasets will be generated in the directory ./data/input_create_model/processed. This folder contains processed samples from both TCGA and GTeX datasets, divided by tumor type or tissue.
Within the TCGA dataset, samples are divided for each tumor type, allocating 80% for training and 20% for testing. Each of these tumor type folders contains the two main inputs of the model and output variable:
- .gn_expr_each_iso_tpm.csv: Gene expression in TPM for each transcript with shape n_samples x n_transcripts.
 - .RBPs_log2p_tpm.csv: RBP expression in log2(TPM+1) with shape n_samples x n_rbps.
 - .trans_log2p_tpm.csv: Transcripts expression in log2(TPM+1) with shape n_samples x n_transcripts.
 
To download the POSTAR3 data with information on RBP binding sites in the genome from CLIP experiments, you need to go to the POSTAR3 website and request access to the data.
In the ./apply_the_model folder, you will find tutorials on how to use a DeeRBP predictor model trained for transcript prediction and explainability.
In particular, there's a tutorial to verify the results obtained in the DeepRBP explain module on Postar3 data using TCGA data and in a real knockout experiment. Alternatively, you can once again run the .sh scripts to directly apply a trained model for calculating DeepLIFT scores in experiments of your interest.
In the following folder ./model, you will be able to create your own predictive model using the DeepRBP architecture, using the samples generated from TCGA (or others that the user wants). A Jupyter notebook tutorial is provided for you to familiarize yourself with the method. Alternatively, you can call the shell script main_predictor.sh to train the DL model according to your preferences.
