This project provides workflow to prepare your own training datasets from raw files and convert them into tensors. Those tensors are input for machine learning architecure. The architecture can be build using Tensorflow or Pytorch framework.
Here we are predicting full MSMS spectrum as set of (Mi,Ii) where Mi is the mass of the peak and Ii is the intenisty of the peak. For the BiGRU model, the MSMS spectrum is presented as b/y ion series discribed in original Prosit paper.
Two machine learning architectures were demonstrated.
- Transformer: Predicts full MSMS spectrum using transformer architecture
 - BiGRU: Predicts y-, b- ion intensities using BiGRU architecture
 
To best test and experience usage of the software, we recommend to use docker environment at the beginning. All the software dependencies were pre-installed in the docker image, while model weight and example data were provided in capsule folder.
- Docker Community Edition (CE)
 - nvidia-container-runtime for code that leverages the GPU
 - NVIDIA Container Toolkit to use GPU in docker container.
 - [GLIBCXX] Updated GCC compiler, version >= 3.4.29.
 - Optional: Conda/Mamba installed
 
- Git clone the repo, download the pre-trained model 
pretrained_models.zipandexample.zipfrom FigShare, create a data folderCoSpred/data, and store two zip files there. 
docker pull xuel12pfizer/cospred:v0.3The provided dockerfile will allow you to reproduce the workflow and results published by the author on your local machine by following the instructions below.
If there's any software requiring a license that needs to be run during the build stage, you'll need to make your license available.
In your terminal, navigate to the CoSpred folder with the example data in data that you've extracted before, and execute the following command:
docker build . --tag cospred_docker -f conf/DockerfileThis step will recreate the environment (i.e., the Docker image) locally, fetching and installing any required dependencies in the process. If any external resources have become unavailable for any reason, the environment will fail to build. Note that in this example, the docker image name is
cospred_dockerwhich will be refered in the following session.
Install all the packages given in conf/requirements_cospred_*.yml in a virtual environment.
conda env create -n cospred_cuda12_gpu_py39 -f conf/requirements_cospred_cuda12_gpu_py39.ymlFollowing is for the purpose of experiencing the end-to-end workflow. Presumably the data example.zip and model pretrain_model.zip have been kept in data folder that you've downloaded from FigShare. Execute the following command, run the docker container using the image just built in the previous session named cospred_docker, adjust parameters as needed (e.g. fot the machine that doesn't have GPU, remove the option flag --gpus all; for some system has restriction on shared memory for GPU, adjust --shm-size flag; or anything specific for your docker environment).
docker run --platform linux/amd64 --rm --gpus all \
  --volume "$PWD/data":/data \
  --volume "$PWD/results":/results \
  cospred_docker bash "scripts/run_training.sh"docker run --platform linux/amd64 --rm --gpus all \
  --volume "$PWD/data":/data \
  --volume "$PWD/results":/results \
  cospred_docker bash "scripts/run_prediction.sh"- When finished, the final results will be stored in 
resultsfolder. 
For advanced usage, the following are the step-by-step guides for fine-grain control modular execution of CoSpred.
- OPTION 1: Docker with interactive mode
 
docker run --platform linux/amd64 --rm -it --gpus all \
  --volume "$PWD/data":/data \
  --volume "$PWD/results":/results \
  --shm-size=32g \
  cospred_docker- OPTION 2: Viturtual environment
 
conda activate cospred_cuda12_gpu_py39Once done the environment setup, navigate to CoSpred working directory, move forward to following steps.
- Main configuration regarding file location and preprocessing parameters could be found and modified in 
paramsfolder. - For all the following modules, log files could be found under 
predictionfolder. - For BiGRU based training parameters and model setup could be found in JSON files under 
model_spectrafolder. Editmodel_construct.pyto modify architecture and generate compatiblemodel.jsonfile. Two examplesmodel_byion.jsonandmodel_fullspectrum.jsonwere provided. 
python model_construct.py- For transformer based model setup, users could directly modify 
cospred_model/model/transformerEncoder.py, as welltrain_transformermodule intraining_cospred.py. 
- Create PSM file with identified peptides from softwares like Proteome Discoverer corresponding to each rawfiles from the experiment.
 
- Convert rawfiles into mzml and mgf
 
Msconvert can be run using GUI version of the software on windows computer or can use Docker on linux machine. We recommend to run MSCovert in Windows GUI. At the end, assuming files were generated, *.mzML and *.mgf. Keep these two type of files in folder data/example/mzml and data/example/mgf respectively, together with example_PSMs.txt and example_InputFiles.txt got from Proteome Discoverer in the folder data/example. Note that all file and folders names could be defined in the params/constants_location.py.
- 
OPTION 1: The MGF file doesn't contain sequence information
- Split the dataset into train and test set. (About 15mins for 300k spectra)
 
20% spectra will be randomly selected for test by default, which could be modified in the script.
example_train.mgfandexample_test.mgfwill be generated from this step.rawfile2hdf_byion.py(preparing dataset with b/y ion annotation) andrawfile2hdf_cospred.py(preparing dataset for full spectrum representation) are the scripts for this purpose. (2019 Macbook Pro, about 2 minitues for the example dataset)python rawfile2hdf_cospred.py -w split
- OPTION 1.1: Pair database search result with MGF spectrum, annotate B and Y ion for MSMS spectrum
 
Pyteomics is used to parse annotations of y and b ions and their fragments charges from MZML and MGF, and report to annotated MGF files for downstream spectrum prediction/viewing application. Note that to parse the input file correctly, you will likely need to adjust regex routine (in the
reformatMGFfunction withinio_cospred.py) according to the specific MGF format you are using. (2019 Macbook Pro, about 1 hour for the example dataset)python rawfile2hdf_byion.py -w train python rawfile2hdf_byion.py -w test- OPTION 1.2: Pair database search result with MGF spectrum, reformat to full MSMS using bins. (2019 Macbook Pro, about 2 minitues for the example dataset)
 
python rawfile2hdf_cospred.py -w train python rawfile2hdf_cospred.py -w test - 
OPTION 2: For MGF file with assigned peptide sequence (e.g.
example.mgf), reformat to full MSMS using bins. Note: you may need to reformat MGF so that peptide sequence representation is compatible for downstream. 
python mgf2hdf_cospred.py -w reformat       # Reformat the MGF for CoSpred workflow.
python mgf2hdf_cospred.py -w split_usi      # Split the dataset into train and test set.
python mgf2hdf_cospred.py -w train          # Convert training set into full spectrum bins
python mgf2hdf_cospred.py -w test           # Convert testing set into full spectrum binsAt the end, a few files will be generated. train.hdf5 and test.hdf5 are input files for the following ML modules.
training_cospred.py is the script for customized training. Workflows could be selected by arguments, including 1) -t: fine-tuning / continue training the existing model; 2) -f: opt in full MS/MS spectrum model instead of B/Y ions; 3) -c: chunking the input dataset (to prevent memory overflow by large dataset); 4) -b: opt in for BiGRU model instead of Transformer.
python training_cospred.py -b   # Training B/Y ion spectrum prediction using BiGRU architecture.
python training_cospred.py      # Training B/Y ion spectrum prediction using Transformer architecture.
python training_cospred.py -bf   # Training full spectrum prediction using BiGRU architecture. 
python training_cospred.py -f   # Training full spectrum prediction using Transformer architecture. 
python training_cospred.py -bft   # Fine-tuning full spectrum prediction using BiGRU architecture with pre-trained weights.During the training procedure under each epoch, model weights files will be auto-generated under the folder model_spectra. Naming of files will be like below,
- For B/Y ion, BiGRU model: 
prosit_byion_[YYYYMMDD]_[HHMMSS]_epoch[integer]_loss[numeric].hdf5 - For full spectrum, BiGRU model: 
prosit_full_[YYYYMMDD]_[HHMMSS]_epoch[integer]_loss[numeric].hdf5 - For B/Y ion, Transformer model: 
transformer_byion_[YYYYMMDD]_[HHMMSS]_epoch[integer]_loss[numeric].pt - For full spectrum, Transformer model: 
transformer_full_[YYYYMMDD]_[HHMMSS]_epoch[integer]_loss[numeric].pt 
To fine-tune the foundation model or re-train the model, following scripts and parameters needs to be modified. In this demo, we will use a non unimod chemical modification "Desthiobiotin" for example.
- Ensure that the related information of (DTBIA) is properly added in your 
constants.pyfile, as following. 
# add to alphabet
ALPHABET = {
    "C(DTBIA)": 26,  # Alphabet
}
# define the mass
MODIFICATION = {
    'DTBIA': 296.185,
}
# add to amino acid
AMINO_ACID["C(DTBIA)"] = AMINO_ACID["C"] + MODIFICATION["DTBIA"]
# define the chemical composition
MODIFICATION_COMPOSITION = {
    'C(DTBIA)': {'H': 24, 'C': 14, 'O': 3, 'N': 4},     # Chemical composition
}
# annotate the novel modification with proforma, so that pyteomics library can parse
VARMOD_PROFORMA = {
    'C(DTBIA)': 'C[+296.185]',
}Keep the best model under model_spectra folder as the trained model for inference phase. Some pre-trained model can be downloaded from FigShare.
With trained models, predict spectrum given peptide sequences from peptidelist_test.csv. All inference results including metrics, plots, and spectra library will be stored under prediction folder. Predicted spectra will be stored in speclib_prediction.msp.
Workflows could be selected by arguments, including 1) -f: opt in full MS/MS spectrum model instead of B/Y ions; 2) -c: chucking the input dataset (to prevent memory overflow by large dataset); 3) -b: opt in for BiGRU model instead of Transformer. Some examples like below.
python prediction.py -b   # Predict B/Y ion spectrum prediction using BiGRU architecture.
python prediction.py      # Predict B/Y ion spectrum prediction using Transformer architecture.
python prediction.py -bf   # Predict full spectrum prediction using BiGRU architecture. 
python prediction.py -f   # Predict full spectrum prediction using Transformer architecture. 
python prediction.py -fc   # Predict full spectrum prediction using Transformer architecture, with chunking for accomodating large peptide list. Optionally, performance evaluation could be executed with -e argument, as long as ground truth a) test.hdf5 or b) example_PSMs.txt with test_usi.mgf are provided, so that reference spectrum for the peptides could be extracted from database search result and the raw mass-spec data. Examples as below.
python prediction.py -be   # Predict B/Y ion spectrum prediction using BiGRU architecture.
python prediction.py -e    # Predict B/Y ion spectrum prediction using Transformer architecture.
python prediction.py -bfe   # Predict full spectrum prediction using BiGRU architecture. 
python prediction.py -fe   # Predict full spectrum prediction using Transformer architecture. 
python prediction.py -fce   # Predict full spectrum prediction using Transformer architecture, with chunking for accomodating large peptide list. The outputs of prediction will be generated under prediction, including predicted spectra library speclib_prediction.msp and speclib_prediction.mgf, plots and metrics under prediction_library, some other intermediate files for recording or diagnosis purpose.
Predicted spectrum and mirror plot for visual evaluation could be separately generated by spectra_plot.py. By default, the required inputs are peptidelist_predict.csv (peptides list), test_reformatted.mgf (reference spectra), and speclib_prediction.mgf (predicited spectra). File names and location could be defined by params/constants. Plots will be stored in prediction/plot folder.
python spectra_plot.py