Benj utility scripts
R installation: Use devtools to install
devtools::install_github("kellislab/benj")Python installation: Use pip
pip install git+https://github.com/kellislab/benjEnvironments under conda/ can be used. Mamba is recommended for environment creation because of the number of packages
On Luria, you can put this script in ~/src/conda_init.sh, where you can run source ~/src/conda_init.sh to activate Conda,
and then you can run conda activate benj to activate the benj environment.
#!/usr/bin/env bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh" ]; then
. "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
The overall strategy is
- Load per-sample H5AD from cellranger H5, including velocyto loom. Barcodes are renamed to “Sample#barcode-1” to make multiomic ATAC+GEX analysis easier, as ArchR uses this format.
- Concatenate all H5AD, add metadata, and run scrublet to remove doublets.
- Integrate cell type iteratively
From a list of sample names and filtered_feature_bc_matrix.h5 files in a file called “items.txt”, you can enter a SLURM script as follows to build H5AD for each:
#!/usr/bin/env bash
#SBATCH -p kellis
#SBATCH --array=1-106%10
#SBATCH --nice=10000
cd /path/to/items/dir
export SAMPLE=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $1 }')
export H5=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $2 }')
export OUTPUT="${SAMPLE}.h5ad"
if [ -f "${OUTPUT}" ]; then
echo "File ${OUTPUT} already created."
else
module load lmod-conda
conda activate pybenj
`type -P time` h5ad_from_cellranger_rna.py --h5 "${H5}" --sample "${SAMPLE}" --output "${OUTPUT}"
fiNow is when sample metadata is included. For every sample in metadata index, the H5AD is read in from the directory.
aggregate_anndata.py --metadata /path/to/md.tsv -d /path/to/h5ad/dir/ -o concatenated.h5ad function integrate {
cls=${1:-overall}
shift
integrate_rna.py -i concatenated.h5ad \
-o "${cls}/${cls}.h5ad" \
-t "${cls}/${cls}_clust.tsv.gz" \
-f "${cls}" \
-b "batch" \
-l "${cls}_clust" \
--hvg 5000 \
--dotplot Mbp Pdgfra Aif1 Gfap Flt1 Pdgfrb Syt1 Slc17a7 Gad1 \
"$@"
}
integrate overall
integrate exc -a ./overall_clust/overall_clust.tsv.gz --subset overall_clust=C1,C2,C3 --plot overall_clust
integrate exc_L23 -a ./exc_clust/exc_clust.tsv.gz --subset exc_clust=C0,C2,C3 --plot overall_clust exc_clust
The ATAC integration pipeline is very similar to the RNA integration pipeline.
As a brief overview, this pipeline uses both ArchR and muon. ArchR is currently better at QC, peaks, and motifs, and muon is better at primary analysis (cell types, UMAPs, multiomics).
So, the same barcodes (ArchR format, “Sample#barcode-1”) is used to ensure no mixup.
The steps consist of:
library(ArchR)
addArchRGenome("hg38")
geneAnnotation = benj::createGeneAnnotationGFF("/path/to/cellranger/refdata/genes/genes.gtf", OrgDb=org.Hs.eg.db::org.Hs.eg.db, dataSource="cellranger", organism="Homo sapiens")
ArrowFiles=createArrowFiles(..., geneAnnotation=geneAnnotation)
proj = createArchRProject(ArrowFiles, outputDirectory="ArchR", geneAnnotation=geneAnnotation)
gzf = gzfile("ArchR_metadata.tsv.gz", "w")
write.table(as.data.frame(proj@cellColData), gzf, sep="\t")
close(gzf)
saveArchRProject(proj, "ArchR")If you don’t have a previous peak annotation, use a tile BED file.
Usually the cellranger-arc aggr pipeline provides a good peak BED that is calculated early in the process, so cellranger can be quit after the BED
#!/usr/bin/env bash
#SBATCH -p kellis
#SBATCH --array=1-106%10
#SBATCH --nice=10000
cd /home/benjames/data/SCORCH/1.Mash_BA9_NAc/ATAC
export SAMPLE=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $1 }')
export FRAG=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $2 }')
export META="/path/to/ArchR_metadata.tsv.gz"
export PEAKS="/path/to/atac_peak_annotation.tsv.gz"
export OUTPUT="${SAMPLE}.h5ad"
if [ -f "${OUTPUT}" ]; then
echo "File ${OUTPUT} already exists"
else
module load lmod-conda
conda activate pybenj
`type -P time` h5ad_from_archr_annotation.py --fragments "${FRAG}" --sample "${SAMPLE}" --cell-metadata "${META}" --peaks "${PEAKS}" --output "${OUTPUT}"
fiNow is when sample metadata is included. For every sample in metadata index, the H5AD is read in from the directory.
aggregate_anndata.py --metadata /path/to/md.tsv -d /path/to/h5ad/dir/ -o concatenated.h5adUsing ArchR style gene estimation, except use the peak set instead of tile matrix.
estimate_gene_accessibility -i concatenated.h5ad -o gacc.h5ad --gtf /path/to/cellranger/genes/genes.gtf.gzOr, if you have annotations already (or after annotations!) you can rank genes and plot:
estimate_gene_accessibility -i concatenated.h5ad -o gacc.h5ad --gtf /path/to/cellranger/genes/genes.gtf.gz --groupby CellTypeVery similar to RNA integration process. But, use sample level batch correction.
function integrate {
cls=${1:-overall}
shift
integrate_atac.py -i concatenated.h5ad \
-o "${cls}/${cls}.h5ad" \
-t "${cls}/${cls}_clust.tsv.gz" \
-f "${cls}" \
-b "Sample" \
-l "${cls}_clust" \
"$@"
}
integrate overall
integrate exc -a ./overall_clust/overall_clust.tsv.gz --subset overall_clust=C1,C2,C3 --plot overall_clust
integrate exc_L23 -a ./exc_clust/exc_clust.tsv.gz --subset exc_clust=C0,C2,C3 --plot overall_clust exc_clustFrom the integrated ATAC, load in the *_clust.tsv.gz files, and addGroupCoverages in ArchR, then call new peaks.
Then, you can iteratively improve the integration.
- Currently, you should process 1) RNA first, using RNA subtyping framework.
- Then, process ATAC alone as single-omic. But, in the integrate() function, add an annotation for the
Standardized single cell data environments allow for easy computation.
This repo works best with the following environment structure per-batch (e.g. from cellranger count):
├── aggr.csv
├── sample_list.txt
├── metrics_summary.tsv
├── ArrowFiles
│ ├── archr_metadata.tsv.gz
│ ├── Sample1.arrow
│ ├── Sample2.arrow
│ ├── Sample3.arrow
│ ├── Sample4.arrow
│ └── Sample5.arrow
├── H5AD
│ ├── cellbender
│ │ ├── Sample1.h5ad
│ │ ├── Sample2.h5ad
│ │ ├── Sample3.h5ad
│ │ ├── Sample4.h5ad
│ │ ├── Sample5.h5ad
│ ├── filtered
│ │ ├── Sample1.h5ad
│ │ ├── Sample2.h5ad
│ │ ├── Sample3.h5ad
│ │ ├── Sample4.h5ad
│ │ ├── Sample5.h5ad
│ └── raw
│ ├── Sample1.h5ad
│ ├── Sample2.h5ad
│ ├── Sample3.h5ad
│ ├── Sample4.h5ad
│ └── Sample5.h5ad
├── Sample1
│ ├── _cmdline
│ ├── _filelist
│ ├── _finalstate
│ ├── _invocation
│ ├── _jobmode
│ ├── _log
│ ├── _mrosource
│ ├── outs
...
├── Sample2
│ ├── _cmdline
│ ├── _filelist
│ ├── _finalstate
│ ├── _invocation
│ ├── _jobmode
│ ├── _log
│ ├── _mrosource
│ ├── outs
...
The script check_batch.sh allows a user to check if a certain batch matches what is expected from this structure to ensure uniform quality.
aggr.csv is generated by benj::make_aggregation_table() or something similar, where sample name is the 1st column, the atac_fragments are contained in another column if ATAC is included.
Other columns should be metadata to include in H5AD per sample, such as tissue or post-mortem interval, or case/control status.
List of samples. Ideally used for cellranger-count, but used to keep track of the directories output from count. May not necessarily be the rownames in aggr.csv if you want to rename after counting.
Summary of metrics per sample, such as # of reads, # of estimated cells. Computed by:
cellranger_metrics_summary -i ./*/outs/*summary.csv -o metrics_summary.tsvand can be used, e.g. extracting expected number of cells for cellbender.
H5AD files should be computed using h5ad_from_cellranger_rna.sh, for each of filtered_feature_bc_matrix, raw_feature_bc_matrix, and cellbender_filtered.
Note that barcodes will use ArchR format, in the case of multiome data, to allow for shared barcodes to trivially overlap.
You can rename the sample if necessary. Note that velocyto counts will be included by default if exists, and cellranger counts will be aggregated into the cellranger directory if those exist.
For example, if you have the sample names used in cellranger count in sample_list.txt and you have the new names as the rownames in aggr.csv, you can use;
paste <(cat sample_list.txt) <(tail -n+2 <aggr.csv | cut -d , -f 1) | awk -F, '{ print "h5ad_from_cellranger_rna.sh \"$1\" \"$2\"" }' | xargs -P8 sh -cOr, if they are the same:
< sample_list.txt xargs -P8 h5ad_from_cellranger_rna.shSpeaking of cellbender, if computed, you should place the H5 files into the outs/ subdirectory per counts directory, just like filtered_feature_bc_matrix.h5.
Using velocyto is easy now. If you’re in a conda env in the batch directory, you can use:
velocyto_slurmgen.sh -b /path/to/batch/directory | sbatch -p my_queue_nameto generate velocyto counts.
For ATAC, ArchR metadata is later used for metadata, and arrow files are needed regardless for secondary analysis. To compute Arrow files for each fragment file, and compute metadata,
mkdir ArrowFiles && pushd ArrowFiles
archr_from_fragments.R -a ../aggr.csv --tss 1 -g /path/to/ArchR_refdata-cellranger-arc-GRCh38-2020-A-2.0.0.rds
popdI currently have the file ~/src/conda_init.sh on Luria as:
#!/usr/bin/env bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh" ]; then
. "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<which should allow anyone with group access to use the benj/pybenj/rbenj environments as updated.
In ~/.bashrc of BOTH the host and all clients/desktops you want to connect from, set
export JUPYTER_LURIA_PREFIX=110or some number above 80. This will help if anyone else is using the port number on Luria.
The simple SSH config here ~/.ssh/config should keep things simple.
Please use an SSH key to make things easy.
Host luria.mit.edu
HostName 10.159.3.125
User your_kerb
PubkeyAcceptedKeyTypes +ssh-rsa
Match exec "echo %h | grep -qE '^b[0-9]+$'"
HostName %h
ProxyJump luria.mit.edu
User your_kerb
If you are on GNU/Linux that uses SystemD, you could use the following per-user service file, ~/.config/systemd/user/jupyter_luria@.service:
[Unit]
Description=jupyter_luria.sh "%I"
After=network.target
[Service]
ExecStart=%h/.local/bin/jupyter_luria.sh %i
Restart=on-failure
RestartSec=5s
Environment="JUPYTER_LURIA_PREFIX=95"
[Install]
WantedBy=default.target
assuming jupyter_luria.sh is in ~/.local/bin/ (you can find this out with type -P jupyter_luria.sh),
and the JUPYTER_LURIA_PREFIX is the same as on the Luria server.
Then, systemctl --user daemon-reload and systemctl --user start jupyter_luria@b3 should connect you to b3.
Then, to launch a Jupyter job on Luria,
sbatch jupyter_luria.sh benjto launch a jupyter lab in the environment named “benj”.
Look at the hostname of the just-launched job via
squeue -u ${USER}for example, b3, and connect on your laptop/desktop with:
jupyter_luria.sh b3Then, you should go in your web browser and connect to https://localhost:XXXX/ where XXXX is the port output by the script.
If you have trouble connecting, try running jupyter lab password in a compute node to use password-based login instead of token-based.
To use, in your ~/.bash_profile, put
module use /path/to/this/repo/modulesand re-login to view changes.
Conda integration is at modules/lmod-conda
To change the default Conda root directory, replace ~/data/miniconda3 with your conda root directory.
- https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/downloads/latest
- https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest
- https://support.10xgenomics.com/single-cell-atac/software/downloads/latest
Download from https://genome.ucsc.edu/cgi-bin/hgTables