A GPU-accelerated tool for large-scale scRNA-seq pipeline.
Highlights • Why ScaleSC • Installation • Tutorial • API Reference
- Fast scRNA-seq pipeline including QC, Normalization, Batch-effect Removal, Dimension Reduction in a similar syntax as
scanpyandrapids-singlecell. - Scale to dataset with more than 10M cells on a single GPU. (A100 80G)
- Chunk the data to avoid the
int32limitation incupyx.scipy.sparseused byrapids-singlecellthat disables the computing for moderate-size datasets (~1.3M) without Multi-GPU support. - Reconcile output at each step to
scanpyto reproduce the same results as on CPU end. - Improvement on
harmonypywhich allows dataset with more than 10M cells and more than 1000 samples to be run on a single GPU. - Speedup and optimize
NSForestalgorithm using GPU for better maker gene identification. - Merge clusters according to the gene expression of detected markers by
NSForest.
Scanpy |
ScaleSC |
Rapids-singlecell |
|
|---|---|---|---|
| GPU Support | ❌ | ✅ | ✅ |
int32 Issue in Sparse |
✅ | ✅ | ❌ |
| Upper Limit of #cell | 5M | ~20M | ~1M |
| Upper Limit of #sample | <100 | >1000 | <100 |
Requirements:
- RAPIDS from Nvidia
- rapids-singlecell, an alternative of scanpy that employs GPU for acceleration.
- Conda, version >=22.11 is strongly encouraged, because conda-libmamba-solver is set as default, which significantly speeds up solving dependencies.
- pip, a python package installer.
Environment Setup:
-
Install RAPIDS through Conda,
conda create -n ScaleSC -c rapidsai -c conda-forge -c nvidia rapids=25.02 python=3.12 'cuda-version>=12.0,<=12.8'Users have the flexibility to install it according to their systems by using this online selector. We highly recommend installing**RAPIDS**>=24.12, it solves a bug related to the Leiden algorithm, which results in too many clusters. -
Activate conda env,
conda activate ScaleSC -
Install rapids-singlecell using pip,
pip install rapids-singlecell -
Install ScaleSC,
- Pull ScaleSC from GitHub
git clone https://github.com/interactivereport/ScaleSC.git - Enter the folder and install ScaleSC
cd scalesc
pip install .
- Pull ScaleSC from GitHub
-
Check env:
python -c "import scalesc; print(scalesc.__version__)"== 0.1.0python -c "import cupy; print(cupy.__version__)">= 13.3.0python -c "import cuml; print(cuml.__version__)">= 24.10python -c "import cupy; print(cupy.cuda.is_available())"= Truepython -c "import xgboost; print(xgboost.__version__)>= 2.1.1, optionally for marker annotation
- See this tutorial for details.
Please cite ScaleSC, and Scanpy, Rapids-singlecell, NSForest, AnnData according to their instructions respectively.
- 2/26/2025:
- adding a parameter
thresholdin functionadata_cluster_mergeto support cluster merging at various scales according to the user's specification.thresholdis between 0 and 1. Set to 0 by default. - Updating a few more examples of cluster merging in the tutorial.
- future work: adding support for loading from large
.h5adfiles.
- adding a parameter
ScaleSC integrated pipeline in a scanpy-like style.
It will automatically load the dataset in chunks, see scalesc.util.AnnDataBatchReader for details, and all methods in this class manipulate this chunked data.
Args:
data_dir(str): Data folder of the dataset.max_cell_batch(int): Maximum number of cells in a single batch.Default: 100000.preload_on_cpu(bool): If load the entire chunked data on CPU. Default:Truepreload_on_gpu(bool): If the entire chunked data is on GPU,preload_on_cpuwill be overwritten toTruewhen this is set toTrue. The default isTrue.save_raw_counts(bool): If saveadata_Xto disk after QC filtering.Default: False.save_norm_counts(bool): If saveadata_Xdata to disk after normalization.Default: False.save_after_each_step(bool): If saveadata(without .X) to disk after each step.Default: False.output_dir(str): Output folder. Default: './results'.gpus(list): List of GPU ids,[0]is set if this is None. Default: None.
__init__(
data_dir,
max_cell_batch=100000.0,
preload_on_cpu=True,
preload_on_gpu=True,
save_raw_counts=False,
save_norm_counts=False,
save_after_each_step=False,
output_dir='results',
gpus=None
)AnnData: An AnnData object that is used to store all intermediate results without the count matrix.
Note: This is always on the CPU.
AnnData: An AnnData object that is used to store all intermediate results, including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption; make sure to invoke to_CPU() before calling this object.
calculate_qc_metrics()Calculate quality control metrics.
clear()Clean the memory
filter_cells(min_count=0, max_count=None, qc_var='n_genes_by_counts', qc=False)Filter genes based on the number of a QC metric.
Args:
min_count(int): Minimum number of counts required for a cell to pass filtering.max_count(int): Maximum number of counts required for a cell to pass filtering.qc_var(str='n_genes_by_counts'): Feature in QC metrics that is used to filter cells.qc(bool=False): Callcalculate_qc_metricsbefore filtering.
filter_genes(min_count=0, max_count=None, qc_var='n_cells_by_counts', qc=False)Filter genes based on the number of a QC metric.
Args:
min_count(int): Minimum number of counts required for a gene to pass filtering.max_count(int): Maximum number of counts required for a gene to pass filtering.qc_var(str='n_cells_by_counts'): Feature in QC metrics that is used to filter genes.qc(bool=False): Callcalculate_qc_metricsbefore filtering.
filter_genes_and_cells(
min_counts_per_gene=0,
min_counts_per_cell=0,
max_counts_per_gene=None,
max_counts_per_cell=None,
qc_var_gene='n_cells_by_counts',
qc_var_cell='n_genes_by_counts',
qc=False
)Filter genes based on the number of a QC metric.
Note:
This is an efficient way to perform regular filtering on genes and cells without repeatedly iterating over chunks.
Args:
min_counts_per_gene(int): Minimum number of counts required for a gene to pass filtering.max_counts_per_gene(int): Maximum number of counts required for a gene to pass filtering.qc_var_gene(str='n_cells_by_counts'): Feature in QC metrics that is used to filter genes.min_counts_per_cell(int): Minimum number of counts required for a cell to pass filtering.max_counts_per_cell(int): Maximum number of counts required for a cell to pass filtering.qc_var_cell(str='n_genes_by_counts'): Feature in QC metrics that is used to filter cells.qc(bool=False): Callcalculate_qc_metricsbefore filtering.
harmony(sample_col_name, n_init=10, max_iter_harmony=20)Use Harmony to integrate different experiments.
Note:
This modified harmony function can easily scale up to 15M cells with 50 pcs on GPU (A100 80G). Result after harmony is stored into
adata.obsm['X_pca_harmony'].
Args:
sample_col_name(str): Column of sample ID.n_init(int=10): Number of times the k-means algorithm is run with different centroid seeds.max_iter_harmony(int=20): Maximum iteration number of harmony.
highly_variable_genes(n_top_genes=4000, method='seurat_v3')Annotate highly variable genes.
Note:
Only
seurat_v3is implemented. The raw count matrix is expected as input forseurat_v3. HVGs are set toTrueinadata.var['highly_variable'].
Args:
n_top_genes(int=4000): Number of highly-variable genes to keep.method(str='seurat_v3'): Choose the flavor for identifying highly variable genes.
leiden(resolution=0.5, random_state=42)Performs Leiden clustering using rapids-singlecell.
Args:
resolution(float=0.5): A parameter value controlling the coarseness of the clustering. (called gamma in the modularity formula). Higher values lead to more clusters.random_state(int=42): Random seed.
neighbors(n_neighbors=20, n_pcs=50, use_rep='X_pac_harmony', algorithm='cagra')Compute a neighborhood graph of observations using rapids-singlecell.
Args:
n_neighbors(int=20): The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation.n_pcs(int=50): Use this many PCs.use_rep(str='X_pca_harmony'): Use the indicated representation.algorithm(str='cagra'): The query algorithm to use.
normalize_log1p(target_sum=10000.0)Normalize counts per cell, then log1p.
Note:
If
save_raw_countsorsave_norm_countsis set, writeadata_Xto disk here automatically.
Args:
target_sum(int=1e4): If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.
normalize_log1p_pca(
target_sum=10000.0,
n_components=50,
hvg_var='highly_variable'
)An alternative for calling normalize_log1p and pca together.
Note:
Used when
preload_on_cpuisFalse.
pca(n_components=50, hvg_var='highly_variable')Principal component analysis.
Computes PCA coordinates, loadings, and variance decomposition. Uses the implementation of scikit-learn.
Note:
Flip the directions according to the largest values in loadings. Results will match up with scanpy perfectly. Calculated PCA matrix is stored in
adata.obsm['X_pca'].
Args:
n_components(int=50): Number of principal components to compute.hvg_var(str='highly_variable'): Use highly variable genes only.
save(data_name=None)Save adata to disk.
Note:
Save to '
output_dir/data_name.h5ad'.
Args:
data_name(str): IfNone, set asdata_dir.
savex(name, data_name=None)Save adata to disk in chunks.
Note:
Each chunk will be saved individually in a subfolder under
output_dir. Save to 'output_dir/name/data_name_i.h5ad'.
Args:
name(str): Subfolder name.data_name(str): IfNone, set asdata_dir.
to_CPU()Move all chunks to the CPU.
to_GPU()Move all chunks to the GPU.
umap(random_state=42)Embed the neighborhood graph using rapids-singlecell.
Args:
random_state(int=42): Random seed.
Chunked dataloader for extremely large single-cell dataset. Return a data chunk each time for further processing.
__init__(
data_dir,
preload_on_cpu=True,
preload_on_gpu=False,
gpus=None,
max_cell_batch=100000,
max_gpu_memory_usage=48.0,
return_anndata=True
)batch_to_CPU()batch_to_GPU()batchify(axis='cell')Return a data generator if preload_on_cpu is set as True.
clear()get_merged_adata_with_X()gpu_wrapper(generator)read(fname)set_cells_filter(filter, update=True)Update the cells filter and apply it to data chunks if update is set to True, otherwise, update the filter only.
set_genes_filter(filter, update=True)Update genes filter and apply on data chunks if update set to True, otherwise, update filter only.
Note:
Genes filter can be set sequentially; a new filter should always be compatible with the previously filtered data.
update_by_cells_filter(filter)update_by_genes_filter(filter)This file was automatically generated via lazydocs.


