Training scDiffusion-X

Clone the scDiffusion-X to your local machine:

git clone https://github.com/EperLuo/scDiffusion-X.git
cd scDiffusion-X

Set the conda environment. (Please refer to the Installation section)

Data preparation

The data for training scDiffusion-X should contain two different modalities and saved in moun h5mu format. For scRNA-seq data, the row count expression data should be saved in mdata['rna'].X. For scATAC-seq data, the binary chromatin openness should be saved in mdata['atac'].X. Meta-information such as cell type can be placed in mdata['rna'].obs['cell_type']. See the example data in figureshare for details.

Train the Autoencoder

After organizing the data, you can start to train the multi-modal Autoencoder.

cd script/training_autoencoder
bash train_autoencoder_multimodal.sbatch

Adjust the data path to your local path. The dataset config file is in script/training_autoencoder/configs/dataset, see the comments in openproblem.yaml for details. The checkpoint will be saved in script/training_autoencoder/outputs/checkpoints and the log file will be saved in script/training_autoencoder/outputs/logs. The autoencoder config file is in script/training_autoencoder/configs/encoder, see the comments in encoder_multimodal.yaml for details.

There are three different sizes Autoencoders: encoder_multimodal_small, encoder_multimodal, and encoder_multimodal_large. We recommand to use encoder_multimodal (corresponding to encoder_multimodal.yaml) for most of dataset. If the genes and peaks are more than 50,000 and 200,000, we recommand a larger autoencoder in encoder_multimodal_large.yaml. If the genes and peaks are less than 5,000 and 15,000, we recommand a smaller autoencoder in encoder_multimodal_small.yaml. The norm_type in the encoder config yaml control the normalization type. For data generation task, we recommend batch_norm, and for translation task, we recommend layer_norm since it has better generalization for OOD data.

Note that the smallest dataset we used in our experiment has 5000 cells with 2500 genes and 5542 peaks. The model can perform very well on this scale of data. For dataset with even fewer cells, we recommend you use a smaller model by modifying the model hyperparameters on your own.

Train the diffusion backbone

cd script/training_diffusion
sh ssh_scripts/multimodal_train.sh

Again, adjust the data path and output path to your own, and also change the ae_path&encoder_config in the sh file to the autoencoder you tarined in step 1. The rna_dim and atac_dim refer to the dimensions of latent representation, you should change them to match the autoencoder you used (refer to the encoder’s config file). When training with condition (like the cell type condition), set the num_class to the number of unique labels. The training is unconditional when the num_class is not set.

Also, change the devices and NUM_GPUS parameter according to your own situation. The total batch size is num_gpu*batch_size.

Pretrained model

Here we provided a model pretrained on the miniatlas dataset (Wu J, et al. EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment). This dataset contains more than 130,000 scATAC-seq with paired scRNA-seq, across 57 cell types. The pretrained model weight and the training data can be found at: https://figshare.com/s/14610d6d67160366aba2.

The complete cell types list: [‘Acinar cell’, ‘Adipocyte’, ‘Alpha cell’, ‘Amacrine cell’, ‘Astrocyte’, ‘B cell’, ‘Beta cell’, ‘Bipolar cell’, ‘CD4 T’, ‘Capillary EC’, ‘Cardiomyocyte’, ‘Colonocyte’, ‘Cone cell’, ‘Delta cell’, ‘Endocardial cell’, ‘Endocrine cell’, ‘Endothelial cell’, ‘Enterocyte’, ‘Epithelial cell’, ‘Erythroblast’, ‘Excitatory neuron’, ‘Fibroblast’, ‘Fibroblasts’, ‘Glia’, ‘Goblet cell’, ‘Hepatocyte’, ‘Horizontal cell’, ‘Inhibitory neuron’, ‘Leyding cell’, ‘Luminal cell’, ‘Macrophage’, ‘Mast cell’, ‘Mesothelial cell’, ‘Microfold cell’, ‘Microglia’, ‘Monocyte’, ‘Myofibroblast’, ‘Neuron’, ‘Nk cell’, ‘Oligodendrocyte’, ‘Oligodendrocyte progenitor cell’, ‘PP cell’, ‘Paneth cell’, ‘Pericyte’, ‘Plasma cell’, ‘Podocyte’, ‘Proerythroblast’, ‘Renal epithelial cell - Loop of Henle’, ‘Renal epithelial cell - distal tubules’, ‘Renal epithelial cell - proximal tubules’, ‘Rod cell’, ‘Smooth muscle cell’, ‘T cell’, ‘TUBA1A ductal cell’, ‘Tuft cell’, ‘Type A intercalated cell’, ‘Type B intercalated cell’]

When generating new dataset, the type index in the same order as the cell types above, e.g. 0 for Acinar cell and 1 for Adipocyte.