Quick start

This is the simplified version of scdiffusionX tutorials.

Step1: Train the Autoencoder

cd script/training_autoencoder
bash train_autoencoder_multimodal.sbatch

Adjust the data path to your local path. The dataset config file is in script/training_autoencoder/configs/dataset, see the comments in openproblem.yaml for details. The checkpoint will be saved in script/training_autoencoder/outputs/checkpoints and the log file will be saved in script/training_autoencoder/outputs/logs. The autoencoder config file is in script/training_autoencoder/configs/encoder, see the comments in encoder_multimodal.yaml for details.

We recommand to use encoder_multimodal for most of dataset. If the genes and peaks are more than 50,000 and 200,000, we recommand a larger autoencoder in encoder_multimodal_large. If the genes and peaks are less than 5,000 and 15,000, we recommand a smaller autoencoder in encoder_multimodal_small. The norm_type in the encoder config yaml control the normalization type. For data generation task, we recommend batch_norm, and for translation task, we recommend layer_norm since it has better generalization for OOD data.

Step2: Train the Diffusion Backbone

cd script/training_diffusion
sh ssh_scripts/multimodal_train.sh

Again, adjust the data path and output path to your own, and also change the ae_path&encoder_config to the autoencoder you tarined in step 1. When training with condition (like the cell type condition), set the num_class to the number of unique labels. The training is unconditional when the num_class is not set.

Step3: Generate new data

cd script/training_diffusion
sh ssh_scripts/multimodal_sample.sh

Change the MULTIMODAL_MODEL_PATH to the checkpoint path in step 2, and the DATA_DIR to your local data path.

The experiments results in the paper can be reproduce through evaluate_script/inference_multi_diff.ipynb

TODO: More details about the hyperpara, conditional and unconditional

Founction: Modality translation

For this task, we recommend you use layer_norm instead of batch_norm since it fit more for the OOD data. And if your source modality doesn’t have a condition label overlap with the training data (like a external dataset), you can use unconditional training to train the model. If so, use a clustering method like leiden to get the cluster label as the covariate_keys for encoder (to get the size factor).

cd script/training_diffusion
sh ssh_scripts/multimodal_train_translation.sh
sh ssh_scripts/multimodal_translation.sh

You need to change the file path in both bash file to your local path. The GEN_MODE is the target modality (either “rna” or “atac” for current model). The training logic is the same for the multimodal_train_translation.sh and multimodal_train.sh except the dataset and other hyperparameters.

The experiments results in the paper can be reproduce through evaluate_script/translation_multi_diff.ipynb

Founction: Gene-Peak regulatory analysis

You need to first complete the step1 and step2. The detail implement can be found in evaluate_script/regulatory_multi_diff.ipynb