Quick start
This is the simplified version of scdiffusionX tutorials.
Step1: Train the Autoencoder
cd script/training_autoencoder
bash train_autoencoder_multimodal.sbatch
Adjust the data path to your local path. The dataset config file is in script/training_autoencoder/configs/dataset, see the comments in openproblem.yaml for details. The checkpoint will be saved in script/training_autoencoder/outputs/checkpoints and the log file will be saved in script/training_autoencoder/outputs/logs. The autoencoder config file is in script/training_autoencoder/configs/encoder, see the comments in encoder_multimodal.yaml for details.
We recommand to use encoder_multimodal for most of dataset. If the genes and peaks are more than 50,000 and 200,000, we recommand a larger autoencoder in encoder_multimodal_large. If the genes and peaks are less than 5,000 and 15,000, we recommand a smaller autoencoder in encoder_multimodal_small. The norm_type in the encoder config yaml control the normalization type. For data generation task, we recommend batch_norm, and for translation task, we recommend layer_norm since it has better generalization for OOD data.
Step2: Train the Diffusion Backbone
cd script/training_diffusion
sh ssh_scripts/multimodal_train.sh
Again, adjust the data path and output path to your own, and also change the ae_path&encoder_config to the autoencoder you tarined in step 1. When training with condition (like the cell type condition), set the num_class to the number of unique labels. The training is unconditional when the num_class is not set.
Step3: Generate new data
cd script/training_diffusion
sh ssh_scripts/multimodal_sample.sh
Change the MULTIMODAL_MODEL_PATH to the checkpoint path in step 2, and the DATA_DIR to your local data path.
The experiments results in the paper can be reproduce through evaluate_script/inference_multi_diff.ipynb
TODO: More details about the hyperpara, conditional and unconditional
Founction: Modality translation
For this task, we recommend you use layer_norm instead of batch_norm since it fit more for the OOD data. And if your source modality doesn’t have a condition label overlap with the training data (like a external dataset), you can use unconditional training to train the model. If so, use a clustering method like leiden to get the cluster label as the covariate_keys for encoder (to get the size factor).
cd script/training_diffusion
sh ssh_scripts/multimodal_train_translation.sh
sh ssh_scripts/multimodal_translation.sh
You need to change the file path in both bash file to your local path. The GEN_MODE is the target modality (either “rna” or “atac” for current model). The training logic is the same for the multimodal_train_translation.sh and multimodal_train.sh except the dataset and other hyperparameters.
The experiments results in the paper can be reproduce through evaluate_script/translation_multi_diff.ipynb
Founction: Gene-Peak regulatory analysis
You need to first complete the step1 and step2. The detail implement can be found in evaluate_script/regulatory_multi_diff.ipynb