# Quick start This is the simplified version of scdiffusionX tutorials. **Step1: Train the Autoencoder** ``` cd script/training_autoencoder bash train_autoencoder_multimodal.sbatch ``` Adjust the data path to your local path. The dataset config file is in script/training_autoencoder/configs/dataset, see the comments in openproblem.yaml for details. The checkpoint will be saved in script/training_autoencoder/outputs/checkpoints and the log file will be saved in script/training_autoencoder/outputs/logs. The autoencoder config file is in script/training_autoencoder/configs/encoder, see the comments in encoder_multimodal.yaml for details. We recommand to use encoder_multimodal for most of dataset. If the genes and peaks are more than 50,000 and 200,000, we recommand a larger autoencoder in encoder_multimodal_large. If the genes and peaks are less than 5,000 and 15,000, we recommand a smaller autoencoder in encoder_multimodal_small. The `norm_type` in the encoder config yaml control the normalization type. For data generation task, we recommend batch_norm, and for translation task, we recommend layer_norm since it has better generalization for OOD data. **Step2: Train the Diffusion Backbone** ``` cd script/training_diffusion sh ssh_scripts/multimodal_train.sh ``` Again, adjust the data path and output path to your own, and also change the ae_path&encoder_config to the autoencoder you tarined in step 1. When training with condition (like the cell type condition), set the `num_class` to the number of unique labels. The training is unconditional when the `num_class` is not set. **Step3: Generate new data** ``` cd script/training_diffusion sh ssh_scripts/multimodal_sample.sh ``` Change the MULTIMODAL_MODEL_PATH to the checkpoint path in step 2, and the DATA_DIR to your local data path. The experiments results in the paper can be reproduce through `evaluate_script/inference_multi_diff.ipynb` TODO: More details about the hyperpara, conditional and unconditional **Founction: Modality translation** For this task, we recommend you use `layer_norm` instead of `batch_norm` since it fit more for the OOD data. And if your source modality doesn't have a condition label overlap with the training data (like a external dataset), you can use unconditional training to train the model. If so, use a clustering method like leiden to get the cluster label as the covariate_keys for encoder (to get the size factor). ``` cd script/training_diffusion sh ssh_scripts/multimodal_train_translation.sh sh ssh_scripts/multimodal_translation.sh ``` You need to change the file path in both bash file to your local path. The `GEN_MODE` is the target modality (either "rna" or "atac" for current model). The training logic is the same for the multimodal_train_translation.sh and multimodal_train.sh except the dataset and other hyperparameters. The experiments results in the paper can be reproduce through `evaluate_script/translation_multi_diff.ipynb` **Founction: Gene-Peak regulatory analysis** You need to first complete the step1 and step2. The detail implement can be found in ``evaluate_script/regulatory_multi_diff.ipynb``