Inputs of MIDAS#

This section explains how to prepare and load input data for MIDAS. The tool requires two main components as input: multi-modal data and masks. Data for each modality is stored separately, while the mask indicates the presence or absence of features.

Inputs Overview#

Data#

MIDAS supports multi-modal data inputs, including both paired and unpaired data. By default, it supports the following modalities:

  • RNA: RNA counts, integer values.

  • ADT: Protein counts, integer values.

  • ATAC: ATAC peaks, integer values (binarized during training).

For custom modalities, refer to the Advanced Development Instructions in the Tutorials/Advanced section to configure additional data types.

Mask#

Masks indicate the presence (1) or absence (0) of features in each modality for every batch.

  • Mask Format:

    Each batch and modality should have a corresponding CSV mask file. The CSV file must have the following structure:

    • Shape: 1 x m (1 row and m columns, where m is the number of features).

    • Includes: A header and an index column.

  • Default Behavior:

    If no mask file is provided, MIDAS assumes that all features are present.

To specify mask files, provide the paths in the format below:

mask_config = [
    {'rna': 'batch_1_rna_mask.csv', 'adt': 'batch_1_adt_mask.csv'},
    {'rna': 'batch_2_rna_mask.csv', 'adt': 'batch_2_adt_mask.csv'},
    {'rna': 'batch_3_rna_mask.csv', 'adt': 'batch_3_adt_mask.csv'}
]

Initial Setup#

Begin by importing the necessary modules and loading default configurations:

from scmidas.model import MIDAS
from scmidas.config import load_config

# Settings for the model, such as the layer dimensions.
configs = load_config()

Approach 1: Loading Data from Single-level Directories#

Follow the steps below to prepare and load input data into MIDAS.

Step 1: Configuring Data#

MIDAS supports two main formats for loading data. Choose the one that best suits your dataset’s size and structure.

Option 1: CSV per Modality and Batch#

  • Description:

    Data for each modality and batch is stored in separate CSV files. Each file represents a cell x feature matrix, where:

    • Rows: Cells

    • Columns: Features

    • The file includes a header and an index column.

  • Example Configuration:
    # Data for each modality and batch
    data_config = [
        {'rna': 'batch_1_rna.csv', 'adt': 'batch_1_adt.csv', 'atac': 'batch_1_atac.csv'},
        {'rna': 'batch_2_rna.csv', 'adt': 'batch_2_adt.csv', 'atac': 'batch_2_atac.csv'},
        {'rna': 'batch_3_rna.csv', 'adt': 'batch_3_adt.csv', 'atac': 'batch_3_atac.csv'}
    ]
    
  • Use Case:

    This format is suitable when datasets fit into memory, as it avoids re-fetching data.

Option 2: CSV per Cell#

  • Description:

    Data for each cell is stored in individual CSV files. Each file contains a 1 x feature vector without a header or index column.

  • Example Configuration:
    # Directory paths for each modality and batch
    data_config = [
        {'rna': 'batch_1_rna_dir/', 'adt': 'batch_1_adt_dir/', 'atac': 'batch_1_atac_dir/'},
        {'rna': 'batch_2_rna_dir/', 'adt': 'batch_2_adt_dir/', 'atac': 'batch_2_atac_dir/'},
        {'rna': 'batch_3_rna_dir/', 'adt': 'batch_3_adt_dir/', 'atac': 'batch_3_atac_dir/'}
    ]
    
  • Use Case:

    This format is ideal for large datasets that cannot fit into memory, as it allows loading data one sample at a time.

Tip

Both Option 1 and Option 2 can be combined for flexible data handling.

Step 2: Define Data Dimensions#

Specify the dimensions for each modality. Example:

# Dimensions per modality.
# In this example, the ATAC data is split into chunks during training
# based on the specified dimensionality.
dims_x = {
    'rna': [200],    # RNA data is represented as a cell x 200 matrix.
    'adt': [200],    # ADT data is represented as a cell x 100 matrix.
    'atac': [100, 200, 300, ..., 200]  # ATAC data is split into multiple chunks with varying dimensions
}

Note

For modalities with more than one dimension (e.g., ATAC), data will be split into chunks based on the specified dimensions. This is useful for high-dimensional data like ATAC-seq, where splitting occurs based on chromosomes.

Step 3: Specify Transformation Rules#

For certain modalities, you may want to apply transformations. In this example, we binarize the ATAC data and leave RNA and ADT data unchanged:

transform = {'atac': 'binarize'}  # Binarize ATAC data, leave RNA and ADT unchanged

Step 4: Combine Configuration#

Integrate the configurations, data paths, and transformations to set up MIDAS:

# Configure MIDAS with the data
datasets, dims_s, s_joint, combs = MIDAS.configure_data_from_csv(data_config, mask_config, transform)
model = MIDAS.configure_data(configs, datasets, dims_x, dims_s, s_joint, combs)

Approach 2: Loading Data from Multi-level Directories#

In addition to the previously mentioned method of loading data from single-level directories (where each path corresponds to an independent directory containing CSV files), MIDAS also supports loading data directly from a well-organized, multi-level directory structure. The required directory format is as follows:

./dataset_path/
    batch_0/
        mask/
            rna.csv
            adt.csv
        vec/
            rna/
                0000.csv
                0001.csv
                ...
            adt/
                0000.csv
                0001.csv
                ...
            atac/
                0000.csv
                0001.csv
                ...
    batch_1/
        ...
    feat/
        feat_dims.toml
  • mask: Contains mask files for each modality.

  • vec: Contains cell-specific data files for each modality.

  • feat/feat_dims.toml: Specifies feature dimensions for each modality. Example:

rna = [200]
adt = [100]
atac = [100, 200, 300, ..., 200]

To load data from this structure, use the configure_data_from_dir() function:

# Load dataset using the directory structure
model = MIDAS.configure_data_from_dir(configs, dataset_path, transform)