scmidas.data

scmidas.data#

class scmidas.data.BasicModDataset[source]#

Bases: Dataset

Base class for modular datasets.

__getitem__(idx: int) → None[source]#

Retrieve the data item at the specified index (not implemented in base class).

Parameters:: idx – int The index of the data item.
Returns:: None

__len__() → int[source]#

Return the number of samples in the dataset.

Returns:

int: Number of samples (default is 0 for base class).

class scmidas.data.MatDataset(csv_file: str)[source]#

Bases: BasicModDataset

Dataset for matrix-based data.

Parameters:: csv_file – str Path to the CSV or compressed CSV file.

__getitem__(idx: int) → ndarray[source]#

Retrieve the matrix row at the specified index.

Parameters:

idx – int The index of the matrix row.

Returns:

np.ndarray: The matrix row as a NumPy array.

__len__() → int[source]#

Return the number of rows in the matrix dataset.

Returns:

int: Number of rows in the dataset.

class scmidas.data.MultiBatchSampler(data_source: Any | None = None, shuffle: bool = True, batch_size: int = 1, n_max: int = 10000)[source]#

Bases: Sampler

Custom sampler for multi-batch sampling across multiple datasets.

Parameters:

data_source – Any A dataset or a concatenated dataset (e.g., ConcatDataset) containing multiple sub-datasets.
shuffle – bool, optional Whether to shuffle the samples within each dataset, default is True.
batch_size – int, optional Number of samples per batch, default is 1.
n_max – int, optional Maximum number of samples to draw from each dataset, default is 10000.

__iter__() → Iterator[int][source]#

Iterate over the dataset indices in a multi-batch sampling manner.

Returns:

Iterator[int]: An iterator over sampled indices.

__len__() → int[source]#

Calculate the total number of samples across all sub-datasets.

Returns:

int: The total number of samples.

class scmidas.data.MultiModalDataset(mod_dict: Dict[str, str], mod_id_dict: Dict[str, int], file_type: Dict[str, str], mask_path: Dict[str, str] | None = None, transform: Dict[str, str] | None = None)[source]#

Bases: Dataset

A dataset class for handling multi-modal data with optional masking and transformations.

Parameters:

mod_dict – Dict[str, str] A dictionary mapping modality names to their respective file paths.
mod_id_dict – Dict[str, int] A dictionary mapping modality names to their unique identifiers.
file_type – Dict[str, str] A dictionary mapping modality names to their file types (e.g., ‘vec’, ‘mat’).
mask_path – Optional[Dict[str, str]], optional A dictionary mapping modality names to their mask file paths, default is None.
transform – Optional[Dict[str, str]], optional A dictionary specifying transformations to apply to each modality, default is None.

__len__()[source]#: Returns the size of the dataset.

__getitem__(idx: int) -> Dict[str, Dict[str, Any]]: Retrieves the data at the given index across all modalities.

__getitem__(idx: int) → Dict[str, Dict[str, Any]][source]#

Retrieves the data at the specified index across all modalities.

Parameters:: idx – int The index of the sample to retrieve.
Returns:: A dictionary containing the following keys: - ‘x’: Modality data at the given index, with optional transformations applied. - ‘s’: Modality IDs. - ‘e’: Masking information, if available.
Return type:: Dict[str, Dict[str, Any]]

__len__() → int[source]#

Returns the size of the dataset.

Returns:

int: The number of samples in the dataset.

class scmidas.data.MyDistributedSampler(dataset: Dataset, num_replicas: int | None = None, rank: int | None = None, shuffle: bool = True, seed: int = 0, batch_size: int = 256, n_max: int = 10000)[source]#

Bases: DistributedSampler

A custom distributed sampler for datasets split across multiple replicas.

Parameters:

dataset – Dataset The dataset to sample from.
num_replicas – int, optional Number of replicas in the distributed setup, default is determined by torch.distributed.
rank – int, optional The rank of the current process, default is determined by torch.distributed.
shuffle – bool, optional Whether to shuffle the data, default is True.
seed – int, optional Random seed for shuffling, default is 0.
batch_size – int, optional Number of samples per batch, default is 256.
n_max – int, optional Maximum number of samples per dataset, default is 10000.

__iter__() → Iterator[_T_co][source]#

Iterate over the distributed dataset, ensuring balanced sampling across replicas.

Returns:

Iterator: Iterator over indices for the current replica.

__len__() → int[source]#

Calculate the number of samples in the sampler.

Returns:

int: Number of samples across all datasets.

class scmidas.data.VecDataset(path: str)[source]#

Bases: BasicModDataset

Dataset for vector-based data.

Parameters:: path – str Directory containing vector-based data files.

__getitem__(idx: int) → ndarray[source]#

Retrieve the vector data at the specified index.

Parameters:

idx – int The index of the vector file.

Returns:

np.ndarray: The vector data as a NumPy array.

__len__() → int[source]#

Return the number of files in the vector dataset.

Returns:

int: Number of vector files in the dataset.

scmidas.data.download_data(name: str, des: str = './')[source]#

Downloads the specified dataset and extracts it.

Parameters:

name – str Name of the dataset to download (e.g., ‘teadog_mosaic_4k’).
des – str Destination path to save the dataset (default is the current directory).

scmidas.data.download_file(url: str, dest_path: Path)[source]#

Helper function to download a file from a URL with progress display.

Parameters:

url – str URL for data.
dest_path – str Path to save.

scmidas.data.unzip_file(zip_path: Path, extract_to: Path)[source]#

Helper function to unzip a file.

Parameters:

zip_path – str Path of zip file.
extract_to – str Path to save.

scmidas.data

Contents

scmidas.data#