SIDISH package

Submodules

SIDISH.DEEP_COX module

class SIDISH.DEEP_COX.DEEPCOX(X_train, Y_train, weights, hidden, encoder, device, batch_size, seed, lr=0.000001, dropout=0)[source]

Bases: object

get_test_ci(test_loader)[source]
get_test_loss(test_loader)[source]
get_train_ci()[source]
get_train_loss()[source]
train(epochs)[source]
SIDISH.DEEP_COX.loss_DeepCox(pred, events, durations, weight=None, train=True)[source]

Compute the negative log-likelihood for the Deep Cox model in Phase 2 of SIDISH.

Parameters:
  • pred (torch.Tensor) – Predicted risk scores.

  • events (torch.Tensor) – Event indicators (1 if event occurred, 0 otherwise or censored).

  • durations (torch.Tensor) – Time durations.

  • weight (torch.Tensor) – Patient weights.

  • train (bool, optional) – Whether the model is in training mode. Defaults to True.

Returns:

torch.Tensor – Negative log-likelihood.

Notes

This method is based on the implementation from DeepSurv: https://github.com/jaredleekatzman/DeepSurv/blob/master/deepsurv/deep_surv.py

SIDISH.DEEP_COX_ARCHITECTURE module

class SIDISH.DEEP_COX_ARCHITECTURE.DEEPCOX_ARCHITECTURE(hidden, encoder, dropout)[source]

Bases: Module

Deep Cox architecture used in SIDISH for survival prediction. This network integrates pretrained encoder representations (from the VAE) with a Cox proportional hazards regression layer for modeling survival risk.

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

SIDISH.SIDISH module

class SIDISH.SIDISH.SIDISH(adata, bulk, device='cpu', seed=1234, use_spatial_graph=False, k_neighbors=None)[source]

Bases: object

SIDISH (Semi-Supervised Iterative Deep Learning for Identifying High-Risk Cells).

This framework integrates single-cell and bulk RNA-seq data to identify High-Risk cancer cells and potential biomarkers.

Parameters:
  • adata (AnnData) – Single-cell RNA-seq data.

  • bulk (pd.DataFrame) – Bulk RNA-seq data.

  • use_spatial_graph (bool, optional) – Whether to use spatial graph information (default=False).

  • k_neighbors (int, optional) – Number of neighbors to use for constructing the spatial graph (default=5).

  • device (str) – Computation device (‘cpu’ or ‘cuda’).

  • seed (int, optional) – Random seed for reproducibility (default=1234).

analyze_perturbation_effects()[source]
annotateCells(test_adata, percentile_cells, mode, perturbation=False)[source]
getEmbedding_adata()[source]

Extracts latent representations from the trained VAE.

Return type:

anndata._core.anndata.AnnData

Returns:

AnnData – Updated AnnData object with embeddings stored in obsm[‘latent’].

get_MarkerGenes(logfc_threshold=1.5, pval_threshold=0.05, method='wilcoxon', group='h')[source]

Identifies marker genes for the specified group using different statistical methods.

Parameters:
  • logfc_threshold (float) – Log fold change threshold for filtering significant genes.

  • pval_threshold (float) – P-value threshold for statistical significance.

  • method (str) – Method for ranking genes (‘wilcoxon’, ‘t-test’, ‘logreg’).

  • group (str) – The group to compare against others (default is ‘h’).

Returns:

upregulated_genes (list) – List of upregulated marker genes. downregulated_genes (list): List of downregulated marker genes.

get_embedding(n_neighbors=30, resolution=None, celltype=True)[source]
get_percentille(percentile)[source]
init_Phase1(epochs, i_epochs, latent_size, layer_dims, batch_size, optimizer, lr, lr_3, dropout, type='Normal')[source]

Initializes Phase 1: training a Variational Autoencoder (VAE) on single-cell RNA-seq data.

Parameters:
  • epochs (int) – Number of epochs for initial VAE training.

  • i_epochs (int) – Number of iterations for retraining VAE.

  • latent_size (int) – Latent dimension size.

  • layer_dims (list) – List of hidden layer dimensions.

  • batch_size (int) – Batch size.

  • optimizer (str) – Optimizer for VAE training.

  • lr (float) – Learning rate.

  • lr_3 (float) – Learning rate for later iterations.

  • dropout (float) – Dropout rate.

  • type (str, optional) – Specifies dense or normal representation (default=”Normal”).

Return type:

None

Returns:

None

init_Phase2(epochs, hidden, lr, dropout, test_size, batch_size_bulk)[source]

Initializes Phase 2: training a Deep Cox model for survival analysis using bulk RNA-seq data.

Parameters:
  • epochs (int) – Number of training epochs for Deep Cox model.

  • hidden (int) – Number of neurons in the hidden layer.

  • lr (float) – Learning rate for Deep Cox model.

  • dropout (float) – Dropout rate for training.

  • test_size (float) – Proportion of dataset allocated to the test split.

  • batch_size_bulk (int) – Number of samples per batch for bulk data.

Return type:

None

Returns:

None

plotUMAP(resolution, figure_size=(8, 6), fontsize=12, cell_size=20)[source]

Performs UMAP dimensionality reduction and Leiden clustering on the latent space.

Parameters:
  • resolution (float) – The resolution parameter for Leiden clustering.

  • figure_size (tuple, optional) – Size of the generated UMAP plot (default=(8, 6)).

  • fontsize (int, optional) – Font size for labels and legends (default=12).

  • cell_size (int, optional) – Size of points in the scatter plot (default=20).

Return type:

None

Returns:

None

plot_CellType_UMAP(size=10, resolution=None, celltype=True)[source]
plot_HighRisk_UMAP(size=10, resolution=None, celltype=True)[source]
plot_KM(penalizer=0.1, data_name='DATA', high_risk_label='High-Risk', background_label='Background', colors=('pink', 'grey'), fontsize=12)[source]

Plot Kaplan-Meier survival curves for High-Risk and background patient groups.

Parameters:
  • penalizer (float) – Penalizer for CoxPHFitter regularization.

  • data_name (str) – Title label for the dataset.

  • high_risk_label (str) – Label for the High-Risk group.

  • background_label (str) – Label for the background group.

  • colors (tuple) – Colors for the survival plots (High-Risk, background).

  • fontsize (int) – Font size for plot labels and legends.

plot_double_Perturbation_Heatmap(percentage_double_dict, top_n=20)[source]
plot_perturbation_UMAP_default(genes_of_interest, resolution=None, celltype=True, threshold=0.8)[source]

Generates UMAP visualizations for specified genes after in-silico perturbation.

Parameters: - adata: AnnData object with latent embeddings. - sidish: SIDISH object for annotation and processing. - ppi_df: DataFrame containing the PPI network data. - genes_of_interest (list): List of genes to visualize. - output_path: Filepath for saving the generated UMAP plot. - seed: Random seed for reproducibility. Default is 42.

plot_perturbation_UMAP_differential(genes_of_interest, resolution=None, celltype=True, threshold=0.8)[source]

Generates UMAP visualizations for specified genes after in-silico perturbation.

Parameters: - adata: AnnData object with latent embeddings. - sidish: SIDISH object for annotation and processing. - ppi_df: DataFrame containing the PPI network data. - genes_of_interest (list): List of genes to visualize. - output_path: Filepath for saving the generated UMAP plot. - seed: Random seed for reproducibility. Default is 42.

plot_top_perturbed_genes(gene_data, top_n=20)[source]

Plots a barplot of the top N genes with the highest percentage reduction in High-Risk cells after in-silico perturbation.

Parameters: - gene_data (dict): Dictionary of gene perturbation effects. - top_n (int): Number of top genes to display. Default is 20.

reload(path, num_workers=0)[source]
run_Perturbation(n_jobs=4)[source]
Return type:

tuple

run_double_Perturbation(genes, top_n=20, threshold=0.8)[source]
run_double_Perturbation_score(genes, top_n=20, threshold=0.8)[source]
set_adata()[source]
train(iterations, percentile, steepness, path, num_workers=0, show=True, distribution_fit='default')[source]

Trains the SIDISH framework iteratively, refining the identification of High-Risk cells.

This function iteratively updates High-Risk cell classifications by integrating single-cell and bulk RNA-seq data. Each iteration includes: - Training the VAE model on single-cell data. - Training the Deep Cox model on bulk RNA-seq survival data. - Updating weight matrices to improve High-Risk cell identification.

Parameters:
  • iterations (int) – Number of training iterations.

  • percentile (float) – Threshold percentile for defining High-Risk cells.

  • steepness (float) – Scaling factor for updating weights.

  • path (str) – Directory for saving model checkpoints.

  • num_workers (int, optional) – Number of parallel workers (default=8).

  • show (bool, optional) – If True, displays training progress (default=True).

Return type:

anndata._core.anndata.AnnData

Returns:

sc.AnnData – Updated AnnData object containing the refined High-Risk cell classifications.

SIDISH.SIDISH.map_event_column(val)[source]

Strictly maps survival event status to binary integers (0 or 1).

This function enforces a strict schema for event data to prevent silent errors during survival analysis. It accepts case-insensitive string labels or numeric binary values.

Parameters:

val (str, int, or float) – The value representing the event status. Accepted values: - Strings: Dead (maps to 1), Alive (maps to 0). - Numbers: 1 and 0

Returns:

int – Returns 1 if the event occurred (Dead) and 0 if censored (Alive).

Raises:

ValueError – If val is anything other than the accepted string or numeric inputs (e.g., ‘censored’, ‘unknown’, 2, NaN).

SIDISH.SIDISH.plot_umap(ax, umap_combined, palette, percentage_change_)[source]

Scatter UMAP of High-Risk/Background status after perturbation with a custom palette and an extra legend line indicating the perturbation percentage change.

SIDISH.SIDISH.plot_umap_differential(ax, umap_combined)[source]

Scatter UMAP colored by continuous risk delta after perturbation.

SIDISH.SIDISH.preprocess(adata, bulk, survival_df, patient_id, celltype_name, processed=True, n_genes_by_counts=5000, pct_counts_mt=10, batch_correction=False, survival_='Overall_survival_days', status='Sample_Status')[source]

Harmonize scRNA-seq (AnnData) and bulk tables: - QC (if raw), HVG selection, intersection of genes across modalities - Optionally apply Harmony (neighbors/umap/visual check) + ComBat - Merge survival metadata -> bulk with columns: duration, event

Returns:

(AnnData, pd.DataFrame) – (sc object restricted to intersecting genes, bulk with survival columns)

SIDISH.SIDISH.process_Data(X, Y, test_size, batch_size, seed)[source]

Splits bulk RNA-seq data into training and testing datasets, converts them to tensors, and creates DataLoaders.

Parameters:
  • X (np.ndarray) – Bulk gene expression data.

  • Y (np.ndarray) – Survival data: [survival days, event, weight].

  • test_size (float) – Proportion of dataset allocated to the test split.

  • batch_size (int) – Number of patients per batch.

  • seed (int) – Random seed for reproducibility.

Return type:

tuple

Returns:

tuple

  • torch.Tensor: X_train (Training feature matrix)

  • torch.Tensor: X_test (Testing feature matrix)

  • torch.Tensor: y_train (Training labels)

  • torch.Tensor: y_test (Testing labels)

SIDISH.Utils module

class SIDISH.Utils.Utils[source]

Bases: object

annotateCells(model, percentile_cells, device, percentile, mode, perturbation=False)[source]
getWeightMatrix(seed, steepness=100, type='Normal')[source]
getWeightVector(adata, model, percentile, device, distribution='default', dist=None)[source]
get_threshold(model, percentile, device)[source]
SIDISH.Utils.create_pytorch_geometric_data(adata, spatial_key='spatial', method='knn', k=5)[source]

Create a PyTorch Geometric Data object from AnnData with spatial information

Parameters:

adataAnnData

AnnData object containing gene expression and spatial data

spatial_keystr

Key in adata.obsm where spatial coordinates are stored

methodstr

Method to create the graph, either ‘knn’ or ‘radius’

kint

Number of nearest neighbors for KNN graph (default: 5)

Returns:

: data : torch_geometric.data.Data

PyTorch Geometric Data object containing the graph

SIDISH.Utils.create_spatial_graph(spatial_coords, method='knn', k=5, radius=None, include_self=False)[source]

Create a graph from spatial coordinates with limited neighborhood size

Parameters:

spatial_coordsnumpy.ndarray

Array of shape [n_cells, n_dims] containing spatial coordinates

methodstr

Method to create the graph, either ‘knn’ or ‘radius’

kint

Number of nearest neighbors for KNN graph (default: 5)

radiusfloat

Radius for radius graph

include_selfbool

Whether to include self-loops

Returns:

: edge_index : torch.LongTensor

Edge indices in COO format [2, num_edges]

edge_weighttorch.FloatTensor

Edge weights based on distance [num_edges]

SIDISH.Utils.extractFeature(adata, type='Normal')[source]
SIDISH.Utils.get_spatial_graph_from_adata(adata, spatial_key='spatial', method='knn', k=5, radius=None, include_self=False)[source]

Create a graph from spatial coordinates stored in AnnData object

Parameters:

adataAnnData

AnnData object containing spatial coordinates

spatial_keystr

Key in adata.obsm where spatial coordinates are stored

methodstr

Method to create the graph, either ‘knn’ or ‘radius’

kint

Number of nearest neighbors for KNN graph (default: 5)

radiusfloat

Radius for radius graph

include_selfbool

Whether to include self-loops

Returns:

: edge_index : torch.LongTensor

Edge indices in COO format [2, num_edges]

edge_weighttorch.FloatTensor

Edge weights based on distance [num_edges]

SIDISH.Utils.r_squared(sample, dist, params)[source]
SIDISH.Utils.sigmoid(x, a=100, b=0)[source]

SIDISH.VAE module

class SIDISH.VAE.VAE(epochs, adata, z_dim, layer_dims, lr, dropout, device, seed, gcn_dims=None)[source]

Bases: object

getBaseEmbedding(clustering=True)[source]

Get embeddings using only the base encoder (for Cox regression compatibility)

getBaseEncoder()[source]
getEmbedding(clustering=True)[source]

Get latent embeddings for the entire dataset

getLoss()[source]
initialize(adata, W=None, batch_size=1024, type='Normal', num_workers=8, spatial_graph=None, num_neighbors=5)[source]
train()[source]

SIDISH.VAE_ARCHITECTURE module

class SIDISH.VAE_ARCHITECTURE.ARCHITECTURE(input_dim, z_dim, layer_dims, seed, dropout=0.5, gcn_dims=None, use_cuda=False)[source]

Bases: Module

forward(x, edge_index=None, edge_weight=None)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_base_latent_representation(x)[source]
get_latent_representation(x, edge_index=None, edge_weight=None)[source]
kl_d(mu, logvar)[source]
loss_function(x, w, mu_decoder, dropout_logits, mu_encoder, logvar)[source]
reconstruction_loss(x, mu, dropout_logits, w)[source]

x: input data mu: output of decoder dropout_logits: dropout logits of zinb distribution w: weights for each sample in x (same shape as x)

reparameterize(mu, logvar)[source]

std = torch.exp(0.5 * logvar) eps = torch.randn_like(std)

class SIDISH.VAE_ARCHITECTURE.Decoder(input_dim, z_dim, layer_dims, dropout=0)[source]

Bases: Module

forward(z)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class SIDISH.VAE_ARCHITECTURE.Encoder(input_dim, z_dim, layer_dims, dropout=0)[source]

Bases: Module

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class SIDISH.VAE_ARCHITECTURE.GraphConvLayer(in_features, out_features)[source]

Bases: Module

forward(x, edge_index, edge_weight=None)[source]

x: Node features [N, in_features] edge_index: Graph connectivity in COO format [2, E] edge_weight: Edge weights [E]

class SIDISH.VAE_ARCHITECTURE.SpatialEncoder(input_dim, z_dim, layer_dims, dropout=0, gcn_dims=None)[source]

Bases: Module

forward(x, edge_index=None, edge_weight=None)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class SIDISH.VAE_ARCHITECTURE.SpatialPreEncoder(input_dim, layer_dims, dropout=0)[source]

Bases: Module

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

SIDISH.gene_perturbation_utils module

class SIDISH.gene_perturbation_utils.GenePerturbationUtils[source]

Bases: object

Utility functions for gene perturbation tasks.

static adjust_expression(adata, genename, network_df)[source]

Modify gene expression for perturbation by knocking out the target gene and its co-expressed genes.

Parameters:
  • adata – AnnData An AnnData object containing gene expression data.

  • genename – str The gene to knock out.

  • network_df – pd.DataFrame A DataFrame containing the PPI network information.

Returns:

AnnData – A new AnnData object with the expression of the target gene and its co-expressed partners set to zero.

static get_cogenes(adata, network_df, genename)[source]

Retrieve co-expressed genes (neighbors) from the PPI network for a given gene.

Parameters:
  • adata – AnnData An AnnData object containing gene expression data.

  • network_df – pd.DataFrame A DataFrame containing the PPI network information.

  • genename – str The gene for which to find co-expressed partners.

Returns:

list – A list of co-expressed gene names (excluding the input gene).

static knockout_gene(adata, gene)[source]

Knock out a gene’s expression by setting its values to zero in the expression matrix.

Parameters:
  • adata – AnnData An AnnData object containing gene expression data.

  • gene – str Name of the gene to knock out.

Returns:

lil_matrix – A copy of the expression matrix with the specified gene’s expression set to zero.

SIDISH.in_silico_perturbation module

class SIDISH.in_silico_perturbation.InSilicoPerturbation(adata)[source]

Bases: object

Handles single-cell in-silico perturbation experiments.

adata

AnnData The original gene expression data.

sidish

Object An object providing cell annotation functionality.

genes

list A list of gene names from the AnnData object.

ppi_handler

PPINetworkHandler An instance of PPINetworkHandler for managing the PPI network.

process_gene(adata, gene)[source]

Process a single gene for perturbation by knocking it out along with its network neighbors.

Parameters:

gene – str The gene to knock out.

Returns:

AnnData – A new AnnData object representing the perturbed state with the gene (and its neighbors) knocked out.

run_parallel_processing(adata, n_jobs=4)[source]

Run gene perturbation processing in parallel.

Parameters:

n_jobs – int, optional (default=4) Number of parallel jobs.

Side Effects:

Sets the ‘optimized_results’ attribute with the list of perturbed AnnData objects.

setup_ppi_network(threshold=0.8)[source]

Initialize the PPI network.

Parameters:
  • hippie_path – str Path to the HIPPIE file.

  • string_path – str Path to the STRING file.

  • info_path – str Path to the gene mapping info file.

  • threshold – float, optional (default=0.7) Threshold value for filtering interactions.

Returns:

pd.DataFrame – The loaded and processed PPI network.

SIDISH.ppi_network_handler module

class SIDISH.ppi_network_handler.PPINetworkHandler(adata)[source]

Bases: object

Handles PPI network construction and neighbor retrieval using fixed file paths.

get_neighbors(target_gene)[source]

Retrieve direct and indirect neighbors of a target gene in the PPI network.

Parameters:

target_gene – str The gene for which to retrieve neighbors.

Returns:

tuple

A tuple containing:
  • list of direct neighbors

  • list of indirect neighbors

load_network(threshold=0.8)[source]

Load and process the PPI network from fixed files (integrating interactions from Hippie and STRING files).

Fixed files used:
  • Hippie file: located at SIDISH/PPI/hippie_current.txt

  • STRING links file: located at SIDISH/PPI/9606.protein.links.v11.5.txt

  • STRING info file: located at SIDISH/PPI/9606.protein.info.v11.5.txt

The method performs the following steps:
  1. Builds a gene mapping from STRING info (only including genes present in the AnnData object).

  2. Processes the Hippie file to extract interactions if the score is >= threshold.

  3. Processes the STRING links file to extract interactions if the score is >= threshold * 1000.

  4. Merges the interactions from both sources into one DataFrame.

  5. Constructs a merged network dictionary (with normalized scores) and saves it as a NumPy file.

  6. Returns the merged interactions as a pandas DataFrame.

Parameters:

threshold – float, optional (default=0.8) Threshold for filtering interactions.

Returns:

pd.DataFrame – A DataFrame containing the merged PPI network with columns: “Source”, “Target”, and “Weight”.

Module contents