SIDISH package
Submodules
SIDISH.DEEP_COX module
- class SIDISH.DEEP_COX.DEEPCOX(X_train, Y_train, weights, hidden, encoder, device, batch_size, seed, lr=0.000001, dropout=0)[source]
Bases:
object
- SIDISH.DEEP_COX.loss_DeepCox(pred, events, durations, weight=None, train=True)[source]
Compute the negative log-likelihood for the Deep Cox model in Phase 2 of SIDISH.
- Parameters:
pred (torch.Tensor) – Predicted risk scores.
events (torch.Tensor) – Event indicators (1 if event occurred, 0 otherwise or censored).
durations (torch.Tensor) – Time durations.
weight (torch.Tensor) – Patient weights.
train (bool, optional) – Whether the model is in training mode. Defaults to True.
- Returns:
torch.Tensor – Negative log-likelihood.
Notes
This method is based on the implementation from DeepSurv: https://github.com/jaredleekatzman/DeepSurv/blob/master/deepsurv/deep_surv.py
SIDISH.DEEP_COX_ARCHITECTURE module
- class SIDISH.DEEP_COX_ARCHITECTURE.DEEPCOX_ARCHITECTURE(hidden, encoder, dropout)[source]
Bases:
ModuleDeep Cox architecture used in SIDISH for survival prediction. This network integrates pretrained encoder representations (from the VAE) with a Cox proportional hazards regression layer for modeling survival risk.
- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
SIDISH.SIDISH module
- class SIDISH.SIDISH.SIDISH(adata, bulk, device='cpu', seed=1234, use_spatial_graph=False, k_neighbors=None)[source]
Bases:
objectSIDISH (Semi-Supervised Iterative Deep Learning for Identifying High-Risk Cells).
This framework integrates single-cell and bulk RNA-seq data to identify High-Risk cancer cells and potential biomarkers.
- Parameters:
adata (AnnData) – Single-cell RNA-seq data.
bulk (pd.DataFrame) – Bulk RNA-seq data.
use_spatial_graph (bool, optional) – Whether to use spatial graph information (default=False).
k_neighbors (int, optional) – Number of neighbors to use for constructing the spatial graph (default=5).
device (str) – Computation device (‘cpu’ or ‘cuda’).
seed (int, optional) – Random seed for reproducibility (default=1234).
- getEmbedding_adata()[source]
Extracts latent representations from the trained VAE.
- Return type:
- Returns:
AnnData – Updated AnnData object with embeddings stored in obsm[‘latent’].
- get_MarkerGenes(logfc_threshold=1.5, pval_threshold=0.05, method='wilcoxon', group='h')[source]
Identifies marker genes for the specified group using different statistical methods.
- Parameters:
- Returns:
upregulated_genes (list) – List of upregulated marker genes. downregulated_genes (list): List of downregulated marker genes.
- init_Phase1(epochs, i_epochs, latent_size, layer_dims, batch_size, optimizer, lr, lr_3, dropout, type='Normal')[source]
Initializes Phase 1: training a Variational Autoencoder (VAE) on single-cell RNA-seq data.
- Parameters:
epochs (int) – Number of epochs for initial VAE training.
i_epochs (int) – Number of iterations for retraining VAE.
latent_size (int) – Latent dimension size.
layer_dims (list) – List of hidden layer dimensions.
batch_size (int) – Batch size.
optimizer (str) – Optimizer for VAE training.
lr (float) – Learning rate.
lr_3 (float) – Learning rate for later iterations.
dropout (float) – Dropout rate.
type (str, optional) – Specifies dense or normal representation (default=”Normal”).
- Return type:
- Returns:
None
- init_Phase2(epochs, hidden, lr, dropout, test_size, batch_size_bulk)[source]
Initializes Phase 2: training a Deep Cox model for survival analysis using bulk RNA-seq data.
- Parameters:
epochs (int) – Number of training epochs for Deep Cox model.
hidden (int) – Number of neurons in the hidden layer.
lr (float) – Learning rate for Deep Cox model.
dropout (float) – Dropout rate for training.
test_size (float) – Proportion of dataset allocated to the test split.
batch_size_bulk (int) – Number of samples per batch for bulk data.
- Return type:
- Returns:
None
- plotUMAP(resolution, figure_size=(8, 6), fontsize=12, cell_size=20)[source]
Performs UMAP dimensionality reduction and Leiden clustering on the latent space.
- Parameters:
resolution (float) – The resolution parameter for Leiden clustering.
figure_size (tuple, optional) – Size of the generated UMAP plot (default=(8, 6)).
fontsize (int, optional) – Font size for labels and legends (default=12).
cell_size (int, optional) – Size of points in the scatter plot (default=20).
- Return type:
- Returns:
None
- plot_KM(penalizer=0.1, data_name='DATA', high_risk_label='High-Risk', background_label='Background', colors=('pink', 'grey'), fontsize=12)[source]
Plot Kaplan-Meier survival curves for High-Risk and background patient groups.
- Parameters:
penalizer (float) – Penalizer for CoxPHFitter regularization.
data_name (str) – Title label for the dataset.
high_risk_label (str) – Label for the High-Risk group.
background_label (str) – Label for the background group.
colors (tuple) – Colors for the survival plots (High-Risk, background).
fontsize (int) – Font size for plot labels and legends.
- plot_perturbation_UMAP_default(genes_of_interest, resolution=None, celltype=True, threshold=0.8)[source]
Generates UMAP visualizations for specified genes after in-silico perturbation.
Parameters: - adata: AnnData object with latent embeddings. - sidish: SIDISH object for annotation and processing. - ppi_df: DataFrame containing the PPI network data. - genes_of_interest (list): List of genes to visualize. - output_path: Filepath for saving the generated UMAP plot. - seed: Random seed for reproducibility. Default is 42.
- plot_perturbation_UMAP_differential(genes_of_interest, resolution=None, celltype=True, threshold=0.8)[source]
Generates UMAP visualizations for specified genes after in-silico perturbation.
Parameters: - adata: AnnData object with latent embeddings. - sidish: SIDISH object for annotation and processing. - ppi_df: DataFrame containing the PPI network data. - genes_of_interest (list): List of genes to visualize. - output_path: Filepath for saving the generated UMAP plot. - seed: Random seed for reproducibility. Default is 42.
- plot_top_perturbed_genes(gene_data, top_n=20)[source]
Plots a barplot of the top N genes with the highest percentage reduction in High-Risk cells after in-silico perturbation.
Parameters: - gene_data (dict): Dictionary of gene perturbation effects. - top_n (int): Number of top genes to display. Default is 20.
- train(iterations, percentile, steepness, path, num_workers=0, show=True, distribution_fit='default')[source]
Trains the SIDISH framework iteratively, refining the identification of High-Risk cells.
This function iteratively updates High-Risk cell classifications by integrating single-cell and bulk RNA-seq data. Each iteration includes: - Training the VAE model on single-cell data. - Training the Deep Cox model on bulk RNA-seq survival data. - Updating weight matrices to improve High-Risk cell identification.
- Parameters:
iterations (int) – Number of training iterations.
percentile (float) – Threshold percentile for defining High-Risk cells.
steepness (float) – Scaling factor for updating weights.
path (str) – Directory for saving model checkpoints.
num_workers (int, optional) – Number of parallel workers (default=8).
show (bool, optional) – If True, displays training progress (default=True).
- Return type:
- Returns:
sc.AnnData – Updated AnnData object containing the refined High-Risk cell classifications.
- SIDISH.SIDISH.map_event_column(val)[source]
Strictly maps survival event status to binary integers (0 or 1).
This function enforces a strict schema for event data to prevent silent errors during survival analysis. It accepts case-insensitive string labels or numeric binary values.
- Parameters:
val (str, int, or float) – The value representing the event status. Accepted values: - Strings: Dead (maps to 1), Alive (maps to 0). - Numbers: 1 and 0
- Returns:
int – Returns 1 if the event occurred (Dead) and 0 if censored (Alive).
- Raises:
ValueError – If val is anything other than the accepted string or numeric inputs (e.g., ‘censored’, ‘unknown’, 2, NaN).
- SIDISH.SIDISH.plot_umap(ax, umap_combined, palette, percentage_change_)[source]
Scatter UMAP of High-Risk/Background status after perturbation with a custom palette and an extra legend line indicating the perturbation percentage change.
- SIDISH.SIDISH.plot_umap_differential(ax, umap_combined)[source]
Scatter UMAP colored by continuous risk delta after perturbation.
- SIDISH.SIDISH.preprocess(adata, bulk, survival_df, patient_id, celltype_name, processed=True, n_genes_by_counts=5000, pct_counts_mt=10, batch_correction=False, survival_='Overall_survival_days', status='Sample_Status')[source]
Harmonize scRNA-seq (AnnData) and bulk tables: - QC (if raw), HVG selection, intersection of genes across modalities - Optionally apply Harmony (neighbors/umap/visual check) + ComBat - Merge survival metadata -> bulk with columns: duration, event
- Returns:
(AnnData, pd.DataFrame) – (sc object restricted to intersecting genes, bulk with survival columns)
- SIDISH.SIDISH.process_Data(X, Y, test_size, batch_size, seed)[source]
Splits bulk RNA-seq data into training and testing datasets, converts them to tensors, and creates DataLoaders.
- Parameters:
- Return type:
- Returns:
tuple –
torch.Tensor: X_train (Training feature matrix)
torch.Tensor: X_test (Testing feature matrix)
torch.Tensor: y_train (Training labels)
torch.Tensor: y_test (Testing labels)
SIDISH.Utils module
- SIDISH.Utils.create_pytorch_geometric_data(adata, spatial_key='spatial', method='knn', k=5)[source]
Create a PyTorch Geometric Data object from AnnData with spatial information
Parameters:
- adataAnnData
AnnData object containing gene expression and spatial data
- spatial_keystr
Key in adata.obsm where spatial coordinates are stored
- methodstr
Method to create the graph, either ‘knn’ or ‘radius’
- kint
Number of nearest neighbors for KNN graph (default: 5)
Returns:
: data : torch_geometric.data.Data
PyTorch Geometric Data object containing the graph
- SIDISH.Utils.create_spatial_graph(spatial_coords, method='knn', k=5, radius=None, include_self=False)[source]
Create a graph from spatial coordinates with limited neighborhood size
Parameters:
- spatial_coordsnumpy.ndarray
Array of shape [n_cells, n_dims] containing spatial coordinates
- methodstr
Method to create the graph, either ‘knn’ or ‘radius’
- kint
Number of nearest neighbors for KNN graph (default: 5)
- radiusfloat
Radius for radius graph
- include_selfbool
Whether to include self-loops
Returns:
: edge_index : torch.LongTensor
Edge indices in COO format [2, num_edges]
- edge_weighttorch.FloatTensor
Edge weights based on distance [num_edges]
- SIDISH.Utils.get_spatial_graph_from_adata(adata, spatial_key='spatial', method='knn', k=5, radius=None, include_self=False)[source]
Create a graph from spatial coordinates stored in AnnData object
Parameters:
- adataAnnData
AnnData object containing spatial coordinates
- spatial_keystr
Key in adata.obsm where spatial coordinates are stored
- methodstr
Method to create the graph, either ‘knn’ or ‘radius’
- kint
Number of nearest neighbors for KNN graph (default: 5)
- radiusfloat
Radius for radius graph
- include_selfbool
Whether to include self-loops
Returns:
: edge_index : torch.LongTensor
Edge indices in COO format [2, num_edges]
- edge_weighttorch.FloatTensor
Edge weights based on distance [num_edges]
SIDISH.VAE module
- class SIDISH.VAE.VAE(epochs, adata, z_dim, layer_dims, lr, dropout, device, seed, gcn_dims=None)[source]
Bases:
object- getBaseEmbedding(clustering=True)[source]
Get embeddings using only the base encoder (for Cox regression compatibility)
SIDISH.VAE_ARCHITECTURE module
- class SIDISH.VAE_ARCHITECTURE.ARCHITECTURE(input_dim, z_dim, layer_dims, seed, dropout=0.5, gcn_dims=None, use_cuda=False)[source]
Bases:
Module- forward(x, edge_index=None, edge_weight=None)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class SIDISH.VAE_ARCHITECTURE.Decoder(input_dim, z_dim, layer_dims, dropout=0)[source]
Bases:
Module- forward(z)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class SIDISH.VAE_ARCHITECTURE.Encoder(input_dim, z_dim, layer_dims, dropout=0)[source]
Bases:
Module- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class SIDISH.VAE_ARCHITECTURE.SpatialEncoder(input_dim, z_dim, layer_dims, dropout=0, gcn_dims=None)[source]
Bases:
Module- forward(x, edge_index=None, edge_weight=None)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class SIDISH.VAE_ARCHITECTURE.SpatialPreEncoder(input_dim, layer_dims, dropout=0)[source]
Bases:
Module- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
SIDISH.gene_perturbation_utils module
- class SIDISH.gene_perturbation_utils.GenePerturbationUtils[source]
Bases:
objectUtility functions for gene perturbation tasks.
- static adjust_expression(adata, genename, network_df)[source]
Modify gene expression for perturbation by knocking out the target gene and its co-expressed genes.
- Parameters:
adata – AnnData An AnnData object containing gene expression data.
genename – str The gene to knock out.
network_df – pd.DataFrame A DataFrame containing the PPI network information.
- Returns:
AnnData – A new AnnData object with the expression of the target gene and its co-expressed partners set to zero.
- static get_cogenes(adata, network_df, genename)[source]
Retrieve co-expressed genes (neighbors) from the PPI network for a given gene.
- Parameters:
adata – AnnData An AnnData object containing gene expression data.
network_df – pd.DataFrame A DataFrame containing the PPI network information.
genename – str The gene for which to find co-expressed partners.
- Returns:
list – A list of co-expressed gene names (excluding the input gene).
- static knockout_gene(adata, gene)[source]
Knock out a gene’s expression by setting its values to zero in the expression matrix.
- Parameters:
adata – AnnData An AnnData object containing gene expression data.
gene – str Name of the gene to knock out.
- Returns:
lil_matrix – A copy of the expression matrix with the specified gene’s expression set to zero.
SIDISH.in_silico_perturbation module
- class SIDISH.in_silico_perturbation.InSilicoPerturbation(adata)[source]
Bases:
objectHandles single-cell in-silico perturbation experiments.
- adata
AnnData The original gene expression data.
- sidish
Object An object providing cell annotation functionality.
- genes
list A list of gene names from the AnnData object.
- ppi_handler
PPINetworkHandler An instance of PPINetworkHandler for managing the PPI network.
- process_gene(adata, gene)[source]
Process a single gene for perturbation by knocking it out along with its network neighbors.
- Parameters:
gene – str The gene to knock out.
- Returns:
AnnData – A new AnnData object representing the perturbed state with the gene (and its neighbors) knocked out.
- run_parallel_processing(adata, n_jobs=4)[source]
Run gene perturbation processing in parallel.
- Parameters:
n_jobs – int, optional (default=4) Number of parallel jobs.
- Side Effects:
Sets the ‘optimized_results’ attribute with the list of perturbed AnnData objects.
- setup_ppi_network(threshold=0.8)[source]
Initialize the PPI network.
- Parameters:
hippie_path – str Path to the HIPPIE file.
string_path – str Path to the STRING file.
info_path – str Path to the gene mapping info file.
threshold – float, optional (default=0.7) Threshold value for filtering interactions.
- Returns:
pd.DataFrame – The loaded and processed PPI network.
SIDISH.ppi_network_handler module
- class SIDISH.ppi_network_handler.PPINetworkHandler(adata)[source]
Bases:
objectHandles PPI network construction and neighbor retrieval using fixed file paths.
- get_neighbors(target_gene)[source]
Retrieve direct and indirect neighbors of a target gene in the PPI network.
- Parameters:
target_gene – str The gene for which to retrieve neighbors.
- Returns:
tuple –
- A tuple containing:
list of direct neighbors
list of indirect neighbors
- load_network(threshold=0.8)[source]
Load and process the PPI network from fixed files (integrating interactions from Hippie and STRING files).
- Fixed files used:
Hippie file: located at SIDISH/PPI/hippie_current.txt
STRING links file: located at SIDISH/PPI/9606.protein.links.v11.5.txt
STRING info file: located at SIDISH/PPI/9606.protein.info.v11.5.txt
- The method performs the following steps:
Builds a gene mapping from STRING info (only including genes present in the AnnData object).
Processes the Hippie file to extract interactions if the score is >= threshold.
Processes the STRING links file to extract interactions if the score is >= threshold * 1000.
Merges the interactions from both sources into one DataFrame.
Constructs a merged network dictionary (with normalized scores) and saves it as a NumPy file.
Returns the merged interactions as a pandas DataFrame.
- Parameters:
threshold – float, optional (default=0.8) Threshold for filtering interactions.
- Returns:
pd.DataFrame – A DataFrame containing the merged PPI network with columns: “Source”, “Target”, and “Weight”.