Preprocess Functions

preprocess_functions.DESEQ(count_mtx, datMeta, condition, n_genes, train_index=None, fit_type='parametric')[source]

Conducts differential expression analysis using DESeq2 algorithm.

Parameters:
  • count_mtx (pandas.DataFrame) – Count data for different genes.

  • datMeta (pandas.DataFrame) – Metadata for the samples in count_mtx.

  • condition (str) – Column in datMeta to use for condition separation.

  • n_genes (int) – Number of top genes to extract from the differential expression result.

  • train_index (list, optional) – Indexes for training samples if splitting is required.

  • fit_type (str, optional) – Statistical fitting type for VST transformation.

Returns:

An object containing results and configuration of DESeq2 run. numpy.ndarray: Variance Stabilized Transformed counts. list: Top genes identified in the differential expression analysis.

Return type:

DeseqDataSet

class preprocess_functions.ElasticNet(num_features, num_classes, alpha, lam)[source]

A PyTorch module that implements an Elastic Net regularization logistic regression model.

Parameters:
  • num_features (int) – Number of features in the input dataset.

  • num_classes (int) – Number of classes in the output prediction.

  • alpha (float) – Mixing parameter for L1 (Lasso) and L2 (Ridge) regularization.

  • lam (float) – Overall regularization strength.

linear

Linear transformation layer.

Type:

nn.Linear

accuracy(logits, y)[source]

Calculates the accuracy of the model’s predictions.

Parameters:
  • logits (torch.Tensor) – The logits as predicted by the model.

  • y (torch.Tensor) – The true labels.

Returns:

The calculated accuracy.

Return type:

float

calculate_loss(logits, y)[source]

Calculates the combined cross-entropy and regularized loss for the model.

Parameters:
  • logits (torch.Tensor) – The logits as predicted by the model.

  • y (torch.Tensor) – The true labels.

Returns:

The calculated loss value.

Return type:

torch.Tensor

forward(X)[source]

Forward pass of the neural network model that makes predictions.

Parameters:

X (torch.Tensor) – Tensor containing input features.

Returns:

Tensor containing the output logits.

Return type:

torch.Tensor

preprocess_functions.SNF(networks, K=15, t=10)[source]

Performs Similarity Network Fusion over multiple networks.

Parameters:
  • networks (list of pd.DataFrames) – The individual networks to fuse, represented as similarity or distance matrices.

  • K (int) – Number of nearest neighbors to retain in the diffusion process.

  • t (int) – Number of iterations for the fusion process.

Returns:

A fused network represented as a similarity matrix.

Return type:

pd.DataFrame

preprocess_functions.abs_bicorr(data, mat_means=True)[source]

Calculates the absolute bicorrelation matrix for the given data.

Parameters:
  • data (pd.DataFrame) – Data for which to compute the bicorrelation.

  • mat_means (bool) – If True, subtract the mean from each column before computing the correlation.

Returns:

Bicorrelation matrix.

Return type:

pd.DataFrame

preprocess_functions.check_wall_names(wall)[source]

Checks whether all matrices in a list share the same row and column names.

Parameters:

wall (list of pd.DataFrame) – List of matrices to check.

Returns:

Returns True if all matrices have consistent names, False otherwise.

Return type:

bool

preprocess_functions.convert_dataframe_to_numpy(input_data)[source]

Converts a pandas DataFrame to a numpy array. If the input is not a DataFrame, returns it as is.

Parameters:

input_data (pd.DataFrame or any) – Data to be converted to numpy array.

Returns:

The resulting numpy array from conversion or the original input if conversion isn’t applicable.

Return type:

np.array or original data type

preprocess_functions.cosine_corr(data, mat_means=True)[source]

Computes cosine correlations for the given data, treated as vectors.

Parameters:
  • data (pd.DataFrame) – Data for which to compute cosine correlations.

  • mat_means (bool) – If True, normalizes the data before computing correlation.

Returns:

Cosine correlation matrix.

Return type:

pd.DataFrame

preprocess_functions.create_similarity_matrix(mat, method='euclidean')[source]

Creates a similarity matrix from the given data matrix using specified methods.

Parameters:
  • mat (pd.DataFrame) – The matrix from which to calculate similarities (e.g., gene expression levels).

  • method (str) – The method to use for calculating similarities. Supported methods are ‘bicorr’, ‘pearson’, and ‘euclidean’.

Returns:

A DataFrame representing the similarity matrix.

Return type:

pd.DataFrame

preprocess_functions.custom_cpm(counts, lib_size)[source]

Computes Counts Per Million (CPM) normalization on count data.

Parameters:
  • counts (np.array) – An array of raw gene counts.

  • lib_size (float or np.array) – The total counts in each library (sample).

Returns:

Normalized counts expressed as counts per million.

Return type:

np.array

preprocess_functions.data_preprocess(count_mtx, datMeta, gene_exp=False)[source]

Processes count matrix data by removing genes with zero expression across all samples. Optionally filters genes based on expression levels and calculates similarity matrices.

Parameters:
  • count_mtx (pd.DataFrame) – A DataFrame containing the gene count data.

  • datMeta (pd.Series or pd.DataFrame) – Metadata associated with the samples in count_mtx.

  • gene_exp (bool) – If true, performs additional gene filtering and similarity matrix calculations.

Returns:

The processed count matrix. pd.Series or pd.DataFrame: The corresponding processed metadata.

Return type:

pd.DataFrame

preprocess_functions.dominateset(xx, KK=20)[source]

Extracts a dominant set from a similarity matrix, setting all but the top KK connections per row to zero and re-normalizes rows.

Parameters:
  • xx (np.array or pd.DataFrame) – The input similarity or distance matrix.

  • KK (int) – Number of top values to keep in each row of the matrix.

Returns:

The extracted dominant set matrix with top KK neighbors per row.

Return type:

np.array

preprocess_functions.elastic_net(count_mtx, datMeta, train_index=None, val_index=None, l1_ratio=1, num_epochs=1000, lam=0.01, device='cuda')[source]

Trains an Elastic Net model given count data and metadata.

Parameters:
  • count_mtx (pandas.DataFrame) – Matrix containing gene expression or count data.

  • datMeta (pandas.Series or DataFrame) – Metadata corresponding to count_mtx samples.

  • train_index (list, optional) – Indexes for training samples.

  • val_index (list, optional) – Indexes for validation samples.

  • l1_ratio (float, optional) – The balance between L1 and L2 regularization.

  • num_epochs (int, optional) – Number of training epochs.

  • lam (float, optional) – Regularization strength.

  • device (str, optional) – Device to run the training on (‘cuda’ or ‘cpu’).

Returns:

Extracted features based on weight importance. ElasticNet: Trained ElasticNet model.

Return type:

list

preprocess_functions.filter_genes(y, design=None, group=None, lib_size=None, min_count=10, min_total_count=15, large_n=10, min_prop=0.7)[source]

Filters genes based on several criteria including minimum count thresholds and proportions.

Parameters:
  • y (np.array) – Expression data for the genes.

  • design (np.array, optional) – Design matrix for the samples if available.

  • group (np.array, optional) – Group information for samples.

  • lib_size (np.array, optional) – Library sizes for the samples.

  • min_count (int) – Minimum count threshold for including a gene.

  • min_total_count (int) – Minimum total count across all samples for a gene.

  • large_n (int) – Cutoff for considering a sample ‘large’.

  • min_prop (float) – Minimum proportion used in calculations for large sample consideration.

Returns:

Boolean array indicating which genes to keep.

Return type:

np.array

preprocess_functions.gen_new_graph(model, h, meta, pnet=False)[source]

Generates a new graph from learned features using a provided model, handling multi-modal data and integrating them.

Parameters:
  • model (nn.Module) – The trained model which contains the learned parameters.

  • h (torch.Tensor) – Tensor containing features of the data.

  • meta (pd.DataFrame or pd.Series) – Metadata associated with the features.

  • pnet (bool) – Flag indicating whether or not pathway network transformations have been used.

Returns:

A graph object representing the new graph generated from the features.

Return type:

nx.Graph

preprocess_functions.get_k_neighbors(matrix, k, corr=True)[source]

Finds k-nearest neighbors for each row in the given matrix.

Parameters:
  • matrix (pd.DataFrame) – The matrix from which neighbors are to be found.

  • k (int) – The number of neighbors to find for each row.

  • corr (bool) – Indicates whether to use correlation rather than distance for finding neighbors.

Returns:

A dictionary where keys are indices (or node names) and values are lists of k-nearest neighbors’ indices.

Return type:

dict

preprocess_functions.knn_graph_generation(datExpr, datMeta, knn=20, method='euclidean', extracted_feats=None, **args)[source]

Generates a k-nearest neighbor graph based on the specified data and method of similarity.

Parameters:
  • datExpr (pd.DataFrame) – DataFrame containing expression data or other numerical data.

  • datMeta (pd.DataFrame or pd.Series) – Metadata for the nodes in the graph.

  • knn (int) – Number of nearest neighbors to connect to each node.

  • method (str) – Method used for calculating similarity or distance (‘euclidean’, ‘bicorr’, ‘pearson’, ‘cosine’).

  • extracted_feats ([type]) – Specific features extracted from the data to use for graph construction.

  • **args – Additional arguments for customizing the node visualization (e.g., node_colour, node_size).

Returns:

A NetworkX graph object representing the k-nearest neighbors graph.

Return type:

nx.Graph

preprocess_functions.normalize(x)[source]

Normalizes a square matrix by scaling each row by its total minus the diagonal value, handling it in-place.

Parameters:

x (np.array) – The square matrix to normalize.

Returns:

The normalized matrix with diagonal set to 0.5.

Return type:

np.array

preprocess_functions.pearson_corr(data, mat_means=True)[source]

Computes the Pearson correlation matrix for the given data.

Parameters:
  • data (pd.DataFrame) – Data for which to compute the Pearson correlation.

  • mat_means (bool) – Normalizes data by its mean if set to True.

Returns:

Pearson correlation matrix.

Return type:

pd.DataFrame

preprocess_functions.plot_knn_network(data, K, labels, node_colours='skyblue', node_size=300)[source]

Plots a k-nearest neighbors network using NetworkX.

Parameters:
  • data (pd.DataFrame) – The similarity or distance matrix used to determine neighbors.

  • K (int) – The number of nearest neighbors for network connections.

  • labels (pd.Series) – Labels or categories for the nodes used in plotting.

  • node_colours (str or list) – Color or list of colors for the nodes.

  • node_size (int) – Size of the nodes in the plot.

Returns:

A NetworkX graph object that has been plotted.

Return type:

nx.Graph