Preprocess Functions

preprocess_functions.DESEQ(count_mtx, datMeta, condition, n_genes, train_index=None, fit_type='parametric')[source]

Conducts differential expression analysis using DESeq2 algorithm.

Parameters:

count_mtx (pandas.DataFrame) – Count data for different genes.
datMeta (pandas.DataFrame) – Metadata for the samples in count_mtx.
condition (str) – Column in datMeta to use for condition separation.
n_genes (int) – Number of top genes to extract from the differential expression result.
train_index (list, optional) – Indexes for training samples if splitting is required.
fit_type (str, optional) – Statistical fitting type for VST transformation.

Returns:

An object containing results and configuration of DESeq2 run. numpy.ndarray: Variance Stabilized Transformed counts. list: Top genes identified in the differential expression analysis.

Return type:

DeseqDataSet

class preprocess_functions.ElasticNet(num_features, num_classes, alpha, lam)[source]

A PyTorch module that implements an Elastic Net regularization logistic regression model.

Parameters:

num_features (int) – Number of features in the input dataset.
num_classes (int) – Number of classes in the output prediction.
alpha (float) – Mixing parameter for L1 (Lasso) and L2 (Ridge) regularization.
lam (float) – Overall regularization strength.

linear

Linear transformation layer.

Type:: nn.Linear

accuracy(logits, y)[source]

Calculates the accuracy of the model’s predictions.

Parameters:

logits (torch.Tensor) – The logits as predicted by the model.
y (torch.Tensor) – The true labels.

Returns:

The calculated accuracy.

Return type:

float

calculate_loss(logits, y)[source]

Calculates the combined cross-entropy and regularized loss for the model.

Parameters:

logits (torch.Tensor) – The logits as predicted by the model.
y (torch.Tensor) – The true labels.

Returns:

The calculated loss value.

Return type:

torch.Tensor

forward(X)[source]

Forward pass of the neural network model that makes predictions.

Parameters:: X (torch.Tensor) – Tensor containing input features.
Returns:: Tensor containing the output logits.
Return type:: torch.Tensor

preprocess_functions.SNF(networks, K=15, t=10)[source]

Performs Similarity Network Fusion over multiple networks.

Parameters:

networks (list of pd.DataFrames) – The individual networks to fuse, represented as similarity or distance matrices.
K (int) – Number of nearest neighbors to retain in the diffusion process.
t (int) – Number of iterations for the fusion process.

Returns:

A fused network represented as a similarity matrix.

Return type:

pd.DataFrame

preprocess_functions.abs_bicorr(data, mat_means=True)[source]

Calculates the absolute bicorrelation matrix for the given data.

Parameters:

data (pd.DataFrame) – Data for which to compute the bicorrelation.
mat_means (bool) – If True, subtract the mean from each column before computing the correlation.

Returns:

Bicorrelation matrix.

Return type:

pd.DataFrame

preprocess_functions.check_wall_names(wall)[source]

Checks whether all matrices in a list share the same row and column names.

Parameters:: wall (list of pd.DataFrame) – List of matrices to check.
Returns:: Returns True if all matrices have consistent names, False otherwise.
Return type:: bool

preprocess_functions.convert_dataframe_to_numpy(input_data)[source]

Converts a pandas DataFrame to a numpy array. If the input is not a DataFrame, returns it as is.

Parameters:: input_data (pd.DataFrame or any) – Data to be converted to numpy array.
Returns:: The resulting numpy array from conversion or the original input if conversion isn’t applicable.
Return type:: np.array or original data type

preprocess_functions.cosine_corr(data, mat_means=True)[source]

Computes cosine correlations for the given data, treated as vectors.

Parameters:

data (pd.DataFrame) – Data for which to compute cosine correlations.
mat_means (bool) – If True, normalizes the data before computing correlation.

Returns:

Cosine correlation matrix.

Return type:

pd.DataFrame

preprocess_functions.create_similarity_matrix(mat, method='euclidean')[source]

Creates a similarity matrix from the given data matrix using specified methods.

Parameters:

mat (pd.DataFrame) – The matrix from which to calculate similarities (e.g., gene expression levels).
method (str) – The method to use for calculating similarities. Supported methods are ‘bicorr’, ‘pearson’, and ‘euclidean’.

Returns:

A DataFrame representing the similarity matrix.

Return type:

pd.DataFrame

preprocess_functions.custom_cpm(counts, lib_size)[source]

Computes Counts Per Million (CPM) normalization on count data.

Parameters:

counts (np.array) – An array of raw gene counts.
lib_size (float or np.array) – The total counts in each library (sample).

Returns:

Normalized counts expressed as counts per million.

Return type:

np.array

preprocess_functions.data_preprocess(count_mtx, datMeta, gene_exp=False)[source]

Processes count matrix data by removing genes with zero expression across all samples. Optionally filters genes based on expression levels and calculates similarity matrices.

Parameters:

count_mtx (pd.DataFrame) – A DataFrame containing the gene count data.
datMeta (pd.Series or pd.DataFrame) – Metadata associated with the samples in count_mtx.
gene_exp (bool) – If true, performs additional gene filtering and similarity matrix calculations.

Returns:

The processed count matrix. pd.Series or pd.DataFrame: The corresponding processed metadata.

Return type:

pd.DataFrame

preprocess_functions.dominateset(xx, KK=20)[source]

Extracts a dominant set from a similarity matrix, setting all but the top KK connections per row to zero and re-normalizes rows.

Parameters:

xx (np.array or pd.DataFrame) – The input similarity or distance matrix.
KK (int) – Number of top values to keep in each row of the matrix.

Returns:

The extracted dominant set matrix with top KK neighbors per row.

Return type:

np.array

preprocess_functions.elastic_net(count_mtx, datMeta, train_index=None, val_index=None, l1_ratio=1, num_epochs=1000, lam=0.01, device='cuda')[source]

Trains an Elastic Net model given count data and metadata.

Parameters:

count_mtx (pandas.DataFrame) – Matrix containing gene expression or count data.
datMeta (pandas.Series or DataFrame) – Metadata corresponding to count_mtx samples.
train_index (list, optional) – Indexes for training samples.
val_index (list, optional) – Indexes for validation samples.
l1_ratio (float, optional) – The balance between L1 and L2 regularization.
num_epochs (int, optional) – Number of training epochs.
lam (float, optional) – Regularization strength.
device (str, optional) – Device to run the training on (‘cuda’ or ‘cpu’).

Returns:

Extracted features based on weight importance. ElasticNet: Trained ElasticNet model.

Return type:

list

preprocess_functions.filter_genes(y, design=None, group=None, lib_size=None, min_count=10, min_total_count=15, large_n=10, min_prop=0.7)[source]

Filters genes based on several criteria including minimum count thresholds and proportions.

Parameters:

y (np.array) – Expression data for the genes.
design (np.array, optional) – Design matrix for the samples if available.
group (np.array, optional) – Group information for samples.
lib_size (np.array, optional) – Library sizes for the samples.
min_count (int) – Minimum count threshold for including a gene.
min_total_count (int) – Minimum total count across all samples for a gene.
large_n (int) – Cutoff for considering a sample ‘large’.
min_prop (float) – Minimum proportion used in calculations for large sample consideration.

Returns:

Boolean array indicating which genes to keep.

Return type:

np.array

preprocess_functions.gen_new_graph(model, h, meta, pnet=False)[source]

Generates a new graph from learned features using a provided model, handling multi-modal data and integrating them.

Parameters:

model (nn.Module) – The trained model which contains the learned parameters.
h (torch.Tensor) – Tensor containing features of the data.
meta (pd.DataFrame or pd.Series) – Metadata associated with the features.
pnet (bool) – Flag indicating whether or not pathway network transformations have been used.

Returns:

A graph object representing the new graph generated from the features.

Return type:

nx.Graph

preprocess_functions.get_k_neighbors(matrix, k, corr=True)[source]

Finds k-nearest neighbors for each row in the given matrix.

Parameters:

matrix (pd.DataFrame) – The matrix from which neighbors are to be found.
k (int) – The number of neighbors to find for each row.
corr (bool) – Indicates whether to use correlation rather than distance for finding neighbors.

Returns:

A dictionary where keys are indices (or node names) and values are lists of k-nearest neighbors’ indices.

Return type:

dict

preprocess_functions.knn_graph_generation(datExpr, datMeta, knn=20, method='euclidean', extracted_feats=None, **args)[source]

Generates a k-nearest neighbor graph based on the specified data and method of similarity.

Parameters:

datExpr (pd.DataFrame) – DataFrame containing expression data or other numerical data.
datMeta (pd.DataFrame or pd.Series) – Metadata for the nodes in the graph.
knn (int) – Number of nearest neighbors to connect to each node.
method (str) – Method used for calculating similarity or distance (‘euclidean’, ‘bicorr’, ‘pearson’, ‘cosine’).
extracted_feats ([type]) – Specific features extracted from the data to use for graph construction.
**args – Additional arguments for customizing the node visualization (e.g., node_colour, node_size).

Returns:

A NetworkX graph object representing the k-nearest neighbors graph.

Return type:

nx.Graph

preprocess_functions.normalize(x)[source]

Normalizes a square matrix by scaling each row by its total minus the diagonal value, handling it in-place.

Parameters:: x (np.array) – The square matrix to normalize.
Returns:: The normalized matrix with diagonal set to 0.5.
Return type:: np.array

preprocess_functions.pearson_corr(data, mat_means=True)[source]

Computes the Pearson correlation matrix for the given data.

Parameters:

data (pd.DataFrame) – Data for which to compute the Pearson correlation.
mat_means (bool) – Normalizes data by its mean if set to True.

Returns:

Pearson correlation matrix.

Return type:

pd.DataFrame

preprocess_functions.plot_knn_network(data, K, labels, node_colours='skyblue', node_size=300)[source]

Plots a k-nearest neighbors network using NetworkX.

Parameters:

data (pd.DataFrame) – The similarity or distance matrix used to determine neighbors.
K (int) – The number of nearest neighbors for network connections.
labels (pd.Series) – Labels or categories for the nodes used in plotting.
node_colours (str or list) – Color or list of colors for the nodes.
node_size (int) – Size of the nodes in the plot.

Returns:

A NetworkX graph object that has been plotted.

Return type:

nx.Graph