Preprocess Functions
- preprocess_functions.DESEQ(count_mtx, datMeta, condition, n_genes, train_index=None, fit_type='parametric')[source]
Conducts differential expression analysis using DESeq2 algorithm.
- Parameters:
count_mtx (pandas.DataFrame) – Count data for different genes.
datMeta (pandas.DataFrame) – Metadata for the samples in count_mtx.
condition (str) – Column in datMeta to use for condition separation.
n_genes (int) – Number of top genes to extract from the differential expression result.
train_index (list, optional) – Indexes for training samples if splitting is required.
fit_type (str, optional) – Statistical fitting type for VST transformation.
- Returns:
An object containing results and configuration of DESeq2 run. numpy.ndarray: Variance Stabilized Transformed counts. list: Top genes identified in the differential expression analysis.
- Return type:
DeseqDataSet
- class preprocess_functions.ElasticNet(num_features, num_classes, alpha, lam)[source]
A PyTorch module that implements an Elastic Net regularization logistic regression model.
- Parameters:
num_features (int) – Number of features in the input dataset.
num_classes (int) – Number of classes in the output prediction.
alpha (float) – Mixing parameter for L1 (Lasso) and L2 (Ridge) regularization.
lam (float) – Overall regularization strength.
- linear
Linear transformation layer.
- Type:
nn.Linear
- accuracy(logits, y)[source]
Calculates the accuracy of the model’s predictions.
- Parameters:
logits (torch.Tensor) – The logits as predicted by the model.
y (torch.Tensor) – The true labels.
- Returns:
The calculated accuracy.
- Return type:
float
- preprocess_functions.SNF(networks, K=15, t=10)[source]
Performs Similarity Network Fusion over multiple networks.
- Parameters:
networks (list of pd.DataFrames) – The individual networks to fuse, represented as similarity or distance matrices.
K (int) – Number of nearest neighbors to retain in the diffusion process.
t (int) – Number of iterations for the fusion process.
- Returns:
A fused network represented as a similarity matrix.
- Return type:
pd.DataFrame
- preprocess_functions.abs_bicorr(data, mat_means=True)[source]
Calculates the absolute bicorrelation matrix for the given data.
- Parameters:
data (pd.DataFrame) – Data for which to compute the bicorrelation.
mat_means (bool) – If True, subtract the mean from each column before computing the correlation.
- Returns:
Bicorrelation matrix.
- Return type:
pd.DataFrame
- preprocess_functions.check_wall_names(wall)[source]
Checks whether all matrices in a list share the same row and column names.
- Parameters:
wall (list of pd.DataFrame) – List of matrices to check.
- Returns:
Returns True if all matrices have consistent names, False otherwise.
- Return type:
bool
- preprocess_functions.convert_dataframe_to_numpy(input_data)[source]
Converts a pandas DataFrame to a numpy array. If the input is not a DataFrame, returns it as is.
- Parameters:
input_data (pd.DataFrame or any) – Data to be converted to numpy array.
- Returns:
The resulting numpy array from conversion or the original input if conversion isn’t applicable.
- Return type:
np.array or original data type
- preprocess_functions.cosine_corr(data, mat_means=True)[source]
Computes cosine correlations for the given data, treated as vectors.
- Parameters:
data (pd.DataFrame) – Data for which to compute cosine correlations.
mat_means (bool) – If True, normalizes the data before computing correlation.
- Returns:
Cosine correlation matrix.
- Return type:
pd.DataFrame
- preprocess_functions.create_similarity_matrix(mat, method='euclidean')[source]
Creates a similarity matrix from the given data matrix using specified methods.
- Parameters:
mat (pd.DataFrame) – The matrix from which to calculate similarities (e.g., gene expression levels).
method (str) – The method to use for calculating similarities. Supported methods are ‘bicorr’, ‘pearson’, and ‘euclidean’.
- Returns:
A DataFrame representing the similarity matrix.
- Return type:
pd.DataFrame
- preprocess_functions.custom_cpm(counts, lib_size)[source]
Computes Counts Per Million (CPM) normalization on count data.
- Parameters:
counts (np.array) – An array of raw gene counts.
lib_size (float or np.array) – The total counts in each library (sample).
- Returns:
Normalized counts expressed as counts per million.
- Return type:
np.array
- preprocess_functions.data_preprocess(count_mtx, datMeta, gene_exp=False)[source]
Processes count matrix data by removing genes with zero expression across all samples. Optionally filters genes based on expression levels and calculates similarity matrices.
- Parameters:
count_mtx (pd.DataFrame) – A DataFrame containing the gene count data.
datMeta (pd.Series or pd.DataFrame) – Metadata associated with the samples in count_mtx.
gene_exp (bool) – If true, performs additional gene filtering and similarity matrix calculations.
- Returns:
The processed count matrix. pd.Series or pd.DataFrame: The corresponding processed metadata.
- Return type:
pd.DataFrame
- preprocess_functions.dominateset(xx, KK=20)[source]
Extracts a dominant set from a similarity matrix, setting all but the top KK connections per row to zero and re-normalizes rows.
- Parameters:
xx (np.array or pd.DataFrame) – The input similarity or distance matrix.
KK (int) – Number of top values to keep in each row of the matrix.
- Returns:
The extracted dominant set matrix with top KK neighbors per row.
- Return type:
np.array
- preprocess_functions.elastic_net(count_mtx, datMeta, train_index=None, val_index=None, l1_ratio=1, num_epochs=1000, lam=0.01, device='cuda')[source]
Trains an Elastic Net model given count data and metadata.
- Parameters:
count_mtx (pandas.DataFrame) – Matrix containing gene expression or count data.
datMeta (pandas.Series or DataFrame) – Metadata corresponding to count_mtx samples.
train_index (list, optional) – Indexes for training samples.
val_index (list, optional) – Indexes for validation samples.
l1_ratio (float, optional) – The balance between L1 and L2 regularization.
num_epochs (int, optional) – Number of training epochs.
lam (float, optional) – Regularization strength.
device (str, optional) – Device to run the training on (‘cuda’ or ‘cpu’).
- Returns:
Extracted features based on weight importance. ElasticNet: Trained ElasticNet model.
- Return type:
list
- preprocess_functions.filter_genes(y, design=None, group=None, lib_size=None, min_count=10, min_total_count=15, large_n=10, min_prop=0.7)[source]
Filters genes based on several criteria including minimum count thresholds and proportions.
- Parameters:
y (np.array) – Expression data for the genes.
design (np.array, optional) – Design matrix for the samples if available.
group (np.array, optional) – Group information for samples.
lib_size (np.array, optional) – Library sizes for the samples.
min_count (int) – Minimum count threshold for including a gene.
min_total_count (int) – Minimum total count across all samples for a gene.
large_n (int) – Cutoff for considering a sample ‘large’.
min_prop (float) – Minimum proportion used in calculations for large sample consideration.
- Returns:
Boolean array indicating which genes to keep.
- Return type:
np.array
- preprocess_functions.gen_new_graph(model, h, meta, pnet=False)[source]
Generates a new graph from learned features using a provided model, handling multi-modal data and integrating them.
- Parameters:
model (nn.Module) – The trained model which contains the learned parameters.
h (torch.Tensor) – Tensor containing features of the data.
meta (pd.DataFrame or pd.Series) – Metadata associated with the features.
pnet (bool) – Flag indicating whether or not pathway network transformations have been used.
- Returns:
A graph object representing the new graph generated from the features.
- Return type:
nx.Graph
- preprocess_functions.get_k_neighbors(matrix, k, corr=True)[source]
Finds k-nearest neighbors for each row in the given matrix.
- Parameters:
matrix (pd.DataFrame) – The matrix from which neighbors are to be found.
k (int) – The number of neighbors to find for each row.
corr (bool) – Indicates whether to use correlation rather than distance for finding neighbors.
- Returns:
A dictionary where keys are indices (or node names) and values are lists of k-nearest neighbors’ indices.
- Return type:
dict
- preprocess_functions.knn_graph_generation(datExpr, datMeta, knn=20, method='euclidean', extracted_feats=None, **args)[source]
Generates a k-nearest neighbor graph based on the specified data and method of similarity.
- Parameters:
datExpr (pd.DataFrame) – DataFrame containing expression data or other numerical data.
datMeta (pd.DataFrame or pd.Series) – Metadata for the nodes in the graph.
knn (int) – Number of nearest neighbors to connect to each node.
method (str) – Method used for calculating similarity or distance (‘euclidean’, ‘bicorr’, ‘pearson’, ‘cosine’).
extracted_feats ([type]) – Specific features extracted from the data to use for graph construction.
**args – Additional arguments for customizing the node visualization (e.g., node_colour, node_size).
- Returns:
A NetworkX graph object representing the k-nearest neighbors graph.
- Return type:
nx.Graph
- preprocess_functions.normalize(x)[source]
Normalizes a square matrix by scaling each row by its total minus the diagonal value, handling it in-place.
- Parameters:
x (np.array) – The square matrix to normalize.
- Returns:
The normalized matrix with diagonal set to 0.5.
- Return type:
np.array
- preprocess_functions.pearson_corr(data, mat_means=True)[source]
Computes the Pearson correlation matrix for the given data.
- Parameters:
data (pd.DataFrame) – Data for which to compute the Pearson correlation.
mat_means (bool) – Normalizes data by its mean if set to True.
- Returns:
Pearson correlation matrix.
- Return type:
pd.DataFrame
- preprocess_functions.plot_knn_network(data, K, labels, node_colours='skyblue', node_size=300)[source]
Plots a k-nearest neighbors network using NetworkX.
- Parameters:
data (pd.DataFrame) – The similarity or distance matrix used to determine neighbors.
K (int) – The number of nearest neighbors for network connections.
labels (pd.Series) – Labels or categories for the nodes used in plotting.
node_colours (str or list) – Color or list of colors for the nodes.
node_size (int) – Size of the nodes in the plot.
- Returns:
A NetworkX graph object that has been plotted.
- Return type:
nx.Graph