scmkl
Single-cell analysis using Multiple Kernel Learning, scMKL, is a binary classification algorithm utilizing prior information to group features to enhance classification and aid understanding of distinguishing features in multi-omic data sets.
Installation
Conda install
Conda is the recommended method to install scMKL:
conda create -n scMKL python=3.12 -c conda-forge ivango17::scmkl
Pip install
First, create a virtual environment with python>=3.11.1,<3.13
.
Then, install scMKL with:
# activate your new env with python>=3.11.1 and <3.13
pip install scmkl
If wheels do not build correctly, ensure gcc
and g++
are installed and up to date. They can be installed with sudo apt install gcc
and sudo apt install g++
.
Requirements
scMKL takes advantage of AnnData objects and can be implemented with just four pieces of data:
1) scRNA and/or scATAC matrices (can be scipy.sparse
matrix)
2) An array of cell labels
3) An array of feature names (eg. gene symbols for RNA or peaks for ATAC)
4) A grouping dictionary where {'group_1' : [feature_5, feature_16], 'group_2' : [feature_1, feature_4, feature_9]}
For more information on formatting/creating the grouping dictionaries, see our example for creating an RNA grouping or ATAC grouping.
For implementing scMKL, see our examples for your use case in examples.
Links
Repo: https://github.com/ohsu-cedar-comp-hub/scMKL
PyPI: https://pypi.org/project/scmkl/
Anaconda: https://anaconda.org/ivango17/scmkl
API: https://ohsu-cedar-comp-hub.github.io/scMKL/
Citation
If you use scMKL in your research, please cite using:
To be determined
Our Shiny for Python application for viewing data produced from this work can be found here: scMKL_analysis
scMKL Documentation
1""" 2.. include:: ../README.md 3 4---------------------------- 5 6## **scMKL Documentation** 7""" 8 9 10__all__ = ['calculate_z', 11 'create_adata', 12 'dataframes', 13 'estimate_sigma', 14 'get_atac_groupings', 15 'multimodal_processing', 16 'one_v_rest', 17 'optimize_alpha', 18 'optimize_sparsity', 19 'plotting', 20 'run', 21 'test', 22 'test_import', 23 'tfidf_normalize', 24 'train_model' 25 ] 26 27from scmkl._checks import * 28from scmkl.calculate_z import * 29from scmkl.create_adata import * 30from scmkl.dataframes import * 31from scmkl.estimate_sigma import * 32from scmkl.get_atac_groupings import * 33from scmkl.multimodal_processing import * 34from scmkl.one_v_rest import * 35from scmkl.optimize_alpha import * 36from scmkl.optimize_sparsity import * 37from scmkl.plotting import * 38from scmkl.run import * 39from scmkl.test import * 40from scmkl.tfidf_normalize import * 41from scmkl.train_model import *
102def calculate_z(adata, n_features = 5000) -> ad.AnnData: 103 ''' 104 Function to calculate Z matrix. 105 106 Parameters 107 ---------- 108 **adata** : *AnnData* 109 > created by `create_adata()` with `adata.uns.keys()` 110 `'sigma'`, `'train_indices'`, and `'test_indices'`. 111 `'sigma'` key can be added by running `estimate_sigma()` on 112 adata. 113 114 **n_features** : *int* 115 > Number of random feature to use when calculating Z- used for 116 scalability. 117 118 Returns 119 ------- 120 **adata** : *AnnData* 121 > adata with Z matrices accessible with `adata.uns['Z_train']` 122 and `adata.uns['Z_test']`. 123 124 Examples 125 -------- 126 >>> adata = scmkl.estimate_sigma(adata) 127 >>> adata = scmkl.calculate_z(adata) 128 >>> adata.uns.keys() 129 dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices', 130 'test_indices']) 131 ''' 132 assert np.all(adata.uns['sigma'] > 0), 'Sigma must be positive' 133 134 # Number of groupings taking from group_dict 135 n_pathway = len(adata.uns['group_dict'].keys()) 136 D = adata.uns['D'] 137 138 # Capturing training and testing indices 139 train_idx = np.array(adata.uns['train_indices'], dtype = np.int_) 140 test_idx = np.array(adata.uns['test_indices'], dtype = np.int_) 141 142 # Create Arrays to store concatenated group Z 143 # Each group of features will have a corresponding entry in each array 144 n_cols = 2 * adata.uns['D'] * n_pathway 145 146 Z_train = np.zeros((train_idx.shape[0], n_cols)) 147 Z_test = np.zeros((test_idx.shape[0], n_cols)) 148 149 # Loop over each of the groups and creating Z for each 150 for m, group_features in enumerate(adata.uns['group_dict'].values()): 151 152 #Extract features from mth group 153 num_group_features = len(group_features) 154 155 # Sample up to n_features features- important for scalability if 156 # using large groupings 157 # Will use all features if the grouping contains fewer than n_features 158 number_features = np.min([n_features, num_group_features]) 159 group_array = np.array(list(group_features)) 160 group_features = adata.uns['seed_obj'].choice(group_array, 161 number_features, 162 replace = False) 163 164 # Create data arrays containing only features within this group 165 X_train = adata[adata.uns['train_indices'],:][:, group_features].X 166 X_test = adata[adata.uns['test_indices'],:][:, group_features].X 167 168 # Data filtering, and transformation according to given data_type 169 # Will remove low variance (< 1e5) features regardless of data_type 170 # If given data_type is 'counts' will log scale and z-score the data 171 X_train, X_test = _process_data(X_train = X_train, X_test = X_test, 172 scale_data = adata.uns['scale_data'], 173 return_dense = True) 174 175 # Extract pre-calculated sigma used for approximating kernel 176 adjusted_sigma = adata.uns['sigma'][m] 177 178 # Calculates approximate kernel according to chosen kernel function 179 # Distribution data comes from Fourier Transform of kernel function 180 if adata.uns['kernel_type'].lower() == 'gaussian': 181 182 gamma = 1/(2*adjusted_sigma**2) 183 sigma_p = 0.5*np.sqrt(2*gamma) 184 185 W = adata.uns['seed_obj'].normal(0, sigma_p, X_train.shape[1] * D) 186 W = W.reshape((X_train.shape[1]), D) 187 188 elif adata.uns['kernel_type'].lower() == 'laplacian': 189 190 gamma = 1 / (2 * adjusted_sigma) 191 192 W = adata.uns['seed_obj'].standard_cauchy(X_train.shape[1] * D) 193 W = gamma * W.reshape((X_train.shape[1], D)) 194 195 elif adata.uns['kernel_type'].lower() == 'cauchy': 196 197 gamma = 1 / (2 * adjusted_sigma ** 2) 198 b = 0.5 * np.sqrt(gamma) 199 200 W = adata.uns['seed_obj'].laplace(0, b, X_train.shape[1] * D) 201 W = W.reshape((X_train.shape[1], D)) 202 203 204 train_projection = np.matmul(X_train, W) 205 test_projection = np.matmul(X_test, W) 206 207 # Store group Z in whole-Z object. 208 # Preserves order to be able to extract meaningful groups 209 x_idx = np.arange( m * 2 * D , (m + 1) * 2 * D) 210 sq_i_d = np.sqrt(1/D) 211 212 Z_train[0:, x_idx] = sq_i_d * np.hstack((np.cos(train_projection), 213 np.sin(train_projection))) 214 Z_test[0:, x_idx] = sq_i_d * np.hstack((np.cos(test_projection), 215 np.sin(test_projection))) 216 217 adata.uns['Z_train'] = Z_train 218 adata.uns['Z_test'] = Z_test 219 220 return adata
Function to calculate Z matrix.
Parameters
adata : AnnData
created by
create_adata()
withadata.uns.keys()
'sigma'
,'train_indices'
, and'test_indices'
.'sigma'
key can be added by runningestimate_sigma()
on adata.
n_features : int
Number of random feature to use when calculating Z- used for scalability.
Returns
adata : AnnData
adata with Z matrices accessible with
adata.uns['Z_train']
andadata.uns['Z_test']
.
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> adata.uns.keys()
dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices',
'test_indices'])
214def create_adata(X, feature_names: np.ndarray, cell_labels: np.ndarray, 215 group_dict: dict, scale_data: bool = True, 216 split_data : np.ndarray | None = None, D : int | None = None, 217 remove_features = True, train_ratio = 0.8, 218 distance_metric = 'euclidean', kernel_type = 'Gaussian', 219 random_state : int = 1, allow_multiclass : bool = False, 220 class_threshold : str | int = 'median'): 221 ''' 222 Function to create an AnnData object to carry all relevant 223 information going forward. 224 225 Parameters 226 ---------- 227 **X** : *scipy.sparse.csc_matrix* | *np.ndarray* | 228 *pd.DataFrame* 229 > A data matrix of cells by features (sparse array 230 recommended for large datasets). 231 232 **feature_names** : *np.ndarray* 233 > array of feature names corresponding with the features 234 in X. 235 236 **cell_labels** : *np.ndarray* 237 > A numpy array of cell phenotypes corresponding with 238 the cells in X. 239 240 **group_dict** : *dict* 241 > Dictionary containing feature grouping information. 242 - Example: {geneset: np.array([gene_1, gene_2, ..., 243 gene_n])} 244 245 **scale_data** : *bool* 246 > If `True`, data matrix is log transformed and standard 247 scaled. 248 249 **split_data** : *None* | *np.ndarray* 250 > If *None*, data will be split stratified by cell labels. 251 Else, is an array of precalculated train/test split 252 corresponding to samples. Can include labels for entire 253 dataset to benchmark performance or for only training 254 data to classify unknown cell types. 255 - Example: np.array(['train', 'test', ..., 'train']) 256 257 **D** : *int* 258 > Number of Random Fourier Features used to calculate Z. 259 Should be a positive integer. Higher values of D will 260 increase classification accuracy at the cost of computation 261 time. If set to `None`, will be calculated given number of 262 samples. 263 264 **remove_features** : *bool* 265 > If `True`, will remove features from X and feature_names 266 not in group_dict and remove features from groupings not in 267 feature_names. 268 269 **train_ratio** : *float* 270 > Ratio of number of training samples to entire data set. Note: 271 if a threshold is applied, the ratio training samples may 272 decrease depending on class balance and `class_threshold` 273 parameter if `allow_multiclass = True`. 274 275 **distance_metric** : *str* 276 > The pairwise distance metric used to estimate sigma. Must 277 be one of the options used in scipy.spatial.distance.cdist. 278 279 **kernel_type** : *str* 280 > The approximated kernel function used to calculate Zs. 281 Must be one of `'Gaussian'`, `'Laplacian'`, or `'Cauchy'`. 282 283 **random_state** : *int* 284 > Integer random_state used to set the seed for 285 reproducibilty. 286 287 **allow_multiclass** : *bool* 288 > If `False`, will ensure that cell labels are binary. 289 290 **class_threshold** : *str* | *int* 291 > Number of samples allowed in the training data for each cell 292 class in the training data. If `'median'`, the median number of 293 cells per cell class will be the threshold for number of 294 samples per class. 295 296 Returns 297 ------- 298 **adata** : *AnnData* 299 > *AnnData* with the following attributes and keys: 300 301 > `adata.X` : the data matrix. 302 303 > `adata.var_names` : the feature names corresponding to 304 `adata.X`. 305 306 > `adata.obs['labels']` : cell classes/phenotypes from 307 `cell_labels`. 308 309 > `adata.uns['train_indices']` : Indices for training data. 310 311 > `adata.uns['test_indices']` : Indices for testing data. 312 313 > `adata.uns['group_dict']` : Grouping information. 314 315 > `adata.uns['seed_obj']` : Seed object with seed equal to 316 100 * `random_state`. 317 318 > `with adata.uns['D']` : Number of dimensions to scMKL with. 319 320 > `adata.uns['scale_data']` : *bool* for whether or not data is log 321 transformed and scaled. 322 323 > `adata.uns['distance_metric']` : Distance metric as given. 324 325 > `adata.uns['kernel_type']` : Kernel function as given. 326 327 Examples 328 -------- 329 >>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz') 330 >>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True) 331 >>> group_dict = np.load('hallmark_genesets.pkl', 332 >>> allow_pickle = True) 333 >>> 334 >>> adata = scmkl.create_adata(X = data_mat, 335 ... feature_names = gene_names, 336 ... group_dict = group_dict) 337 >>> adata 338 AnnData object with n_obs × n_vars = 1000 × 4341 339 obs: 'labels' 340 uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 341 'distance_metric', 'train_indices', 'test_indices' 342 ''' 343 344 assert X.shape[1] == len(feature_names), ("Different number of features " 345 "in X than feature names") 346 347 if not allow_multiclass: 348 assert len(np.unique(cell_labels)) == 2, ("cell_labels must contain " 349 "2 classes") 350 if D is not None: 351 assert isinstance(D, int) and D > 0, 'D must be a positive integer' 352 353 kernel_options = ['gaussian', 'laplacian', 'cauchy'] 354 assert kernel_type.lower() in kernel_options, ("Given kernel type not " 355 "implemented. Gaussian, " 356 "Laplacian, and Cauchy " 357 "are the acceptable " 358 "types.") 359 360 X, feature_names, group_dict = _filter_features(X, 361 feature_names, 362 group_dict, 363 remove_features) 364 365 # Create adata object and add column names 366 adata = ad.AnnData(X) 367 adata.var_names = feature_names 368 369 # Add metadata to adata object 370 adata.uns['group_dict'] = group_dict 371 adata.uns['seed_obj'] = np.random.default_rng(100 * random_state) 372 adata.uns['scale_data'] = scale_data 373 adata.uns['D'] = D if D is not None else calculate_d(adata.shape[0]) 374 adata.uns['kernel_type'] = kernel_type 375 adata.uns['distance_metric'] = distance_metric 376 377 if (split_data is None): 378 assert X.shape[0] == len(cell_labels), ("Different number of cells " 379 "than labels") 380 adata.obs['labels'] = cell_labels 381 382 if (allow_multiclass == False): 383 split = _binary_split(cell_labels, 384 seed_obj = adata.uns['seed_obj'], 385 train_ratio = train_ratio) 386 train_indices, test_indices = split 387 388 elif (allow_multiclass == True): 389 split = _multi_class_split(cell_labels, 390 seed_obj = adata.uns['seed_obj'], 391 class_threshold = class_threshold, 392 train_ratio = train_ratio) 393 train_indices, test_indices = split 394 395 adata.uns['labeled_test'] = True 396 397 else: 398 x_eq_labs = X.shape[0] == len(cell_labels) 399 train_eq_labs = X.shape[0] == len(cell_labels) 400 assert x_eq_labs or train_eq_labs, ("Must give labels for all cells " 401 "or only for training cells") 402 403 train_indices = np.where(split_data == 'train')[0] 404 test_indices = np.where(split_data == 'test')[0] 405 406 if len(cell_labels) == len(train_indices): 407 408 padded_cell_labels = np.zeros((X.shape[0])).astype('object') 409 padded_cell_labels[train_indices] = cell_labels 410 padded_cell_labels[test_indices] = 'padded_test_label' 411 412 adata.obs['labels'] = padded_cell_labels 413 adata.uns['labeled_test'] = False 414 415 elif len(cell_labels) == len(split_data): 416 adata.obs['labels'] = cell_labels 417 adata.uns['labeled_test'] = True 418 419 adata.uns['train_indices'] = train_indices 420 adata.uns['test_indices'] = test_indices 421 422 if not scale_data: 423 print("WARNING: Data will not be log transformed and scaled " 424 "To change this behavior, set scale_data to True") 425 426 return adata
Function to create an AnnData object to carry all relevant information going forward.
Parameters
X : scipy.sparse.csc_matrix | np.ndarray | pd.DataFrame
A data matrix of cells by features (sparse array recommended for large datasets).
feature_names : np.ndarray
array of feature names corresponding with the features in X.
cell_labels : np.ndarray
A numpy array of cell phenotypes corresponding with the cells in X.
group_dict : dict
Dictionary containing feature grouping information. - Example: {geneset: np.array([gene_1, gene_2, ..., gene_n])}
scale_data : bool
If
True
, data matrix is log transformed and standard scaled.
split_data : None | np.ndarray
If None, data will be split stratified by cell labels. Else, is an array of precalculated train/test split corresponding to samples. Can include labels for entire dataset to benchmark performance or for only training data to classify unknown cell types. - Example: np.array(['train', 'test', ..., 'train'])
D : int
Number of Random Fourier Features used to calculate Z. Should be a positive integer. Higher values of D will increase classification accuracy at the cost of computation time. If set to
None
, will be calculated given number of samples.
remove_features : bool
If
True
, will remove features from X and feature_names not in group_dict and remove features from groupings not in feature_names.
train_ratio : float
Ratio of number of training samples to entire data set. Note: if a threshold is applied, the ratio training samples may decrease depending on class balance and
class_threshold
parameter ifallow_multiclass = True
.
distance_metric : str
The pairwise distance metric used to estimate sigma. Must be one of the options used in scipy.spatial.distance.cdist.
kernel_type : str
The approximated kernel function used to calculate Zs. Must be one of
'Gaussian'
,'Laplacian'
, or'Cauchy'
.
random_state : int
Integer random_state used to set the seed for reproducibilty.
allow_multiclass : bool
If
False
, will ensure that cell labels are binary.
class_threshold : str | int
Number of samples allowed in the training data for each cell class in the training data. If
'median'
, the median number of cells per cell class will be the threshold for number of samples per class.
Returns
adata : AnnData
AnnData with the following attributes and keys:
adata.X
: the data matrix.
adata.var_names
: the feature names corresponding toadata.X
.
adata.obs['labels']
: cell classes/phenotypes fromcell_labels
.
adata.uns['train_indices']
: Indices for training data.
adata.uns['test_indices']
: Indices for testing data.
adata.uns['group_dict']
: Grouping information.
adata.uns['seed_obj']
: Seed object with seed equal to 100 *random_state
.
with adata.uns['D']
: Number of dimensions to scMKL with.
adata.uns['scale_data']
: bool for whether or not data is log transformed and scaled.
adata.uns['distance_metric']
: Distance metric as given.
adata.uns['kernel_type']
: Kernel function as given.
Examples
>>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz')
>>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True)
>>> group_dict = np.load('hallmark_genesets.pkl',
>>> allow_pickle = True)
>>>
>>> adata = scmkl.create_adata(X = data_mat,
... feature_names = gene_names,
... group_dict = group_dict)
>>> adata
AnnData object with n_obs × n_vars = 1000 × 4341
obs: 'labels'
uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type',
'distance_metric', 'train_indices', 'test_indices'
8def estimate_sigma(adata, n_features = 5000): 9 ''' 10 Calculate kernel widths to inform distribution for projection of 11 Fourier Features. Calculates one sigma per group of features. 12 13 Parameters 14 ---------- 15 **adata** : *AnnData* 16 > Created by `create_adata`. 17 18 **n_features** : *int* 19 > Number of random features to include when estimating sigma. 20 Will be scaled for the whole pathway set according to a 21 heuristic. Used for scalability. 22 23 Returns 24 ------- 25 **adata** : *AnnData* 26 > Key added `adata.uns['sigma']`. 27 28 Examples 29 -------- 30 >>> adata = scmkl.estimate_sigma(adata) 31 >>> adata.uns['sigma'] 32 array([10.4640895 , 10.82011454, 6.16769438, 9.86156855, ...]) 33 ''' 34 35 sigma_list = [] 36 37 # Loop over every group in group_dict 38 for group_features in adata.uns['group_dict'].values(): 39 40 # Select only features in that group and downsample for scalability 41 num_group_features = len(group_features) 42 group_array = np.array(list(group_features)) 43 n_feats = min([n_features, num_group_features]) 44 group_features = adata.uns['seed_obj'].choice(group_array, n_feats, 45 replace = False) 46 47 # Use on the train data to estimate sigma 48 X_train = adata[adata.uns['train_indices'], group_features].X 49 X_train = _process_data(X_train = X_train, 50 scale_data = adata.uns['scale_data'], 51 return_dense = True) 52 53 # Sample cells for scalability 54 sample_idx = np.arange(X_train.shape[0]) 55 n_samples = np.min((2000, X_train.shape[0])) 56 distance_indices = adata.uns['seed_obj'].choice(sample_idx, n_samples) 57 58 # Calculate Distance Matrix with specified metric 59 sigma = scipy.spatial.distance.cdist(X_train[distance_indices,:], 60 X_train[distance_indices,:], 61 adata.uns['distance_metric']) 62 sigma = np.mean(sigma) 63 64 # sigma = 0 is numerically unusable in later steps 65 # Using such a small sigma will result in wide distribution, and 66 # typically a non-predictive Z 67 if sigma == 0: 68 sigma += 1e-5 69 70 if n_features < num_group_features: 71 # Heuristic we calculated to account for fewer features used in 72 # distance calculation 73 sigma = sigma * num_group_features / n_features 74 75 sigma_list.append(sigma) 76 77 adata.uns['sigma'] = np.array(sigma_list) 78 79 return adata
Calculate kernel widths to inform distribution for projection of Fourier Features. Calculates one sigma per group of features.
Parameters
adata : AnnData
Created by
create_adata
.
n_features : int
Number of random features to include when estimating sigma. Will be scaled for the whole pathway set according to a heuristic. Used for scalability.
Returns
adata : AnnData
Key added
adata.uns['sigma']
.
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata.uns['sigma']
array([10.4640895 , 10.82011454, 6.16769438, 9.86156855, ...])
262def get_atac_groupings(gene_anno : pd.DataFrame, gene_sets : dict, 263 feature_names : np.ndarray | pd.Series | list | set, 264 len_up : int = 5000, len_down : int = 5000) -> dict: 265 ''' 266 Creates a peak set where keys are gene set names from `gene_sets` 267 and values are arrays of features pulled from `feature_names`. 268 Features are added to each peak set given overlap between regions 269 in single-cell data matrix and inferred gene promoter regions in 270 `gene_anno`. 271 272 Parameters 273 ---------- 274 **gene_anno** : *pd.DataFrame* 275 > Gene annotations in GTF format as a pd.DataFrame with columns 276 ['chr', 'start', 'end', 'strand', 'gene_name']. 277 278 **gene_sets** : *dict* 279 > Gene set names as keys and an iterable object of gene names 280 as values. 281 282 **feature_names** : *np.ndarray* | *pd.Series* | *list* | *set* 283 > Feature names corresponding to a single_cell ATAC data 284 matrix. 285 286 Returns 287 ------- 288 **atac_grouping** : *dict* 289 > Keys are the names from `gene_sets` and values 290 are a list of regions from `feature_names` that overlap with 291 promotor regions respective to genes in `gene_sets` (i.e., if 292 ATAC feature in `feature_names` overlaps with promotor region 293 from a gene in a gene set from `gene_sets`, that region will be 294 added to the new dictionary under the respective gene set 295 name). 296 297 Examples 298 -------- 299 >>> # Reading in a gene set and the peak names from scATAC dataset 300 >>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl", 301 ... allow_pickle = True) 302 >>> peaks = np.load("data/MCF7_ATAC_feature_names.npy", 303 ... allow_pickle = True) 304 >>> 305 >>> # Reading in GTF file 306 >>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf" 307 >>> gtf = pd.read_csv(gtf_path, sep = "\t", header = None, 308 ... skip_blank_lines = True, comment = "#") 309 >>> 310 >>> # Naming columns 311 >>> gtf.columns = ['chr', 'source', 'feature', 'start', 'end', 312 ... 'score', 'strand', 'frame', 'attribute'] 313 >>> 314 >>> # Subsetting to only protein coding genes 315 >>> prot_rows = gtf['attribute'].str.contains('protein_coding') 316 >>> gtf = gtf[prot_rows] 317 >>> gtf = gtf[gtf['feature'] == 'gene'] 318 >>> 319 >>> # Capturing gene name from attributes column 320 >>> gtf['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+', 321 ... attr)[0] 322 ... for attr in gtf['attribute']] 323 >>> 324 >>> atac_grouping = scmkl.get_atac_groupings(gene_anno = gtf, 325 ... gene_sets = gene_sets, 326 ... feature_names = peaks) 327 >>> 328 >>> atac_grouping.keys() 329 dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...]) 330 ''' 331 # Getting a unique set of gene names from gene_sets 332 all_genes = {gene for group in gene_sets.keys() 333 for gene in gene_sets[group]} 334 335 # Filtering out NaN values 336 all_genes = [gene for gene in all_genes if type(gene) != float] 337 338 # Filtering out annotations for genes not present in gene_sets 339 gene_anno = gene_anno[np.isin(gene_anno['gene_name'], all_genes)] 340 341 # Adjusting start and end columns to represent promotor regions 342 gene_anno = _adjust_regions(gene_anno = gene_anno, 343 len_up = len_up, len_down = len_down) 344 345 # Creating a dictionary from assay features where [chr] : (start, end) 346 feature_dict = _create_feature_dict(feature_names) 347 348 # Creating data structures from gene_anno for comparing regions 349 peak_gene_dict, ga_regions = _create_region_dicts(gene_anno) 350 351 # Capturing the separator type used in assay 352 chr_sep = ':' if ':' in feature_names[0] else '-' 353 354 atac_groupings = _compare_regions(feature_dict = feature_dict, 355 ga_regions = ga_regions, 356 peak_gene_dict = peak_gene_dict, 357 gene_sets = gene_sets, 358 chr_sep = chr_sep) 359 360 return atac_groupings
Creates a peak set where keys are gene set names from gene_sets
and values are arrays of features pulled from feature_names
.
Features are added to each peak set given overlap between regions
in single-cell data matrix and inferred gene promoter regions in
gene_anno
.
Parameters
gene_anno : pd.DataFrame
Gene annotations in GTF format as a pd.DataFrame with columns ['chr', 'start', 'end', 'strand', 'gene_name'].
gene_sets : dict
Gene set names as keys and an iterable object of gene names as values.
feature_names : np.ndarray | pd.Series | list | set
Feature names corresponding to a single_cell ATAC data matrix.
Returns
atac_grouping : dict
Keys are the names from
gene_sets
and values are a list of regions fromfeature_names
that overlap with promotor regions respective to genes ingene_sets
(i.e., if ATAC feature infeature_names
overlaps with promotor region from a gene in a gene set fromgene_sets
, that region will be added to the new dictionary under the respective gene set name).
Examples
>>> # Reading in a gene set and the peak names from scATAC dataset
>>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl",
... allow_pickle = True)
>>> peaks = np.load("data/MCF7_ATAC_feature_names.npy",
... allow_pickle = True)
>>>
>>> # Reading in GTF file
>>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf"
>>> gtf = pd.read_csv(gtf_path, sep = " ", header = None,
... skip_blank_lines = True, comment = "#")
>>>
>>> # Naming columns
>>> gtf.columns = ['chr', 'source', 'feature', 'start', 'end',
... 'score', 'strand', 'frame', 'attribute']
>>>
>>> # Subsetting to only protein coding genes
>>> prot_rows = gtf['attribute'].str.contains('protein_coding')
>>> gtf = gtf[prot_rows]
>>> gtf = gtf[gtf['feature'] == 'gene']
>>>
>>> # Capturing gene name from attributes column
>>> gtf['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+',
... attr)[0]
... for attr in gtf['attribute']]
>>>
>>> atac_grouping = scmkl.get_atac_groupings(gene_anno = gtf,
... gene_sets = gene_sets,
... feature_names = peaks)
>>>
>>> atac_grouping.keys()
dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...])
97def multimodal_processing(adatas : list, names : list, tfidf: list): 98 ''' 99 Combines and processes a list of adata objects. 100 101 Parameters 102 ---------- 103 **adatas** : *list[AnnData]* 104 > List of AnnData objects where each object is a different 105 modality for the same cells. 106 107 **names** : *list[str]* 108 > List of string names for each modality repective to each 109 object in `adatas`. 110 111 **tfidf** : *bool* 112 > List where if element i is `True`, adata[i] will be TFIDF 113 normalized. 114 115 Returns 116 ------- 117 **adata** : *AnnData* 118 > Concatenated from objects from `adatas` with Z matrices 119 calculated. 120 121 Examples 122 -------- 123 >>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat, 124 ... feature_names = gene_names, 125 ... scale_data = True, 126 ... cell_labels = cell_labels, 127 ... group_dict = rna_grouping) 128 >>> 129 >>> atac_adata = scmkl.create_adata(X = mcf7_atac_mat, 130 ... feature_names = peak_names, 131 ... scale_data = False, 132 ... cell_labels = cell_labels, 133 ... group_dict = atac_grouping) 134 >>> 135 >>> adatas = [rna_adata, atac_adata] 136 >>> mod_names = ['rna', 'atac'] 137 >>> adata = scmkl.multimodal_processing(adatas = adatas, 138 ... names = mod_names, 139 ... tfidf = [False, True]) 140 >>> 141 >>> adata 142 AnnData object with n_obs × n_vars = 1000 × 12676 143 obs: 'labels' 144 var: 'labels' 145 uns: 'D', 'kernel_type', 'distance_metric', 'train_indices', 146 'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj' 147 ''' 148 import warnings 149 warnings.filterwarnings('ignore') 150 151 assert all([adata.shape[0] for adata in adatas]), ("Different number of " 152 "cells present in " 153 "each object") 154 155 # True if all train indices match 156 same_train = np.all([np.array_equal(adatas[0].uns['train_indices'], 157 adatas[i].uns['train_indices']) 158 for i in range(1, len(adatas))]) 159 160 # True if all test indices match 161 same_test = np.all([np.array_equal(adatas[0].uns['test_indices'], 162 adatas[i].uns['test_indices']) 163 for i in range(1, len(adatas))]) 164 165 assert same_train, 'Different train indices' 166 assert same_test, 'Different test indices' 167 168 # Creates a boolean array for each modality of cells with non-empty rows 169 non_empty_rows = [np.array(_sparse_var(adata.X, axis = 1) != 0).ravel() 170 for adata in adatas] 171 172 # Returns a 1d array where sample feature sums 173 # across all modalities are more than 0 174 non_empty_rows = np.logical_and(*non_empty_rows).squeeze() 175 176 # Initializing final train test split array 177 train_test = np.repeat('train', adatas[0].shape[0]) 178 train_test[adatas[0].uns['test_indices']] = 'test' 179 180 # Capturing train test split with empty rows filtered out 181 train_test = train_test[non_empty_rows] 182 train_indices = np.where(train_test == 'train')[0] 183 test_indices = np.where(train_test == 'test')[0] 184 185 # Adding train test split arrays to AnnData objects 186 # and filtering out empty samples 187 for i, adata in enumerate(adatas): 188 adatas[i].uns['train_indices'] = train_indices 189 adatas[i].uns['test_indices'] = test_indices 190 adatas[i] = adata[non_empty_rows, :] 191 # tfidf normalizing if corresponding element in tfidf is True 192 if tfidf[i]: 193 adatas[i] = tfidf_normalize(adata) 194 195 if 'Z_train' not in adatas[i].uns.keys(): 196 # AnnData update must be pointing at the object in list 197 print(f'Estimating Sigma for {names[i]}', flush = True) 198 adatas[i] = estimate_sigma(adata, n_features= 200) 199 print(f'Calculating Z for {names[i]}', flush = True) 200 adatas[i] = calculate_z(adata, n_features = 5000) 201 202 if 'labels' in adatas[0].obs: 203 all_labels = [adata.obs['labels'] for adata in adatas] 204 # Ensuring cell labels for each AnnData object are the same 205 for i in range(1, len(all_labels)): 206 same_labels = np.all(all_labels[0] == all_labels[i]) 207 assert same_labels, (f"Cell labels between AnnData object in " 208 f"position 0 and position {i} in adatas do " 209 "not match") 210 211 adata = _combine_modalities(adatas = adatas, 212 names = names, 213 combination = 'concatenate') 214 215 del adatas 216 gc.collect() 217 218 return adata
Combines and processes a list of adata objects.
Parameters
adatas : list[AnnData]
List of AnnData objects where each object is a different modality for the same cells.
names : list[str]
List of string names for each modality repective to each object in
adatas
.
tfidf : bool
List where if element i is
True
, adata[i] will be TFIDF normalized.
Returns
adata : AnnData
Concatenated from objects from
adatas
with Z matrices calculated.
Examples
>>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat,
... feature_names = gene_names,
... scale_data = True,
... cell_labels = cell_labels,
... group_dict = rna_grouping)
>>>
>>> atac_adata = scmkl.create_adata(X = mcf7_atac_mat,
... feature_names = peak_names,
... scale_data = False,
... cell_labels = cell_labels,
... group_dict = atac_grouping)
>>>
>>> adatas = [rna_adata, atac_adata]
>>> mod_names = ['rna', 'atac']
>>> adata = scmkl.multimodal_processing(adatas = adatas,
... names = mod_names,
... tfidf = [False, True])
>>>
>>> adata
AnnData object with n_obs × n_vars = 1000 × 12676
obs: 'labels'
var: 'labels'
uns: 'D', 'kernel_type', 'distance_metric', 'train_indices',
'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj'
108def one_v_rest(adatas : list, names : list, alpha_list : np.ndarray, 109 tfidf : list) -> dict: 110 ''' 111 For each cell class, creates model(s) comparing that class to all 112 others. Then, predicts on the training data using `scmkl.run()`. 113 Only labels in both training and testing will be run. 114 115 Parameters 116 ---------- 117 **adatas** : *list[AnnData]* 118 > List of AnnData objects created by create_adata() 119 where each AnnData is one modality and composed of both 120 training and testing samples. Requires that `'train_indices'` 121 and `'test_indices'` are the same across all AnnDatas. 122 123 **names** : *list[str]* 124 > List of string variables that describe each modality 125 respective to adatas for labeling. 126 127 **alpha_list** : *np.ndarray* | *float* 128 > An array of alpha values to create each model with. 129 130 **tfidf** : *list[bool]* 131 > List where if element i is `True`, adata[i] will be TFIDF 132 normalized. 133 134 Returns 135 ------- 136 **results** : *dict* 137 > Contains keys for each cell class with results from cell class 138 versus all other samples. See `scmkl.run()` for futher details. 139 140 Examples 141 -------- 142 >>> adata = scmkl.create_adata(X = data_mat, 143 ... feature_names = gene_names, 144 ... group_dict = group_dict) 145 >>> 146 >>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'], 147 ... alpha_list = np.array([0.05, 0.1]), 148 ... tfidf = [False]) 149 >>> 150 >>> adata.keys() 151 dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...]) 152 ''' 153 # Formatting checks ensuring all adata elements are 154 # AnnData objects and train/test indices are all the same 155 _check_adatas(adatas, check_obs = True, check_uns = True) 156 157 # Extracting train and test indices 158 train_indices = adatas[0].uns['train_indices'] 159 test_indices = adatas[0].uns['test_indices'] 160 161 # Checking and capturing cell labels 162 uniq_labels = _eval_labels( cell_labels = adatas[0].obs['labels'], 163 train_indices = train_indices, 164 test_indices = test_indices) 165 166 167 # Calculating Z matrices, method depends on whether there are multiple 168 # adatas (modalities) 169 if len(adatas) == 1: 170 adata = estimate_sigma(adatas[0], n_features = 200) 171 adata = calculate_z(adata, n_features = 5000) 172 else: 173 adata = multimodal_processing(adatas = adatas, 174 names = names, 175 tfidf = tfidf, 176 z_calculation = True) 177 178 del adatas 179 gc.collect() 180 181 # Initializing for capturing model outputs 182 results = {} 183 184 # Capturing cell labels before overwriting 185 cell_labels = np.array(adata.obs['labels']) 186 187 for label in uniq_labels: 188 print(f"Comparing {label} to other types", flush = True) 189 cur_labels = cell_labels.copy() 190 cur_labels[cell_labels != label] = 'other' 191 192 # Replacing cell labels for current cell type vs rest 193 adata.obs['labels'] = cur_labels 194 195 # Running scMKL 196 results[label] = run(adata, alpha_list, return_probs = True) 197 198 # Getting final predictions 199 alpha = np.min(alpha_list) 200 prob_table, pred_class, low_conf = _prob_table(results, alpha) 201 202 results['Probability_table'] = prob_table 203 results['Predicted_class'] = pred_class 204 results['Truth_labels'] = cell_labels[adata.uns['test_indices']] 205 results['Low_confidence'] = low_conf 206 207 return results
For each cell class, creates model(s) comparing that class to all
others. Then, predicts on the training data using scmkl.run
.
Only labels in both training and testing will be run.
Parameters
adatas : list[AnnData]
List of AnnData objects created by create_adata() where each AnnData is one modality and composed of both training and testing samples. Requires that
'train_indices'
and'test_indices'
are the same across all AnnDatas.
names : list[str]
List of string variables that describe each modality respective to adatas for labeling.
alpha_list : np.ndarray | float
An array of alpha values to create each model with.
tfidf : list[bool]
List where if element i is
True
, adata[i] will be TFIDF normalized.
Returns
results : dict
Contains keys for each cell class with results from cell class versus all other samples. See
scmkl.run
for futher details.
Examples
>>> adata = scmkl.create_adata(X = data_mat,
... feature_names = gene_names,
... group_dict = group_dict)
>>>
>>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'],
... alpha_list = np.array([0.05, 0.1]),
... tfidf = [False])
>>>
>>> adata.keys()
dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...])
176def optimize_alpha(adata, group_size = None, tfidf = False, 177 alpha_array = np.round(np.linspace(1.9,0.1, 10),2), k = 4): 178 ''' 179 Iteratively train a grouplasso model and update alpha to find the 180 parameter yielding best performing sparsity. This function 181 currently only works for binary experiments. 182 183 Parameters 184 ---------- 185 **adata** : *AnnData* | *list[AnnData]* 186 > `AnnData`(s) with `'Z_train'` and `'Z_test'` in 187 `adata.uns.keys()`. 188 189 **group_size** : *None* | *int* 190 > Argument describing how the features are grouped. If *None*, 191 `2 * adata.uns['D'] will be used. 192 For more information see celer documentation. 193 194 **tfidf** : *bool* 195 > If `True`, TFIDF normalization will be run at each fold. 196 197 **alpha_array** : *np.ndarray* 198 > Array of all alpha values to be tested. 199 200 **k** : *int* 201 > Number of folds to perform cross validation over. 202 203 Returns 204 ------- 205 **alpha_star** : *int* 206 > The best performing alpha value from cross validation on 207 training data. 208 209 Examples 210 -------- 211 >>> alpha_star = scmkl.optimize_alpha(adata) 212 >>> alpha_star 213 0.1 214 215 See Also 216 -------- 217 celer.GroupLasso : 218 https://mathurinm.github.io/celer/generated/celer.GroupLasso.html 219 ''' 220 221 assert isinstance(k, int) and k > 0, ("Must be a positive integer number " 222 "of folds") 223 224 import warnings 225 warnings.filterwarnings('ignore') 226 227 if group_size == None: 228 group_size = adata.uns['D'] * 2 229 230 if type(adata) == list: 231 alpha_star = _multimodal_optimize_alpha(adatas = adata, 232 group_size = group_size, 233 tfidf = tfidf, 234 alpha_array = alpha_array) 235 return alpha_star 236 237 y = adata.obs['labels'].iloc[adata.uns['train_indices']].to_numpy() 238 239 # Splits the labels evenly between folds 240 pos_indices = np.where(y == np.unique(y)[0])[0] 241 neg_indices = np.setdiff1d(np.arange(len(y)), pos_indices) 242 243 pos_annotations = np.arange(len(pos_indices)) % k 244 neg_annotations = np.arange(len(neg_indices)) % k 245 246 auc_array = np.zeros((len(alpha_array), k)) 247 248 gc.collect() 249 250 for fold in np.arange(k): 251 252 cv_adata = adata[adata.uns['train_indices'],:] 253 254 # Create CV train/test indices 255 # Capturing train indices for both classes 256 ftrain_pos = pos_indices[np.where(pos_annotations != fold)[0]] 257 ftrain_neg = neg_indices[np.where(neg_annotations != fold)[0]] 258 259 # Capturing test indices for both classes 260 ftest_pos = pos_indices[np.where(pos_annotations == fold)[0]] 261 ftest_neg = neg_indices[np.where(neg_annotations == fold)[0]] 262 263 # Creating class-balanced train/test split for fold 264 fold_train = np.concatenate((ftrain_pos, ftrain_neg)) 265 fold_test = np.concatenate((ftest_pos, ftest_neg)) 266 267 cv_adata.uns['train_indices'] = fold_train 268 cv_adata.uns['test_indices'] = fold_test 269 270 if tfidf: 271 cv_adata = tfidf_normalize(cv_adata, binarize= True) 272 273 cv_adata = estimate_sigma(cv_adata, n_features = 200) 274 cv_adata = calculate_z(cv_adata, n_features= 5000) 275 276 gc.collect() 277 278 for i, alpha in enumerate(alpha_array): 279 280 cv_adata = train_model(cv_adata, group_size, alpha = alpha) 281 auc_array[i, fold] = _calculate_auroc(cv_adata) 282 283 gc.collect() 284 285 del cv_adata 286 gc.collect() 287 288 # Take AUROC mean across the k folds to find alpha yielding highest AUROC 289 alpha_star = alpha_array[np.argmax(np.mean(auc_array, axis = 1))] 290 gc.collect() 291 292 293 return alpha_star
Iteratively train a grouplasso model and update alpha to find the parameter yielding best performing sparsity. This function currently only works for binary experiments.
Parameters
adata : AnnData | list[AnnData]
AnnData
(s) with'Z_train'
and'Z_test'
inadata.uns.keys()
.
group_size : None | int
Argument describing how the features are grouped. If None, `2 * adata.uns['D'] will be used. For more information see celer documentation.
tfidf : bool
If
True
, TFIDF normalization will be run at each fold.
alpha_array : np.ndarray
Array of all alpha values to be tested.
k : int
Number of folds to perform cross validation over.
Returns
alpha_star : int
The best performing alpha value from cross validation on training data.
Examples
>>> alpha_star = scmkl.optimize_alpha(adata)
>>> alpha_star
0.1
See Also
celer.GroupLasso : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
8def optimize_sparsity(adata, group_size, starting_alpha = 1.9, 9 increment = 0.2, target = 1, n_iter = 10): 10 ''' 11 Iteratively train a grouplasso model and update alpha to find the 12 parameter yielding the desired sparsity. 13 14 Parameters 15 ---------- 16 **adata** : *AnnData* 17 > `AnnData` with `'Z_train'` and `'Z_test'` in 18 `adata.uns.keys()`. 19 20 **group_size** : *int* 21 > Argument describing how the features are grouped. Should be 22 `2 * D`. For more information see celer documentation. 23 24 **starting_alpha** : *float* 25 > The alpha value to start the search at. 26 27 **increment** : *float* 28 > Amount to adjust alpha by between iterations. 29 30 **target** : *int* 31 > The desired number of groups selected by the model. 32 33 **n_iter** : *int* 34 > The maximum number of iterations to run. 35 36 Returns 37 ------- 38 **sparsity_dict** : *dict* 39 > Tested alpha as keys and the number of selected pathways as 40 the values. 41 42 **alpha** : *float* 43 >The alpha value yielding the number of selected groups closest 44 to the target. 45 46 Examples 47 -------- 48 >>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, (2 * D), 49 ... target = 1) 50 >>> 51 >>> alpha 52 0.01 53 54 See Also 55 -------- 56 celer.GroupLasso : 57 https://mathurinm.github.io/celer/generated/celer.GroupLasso.html 58 ''' 59 assert increment > 0 and increment < starting_alpha, ("Choose a positive " 60 "increment less " 61 "than alpha") 62 assert target > 0 and isinstance(target, int), ("Choose an integer " 63 "target number of groups " 64 "that is greater than 0") 65 assert n_iter > 0 and isinstance(n_iter, int), ("Choose an integer " 66 "number of iterations " 67 "that is greater than 0") 68 69 sparsity_dict = {} 70 alpha = starting_alpha 71 72 for _ in np.arange(n_iter): 73 adata = train_model(adata, group_size, alpha) 74 num_selected = len(find_selected_groups(adata)) 75 76 sparsity_dict[np.round(alpha, 4)] = num_selected 77 78 if num_selected < target: 79 #Decreasing alpha will increase the number of selected pathways 80 if alpha - increment in sparsity_dict.keys(): 81 # Make increment smaller so the model can't go back and forth 82 # between alpha values 83 increment /= 2 84 # Ensures that alpha will never be negative 85 alpha = np.max([alpha - increment, 1e-1]) 86 87 elif num_selected > target: 88 if alpha + increment in sparsity_dict.keys(): 89 increment /= 2 90 91 alpha += increment 92 elif num_selected == target: 93 break 94 95 # Find the alpha that minimizes the difference between target and observed 96 # number of selected groups 97 spar_idx = np.argmin([np.abs(selected - target) 98 for selected in sparsity_dict.values()]) 99 optimal_alpha = list(sparsity_dict.keys())[spar_idx] 100 return sparsity_dict, optimal_alpha
Iteratively train a grouplasso model and update alpha to find the parameter yielding the desired sparsity.
Parameters
adata : AnnData
AnnData
with'Z_train'
and'Z_test'
inadata.uns.keys()
.
group_size : int
Argument describing how the features are grouped. Should be
2 * D
. For more information see celer documentation.
starting_alpha : float
The alpha value to start the search at.
increment : float
Amount to adjust alpha by between iterations.
target : int
The desired number of groups selected by the model.
n_iter : int
The maximum number of iterations to run.
Returns
sparsity_dict : dict
Tested alpha as keys and the number of selected pathways as the values.
alpha : float
The alpha value yielding the number of selected groups closest to the target.
Examples
>>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, (2 * D),
... target = 1)
>>>
>>> alpha
0.01
See Also
celer.GroupLasso : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
11def run(adata : ad.AnnData, alpha_list : np.ndarray, 12 metrics : list | None = None, return_probs = False) -> dict: 13 ''' 14 Wrapper function for training and test with multiple alpha values. 15 Returns metrics, predictions, group weights, and resource usage. 16 17 Parameters 18 ---------- 19 **adata** : *AnnData* 20 > A processed *AnnData* with `'Z_train'`, `'Z_test'`, and 21 `'group_dict'` keys in `adata.uns`. 22 23 **alpha_list** : *np.ndarray* 24 > `alpha` values to create models using. Alpha refers to the 25 penalty parameter in Group Lasso. Larger alphas force group 26 weights to shrink towards 0 while smaller alphas apply a lesser 27 penalty to kernal weights. 28 29 **metrics** : *list[str]* 30 > What metrics should be calculated on predictions. Options are 31 ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When 32 set to `None`, all metrics are calculated. 33 34 Returns 35 ------- 36 **results** : *dict* 37 > With keys and values: 38 39 > `'Metrics'` : a nested dictionary as `[alpha][metric]` = value. 40 41 > `'Selected_groups'` : a dictionary as `[alpha]` = array of 42 groups with nonzero weights. 43 44 > `'Norms'` : a dictionary as `[alpha]` = array of kernel weights 45 for each group, order respective to 'Group_names'. 46 47 > `'Predictions'` : a dictionary as `[alpha]` = predicted class 48 respective to 'Observations' for that `alpha`. 49 50 > `'Observed'` : an array of ground truth cell labels from the 51 test set. 52 53 > `'Test_indices'` : indices of samples respective to adata 54 used in the training set. 55 56 > `'Group_names'` : an array of group names respective to each 57 array in 'Norms'. 58 59 > `'Model'` : a dictionary where `[alpha]` = Celer Group Lasso 60 object for that `alpha`. 61 62 > `'RAM_usage'` : memory usage after training models for each 63 `alpha`. 64 65 Examples 66 -------- 67 >>> results = scmkl.run(adata = adata, 68 ... alpha_list = np.array([0.05, 0.1, 0.5])) 69 >>> results 70 dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions', 71 ... 'Observed', 'Test_indices', 'Group_names', 'Models', 72 ... 'Train_time', 'RAM_usage']) 73 >>> 74 >>> # List of alpha values 75 >>> results['Metrics'].keys() 76 dict_keys([0.05, 0.1, 0.5]) 77 >>> 78 >>> results['Metrics'][0.05] 79 {'AUROC': 0.9859, 80 'Accuracy': 0.945, 81 'F1-Score': 0.9452736318407959, 82 'Precision': 0.9405940594059405, 83 'Recall': 0.95} 84 ''' 85 if metrics == None: 86 metrics = ['AUROC', 'F1-Score','Accuracy', 'Precision', 'Recall'] 87 88 # Initializing variables to capture metrics 89 group_names = list(adata.uns['group_dict'].keys()) 90 preds = {} 91 group_norms = {} 92 mets_dict = {} 93 selected_groups = {} 94 train_time = {} 95 models = {} 96 probs = {} 97 98 D = adata.uns['D'] 99 100 # Generating models for each alpha and outputs 101 for alpha in alpha_list: 102 103 print(f' Evaluating model. Alpha: {alpha}', flush = True) 104 105 train_start = time.time() 106 107 adata = train_model(adata, group_size= 2*D, alpha = alpha) 108 109 if return_probs: 110 alpha_res = predict(adata, 111 metrics = metrics, 112 return_probs = return_probs) 113 preds[alpha], mets_dict[alpha], probs[alpha] = alpha_res 114 115 else: 116 alpha_res = predict(adata, 117 metrics = metrics, 118 return_probs = return_probs) 119 preds[alpha], mets_dict[alpha] = alpha_res 120 121 selected_groups[alpha] = find_selected_groups(adata) 122 123 kernel_weights = adata.uns['model'].coef_ 124 group_norms[alpha] = [ 125 np.linalg.norm(kernel_weights[i * 2 * D : (i + 1) * 2 * D - 1]) 126 for i in np.arange(len(group_names)) 127 ] 128 129 models[alpha] = adata.uns['model'] 130 131 train_end = time.time() 132 train_time[alpha] = train_end - train_start 133 134 # Combining results into one object 135 results = {} 136 results['Metrics'] = mets_dict 137 results['Selected_groups'] = selected_groups 138 results['Norms'] = group_norms 139 results['Predictions'] = preds 140 results['Observed'] = adata.obs['labels'].iloc[adata.uns['test_indices']] 141 results['Test_indices'] = adata.uns['test_indices'] 142 results['Group_names']= group_names 143 results['Models'] = models 144 results['Train_time'] = train_time 145 results['RAM_usage'] = f'{tracemalloc.get_traced_memory()[1] / 1e9} GB' 146 results['Probabilities'] = probs 147 148 return results
Wrapper function for training and test with multiple alpha values. Returns metrics, predictions, group weights, and resource usage.
Parameters
adata : AnnData
A processed AnnData with
'Z_train'
,'Z_test'
, and'group_dict'
keys inadata.uns
.
alpha_list : np.ndarray
alpha
values to create models using. Alpha refers to the penalty parameter in Group Lasso. Larger alphas force group weights to shrink towards 0 while smaller alphas apply a lesser penalty to kernal weights.
metrics : list[str]
What metrics should be calculated on predictions. Options are ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When set to
None
, all metrics are calculated.
Returns
results : dict
With keys and values:
'Metrics'
: a nested dictionary as[alpha][metric]
= value.
'Selected_groups'
: a dictionary as[alpha]
= array of groups with nonzero weights.
'Norms'
: a dictionary as[alpha]
= array of kernel weights for each group, order respective to 'Group_names'.
'Predictions'
: a dictionary as[alpha]
= predicted class respective to 'Observations' for thatalpha
.
'Observed'
: an array of ground truth cell labels from the test set.
'Test_indices'
: indices of samples respective to adata used in the training set.
'Group_names'
: an array of group names respective to each array in 'Norms'.
'Model'
: a dictionary where[alpha]
= Celer Group Lasso object for thatalpha
.
'RAM_usage'
: memory usage after training models for eachalpha
.
Examples
>>> results = scmkl.run(adata = adata,
... alpha_list = np.array([0.05, 0.1, 0.5]))
>>> results
dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions',
... 'Observed', 'Test_indices', 'Group_names', 'Models',
... 'Train_time', 'RAM_usage'])
>>>
>>> # List of alpha values
>>> results['Metrics'].keys()
dict_keys([0.05, 0.1, 0.5])
>>>
>>> results['Metrics'][0.05]
{'AUROC': 0.9859,
'Accuracy': 0.945,
'F1-Score': 0.9452736318407959,
'Precision': 0.9405940594059405,
'Recall': 0.95}
51def tfidf_normalize(adata, binarize = False): 52 ''' 53 Function to TFIDF normalize the data in an adata object. If any 54 rows are entirely 0, that row and its metadata will be removed from 55 the object. 56 57 Parameters 58 ---------- 59 **adata** : *AnnData* 60 > `adata.X` to be normalized. If `'train_indices'` and 61 `'test_indices'` in `'adata.uns.keys()'`, normalization will be 62 done separately for the training and testing data. Otherwise, 63 it will calculate it on the entire dataset. 64 65 **binarize** : *bool* 66 > If `True`, all values in `adata.X` greater than 1 will become 67 1. 68 69 Returns 70 ------- 71 **adata** : *AnnData* 72 > adata with adata.X TFIDF normalized. Will now have the train 73 data stacked on test data, and the indices will be adjusted 74 accordingly. 75 76 Examples 77 -------- 78 >>> adata = scmkl.create_adata(X = data_mat, 79 ... feature_names = gene_names, 80 ... group_dict = group_dict) 81 >>> 82 >>> adata = scmkl.tfidf_normalize(adata) 83 ''' 84 X = adata.X.copy() 85 row_sums = np.sum(X, axis = 1) 86 assert np.all(row_sums > 0), "TFIDF requires all row sums be positive" 87 88 if binarize: 89 X[X > 0] = 1 90 91 if 'train_indices' in adata.uns_keys(): 92 93 train_indices = adata.uns['train_indices'].copy() 94 test_indices = adata.uns['test_indices'].copy() 95 96 # Calculate the train TFIDF matrix on just the training data so it is 97 # not biased by testing data 98 tfidf_train = _tfidf(X[train_indices,:], mode = 'normalize') 99 100 # Calculate the test TFIDF by calculating it on the train and test 101 # data and index the test data 102 tfidf_test = _tfidf(X, mode = 'normalize')[test_indices,:] 103 104 # Impossible to add rows back to original location so we need to 105 # stack the matrices to maintain train/test 106 if scipy.sparse.issparse(X): 107 tfidf_norm = scipy.sparse.vstack((tfidf_train, tfidf_test)) 108 else: 109 tfidf_norm = np.vstack((tfidf_train, tfidf_test)) 110 111 # I'm not sure why this reassignment is necessary, but without, the 112 # values will be saved as 0s in adata 113 adata.uns['train_indices'] = train_indices 114 adata.uns['test_indices'] = test_indices 115 116 combined_indices = np.concatenate((train_indices, test_indices)) 117 118 # Anndata indexes by "rownames" not position so we need to rename the 119 # rows to properly index 120 adata_index = adata.obs_names[combined_indices].astype(int) 121 tfidf_norm = tfidf_norm[np.argsort(adata_index),:] 122 123 else: 124 125 tfidf_norm = _tfidf(X, mode = 'normalize') 126 127 adata.X = tfidf_norm.copy() 128 129 return adata
Function to TFIDF normalize the data in an adata object. If any rows are entirely 0, that row and its metadata will be removed from the object.
Parameters
adata : AnnData
adata.X
to be normalized. If'train_indices'
and'test_indices'
in'adata.uns.keys()'
, normalization will be done separately for the training and testing data. Otherwise, it will calculate it on the entire dataset.
binarize : bool
If
True
, all values inadata.X
greater than 1 will become 1.
Returns
adata : AnnData
adata with adata.X TFIDF normalized. Will now have the train data stacked on test data, and the indices will be adjusted accordingly.
Examples
>>> adata = scmkl.create_adata(X = data_mat,
... feature_names = gene_names,
... group_dict = group_dict)
>>>
>>> adata = scmkl.tfidf_normalize(adata)
6def train_model(adata, group_size = 1, alpha = 0.9): 7 ''' 8 Fit a grouplasso model to the provided data. 9 10 Parameters 11 ---------- 12 **adata** : *AnnData* 13 > Has `'Z_train'` and `'Z_test'` keys in `adata.uns`. 14 15 **group_size** : *int* 16 > Argument describing how the features are grouped. Should be 17 `2 * D`. For more information see celer documentation. 18 19 **alpha** : *float* 20 > Group Lasso regularization coefficient. alpha is a floating 21 point value controlling model solution sparsity. Must be a 22 positive float. The smaller the value, the more feature groups 23 will be selected in the trained model. 24 25 Returns 26 ------- 27 **adata** : *AnnData* 28 > Trained model accessible with `adata.uns['model']`. 29 30 Examples 31 -------- 32 >>> adata = scmkl.estimate_sigma(adata) 33 >>> adata = scmkl.calculate_z(adata) 34 >>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 35 ... 'Recall'] 36 >>> d = scmkl.calculate_d(adata.shape[0]) 37 >>> group_size = 2 * d 38 >>> adata = scmkl.train_model(adata, group_size) 39 >>> 40 >>> 'model' in adata.uns.keys() 41 True 42 43 See Also 44 -------- 45 celer documentation : 46 https://mathurinm.github.io/celer/generated/celer.GroupLasso.html 47 ''' 48 assert alpha > 0, 'Alpha must be positive' 49 50 y_train = adata.obs['labels'].iloc[adata.uns['train_indices']] 51 X_train = adata.uns['Z_train'] 52 53 cell_labels = np.unique(y_train) 54 55 # This is a regression algorithm. We need to make the labels 'continuous' 56 # for classification, but they will remain binary. Casts training labels 57 # to array of -1,1 58 train_labels = np.ones(y_train.shape) 59 train_labels[y_train == cell_labels[1]] = -1 60 61 # Alphamax is a calculation to regularize the effect of alpha across 62 # different data sets 63 alphamax = np.max(np.abs(X_train.T.dot(train_labels))) 64 alphamax /= X_train.shape[0] 65 alphamax *= alpha 66 67 # Instantiate celer Group Lasso Regression Model Object 68 model = celer.GroupLasso(groups = group_size, alpha = alphamax) 69 70 # Fit model using training data 71 model.fit(X_train, train_labels.ravel()) 72 73 adata.uns['model'] = model 74 return adata
Fit a grouplasso model to the provided data.
Parameters
adata : AnnData
Has
'Z_train'
and'Z_test'
keys inadata.uns
.
group_size : int
Argument describing how the features are grouped. Should be
2 * D
. For more information see celer documentation.
alpha : float
Group Lasso regularization coefficient. alpha is a floating point value controlling model solution sparsity. Must be a positive float. The smaller the value, the more feature groups will be selected in the trained model.
Returns
adata : AnnData
Trained model accessible with
adata.uns['model']
.
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision',
... 'Recall']
>>> d = scmkl.calculate_d(adata.shape[0])
>>> group_size = 2 * d
>>> adata = scmkl.train_model(adata, group_size)
>>>
>>> 'model' in adata.uns.keys()
True
See Also
celer documentation : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html