scmkl
Single-cell analysis using Multiple Kernel Learning, scMKL, is a binary classification algorithm utilizing prior information to group features to enhance classification and aid understanding of distinguishing features in multi-omic data sets.
Installation
Conda install
Conda is the recommended method to install scMKL:
conda create -n scMKL python=3.12 -c conda-forge ivango17::scmkl
Pip install
First, create a virtual environment with python>=3.11.1,<3.13
.
Then, install scMKL with:
# activate your new env with python>=3.11.1 and <3.13
pip install scmkl
If wheels do not build correctly, ensure gcc
and g++
are installed and up to date. They can be installed with sudo apt install gcc
and sudo apt install g++
.
Requirements
scMKL takes advantage of AnnData objects and can be implemented with just four pieces of data:
1) scRNA and/or scATAC matrices (can be scipy.sparse
matrix)
2) An array of cell labels
3) An array of feature names (eg. gene symbols for RNA or peaks for ATAC)
4) A grouping dictionary where {'group_1' : [feature_5, feature_16], 'group_2' : [feature_1, feature_4, feature_9]}
For more information on formatting/creating the grouping dictionaries, see our example for creating an RNA grouping or ATAC grouping.
For implementing scMKL, see our examples for your use case in examples.
Links
Repo: https://github.com/ohsu-cedar-comp-hub/scMKL
PyPI: https://pypi.org/project/scmkl/
Anaconda: https://anaconda.org/ivango17/scmkl
API: https://ohsu-cedar-comp-hub.github.io/scMKL/
Citation
If you use scMKL in your research, please cite using:
To be determined
Our Shiny for Python application for viewing data produced from this work can be found here: scMKL_analysis
scMKL Documentation
1""" 2.. include:: ../README.md 3 4---------------------------- 5 6## **scMKL Documentation** 7""" 8 9 10__all__ = ['calculate_z', 11 'create_adata', 12 'dataframes', 13 'estimate_sigma', 14 'get_atac_groupings', 15 'multimodal_processing', 16 'one_v_rest', 17 'optimize_alpha', 18 'optimize_sparsity', 19 'plotting', 20 'run', 21 'test', 22 'tfidf_normalize', 23 'train_model' 24 ] 25 26from scmkl._checks import * 27from scmkl.calculate_z import * 28from scmkl.create_adata import * 29from scmkl.dataframes import * 30from scmkl.estimate_sigma import * 31from scmkl.get_atac_groupings import * 32from scmkl.multimodal_processing import * 33from scmkl.one_v_rest import * 34from scmkl.optimize_alpha import * 35from scmkl.optimize_sparsity import * 36from scmkl.plotting import * 37from scmkl.run import * 38from scmkl.test import * 39from scmkl.tfidf_normalize import * 40from scmkl.train_model import *
105def calculate_z(adata, n_features = 5000) -> ad.AnnData: 106 ''' 107 Function to calculate Z matrix. 108 109 Parameters 110 ---------- 111 **adata** : *AnnData* 112 > created by `create_adata()` with `adata.uns.keys()` 113 `'sigma'`, `'train_indices'`, and `'test_indices'`. 114 `'sigma'` key can be added by running `estimate_sigma()` on 115 adata. 116 117 **n_features** : *int* 118 > Number of random feature to use when calculating Z- used for 119 scalability. 120 121 Returns 122 ------- 123 **adata** : *AnnData* 124 > adata with Z matrices accessible with `adata.uns['Z_train']` 125 and `adata.uns['Z_test']`. 126 127 Examples 128 -------- 129 >>> adata = scmkl.estimate_sigma(adata) 130 >>> adata = scmkl.calculate_z(adata) 131 >>> adata.uns.keys() 132 dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices', 133 'test_indices']) 134 ''' 135 assert np.all(adata.uns['sigma'] > 0), 'Sigma must be positive' 136 137 # Number of groupings taking from group_dict 138 n_pathway = len(adata.uns['group_dict'].keys()) 139 D = adata.uns['D'] 140 141 # Capturing training and testing indices 142 train_idx = np.array(adata.uns['train_indices'], dtype = np.int_) 143 test_idx = np.array(adata.uns['test_indices'], dtype = np.int_) 144 145 # Create Arrays to store concatenated group Z 146 # Each group of features will have a corresponding entry in each array 147 n_cols = 2 * adata.uns['D'] * n_pathway 148 149 Z_train = np.zeros((train_idx.shape[0], n_cols)) 150 Z_test = np.zeros((test_idx.shape[0], n_cols)) 151 152 153 # Loop over each of the groups and creating Z for each 154 for m, group_features in enumerate(adata.uns['group_dict'].values()): 155 156 #Extract features from mth group 157 num_group_features = len(group_features) 158 159 # Sample up to n_features features- important for scalability if 160 # using large groupings 161 # Will use all features if the grouping contains fewer than n_features 162 number_features = np.min([n_features, num_group_features]) 163 group_array = np.array(list(group_features)) 164 group_features = adata.uns['seed_obj'].choice(group_array, 165 number_features, 166 replace = False) 167 168 # Create data arrays containing only features within this group 169 X_train = adata[adata.uns['train_indices'],:][:, group_features].X 170 X_test = adata[adata.uns['test_indices'],:][:, group_features].X 171 172 if adata.uns['tfidf']: 173 X_train, X_test = _tfidf_train_test(X_train, X_test) 174 175 # Data filtering, and transformation according to given data_type 176 # Will remove low variance (< 1e5) features regardless of data_type 177 # If given data_type is 'counts' will log scale and z-score the data 178 X_train, X_test = _process_data(X_train = X_train, X_test = X_test, 179 scale_data = adata.uns['scale_data'], 180 return_dense = True) 181 182 if adata.uns['reduction'].lower() == 'svd': 183 184 SVD_func = TruncatedSVD(n_components = np.min([50, X_train.shape[1]]), random_state = 1) 185 186 # Remove first component as it corresponds with sequencing depth 187 X_train = SVD_func.fit_transform(scipy.sparse.csr_array(X_train))[:, 1:] 188 X_test = SVD_func.transform(scipy.sparse.csr_array(X_test))[:, 1:] 189 190 elif adata.uns['reduction'].lower() == 'pca': 191 PCA_func = PCA(n_components = np.min([50, X_train.shape[1]]), random_state = 1) 192 193 X_train = PCA_func.fit_transform(np.asarray(X_train)) 194 X_test = PCA_func.transform(np.asarray(X_test)) 195 196 elif adata.uns['reduction'] == 'linear': 197 198 X_train = X_train @ adata.uns['seed_obj'].choice([0,1], p = [0.02, 0.98], size = X_train.shape[1] * 50).reshape((X_train.shape[1], 50)) 199 X_test = X_test @ adata.uns['seed_obj'].choice([0,1], p = [0.02, 0.98], size = X_test.shape[1] * 50).reshape((X_train.shape[1], 50)) 200 201 if scipy.sparse.issparse(X_train): 202 X_train = X_train.toarray().astype(np.float16) 203 X_test = X_test.toarray().astype(np.float16) 204 205 # Extract pre-calculated sigma used for approximating kernel 206 adjusted_sigma = adata.uns['sigma'][m] 207 208 # Calculates approximate kernel according to chosen kernel function 209 # Distribution data comes from Fourier Transform of kernel function 210 if adata.uns['kernel_type'].lower() == 'gaussian': 211 212 gamma = 1/(2*adjusted_sigma**2) 213 sigma_p = 0.5*np.sqrt(2*gamma) 214 215 W = adata.uns['seed_obj'].normal(0, sigma_p, X_train.shape[1] * D) 216 W = W.reshape((X_train.shape[1]), D) 217 218 elif adata.uns['kernel_type'].lower() == 'laplacian': 219 220 gamma = 1 / (2 * adjusted_sigma) 221 222 W = adata.uns['seed_obj'].standard_cauchy(X_train.shape[1] * D) 223 W = gamma * W.reshape((X_train.shape[1], D)) 224 225 elif adata.uns['kernel_type'].lower() == 'cauchy': 226 227 gamma = 1 / (2 * adjusted_sigma ** 2) 228 b = 0.5 * np.sqrt(gamma) 229 230 W = adata.uns['seed_obj'].laplace(0, b, X_train.shape[1] * D) 231 W = W.reshape((X_train.shape[1], D)) 232 233 234 train_projection = np.matmul(X_train, W) 235 test_projection = np.matmul(X_test, W) 236 237 # Store group Z in whole-Z object. 238 # Preserves order to be able to extract meaningful groups 239 x_idx = np.arange( m * 2 * D , (m + 1) * 2 * D) 240 sq_i_d = np.sqrt(1/D) 241 242 Z_train[0:, x_idx] = sq_i_d * np.hstack((np.cos(train_projection), 243 np.sin(train_projection))) 244 Z_test[0:, x_idx] = sq_i_d * np.hstack((np.cos(test_projection), 245 np.sin(test_projection))) 246 247 adata.uns['Z_train'] = Z_train 248 adata.uns['Z_test'] = Z_test 249 250 return adata
Function to calculate Z matrix.
Parameters
adata : AnnData
created by
create_adata()
withadata.uns.keys()
'sigma'
,'train_indices'
, and'test_indices'
.'sigma'
key can be added by runningestimate_sigma()
on adata.
n_features : int
Number of random feature to use when calculating Z- used for scalability.
Returns
adata : AnnData
adata with Z matrices accessible with
adata.uns['Z_train']
andadata.uns['Z_test']
.
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> adata.uns.keys()
dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices',
'test_indices'])
214def create_adata(X, feature_names: np.ndarray, cell_labels: np.ndarray, 215 group_dict: dict, scale_data: bool = True, 216 split_data : np.ndarray | None = None, D : int | None = None, 217 remove_features = True, train_ratio = 0.8, 218 distance_metric = 'euclidean', kernel_type = 'Gaussian', 219 random_state : int = 1, allow_multiclass : bool = False, 220 class_threshold : str | int = 'median', 221 reduction: str | None = None, tfidf: bool = False): 222 ''' 223 Function to create an AnnData object to carry all relevant 224 information going forward. 225 226 Parameters 227 ---------- 228 **X** : *scipy.sparse.csc_matrix* | *np.ndarray* | 229 *pd.DataFrame* 230 > A data matrix of cells by features (sparse array 231 recommended for large datasets). 232 233 **feature_names** : *np.ndarray* 234 > array of feature names corresponding with the features 235 in X. 236 237 **cell_labels** : *np.ndarray* 238 > A numpy array of cell phenotypes corresponding with 239 the cells in X. 240 241 **group_dict** : *dict* 242 > Dictionary containing feature grouping information. 243 - Example: {geneset: np.array([gene_1, gene_2, ..., 244 gene_n])} 245 246 **scale_data** : *bool* 247 > If `True`, data matrix is log transformed and standard 248 scaled. 249 250 **split_data** : *None* | *np.ndarray* 251 > If *None*, data will be split stratified by cell labels. 252 Else, is an array of precalculated train/test split 253 corresponding to samples. Can include labels for entire 254 dataset to benchmark performance or for only training 255 data to classify unknown cell types. 256 - Example: np.array(['train', 'test', ..., 'train']) 257 258 **D** : *int* 259 > Number of Random Fourier Features used to calculate Z. 260 Should be a positive integer. Higher values of D will 261 increase classification accuracy at the cost of computation 262 time. If set to `None`, will be calculated given number of 263 samples. 264 265 **remove_features** : *bool* 266 > If `True`, will remove features from X and feature_names 267 not in group_dict and remove features from groupings not in 268 feature_names. 269 270 **train_ratio** : *float* 271 > Ratio of number of training samples to entire data set. Note: 272 if a threshold is applied, the ratio training samples may 273 decrease depending on class balance and `class_threshold` 274 parameter if `allow_multiclass = True`. 275 276 **distance_metric** : *str* 277 > The pairwise distance metric used to estimate sigma. Must 278 be one of the options used in scipy.spatial.distance.cdist. 279 280 **kernel_type** : *str* 281 > The approximated kernel function used to calculate Zs. 282 Must be one of `'Gaussian'`, `'Laplacian'`, or `'Cauchy'`. 283 284 **random_state** : *int* 285 > Integer random_state used to set the seed for 286 reproducibilty. 287 288 **allow_multiclass** : *bool* 289 > If `False`, will ensure that cell labels are binary. 290 291 **class_threshold** : *str* | *int* 292 > Number of samples allowed in the training data for each cell 293 class in the training data. If `'median'`, the median number of 294 cells per cell class will be the threshold for number of 295 samples per class. 296 297 **reduction**: *str* | *None* 298 > Choose which dimension reduction technique to perform on features 299 within a group. 'svd' will run sklearn.decomposition.TruncatedSVD, 300 'linear' will multiply by an array of 1s down to 50 dimensions. 301 302 **tfidf**: *bool* 303 > Whether to calculate TFIDF transformation on peaks within 304 groupings. 305 306 Returns 307 ------- 308 **adata** : *AnnData* 309 > *AnnData* with the following attributes and keys: 310 311 > `adata.X` : the data matrix. 312 313 > `adata.var_names` : the feature names corresponding to 314 `adata.X`. 315 316 > `adata.obs['labels']` : cell classes/phenotypes from 317 `cell_labels`. 318 319 > `adata.uns['train_indices']` : Indices for training data. 320 321 > `adata.uns['test_indices']` : Indices for testing data. 322 323 > `adata.uns['group_dict']` : Grouping information. 324 325 > `adata.uns['seed_obj']` : Seed object with seed equal to 326 100 * `random_state`. 327 328 > `with adata.uns['D']` : Number of dimensions to scMKL with. 329 330 > `adata.uns['scale_data']` : *bool* for whether or not data is log 331 transformed and scaled. 332 333 > `adata.uns['distance_metric']` : Distance metric as given. 334 335 > `adata.uns['kernel_type']` : Kernel function as given. 336 337 > `adata.uns['svd'] : *bool* for whether to calculate svd reduction. 338 339 > `adata.uns['tfidf'] : *bool* for whether to calculate tfidf per grouping. 340 341 Examples 342 -------- 343 >>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz') 344 >>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True) 345 >>> group_dict = np.load('hallmark_genesets.pkl', 346 >>> allow_pickle = True) 347 >>> 348 >>> adata = scmkl.create_adata(X = data_mat, 349 ... feature_names = gene_names, 350 ... group_dict = group_dict) 351 >>> adata 352 AnnData object with n_obs × n_vars = 1000 × 4341 353 obs: 'labels' 354 uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 355 'distance_metric', 'train_indices', 'test_indices' 356 ''' 357 358 assert X.shape[1] == len(feature_names), ("Different number of features " 359 "in X than feature names") 360 361 if not allow_multiclass: 362 assert len(np.unique(cell_labels)) == 2, ("cell_labels must contain " 363 "2 classes") 364 if D is not None: 365 assert isinstance(D, int) and D > 0, 'D must be a positive integer' 366 367 kernel_options = ['gaussian', 'laplacian', 'cauchy'] 368 assert kernel_type.lower() in kernel_options, ("Given kernel type not " 369 "implemented. Gaussian, " 370 "Laplacian, and Cauchy " 371 "are the acceptable " 372 "types.") 373 374 X, feature_names, group_dict = _filter_features(X, 375 feature_names, 376 group_dict, 377 remove_features) 378 379 # Create adata object and add column names 380 adata = ad.AnnData(X) 381 adata.var_names = feature_names 382 383 # Add metadata to adata object 384 adata.uns['group_dict'] = group_dict 385 adata.uns['seed_obj'] = np.random.default_rng(100 * random_state) 386 adata.uns['scale_data'] = scale_data 387 adata.uns['D'] = D if D is not None else calculate_d(adata.shape[0]) 388 adata.uns['kernel_type'] = kernel_type 389 adata.uns['distance_metric'] = distance_metric 390 adata.uns['reduction'] = reduction if isinstance(reduction, str) else 'None' 391 adata.uns['tfidf'] = tfidf 392 393 if (split_data is None): 394 assert X.shape[0] == len(cell_labels), ("Different number of cells " 395 "than labels") 396 adata.obs['labels'] = cell_labels 397 398 if (allow_multiclass == False): 399 split = _binary_split(cell_labels, 400 seed_obj = adata.uns['seed_obj'], 401 train_ratio = train_ratio) 402 train_indices, test_indices = split 403 404 elif (allow_multiclass == True): 405 split = _multi_class_split(cell_labels, 406 seed_obj = adata.uns['seed_obj'], 407 class_threshold = class_threshold, 408 train_ratio = train_ratio) 409 train_indices, test_indices = split 410 411 adata.uns['labeled_test'] = True 412 413 else: 414 x_eq_labs = X.shape[0] == len(cell_labels) 415 train_eq_labs = X.shape[0] == len(cell_labels) 416 assert x_eq_labs or train_eq_labs, ("Must give labels for all cells " 417 "or only for training cells") 418 419 train_indices = np.where(split_data == 'train')[0] 420 test_indices = np.where(split_data == 'test')[0] 421 422 if len(cell_labels) == len(train_indices): 423 424 padded_cell_labels = np.zeros((X.shape[0])).astype('object') 425 padded_cell_labels[train_indices] = cell_labels 426 padded_cell_labels[test_indices] = 'padded_test_label' 427 428 adata.obs['labels'] = padded_cell_labels 429 adata.uns['labeled_test'] = False 430 431 elif len(cell_labels) == len(split_data): 432 adata.obs['labels'] = cell_labels 433 adata.uns['labeled_test'] = True 434 435 adata.uns['train_indices'] = train_indices 436 adata.uns['test_indices'] = test_indices 437 438 if not scale_data: 439 print("WARNING: Data will not be log transformed and scaled " 440 "To change this behavior, set scale_data to True") 441 442 return adata
Function to create an AnnData object to carry all relevant information going forward.
Parameters
X : scipy.sparse.csc_matrix | np.ndarray | pd.DataFrame
A data matrix of cells by features (sparse array recommended for large datasets).
feature_names : np.ndarray
array of feature names corresponding with the features in X.
cell_labels : np.ndarray
A numpy array of cell phenotypes corresponding with the cells in X.
group_dict : dict
Dictionary containing feature grouping information. - Example: {geneset: np.array([gene_1, gene_2, ..., gene_n])}
scale_data : bool
If
True
, data matrix is log transformed and standard scaled.
split_data : None | np.ndarray
If None, data will be split stratified by cell labels. Else, is an array of precalculated train/test split corresponding to samples. Can include labels for entire dataset to benchmark performance or for only training data to classify unknown cell types. - Example: np.array(['train', 'test', ..., 'train'])
D : int
Number of Random Fourier Features used to calculate Z. Should be a positive integer. Higher values of D will increase classification accuracy at the cost of computation time. If set to
None
, will be calculated given number of samples.
remove_features : bool
If
True
, will remove features from X and feature_names not in group_dict and remove features from groupings not in feature_names.
train_ratio : float
Ratio of number of training samples to entire data set. Note: if a threshold is applied, the ratio training samples may decrease depending on class balance and
class_threshold
parameter ifallow_multiclass = True
.
distance_metric : str
The pairwise distance metric used to estimate sigma. Must be one of the options used in scipy.spatial.distance.cdist.
kernel_type : str
The approximated kernel function used to calculate Zs. Must be one of
'Gaussian'
,'Laplacian'
, or'Cauchy'
.
random_state : int
Integer random_state used to set the seed for reproducibilty.
allow_multiclass : bool
If
False
, will ensure that cell labels are binary.
class_threshold : str | int
Number of samples allowed in the training data for each cell class in the training data. If
'median'
, the median number of cells per cell class will be the threshold for number of samples per class.
reduction: str | None
Choose which dimension reduction technique to perform on features within a group. 'svd' will run sklearn.decomposition.TruncatedSVD, 'linear' will multiply by an array of 1s down to 50 dimensions.
tfidf: bool
Whether to calculate TFIDF transformation on peaks within groupings.
Returns
adata : AnnData
AnnData with the following attributes and keys:
adata.X
: the data matrix.
adata.var_names
: the feature names corresponding toadata.X
.
adata.obs['labels']
: cell classes/phenotypes fromcell_labels
.
adata.uns['train_indices']
: Indices for training data.
adata.uns['test_indices']
: Indices for testing data.
adata.uns['group_dict']
: Grouping information.
adata.uns['seed_obj']
: Seed object with seed equal to 100 *random_state
.
with adata.uns['D']
: Number of dimensions to scMKL with.
adata.uns['scale_data']
: bool for whether or not data is log transformed and scaled.
adata.uns['distance_metric']
: Distance metric as given.
adata.uns['kernel_type']
: Kernel function as given.
`adata.uns['svd'] : bool for whether to calculate svd reduction.
`adata.uns['tfidf'] : bool for whether to calculate tfidf per grouping.
Examples
>>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz')
>>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True)
>>> group_dict = np.load('hallmark_genesets.pkl',
>>> allow_pickle = True)
>>>
>>> adata = scmkl.create_adata(X = data_mat,
... feature_names = gene_names,
... group_dict = group_dict)
>>> adata
AnnData object with n_obs × n_vars = 1000 × 4341
obs: 'labels'
uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type',
'distance_metric', 'train_indices', 'test_indices'
9def estimate_sigma(adata, n_features = 5000): 10 ''' 11 Calculate kernel widths to inform distribution for projection of 12 Fourier Features. Calculates one sigma per group of features. 13 14 Parameters 15 ---------- 16 **adata** : *AnnData* 17 > Created by `create_adata`. 18 19 **n_features** : *int* 20 > Number of random features to include when estimating sigma. 21 Will be scaled for the whole pathway set according to a 22 heuristic. Used for scalability. 23 24 Returns 25 ------- 26 **adata** : *AnnData* 27 > Key added `adata.uns['sigma']`. 28 29 Examples 30 -------- 31 >>> adata = scmkl.estimate_sigma(adata) 32 >>> adata.uns['sigma'] 33 array([10.4640895 , 10.82011454, 6.16769438, 9.86156855, ...]) 34 ''' 35 sigma_list = [] 36 37 # Loop over every group in group_dict 38 for m, group_features in enumerate(adata.uns['group_dict'].values()): 39 40 # Select only features in that group and downsample for scalability 41 num_group_features = len(group_features) 42 group_array = np.array(list(group_features)) 43 n_feats = min([n_features, num_group_features]) 44 group_features = adata.uns['seed_obj'].choice(group_array, n_feats, 45 replace = False) 46 47 # Use on the train data to estimate sigma 48 X_train = adata[adata.uns['train_indices'], group_features].X 49 50 51 # Sample cells for scalability 52 sample_idx = np.arange(X_train.shape[0]) 53 n_samples = np.min((2000, X_train.shape[0])) 54 distance_indices = adata.uns['seed_obj'].choice(sample_idx, n_samples, 55 replace = False) 56 57 X_train = _process_data(X_train = X_train, 58 scale_data = adata.uns['scale_data'], 59 return_dense = True) 60 61 62 if adata.uns['tfidf']: 63 X_train = _tfidf(X_train, mode = 'normalize') 64 65 if adata.uns['reduction'].lower() == 'svd': 66 67 SVD_func = TruncatedSVD(n_components = np.min([50, X_train.shape[1]]), random_state = 1) 68 X_train = SVD_func.fit_transform(scipy.sparse.csr_array(X_train[distance_indices,:]))[:,1:] 69 70 elif adata.uns['reduction'].lower() == 'pca': 71 PCA_func = PCA(n_components = np.min([50, X_train.shape[1]]), random_state = 1) 72 X_train = PCA_func.fit_transform(np.asarray(X_train[distance_indices,:])) 73 74 elif adata.uns['reduction'].lower() == 'linear': 75 76 X_train = X_train[distance_indices] @ adata.uns['seed_obj'].choice([0,1], [0.02, 0.98]).reshape((len(distance_indices), 50)) 77 78 else: 79 X_train = X_train[distance_indices, :] 80 81 if scipy.sparse.issparse(X_train): 82 X_train = X_train.toarray() 83 84 # Calculate Distance Matrix with specified metric 85 sigma = scipy.spatial.distance.cdist(X_train, 86 X_train, 87 adata.uns['distance_metric']) 88 sigma = np.mean(sigma) 89 90 # sigma = 0 is numerically unusable in later steps 91 # Using such a small sigma will result in wide distribution, and 92 # typically a non-predictive Z 93 if sigma == 0: 94 sigma += 1e-5 95 96 if n_features < num_group_features: 97 # Heuristic we calculated to account for fewer features used in 98 # distance calculation 99 sigma = sigma * num_group_features / n_features 100 101 sigma_list.append(sigma) 102 103 adata.uns['sigma'] = np.array(sigma_list) 104 105 return adata
Calculate kernel widths to inform distribution for projection of Fourier Features. Calculates one sigma per group of features.
Parameters
adata : AnnData
Created by
create_adata
.
n_features : int
Number of random features to include when estimating sigma. Will be scaled for the whole pathway set according to a heuristic. Used for scalability.
Returns
adata : AnnData
Key added
adata.uns['sigma']
.
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata.uns['sigma']
array([10.4640895 , 10.82011454, 6.16769438, 9.86156855, ...])
262def get_atac_groupings(gene_anno : pd.DataFrame, gene_sets : dict, 263 feature_names : np.ndarray | pd.Series | list | set, 264 len_up : int = 5000, len_down : int = 5000) -> dict: 265 ''' 266 Creates a peak set where keys are gene set names from `gene_sets` 267 and values are arrays of features pulled from `feature_names`. 268 Features are added to each peak set given overlap between regions 269 in single-cell data matrix and inferred gene promoter regions in 270 `gene_anno`. 271 272 Parameters 273 ---------- 274 **gene_anno** : *pd.DataFrame* 275 > Gene annotations in GTF format as a pd.DataFrame with columns 276 ['chr', 'start', 'end', 'strand', 'gene_name']. 277 278 **gene_sets** : *dict* 279 > Gene set names as keys and an iterable object of gene names 280 as values. 281 282 **feature_names** : *np.ndarray* | *pd.Series* | *list* | *set* 283 > Feature names corresponding to a single_cell ATAC data 284 matrix. 285 286 Returns 287 ------- 288 **atac_grouping** : *dict* 289 > Keys are the names from `gene_sets` and values 290 are a list of regions from `feature_names` that overlap with 291 promotor regions respective to genes in `gene_sets` (i.e., if 292 ATAC feature in `feature_names` overlaps with promotor region 293 from a gene in a gene set from `gene_sets`, that region will be 294 added to the new dictionary under the respective gene set 295 name). 296 297 Examples 298 -------- 299 >>> # Reading in a gene set and the peak names from scATAC dataset 300 >>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl", 301 ... allow_pickle = True) 302 >>> peaks = np.load("data/MCF7_ATAC_feature_names.npy", 303 ... allow_pickle = True) 304 >>> 305 >>> # Reading in GTF file 306 >>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf" 307 >>> gtf = pd.read_csv(gtf_path, sep = "\t", header = None, 308 ... skip_blank_lines = True, comment = "#") 309 >>> 310 >>> # Naming columns 311 >>> gtf.columns = ['chr', 'source', 'feature', 'start', 'end', 312 ... 'score', 'strand', 'frame', 'attribute'] 313 >>> 314 >>> # Subsetting to only protein coding genes 315 >>> prot_rows = gtf['attribute'].str.contains('protein_coding') 316 >>> gtf = gtf[prot_rows] 317 >>> gtf = gtf[gtf['feature'] == 'gene'] 318 >>> 319 >>> # Capturing gene name from attributes column 320 >>> gtf['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+', 321 ... attr)[0] 322 ... for attr in gtf['attribute']] 323 >>> 324 >>> atac_grouping = scmkl.get_atac_groupings(gene_anno = gtf, 325 ... gene_sets = gene_sets, 326 ... feature_names = peaks) 327 >>> 328 >>> atac_grouping.keys() 329 dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...]) 330 ''' 331 # Getting a unique set of gene names from gene_sets 332 all_genes = {gene for group in gene_sets.keys() 333 for gene in gene_sets[group]} 334 335 # Filtering out NaN values 336 all_genes = [gene for gene in all_genes if type(gene) != float] 337 338 # Filtering out annotations for genes not present in gene_sets 339 gene_anno = gene_anno[np.isin(gene_anno['gene_name'], all_genes)] 340 341 # Adjusting start and end columns to represent promotor regions 342 gene_anno = _adjust_regions(gene_anno = gene_anno, 343 len_up = len_up, len_down = len_down) 344 345 # Creating a dictionary from assay features where [chr] : (start, end) 346 feature_dict = _create_feature_dict(feature_names) 347 348 # Creating data structures from gene_anno for comparing regions 349 peak_gene_dict, ga_regions = _create_region_dicts(gene_anno) 350 351 # Capturing the separator type used in assay 352 chr_sep = ':' if ':' in feature_names[0] else '-' 353 354 atac_groupings = _compare_regions(feature_dict = feature_dict, 355 ga_regions = ga_regions, 356 peak_gene_dict = peak_gene_dict, 357 gene_sets = gene_sets, 358 chr_sep = chr_sep) 359 360 return atac_groupings
Creates a peak set where keys are gene set names from gene_sets
and values are arrays of features pulled from feature_names
.
Features are added to each peak set given overlap between regions
in single-cell data matrix and inferred gene promoter regions in
gene_anno
.
Parameters
gene_anno : pd.DataFrame
Gene annotations in GTF format as a pd.DataFrame with columns ['chr', 'start', 'end', 'strand', 'gene_name'].
gene_sets : dict
Gene set names as keys and an iterable object of gene names as values.
feature_names : np.ndarray | pd.Series | list | set
Feature names corresponding to a single_cell ATAC data matrix.
Returns
atac_grouping : dict
Keys are the names from
gene_sets
and values are a list of regions fromfeature_names
that overlap with promotor regions respective to genes ingene_sets
(i.e., if ATAC feature infeature_names
overlaps with promotor region from a gene in a gene set fromgene_sets
, that region will be added to the new dictionary under the respective gene set name).
Examples
>>> # Reading in a gene set and the peak names from scATAC dataset
>>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl",
... allow_pickle = True)
>>> peaks = np.load("data/MCF7_ATAC_feature_names.npy",
... allow_pickle = True)
>>>
>>> # Reading in GTF file
>>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf"
>>> gtf = pd.read_csv(gtf_path, sep = " ", header = None,
... skip_blank_lines = True, comment = "#")
>>>
>>> # Naming columns
>>> gtf.columns = ['chr', 'source', 'feature', 'start', 'end',
... 'score', 'strand', 'frame', 'attribute']
>>>
>>> # Subsetting to only protein coding genes
>>> prot_rows = gtf['attribute'].str.contains('protein_coding')
>>> gtf = gtf[prot_rows]
>>> gtf = gtf[gtf['feature'] == 'gene']
>>>
>>> # Capturing gene name from attributes column
>>> gtf['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+',
... attr)[0]
... for attr in gtf['attribute']]
>>>
>>> atac_grouping = scmkl.get_atac_groupings(gene_anno = gtf,
... gene_sets = gene_sets,
... feature_names = peaks)
>>>
>>> atac_grouping.keys()
dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...])
97def multimodal_processing(adatas : list, names : list, tfidf: list): 98 ''' 99 Combines and processes a list of adata objects. 100 101 Parameters 102 ---------- 103 **adatas** : *list[AnnData]* 104 > List of AnnData objects where each object is a different 105 modality for the same cells. 106 107 **names** : *list[str]* 108 > List of string names for each modality repective to each 109 object in `adatas`. 110 111 **tfidf** : *bool* 112 > List where if element i is `True`, adata[i] will be TFIDF 113 normalized. 114 115 Returns 116 ------- 117 **adata** : *AnnData* 118 > Concatenated from objects from `adatas` with Z matrices 119 calculated. 120 121 Examples 122 -------- 123 >>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat, 124 ... feature_names = gene_names, 125 ... scale_data = True, 126 ... cell_labels = cell_labels, 127 ... group_dict = rna_grouping) 128 >>> 129 >>> atac_adata = scmkl.create_adata(X = mcf7_atac_mat, 130 ... feature_names = peak_names, 131 ... scale_data = False, 132 ... cell_labels = cell_labels, 133 ... group_dict = atac_grouping) 134 >>> 135 >>> adatas = [rna_adata, atac_adata] 136 >>> mod_names = ['rna', 'atac'] 137 >>> adata = scmkl.multimodal_processing(adatas = adatas, 138 ... names = mod_names, 139 ... tfidf = [False, True]) 140 >>> 141 >>> adata 142 AnnData object with n_obs × n_vars = 1000 × 12676 143 obs: 'labels' 144 var: 'labels' 145 uns: 'D', 'kernel_type', 'distance_metric', 'train_indices', 146 'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj' 147 ''' 148 import warnings 149 warnings.filterwarnings('ignore') 150 151 assert all([adata.shape[0] for adata in adatas]), ("Different number of " 152 "cells present in " 153 "each object") 154 155 # True if all train indices match 156 same_train = np.all([np.array_equal(adatas[0].uns['train_indices'], 157 adatas[i].uns['train_indices']) 158 for i in range(1, len(adatas))]) 159 160 # True if all test indices match 161 same_test = np.all([np.array_equal(adatas[0].uns['test_indices'], 162 adatas[i].uns['test_indices']) 163 for i in range(1, len(adatas))]) 164 165 assert same_train, 'Different train indices' 166 assert same_test, 'Different test indices' 167 168 # Creates a boolean array for each modality of cells with non-empty rows 169 non_empty_rows = [np.array(_sparse_var(adata.X, axis = 1) != 0).ravel() 170 for adata in adatas] 171 172 # Returns a 1d array where sample feature sums 173 # across all modalities are more than 0 174 non_empty_rows = np.logical_and(*non_empty_rows).squeeze() 175 176 # Initializing final train test split array 177 train_test = np.repeat('train', adatas[0].shape[0]) 178 train_test[adatas[0].uns['test_indices']] = 'test' 179 180 # Capturing train test split with empty rows filtered out 181 train_test = train_test[non_empty_rows] 182 train_indices = np.where(train_test == 'train')[0] 183 test_indices = np.where(train_test == 'test')[0] 184 185 # Adding train test split arrays to AnnData objects 186 # and filtering out empty samples 187 for i, adata in enumerate(adatas): 188 adatas[i].uns['train_indices'] = train_indices 189 adatas[i].uns['test_indices'] = test_indices 190 adatas[i] = adata[non_empty_rows, :] 191 # tfidf normalizing if corresponding element in tfidf is True 192 if tfidf[i]: 193 adatas[i] = tfidf_normalize(adata) 194 195 print(f'Estimating Sigma for {names[i]}', flush = True) 196 adatas[i] = estimate_sigma(adata, n_features= 200) 197 print(f'Calculating Z for {names[i]}', flush = True) 198 adatas[i] = calculate_z(adata, n_features = 5000) 199 200 if 'labels' in adatas[0].obs: 201 all_labels = [adata.obs['labels'] for adata in adatas] 202 # Ensuring cell labels for each AnnData object are the same 203 for i in range(1, len(all_labels)): 204 same_labels = np.all(all_labels[0] == all_labels[i]) 205 assert same_labels, (f"Cell labels between AnnData object in " 206 f"position 0 and position {i} in adatas do " 207 "not match") 208 209 adata = _combine_modalities(adatas = adatas, 210 names = names, 211 combination = 'concatenate') 212 213 del adatas 214 gc.collect() 215 216 return adata
Combines and processes a list of adata objects.
Parameters
adatas : list[AnnData]
List of AnnData objects where each object is a different modality for the same cells.
names : list[str]
List of string names for each modality repective to each object in
adatas
.
tfidf : bool
List where if element i is
True
, adata[i] will be TFIDF normalized.
Returns
adata : AnnData
Concatenated from objects from
adatas
with Z matrices calculated.
Examples
>>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat,
... feature_names = gene_names,
... scale_data = True,
... cell_labels = cell_labels,
... group_dict = rna_grouping)
>>>
>>> atac_adata = scmkl.create_adata(X = mcf7_atac_mat,
... feature_names = peak_names,
... scale_data = False,
... cell_labels = cell_labels,
... group_dict = atac_grouping)
>>>
>>> adatas = [rna_adata, atac_adata]
>>> mod_names = ['rna', 'atac']
>>> adata = scmkl.multimodal_processing(adatas = adatas,
... names = mod_names,
... tfidf = [False, True])
>>>
>>> adata
AnnData object with n_obs × n_vars = 1000 × 12676
obs: 'labels'
var: 'labels'
uns: 'D', 'kernel_type', 'distance_metric', 'train_indices',
'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj'
108def one_v_rest(adatas : list, names : list, alpha_list : np.ndarray, 109 tfidf : list) -> dict: 110 ''' 111 For each cell class, creates model(s) comparing that class to all 112 others. Then, predicts on the training data using `scmkl.run()`. 113 Only labels in both training and testing will be run. 114 115 Parameters 116 ---------- 117 **adatas** : *list[AnnData]* 118 > List of AnnData objects created by create_adata() 119 where each AnnData is one modality and composed of both 120 training and testing samples. Requires that `'train_indices'` 121 and `'test_indices'` are the same across all AnnDatas. 122 123 **names** : *list[str]* 124 > List of string variables that describe each modality 125 respective to adatas for labeling. 126 127 **alpha_list** : *np.ndarray* | *float* 128 > An array of alpha values to create each model with. 129 130 **tfidf** : *list[bool]* 131 > List where if element i is `True`, adata[i] will be TFIDF 132 normalized. 133 134 Returns 135 ------- 136 **results** : *dict* 137 > Contains keys for each cell class with results from cell class 138 versus all other samples. See `scmkl.run()` for futher details. 139 140 Examples 141 -------- 142 >>> adata = scmkl.create_adata(X = data_mat, 143 ... feature_names = gene_names, 144 ... group_dict = group_dict) 145 >>> 146 >>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'], 147 ... alpha_list = np.array([0.05, 0.1]), 148 ... tfidf = [False]) 149 >>> 150 >>> adata.keys() 151 dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...]) 152 ''' 153 # Formatting checks ensuring all adata elements are 154 # AnnData objects and train/test indices are all the same 155 _check_adatas(adatas, check_obs = True, check_uns = True) 156 157 # Extracting train and test indices 158 train_indices = adatas[0].uns['train_indices'] 159 test_indices = adatas[0].uns['test_indices'] 160 161 # Checking and capturing cell labels 162 uniq_labels = _eval_labels( cell_labels = adatas[0].obs['labels'], 163 train_indices = train_indices, 164 test_indices = test_indices) 165 166 167 # Calculating Z matrices, method depends on whether there are multiple 168 # adatas (modalities) 169 if len(adatas) == 1: 170 adata = estimate_sigma(adatas[0], n_features = 200) 171 adata = calculate_z(adata, n_features = 5000) 172 else: 173 adata = multimodal_processing(adatas = adatas, 174 names = names, 175 tfidf = tfidf, 176 z_calculation = True) 177 178 del adatas 179 gc.collect() 180 181 # Initializing for capturing model outputs 182 results = {} 183 184 # Capturing cell labels before overwriting 185 cell_labels = np.array(adata.obs['labels']) 186 187 for label in uniq_labels: 188 print(f"Comparing {label} to other types", flush = True) 189 cur_labels = cell_labels.copy() 190 cur_labels[cell_labels != label] = 'other' 191 192 # Replacing cell labels for current cell type vs rest 193 adata.obs['labels'] = cur_labels 194 195 # Running scMKL 196 results[label] = run(adata, alpha_list, return_probs = True) 197 198 # Getting final predictions 199 alpha = np.min(alpha_list) 200 prob_table, pred_class, low_conf = _prob_table(results, alpha) 201 202 results['Probability_table'] = prob_table 203 results['Predicted_class'] = pred_class 204 results['Truth_labels'] = cell_labels[adata.uns['test_indices']] 205 results['Low_confidence'] = low_conf 206 207 return results
For each cell class, creates model(s) comparing that class to all
others. Then, predicts on the training data using scmkl.run
.
Only labels in both training and testing will be run.
Parameters
adatas : list[AnnData]
List of AnnData objects created by create_adata() where each AnnData is one modality and composed of both training and testing samples. Requires that
'train_indices'
and'test_indices'
are the same across all AnnDatas.
names : list[str]
List of string variables that describe each modality respective to adatas for labeling.
alpha_list : np.ndarray | float
An array of alpha values to create each model with.
tfidf : list[bool]
List where if element i is
True
, adata[i] will be TFIDF normalized.
Returns
results : dict
Contains keys for each cell class with results from cell class versus all other samples. See
scmkl.run
for futher details.
Examples
>>> adata = scmkl.create_adata(X = data_mat,
... feature_names = gene_names,
... group_dict = group_dict)
>>>
>>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'],
... alpha_list = np.array([0.05, 0.1]),
... tfidf = [False])
>>>
>>> adata.keys()
dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...])
104def optimize_alpha(adata, group_size = None, tfidf = False, 105 alpha_array = np.round(np.linspace(1.9,0.1, 10),2), 106 k = 4, metric = 'AUROC'): 107 ''' 108 Iteratively train a grouplasso model and update alpha to find the 109 parameter yielding best performing sparsity. This function 110 currently only works for binary experiments. 111 112 Parameters 113 ---------- 114 **adata** : *AnnData* | *list[AnnData]* 115 > `AnnData`(s) with `'Z_train'` and `'Z_test'` in 116 `adata.uns.keys()`. 117 118 **group_size** : *None* | *int* 119 > Argument describing how the features are grouped. If *None*, 120 `2 * adata.uns['D'] will be used. 121 For more information see 122 [celer documentation](https://mathurinm.github.io/celer/generated/celer.GroupLasso.html). 123 124 **tfidf** : *bool* 125 > If `True`, TFIDF normalization will be run at each fold. 126 127 **alpha_array** : *np.ndarray* 128 > Array of all alpha values to be tested. 129 130 **k** : *int* 131 > Number of folds to perform cross validation over. 132 133 **metric**: *str* 134 > Which metric to use to optimize alpha. Options 135 are `'AUROC'`, `'Accuracy'`, `'F1-Score'`, `'Precision'`, and 136 `'Recall'` 137 138 Returns 139 ------- 140 **alpha_star** : *int* 141 > The best performing alpha value from cross validation on 142 training data. 143 144 Examples 145 -------- 146 >>> alpha_star = scmkl.optimize_alpha(adata) 147 >>> alpha_star 148 0.1 149 ''' 150 151 assert isinstance(k, int) and k > 0, 'Must be a positive integer number of folds' 152 153 import warnings 154 warnings.filterwarnings('ignore') 155 156 if group_size == None: 157 group_size = adata.uns['D'] * 2 158 159 if type(adata) == list: 160 alpha_star = _multimodal_optimize_alpha(adatas = adata, group_size = group_size, tfidf = tfidf, alpha_array = alpha_array, metric = metric) 161 return alpha_star 162 163 y = adata.obs['labels'].iloc[adata.uns['train_indices']].to_numpy() 164 165 # Splits the labels evenly between folds 166 positive_indices = np.where(y == np.unique(y)[0])[0] 167 negative_indices = np.setdiff1d(np.arange(len(y)), positive_indices) 168 169 positive_annotations = np.arange(len(positive_indices)) % k 170 negative_annotations = np.arange(len(negative_indices)) % k 171 172 metric_array = np.zeros((len(alpha_array), k)) 173 174 gc.collect() 175 176 for fold in np.arange(k): 177 178 cv_adata = adata[adata.uns['train_indices'],:] 179 180 # Create CV train/test indices 181 fold_train = np.concatenate((positive_indices[np.where(positive_annotations != fold)[0]], 182 negative_indices[np.where(negative_annotations != fold)[0]])) 183 fold_test = np.concatenate((positive_indices[np.where(positive_annotations == fold)[0]], 184 negative_indices[np.where(negative_annotations == fold)[0]])) 185 186 cv_adata.uns['train_indices'] = fold_train 187 cv_adata.uns['test_indices'] = fold_test 188 189 if tfidf: 190 cv_adata = tfidf_normalize(cv_adata, binarize= True) 191 192 cv_adata = estimate_sigma(cv_adata, n_features = 200) 193 cv_adata = calculate_z(cv_adata, n_features= 5000) 194 195 gc.collect() 196 197 for i, alpha in enumerate(alpha_array): 198 199 cv_adata = train_model(cv_adata, group_size, alpha = alpha) 200 _, metrics = predict(cv_adata, metrics = [metric]) 201 metric_array[i, fold] = metrics[metric] 202 203 gc.collect() 204 205 del cv_adata 206 gc.collect() 207 208 # Take AUROC mean across the k folds to find alpha yielding highest AUROC 209 alpha_star = alpha_array[np.argmax(np.mean(metric_array, axis = 1))] 210 gc.collect() 211 212 213 return alpha_star
Iteratively train a grouplasso model and update alpha to find the parameter yielding best performing sparsity. This function currently only works for binary experiments.
Parameters
adata : AnnData | list[AnnData]
AnnData
(s) with'Z_train'
and'Z_test'
inadata.uns.keys()
.
group_size : None | int
Argument describing how the features are grouped. If None, `2 * adata.uns['D'] will be used. For more information see celer documentation.
tfidf : bool
If
True
, TFIDF normalization will be run at each fold.
alpha_array : np.ndarray
Array of all alpha values to be tested.
k : int
Number of folds to perform cross validation over.
metric: str
Which metric to use to optimize alpha. Options are
'AUROC'
,'Accuracy'
,'F1-Score'
,'Precision'
, and'Recall'
Returns
alpha_star : int
The best performing alpha value from cross validation on training data.
Examples
>>> alpha_star = scmkl.optimize_alpha(adata)
>>> alpha_star
0.1
8def optimize_sparsity(adata, group_size, starting_alpha = 1.9, 9 increment = 0.2, target = 1, n_iter = 10): 10 ''' 11 Iteratively train a grouplasso model and update alpha to find the 12 parameter yielding the desired sparsity. 13 14 Parameters 15 ---------- 16 **adata** : *AnnData* 17 > `AnnData` with `'Z_train'` and `'Z_test'` in 18 `adata.uns.keys()`. 19 20 **group_size** : *int* 21 > Argument describing how the features are grouped. Should be 22 `2 * D`. For more information see celer documentation. 23 24 **starting_alpha** : *float* 25 > The alpha value to start the search at. 26 27 **increment** : *float* 28 > Amount to adjust alpha by between iterations. 29 30 **target** : *int* 31 > The desired number of groups selected by the model. 32 33 **n_iter** : *int* 34 > The maximum number of iterations to run. 35 36 Returns 37 ------- 38 **sparsity_dict** : *dict* 39 > Tested alpha as keys and the number of selected pathways as 40 the values. 41 42 **alpha** : *float* 43 >The alpha value yielding the number of selected groups closest 44 to the target. 45 46 Examples 47 -------- 48 >>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, (2 * D), 49 ... target = 1) 50 >>> 51 >>> alpha 52 0.01 53 54 See Also 55 -------- 56 celer.GroupLasso : 57 https://mathurinm.github.io/celer/generated/celer.GroupLasso.html 58 ''' 59 assert increment > 0 and increment < starting_alpha, ("Choose a positive " 60 "increment less " 61 "than alpha") 62 assert target > 0 and isinstance(target, int), ("Choose an integer " 63 "target number of groups " 64 "that is greater than 0") 65 assert n_iter > 0 and isinstance(n_iter, int), ("Choose an integer " 66 "number of iterations " 67 "that is greater than 0") 68 69 sparsity_dict = {} 70 alpha = starting_alpha 71 72 for _ in np.arange(n_iter): 73 adata = train_model(adata, group_size, alpha) 74 num_selected = len(find_selected_groups(adata)) 75 76 sparsity_dict[np.round(alpha, 4)] = num_selected 77 78 if num_selected < target: 79 #Decreasing alpha will increase the number of selected pathways 80 if alpha - increment in sparsity_dict.keys(): 81 # Make increment smaller so the model can't go back and forth 82 # between alpha values 83 increment /= 2 84 # Ensures that alpha will never be negative 85 alpha = np.max([alpha - increment, 1e-3]) 86 87 elif num_selected > target: 88 if alpha + increment in sparsity_dict.keys(): 89 increment /= 2 90 91 alpha += increment 92 elif num_selected == target: 93 break 94 95 # Find the alpha that minimizes the difference between target and observed 96 # number of selected groups 97 spar_idx = np.argmin([np.abs(selected - target) 98 for selected in sparsity_dict.values()]) 99 optimal_alpha = list(sparsity_dict.keys())[spar_idx] 100 return sparsity_dict, optimal_alpha
Iteratively train a grouplasso model and update alpha to find the parameter yielding the desired sparsity.
Parameters
adata : AnnData
AnnData
with'Z_train'
and'Z_test'
inadata.uns.keys()
.
group_size : int
Argument describing how the features are grouped. Should be
2 * D
. For more information see celer documentation.
starting_alpha : float
The alpha value to start the search at.
increment : float
Amount to adjust alpha by between iterations.
target : int
The desired number of groups selected by the model.
n_iter : int
The maximum number of iterations to run.
Returns
sparsity_dict : dict
Tested alpha as keys and the number of selected pathways as the values.
alpha : float
The alpha value yielding the number of selected groups closest to the target.
Examples
>>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, (2 * D),
... target = 1)
>>>
>>> alpha
0.01
See Also
celer.GroupLasso : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
11def run(adata : ad.AnnData, alpha_list : np.ndarray, 12 metrics : list | None = None, return_probs = False) -> dict: 13 ''' 14 Wrapper function for training and test with multiple alpha values. 15 Returns metrics, predictions, group weights, and resource usage. 16 17 Parameters 18 ---------- 19 **adata** : *AnnData* 20 > A processed *AnnData* with `'Z_train'`, `'Z_test'`, and 21 `'group_dict'` keys in `adata.uns`. 22 23 **alpha_list** : *np.ndarray* 24 > `alpha` values to create models using. Alpha refers to the 25 penalty parameter in Group Lasso. Larger alphas force group 26 weights to shrink towards 0 while smaller alphas apply a lesser 27 penalty to kernal weights. 28 29 **metrics** : *list[str]* 30 > What metrics should be calculated on predictions. Options are 31 ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When 32 set to `None`, all metrics are calculated. 33 34 Returns 35 ------- 36 **results** : *dict* 37 > With keys and values: 38 39 > `'Metrics'` : a nested dictionary as `[alpha][metric]` = value. 40 41 > `'Selected_groups'` : a dictionary as `[alpha]` = array of 42 groups with nonzero weights. 43 44 > `'Norms'` : a dictionary as `[alpha]` = array of kernel weights 45 for each group, order respective to 'Group_names'. 46 47 > `'Predictions'` : a dictionary as `[alpha]` = predicted class 48 respective to 'Observations' for that `alpha`. 49 50 > `'Observed'` : an array of ground truth cell labels from the 51 test set. 52 53 > `'Test_indices'` : indices of samples respective to adata 54 used in the training set. 55 56 > `'Group_names'` : an array of group names respective to each 57 array in 'Norms'. 58 59 > `'Model'` : a dictionary where `[alpha]` = Celer Group Lasso 60 object for that `alpha`. 61 62 > `'RAM_usage'` : memory usage after training models for each 63 `alpha`. 64 65 Examples 66 -------- 67 >>> results = scmkl.run(adata = adata, 68 ... alpha_list = np.array([0.05, 0.1, 0.5])) 69 >>> results 70 dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions', 71 ... 'Observed', 'Test_indices', 'Group_names', 'Models', 72 ... 'Train_time', 'RAM_usage']) 73 >>> 74 >>> # List of alpha values 75 >>> results['Metrics'].keys() 76 dict_keys([0.05, 0.1, 0.5]) 77 >>> 78 >>> results['Metrics'][0.05] 79 {'AUROC': 0.9859, 80 'Accuracy': 0.945, 81 'F1-Score': 0.9452736318407959, 82 'Precision': 0.9405940594059405, 83 'Recall': 0.95} 84 ''' 85 if metrics == None: 86 metrics = ['AUROC', 'F1-Score','Accuracy', 'Precision', 'Recall'] 87 88 # Initializing variables to capture metrics 89 group_names = list(adata.uns['group_dict'].keys()) 90 preds = {} 91 group_norms = {} 92 mets_dict = {} 93 selected_groups = {} 94 train_time = {} 95 models = {} 96 probs = {} 97 98 D = adata.uns['D'] 99 100 # Generating models for each alpha and outputs 101 for alpha in alpha_list: 102 103 print(f' Evaluating model. Alpha: {alpha}', flush = True) 104 105 train_start = time.time() 106 107 adata = train_model(adata, group_size= 2*D, alpha = alpha) 108 109 if return_probs: 110 alpha_res = predict(adata, 111 metrics = metrics, 112 return_probs = return_probs) 113 preds[alpha], mets_dict[alpha], probs[alpha] = alpha_res 114 115 else: 116 alpha_res = predict(adata, 117 metrics = metrics, 118 return_probs = return_probs) 119 preds[alpha], mets_dict[alpha] = alpha_res 120 121 selected_groups[alpha] = find_selected_groups(adata) 122 123 kernel_weights = adata.uns['model'].coef_ 124 group_norms[alpha] = [ 125 np.linalg.norm(kernel_weights[i * 2 * D : (i + 1) * 2 * D - 1]) 126 for i in np.arange(len(group_names)) 127 ] 128 129 models[alpha] = adata.uns['model'] 130 131 train_end = time.time() 132 train_time[alpha] = train_end - train_start 133 134 # Combining results into one object 135 results = {} 136 results['Metrics'] = mets_dict 137 results['Selected_groups'] = selected_groups 138 results['Norms'] = group_norms 139 results['Predictions'] = preds 140 results['Observed'] = adata.obs['labels'].iloc[adata.uns['test_indices']] 141 results['Test_indices'] = adata.uns['test_indices'] 142 results['Group_names']= group_names 143 results['Models'] = models 144 results['Train_time'] = train_time 145 results['RAM_usage'] = f'{tracemalloc.get_traced_memory()[1] / 1e9} GB' 146 results['Probabilities'] = probs 147 148 return results
Wrapper function for training and test with multiple alpha values. Returns metrics, predictions, group weights, and resource usage.
Parameters
adata : AnnData
A processed AnnData with
'Z_train'
,'Z_test'
, and'group_dict'
keys inadata.uns
.
alpha_list : np.ndarray
alpha
values to create models using. Alpha refers to the penalty parameter in Group Lasso. Larger alphas force group weights to shrink towards 0 while smaller alphas apply a lesser penalty to kernal weights.
metrics : list[str]
What metrics should be calculated on predictions. Options are ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When set to
None
, all metrics are calculated.
Returns
results : dict
With keys and values:
'Metrics'
: a nested dictionary as[alpha][metric]
= value.
'Selected_groups'
: a dictionary as[alpha]
= array of groups with nonzero weights.
'Norms'
: a dictionary as[alpha]
= array of kernel weights for each group, order respective to 'Group_names'.
'Predictions'
: a dictionary as[alpha]
= predicted class respective to 'Observations' for thatalpha
.
'Observed'
: an array of ground truth cell labels from the test set.
'Test_indices'
: indices of samples respective to adata used in the training set.
'Group_names'
: an array of group names respective to each array in 'Norms'.
'Model'
: a dictionary where[alpha]
= Celer Group Lasso object for thatalpha
.
'RAM_usage'
: memory usage after training models for eachalpha
.
Examples
>>> results = scmkl.run(adata = adata,
... alpha_list = np.array([0.05, 0.1, 0.5]))
>>> results
dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions',
... 'Observed', 'Test_indices', 'Group_names', 'Models',
... 'Train_time', 'RAM_usage'])
>>>
>>> # List of alpha values
>>> results['Metrics'].keys()
dict_keys([0.05, 0.1, 0.5])
>>>
>>> results['Metrics'][0.05]
{'AUROC': 0.9859,
'Accuracy': 0.945,
'F1-Score': 0.9452736318407959,
'Precision': 0.9405940594059405,
'Recall': 0.95}
72def tfidf_normalize(adata, binarize = False): 73 ''' 74 Function to TFIDF normalize the data in an adata object. If any 75 rows are entirely 0, that row and its metadata will be removed from 76 the object. 77 78 Parameters 79 ---------- 80 **adata** : *AnnData* 81 > `adata.X` to be normalized. If `'train_indices'` and 82 `'test_indices'` in `'adata.uns.keys()'`, normalization will be 83 done separately for the training and testing data. Otherwise, 84 it will calculate it on the entire dataset. 85 86 **binarize** : *bool* 87 > If `True`, all values in `adata.X` greater than 1 will become 88 1. 89 90 Returns 91 ------- 92 **adata** : *AnnData* 93 > adata with adata.X TFIDF normalized. Will now have the train 94 data stacked on test data, and the indices will be adjusted 95 accordingly. 96 97 Examples 98 -------- 99 >>> adata = scmkl.create_adata(X = data_mat, 100 ... feature_names = gene_names, 101 ... group_dict = group_dict) 102 >>> 103 >>> adata = scmkl.tfidf_normalize(adata) 104 ''' 105 X = adata.X.copy() 106 row_sums = np.sum(X, axis = 1) 107 assert np.all(row_sums > 0), "TFIDF requires all row sums be positive" 108 109 if binarize: 110 X[X > 0] = 1 111 112 if 'train_indices' in adata.uns_keys(): 113 114 train_indices = adata.uns['train_indices'].copy() 115 test_indices = adata.uns['test_indices'].copy() 116 117 # Calculate the train TFIDF matrix on just the training data so it is 118 # not biased by testing data 119 tfidf_train = _tfidf(X[train_indices,:], mode = 'normalize') 120 121 # Calculate the test TFIDF by calculating it on the train and test 122 # data and index the test data 123 tfidf_test = _tfidf(X, mode = 'normalize')[test_indices,:] 124 125 # Impossible to add rows back to original location so we need to 126 # stack the matrices to maintain train/test 127 if scipy.sparse.issparse(X): 128 tfidf_norm = scipy.sparse.vstack((tfidf_train, tfidf_test)) 129 else: 130 tfidf_norm = np.vstack((tfidf_train, tfidf_test)) 131 132 # I'm not sure why this reassignment is necessary, but without, the 133 # values will be saved as 0s in adata 134 adata.uns['train_indices'] = train_indices 135 adata.uns['test_indices'] = test_indices 136 137 combined_indices = np.concatenate((train_indices, test_indices)) 138 139 # Anndata indexes by "rownames" not position so we need to rename the 140 # rows to properly index 141 adata_index = adata.obs_names[combined_indices].astype(int) 142 tfidf_norm = tfidf_norm[np.argsort(adata_index),:] 143 144 else: 145 146 tfidf_norm = _tfidf(X, mode = 'normalize') 147 148 adata.X = tfidf_norm.copy() 149 150 return adata
Function to TFIDF normalize the data in an adata object. If any rows are entirely 0, that row and its metadata will be removed from the object.
Parameters
adata : AnnData
adata.X
to be normalized. If'train_indices'
and'test_indices'
in'adata.uns.keys()'
, normalization will be done separately for the training and testing data. Otherwise, it will calculate it on the entire dataset.
binarize : bool
If
True
, all values inadata.X
greater than 1 will become 1.
Returns
adata : AnnData
adata with adata.X TFIDF normalized. Will now have the train data stacked on test data, and the indices will be adjusted accordingly.
Examples
>>> adata = scmkl.create_adata(X = data_mat,
... feature_names = gene_names,
... group_dict = group_dict)
>>>
>>> adata = scmkl.tfidf_normalize(adata)
6def train_model(adata, group_size = 1, alpha = 0.9): 7 ''' 8 Fit a grouplasso model to the provided data. 9 10 Parameters 11 ---------- 12 **adata** : *AnnData* 13 > Has `'Z_train'` and `'Z_test'` keys in `adata.uns`. 14 15 **group_size** : *int* 16 > Argument describing how the features are grouped. Should be 17 `2 * D`. For more information see celer documentation. 18 19 **alpha** : *float* 20 > Group Lasso regularization coefficient. alpha is a floating 21 point value controlling model solution sparsity. Must be a 22 positive float. The smaller the value, the more feature groups 23 will be selected in the trained model. 24 25 Returns 26 ------- 27 **adata** : *AnnData* 28 > Trained model accessible with `adata.uns['model']`. 29 30 Examples 31 -------- 32 >>> adata = scmkl.estimate_sigma(adata) 33 >>> adata = scmkl.calculate_z(adata) 34 >>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 35 ... 'Recall'] 36 >>> d = scmkl.calculate_d(adata.shape[0]) 37 >>> group_size = 2 * d 38 >>> adata = scmkl.train_model(adata, group_size) 39 >>> 40 >>> 'model' in adata.uns.keys() 41 True 42 43 See Also 44 -------- 45 celer documentation : 46 https://mathurinm.github.io/celer/generated/celer.GroupLasso.html 47 ''' 48 assert alpha > 0, 'Alpha must be positive' 49 50 y_train = adata.obs['labels'].iloc[adata.uns['train_indices']] 51 X_train = adata.uns['Z_train'] 52 53 cell_labels = np.unique(y_train) 54 55 # This is a regression algorithm. We need to make the labels 'continuous' 56 # for classification, but they will remain binary. Casts training labels 57 # to array of -1,1 58 train_labels = np.ones(y_train.shape) 59 train_labels[y_train == cell_labels[1]] = -1 60 61 # Alphamax is a calculation to regularize the effect of alpha across 62 # different data sets 63 alphamax = np.max(np.abs(X_train.T.dot(train_labels))) 64 alphamax /= X_train.shape[0] 65 alphamax *= alpha 66 67 # Instantiate celer Group Lasso Regression Model Object 68 model = celer.GroupLasso(groups = group_size, alpha = alphamax) 69 70 # Fit model using training data 71 model.fit(X_train, train_labels.ravel()) 72 73 adata.uns['model'] = model 74 return adata
Fit a grouplasso model to the provided data.
Parameters
adata : AnnData
Has
'Z_train'
and'Z_test'
keys inadata.uns
.
group_size : int
Argument describing how the features are grouped. Should be
2 * D
. For more information see celer documentation.
alpha : float
Group Lasso regularization coefficient. alpha is a floating point value controlling model solution sparsity. Must be a positive float. The smaller the value, the more feature groups will be selected in the trained model.
Returns
adata : AnnData
Trained model accessible with
adata.uns['model']
.
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision',
... 'Recall']
>>> d = scmkl.calculate_d(adata.shape[0])
>>> group_size = 2 * d
>>> adata = scmkl.train_model(adata, group_size)
>>>
>>> 'model' in adata.uns.keys()
True
See Also
celer documentation : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html