scmkl


PyPI PyPI - Downloads Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

Single-cell analysis using Multiple Kernel Learning, scMKL, is a binary classification algorithm utilizing prior information to group features to enhance classification and aid understanding of distinguishing features in multi-omic data sets.

Installation

Conda install

Conda is the recommended method to install scMKL:

conda create -n scMKL python=3.12 -c conda-forge ivango17::scmkl

Pip install

First, create a virtual environment with python>=3.11.1,<3.13.

Then, install scMKL with:

# activate your new env with python>=3.11.1 and <3.13
pip install scmkl

If wheels do not build correctly, ensure gcc and g++ are installed and up to date. They can be installed with sudo apt install gcc and sudo apt install g++.

Requirements

scMKL takes advantage of AnnData objects and can be implemented with just four pieces of data:

1) scRNA and/or scATAC matrices (can be scipy.sparse matrix)

2) An array of cell labels

3) An array of feature names (eg. gene symbols for RNA or peaks for ATAC)

4) A grouping dictionary where {'group_1' : [feature_5, feature_16], 'group_2' : [feature_1, feature_4, feature_9]}

For more information on formatting/creating the grouping dictionaries, see our example for creating an RNA grouping or ATAC grouping.

For implementing scMKL, see our examples for your use case in examples.

Repo: https://github.com/ohsu-cedar-comp-hub/scMKL

PyPI: https://pypi.org/project/scmkl/

Anaconda: https://anaconda.org/ivango17/scmkl

API: https://ohsu-cedar-comp-hub.github.io/scMKL/

Citation

If you use scMKL in your research, please cite using:

To be determined

Our Shiny for Python application for viewing data produced from this work can be found here: scMKL_analysis


scMKL Documentation

 1"""
 2.. include:: ../README.md
 3
 4----------------------------
 5
 6## **scMKL Documentation**
 7"""
 8
 9
10__all__ = ['calculate_z', 
11           'create_adata', 
12           'dataframes',
13           'estimate_sigma', 
14           'get_atac_groupings',
15           'multimodal_processing', 
16           'one_v_rest', 
17           'optimize_alpha', 
18           'optimize_sparsity',
19           'plotting',
20           'run',
21           'test',
22           'tfidf_normalize',
23           'train_model'
24           ]
25
26from scmkl._checks import *
27from scmkl.calculate_z import *
28from scmkl.create_adata import *
29from scmkl.dataframes import *
30from scmkl.estimate_sigma import *
31from scmkl.get_atac_groupings import *
32from scmkl.multimodal_processing import *
33from scmkl.one_v_rest import *
34from scmkl.optimize_alpha import *
35from scmkl.optimize_sparsity import *
36from scmkl.plotting import *
37from scmkl.run import *
38from scmkl.test import *
39from scmkl.tfidf_normalize import *
40from scmkl.train_model import *
def calculate_z(adata, n_features=5000) -> anndata._core.anndata.AnnData:
105def calculate_z(adata, n_features = 5000) -> ad.AnnData:
106    '''
107    Function to calculate Z matrix.
108
109    Parameters
110    ----------
111    **adata** : *AnnData*
112        > created by `create_adata()` with `adata.uns.keys()` 
113        `'sigma'`, `'train_indices'`, and `'test_indices'`. 
114        `'sigma'` key can be added by running `estimate_sigma()` on 
115        adata. 
116
117    **n_features** : *int* 
118        > Number of random feature to use when calculating Z- used for 
119        scalability.
120
121    Returns
122    -------
123    **adata** : *AnnData*
124        > adata with Z matrices accessible with `adata.uns['Z_train']` 
125        and `adata.uns['Z_test']`.
126
127    Examples
128    --------
129    >>> adata = scmkl.estimate_sigma(adata)
130    >>> adata = scmkl.calculate_z(adata)
131    >>> adata.uns.keys()
132    dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices', 
133    'test_indices'])
134    '''
135    assert np.all(adata.uns['sigma'] > 0), 'Sigma must be positive'
136
137    # Number of groupings taking from group_dict
138    n_pathway = len(adata.uns['group_dict'].keys())
139    D = adata.uns['D']
140
141    # Capturing training and testing indices
142    train_idx = np.array(adata.uns['train_indices'], dtype = np.int_)
143    test_idx = np.array(adata.uns['test_indices'], dtype = np.int_)
144
145    # Create Arrays to store concatenated group Z
146    # Each group of features will have a corresponding entry in each array
147    n_cols = 2 * adata.uns['D'] * n_pathway
148
149    Z_train = np.zeros((train_idx.shape[0], n_cols))
150    Z_test = np.zeros((test_idx.shape[0], n_cols))
151
152
153    # Loop over each of the groups and creating Z for each
154    for m, group_features in enumerate(adata.uns['group_dict'].values()):
155        
156        #Extract features from mth group
157        num_group_features = len(group_features)
158
159        # Sample up to n_features features- important for scalability if 
160        # using large groupings
161        # Will use all features if the grouping contains fewer than n_features
162        number_features = np.min([n_features, num_group_features])
163        group_array = np.array(list(group_features))
164        group_features = adata.uns['seed_obj'].choice(group_array, 
165                                                      number_features, 
166                                                      replace = False) 
167
168        # Create data arrays containing only features within this group
169        X_train = adata[adata.uns['train_indices'],:][:, group_features].X
170        X_test = adata[adata.uns['test_indices'],:][:, group_features].X
171
172        if adata.uns['tfidf']:
173            X_train, X_test = _tfidf_train_test(X_train, X_test)
174
175        # Data filtering, and transformation according to given data_type
176        # Will remove low variance (< 1e5) features regardless of data_type
177        # If given data_type is 'counts' will log scale and z-score the data
178        X_train, X_test = _process_data(X_train = X_train, X_test = X_test, 
179                                        scale_data = adata.uns['scale_data'], 
180                                        return_dense = True)          
181
182        if adata.uns['reduction'].lower() == 'svd':
183
184            SVD_func = TruncatedSVD(n_components = np.min([50, X_train.shape[1]]), random_state = 1)
185            
186            # Remove first component as it corresponds with sequencing depth
187            X_train = SVD_func.fit_transform(scipy.sparse.csr_array(X_train))[:, 1:]
188            X_test = SVD_func.transform(scipy.sparse.csr_array(X_test))[:, 1:]
189
190        elif adata.uns['reduction'].lower() == 'pca':
191            PCA_func = PCA(n_components = np.min([50, X_train.shape[1]]), random_state = 1)
192
193            X_train = PCA_func.fit_transform(np.asarray(X_train))
194            X_test = PCA_func.transform(np.asarray(X_test))
195
196        elif adata.uns['reduction'] == 'linear':
197
198            X_train = X_train @ adata.uns['seed_obj'].choice([0,1], p = [0.02, 0.98], size = X_train.shape[1] * 50).reshape((X_train.shape[1], 50))
199            X_test = X_test @ adata.uns['seed_obj'].choice([0,1], p = [0.02, 0.98], size = X_test.shape[1] * 50).reshape((X_train.shape[1], 50))
200
201        if scipy.sparse.issparse(X_train):
202            X_train = X_train.toarray().astype(np.float16)
203            X_test = X_test.toarray().astype(np.float16)
204
205        # Extract pre-calculated sigma used for approximating kernel
206        adjusted_sigma = adata.uns['sigma'][m]
207
208        # Calculates approximate kernel according to chosen kernel function
209        # Distribution data comes from Fourier Transform of kernel function
210        if adata.uns['kernel_type'].lower() == 'gaussian':
211
212            gamma = 1/(2*adjusted_sigma**2)
213            sigma_p = 0.5*np.sqrt(2*gamma)
214
215            W = adata.uns['seed_obj'].normal(0, sigma_p, X_train.shape[1] * D)
216            W = W.reshape((X_train.shape[1]), D)
217
218        elif adata.uns['kernel_type'].lower() == 'laplacian':
219
220            gamma = 1 / (2 * adjusted_sigma)
221
222            W = adata.uns['seed_obj'].standard_cauchy(X_train.shape[1] * D)
223            W = gamma * W.reshape((X_train.shape[1], D))
224
225        elif adata.uns['kernel_type'].lower() == 'cauchy':
226
227            gamma = 1 / (2 * adjusted_sigma ** 2)
228            b = 0.5 * np.sqrt(gamma)
229
230            W = adata.uns['seed_obj'].laplace(0, b, X_train.shape[1] * D)
231            W = W.reshape((X_train.shape[1], D))
232
233
234        train_projection = np.matmul(X_train, W)
235        test_projection = np.matmul(X_test, W)
236        
237        # Store group Z in whole-Z object. 
238        # Preserves order to be able to extract meaningful groups
239        x_idx = np.arange( m * 2 * D , (m + 1) * 2 * D)
240        sq_i_d = np.sqrt(1/D)
241
242        Z_train[0:, x_idx] = sq_i_d * np.hstack((np.cos(train_projection), 
243                                                 np.sin(train_projection)))
244        Z_test[0:, x_idx] = sq_i_d * np.hstack((np.cos(test_projection), 
245                                                np.sin(test_projection)))
246
247    adata.uns['Z_train'] = Z_train
248    adata.uns['Z_test'] = Z_test
249
250    return adata

Function to calculate Z matrix.

Parameters

adata : AnnData

created by create_adata() with adata.uns.keys() 'sigma', 'train_indices', and 'test_indices'. 'sigma' key can be added by running estimate_sigma() on adata.

n_features : int

Number of random feature to use when calculating Z- used for scalability.

Returns

adata : AnnData

adata with Z matrices accessible with adata.uns['Z_train'] and adata.uns['Z_test'].

Examples

>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> adata.uns.keys()
dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices', 
'test_indices'])
def create_adata( X, feature_names: numpy.ndarray, cell_labels: numpy.ndarray, group_dict: dict, scale_data: bool = True, split_data: numpy.ndarray | None = None, D: int | None = None, remove_features=True, train_ratio=0.8, distance_metric='euclidean', kernel_type='Gaussian', random_state: int = 1, allow_multiclass: bool = False, class_threshold: str | int = 'median', reduction: str | None = None, tfidf: bool = False):
214def create_adata(X, feature_names: np.ndarray, cell_labels: np.ndarray, 
215                 group_dict: dict, scale_data: bool = True, 
216                 split_data : np.ndarray | None = None, D : int | None = None, 
217                 remove_features = True, train_ratio = 0.8,
218                 distance_metric = 'euclidean', kernel_type = 'Gaussian', 
219                 random_state : int = 1, allow_multiclass : bool = False, 
220                 class_threshold : str | int = 'median',
221                 reduction: str | None = None, tfidf: bool = False):
222    '''
223    Function to create an AnnData object to carry all relevant 
224    information going forward.
225
226    Parameters
227    ----------
228    **X** : *scipy.sparse.csc_matrix* | *np.ndarray* | 
229            *pd.DataFrame*
230        > A data matrix of cells by features (sparse array 
231        recommended for large datasets).
232
233    **feature_names** : *np.ndarray*
234        > array of feature names corresponding with the features 
235        in X.
236
237    **cell_labels** : *np.ndarray*
238        > A numpy array of cell phenotypes corresponding with 
239        the cells in X.
240
241    **group_dict** : *dict* 
242        > Dictionary containing feature grouping information.
243            - Example: {geneset: np.array([gene_1, gene_2, ..., 
244                        gene_n])}
245
246    **scale_data** : *bool*  
247        > If `True`, data matrix is log transformed and standard 
248        scaled. 
249        
250    **split_data** : *None* | *np.ndarray*
251        > If *None*, data will be split stratified by cell labels. 
252        Else, is an array of precalculated train/test split 
253        corresponding to samples. Can include labels for entire
254        dataset to benchmark performance or for only training
255        data to classify unknown cell types.
256            - Example: np.array(['train', 'test', ..., 'train'])
257
258    **D** : *int* 
259        > Number of Random Fourier Features used to calculate Z. 
260        Should be a positive integer. Higher values of D will 
261        increase classification accuracy at the cost of computation 
262        time. If set to `None`, will be calculated given number of 
263        samples. 
264    
265    **remove_features** : *bool* 
266        > If `True`, will remove features from X and feature_names
267        not in group_dict and remove features from groupings not in
268        feature_names.
269
270    **train_ratio** : *float*
271        > Ratio of number of training samples to entire data set. Note:
272        if a threshold is applied, the ratio training samples may 
273        decrease depending on class balance and `class_threshold`
274        parameter if `allow_multiclass = True`.
275
276    **distance_metric** : *str* 
277        > The pairwise distance metric used to estimate sigma. Must
278        be one of the options used in scipy.spatial.distance.cdist.
279
280    **kernel_type** : *str*
281        > The approximated kernel function used to calculate Zs.
282        Must be one of `'Gaussian'`, `'Laplacian'`, or `'Cauchy'`.
283
284    **random_state** : *int*
285        > Integer random_state used to set the seed for 
286        reproducibilty.
287
288    **allow_multiclass** : *bool*
289        > If `False`, will ensure that cell labels are binary.
290
291    **class_threshold** : *str* | *int*
292        > Number of samples allowed in the training data for each cell
293        class in the training data. If `'median'`, the median number of
294        cells per cell class will be the threshold for number of 
295        samples per class.
296
297    **reduction**: *str* | *None*
298        > Choose which dimension reduction technique to perform on features
299        within a group.  'svd' will run sklearn.decomposition.TruncatedSVD,
300        'linear' will multiply by an array of 1s down to 50 dimensions.
301        
302    **tfidf**: *bool*
303        > Whether to calculate TFIDF transformation on peaks within 
304        groupings.
305        
306    Returns
307    -------
308    **adata** : *AnnData*
309    > *AnnData* with the following attributes and keys:
310
311    > `adata.X` : the data matrix.
312    
313    > `adata.var_names` : the feature names corresponding to
314    `adata.X`.
315
316    > `adata.obs['labels']` : cell classes/phenotypes from 
317    `cell_labels`.
318
319    > `adata.uns['train_indices']` : Indices for training data. 
320
321    > `adata.uns['test_indices']` : Indices for testing data.
322
323    > `adata.uns['group_dict']` : Grouping information.
324
325    > `adata.uns['seed_obj']` : Seed object with seed equal to
326    100 * `random_state`.
327
328    > `with adata.uns['D']` : Number of dimensions to scMKL with.
329
330    > `adata.uns['scale_data']` : *bool* for whether or not data is log
331    transformed and scaled.
332
333    > `adata.uns['distance_metric']` : Distance metric as given.
334    
335    > `adata.uns['kernel_type']` : Kernel function as given.
336
337    > `adata.uns['svd'] : *bool* for whether to calculate svd reduction.
338
339    > `adata.uns['tfidf'] : *bool* for whether to calculate tfidf per grouping.
340
341    Examples
342    --------
343    >>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz')
344    >>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True)
345    >>> group_dict = np.load('hallmark_genesets.pkl', 
346    >>>                      allow_pickle = True)
347    >>> 
348    >>> adata = scmkl.create_adata(X = data_mat, 
349    ...                            feature_names = gene_names, 
350    ...                            group_dict = group_dict)
351    >>> adata
352    AnnData object with n_obs × n_vars = 1000 × 4341
353    obs: 'labels'
354    uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 
355    'distance_metric', 'train_indices', 'test_indices'
356    '''
357
358    assert X.shape[1] == len(feature_names), ("Different number of features "
359                                              "in X than feature names")
360    
361    if not allow_multiclass:
362        assert len(np.unique(cell_labels)) == 2, ("cell_labels must contain "
363                                                  "2 classes")
364    if D is not None:    
365        assert isinstance(D, int) and D > 0, 'D must be a positive integer'
366
367    kernel_options = ['gaussian', 'laplacian', 'cauchy']
368    assert kernel_type.lower() in kernel_options, ("Given kernel type not "
369                                                   "implemented. Gaussian, "
370                                                   "Laplacian, and Cauchy "
371                                                   "are the acceptable "
372                                                   "types.")
373
374    X, feature_names, group_dict = _filter_features(X, 
375                                                    feature_names, 
376                                                    group_dict, 
377                                                    remove_features)
378
379    # Create adata object and add column names
380    adata = ad.AnnData(X)
381    adata.var_names = feature_names
382
383    # Add metadata to adata object
384    adata.uns['group_dict'] = group_dict
385    adata.uns['seed_obj'] = np.random.default_rng(100 * random_state)
386    adata.uns['scale_data'] = scale_data
387    adata.uns['D'] = D if D is not None else calculate_d(adata.shape[0])
388    adata.uns['kernel_type'] = kernel_type
389    adata.uns['distance_metric'] = distance_metric
390    adata.uns['reduction'] = reduction if isinstance(reduction, str) else 'None'
391    adata.uns['tfidf'] = tfidf
392
393    if (split_data is None):
394        assert X.shape[0] == len(cell_labels), ("Different number of cells "
395                                                "than labels")
396        adata.obs['labels'] = cell_labels
397
398        if (allow_multiclass == False):
399            split = _binary_split(cell_labels, 
400                                  seed_obj = adata.uns['seed_obj'],
401                                  train_ratio = train_ratio)
402            train_indices, test_indices = split
403
404        elif (allow_multiclass == True):
405            split = _multi_class_split(cell_labels, 
406                                       seed_obj = adata.uns['seed_obj'], 
407                                       class_threshold = class_threshold,
408                                       train_ratio = train_ratio)
409            train_indices, test_indices = split
410
411        adata.uns['labeled_test'] = True
412
413    else:
414        x_eq_labs = X.shape[0] == len(cell_labels)
415        train_eq_labs = X.shape[0] == len(cell_labels)
416        assert x_eq_labs or train_eq_labs, ("Must give labels for all cells "
417                                            "or only for training cells")
418        
419        train_indices = np.where(split_data == 'train')[0]
420        test_indices = np.where(split_data == 'test')[0]
421
422        if len(cell_labels) == len(train_indices):
423
424            padded_cell_labels = np.zeros((X.shape[0])).astype('object')
425            padded_cell_labels[train_indices] = cell_labels
426            padded_cell_labels[test_indices] = 'padded_test_label'
427
428            adata.obs['labels'] = padded_cell_labels
429            adata.uns['labeled_test'] = False
430
431        elif len(cell_labels) == len(split_data):
432            adata.obs['labels'] = cell_labels
433            adata.uns['labeled_test'] = True
434
435    adata.uns['train_indices'] = train_indices
436    adata.uns['test_indices'] = test_indices
437
438    if not scale_data:
439        print("WARNING: Data will not be log transformed and scaled "
440              "To change this behavior, set scale_data to True")
441
442    return adata

Function to create an AnnData object to carry all relevant information going forward.

Parameters

X : scipy.sparse.csc_matrix | np.ndarray | pd.DataFrame

A data matrix of cells by features (sparse array recommended for large datasets).

feature_names : np.ndarray

array of feature names corresponding with the features in X.

cell_labels : np.ndarray

A numpy array of cell phenotypes corresponding with the cells in X.

group_dict : dict

Dictionary containing feature grouping information. - Example: {geneset: np.array([gene_1, gene_2, ..., gene_n])}

scale_data : bool

If True, data matrix is log transformed and standard scaled.

split_data : None | np.ndarray

If None, data will be split stratified by cell labels. Else, is an array of precalculated train/test split corresponding to samples. Can include labels for entire dataset to benchmark performance or for only training data to classify unknown cell types. - Example: np.array(['train', 'test', ..., 'train'])

D : int

Number of Random Fourier Features used to calculate Z. Should be a positive integer. Higher values of D will increase classification accuracy at the cost of computation time. If set to None, will be calculated given number of samples.

remove_features : bool

If True, will remove features from X and feature_names not in group_dict and remove features from groupings not in feature_names.

train_ratio : float

Ratio of number of training samples to entire data set. Note: if a threshold is applied, the ratio training samples may decrease depending on class balance and class_threshold parameter if allow_multiclass = True.

distance_metric : str

The pairwise distance metric used to estimate sigma. Must be one of the options used in scipy.spatial.distance.cdist.

kernel_type : str

The approximated kernel function used to calculate Zs. Must be one of 'Gaussian', 'Laplacian', or 'Cauchy'.

random_state : int

Integer random_state used to set the seed for reproducibilty.

allow_multiclass : bool

If False, will ensure that cell labels are binary.

class_threshold : str | int

Number of samples allowed in the training data for each cell class in the training data. If 'median', the median number of cells per cell class will be the threshold for number of samples per class.

reduction: str | None

Choose which dimension reduction technique to perform on features within a group. 'svd' will run sklearn.decomposition.TruncatedSVD, 'linear' will multiply by an array of 1s down to 50 dimensions.

tfidf: bool

Whether to calculate TFIDF transformation on peaks within groupings.

Returns

adata : AnnData

AnnData with the following attributes and keys:

adata.X : the data matrix.

adata.var_names : the feature names corresponding to adata.X.

adata.obs['labels'] : cell classes/phenotypes from cell_labels.

adata.uns['train_indices'] : Indices for training data.

adata.uns['test_indices'] : Indices for testing data.

adata.uns['group_dict'] : Grouping information.

adata.uns['seed_obj'] : Seed object with seed equal to 100 * random_state.

with adata.uns['D'] : Number of dimensions to scMKL with.

adata.uns['scale_data'] : bool for whether or not data is log transformed and scaled.

adata.uns['distance_metric'] : Distance metric as given.

adata.uns['kernel_type'] : Kernel function as given.

`adata.uns['svd'] : bool for whether to calculate svd reduction.

`adata.uns['tfidf'] : bool for whether to calculate tfidf per grouping.

Examples

>>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz')
>>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True)
>>> group_dict = np.load('hallmark_genesets.pkl', 
>>>                      allow_pickle = True)
>>> 
>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>> adata
AnnData object with n_obs × n_vars = 1000 × 4341
obs: 'labels'
uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 
'distance_metric', 'train_indices', 'test_indices'
def estimate_sigma(adata, n_features=5000):
  9def estimate_sigma(adata, n_features = 5000):
 10    '''
 11    Calculate kernel widths to inform distribution for projection of 
 12    Fourier Features. Calculates one sigma per group of features.
 13
 14    Parameters
 15    ----------
 16    **adata** : *AnnData* 
 17        > Created by `create_adata`.
 18    
 19    **n_features** : *int*  
 20        > Number of random features to include when estimating sigma. 
 21        Will be scaled for the whole pathway set according to a 
 22        heuristic. Used for scalability.
 23    
 24    Returns
 25    -------
 26    **adata** : *AnnData*
 27        > Key added `adata.uns['sigma']`.
 28
 29    Examples
 30    --------
 31    >>> adata = scmkl.estimate_sigma(adata)
 32    >>> adata.uns['sigma']
 33    array([10.4640895 , 10.82011454,  6.16769438,  9.86156855, ...])
 34    '''
 35    sigma_list = []
 36
 37    # Loop over every group in group_dict
 38    for m, group_features in enumerate(adata.uns['group_dict'].values()):
 39
 40        # Select only features in that group and downsample for scalability
 41        num_group_features = len(group_features)
 42        group_array = np.array(list(group_features))
 43        n_feats = min([n_features, num_group_features])
 44        group_features = adata.uns['seed_obj'].choice(group_array, n_feats, 
 45                                                      replace = False) 
 46
 47        # Use on the train data to estimate sigma
 48        X_train = adata[adata.uns['train_indices'], group_features].X
 49
 50        
 51        # Sample cells for scalability
 52        sample_idx = np.arange(X_train.shape[0])
 53        n_samples = np.min((2000, X_train.shape[0]))
 54        distance_indices = adata.uns['seed_obj'].choice(sample_idx, n_samples, 
 55                                                        replace = False)
 56
 57        X_train = _process_data(X_train = X_train, 
 58                                 scale_data = adata.uns['scale_data'], 
 59                                 return_dense = True)
 60        
 61
 62        if adata.uns['tfidf']:
 63            X_train = _tfidf(X_train, mode = 'normalize')
 64
 65        if adata.uns['reduction'].lower() == 'svd':
 66
 67            SVD_func = TruncatedSVD(n_components = np.min([50, X_train.shape[1]]), random_state = 1)
 68            X_train = SVD_func.fit_transform(scipy.sparse.csr_array(X_train[distance_indices,:]))[:,1:]
 69
 70        elif adata.uns['reduction'].lower() == 'pca':
 71            PCA_func = PCA(n_components = np.min([50, X_train.shape[1]]), random_state = 1)
 72            X_train = PCA_func.fit_transform(np.asarray(X_train[distance_indices,:]))
 73
 74        elif adata.uns['reduction'].lower() == 'linear':
 75
 76            X_train = X_train[distance_indices] @ adata.uns['seed_obj'].choice([0,1], [0.02, 0.98]).reshape((len(distance_indices), 50))
 77
 78        else:
 79            X_train = X_train[distance_indices, :]
 80
 81        if scipy.sparse.issparse(X_train):
 82            X_train = X_train.toarray()
 83
 84        # Calculate Distance Matrix with specified metric
 85        sigma = scipy.spatial.distance.cdist(X_train, 
 86                                             X_train, 
 87                                             adata.uns['distance_metric'])
 88        sigma = np.mean(sigma)
 89
 90        # sigma = 0 is numerically unusable in later steps
 91        # Using such a small sigma will result in wide distribution, and 
 92        # typically a non-predictive Z
 93        if sigma == 0:
 94            sigma += 1e-5
 95
 96        if n_features < num_group_features:
 97            # Heuristic we calculated to account for fewer features used in 
 98            # distance calculation
 99            sigma = sigma * num_group_features / n_features 
100
101        sigma_list.append(sigma)
102    
103    adata.uns['sigma'] = np.array(sigma_list)
104        
105    return adata

Calculate kernel widths to inform distribution for projection of Fourier Features. Calculates one sigma per group of features.

Parameters

adata : AnnData

Created by create_adata.

n_features : int

Number of random features to include when estimating sigma. Will be scaled for the whole pathway set according to a heuristic. Used for scalability.

Returns

adata : AnnData

Key added adata.uns['sigma'].

Examples

>>> adata = scmkl.estimate_sigma(adata)
>>> adata.uns['sigma']
array([10.4640895 , 10.82011454,  6.16769438,  9.86156855, ...])
def get_atac_groupings( gene_anno: pandas.core.frame.DataFrame, gene_sets: dict, feature_names: numpy.ndarray | pandas.core.series.Series | list | set, len_up: int = 5000, len_down: int = 5000) -> dict:
262def get_atac_groupings(gene_anno : pd.DataFrame, gene_sets : dict, 
263                       feature_names : np.ndarray | pd.Series | list | set,
264                       len_up : int = 5000, len_down : int = 5000) -> dict:
265    '''
266    Creates a peak set where keys are gene set names from `gene_sets` 
267    and values are arrays of features pulled from `feature_names`. 
268    Features are added to each peak set given overlap between regions 
269    in single-cell data matrix and inferred gene promoter regions in 
270    `gene_anno`.
271
272    Parameters
273    ----------
274    **gene_anno** : *pd.DataFrame*
275        > Gene annotations in GTF format as a pd.DataFrame with columns
276        ['chr', 'start', 'end', 'strand', 'gene_name'].
277
278    **gene_sets** : *dict*
279        > Gene set names as keys and an iterable object of gene names
280        as values.
281
282    **feature_names** : *np.ndarray* | *pd.Series* | *list* | *set*
283        > Feature names corresponding to a single_cell ATAC data 
284        matrix.
285
286    Returns
287    -------
288    **atac_grouping** : *dict*
289        > Keys are the names from `gene_sets` and values
290        are a list of regions from `feature_names` that overlap with 
291        promotor regions respective to genes in `gene_sets` (i.e., if 
292        ATAC feature in `feature_names` overlaps with promotor region 
293        from a gene in a gene set from `gene_sets`, that region will be
294        added to the new dictionary under the respective gene set 
295        name).
296
297    Examples
298    --------
299    >>> # Reading in a gene set and the peak names from scATAC dataset
300    >>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl", 
301    ...                     allow_pickle = True)
302    >>> peaks = np.load("data/MCF7_ATAC_feature_names.npy", 
303    ...                 allow_pickle = True)
304    >>> 
305    >>> # Reading in GTF file
306    >>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf"
307    >>> gtf = pd.read_csv(gtf_path, sep = "\t", header = None, 
308    ...                   skip_blank_lines = True, comment = "#")
309    >>> 
310    >>> # Naming columns
311    >>> gtf.columns = ['chr', 'source', 'feature', 'start', 'end', 
312    ...                'score', 'strand', 'frame', 'attribute']
313    >>>
314    >>> # Subsetting to only protein coding genes
315    >>> prot_rows = gtf['attribute'].str.contains('protein_coding')
316    >>> gtf = gtf[prot_rows]
317    >>> gtf = gtf[gtf['feature'] == 'gene']
318    >>> 
319    >>> # Capturing gene name from attributes column
320    >>> gtf['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+', 
321    ...                                attr)[0] 
322    ...                     for attr in gtf['attribute']]
323    >>>
324    >>> atac_grouping = scmkl.get_atac_groupings(gene_anno = gtf,
325    ...                                         gene_sets = gene_sets,
326    ...                                         feature_names = peaks)
327    >>>
328    >>> atac_grouping.keys()
329    dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...])
330    '''
331    # Getting a unique set of gene names from gene_sets
332    all_genes = {gene for group in gene_sets.keys()
333                 for gene in gene_sets[group]}
334    
335    # Filtering out NaN values
336    all_genes = [gene for gene in all_genes if type(gene) != float]
337
338    # Filtering out annotations for genes not present in gene_sets
339    gene_anno = gene_anno[np.isin(gene_anno['gene_name'], all_genes)]
340
341    # Adjusting start and end columns to represent promotor regions
342    gene_anno = _adjust_regions(gene_anno = gene_anno, 
343                                len_up = len_up, len_down = len_down)
344
345    # Creating a dictionary from assay features where [chr] : (start, end)
346    feature_dict = _create_feature_dict(feature_names)
347
348    # Creating data structures from gene_anno for comparing regions
349    peak_gene_dict, ga_regions = _create_region_dicts(gene_anno)
350
351    # Capturing the separator type used in assay
352    chr_sep = ':' if ':' in feature_names[0] else '-'
353
354    atac_groupings = _compare_regions(feature_dict = feature_dict,
355                                     ga_regions = ga_regions,
356                                     peak_gene_dict = peak_gene_dict,
357                                     gene_sets = gene_sets,
358                                     chr_sep = chr_sep)
359    
360    return atac_groupings

Creates a peak set where keys are gene set names from gene_sets and values are arrays of features pulled from feature_names. Features are added to each peak set given overlap between regions in single-cell data matrix and inferred gene promoter regions in gene_anno.

Parameters

gene_anno : pd.DataFrame

Gene annotations in GTF format as a pd.DataFrame with columns ['chr', 'start', 'end', 'strand', 'gene_name'].

gene_sets : dict

Gene set names as keys and an iterable object of gene names as values.

feature_names : np.ndarray | pd.Series | list | set

Feature names corresponding to a single_cell ATAC data matrix.

Returns

atac_grouping : dict

Keys are the names from gene_sets and values are a list of regions from feature_names that overlap with promotor regions respective to genes in gene_sets (i.e., if ATAC feature in feature_names overlaps with promotor region from a gene in a gene set from gene_sets, that region will be added to the new dictionary under the respective gene set name).

Examples

>>> # Reading in a gene set and the peak names from scATAC dataset
>>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl", 
...                     allow_pickle = True)
>>> peaks = np.load("data/MCF7_ATAC_feature_names.npy", 
...                 allow_pickle = True)
>>> 
>>> # Reading in GTF file
>>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf"
>>> gtf = pd.read_csv(gtf_path, sep = "     ", header = None, 
...                   skip_blank_lines = True, comment = "#")
>>> 
>>> # Naming columns
>>> gtf.columns = ['chr', 'source', 'feature', 'start', 'end', 
...                'score', 'strand', 'frame', 'attribute']
>>>
>>> # Subsetting to only protein coding genes
>>> prot_rows = gtf['attribute'].str.contains('protein_coding')
>>> gtf = gtf[prot_rows]
>>> gtf = gtf[gtf['feature'] == 'gene']
>>> 
>>> # Capturing gene name from attributes column
>>> gtf['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+', 
...                                attr)[0] 
...                     for attr in gtf['attribute']]
>>>
>>> atac_grouping = scmkl.get_atac_groupings(gene_anno = gtf,
...                                         gene_sets = gene_sets,
...                                         feature_names = peaks)
>>>
>>> atac_grouping.keys()
dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...])
def multimodal_processing(adatas: list, names: list, tfidf: list):
 97def multimodal_processing(adatas : list, names : list, tfidf: list):
 98    '''
 99    Combines and processes a list of adata objects.
100
101    Parameters
102    ----------
103    **adatas** : *list[AnnData]* 
104        > List of AnnData objects where each object is a different 
105        modality for the same cells.
106
107    **names** : *list[str]*
108        > List of string names for each modality repective to each 
109        object in `adatas`.
110    
111    **tfidf** : *bool* 
112        > List where if element i is `True`, adata[i] will be TFIDF 
113        normalized.
114
115    Returns
116    -------
117    **adata** : *AnnData* 
118        > Concatenated from objects from `adatas` with Z matrices 
119        calculated.
120
121    Examples
122    --------
123    >>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat, 
124    ...                                feature_names = gene_names, 
125    ...                                scale_data = True, 
126    ...                                cell_labels = cell_labels, 
127    ...                                 group_dict = rna_grouping)
128    >>>
129    >>> atac_adata = scmkl.create_adata(X = mcf7_atac_mat, 
130    ...                                 feature_names = peak_names, 
131    ...                                 scale_data = False, 
132    ...                                 cell_labels = cell_labels, 
133    ...                                 group_dict = atac_grouping)
134    >>>
135    >>> adatas = [rna_adata, atac_adata]
136    >>> mod_names = ['rna', 'atac']
137    >>> adata = scmkl.multimodal_processing(adatas = adatas, 
138    ...                                     names = mod_names,
139    ...                                     tfidf = [False, True])
140    >>>
141    >>> adata
142    AnnData object with n_obs × n_vars = 1000 × 12676
143    obs: 'labels'
144    var: 'labels'
145    uns: 'D', 'kernel_type', 'distance_metric', 'train_indices',  
146    'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj'
147    '''
148    import warnings 
149    warnings.filterwarnings('ignore')
150
151    assert all([adata.shape[0] for adata in adatas]), ("Different number of "
152                                                       "cells present in "
153                                                       "each object")
154    
155    # True if all train indices match
156    same_train = np.all([np.array_equal(adatas[0].uns['train_indices'], 
157                                        adatas[i].uns['train_indices']) 
158                         for i in range(1, len(adatas))])
159
160    # True if all test indices match
161    same_test = np.all([np.array_equal(adatas[0].uns['test_indices'], 
162                                       adatas[i].uns['test_indices']) 
163                        for i in range(1, len(adatas))])
164
165    assert same_train, 'Different train indices'
166    assert same_test, 'Different test indices'
167
168    # Creates a boolean array for each modality of cells with non-empty rows
169    non_empty_rows = [np.array(_sparse_var(adata.X, axis = 1) != 0).ravel() 
170                      for adata in adatas]
171
172    # Returns a 1d array where sample feature sums
173    # across all modalities are more than 0
174    non_empty_rows = np.logical_and(*non_empty_rows).squeeze()
175
176    # Initializing final train test split array
177    train_test = np.repeat('train', adatas[0].shape[0])
178    train_test[adatas[0].uns['test_indices']] = 'test'
179
180    # Capturing train test split with empty rows filtered out
181    train_test = train_test[non_empty_rows]
182    train_indices = np.where(train_test == 'train')[0]
183    test_indices = np.where(train_test == 'test')[0]
184
185    # Adding train test split arrays to AnnData objects 
186    # and filtering out empty samples
187    for i, adata in enumerate(adatas):
188        adatas[i].uns['train_indices'] = train_indices
189        adatas[i].uns['test_indices'] = test_indices
190        adatas[i] = adata[non_empty_rows, :]
191        # tfidf normalizing if corresponding element in tfidf is True
192        if tfidf[i]:
193            adatas[i] = tfidf_normalize(adata)
194
195        print(f'Estimating Sigma for {names[i]}', flush = True)
196        adatas[i] = estimate_sigma(adata, n_features= 200)
197        print(f'Calculating Z for {names[i]}', flush = True)
198        adatas[i] = calculate_z(adata, n_features = 5000)
199
200    if 'labels' in adatas[0].obs:
201        all_labels = [adata.obs['labels'] for adata in adatas]
202        # Ensuring cell labels for each AnnData object are the same
203        for i in range(1, len(all_labels)):
204            same_labels = np.all(all_labels[0] == all_labels[i])
205            assert same_labels, (f"Cell labels between AnnData object in "
206                                 f"position 0 and position {i} in adatas do "
207                                 "not match")
208
209    adata = _combine_modalities(adatas = adatas,
210                                names = names,
211                                combination = 'concatenate')
212
213    del adatas
214    gc.collect()
215
216    return adata    

Combines and processes a list of adata objects.

Parameters

adatas : list[AnnData]

List of AnnData objects where each object is a different modality for the same cells.

names : list[str]

List of string names for each modality repective to each object in adatas.

tfidf : bool

List where if element i is True, adata[i] will be TFIDF normalized.

Returns

adata : AnnData

Concatenated from objects from adatas with Z matrices calculated.

Examples

>>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat, 
...                                feature_names = gene_names, 
...                                scale_data = True, 
...                                cell_labels = cell_labels, 
...                                 group_dict = rna_grouping)
>>>
>>> atac_adata = scmkl.create_adata(X = mcf7_atac_mat, 
...                                 feature_names = peak_names, 
...                                 scale_data = False, 
...                                 cell_labels = cell_labels, 
...                                 group_dict = atac_grouping)
>>>
>>> adatas = [rna_adata, atac_adata]
>>> mod_names = ['rna', 'atac']
>>> adata = scmkl.multimodal_processing(adatas = adatas, 
...                                     names = mod_names,
...                                     tfidf = [False, True])
>>>
>>> adata
AnnData object with n_obs × n_vars = 1000 × 12676
obs: 'labels'
var: 'labels'
uns: 'D', 'kernel_type', 'distance_metric', 'train_indices',  
'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj'
def one_v_rest( adatas: list, names: list, alpha_list: numpy.ndarray, tfidf: list) -> dict:
108def one_v_rest(adatas : list, names : list, alpha_list : np.ndarray, 
109              tfidf : list) -> dict:
110    '''
111    For each cell class, creates model(s) comparing that class to all 
112    others. Then, predicts on the training data using `scmkl.run()`.
113    Only labels in both training and testing will be run.
114
115    Parameters
116    ----------
117    **adatas** : *list[AnnData]* 
118        > List of AnnData objects created by create_adata()
119        where each AnnData is one modality and composed of both 
120        training and testing samples. Requires that `'train_indices'`
121        and `'test_indices'` are the same across all AnnDatas.
122
123    **names** : *list[str]* 
124        > List of string variables that describe each modality
125        respective to adatas for labeling.
126        
127    **alpha_list** : *np.ndarray* | *float*
128        > An array of alpha values to create each model with.
129
130    **tfidf** : *list[bool]* 
131        > List where if element i is `True`, adata[i] will be TFIDF 
132        normalized.
133
134    Returns
135    -------
136    **results** : *dict*
137    > Contains keys for each cell class with results from cell class
138    versus all other samples. See `scmkl.run()` for futher details.
139
140    Examples
141    --------
142    >>> adata = scmkl.create_adata(X = data_mat, 
143    ...                            feature_names = gene_names, 
144    ...                            group_dict = group_dict)
145    >>>
146    >>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'],
147    ...                           alpha_list = np.array([0.05, 0.1]),
148    ...                           tfidf = [False])
149    >>>
150    >>> adata.keys()
151    dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...])
152    '''
153    # Formatting checks ensuring all adata elements are 
154    # AnnData objects and train/test indices are all the same
155    _check_adatas(adatas, check_obs = True, check_uns = True)
156
157    # Extracting train and test indices
158    train_indices = adatas[0].uns['train_indices']
159    test_indices = adatas[0].uns['test_indices']
160
161    # Checking and capturing cell labels
162    uniq_labels = _eval_labels(  cell_labels = adatas[0].obs['labels'], 
163                                train_indices = train_indices,
164                                 test_indices = test_indices)
165
166
167    # Calculating Z matrices, method depends on whether there are multiple 
168    # adatas (modalities)
169    if len(adatas) == 1:
170        adata = estimate_sigma(adatas[0], n_features = 200)
171        adata = calculate_z(adata, n_features = 5000)
172    else:
173        adata = multimodal_processing(adatas = adatas, 
174                                        names = names, 
175                                        tfidf = tfidf, 
176                                        z_calculation = True)
177
178    del adatas
179    gc.collect()
180
181    # Initializing for capturing model outputs
182    results = {}
183
184    # Capturing cell labels before overwriting
185    cell_labels = np.array(adata.obs['labels'])
186
187    for label in uniq_labels:
188        print(f"Comparing {label} to other types", flush = True)
189        cur_labels = cell_labels.copy()
190        cur_labels[cell_labels != label] = 'other'
191        
192        # Replacing cell labels for current cell type vs rest
193        adata.obs['labels'] = cur_labels
194
195        # Running scMKL
196        results[label] = run(adata, alpha_list, return_probs = True)
197
198    # Getting final predictions
199    alpha = np.min(alpha_list)
200    prob_table, pred_class, low_conf = _prob_table(results, alpha)
201
202    results['Probability_table'] = prob_table
203    results['Predicted_class'] = pred_class
204    results['Truth_labels'] = cell_labels[adata.uns['test_indices']]
205    results['Low_confidence'] = low_conf
206
207    return results

For each cell class, creates model(s) comparing that class to all others. Then, predicts on the training data using scmkl.run. Only labels in both training and testing will be run.

Parameters

adatas : list[AnnData]

List of AnnData objects created by create_adata() where each AnnData is one modality and composed of both training and testing samples. Requires that 'train_indices' and 'test_indices' are the same across all AnnDatas.

names : list[str]

List of string variables that describe each modality respective to adatas for labeling.

alpha_list : np.ndarray | float

An array of alpha values to create each model with.

tfidf : list[bool]

List where if element i is True, adata[i] will be TFIDF normalized.

Returns

results : dict

Contains keys for each cell class with results from cell class versus all other samples. See scmkl.run for futher details.

Examples

>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>>
>>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'],
...                           alpha_list = np.array([0.05, 0.1]),
...                           tfidf = [False])
>>>
>>> adata.keys()
dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...])
def optimize_alpha( adata, group_size=None, tfidf=False, alpha_array=array([1.9, 1.7, 1.5, 1.3, 1.1, 0.9, 0.7, 0.5, 0.3, 0.1]), k=4, metric='AUROC'):
104def optimize_alpha(adata, group_size = None, tfidf = False, 
105                   alpha_array = np.round(np.linspace(1.9,0.1, 10),2), 
106                   k = 4, metric = 'AUROC'):
107    '''
108    Iteratively train a grouplasso model and update alpha to find the 
109    parameter yielding best performing sparsity. This function 
110    currently only works for binary experiments.
111
112    Parameters
113    ----------
114    **adata** : *AnnData* | *list[AnnData]*
115        > `AnnData`(s) with `'Z_train'` and `'Z_test'` in 
116        `adata.uns.keys()`.
117
118    **group_size** : *None* | *int*
119        > Argument describing how the features are grouped. If *None*, 
120        `2 * adata.uns['D'] will be used. 
121        For more information see 
122        [celer documentation](https://mathurinm.github.io/celer/generated/celer.GroupLasso.html).
123
124    **tfidf** : *bool* 
125        > If `True`, TFIDF normalization will be run at each fold.
126    
127    **alpha_array** : *np.ndarray*
128        > Array of all alpha values to be tested.
129
130    **k** : *int*
131        > Number of folds to perform cross validation over.
132            
133    **metric**: *str*
134        > Which metric to use to optimize alpha. Options
135        are `'AUROC'`, `'Accuracy'`, `'F1-Score'`, `'Precision'`, and 
136        `'Recall'`
137
138    Returns
139    -------
140    **alpha_star** : *int*
141        > The best performing alpha value from cross validation on 
142        training data.
143
144    Examples
145    --------
146    >>> alpha_star = scmkl.optimize_alpha(adata)
147    >>> alpha_star
148    0.1
149    '''
150
151    assert isinstance(k, int) and k > 0, 'Must be a positive integer number of folds'
152
153    import warnings 
154    warnings.filterwarnings('ignore')
155
156    if group_size == None:
157        group_size = adata.uns['D'] * 2
158
159    if type(adata) == list:
160        alpha_star = _multimodal_optimize_alpha(adatas = adata, group_size = group_size, tfidf = tfidf, alpha_array = alpha_array, metric = metric)
161        return alpha_star
162
163    y = adata.obs['labels'].iloc[adata.uns['train_indices']].to_numpy()
164    
165    # Splits the labels evenly between folds
166    positive_indices = np.where(y == np.unique(y)[0])[0]
167    negative_indices = np.setdiff1d(np.arange(len(y)), positive_indices)
168
169    positive_annotations = np.arange(len(positive_indices)) % k
170    negative_annotations = np.arange(len(negative_indices)) % k
171
172    metric_array = np.zeros((len(alpha_array), k))
173
174    gc.collect()
175
176    for fold in np.arange(k):
177
178        cv_adata = adata[adata.uns['train_indices'],:]
179
180        # Create CV train/test indices
181        fold_train = np.concatenate((positive_indices[np.where(positive_annotations != fold)[0]], 
182                                     negative_indices[np.where(negative_annotations != fold)[0]]))
183        fold_test = np.concatenate((positive_indices[np.where(positive_annotations == fold)[0]], 
184                                    negative_indices[np.where(negative_annotations == fold)[0]]))
185
186        cv_adata.uns['train_indices'] = fold_train
187        cv_adata.uns['test_indices'] = fold_test
188
189        if tfidf:
190            cv_adata = tfidf_normalize(cv_adata, binarize= True)
191
192        cv_adata = estimate_sigma(cv_adata, n_features = 200)
193        cv_adata = calculate_z(cv_adata, n_features= 5000)
194
195        gc.collect()
196
197        for i, alpha in enumerate(alpha_array):
198
199            cv_adata = train_model(cv_adata, group_size, alpha = alpha)
200            _, metrics = predict(cv_adata, metrics = [metric])
201            metric_array[i, fold] = metrics[metric]
202
203            gc.collect()
204
205        del cv_adata
206        gc.collect()
207        
208    # Take AUROC mean across the k folds to find alpha yielding highest AUROC
209    alpha_star = alpha_array[np.argmax(np.mean(metric_array, axis = 1))]
210    gc.collect()
211    
212
213    return alpha_star

Iteratively train a grouplasso model and update alpha to find the parameter yielding best performing sparsity. This function currently only works for binary experiments.

Parameters

adata : AnnData | list[AnnData]

AnnData(s) with 'Z_train' and 'Z_test' in adata.uns.keys().

group_size : None | int

Argument describing how the features are grouped. If None, `2 * adata.uns['D'] will be used. For more information see celer documentation.

tfidf : bool

If True, TFIDF normalization will be run at each fold.

alpha_array : np.ndarray

Array of all alpha values to be tested.

k : int

Number of folds to perform cross validation over.

metric: str

Which metric to use to optimize alpha. Options are 'AUROC', 'Accuracy', 'F1-Score', 'Precision', and 'Recall'

Returns

alpha_star : int

The best performing alpha value from cross validation on training data.

Examples

>>> alpha_star = scmkl.optimize_alpha(adata)
>>> alpha_star
0.1
def optimize_sparsity( adata, group_size, starting_alpha=1.9, increment=0.2, target=1, n_iter=10):
  8def optimize_sparsity(adata, group_size, starting_alpha = 1.9, 
  9                      increment = 0.2, target = 1, n_iter = 10):
 10    '''
 11    Iteratively train a grouplasso model and update alpha to find the 
 12    parameter yielding the desired sparsity.
 13    
 14    Parameters
 15    ----------
 16    **adata** : *AnnData*
 17        > `AnnData` with `'Z_train'` and `'Z_test'` in 
 18        `adata.uns.keys()`.
 19
 20    **group_size** : *int* 
 21        > Argument describing how the features are grouped. Should be
 22        `2 * D`. For more information see celer documentation.
 23
 24    **starting_alpha** : *float*
 25        > The alpha value to start the search at.
 26    
 27    **increment** : *float* 
 28        > Amount to adjust alpha by between iterations.
 29    
 30    **target** : *int*
 31        > The desired number of groups selected by the model.
 32
 33    **n_iter** : *int*
 34        > The maximum number of iterations to run.
 35            
 36    Returns
 37    -------
 38    **sparsity_dict** : *dict*
 39        > Tested alpha as keys and the number of selected pathways as 
 40        the values.
 41        
 42    **alpha** : *float*
 43        >The alpha value yielding the number of selected groups closest 
 44        to the target.
 45
 46    Examples
 47    --------
 48    >>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, (2 * D), 
 49    ...                                                target = 1)
 50    >>>
 51    >>> alpha
 52    0.01
 53
 54    See Also
 55    --------
 56    celer.GroupLasso : 
 57    https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
 58    '''
 59    assert increment > 0 and increment < starting_alpha, ("Choose a positive "
 60                                                          "increment less "
 61                                                          "than alpha")
 62    assert target > 0 and isinstance(target, int), ("Choose an integer "
 63                                                    "target number of groups "
 64                                                     "that is greater than 0")
 65    assert n_iter > 0 and isinstance(n_iter, int), ("Choose an integer "
 66                                                    "number of iterations "
 67                                                    "that is greater than 0")
 68
 69    sparsity_dict = {}
 70    alpha = starting_alpha
 71
 72    for _ in np.arange(n_iter):
 73        adata = train_model(adata, group_size, alpha)
 74        num_selected = len(find_selected_groups(adata))
 75
 76        sparsity_dict[np.round(alpha, 4)] = num_selected
 77
 78        if num_selected < target:
 79            #Decreasing alpha will increase the number of selected pathways
 80            if alpha - increment in sparsity_dict.keys():
 81                # Make increment smaller so the model can't go back and forth 
 82                # between alpha values
 83                increment /= 2
 84            # Ensures that alpha will never be negative
 85            alpha = np.max([alpha - increment, 1e-3]) 
 86
 87        elif num_selected > target:
 88            if alpha + increment in sparsity_dict.keys():
 89                increment /= 2
 90
 91            alpha += increment
 92        elif num_selected == target:
 93            break
 94
 95    # Find the alpha that minimizes the difference between target and observed
 96    # number of selected groups
 97    spar_idx = np.argmin([np.abs(selected - target) 
 98                          for selected in sparsity_dict.values()])
 99    optimal_alpha = list(sparsity_dict.keys())[spar_idx]
100    return sparsity_dict, optimal_alpha

Iteratively train a grouplasso model and update alpha to find the parameter yielding the desired sparsity.

Parameters

adata : AnnData

AnnData with 'Z_train' and 'Z_test' in adata.uns.keys().

group_size : int

Argument describing how the features are grouped. Should be 2 * D. For more information see celer documentation.

starting_alpha : float

The alpha value to start the search at.

increment : float

Amount to adjust alpha by between iterations.

target : int

The desired number of groups selected by the model.

n_iter : int

The maximum number of iterations to run.

Returns

sparsity_dict : dict

Tested alpha as keys and the number of selected pathways as the values.

alpha : float

The alpha value yielding the number of selected groups closest to the target.

Examples

>>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, (2 * D), 
...                                                target = 1)
>>>
>>> alpha
0.01

See Also

celer.GroupLasso : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html

def run( adata: anndata._core.anndata.AnnData, alpha_list: numpy.ndarray, metrics: list | None = None, return_probs=False) -> dict:
 11def run(adata : ad.AnnData, alpha_list : np.ndarray, 
 12        metrics : list | None = None, return_probs = False) -> dict:
 13    '''
 14    Wrapper function for training and test with multiple alpha values.
 15    Returns metrics, predictions, group weights, and resource usage.
 16
 17    Parameters
 18    ----------
 19    **adata** : *AnnData* 
 20        > A processed *AnnData* with `'Z_train'`, `'Z_test'`, and 
 21        `'group_dict'` keys in `adata.uns`.
 22    
 23    **alpha_list** : *np.ndarray* 
 24        > `alpha` values to create models using. Alpha refers to the 
 25        penalty parameter in Group Lasso. Larger alphas force group 
 26        weights to shrink towards 0 while smaller alphas apply a lesser 
 27        penalty to kernal weights.
 28
 29    **metrics** : *list[str]*
 30        > What metrics should be calculated on predictions. Options are 
 31        ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When 
 32        set to `None`, all metrics are calculated.
 33    
 34    Returns
 35    -------
 36    **results** : *dict*
 37    > With keys and values: 
 38
 39    > `'Metrics'` : a nested dictionary as `[alpha][metric]` = value.
 40    
 41    > `'Selected_groups'` : a dictionary as `[alpha]` = array of 
 42        groups with nonzero weights.
 43
 44    > `'Norms'` : a dictionary as `[alpha]` = array of kernel weights
 45        for each group, order respective to 'Group_names'.
 46
 47    > `'Predictions'` : a dictionary as `[alpha]` = predicted class
 48        respective to 'Observations' for that `alpha`.
 49
 50    > `'Observed'` : an array of ground truth cell labels from the
 51        test set.
 52
 53    > `'Test_indices'` : indices of samples respective to adata 
 54        used in the training set.
 55
 56    > `'Group_names'` : an array of group names respective to each
 57        array in 'Norms'.
 58
 59    > `'Model'` : a dictionary where `[alpha]` = Celer Group Lasso
 60        object for that `alpha`.
 61
 62    > `'RAM_usage'` : memory usage after training models for each 
 63        `alpha`.
 64
 65    Examples
 66    --------
 67    >>> results = scmkl.run(adata = adata, 
 68    ...                     alpha_list = np.array([0.05, 0.1, 0.5]))
 69    >>> results
 70    dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions', 
 71    ...        'Observed', 'Test_indices', 'Group_names', 'Models', 
 72    ...        'Train_time', 'RAM_usage'])
 73    >>>
 74    >>> # List of alpha values
 75    >>> results['Metrics'].keys()
 76    dict_keys([0.05, 0.1, 0.5])
 77    >>>
 78    >>> results['Metrics'][0.05]
 79    {'AUROC': 0.9859,
 80    'Accuracy': 0.945,
 81    'F1-Score': 0.9452736318407959,
 82    'Precision': 0.9405940594059405,
 83    'Recall': 0.95}
 84    '''
 85    if metrics == None:
 86        metrics = ['AUROC', 'F1-Score','Accuracy', 'Precision', 'Recall']
 87
 88    # Initializing variables to capture metrics
 89    group_names = list(adata.uns['group_dict'].keys())
 90    preds = {}
 91    group_norms = {}
 92    mets_dict = {}
 93    selected_groups = {}
 94    train_time = {}
 95    models = {}
 96    probs = {}
 97
 98    D = adata.uns['D']
 99
100    # Generating models for each alpha and outputs
101    for alpha in alpha_list:
102        
103        print(f'  Evaluating model. Alpha: {alpha}', flush = True)
104
105        train_start = time.time()
106
107        adata = train_model(adata, group_size= 2*D, alpha = alpha)
108
109        if return_probs:
110            alpha_res = predict(adata, 
111                                metrics = metrics,
112                                return_probs = return_probs)
113            preds[alpha], mets_dict[alpha], probs[alpha] = alpha_res
114
115        else:
116            alpha_res = predict(adata, 
117                                metrics = metrics,
118                                return_probs = return_probs)
119            preds[alpha], mets_dict[alpha] = alpha_res
120
121        selected_groups[alpha] = find_selected_groups(adata)
122
123        kernel_weights = adata.uns['model'].coef_
124        group_norms[alpha] = [
125            np.linalg.norm(kernel_weights[i * 2 * D : (i + 1) * 2 * D - 1])
126            for i in np.arange(len(group_names))
127            ]
128        
129        models[alpha] = adata.uns['model']
130        
131        train_end = time.time()
132        train_time[alpha] = train_end - train_start
133
134    # Combining results into one object
135    results = {}
136    results['Metrics'] = mets_dict
137    results['Selected_groups'] = selected_groups
138    results['Norms'] = group_norms
139    results['Predictions'] = preds
140    results['Observed'] = adata.obs['labels'].iloc[adata.uns['test_indices']]
141    results['Test_indices'] = adata.uns['test_indices']
142    results['Group_names']= group_names
143    results['Models'] = models
144    results['Train_time'] = train_time
145    results['RAM_usage'] = f'{tracemalloc.get_traced_memory()[1] / 1e9} GB'
146    results['Probabilities'] = probs
147
148    return results

Wrapper function for training and test with multiple alpha values. Returns metrics, predictions, group weights, and resource usage.

Parameters

adata : AnnData

A processed AnnData with 'Z_train', 'Z_test', and 'group_dict' keys in adata.uns.

alpha_list : np.ndarray

alpha values to create models using. Alpha refers to the penalty parameter in Group Lasso. Larger alphas force group weights to shrink towards 0 while smaller alphas apply a lesser penalty to kernal weights.

metrics : list[str]

What metrics should be calculated on predictions. Options are ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When set to None, all metrics are calculated.

Returns

results : dict

With keys and values:

'Metrics' : a nested dictionary as [alpha][metric] = value.

'Selected_groups' : a dictionary as [alpha] = array of groups with nonzero weights.

'Norms' : a dictionary as [alpha] = array of kernel weights for each group, order respective to 'Group_names'.

'Predictions' : a dictionary as [alpha] = predicted class respective to 'Observations' for that alpha.

'Observed' : an array of ground truth cell labels from the test set.

'Test_indices' : indices of samples respective to adata used in the training set.

'Group_names' : an array of group names respective to each array in 'Norms'.

'Model' : a dictionary where [alpha] = Celer Group Lasso object for that alpha.

'RAM_usage' : memory usage after training models for each alpha.

Examples

>>> results = scmkl.run(adata = adata, 
...                     alpha_list = np.array([0.05, 0.1, 0.5]))
>>> results
dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions', 
...        'Observed', 'Test_indices', 'Group_names', 'Models', 
...        'Train_time', 'RAM_usage'])
>>>
>>> # List of alpha values
>>> results['Metrics'].keys()
dict_keys([0.05, 0.1, 0.5])
>>>
>>> results['Metrics'][0.05]
{'AUROC': 0.9859,
'Accuracy': 0.945,
'F1-Score': 0.9452736318407959,
'Precision': 0.9405940594059405,
'Recall': 0.95}
def tfidf_normalize(adata, binarize=False):
 72def tfidf_normalize(adata, binarize = False):
 73    '''
 74    Function to TFIDF normalize the data in an adata object. If any 
 75    rows are entirely 0, that row and its metadata will be removed from
 76    the object.
 77
 78    Parameters
 79    ----------
 80    **adata** : *AnnData* 
 81        > `adata.X` to be normalized. If `'train_indices'` and 
 82        `'test_indices'` in `'adata.uns.keys()'`, normalization will be
 83        done separately for the training and testing data. Otherwise, 
 84        it will calculate it on the entire dataset.
 85
 86    **binarize** : *bool* 
 87        > If `True`, all values in `adata.X` greater than 1 will become 
 88        1.
 89
 90    Returns
 91    -------
 92    **adata** : *AnnData* 
 93        > adata with adata.X TFIDF normalized. Will now have the train 
 94        data stacked on test data, and the indices will be adjusted 
 95        accordingly.
 96
 97    Examples
 98    --------
 99    >>> adata = scmkl.create_adata(X = data_mat, 
100    ...                            feature_names = gene_names, 
101    ...                            group_dict = group_dict)
102    >>> 
103    >>> adata = scmkl.tfidf_normalize(adata)
104    '''
105    X = adata.X.copy()
106    row_sums = np.sum(X, axis = 1)
107    assert np.all(row_sums > 0), "TFIDF requires all row sums be positive"
108
109    if binarize:
110        X[X > 0] = 1
111
112    if 'train_indices' in adata.uns_keys():
113
114        train_indices = adata.uns['train_indices'].copy()
115        test_indices = adata.uns['test_indices'].copy()
116
117        # Calculate the train TFIDF matrix on just the training data so it is 
118        # not biased by testing data
119        tfidf_train = _tfidf(X[train_indices,:], mode = 'normalize')
120
121        # Calculate the test TFIDF by calculating it on the train and test 
122        # data and index the test data
123        tfidf_test = _tfidf(X, mode = 'normalize')[test_indices,:]
124
125        # Impossible to add rows back to original location so we need to 
126        # stack the matrices to maintain train/test
127        if scipy.sparse.issparse(X):
128            tfidf_norm = scipy.sparse.vstack((tfidf_train, tfidf_test))
129        else:
130            tfidf_norm = np.vstack((tfidf_train, tfidf_test))
131
132        # I'm not sure why this reassignment is necessary, but without, the 
133        # values will be saved as 0s in adata
134        adata.uns['train_indices'] = train_indices
135        adata.uns['test_indices'] = test_indices
136
137        combined_indices = np.concatenate((train_indices, test_indices))
138
139        # Anndata indexes by "rownames" not position so we need to rename the 
140        # rows to properly index
141        adata_index = adata.obs_names[combined_indices].astype(int)
142        tfidf_norm = tfidf_norm[np.argsort(adata_index),:]
143
144    else:
145
146        tfidf_norm = _tfidf(X, mode = 'normalize')
147
148    adata.X = tfidf_norm.copy()
149
150    return adata

Function to TFIDF normalize the data in an adata object. If any rows are entirely 0, that row and its metadata will be removed from the object.

Parameters

adata : AnnData

adata.X to be normalized. If 'train_indices' and 'test_indices' in 'adata.uns.keys()', normalization will be done separately for the training and testing data. Otherwise, it will calculate it on the entire dataset.

binarize : bool

If True, all values in adata.X greater than 1 will become 1.

Returns

adata : AnnData

adata with adata.X TFIDF normalized. Will now have the train data stacked on test data, and the indices will be adjusted accordingly.

Examples

>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>> 
>>> adata = scmkl.tfidf_normalize(adata)
def train_model(adata, group_size=1, alpha=0.9):
 6def train_model(adata, group_size = 1, alpha = 0.9):
 7    '''
 8    Fit a grouplasso model to the provided data.
 9
10    Parameters
11    ----------
12    **adata** : *AnnData* 
13        > Has `'Z_train'` and `'Z_test'` keys in `adata.uns`.
14
15    **group_size** : *int* 
16        > Argument describing how the features are grouped. Should be
17        `2 * D`. For more information see celer documentation. 
18            
19    **alpha** : *float*
20        > Group Lasso regularization coefficient. alpha is a floating 
21        point value controlling model solution sparsity. Must be a 
22        positive float. The smaller the value, the more feature groups 
23        will be selected in the trained model.
24    
25    Returns
26    -------
27    **adata** : *AnnData* 
28        > Trained model accessible with `adata.uns['model']`.
29
30    Examples
31    --------
32    >>> adata = scmkl.estimate_sigma(adata)
33    >>> adata = scmkl.calculate_z(adata)
34    >>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 
35    ...            'Recall']
36    >>> d = scmkl.calculate_d(adata.shape[0])
37    >>> group_size = 2 * d
38    >>> adata = scmkl.train_model(adata, group_size)
39    >>>
40    >>> 'model' in adata.uns.keys()
41    True
42
43    See Also
44    --------
45    celer documentation :
46    https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
47    '''
48    assert alpha > 0, 'Alpha must be positive'
49
50    y_train = adata.obs['labels'].iloc[adata.uns['train_indices']]
51    X_train = adata.uns['Z_train']
52
53    cell_labels = np.unique(y_train)
54
55    # This is a regression algorithm. We need to make the labels 'continuous' 
56    # for classification, but they will remain binary. Casts training labels 
57    # to array of -1,1
58    train_labels = np.ones(y_train.shape)
59    train_labels[y_train == cell_labels[1]] = -1
60
61    # Alphamax is a calculation to regularize the effect of alpha across 
62    # different data sets
63    alphamax = np.max(np.abs(X_train.T.dot(train_labels)))
64    alphamax /= X_train.shape[0] 
65    alphamax *= alpha
66
67    # Instantiate celer Group Lasso Regression Model Object
68    model = celer.GroupLasso(groups = group_size, alpha = alphamax)
69
70    # Fit model using training data
71    model.fit(X_train, train_labels.ravel())
72
73    adata.uns['model'] = model
74    return adata

Fit a grouplasso model to the provided data.

Parameters

adata : AnnData

Has 'Z_train' and 'Z_test' keys in adata.uns.

group_size : int

Argument describing how the features are grouped. Should be 2 * D. For more information see celer documentation.

alpha : float

Group Lasso regularization coefficient. alpha is a floating point value controlling model solution sparsity. Must be a positive float. The smaller the value, the more feature groups will be selected in the trained model.

Returns

adata : AnnData

Trained model accessible with adata.uns['model'].

Examples

>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 
...            'Recall']
>>> d = scmkl.calculate_d(adata.shape[0])
>>> group_size = 2 * d
>>> adata = scmkl.train_model(adata, group_size)
>>>
>>> 'model' in adata.uns.keys()
True

See Also

celer documentation : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html