scmkl


PyPI PyPI - Downloads Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

Single-cell analysis using Multiple Kernel Learning, scMKL, is a binary classification algorithm utilizing prior information to group features to enhance classification and aid understanding of distinguishing features in multi-omic data sets.

Installation

Conda install

Conda is the recommended method to install scMKL:

conda create -n scMKL python=3.12 -c conda-forge ivango17::scmkl

Pip install

First, create a virtual environment with python>=3.11.1,<3.13.

Then, install scMKL with:

# activate your new env with python>=3.11.1 and <3.13
pip install scmkl

If wheels do not build correctly, ensure gcc and g++ are installed and up to date. They can be installed with sudo apt install gcc and sudo apt install g++.

Requirements

scMKL takes advantage of AnnData objects and can be implemented with just four pieces of data:

1) scRNA and/or scATAC matrices (can be scipy.sparse matrix)

2) An array of cell labels

3) An array of feature names (eg. gene symbols for RNA or peaks for ATAC)

4) A grouping dictionary where {'group_1' : [feature_5, feature_16], 'group_2' : [feature_1, feature_4, feature_9]}

For more information on formatting/creating the grouping dictionaries, see our example for creating an RNA grouping or ATAC grouping.

For implementing scMKL, see our examples for your use case in examples.

Repo: https://github.com/ohsu-cedar-comp-hub/scMKL

PyPI: https://pypi.org/project/scmkl/

Anaconda: https://anaconda.org/ivango17/scmkl

API: https://ohsu-cedar-comp-hub.github.io/scMKL/

Citation

If you use scMKL in your research, please cite using:

To be determined

Our Shiny for Python application for viewing data produced from this work can be found here: scMKL_analysis


scMKL Documentation

 1"""
 2.. include:: ../README.md
 3
 4----------------------------
 5
 6## **scMKL Documentation**
 7"""
 8
 9
10__all__ = ['calculate_z', 
11           'create_adata', 
12           'dataframes',
13           'estimate_sigma', 
14           'get_atac_groupings',
15           'multimodal_processing', 
16           'one_v_rest', 
17           'optimize_alpha', 
18           'optimize_sparsity',
19           'plotting',
20           'run',
21           'test', 
22           'test_import',
23           'tfidf_normalize',
24           'train_model'
25           ]
26
27from scmkl._checks import *
28from scmkl.calculate_z import *
29from scmkl.create_adata import *
30from scmkl.dataframes import *
31from scmkl.estimate_sigma import *
32from scmkl.get_atac_groupings import *
33from scmkl.multimodal_processing import *
34from scmkl.one_v_rest import *
35from scmkl.optimize_alpha import *
36from scmkl.optimize_sparsity import *
37from scmkl.plotting import *
38from scmkl.run import *
39from scmkl.test import *
40from scmkl.tfidf_normalize import *
41from scmkl.train_model import *
def calculate_z(adata, n_features=5000) -> anndata._core.anndata.AnnData:
102def calculate_z(adata, n_features = 5000) -> ad.AnnData:
103    '''
104    Function to calculate Z matrix.
105
106    Parameters
107    ----------
108    **adata** : *AnnData*
109        > created by `create_adata()` with `adata.uns.keys()` 
110        `'sigma'`, `'train_indices'`, and `'test_indices'`. 
111        `'sigma'` key can be added by running `estimate_sigma()` on 
112        adata. 
113
114    **n_features** : *int* 
115        > Number of random feature to use when calculating Z- used for 
116        scalability.
117
118    Returns
119    -------
120    **adata** : *AnnData*
121        > adata with Z matrices accessible with `adata.uns['Z_train']` 
122        and `adata.uns['Z_test']`.
123
124    Examples
125    --------
126    >>> adata = scmkl.estimate_sigma(adata)
127    >>> adata = scmkl.calculate_z(adata)
128    >>> adata.uns.keys()
129    dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices', 
130    'test_indices'])
131    '''
132    assert np.all(adata.uns['sigma'] > 0), 'Sigma must be positive'
133
134    # Number of groupings taking from group_dict
135    n_pathway = len(adata.uns['group_dict'].keys())
136    D = adata.uns['D']
137
138    # Capturing training and testing indices
139    train_idx = np.array(adata.uns['train_indices'], dtype = np.int_)
140    test_idx = np.array(adata.uns['test_indices'], dtype = np.int_)
141
142    # Create Arrays to store concatenated group Z
143    # Each group of features will have a corresponding entry in each array
144    n_cols = 2 * adata.uns['D'] * n_pathway
145
146    Z_train = np.zeros((train_idx.shape[0], n_cols))
147    Z_test = np.zeros((test_idx.shape[0], n_cols))
148
149    # Loop over each of the groups and creating Z for each
150    for m, group_features in enumerate(adata.uns['group_dict'].values()):
151        
152        #Extract features from mth group
153        num_group_features = len(group_features)
154
155        # Sample up to n_features features- important for scalability if 
156        # using large groupings
157        # Will use all features if the grouping contains fewer than n_features
158        number_features = np.min([n_features, num_group_features])
159        group_array = np.array(list(group_features))
160        group_features = adata.uns['seed_obj'].choice(group_array, 
161                                                      number_features, 
162                                                      replace = False) 
163
164        # Create data arrays containing only features within this group
165        X_train = adata[adata.uns['train_indices'],:][:, group_features].X
166        X_test = adata[adata.uns['test_indices'],:][:, group_features].X
167
168        # Data filtering, and transformation according to given data_type
169        # Will remove low variance (< 1e5) features regardless of data_type
170        # If given data_type is 'counts' will log scale and z-score the data
171        X_train, X_test = _process_data(X_train = X_train, X_test = X_test, 
172                                        scale_data = adata.uns['scale_data'], 
173                                        return_dense = True)
174
175        # Extract pre-calculated sigma used for approximating kernel
176        adjusted_sigma = adata.uns['sigma'][m]
177
178        # Calculates approximate kernel according to chosen kernel function
179        # Distribution data comes from Fourier Transform of kernel function
180        if adata.uns['kernel_type'].lower() == 'gaussian':
181
182            gamma = 1/(2*adjusted_sigma**2)
183            sigma_p = 0.5*np.sqrt(2*gamma)
184
185            W = adata.uns['seed_obj'].normal(0, sigma_p, X_train.shape[1] * D)
186            W = W.reshape((X_train.shape[1]), D)
187
188        elif adata.uns['kernel_type'].lower() == 'laplacian':
189
190            gamma = 1 / (2 * adjusted_sigma)
191
192            W = adata.uns['seed_obj'].standard_cauchy(X_train.shape[1] * D)
193            W = gamma * W.reshape((X_train.shape[1], D))
194
195        elif adata.uns['kernel_type'].lower() == 'cauchy':
196
197            gamma = 1 / (2 * adjusted_sigma ** 2)
198            b = 0.5 * np.sqrt(gamma)
199
200            W = adata.uns['seed_obj'].laplace(0, b, X_train.shape[1] * D)
201            W = W.reshape((X_train.shape[1], D))
202
203
204        train_projection = np.matmul(X_train, W)
205        test_projection = np.matmul(X_test, W)
206        
207        # Store group Z in whole-Z object. 
208        # Preserves order to be able to extract meaningful groups
209        x_idx = np.arange( m * 2 * D , (m + 1) * 2 * D)
210        sq_i_d = np.sqrt(1/D)
211
212        Z_train[0:, x_idx] = sq_i_d * np.hstack((np.cos(train_projection), 
213                                                 np.sin(train_projection)))
214        Z_test[0:, x_idx] = sq_i_d * np.hstack((np.cos(test_projection), 
215                                                np.sin(test_projection)))
216
217    adata.uns['Z_train'] = Z_train
218    adata.uns['Z_test'] = Z_test
219
220    return adata

Function to calculate Z matrix.

Parameters

adata : AnnData

created by create_adata() with adata.uns.keys() 'sigma', 'train_indices', and 'test_indices'. 'sigma' key can be added by running estimate_sigma() on adata.

n_features : int

Number of random feature to use when calculating Z- used for scalability.

Returns

adata : AnnData

adata with Z matrices accessible with adata.uns['Z_train'] and adata.uns['Z_test'].

Examples

>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> adata.uns.keys()
dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices', 
'test_indices'])
def create_adata( X, feature_names: numpy.ndarray, cell_labels: numpy.ndarray, group_dict: dict, scale_data: bool = True, split_data: numpy.ndarray | None = None, D: int | None = None, remove_features=True, train_ratio=0.8, distance_metric='euclidean', kernel_type='Gaussian', random_state: int = 1, allow_multiclass: bool = False, class_threshold: str | int = 'median'):
214def create_adata(X, feature_names: np.ndarray, cell_labels: np.ndarray, 
215                 group_dict: dict, scale_data: bool = True, 
216                 split_data : np.ndarray | None = None, D : int | None = None, 
217                 remove_features = True, train_ratio = 0.8,
218                 distance_metric = 'euclidean', kernel_type = 'Gaussian', 
219                 random_state : int = 1, allow_multiclass : bool = False, 
220                 class_threshold : str | int = 'median'):
221    '''
222    Function to create an AnnData object to carry all relevant 
223    information going forward.
224
225    Parameters
226    ----------
227    **X** : *scipy.sparse.csc_matrix* | *np.ndarray* | 
228            *pd.DataFrame*
229        > A data matrix of cells by features (sparse array 
230        recommended for large datasets).
231
232    **feature_names** : *np.ndarray*
233        > array of feature names corresponding with the features 
234        in X.
235
236    **cell_labels** : *np.ndarray*
237        > A numpy array of cell phenotypes corresponding with 
238        the cells in X.
239
240    **group_dict** : *dict* 
241        > Dictionary containing feature grouping information.
242            - Example: {geneset: np.array([gene_1, gene_2, ..., 
243                        gene_n])}
244
245    **scale_data** : *bool*  
246        > If `True`, data matrix is log transformed and standard 
247        scaled. 
248        
249    **split_data** : *None* | *np.ndarray*
250        > If *None*, data will be split stratified by cell labels. 
251        Else, is an array of precalculated train/test split 
252        corresponding to samples. Can include labels for entire
253        dataset to benchmark performance or for only training
254        data to classify unknown cell types.
255            - Example: np.array(['train', 'test', ..., 'train'])
256
257    **D** : *int* 
258        > Number of Random Fourier Features used to calculate Z. 
259        Should be a positive integer. Higher values of D will 
260        increase classification accuracy at the cost of computation 
261        time. If set to `None`, will be calculated given number of 
262        samples. 
263    
264    **remove_features** : *bool* 
265        > If `True`, will remove features from X and feature_names
266        not in group_dict and remove features from groupings not in
267        feature_names.
268
269    **train_ratio** : *float*
270        > Ratio of number of training samples to entire data set. Note:
271        if a threshold is applied, the ratio training samples may 
272        decrease depending on class balance and `class_threshold`
273        parameter if `allow_multiclass = True`.
274
275    **distance_metric** : *str* 
276        > The pairwise distance metric used to estimate sigma. Must
277        be one of the options used in scipy.spatial.distance.cdist.
278
279    **kernel_type** : *str*
280        > The approximated kernel function used to calculate Zs.
281        Must be one of `'Gaussian'`, `'Laplacian'`, or `'Cauchy'`.
282
283    **random_state** : *int*
284        > Integer random_state used to set the seed for 
285        reproducibilty.
286
287    **allow_multiclass** : *bool*
288        > If `False`, will ensure that cell labels are binary.
289
290    **class_threshold** : *str* | *int*
291        > Number of samples allowed in the training data for each cell
292        class in the training data. If `'median'`, the median number of
293        cells per cell class will be the threshold for number of 
294        samples per class.
295
296    Returns
297    -------
298    **adata** : *AnnData*
299    > *AnnData* with the following attributes and keys:
300
301    > `adata.X` : the data matrix.
302    
303    > `adata.var_names` : the feature names corresponding to
304    `adata.X`.
305
306    > `adata.obs['labels']` : cell classes/phenotypes from 
307    `cell_labels`.
308
309    > `adata.uns['train_indices']` : Indices for training data. 
310
311    > `adata.uns['test_indices']` : Indices for testing data.
312
313    > `adata.uns['group_dict']` : Grouping information.
314
315    > `adata.uns['seed_obj']` : Seed object with seed equal to
316    100 * `random_state`.
317
318    > `with adata.uns['D']` : Number of dimensions to scMKL with.
319
320    > `adata.uns['scale_data']` : *bool* for whether or not data is log
321    transformed and scaled.
322
323    > `adata.uns['distance_metric']` : Distance metric as given.
324    
325    > `adata.uns['kernel_type']` : Kernel function as given.
326
327    Examples
328    --------
329    >>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz')
330    >>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True)
331    >>> group_dict = np.load('hallmark_genesets.pkl', 
332    >>>                      allow_pickle = True)
333    >>> 
334    >>> adata = scmkl.create_adata(X = data_mat, 
335    ...                            feature_names = gene_names, 
336    ...                            group_dict = group_dict)
337    >>> adata
338    AnnData object with n_obs × n_vars = 1000 × 4341
339    obs: 'labels'
340    uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 
341    'distance_metric', 'train_indices', 'test_indices'
342    '''
343
344    assert X.shape[1] == len(feature_names), ("Different number of features "
345                                              "in X than feature names")
346    
347    if not allow_multiclass:
348        assert len(np.unique(cell_labels)) == 2, ("cell_labels must contain "
349                                                  "2 classes")
350    if D is not None:    
351        assert isinstance(D, int) and D > 0, 'D must be a positive integer'
352
353    kernel_options = ['gaussian', 'laplacian', 'cauchy']
354    assert kernel_type.lower() in kernel_options, ("Given kernel type not "
355                                                   "implemented. Gaussian, "
356                                                   "Laplacian, and Cauchy "
357                                                   "are the acceptable "
358                                                   "types.")
359
360    X, feature_names, group_dict = _filter_features(X, 
361                                                    feature_names, 
362                                                    group_dict, 
363                                                    remove_features)
364
365    # Create adata object and add column names
366    adata = ad.AnnData(X)
367    adata.var_names = feature_names
368
369    # Add metadata to adata object
370    adata.uns['group_dict'] = group_dict
371    adata.uns['seed_obj'] = np.random.default_rng(100 * random_state)
372    adata.uns['scale_data'] = scale_data
373    adata.uns['D'] = D if D is not None else calculate_d(adata.shape[0])
374    adata.uns['kernel_type'] = kernel_type
375    adata.uns['distance_metric'] = distance_metric
376
377    if (split_data is None):
378        assert X.shape[0] == len(cell_labels), ("Different number of cells "
379                                                "than labels")
380        adata.obs['labels'] = cell_labels
381
382        if (allow_multiclass == False):
383            split = _binary_split(cell_labels, 
384                                  seed_obj = adata.uns['seed_obj'],
385                                  train_ratio = train_ratio)
386            train_indices, test_indices = split
387
388        elif (allow_multiclass == True):
389            split = _multi_class_split(cell_labels, 
390                                       seed_obj = adata.uns['seed_obj'], 
391                                       class_threshold = class_threshold,
392                                       train_ratio = train_ratio)
393            train_indices, test_indices = split
394
395        adata.uns['labeled_test'] = True
396
397    else:
398        x_eq_labs = X.shape[0] == len(cell_labels)
399        train_eq_labs = X.shape[0] == len(cell_labels)
400        assert x_eq_labs or train_eq_labs, ("Must give labels for all cells "
401                                            "or only for training cells")
402        
403        train_indices = np.where(split_data == 'train')[0]
404        test_indices = np.where(split_data == 'test')[0]
405
406        if len(cell_labels) == len(train_indices):
407
408            padded_cell_labels = np.zeros((X.shape[0])).astype('object')
409            padded_cell_labels[train_indices] = cell_labels
410            padded_cell_labels[test_indices] = 'padded_test_label'
411
412            adata.obs['labels'] = padded_cell_labels
413            adata.uns['labeled_test'] = False
414
415        elif len(cell_labels) == len(split_data):
416            adata.obs['labels'] = cell_labels
417            adata.uns['labeled_test'] = True
418
419    adata.uns['train_indices'] = train_indices
420    adata.uns['test_indices'] = test_indices
421
422    if not scale_data:
423        print("WARNING: Data will not be log transformed and scaled "
424              "To change this behavior, set scale_data to True")
425
426    return adata

Function to create an AnnData object to carry all relevant information going forward.

Parameters

X : scipy.sparse.csc_matrix | np.ndarray | pd.DataFrame

A data matrix of cells by features (sparse array recommended for large datasets).

feature_names : np.ndarray

array of feature names corresponding with the features in X.

cell_labels : np.ndarray

A numpy array of cell phenotypes corresponding with the cells in X.

group_dict : dict

Dictionary containing feature grouping information. - Example: {geneset: np.array([gene_1, gene_2, ..., gene_n])}

scale_data : bool

If True, data matrix is log transformed and standard scaled.

split_data : None | np.ndarray

If None, data will be split stratified by cell labels. Else, is an array of precalculated train/test split corresponding to samples. Can include labels for entire dataset to benchmark performance or for only training data to classify unknown cell types. - Example: np.array(['train', 'test', ..., 'train'])

D : int

Number of Random Fourier Features used to calculate Z. Should be a positive integer. Higher values of D will increase classification accuracy at the cost of computation time. If set to None, will be calculated given number of samples.

remove_features : bool

If True, will remove features from X and feature_names not in group_dict and remove features from groupings not in feature_names.

train_ratio : float

Ratio of number of training samples to entire data set. Note: if a threshold is applied, the ratio training samples may decrease depending on class balance and class_threshold parameter if allow_multiclass = True.

distance_metric : str

The pairwise distance metric used to estimate sigma. Must be one of the options used in scipy.spatial.distance.cdist.

kernel_type : str

The approximated kernel function used to calculate Zs. Must be one of 'Gaussian', 'Laplacian', or 'Cauchy'.

random_state : int

Integer random_state used to set the seed for reproducibilty.

allow_multiclass : bool

If False, will ensure that cell labels are binary.

class_threshold : str | int

Number of samples allowed in the training data for each cell class in the training data. If 'median', the median number of cells per cell class will be the threshold for number of samples per class.

Returns

adata : AnnData

AnnData with the following attributes and keys:

adata.X : the data matrix.

adata.var_names : the feature names corresponding to adata.X.

adata.obs['labels'] : cell classes/phenotypes from cell_labels.

adata.uns['train_indices'] : Indices for training data.

adata.uns['test_indices'] : Indices for testing data.

adata.uns['group_dict'] : Grouping information.

adata.uns['seed_obj'] : Seed object with seed equal to 100 * random_state.

with adata.uns['D'] : Number of dimensions to scMKL with.

adata.uns['scale_data'] : bool for whether or not data is log transformed and scaled.

adata.uns['distance_metric'] : Distance metric as given.

adata.uns['kernel_type'] : Kernel function as given.

Examples

>>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz')
>>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True)
>>> group_dict = np.load('hallmark_genesets.pkl', 
>>>                      allow_pickle = True)
>>> 
>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>> adata
AnnData object with n_obs × n_vars = 1000 × 4341
obs: 'labels'
uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 
'distance_metric', 'train_indices', 'test_indices'
def estimate_sigma(adata, n_features=5000):
 8def estimate_sigma(adata, n_features = 5000):
 9    '''
10    Calculate kernel widths to inform distribution for projection of 
11    Fourier Features. Calculates one sigma per group of features.
12
13    Parameters
14    ----------
15    **adata** : *AnnData* 
16        > Created by `create_adata`.
17    
18    **n_features** : *int*  
19        > Number of random features to include when estimating sigma. 
20        Will be scaled for the whole pathway set according to a 
21        heuristic. Used for scalability.
22    
23    Returns
24    -------
25    **adata** : *AnnData*
26        > Key added `adata.uns['sigma']`.
27
28    Examples
29    --------
30    >>> adata = scmkl.estimate_sigma(adata)
31    >>> adata.uns['sigma']
32    array([10.4640895 , 10.82011454,  6.16769438,  9.86156855, ...])
33    '''
34 
35    sigma_list = []
36
37    # Loop over every group in group_dict
38    for group_features in adata.uns['group_dict'].values():
39
40        # Select only features in that group and downsample for scalability
41        num_group_features = len(group_features)
42        group_array = np.array(list(group_features))
43        n_feats = min([n_features, num_group_features])
44        group_features = adata.uns['seed_obj'].choice(group_array, n_feats, 
45                                                      replace = False) 
46
47        # Use on the train data to estimate sigma
48        X_train = adata[adata.uns['train_indices'], group_features].X
49        X_train = _process_data(X_train = X_train, 
50                                scale_data = adata.uns['scale_data'], 
51                                return_dense = True)
52        
53        # Sample cells for scalability
54        sample_idx = np.arange(X_train.shape[0])
55        n_samples = np.min((2000, X_train.shape[0]))
56        distance_indices = adata.uns['seed_obj'].choice(sample_idx, n_samples)
57
58        # Calculate Distance Matrix with specified metric
59        sigma = scipy.spatial.distance.cdist(X_train[distance_indices,:], 
60                                             X_train[distance_indices,:], 
61                                             adata.uns['distance_metric'])
62        sigma = np.mean(sigma)
63
64        # sigma = 0 is numerically unusable in later steps
65        # Using such a small sigma will result in wide distribution, and 
66        # typically a non-predictive Z
67        if sigma == 0:
68            sigma += 1e-5
69
70        if n_features < num_group_features:
71            # Heuristic we calculated to account for fewer features used in 
72            # distance calculation
73            sigma = sigma * num_group_features / n_features 
74
75        sigma_list.append(sigma)
76    
77    adata.uns['sigma'] = np.array(sigma_list)
78        
79    return adata

Calculate kernel widths to inform distribution for projection of Fourier Features. Calculates one sigma per group of features.

Parameters

adata : AnnData

Created by create_adata.

n_features : int

Number of random features to include when estimating sigma. Will be scaled for the whole pathway set according to a heuristic. Used for scalability.

Returns

adata : AnnData

Key added adata.uns['sigma'].

Examples

>>> adata = scmkl.estimate_sigma(adata)
>>> adata.uns['sigma']
array([10.4640895 , 10.82011454,  6.16769438,  9.86156855, ...])
def get_atac_groupings( gene_anno: pandas.core.frame.DataFrame, gene_sets: dict, feature_names: numpy.ndarray | pandas.core.series.Series | list | set, len_up: int = 5000, len_down: int = 5000) -> dict:
262def get_atac_groupings(gene_anno : pd.DataFrame, gene_sets : dict, 
263                       feature_names : np.ndarray | pd.Series | list | set,
264                       len_up : int = 5000, len_down : int = 5000) -> dict:
265    '''
266    Creates a peak set where keys are gene set names from `gene_sets` 
267    and values are arrays of features pulled from `feature_names`. 
268    Features are added to each peak set given overlap between regions 
269    in single-cell data matrix and inferred gene promoter regions in 
270    `gene_anno`.
271
272    Parameters
273    ----------
274    **gene_anno** : *pd.DataFrame*
275        > Gene annotations in GTF format as a pd.DataFrame with columns
276        ['chr', 'start', 'end', 'strand', 'gene_name'].
277
278    **gene_sets** : *dict*
279        > Gene set names as keys and an iterable object of gene names
280        as values.
281
282    **feature_names** : *np.ndarray* | *pd.Series* | *list* | *set*
283        > Feature names corresponding to a single_cell ATAC data 
284        matrix.
285
286    Returns
287    -------
288    **atac_grouping** : *dict*
289        > Keys are the names from `gene_sets` and values
290        are a list of regions from `feature_names` that overlap with 
291        promotor regions respective to genes in `gene_sets` (i.e., if 
292        ATAC feature in `feature_names` overlaps with promotor region 
293        from a gene in a gene set from `gene_sets`, that region will be
294        added to the new dictionary under the respective gene set 
295        name).
296
297    Examples
298    --------
299    >>> # Reading in a gene set and the peak names from scATAC dataset
300    >>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl", 
301    ...                     allow_pickle = True)
302    >>> peaks = np.load("data/MCF7_ATAC_feature_names.npy", 
303    ...                 allow_pickle = True)
304    >>> 
305    >>> # Reading in GTF file
306    >>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf"
307    >>> gtf = pd.read_csv(gtf_path, sep = "\t", header = None, 
308    ...                   skip_blank_lines = True, comment = "#")
309    >>> 
310    >>> # Naming columns
311    >>> gtf.columns = ['chr', 'source', 'feature', 'start', 'end', 
312    ...                'score', 'strand', 'frame', 'attribute']
313    >>>
314    >>> # Subsetting to only protein coding genes
315    >>> prot_rows = gtf['attribute'].str.contains('protein_coding')
316    >>> gtf = gtf[prot_rows]
317    >>> gtf = gtf[gtf['feature'] == 'gene']
318    >>> 
319    >>> # Capturing gene name from attributes column
320    >>> gtf['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+', 
321    ...                                attr)[0] 
322    ...                     for attr in gtf['attribute']]
323    >>>
324    >>> atac_grouping = scmkl.get_atac_groupings(gene_anno = gtf,
325    ...                                         gene_sets = gene_sets,
326    ...                                         feature_names = peaks)
327    >>>
328    >>> atac_grouping.keys()
329    dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...])
330    '''
331    # Getting a unique set of gene names from gene_sets
332    all_genes = {gene for group in gene_sets.keys()
333                 for gene in gene_sets[group]}
334    
335    # Filtering out NaN values
336    all_genes = [gene for gene in all_genes if type(gene) != float]
337
338    # Filtering out annotations for genes not present in gene_sets
339    gene_anno = gene_anno[np.isin(gene_anno['gene_name'], all_genes)]
340
341    # Adjusting start and end columns to represent promotor regions
342    gene_anno = _adjust_regions(gene_anno = gene_anno, 
343                                len_up = len_up, len_down = len_down)
344
345    # Creating a dictionary from assay features where [chr] : (start, end)
346    feature_dict = _create_feature_dict(feature_names)
347
348    # Creating data structures from gene_anno for comparing regions
349    peak_gene_dict, ga_regions = _create_region_dicts(gene_anno)
350
351    # Capturing the separator type used in assay
352    chr_sep = ':' if ':' in feature_names[0] else '-'
353
354    atac_groupings = _compare_regions(feature_dict = feature_dict,
355                                     ga_regions = ga_regions,
356                                     peak_gene_dict = peak_gene_dict,
357                                     gene_sets = gene_sets,
358                                     chr_sep = chr_sep)
359    
360    return atac_groupings

Creates a peak set where keys are gene set names from gene_sets and values are arrays of features pulled from feature_names. Features are added to each peak set given overlap between regions in single-cell data matrix and inferred gene promoter regions in gene_anno.

Parameters

gene_anno : pd.DataFrame

Gene annotations in GTF format as a pd.DataFrame with columns ['chr', 'start', 'end', 'strand', 'gene_name'].

gene_sets : dict

Gene set names as keys and an iterable object of gene names as values.

feature_names : np.ndarray | pd.Series | list | set

Feature names corresponding to a single_cell ATAC data matrix.

Returns

atac_grouping : dict

Keys are the names from gene_sets and values are a list of regions from feature_names that overlap with promotor regions respective to genes in gene_sets (i.e., if ATAC feature in feature_names overlaps with promotor region from a gene in a gene set from gene_sets, that region will be added to the new dictionary under the respective gene set name).

Examples

>>> # Reading in a gene set and the peak names from scATAC dataset
>>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl", 
...                     allow_pickle = True)
>>> peaks = np.load("data/MCF7_ATAC_feature_names.npy", 
...                 allow_pickle = True)
>>> 
>>> # Reading in GTF file
>>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf"
>>> gtf = pd.read_csv(gtf_path, sep = "     ", header = None, 
...                   skip_blank_lines = True, comment = "#")
>>> 
>>> # Naming columns
>>> gtf.columns = ['chr', 'source', 'feature', 'start', 'end', 
...                'score', 'strand', 'frame', 'attribute']
>>>
>>> # Subsetting to only protein coding genes
>>> prot_rows = gtf['attribute'].str.contains('protein_coding')
>>> gtf = gtf[prot_rows]
>>> gtf = gtf[gtf['feature'] == 'gene']
>>> 
>>> # Capturing gene name from attributes column
>>> gtf['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+', 
...                                attr)[0] 
...                     for attr in gtf['attribute']]
>>>
>>> atac_grouping = scmkl.get_atac_groupings(gene_anno = gtf,
...                                         gene_sets = gene_sets,
...                                         feature_names = peaks)
>>>
>>> atac_grouping.keys()
dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...])
def multimodal_processing(adatas: list, names: list, tfidf: list):
 97def multimodal_processing(adatas : list, names : list, tfidf: list):
 98    '''
 99    Combines and processes a list of adata objects.
100
101    Parameters
102    ----------
103    **adatas** : *list[AnnData]* 
104        > List of AnnData objects where each object is a different 
105        modality for the same cells.
106
107    **names** : *list[str]*
108        > List of string names for each modality repective to each 
109        object in `adatas`.
110    
111    **tfidf** : *bool* 
112        > List where if element i is `True`, adata[i] will be TFIDF 
113        normalized.
114
115    Returns
116    -------
117    **adata** : *AnnData* 
118        > Concatenated from objects from `adatas` with Z matrices 
119        calculated.
120
121    Examples
122    --------
123    >>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat, 
124    ...                                feature_names = gene_names, 
125    ...                                scale_data = True, 
126    ...                                cell_labels = cell_labels, 
127    ...                                 group_dict = rna_grouping)
128    >>>
129    >>> atac_adata = scmkl.create_adata(X = mcf7_atac_mat, 
130    ...                                 feature_names = peak_names, 
131    ...                                 scale_data = False, 
132    ...                                 cell_labels = cell_labels, 
133    ...                                 group_dict = atac_grouping)
134    >>>
135    >>> adatas = [rna_adata, atac_adata]
136    >>> mod_names = ['rna', 'atac']
137    >>> adata = scmkl.multimodal_processing(adatas = adatas, 
138    ...                                     names = mod_names,
139    ...                                     tfidf = [False, True])
140    >>>
141    >>> adata
142    AnnData object with n_obs × n_vars = 1000 × 12676
143    obs: 'labels'
144    var: 'labels'
145    uns: 'D', 'kernel_type', 'distance_metric', 'train_indices',  
146    'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj'
147    '''
148    import warnings 
149    warnings.filterwarnings('ignore')
150
151    assert all([adata.shape[0] for adata in adatas]), ("Different number of "
152                                                       "cells present in "
153                                                       "each object")
154    
155    # True if all train indices match
156    same_train = np.all([np.array_equal(adatas[0].uns['train_indices'], 
157                                        adatas[i].uns['train_indices']) 
158                         for i in range(1, len(adatas))])
159
160    # True if all test indices match
161    same_test = np.all([np.array_equal(adatas[0].uns['test_indices'], 
162                                       adatas[i].uns['test_indices']) 
163                        for i in range(1, len(adatas))])
164
165    assert same_train, 'Different train indices'
166    assert same_test, 'Different test indices'
167
168    # Creates a boolean array for each modality of cells with non-empty rows
169    non_empty_rows = [np.array(_sparse_var(adata.X, axis = 1) != 0).ravel() 
170                      for adata in adatas]
171
172    # Returns a 1d array where sample feature sums
173    # across all modalities are more than 0
174    non_empty_rows = np.logical_and(*non_empty_rows).squeeze()
175
176    # Initializing final train test split array
177    train_test = np.repeat('train', adatas[0].shape[0])
178    train_test[adatas[0].uns['test_indices']] = 'test'
179
180    # Capturing train test split with empty rows filtered out
181    train_test = train_test[non_empty_rows]
182    train_indices = np.where(train_test == 'train')[0]
183    test_indices = np.where(train_test == 'test')[0]
184
185    # Adding train test split arrays to AnnData objects 
186    # and filtering out empty samples
187    for i, adata in enumerate(adatas):
188        adatas[i].uns['train_indices'] = train_indices
189        adatas[i].uns['test_indices'] = test_indices
190        adatas[i] = adata[non_empty_rows, :]
191        # tfidf normalizing if corresponding element in tfidf is True
192        if tfidf[i]:
193            adatas[i] = tfidf_normalize(adata)
194
195        if 'Z_train' not in adatas[i].uns.keys():
196            # AnnData update must be pointing at the object in list
197            print(f'Estimating Sigma for {names[i]}', flush = True)
198            adatas[i] = estimate_sigma(adata, n_features= 200)
199            print(f'Calculating Z for {names[i]}', flush = True)
200            adatas[i] = calculate_z(adata, n_features = 5000)
201
202    if 'labels' in adatas[0].obs:
203        all_labels = [adata.obs['labels'] for adata in adatas]
204        # Ensuring cell labels for each AnnData object are the same
205        for i in range(1, len(all_labels)):
206            same_labels = np.all(all_labels[0] == all_labels[i])
207            assert same_labels, (f"Cell labels between AnnData object in "
208                                 f"position 0 and position {i} in adatas do "
209                                 "not match")
210
211    adata = _combine_modalities(adatas = adatas,
212                                names = names,
213                                combination = 'concatenate')
214
215    del adatas
216    gc.collect()
217
218    return adata    

Combines and processes a list of adata objects.

Parameters

adatas : list[AnnData]

List of AnnData objects where each object is a different modality for the same cells.

names : list[str]

List of string names for each modality repective to each object in adatas.

tfidf : bool

List where if element i is True, adata[i] will be TFIDF normalized.

Returns

adata : AnnData

Concatenated from objects from adatas with Z matrices calculated.

Examples

>>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat, 
...                                feature_names = gene_names, 
...                                scale_data = True, 
...                                cell_labels = cell_labels, 
...                                 group_dict = rna_grouping)
>>>
>>> atac_adata = scmkl.create_adata(X = mcf7_atac_mat, 
...                                 feature_names = peak_names, 
...                                 scale_data = False, 
...                                 cell_labels = cell_labels, 
...                                 group_dict = atac_grouping)
>>>
>>> adatas = [rna_adata, atac_adata]
>>> mod_names = ['rna', 'atac']
>>> adata = scmkl.multimodal_processing(adatas = adatas, 
...                                     names = mod_names,
...                                     tfidf = [False, True])
>>>
>>> adata
AnnData object with n_obs × n_vars = 1000 × 12676
obs: 'labels'
var: 'labels'
uns: 'D', 'kernel_type', 'distance_metric', 'train_indices',  
'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj'
def one_v_rest( adatas: list, names: list, alpha_list: numpy.ndarray, tfidf: list) -> dict:
108def one_v_rest(adatas : list, names : list, alpha_list : np.ndarray, 
109              tfidf : list) -> dict:
110    '''
111    For each cell class, creates model(s) comparing that class to all 
112    others. Then, predicts on the training data using `scmkl.run()`.
113    Only labels in both training and testing will be run.
114
115    Parameters
116    ----------
117    **adatas** : *list[AnnData]* 
118        > List of AnnData objects created by create_adata()
119        where each AnnData is one modality and composed of both 
120        training and testing samples. Requires that `'train_indices'`
121        and `'test_indices'` are the same across all AnnDatas.
122
123    **names** : *list[str]* 
124        > List of string variables that describe each modality
125        respective to adatas for labeling.
126        
127    **alpha_list** : *np.ndarray* | *float*
128        > An array of alpha values to create each model with.
129
130    **tfidf** : *list[bool]* 
131        > List where if element i is `True`, adata[i] will be TFIDF 
132        normalized.
133
134    Returns
135    -------
136    **results** : *dict*
137    > Contains keys for each cell class with results from cell class
138    versus all other samples. See `scmkl.run()` for futher details.
139
140    Examples
141    --------
142    >>> adata = scmkl.create_adata(X = data_mat, 
143    ...                            feature_names = gene_names, 
144    ...                            group_dict = group_dict)
145    >>>
146    >>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'],
147    ...                           alpha_list = np.array([0.05, 0.1]),
148    ...                           tfidf = [False])
149    >>>
150    >>> adata.keys()
151    dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...])
152    '''
153    # Formatting checks ensuring all adata elements are 
154    # AnnData objects and train/test indices are all the same
155    _check_adatas(adatas, check_obs = True, check_uns = True)
156
157    # Extracting train and test indices
158    train_indices = adatas[0].uns['train_indices']
159    test_indices = adatas[0].uns['test_indices']
160
161    # Checking and capturing cell labels
162    uniq_labels = _eval_labels(  cell_labels = adatas[0].obs['labels'], 
163                                train_indices = train_indices,
164                                 test_indices = test_indices)
165
166
167    # Calculating Z matrices, method depends on whether there are multiple 
168    # adatas (modalities)
169    if len(adatas) == 1:
170        adata = estimate_sigma(adatas[0], n_features = 200)
171        adata = calculate_z(adata, n_features = 5000)
172    else:
173        adata = multimodal_processing(adatas = adatas, 
174                                        names = names, 
175                                        tfidf = tfidf, 
176                                        z_calculation = True)
177
178    del adatas
179    gc.collect()
180
181    # Initializing for capturing model outputs
182    results = {}
183
184    # Capturing cell labels before overwriting
185    cell_labels = np.array(adata.obs['labels'])
186
187    for label in uniq_labels:
188        print(f"Comparing {label} to other types", flush = True)
189        cur_labels = cell_labels.copy()
190        cur_labels[cell_labels != label] = 'other'
191        
192        # Replacing cell labels for current cell type vs rest
193        adata.obs['labels'] = cur_labels
194
195        # Running scMKL
196        results[label] = run(adata, alpha_list, return_probs = True)
197
198    # Getting final predictions
199    alpha = np.min(alpha_list)
200    prob_table, pred_class, low_conf = _prob_table(results, alpha)
201
202    results['Probability_table'] = prob_table
203    results['Predicted_class'] = pred_class
204    results['Truth_labels'] = cell_labels[adata.uns['test_indices']]
205    results['Low_confidence'] = low_conf
206
207    return results

For each cell class, creates model(s) comparing that class to all others. Then, predicts on the training data using scmkl.run. Only labels in both training and testing will be run.

Parameters

adatas : list[AnnData]

List of AnnData objects created by create_adata() where each AnnData is one modality and composed of both training and testing samples. Requires that 'train_indices' and 'test_indices' are the same across all AnnDatas.

names : list[str]

List of string variables that describe each modality respective to adatas for labeling.

alpha_list : np.ndarray | float

An array of alpha values to create each model with.

tfidf : list[bool]

List where if element i is True, adata[i] will be TFIDF normalized.

Returns

results : dict

Contains keys for each cell class with results from cell class versus all other samples. See scmkl.run for futher details.

Examples

>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>>
>>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'],
...                           alpha_list = np.array([0.05, 0.1]),
...                           tfidf = [False])
>>>
>>> adata.keys()
dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...])
def optimize_alpha( adata, group_size=None, tfidf=False, alpha_array=array([1.9, 1.7, 1.5, 1.3, 1.1, 0.9, 0.7, 0.5, 0.3, 0.1]), k=4):
176def optimize_alpha(adata, group_size = None, tfidf = False, 
177                   alpha_array = np.round(np.linspace(1.9,0.1, 10),2), k = 4):
178    '''
179    Iteratively train a grouplasso model and update alpha to find the 
180    parameter yielding best performing sparsity. This function 
181    currently only works for binary experiments.
182
183    Parameters
184    ----------
185    **adata** : *AnnData* | *list[AnnData]*
186        > `AnnData`(s) with `'Z_train'` and `'Z_test'` in 
187        `adata.uns.keys()`.
188
189    **group_size** : *None* | *int*
190        > Argument describing how the features are grouped. If *None*, 
191        `2 * adata.uns['D'] will be used. 
192        For more information see celer documentation.
193
194    **tfidf** : *bool* 
195        > If `True`, TFIDF normalization will be run at each fold.
196    
197    **alpha_array** : *np.ndarray*
198        > Array of all alpha values to be tested.
199
200    **k** : *int*
201        > Number of folds to perform cross validation over.
202            
203    Returns
204    -------
205    **alpha_star** : *int*
206        > The best performing alpha value from cross validation on 
207        training data.
208
209    Examples
210    --------
211    >>> alpha_star = scmkl.optimize_alpha(adata)
212    >>> alpha_star
213    0.1
214
215    See Also
216    --------
217    celer.GroupLasso : 
218    https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
219    '''
220
221    assert isinstance(k, int) and k > 0, ("Must be a positive integer number "
222                                          "of folds")
223
224    import warnings 
225    warnings.filterwarnings('ignore')
226
227    if group_size == None:
228        group_size = adata.uns['D'] * 2
229
230    if type(adata) == list:
231        alpha_star = _multimodal_optimize_alpha(adatas = adata, 
232                                                group_size = group_size, 
233                                                tfidf = tfidf, 
234                                                alpha_array = alpha_array)
235        return alpha_star
236
237    y = adata.obs['labels'].iloc[adata.uns['train_indices']].to_numpy()
238    
239    # Splits the labels evenly between folds
240    pos_indices = np.where(y == np.unique(y)[0])[0]
241    neg_indices = np.setdiff1d(np.arange(len(y)), pos_indices)
242
243    pos_annotations = np.arange(len(pos_indices)) % k
244    neg_annotations = np.arange(len(neg_indices)) % k
245
246    auc_array = np.zeros((len(alpha_array), k))
247
248    gc.collect()
249
250    for fold in np.arange(k):
251
252        cv_adata = adata[adata.uns['train_indices'],:]
253
254        # Create CV train/test indices
255        # Capturing train indices for both classes
256        ftrain_pos = pos_indices[np.where(pos_annotations != fold)[0]]
257        ftrain_neg = neg_indices[np.where(neg_annotations != fold)[0]]
258
259        # Capturing test indices for both classes
260        ftest_pos = pos_indices[np.where(pos_annotations == fold)[0]]
261        ftest_neg = neg_indices[np.where(neg_annotations == fold)[0]]
262
263        # Creating class-balanced train/test split for fold
264        fold_train = np.concatenate((ftrain_pos, ftrain_neg))
265        fold_test = np.concatenate((ftest_pos, ftest_neg))
266
267        cv_adata.uns['train_indices'] = fold_train
268        cv_adata.uns['test_indices'] = fold_test
269
270        if tfidf:
271            cv_adata = tfidf_normalize(cv_adata, binarize= True)
272
273        cv_adata = estimate_sigma(cv_adata, n_features = 200)
274        cv_adata = calculate_z(cv_adata, n_features= 5000)
275
276        gc.collect()
277
278        for i, alpha in enumerate(alpha_array):
279
280            cv_adata = train_model(cv_adata, group_size, alpha = alpha)
281            auc_array[i, fold] = _calculate_auroc(cv_adata)
282
283            gc.collect()
284
285        del cv_adata
286        gc.collect()
287        
288    # Take AUROC mean across the k folds to find alpha yielding highest AUROC
289    alpha_star = alpha_array[np.argmax(np.mean(auc_array, axis = 1))]
290    gc.collect()
291    
292
293    return alpha_star

Iteratively train a grouplasso model and update alpha to find the parameter yielding best performing sparsity. This function currently only works for binary experiments.

Parameters

adata : AnnData | list[AnnData]

AnnData(s) with 'Z_train' and 'Z_test' in adata.uns.keys().

group_size : None | int

Argument describing how the features are grouped. If None, `2 * adata.uns['D'] will be used. For more information see celer documentation.

tfidf : bool

If True, TFIDF normalization will be run at each fold.

alpha_array : np.ndarray

Array of all alpha values to be tested.

k : int

Number of folds to perform cross validation over.

Returns

alpha_star : int

The best performing alpha value from cross validation on training data.

Examples

>>> alpha_star = scmkl.optimize_alpha(adata)
>>> alpha_star
0.1

See Also

celer.GroupLasso : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html

def optimize_sparsity( adata, group_size, starting_alpha=1.9, increment=0.2, target=1, n_iter=10):
  8def optimize_sparsity(adata, group_size, starting_alpha = 1.9, 
  9                      increment = 0.2, target = 1, n_iter = 10):
 10    '''
 11    Iteratively train a grouplasso model and update alpha to find the 
 12    parameter yielding the desired sparsity.
 13    
 14    Parameters
 15    ----------
 16    **adata** : *AnnData*
 17        > `AnnData` with `'Z_train'` and `'Z_test'` in 
 18        `adata.uns.keys()`.
 19
 20    **group_size** : *int* 
 21        > Argument describing how the features are grouped. Should be
 22        `2 * D`. For more information see celer documentation.
 23
 24    **starting_alpha** : *float*
 25        > The alpha value to start the search at.
 26    
 27    **increment** : *float* 
 28        > Amount to adjust alpha by between iterations.
 29    
 30    **target** : *int*
 31        > The desired number of groups selected by the model.
 32
 33    **n_iter** : *int*
 34        > The maximum number of iterations to run.
 35            
 36    Returns
 37    -------
 38    **sparsity_dict** : *dict*
 39        > Tested alpha as keys and the number of selected pathways as 
 40        the values.
 41        
 42    **alpha** : *float*
 43        >The alpha value yielding the number of selected groups closest 
 44        to the target.
 45
 46    Examples
 47    --------
 48    >>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, (2 * D), 
 49    ...                                                target = 1)
 50    >>>
 51    >>> alpha
 52    0.01
 53
 54    See Also
 55    --------
 56    celer.GroupLasso : 
 57    https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
 58    '''
 59    assert increment > 0 and increment < starting_alpha, ("Choose a positive "
 60                                                          "increment less "
 61                                                          "than alpha")
 62    assert target > 0 and isinstance(target, int), ("Choose an integer "
 63                                                    "target number of groups "
 64                                                     "that is greater than 0")
 65    assert n_iter > 0 and isinstance(n_iter, int), ("Choose an integer "
 66                                                    "number of iterations "
 67                                                    "that is greater than 0")
 68
 69    sparsity_dict = {}
 70    alpha = starting_alpha
 71
 72    for _ in np.arange(n_iter):
 73        adata = train_model(adata, group_size, alpha)
 74        num_selected = len(find_selected_groups(adata))
 75
 76        sparsity_dict[np.round(alpha, 4)] = num_selected
 77
 78        if num_selected < target:
 79            #Decreasing alpha will increase the number of selected pathways
 80            if alpha - increment in sparsity_dict.keys():
 81                # Make increment smaller so the model can't go back and forth 
 82                # between alpha values
 83                increment /= 2
 84            # Ensures that alpha will never be negative
 85            alpha = np.max([alpha - increment, 1e-1]) 
 86
 87        elif num_selected > target:
 88            if alpha + increment in sparsity_dict.keys():
 89                increment /= 2
 90
 91            alpha += increment
 92        elif num_selected == target:
 93            break
 94
 95    # Find the alpha that minimizes the difference between target and observed
 96    # number of selected groups
 97    spar_idx = np.argmin([np.abs(selected - target) 
 98                          for selected in sparsity_dict.values()])
 99    optimal_alpha = list(sparsity_dict.keys())[spar_idx]
100    return sparsity_dict, optimal_alpha

Iteratively train a grouplasso model and update alpha to find the parameter yielding the desired sparsity.

Parameters

adata : AnnData

AnnData with 'Z_train' and 'Z_test' in adata.uns.keys().

group_size : int

Argument describing how the features are grouped. Should be 2 * D. For more information see celer documentation.

starting_alpha : float

The alpha value to start the search at.

increment : float

Amount to adjust alpha by between iterations.

target : int

The desired number of groups selected by the model.

n_iter : int

The maximum number of iterations to run.

Returns

sparsity_dict : dict

Tested alpha as keys and the number of selected pathways as the values.

alpha : float

The alpha value yielding the number of selected groups closest to the target.

Examples

>>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, (2 * D), 
...                                                target = 1)
>>>
>>> alpha
0.01

See Also

celer.GroupLasso : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html

def run( adata: anndata._core.anndata.AnnData, alpha_list: numpy.ndarray, metrics: list | None = None, return_probs=False) -> dict:
 11def run(adata : ad.AnnData, alpha_list : np.ndarray, 
 12        metrics : list | None = None, return_probs = False) -> dict:
 13    '''
 14    Wrapper function for training and test with multiple alpha values.
 15    Returns metrics, predictions, group weights, and resource usage.
 16
 17    Parameters
 18    ----------
 19    **adata** : *AnnData* 
 20        > A processed *AnnData* with `'Z_train'`, `'Z_test'`, and 
 21        `'group_dict'` keys in `adata.uns`.
 22    
 23    **alpha_list** : *np.ndarray* 
 24        > `alpha` values to create models using. Alpha refers to the 
 25        penalty parameter in Group Lasso. Larger alphas force group 
 26        weights to shrink towards 0 while smaller alphas apply a lesser 
 27        penalty to kernal weights.
 28
 29    **metrics** : *list[str]*
 30        > What metrics should be calculated on predictions. Options are 
 31        ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When 
 32        set to `None`, all metrics are calculated.
 33    
 34    Returns
 35    -------
 36    **results** : *dict*
 37    > With keys and values: 
 38
 39    > `'Metrics'` : a nested dictionary as `[alpha][metric]` = value.
 40    
 41    > `'Selected_groups'` : a dictionary as `[alpha]` = array of 
 42        groups with nonzero weights.
 43
 44    > `'Norms'` : a dictionary as `[alpha]` = array of kernel weights
 45        for each group, order respective to 'Group_names'.
 46
 47    > `'Predictions'` : a dictionary as `[alpha]` = predicted class
 48        respective to 'Observations' for that `alpha`.
 49
 50    > `'Observed'` : an array of ground truth cell labels from the
 51        test set.
 52
 53    > `'Test_indices'` : indices of samples respective to adata 
 54        used in the training set.
 55
 56    > `'Group_names'` : an array of group names respective to each
 57        array in 'Norms'.
 58
 59    > `'Model'` : a dictionary where `[alpha]` = Celer Group Lasso
 60        object for that `alpha`.
 61
 62    > `'RAM_usage'` : memory usage after training models for each 
 63        `alpha`.
 64
 65    Examples
 66    --------
 67    >>> results = scmkl.run(adata = adata, 
 68    ...                     alpha_list = np.array([0.05, 0.1, 0.5]))
 69    >>> results
 70    dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions', 
 71    ...        'Observed', 'Test_indices', 'Group_names', 'Models', 
 72    ...        'Train_time', 'RAM_usage'])
 73    >>>
 74    >>> # List of alpha values
 75    >>> results['Metrics'].keys()
 76    dict_keys([0.05, 0.1, 0.5])
 77    >>>
 78    >>> results['Metrics'][0.05]
 79    {'AUROC': 0.9859,
 80    'Accuracy': 0.945,
 81    'F1-Score': 0.9452736318407959,
 82    'Precision': 0.9405940594059405,
 83    'Recall': 0.95}
 84    '''
 85    if metrics == None:
 86        metrics = ['AUROC', 'F1-Score','Accuracy', 'Precision', 'Recall']
 87
 88    # Initializing variables to capture metrics
 89    group_names = list(adata.uns['group_dict'].keys())
 90    preds = {}
 91    group_norms = {}
 92    mets_dict = {}
 93    selected_groups = {}
 94    train_time = {}
 95    models = {}
 96    probs = {}
 97
 98    D = adata.uns['D']
 99
100    # Generating models for each alpha and outputs
101    for alpha in alpha_list:
102        
103        print(f'  Evaluating model. Alpha: {alpha}', flush = True)
104
105        train_start = time.time()
106
107        adata = train_model(adata, group_size= 2*D, alpha = alpha)
108
109        if return_probs:
110            alpha_res = predict(adata, 
111                                metrics = metrics,
112                                return_probs = return_probs)
113            preds[alpha], mets_dict[alpha], probs[alpha] = alpha_res
114
115        else:
116            alpha_res = predict(adata, 
117                                metrics = metrics,
118                                return_probs = return_probs)
119            preds[alpha], mets_dict[alpha] = alpha_res
120
121        selected_groups[alpha] = find_selected_groups(adata)
122
123        kernel_weights = adata.uns['model'].coef_
124        group_norms[alpha] = [
125            np.linalg.norm(kernel_weights[i * 2 * D : (i + 1) * 2 * D - 1])
126            for i in np.arange(len(group_names))
127            ]
128        
129        models[alpha] = adata.uns['model']
130        
131        train_end = time.time()
132        train_time[alpha] = train_end - train_start
133
134    # Combining results into one object
135    results = {}
136    results['Metrics'] = mets_dict
137    results['Selected_groups'] = selected_groups
138    results['Norms'] = group_norms
139    results['Predictions'] = preds
140    results['Observed'] = adata.obs['labels'].iloc[adata.uns['test_indices']]
141    results['Test_indices'] = adata.uns['test_indices']
142    results['Group_names']= group_names
143    results['Models'] = models
144    results['Train_time'] = train_time
145    results['RAM_usage'] = f'{tracemalloc.get_traced_memory()[1] / 1e9} GB'
146    results['Probabilities'] = probs
147
148    return results

Wrapper function for training and test with multiple alpha values. Returns metrics, predictions, group weights, and resource usage.

Parameters

adata : AnnData

A processed AnnData with 'Z_train', 'Z_test', and 'group_dict' keys in adata.uns.

alpha_list : np.ndarray

alpha values to create models using. Alpha refers to the penalty parameter in Group Lasso. Larger alphas force group weights to shrink towards 0 while smaller alphas apply a lesser penalty to kernal weights.

metrics : list[str]

What metrics should be calculated on predictions. Options are ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When set to None, all metrics are calculated.

Returns

results : dict

With keys and values:

'Metrics' : a nested dictionary as [alpha][metric] = value.

'Selected_groups' : a dictionary as [alpha] = array of groups with nonzero weights.

'Norms' : a dictionary as [alpha] = array of kernel weights for each group, order respective to 'Group_names'.

'Predictions' : a dictionary as [alpha] = predicted class respective to 'Observations' for that alpha.

'Observed' : an array of ground truth cell labels from the test set.

'Test_indices' : indices of samples respective to adata used in the training set.

'Group_names' : an array of group names respective to each array in 'Norms'.

'Model' : a dictionary where [alpha] = Celer Group Lasso object for that alpha.

'RAM_usage' : memory usage after training models for each alpha.

Examples

>>> results = scmkl.run(adata = adata, 
...                     alpha_list = np.array([0.05, 0.1, 0.5]))
>>> results
dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions', 
...        'Observed', 'Test_indices', 'Group_names', 'Models', 
...        'Train_time', 'RAM_usage'])
>>>
>>> # List of alpha values
>>> results['Metrics'].keys()
dict_keys([0.05, 0.1, 0.5])
>>>
>>> results['Metrics'][0.05]
{'AUROC': 0.9859,
'Accuracy': 0.945,
'F1-Score': 0.9452736318407959,
'Precision': 0.9405940594059405,
'Recall': 0.95}
test_import
def tfidf_normalize(adata, binarize=False):
 51def tfidf_normalize(adata, binarize = False):
 52    '''
 53    Function to TFIDF normalize the data in an adata object. If any 
 54    rows are entirely 0, that row and its metadata will be removed from
 55    the object.
 56
 57    Parameters
 58    ----------
 59    **adata** : *AnnData* 
 60        > `adata.X` to be normalized. If `'train_indices'` and 
 61        `'test_indices'` in `'adata.uns.keys()'`, normalization will be
 62        done separately for the training and testing data. Otherwise, 
 63        it will calculate it on the entire dataset.
 64
 65    **binarize** : *bool* 
 66        > If `True`, all values in `adata.X` greater than 1 will become 
 67        1.
 68
 69    Returns
 70    -------
 71    **adata** : *AnnData* 
 72        > adata with adata.X TFIDF normalized. Will now have the train 
 73        data stacked on test data, and the indices will be adjusted 
 74        accordingly.
 75
 76    Examples
 77    --------
 78    >>> adata = scmkl.create_adata(X = data_mat, 
 79    ...                            feature_names = gene_names, 
 80    ...                            group_dict = group_dict)
 81    >>> 
 82    >>> adata = scmkl.tfidf_normalize(adata)
 83    '''
 84    X = adata.X.copy()
 85    row_sums = np.sum(X, axis = 1)
 86    assert np.all(row_sums > 0), "TFIDF requires all row sums be positive"
 87
 88    if binarize:
 89        X[X > 0] = 1
 90
 91    if 'train_indices' in adata.uns_keys():
 92
 93        train_indices = adata.uns['train_indices'].copy()
 94        test_indices = adata.uns['test_indices'].copy()
 95
 96        # Calculate the train TFIDF matrix on just the training data so it is 
 97        # not biased by testing data
 98        tfidf_train = _tfidf(X[train_indices,:], mode = 'normalize')
 99
100        # Calculate the test TFIDF by calculating it on the train and test 
101        # data and index the test data
102        tfidf_test = _tfidf(X, mode = 'normalize')[test_indices,:]
103
104        # Impossible to add rows back to original location so we need to 
105        # stack the matrices to maintain train/test
106        if scipy.sparse.issparse(X):
107            tfidf_norm = scipy.sparse.vstack((tfidf_train, tfidf_test))
108        else:
109            tfidf_norm = np.vstack((tfidf_train, tfidf_test))
110
111        # I'm not sure why this reassignment is necessary, but without, the 
112        # values will be saved as 0s in adata
113        adata.uns['train_indices'] = train_indices
114        adata.uns['test_indices'] = test_indices
115
116        combined_indices = np.concatenate((train_indices, test_indices))
117
118        # Anndata indexes by "rownames" not position so we need to rename the 
119        # rows to properly index
120        adata_index = adata.obs_names[combined_indices].astype(int)
121        tfidf_norm = tfidf_norm[np.argsort(adata_index),:]
122
123    else:
124
125        tfidf_norm = _tfidf(X, mode = 'normalize')
126
127    adata.X = tfidf_norm.copy()
128
129    return adata

Function to TFIDF normalize the data in an adata object. If any rows are entirely 0, that row and its metadata will be removed from the object.

Parameters

adata : AnnData

adata.X to be normalized. If 'train_indices' and 'test_indices' in 'adata.uns.keys()', normalization will be done separately for the training and testing data. Otherwise, it will calculate it on the entire dataset.

binarize : bool

If True, all values in adata.X greater than 1 will become 1.

Returns

adata : AnnData

adata with adata.X TFIDF normalized. Will now have the train data stacked on test data, and the indices will be adjusted accordingly.

Examples

>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>> 
>>> adata = scmkl.tfidf_normalize(adata)
def train_model(adata, group_size=1, alpha=0.9):
 6def train_model(adata, group_size = 1, alpha = 0.9):
 7    '''
 8    Fit a grouplasso model to the provided data.
 9
10    Parameters
11    ----------
12    **adata** : *AnnData* 
13        > Has `'Z_train'` and `'Z_test'` keys in `adata.uns`.
14
15    **group_size** : *int* 
16        > Argument describing how the features are grouped. Should be
17        `2 * D`. For more information see celer documentation. 
18            
19    **alpha** : *float*
20        > Group Lasso regularization coefficient. alpha is a floating 
21        point value controlling model solution sparsity. Must be a 
22        positive float. The smaller the value, the more feature groups 
23        will be selected in the trained model.
24    
25    Returns
26    -------
27    **adata** : *AnnData* 
28        > Trained model accessible with `adata.uns['model']`.
29
30    Examples
31    --------
32    >>> adata = scmkl.estimate_sigma(adata)
33    >>> adata = scmkl.calculate_z(adata)
34    >>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 
35    ...            'Recall']
36    >>> d = scmkl.calculate_d(adata.shape[0])
37    >>> group_size = 2 * d
38    >>> adata = scmkl.train_model(adata, group_size)
39    >>>
40    >>> 'model' in adata.uns.keys()
41    True
42
43    See Also
44    --------
45    celer documentation :
46    https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
47    '''
48    assert alpha > 0, 'Alpha must be positive'
49
50    y_train = adata.obs['labels'].iloc[adata.uns['train_indices']]
51    X_train = adata.uns['Z_train']
52
53    cell_labels = np.unique(y_train)
54
55    # This is a regression algorithm. We need to make the labels 'continuous' 
56    # for classification, but they will remain binary. Casts training labels 
57    # to array of -1,1
58    train_labels = np.ones(y_train.shape)
59    train_labels[y_train == cell_labels[1]] = -1
60
61    # Alphamax is a calculation to regularize the effect of alpha across 
62    # different data sets
63    alphamax = np.max(np.abs(X_train.T.dot(train_labels)))
64    alphamax /= X_train.shape[0] 
65    alphamax *= alpha
66
67    # Instantiate celer Group Lasso Regression Model Object
68    model = celer.GroupLasso(groups = group_size, alpha = alphamax)
69
70    # Fit model using training data
71    model.fit(X_train, train_labels.ravel())
72
73    adata.uns['model'] = model
74    return adata

Fit a grouplasso model to the provided data.

Parameters

adata : AnnData

Has 'Z_train' and 'Z_test' keys in adata.uns.

group_size : int

Argument describing how the features are grouped. Should be 2 * D. For more information see celer documentation.

alpha : float

Group Lasso regularization coefficient. alpha is a floating point value controlling model solution sparsity. Must be a positive float. The smaller the value, the more feature groups will be selected in the trained model.

Returns

adata : AnnData

Trained model accessible with adata.uns['model'].

Examples

>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 
...            'Recall']
>>> d = scmkl.calculate_d(adata.shape[0])
>>> group_size = 2 * d
>>> adata = scmkl.train_model(adata, group_size)
>>>
>>> 'model' in adata.uns.keys()
True

See Also

celer documentation : https://mathurinm.github.io/celer/generated/celer.GroupLasso.html