scmkl


PyPI PyPI Downloads Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

Single-cell analysis using Multiple Kernel Learning, scMKL, is a binary classification algorithm utilizing prior information to group features to enhance classification and aid understanding of distinguishing features in multi-omic data sets.

Installation

Conda install

Conda is the recommended method to install scMKL:

conda create -n scMKL python=3.12 
conda activate scMKL
conda install -c conda-forge -c bioconda ivango17::scmkl

Ensure bioconda and conda-forge are in your available conda channels.

Pip install

First, create a virtual environment with python>=3.11.1,<3.13.

Then, install scMKL with:

# activate your new env with python>=3.11.1 and <3.13
pip install scmkl

If wheels do not build correctly, ensure gcc and g++ are installed and up to date. They can be installed with sudo apt install gcc and sudo apt install g++.

Requirements

scMKL takes advantage of AnnData objects and can be implemented with just four pieces of data:

1) scRNA and/or scATAC matrices (can be scipy.sparse matrix)

2) An array of cell labels

3) An array of feature names (eg. gene symbols for RNA or peaks for ATAC)

4) A grouping dictionary where {'group_1' : [feature_5, feature_16], 'group_2' : [feature_1, feature_4, feature_9]}

For implementing scMKL and learning how to get meaningful feature groupings, see our examples for your use case in examples.

Repo: https://github.com/ohsu-cedar-comp-hub/scMKL

PyPI: https://pypi.org/project/scmkl/

Anaconda: https://anaconda.org/ivango17/scmkl

API: https://ohsu-cedar-comp-hub.github.io/scMKL/

Publication

If you use scMKL in your research, please cite using:

Kupp, S., VanGordon, I., Gönen, M., Esener, S., Eksi, S., Ak, C. Interpretable and integrative analysis of single-cell multiomics with scMKL. Commun Biol 8, 1160 (2025). https://doi.org/10.1038/s42003-025-08533-7

Our Shiny for Python application for viewing data produced from this work can be found here: scMKL_analysis

Issues

Please report bugs here.

Examples

Here are helpful examples for running scMKL, and getting the data required to run scMKL.

File Description
getting_gene_groupings.ipynb Different ways to get gene sets in a usable format with scMKL
getting_region_groupings.ipynb How to group genomic regions using a gene set library and a GTF file
RNA_analysis.ipynb Running scMKL using only single-cell RNA data
ATAC_analysis.ipynb Running scMKL using only single-cell ATAC data
multimodal_analysis.ipynb Running scMKL using single-cell multi-omic data (RNA + ATAC)

scMKL Documentation

 1"""
 2.. include:: ../README.md
 3.. include:: ../example/README.md
 4
 5----------------------------
 6
 7## **scMKL Documentation**
 8"""
 9
10__version__ = '0.4.3a0'
11
12__all__ = ['calculate_z', 
13           'calculate_d',
14           'create_adata', 
15           'data_processing',
16           'dataframes',
17           'estimate_sigma', 
18           'extract_results',
19           'find_candidates',
20           'format_adata',
21           'format_group_names',
22           'get_folds',
23           'get_gene_groupings',
24           'get_metrics',
25           'get_region_groupings',
26           'get_selection',
27           'get_summary', 
28           'get_weights', 
29           'groups_per_alpha',
30           'group_umap',
31           'multimodal_processing', 
32           'one_v_rest', 
33           'optimize_alpha', 
34           'optimize_sparsity',
35           'parse_metrics',
36           'parse_weights',
37           'plotting',
38           'plot_metric',
39           'plot_conf_mat',
40           'prepare_fold',
41           'projections',
42           'process_fold',
43           'read_files',
44           'read_gtf',
45           'run',
46           'sort_groups',
47           'test',
48           'tfidf_normalize',
49           'train_model',
50           'weights_barplot',
51           'weights_dotplot',
52           'weights_heatmap'
53           ]
54
55from scmkl._checks import *
56from scmkl.calculate_z import *
57from scmkl.create_adata import *
58from scmkl.data_processing import *
59from scmkl.dataframes import *
60from scmkl.estimate_sigma import *
61from scmkl.get_gene_groupings import *
62from scmkl.get_region_groupings import *
63from scmkl.multimodal_processing import *
64from scmkl.one_v_rest import *
65from scmkl.optimize_alpha import *
66from scmkl.optimize_sparsity import *
67from scmkl.plotting import *
68from scmkl.projections import *
69from scmkl.run import *
70from scmkl.test import *
71from scmkl.tfidf_normalize import *
72from scmkl.train_model import *
73from scmkl.projections import *
def calculate_z( adata, n_features=5000, batches=10, batch_size=100) -> anndata._core.anndata.AnnData:
130def calculate_z(adata, n_features=5000, batches=10, 
131                batch_size=100) -> ad.AnnData:
132    """
133    Function to calculate Z matrices for all groups in both training 
134    and testing data.
135
136    Parameters
137    ----------
138    adata : ad.AnnData
139        created by `scmkl.create_adata()` with `adata.uns.keys()`: 
140        `'train_indices'`, and `'test_indices'`. 
141
142    n_features : int
143        Number of random feature to use when calculating Z; used for 
144        scalability.
145
146    batches : int
147        The number of batches to use for the distance calculation.
148        This will average the result of `batches` distance calculations
149        of `batch_size` randomly sampled cells. More batches will converge
150        to population distance values at the cost of scalability.
151
152    batch_size : int
153        The number of cells to include per batch for distance
154        calculations. Higher batch size will converge to population
155        distance values at the cost of scalability.
156        If `batches*batch_size > num_training_cells`,
157        `batch_size` will be reduced to 
158        `int(num_training_cells / batches)`.
159
160    Returns
161    -------
162    adata : ad.AnnData
163        `adata` with Z matrices accessible with `adata.uns['Z_train']` 
164        and `adata.uns['Z_test']`.
165
166    Examples
167    --------
168    >>> adata = scmkl.estimate_sigma(adata)
169    >>> adata = scmkl.calculate_z(adata)
170    >>> adata.uns.keys()
171    dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices', 
172    'test_indices'])
173    """
174    # Number of groupings taking from group_dict
175    n_pathway = len(adata.uns['group_dict'].keys())
176    D = adata.uns['D']
177
178    sq_i_d = np.sqrt(1/D)
179
180    # Capturing training and testing sizes
181    train_len = len(adata.uns['train_indices'])
182    test_len = len(adata.uns['test_indices'])
183
184    if batch_size * batches > train_len:
185        old_batch_size = batch_size
186        batch_size = int(train_len/batches)
187        print("Specified batch size required too many cells for "
188                "independent batches. Reduced batch size from "
189                f"{old_batch_size} to {batch_size}")
190
191    if 'sigma' not in adata.uns.keys():
192        n_samples = np.min((2000, train_len))
193        sample_range = np.arange(n_samples)
194        batch_idx = get_batches(sample_range, adata.uns['seed_obj'], 
195                                batches=batches, batch_size=batch_size)
196        sigma_indices = sample_cells(adata.uns['train_indices'], n_samples, adata.uns['seed_obj'])
197
198    # Create Arrays to store concatenated group Zs
199    # Each group of features will have a corresponding entry in each array
200    n_cols = 2*adata.uns['D']*n_pathway
201    Z_train = np.zeros((train_len, n_cols), dtype=np.float16)
202    Z_test = np.zeros((test_len, n_cols), dtype=np.float16)
203
204
205    # Setting kernel function 
206    match adata.uns['kernel_type'].lower():
207        case 'gaussian':
208            proj_func = gaussian_trans
209        case 'laplacian':
210            proj_func = laplacian_trans
211        case 'cauchy':
212            proj_func = cauchy_trans
213
214
215    # Loop over each of the groups and creating Z for each
216    sigma_list = list()
217    for m, group_features in enumerate(adata.uns['group_dict'].values()):
218
219        n_group_features = len(group_features)
220
221        X_train, X_test = get_group_mat(adata, n_features, group_features, 
222                                        n_group_features, process_test=True)
223        
224        if adata.uns['tfidf']:
225            X_train, X_test = tfidf_train_test(X_train, X_test)
226
227        # Data filtering, and transformation according to given data_type
228        # Will remove low variance (< 1e5) features regardless of data_type
229        # If scale/transform data depending on .uns values
230        X_train, X_test = process_data(X_train=X_train, X_test=X_test, 
231                                       scale_data=adata.uns['scale_data'], 
232                                       transform_data=adata.uns['transform_data'],
233                                       return_dense=True)    
234
235        # Getting sigma
236        if 'sigma' in adata.uns.keys():
237            sigma = adata.uns['sigma'][m]
238        else:
239            sigma = est_group_sigma(adata, X_train, n_group_features, 
240                                    n_features, batch_idx=batch_idx)
241            sigma_list.append(sigma)
242            
243        assert sigma > 0, "Sigma must be more than 0"
244        train_projection, test_projection = calc_groupz(X_train, X_test, 
245                                                        adata, D, sigma, 
246                                                        proj_func)
247
248        # Store group Z in whole-Z object
249        # Preserves order to be able to extract meaningful groups
250        cos_idx, sin_idx = get_z_indices(m, D)
251
252        Z_train[0:, cos_idx] = np.cos(train_projection, dtype=np.float16)
253        Z_train[0:, sin_idx] = np.sin(train_projection, dtype=np.float16)
254
255        Z_test[0:, cos_idx] = np.cos(test_projection, dtype=np.float16)
256        Z_test[0:, sin_idx] = np.sin(test_projection, dtype=np.float16)
257
258    adata.uns['Z_train'] = Z_train*sq_i_d
259    adata.uns['Z_test'] = Z_test*sq_i_d
260
261    if 'sigma' not in adata.uns.keys():
262        adata.uns['sigma'] = np.array(sigma_list)
263
264    check_for_nan(adata)
265    check_for_inf(adata)
266
267    return adata

Function to calculate Z matrices for all groups in both training and testing data.

Parameters
  • adata (ad.AnnData): created by scmkl.create_adata with adata.uns.keys(): 'train_indices', and 'test_indices'.
  • n_features (int): Number of random feature to use when calculating Z; used for scalability.
  • batches (int): The number of batches to use for the distance calculation. This will average the result of batches distance calculations of batch_size randomly sampled cells. More batches will converge to population distance values at the cost of scalability.
  • batch_size (int): The number of cells to include per batch for distance calculations. Higher batch size will converge to population distance values at the cost of scalability. If batches*batch_size > num_training_cells, batch_size will be reduced to int(num_training_cells / batches).
Returns
  • adata (ad.AnnData): adata with Z matrices accessible with adata.uns['Z_train'] and adata.uns['Z_test'].
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> adata.uns.keys()
dict_keys(['Z_train', 'Z_test', 'sigmas', 'train_indices', 
'test_indices'])
def calculate_d(num_samples: int):
220def calculate_d(num_samples : int):
221    """
222    This function calculates the optimal number of dimensions for 
223    performance. See https://doi.org/10.48550/arXiv.1806.09178 for more
224    information.
225
226    Parameters
227    ----------
228    num_samples : int
229        The number of samples in the data set including both training
230        and testing sets.
231
232    Returns
233    -------
234    d : int
235        The optimal number of dimensions to run scMKL with the given 
236        data set.
237
238    Examples
239    --------
240    >>> raw_counts = scipy.sparse.load_npz('MCF7_counts.npz')
241    >>>
242    >>> num_cells = raw_counts.shape[0]
243    >>> d = scmkl.calculate_d(num_cells)
244    >>> d
245    161
246    """
247    d = int(np.sqrt(num_samples)*np.log(np.log(num_samples)))
248
249    return int(np.max([d, 100]))

This function calculates the optimal number of dimensions for performance. See https://doi.org/10.48550/arXiv.1806.09178 for more information.

Parameters
  • num_samples (int): The number of samples in the data set including both training and testing sets.
Returns
  • d (int): The optimal number of dimensions to run scMKL with the given data set.
Examples
>>> raw_counts = scipy.sparse.load_npz('MCF7_counts.npz')
>>>
>>> num_cells = raw_counts.shape[0]
>>> d = scmkl.calculate_d(num_cells)
>>> d
161
def create_adata( X: scipy.sparse._csc.csc_matrix | numpy.ndarray | pandas.core.frame.DataFrame, feature_names: numpy.ndarray, cell_labels: numpy.ndarray, group_dict: dict, obs_names: None | numpy.ndarray = None, scale_data: bool = True, transform_data: bool = False, split_data: numpy.ndarray | None = None, D: int | None = None, remove_features: bool = True, train_ratio: float = 0.8, distance_metric: str = 'euclidean', kernel_type: str = 'Gaussian', random_state: int = 1, allow_multiclass: bool = False, class_threshold: str | int | None = None, reduction: str | None = None, tfidf: bool = False, other_factor: float = 1.5):
326def create_adata(X: scipy.sparse._csc.csc_matrix | np.ndarray | pd.DataFrame, 
327                 feature_names: np.ndarray, cell_labels: np.ndarray, 
328                 group_dict: dict, obs_names: None | np.ndarray=None, 
329                 scale_data: bool=True, transform_data: bool=False, 
330                 split_data: np.ndarray | None=None, D: int | None=None, 
331                 remove_features: bool=True, train_ratio: float=0.8, 
332                 distance_metric: str='euclidean', kernel_type: str='Gaussian', 
333                 random_state: int=1, allow_multiclass: bool = False, 
334                 class_threshold: str | int | None = None,
335                 reduction: str | None = None, tfidf: bool = False, 
336                 other_factor: float=1.5):
337    """
338    Function to create an AnnData object to carry all relevant 
339    information going forward.
340
341    Parameters
342    ----------
343    X : scipy.sparse.csc_matrix | np.ndarray | pd.DataFrame
344        A data matrix of cells by features (sparse array 
345        recommended for large datasets).
346
347    feature_names : np.ndarray
348        Array of feature names corresponding with the features 
349        in `X`.
350
351    cell_labels : np.ndarray
352        A numpy array of cell phenotypes corresponding with 
353        the cells in `X`.
354
355    group_dict : dict 
356        Dictionary containing feature grouping information (i.e. 
357        `{geneset1: np.array([gene_1, gene_2, ..., gene_n]), geneset2: 
358        np.array([...]), ...}`.
359
360    obs_names : None | np.ndarray
361        The cell names corresponding to `X` to be assigned to output 
362        object `.obs_names` attribute.
363
364    scale_data : bool  
365        If `True`, data matrix is log transformed and standard 
366        scaled. Default is `True`.
367
368    transform_data : bool
369        If `True`, data will be log1p transformed (recommended for 
370        counts data). Default is `False`.   
371    
372    split_data : None | np.ndarray
373        If `None`, data will be split stratified by cell labels. 
374        Else, is an array of precalculated train/test split 
375        corresponding to samples. Can include labels for entire
376        dataset to benchmark performance or for only training
377        data to classify unknown cell types (i.e. `np.array(['train', 
378        'test', ..., 'train'])`.
379
380    D : int 
381        Number of Random Fourier Features used to calculate Z. 
382        Should be a positive integer. Higher values of D will 
383        increase classification accuracy at the cost of computation 
384        time. If set to `None`, will be calculated given number of 
385        samples. 
386    
387    remove_features : bool
388        If `True`, will remove features from `X` and `feature_names` 
389        not in `group_dict` and remove features from groupings not in 
390        `feature_names`.
391
392    train_ratio : float
393        Ratio of number of training samples to entire data set. Note:
394        if a threshold is applied, the ratio training samples may 
395        decrease depending on class balance and `class_threshold`
396        parameter if `allow_multiclass = True`.
397
398    distance_metric : str
399        The pairwise distance metric used to estimate sigma. Must
400        be one of the options used in `scipy.spatial.distance.cdist`.
401
402    kernel_type : str
403        The approximated kernel function used to calculate Zs.
404        Must be one of `'Gaussian'`, `'Laplacian'`, or `'Cauchy'`.
405
406    random_state : int
407        Integer random_state used to set the seed for 
408        reproducibilty.
409
410    allow_multiclass : bool
411        If `False`, will ensure that cell labels are binary.
412
413    class_threshold : str | int
414        Number of samples allowed in the training data for each cell
415        class in the training data. If `'median'`, the median number 
416        of cells per cell class will be the threshold for number of 
417        samples per class.
418
419    reduction: str | None
420        Choose which dimension reduction technique to perform on 
421        features within a group. 'svd' will run 
422        `sklearn.decomposition.TruncatedSVD`, 'linear' will multiply 
423        by an array of 1s down to 50 dimensions. 'pca' will replace 
424        each group values with 50 PCs from principal component 
425        analysis.
426        
427    tfidf: bool
428        Whether to calculate TFIDF transformation on peaks within 
429        groupings.
430        
431    Returns
432    -------
433    adata : ad.AnnData
434        AnnData with the following attributes and keys:
435
436        `adata.X` (array_like):
437            Data matrix.
438    
439        `adata.var_names` (array_like): 
440            Feature names corresponding to `adata.X`.
441
442        `adata.obs['labels']` (array_like):
443            cell classes/phenotypes from `cell_labels`.
444
445        `adata.uns['train_indices']` (array_like):
446            Indices for training data. 
447
448        `adata.uns['test_indices']` (array_like)
449            Indices for testing data.
450
451        `adata.uns['group_dict']` (dict):
452            Grouping information.
453
454        `adata.uns['seed_obj']` (np.random._generator.Generator): 
455            Seed object with seed equal to 100 * `random_state`.
456
457        `adata.uns['D']` (int):
458            Number of dimensions to scMKL with.
459
460        `adata.uns['scale_data']` (bool):
461            Whether or not data is scaled.
462
463        `adata.uns['transform_data']` (bool):
464            Whether or not data is log1p transformed.
465
466        `adata.uns['distance_metric']` (str): 
467            Distance metric as given.
468    
469        `adata.uns['kernel_type']` (str): 
470            Kernel function as given.
471
472        `adata.uns['svd']` (bool): 
473            Whether to calculate SVD reduction.
474
475        `adata.uns['tfidf']` (bool): 
476            Whether to calculate TF-IDF per grouping.
477
478    Examples
479    --------
480    >>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz')
481    >>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True)
482    >>> group_dict = np.load('hallmark_genesets.pkl', 
483    >>>                      allow_pickle = True)
484    >>> 
485    >>> adata = scmkl.create_adata(X = data_mat, 
486    ...                            feature_names = gene_names, 
487    ...                            group_dict = group_dict)
488    >>> adata
489    AnnData object with n_obs Ă— n_vars = 1000 Ă— 4341
490    obs: 'labels'
491    uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 
492    'distance_metric', 'train_indices', 'test_indices'
493    """
494
495    assert X.shape[1] == len(feature_names), ("Different number of features "
496                                              "in X than feature names.")
497    
498    if not allow_multiclass:
499        assert len(np.unique(cell_labels)) == 2, ("cell_labels must contain "
500                                                  "2 classes.")
501
502    kernel_options = ['gaussian', 'laplacian', 'cauchy']
503    assert kernel_type.lower() in kernel_options, ("Given kernel type not "
504                                                   "implemented. Gaussian, "
505                                                   "Laplacian, and Cauchy "
506                                                   "are the acceptable "
507                                                   "types.")
508
509    # Create adata object and add column names
510    adata = ad.AnnData(X)
511    adata.var_names = feature_names
512
513    if isinstance(obs_names, (np.ndarray)):
514        adata.obs_names = obs_names
515
516    filtered_feature_names, group_dict = filter_features(feature_names, 
517                                                          group_dict)
518    
519    # Ensuring that there are common features between feature_names and groups
520    overlap_err = ("No common features between feature names and grouping "
521                   "dict. Check grouping.")
522    assert len(filtered_feature_names) > 0, overlap_err
523
524    if remove_features:
525        warnings.filterwarnings('ignore', category = ad.ImplicitModificationWarning)
526        adata = adata[:, filtered_feature_names]
527    
528    gc.collect()
529
530    # Add metadata to adata object
531    adata.uns['group_dict'] = group_dict
532    adata.uns['seed_obj'] = np.random.default_rng(100*random_state)
533    adata.uns['scale_data'] = scale_data
534    adata.uns['transform_data'] = transform_data
535    adata.uns['kernel_type'] = kernel_type
536    adata.uns['distance_metric'] = distance_metric
537    adata.uns['reduction'] = reduction if isinstance(reduction, str) else 'None'
538    adata.uns['tfidf'] = tfidf
539
540    if (split_data is None):
541        assert X.shape[0] == len(cell_labels), ("Different number of cells "
542                                                "than labels")
543        adata.obs['labels'] = cell_labels
544
545        if (allow_multiclass == False):
546            split = binary_split(cell_labels, 
547                                  seed_obj = adata.uns['seed_obj'],
548                                  train_ratio = train_ratio)
549            train_indices, test_indices = split
550
551        elif (allow_multiclass == True):
552            split = multi_class_split(cell_labels, 
553                                       seed_obj = adata.uns['seed_obj'], 
554                                       class_threshold = class_threshold,
555                                       train_ratio = train_ratio)
556            train_indices, test_indices = split
557
558        adata.uns['labeled_test'] = True
559
560    else:
561        sd_err_message = "`split_data` argument must be of type np.ndarray"
562        assert isinstance(split_data, np.ndarray), sd_err_message
563        x_eq_labs = X.shape[0] == len(cell_labels)
564        train_eq_labs = X.shape[0] == len(cell_labels)
565        assert x_eq_labs or train_eq_labs, ("Must give labels for all cells "
566                                            "or only for training cells")
567        
568        train_indices = np.where(split_data == 'train')[0]
569        test_indices = np.where(split_data == 'test')[0]
570
571        if len(cell_labels) == len(train_indices):
572
573            padded_cell_labels = np.zeros((X.shape[0])).astype('object')
574            padded_cell_labels[train_indices] = cell_labels
575            padded_cell_labels[test_indices] = 'padded_test_label'
576
577            adata.obs['labels'] = padded_cell_labels
578            adata.uns['labeled_test'] = False
579
580        elif len(cell_labels) == len(split_data):
581            adata.obs['labels'] = cell_labels
582            adata.uns['labeled_test'] = True
583
584    # Ensuring all train samples are first in adata object followed by test
585    sort_idx, train_indices, test_indices = sort_samples(train_indices, 
586                                                         test_indices)
587
588    adata = adata[sort_idx]
589
590    if not isinstance(obs_names, (np.ndarray)):
591        adata.obs = adata.obs.reset_index(drop=True)
592        adata.obs.index = adata.obs.index.astype('O')
593
594    adata.uns['train_indices'] = train_indices
595    adata.uns['test_indices'] = test_indices
596
597    adata.uns['D'] = get_optimal_d(adata, D, allow_multiclass, other_factor)
598
599    if not scale_data:
600        print("WARNING: Data will not be scaled. "
601              "To change this behavior, set scale_data to True")
602    if not transform_data:
603        print("WARNING: Data will not be transformed."
604              "To change this behavior, set transform_data to True")
605
606    return adata

Function to create an AnnData object to carry all relevant information going forward.

Parameters
  • X (scipy.sparse.csc_matrix | np.ndarray | pd.DataFrame): A data matrix of cells by features (sparse array recommended for large datasets).
  • feature_names (np.ndarray): Array of feature names corresponding with the features in X.
  • cell_labels (np.ndarray): A numpy array of cell phenotypes corresponding with the cells in X.
  • group_dict (dict): Dictionary containing feature grouping information (i.e. {geneset1: np.array([gene_1, gene_2, ..., gene_n]), geneset2: np.array([...]), ...}.
  • obs_names (None | np.ndarray): The cell names corresponding to X to be assigned to output object .obs_names attribute.
  • scale_data (bool): If True, data matrix is log transformed and standard scaled. Default is True.
  • transform_data (bool): If True, data will be log1p transformed (recommended for counts data). Default is False.
  • split_data (None | np.ndarray): If None, data will be split stratified by cell labels. Else, is an array of precalculated train/test split corresponding to samples. Can include labels for entire dataset to benchmark performance or for only training data to classify unknown cell types (i.e. np.array(['train', 'test', ..., 'train']).
  • D (int): Number of Random Fourier Features used to calculate Z. Should be a positive integer. Higher values of D will increase classification accuracy at the cost of computation time. If set to None, will be calculated given number of samples.
  • remove_features (bool): If True, will remove features from X and feature_names not in group_dict and remove features from groupings not in feature_names.
  • train_ratio (float): Ratio of number of training samples to entire data set. Note: if a threshold is applied, the ratio training samples may decrease depending on class balance and class_threshold parameter if allow_multiclass = True.
  • distance_metric (str): The pairwise distance metric used to estimate sigma. Must be one of the options used in scipy.spatial.distance.cdist.
  • kernel_type (str): The approximated kernel function used to calculate Zs. Must be one of 'Gaussian', 'Laplacian', or 'Cauchy'.
  • random_state (int): Integer random_state used to set the seed for reproducibilty.
  • allow_multiclass (bool): If False, will ensure that cell labels are binary.
  • class_threshold (str | int): Number of samples allowed in the training data for each cell class in the training data. If 'median', the median number of cells per cell class will be the threshold for number of samples per class.
  • reduction (str | None): Choose which dimension reduction technique to perform on features within a group. 'svd' will run sklearn.decomposition.TruncatedSVD, 'linear' will multiply by an array of 1s down to 50 dimensions. 'pca' will replace each group values with 50 PCs from principal component analysis.
  • tfidf (bool): Whether to calculate TFIDF transformation on peaks within groupings.
Returns
  • adata (ad.AnnData): AnnData with the following attributes and keys:

    adata.X (array_like): Data matrix.

    adata.var_names (array_like): Feature names corresponding to adata.X.

    adata.obs['labels'] (array_like): cell classes/phenotypes from cell_labels.

    adata.uns['train_indices'] (array_like): Indices for training data.

    adata.uns['test_indices'] (array_like) Indices for testing data.

    adata.uns['group_dict'] (dict): Grouping information.

    adata.uns['seed_obj'] (np.random._generator.Generator): Seed object with seed equal to 100 * random_state.

    adata.uns['D'] (int): Number of dimensions to scMKL with.

    adata.uns['scale_data'] (bool): Whether or not data is scaled.

    adata.uns['transform_data'] (bool): Whether or not data is log1p transformed.

    adata.uns['distance_metric'] (str): Distance metric as given.

    adata.uns['kernel_type'] (str): Kernel function as given.

    adata.uns['svd'] (bool): Whether to calculate SVD reduction.

    adata.uns['tfidf'] (bool): Whether to calculate TF-IDF per grouping.

Examples
>>> data_mat = scipy.sparse.load_npz('MCF7_RNA_matrix.npz')
>>> gene_names = np.load('MCF7_gene_names.pkl', allow_pickle = True)
>>> group_dict = np.load('hallmark_genesets.pkl', 
>>>                      allow_pickle = True)
>>> 
>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>> adata
AnnData object with n_obs Ă— n_vars = 1000 Ă— 4341
obs: 'labels'
uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 
'distance_metric', 'train_indices', 'test_indices'
def estimate_sigma( adata: anndata._core.anndata.AnnData, n_features: int = 5000, batches: int = 10, batch_size: int = 100) -> anndata._core.anndata.AnnData:
165def estimate_sigma(adata: ad.AnnData,
166                   n_features: int = 5000,
167                   batches: int = 10, 
168                   batch_size: int = 100) -> ad.AnnData:
169    """
170    Calculate kernel widths to inform distribution for projection of 
171    Fourier Features. Calculates one sigma per group of features.
172
173    Parameters
174    ----------
175    adata : ad.AnnData
176        Created by `create_adata`.
177    
178    n_features : int
179        Number of random features to include when estimating sigma. 
180        Will be scaled for the whole pathway set according to a 
181        heuristic. Used for scalability.
182
183    batches : int
184        The number of batches to use for the distance calculation.
185        This will average the result of `batches` distance calculations
186        of `batch_size` randomly sampled cells. More batches will converge
187        to population distance values at the cost of scalability.
188
189    batch_size : int
190        The number of cells to include per batch for distance
191        calculations. Higher batch size will converge to population
192        distance values at the cost of scalability.
193        If `batches` * `batch_size` > # training cells,
194        `batch_size` will be reduced to `int(num training cells / 
195        batches)`.
196        
197    Returns
198    -------
199    adata : ad.AnnData
200        Key added `adata.uns['sigma']` with grouping kernel widths.
201
202    Examples
203    --------
204    >>> adata = scmkl.estimate_sigma(adata)
205    >>> adata.uns['sigma']
206    array([10.4640895 , 10.82011454,  6.16769438,  9.86156855, ...])
207    """
208    assert batch_size <= len(adata.uns['train_indices']), ("Batch size must be "
209                                                          "smaller than the "
210                                                          "training set.")
211
212    if batch_size * batches > len(adata.uns['train_indices']):
213        old_batch_size = batch_size
214        batch_size = int(len(adata.uns['train_indices'])/batches)
215        print("Specified batch size required too many cells for "
216                "independent batches. Reduced batch size from "
217                f"{old_batch_size} to {batch_size}", flush=True)
218
219    if batch_size > 2000:
220        print("Warning: Batch sizes over 2000 may "
221               "result in long run-time.", flush=True)
222        
223    # Getting subsample indices
224    sample_size = np.min((2000, adata.uns['train_indices'].shape[0]))
225    indices = sample_cells(adata.uns['train_indices'], sample_size=sample_size, 
226                           seed_obj=adata.uns['seed_obj'])
227
228    # Getting batch indices
229    sample_range = np.arange(sample_size)
230    batch_idx = get_batches(sample_range, adata.uns['seed_obj'], 
231                            batches, batch_size)
232
233    # Loop over every group in group_dict
234    sigma_array = np.zeros((len(adata.uns['group_dict'])))
235    for m, group_features in enumerate(adata.uns['group_dict'].values()):
236
237        n_group_features = len(group_features)
238
239        # Filtering to only features in grouping using filtered view of adata
240        X_train = get_group_mat(adata[indices], n_features=n_features, 
241                            group_features=group_features, 
242                            n_group_features=n_group_features)
243        
244        if adata.uns['tfidf']:
245            X_train = tfidf(X_train, mode='normalize')
246
247        # Data filtering, and transformation according to given data_type
248        # Will remove low variance (< 1e5) features regardless of data_type
249        # If scale_data will log scale and z-score the data
250        X_train = process_data(X_train=X_train,
251                               scale_data=adata.uns['scale_data'], 
252                               transform_data=adata.uns['transform_data'],
253                               return_dense=True)    
254
255        # Estimating sigma
256        sigma = est_group_sigma(adata, X_train, n_group_features, 
257                                n_features, batch_idx=batch_idx)
258
259        sigma_array[m] = sigma
260    
261    adata.uns['sigma'] = sigma_array
262        
263    return adata

Calculate kernel widths to inform distribution for projection of Fourier Features. Calculates one sigma per group of features.

Parameters
  • adata (ad.AnnData): Created by create_adata.
  • n_features (int): Number of random features to include when estimating sigma. Will be scaled for the whole pathway set according to a heuristic. Used for scalability.
  • batches (int): The number of batches to use for the distance calculation. This will average the result of batches distance calculations of batch_size randomly sampled cells. More batches will converge to population distance values at the cost of scalability.
  • batch_size (int): The number of cells to include per batch for distance calculations. Higher batch size will converge to population distance values at the cost of scalability. If batches * batch_size > # training cells, batch_size will be reduced to int(num training cells / batches).
Returns
  • adata (ad.AnnData): Key added adata.uns['sigma'] with grouping kernel widths.
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata.uns['sigma']
array([10.4640895 , 10.82011454,  6.16769438,  9.86156855, ...])
def extract_results(results: dict, metric: str):
264def extract_results(results: dict, metric: str):
265    """
266    
267    """
268    summary = {'Alpha' : list(),
269               metric : list(),
270               'Number of Selected Groups' : list(),
271               'Top Group' : list()}
272    
273    alpha_list = list(results['Metrics'].keys())
274
275    # Creating summary DataFrame for each model
276    for alpha in alpha_list:
277        cur_alpha_rows = results['Norms'][alpha]
278        top_weight_rows = np.max(results['Norms'][alpha])
279        top_group_index = np.where(cur_alpha_rows == top_weight_rows)
280        num_selected = len(results['Selected_groups'][alpha])
281        top_group_name = np.array(results['Group_names'])[top_group_index]
282        
283        if 0 == num_selected:
284            top_group_name = ["No groups selected"]
285
286        summary['Alpha'].append(alpha)
287        summary[metric].append(results['Metrics'][alpha][metric])
288        summary['Number of Selected Groups'].append(num_selected)
289        summary['Top Group'].append(*top_group_name)
290    
291    return pd.DataFrame(summary)
def find_candidates( organism: str = 'human', key_terms: str | list = '', blacklist: str | list | bool = False):
203def find_candidates(organism: str='human', key_terms: str | list='', blacklist: str | list | bool=False):
204    """
205    Given `organism` and `key_terms`, will search for gene 
206    groupings that could fit the datasets/classification task. 
207    `blacklist` terms undesired in group names.
208
209    Parameters
210    ----------
211    organism : str
212        The species the gene grouping is for. Options are 
213        `{'Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm'}`
214
215    key_terms : str | list
216        The types of cells or other specifiers the gene set is for 
217        (example: 'CD4 T').
218
219    blacklist : str | list | bool
220        Term(s) undesired in group names. Ignored unless provided.
221
222    Returns
223    -------
224    libraries : list
225        A list of gene set library names that could serve for the 
226        dataset/classification task.        
227
228    Examples
229    --------
230    >>> scmkl.find_candidates('human', key_terms=' b ')
231                                Library  No. Gene Sets
232    0                    Azimuth_2023           1241
233    1         Azimuth_Cell_Types_2021            341
234    2   Cancer_Cell_Line_Encyclopedia            967
235    3                 CellMarker_2024           1134
236    No. Key Type Matching
237    9
238    9
239    0
240    21
241    """
242    check_organism(organism)
243
244    if organism.lower() in global_lib_orgs:
245        glo = global_lib_orgs.copy()
246        glo.remove(organism)
247        other_org = glo[0]
248        libs = human_genesets
249        libs = [name for name in libs if not other_org in name.lower()]
250    else:
251        libs = gp.get_library_name(organism)
252        other_org = ''
253
254    libs = {name : gp.get_library(name, organism)
255            for name in libs}
256    
257    libs_df, _ = check_libs(libs, key_terms, blacklist, other_org)
258
259    return libs_df

Given organism and key_terms, will search for gene groupings that could fit the datasets/classification task. blacklist terms undesired in group names.

Parameters
  • organism (str): The species the gene grouping is for. Options are {'Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm'}
  • key_terms (str | list): The types of cells or other specifiers the gene set is for (example: 'CD4 T').
  • blacklist (str | list | bool): Term(s) undesired in group names. Ignored unless provided.
Returns
  • libraries (list): A list of gene set library names that could serve for the dataset/classification task.
Examples
>>> scmkl.find_candidates('human', key_terms=' b ')
                            Library  No. Gene Sets
0                    Azimuth_2023           1241
1         Azimuth_Cell_Types_2021            341
2   Cancer_Cell_Line_Encyclopedia            967
3                 CellMarker_2024           1134
No. Key Type Matching
9
9
0
21
def format_adata( adata: anndata._core.anndata.AnnData | str, cell_labels: numpy.ndarray | str, group_dict: dict | str, use_raw: bool = False, scale_data: bool = True, transform_data: bool = False, split_data: numpy.ndarray | None = None, D: int | None = None, remove_features: bool = True, train_ratio: float = 0.8, distance_metric: str = 'euclidean', kernel_type: str = 'Gaussian', random_state: int = 1, allow_multiclass: bool = False, class_threshold: str | int | None = None, reduction: str | None = None, tfidf: bool = False):
609def format_adata(adata: ad.AnnData | str, cell_labels: np.ndarray | str, 
610                 group_dict: dict | str, use_raw: bool=False, 
611                 scale_data: bool=True, transform_data: bool=False, 
612                 split_data: np.ndarray | None=None, D: int | None=None, 
613                 remove_features: bool=True, train_ratio: float=0.8, 
614                 distance_metric: str='euclidean', kernel_type: str='Gaussian', 
615                 random_state: int=1, allow_multiclass: bool = False, 
616                 class_threshold: str | int | None=None, 
617                 reduction: str | None = None, tfidf: bool = False):
618    """
619    Function to format an `ad.AnnData` object to carry all relevant 
620    information going forward. `adata.obs_names` will be retained.
621
622    **NOTE: Information not needed for running `scmkl` will be 
623    removed.**
624
625    Parameters
626    ----------
627    adata : ad.AnnData
628        Object with data for `scmkl` to be applied to. Only requirment 
629        is that `.var_names` is correct and data matrix is in `adata.X` 
630        or `adata.raw.X`. A h5ad file can be provided as a `str` and it 
631        will be read in.
632
633    cell_labels : np.ndarray | str
634        If type `str`, the labels for `scmkl` to learn are captured 
635        from `adata.obs['cell_labels']`. Else, a `np.ndarray` of cell 
636        phenotypes corresponding with the cells in `adata.X`.
637
638    group_dict : dict | str
639        Dictionary containing feature grouping information (i.e. 
640        `{geneset1: np.array([gene_1, gene_2, ..., gene_n]), geneset2: 
641        np.array([...]), ...}`. A pickle file can be provided as a `str` 
642        and it will be read in.
643
644    obs_names : None | np.ndarray
645        The cell names corresponding to `X` to be assigned to output 
646        object `.obs_names` attribute.
647
648    use_raw : bool
649        If `False`, will use `adata.X` to create new `adata`. Else, 
650        will use `adata.raw.X`.
651
652    scale_data : bool  
653        If `True`, data matrix is log transformed and standard 
654        scaled. 
655
656    transform_data : bool
657        If `True`, data will be log1p transformed (recommended for 
658        counts data). Default is `False`. 
659        
660    split_data : None | np.ndarray
661        If `None`, data will be split stratified by cell labels. 
662        Else, is an array of precalculated train/test split 
663        corresponding to samples. Can include labels for entire
664        dataset to benchmark performance or for only training
665        data to classify unknown cell types (i.e. `np.array(['train', 
666        'test', ..., 'train'])`.
667
668    D : int 
669        Number of Random Fourier Features used to calculate Z. 
670        Should be a positive integer. Higher values of D will 
671        increase classification accuracy at the cost of computation 
672        time. If set to `None`, will be calculated given number of 
673        samples. 
674    
675    remove_features : bool
676        If `True`, will remove features from `X` and `feature_names` 
677        not in `group_dict` and remove features from groupings not in 
678        `feature_names`.
679
680    train_ratio : float
681        Ratio of number of training samples to entire data set. Note:
682        if a threshold is applied, the ratio training samples may 
683        decrease depending on class balance and `class_threshold`
684        parameter if `allow_multiclass = True`.
685
686    distance_metric : str
687        The pairwise distance metric used to estimate sigma. Must
688        be one of the options used in `scipy.spatial.distance.cdist`.
689
690    kernel_type : str
691        The approximated kernel function used to calculate Zs.
692        Must be one of `'Gaussian'`, `'Laplacian'`, or `'Cauchy'`.
693
694    random_state : int
695        Integer random_state used to set the seed for 
696        reproducibilty.
697
698    allow_multiclass : bool
699        If `False`, will ensure that cell labels are binary.
700
701    class_threshold : str | int
702        Number of samples allowed in the training data for each cell
703        class in the training data. If `'median'`, the median number 
704        of cells per cell class will be the threshold for number of 
705        samples per class.
706
707    reduction: str | None
708        Choose which dimension reduction technique to perform on 
709        features within a group. 'svd' will run 
710        `sklearn.decomposition.TruncatedSVD`, 'linear' will multiply 
711        by an array of 1s down to 50 dimensions.
712        
713    tfidf: bool
714        Whether to calculate TFIDF transformation on peaks within 
715        groupings.
716        
717    Returns
718    -------
719    adata : ad.AnnData
720        AnnData with the following attributes and keys:
721
722        `adata.X` (array_like):
723            Data matrix.
724    
725        `adata.var_names` (array_like): 
726            Feature names corresponding to `adata.X`.
727
728        `adata.obs['labels']` (array_like):
729            cell classes/phenotypes from `cell_labels`.
730
731        `adata.uns['train_indices']` (array_like):
732            Indices for training data. 
733
734        `adata.uns['test_indices']` (array_like)
735            Indices for testing data.
736
737        `adata.uns['group_dict']` (dict):
738            Grouping information.
739
740        `adata.uns['seed_obj']` (np.random._generator.Generator): 
741            Seed object with seed equal to 100 * `random_state`.
742
743        `adata.uns['D']` (int):
744            Number of dimensions to scMKL with.
745
746        `adata.uns['scale_data']` (bool):
747            Whether or not data is scaled.
748
749        `adata.uns['transform_data']` (bool):
750            Whether or not data is log1p transformed.
751
752        `adata.uns['distance_metric']` (str): 
753            Distance metric as given.
754    
755        `adata.uns['kernel_type']` (str): 
756            Kernel function as given.
757
758        `adata.uns['svd']` (bool): 
759            Whether to calculate SVD reduction.
760
761        `adata.uns['tfidf']` (bool): 
762            Whether to calculate TF-IDF per grouping.
763
764    Examples
765    --------
766    >>> adata = ad.read_h5ad('MCF7_rna.h5ad')
767    >>> group_dict = np.load('hallmark_genesets.pkl', 
768    >>>                      allow_pickle = True)
769    >>> 
770    >>> 
771    >>> # The labels in adata.obs we want to learn are 'celltypes'
772    >>> adata = scmkl.format_adata(adata, 'celltypes', 
773    ...                            group_dict)
774    >>> adata
775    AnnData object with n_obs Ă— n_vars = 1000 Ă— 4341
776    obs: 'labels'
777    uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 
778    'distance_metric', 'train_indices', 'test_indices'
779    """
780    if str == type(adata):
781        adata = ad.read_h5ad(adata)
782
783    if str == type(group_dict):
784        group_dict = np.load(group_dict, allow_pickle=True)
785        
786    if str == type(cell_labels):
787        err_msg = f"{cell_labels} is not in `adata.obs`"
788        assert cell_labels in adata.obs.keys(), err_msg
789        cell_labels = adata.obs[cell_labels].to_numpy()
790    
791    if use_raw:
792        assert adata.raw, "`adata.raw` is empty, set `use_raw` to `False`"
793        X = adata.raw.X
794    else:
795        X = adata.X
796
797    adata = create_adata(X, adata.var_names.to_numpy().copy(), cell_labels, 
798                         group_dict, adata.obs_names.to_numpy().copy(), 
799                         scale_data, transform_data, split_data, D, remove_features, 
800                         train_ratio, distance_metric, kernel_type, 
801                         random_state, allow_multiclass, class_threshold, 
802                         reduction, tfidf)
803
804    return adata

Function to format an ad.AnnData object to carry all relevant information going forward. adata.obs_names will be retained.

NOTE: Information not needed for running scmkl will be removed.

Parameters
  • adata (ad.AnnData): Object with data for scmkl to be applied to. Only requirment is that .var_names is correct and data matrix is in adata.X or adata.raw.X. A h5ad file can be provided as a str and it will be read in.
  • cell_labels (np.ndarray | str): If type str, the labels for scmkl to learn are captured from adata.obs['cell_labels']. Else, a np.ndarray of cell phenotypes corresponding with the cells in adata.X.
  • group_dict (dict | str): Dictionary containing feature grouping information (i.e. {geneset1: np.array([gene_1, gene_2, ..., gene_n]), geneset2: np.array([...]), ...}. A pickle file can be provided as a str and it will be read in.
  • obs_names (None | np.ndarray): The cell names corresponding to X to be assigned to output object .obs_names attribute.
  • use_raw (bool): If False, will use adata.X to create new adata. Else, will use adata.raw.X.
  • scale_data (bool): If True, data matrix is log transformed and standard scaled.
  • transform_data (bool): If True, data will be log1p transformed (recommended for counts data). Default is False.
  • split_data (None | np.ndarray): If None, data will be split stratified by cell labels. Else, is an array of precalculated train/test split corresponding to samples. Can include labels for entire dataset to benchmark performance or for only training data to classify unknown cell types (i.e. np.array(['train', 'test', ..., 'train']).
  • D (int): Number of Random Fourier Features used to calculate Z. Should be a positive integer. Higher values of D will increase classification accuracy at the cost of computation time. If set to None, will be calculated given number of samples.
  • remove_features (bool): If True, will remove features from X and feature_names not in group_dict and remove features from groupings not in feature_names.
  • train_ratio (float): Ratio of number of training samples to entire data set. Note: if a threshold is applied, the ratio training samples may decrease depending on class balance and class_threshold parameter if allow_multiclass = True.
  • distance_metric (str): The pairwise distance metric used to estimate sigma. Must be one of the options used in scipy.spatial.distance.cdist.
  • kernel_type (str): The approximated kernel function used to calculate Zs. Must be one of 'Gaussian', 'Laplacian', or 'Cauchy'.
  • random_state (int): Integer random_state used to set the seed for reproducibilty.
  • allow_multiclass (bool): If False, will ensure that cell labels are binary.
  • class_threshold (str | int): Number of samples allowed in the training data for each cell class in the training data. If 'median', the median number of cells per cell class will be the threshold for number of samples per class.
  • reduction (str | None): Choose which dimension reduction technique to perform on features within a group. 'svd' will run sklearn.decomposition.TruncatedSVD, 'linear' will multiply by an array of 1s down to 50 dimensions.
  • tfidf (bool): Whether to calculate TFIDF transformation on peaks within groupings.
Returns
  • adata (ad.AnnData): AnnData with the following attributes and keys:

    adata.X (array_like): Data matrix.

    adata.var_names (array_like): Feature names corresponding to adata.X.

    adata.obs['labels'] (array_like): cell classes/phenotypes from cell_labels.

    adata.uns['train_indices'] (array_like): Indices for training data.

    adata.uns['test_indices'] (array_like) Indices for testing data.

    adata.uns['group_dict'] (dict): Grouping information.

    adata.uns['seed_obj'] (np.random._generator.Generator): Seed object with seed equal to 100 * random_state.

    adata.uns['D'] (int): Number of dimensions to scMKL with.

    adata.uns['scale_data'] (bool): Whether or not data is scaled.

    adata.uns['transform_data'] (bool): Whether or not data is log1p transformed.

    adata.uns['distance_metric'] (str): Distance metric as given.

    adata.uns['kernel_type'] (str): Kernel function as given.

    adata.uns['svd'] (bool): Whether to calculate SVD reduction.

    adata.uns['tfidf'] (bool): Whether to calculate TF-IDF per grouping.

Examples
>>> adata = ad.read_h5ad('MCF7_rna.h5ad')
>>> group_dict = np.load('hallmark_genesets.pkl', 
>>>                      allow_pickle = True)
>>> 
>>> 
>>> # The labels in adata.obs we want to learn are 'celltypes'
>>> adata = scmkl.format_adata(adata, 'celltypes', 
...                            group_dict)
>>> adata
AnnData object with n_obs Ă— n_vars = 1000 Ă— 4341
obs: 'labels'
uns: 'group_dict', 'seed_obj', 'scale_data', 'D', 'kernel_type', 
'distance_metric', 'train_indices', 'test_indices'
def format_group_names( group_names: list | pandas.core.series.Series | numpy.ndarray, rm_words: list = []):
 90def format_group_names(group_names: list | pd.Series | np.ndarray, 
 91                       rm_words: list=list()):
 92    """
 93    Takes an ArrayLike object of group names and formats them.
 94
 95    Parameters
 96    ----------
 97    group_names : array_like
 98        An array of group names to format.
 99
100    rm_words : list
101        Words to remove from all group names.
102
103    Returns
104    -------
105    new_group_names : list
106        Formatted version of the input group names.
107
108    Examples
109    --------
110    >>> groups = ['HALLMARK_E2F_TARGETS', 'HALLMARK_HYPOXIA']
111    >>> new_groups = scmkl.format_group_names(groups)
112    >>> new_groups
113    ['Hallmark E2F Targets', 'Hallmark Hypoxia']
114    """
115    new_group_names = list()
116    rm_words = [word.lower() for word in rm_words]
117
118    for name in group_names:
119        new_name = list()
120        for word in re.split(r'_|\s', name):
121            if word.isalpha() and (len(word) > 3):
122                word = word.capitalize()
123            if word.lower() not in rm_words:
124                new_name.append(word)
125        new_name = ' '.join(new_name)
126        new_group_names.append(new_name)
127
128    return new_group_names

Takes an ArrayLike object of group names and formats them.

Parameters
  • group_names (array_like): An array of group names to format.
  • rm_words (list): Words to remove from all group names.
Returns
  • new_group_names (list): Formatted version of the input group names.
Examples
>>> groups = ['HALLMARK_E2F_TARGETS', 'HALLMARK_HYPOXIA']
>>> new_groups = scmkl.format_group_names(groups)
>>> new_groups
['Hallmark E2F Targets', 'Hallmark Hypoxia']
def get_folds(adata: anndata._core.anndata.AnnData, k: int):
47def get_folds(adata: ad.AnnData, k: int):
48    """
49    With labels of samples for cross validation and number of folds, 
50    returns the indices and label for each k-folds.
51
52    Parameters
53    ----------
54    adata : ad.AnnData
55        `AnnData` object containing `'labels'` column in `.obs` and 
56        `'train_indices'` in `.uns`.
57
58    k : int
59        The number of folds to perform cross validation over.
60
61    Returns
62    -------
63    folds : dict
64        A dictionary with keys being [0, k) and values being a tuple 
65        with first element being training sample indices and the second 
66        being testing value indices.
67
68    Examples
69    --------
70    >>> adata = scmkl.create_adata(...)
71    >>> folds = scmkl.get_folds(adata, k=4)
72    >>>
73    >>> train_fold_0, test_fold_0 = folds[0]
74    >>>
75    >>> train_fold_0
76    array([  0,   3,   5,   6,   8,   9,  10,  11,  12,  13])
77    >>> test_fold_0
78    array([  1,   2,  4,  7])
79    """
80    y = adata.obs['labels'][adata.uns['train_indices']].copy()
81
82    # Creating dummy x prevents holding more views in memory
83    x = y.copy()
84
85    skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=100)
86
87    folds = dict()
88    for fold, (fold_train, fold_test) in enumerate(skf.split(x, y)):
89        folds[fold] = (fold_train, fold_test)
90
91    return folds

With labels of samples for cross validation and number of folds, returns the indices and label for each k-folds.

Parameters
  • adata (ad.AnnData): AnnData object containing 'labels' column in .obs and 'train_indices' in .uns.
  • k (int): The number of folds to perform cross validation over.
Returns
  • folds (dict): A dictionary with keys being [0, k) and values being a tuple with first element being training sample indices and the second being testing value indices.
Examples
>>> adata = scmkl.create_adata(...)
>>> folds = scmkl.get_folds(adata, k=4)
>>>
>>> train_fold_0, test_fold_0 = folds[0]
>>>
>>> train_fold_0
array([  0,   3,   5,   6,   8,   9,  10,  11,  12,  13])
>>> test_fold_0
array([  1,   2,  4,  7])
def get_gene_groupings( lib_name: str, organism: str = 'human', key_terms: str | list = '', blacklist: str | list | bool = False, min_overlap: int = 2, genes: list | tuple | pandas.core.series.Series | numpy.ndarray | set = []):
262def get_gene_groupings(lib_name: str, organism: str='human', key_terms: str | list='', 
263                       blacklist: str | list | bool=False, min_overlap: int=2,
264                      genes: list | tuple | pd.Series | np.ndarray | set=[]):
265    """
266    Takes a gene set library name and filters to groups containing 
267    element(s) in `key_terms`. If genes is provided, will 
268    ensure that there are at least `min_overlap` number of genes in 
269    each group. Resulting groups will meet all of the before-mentioned 
270    criteria if `isin_logic` is `'and'` | `'or'`.
271
272    Parameters
273    ----------
274    lib_name : str
275        The desired library name.
276
277    organism : str
278        The species the gene grouping is for. Options are 
279        `{'Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm'}`.
280
281    key_terms : str | list
282        The types of cells or other specifiers the gene set is for 
283        (example: 'CD4 T').
284
285    genes : array_like
286        A vector of genes from the reference/query datasets. If not 
287        assigned, function will not filter groups based on feature 
288        overlap.
289
290    min_overlap : int
291        The minimum number of genes that must be present in a group 
292        for it to be kept. If `genes` is not given, ignored.
293
294    Returns
295    -------
296    lib : dict
297        The filtered library as a `dict` where keys are group names 
298        and keys are features.
299
300    Examples
301    --------
302    >>> dataset_feats = [
303    ...    'FUCA1', 'CLIC4', 'STMN1', 'SYF2', 'TAS1R1', 
304    ...    'NOL9', 'TAS1R3', 'SLC2A5', 'THAP3', 'IGHM', 
305    ...    'MARCKS', 'BANK1', 'TNFRSF13B', 'IGKC', 'IGHD', 
306    ...    'LINC01857', 'CD24', 'CD37', 'IGHD', 'RALGPS2'
307    ...    ]
308    >>> rna_grouping = scmkl.get_gene_groupings(
309    ...   'Azimuth_2023', key_terms=[' b ', 'b cell', 'b '], 
310    ...   genes=dataset_feats)
311    >>>
312    >>> rna_groupings.keys()
313    dict_keys(['PBMC-L1-B Cell', 'PBMC-L2-Intermediate B Cell', ...])
314    """
315    check_organism(organism)
316    
317    lib = gp.get_library(lib_name, organism)
318
319    if organism.lower() in global_lib_orgs:
320        glo = global_lib_orgs.copy()
321        glo.remove(organism)
322        other_org = glo[0]
323    else:
324        other_org = ''
325
326    group_names = list(lib.keys())
327    res = check_groups(group_names, key_terms, blacklist, other_org)
328    del res['num_groups']
329
330    # Finding groups where group name matches key_terms
331    g_summary = pd.DataFrame(res)
332
333    if key_terms:
334        kept = g_summary['key_terms_in']
335        kept_groups = g_summary['name'][kept].to_numpy()
336        g_summary = g_summary[kept]
337    else:
338        print("Not filtering with `key_terms` parameter.")
339        kept_groups = g_summary['name'].to_numpy()
340
341    if blacklist:
342        kept = ~g_summary['blacklist_in']
343        kept_groups = g_summary['name'][kept].to_numpy()
344    else:
345        print("Not filtering with `blacklist` parameter.")
346    
347    # Filtering library
348    lib = {group : lib[group] for group in kept_groups}
349
350    if 0 < len(genes):
351        del_groups = list()
352        genes = list(set(genes.copy()))
353        for group, features in lib.items():
354            overlap = np.isin(features, genes)
355            overlap = np.sum(overlap)
356            if overlap < min_overlap:
357                print(overlap, flush=True)
358                del_groups.append(group)
359
360        # Removing genes without enough overlap
361        for group in del_groups:
362            print(f'Removing {group} from grouping.')
363            del lib[group]
364
365    else:
366        print("Not checking overlap between group and dataset features.")
367
368    return lib

Takes a gene set library name and filters to groups containing element(s) in key_terms. If genes is provided, will ensure that there are at least min_overlap number of genes in each group. Resulting groups will meet all of the before-mentioned criteria if isin_logic is 'and' | 'or'.

Parameters
  • lib_name (str): The desired library name.
  • organism (str): The species the gene grouping is for. Options are {'Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm'}.
  • key_terms (str | list): The types of cells or other specifiers the gene set is for (example: 'CD4 T').
  • genes (array_like): A vector of genes from the reference/query datasets. If not assigned, function will not filter groups based on feature overlap.
  • min_overlap (int): The minimum number of genes that must be present in a group for it to be kept. If genes is not given, ignored.
Returns
  • lib (dict): The filtered library as a dict where keys are group names and keys are features.
Examples
>>> dataset_feats = [
...    'FUCA1', 'CLIC4', 'STMN1', 'SYF2', 'TAS1R1', 
...    'NOL9', 'TAS1R3', 'SLC2A5', 'THAP3', 'IGHM', 
...    'MARCKS', 'BANK1', 'TNFRSF13B', 'IGKC', 'IGHD', 
...    'LINC01857', 'CD24', 'CD37', 'IGHD', 'RALGPS2'
...    ]
>>> rna_grouping = scmkl.get_gene_groupings(
...   'Azimuth_2023', key_terms=[' b ', 'b cell', 'b '], 
...   genes=dataset_feats)
>>>
>>> rna_groupings.keys()
dict_keys(['PBMC-L1-B Cell', 'PBMC-L2-Intermediate B Cell', ...])
def get_metrics(results: dict, include_as: bool = False) -> pandas.core.frame.DataFrame:
398def get_metrics(results: dict, include_as: bool=False) -> pd.DataFrame:
399    """
400    Takes either a single scMKL result or a dictionary where each 
401    entry cooresponds to one result. Returns a dataframe with cols 
402    ['Alpha', 'Metric', 'Value']. If `include_as == True`, another 
403    col of booleans will be added to indicate whether or not the run 
404    respective to that alpha was chosen as optimal via CV. If 
405    `include_key == True`, another column will be added with the name 
406    of the key to the respective file (only applicable with multiple 
407    results).
408
409    Parameters
410    ----------
411    results : dict | None
412        A dictionary with the results of a single run from 
413        `scmkl.run()`. Must be `None` if `rfiles is not None`.
414
415    rfiles : dict | None
416        A dictionary of results dictionaries containing multiple 
417        results from `scmkl.run()`. 
418
419    include_as : bool
420        When `True`, will add a bool col to output pd.DataFrame 
421        where rows with alphas cooresponding to alpha_star will be 
422        `True`.
423
424    Returns
425    -------
426    df : pd.DataFrame
427        A pd.DataFrame containing all of the metrics present from 
428        the runs input.
429
430    Examples
431    --------
432    >>> # For a single file
433    >>> results = scmkl.run(adata)
434    >>> metrics = scmkl.get_metrics(results = results)
435
436    >>> # For multiple runs saved in a dict
437    >>> output_dir = 'scMKL_outputs/'
438    >>> rfiles = scmkl.read_files(output_dir)
439    >>> metrics = scmkl.get_metrics(rfiles=rfiles)
440    """
441    # Checking which data is being worked with 
442    is_mult, is_many = _parse_result_type(results)
443
444    # Initiating col list with minimal columns
445    cols = ['Alpha', 'Metric', 'Value']
446
447    if include_as:
448        cols.append('Alpha Star')
449    if is_mult:
450        cols.append('Class')
451
452    if is_many:
453        cols.append('Key')
454        df = pd.DataFrame(columns = cols)
455        for key, result in results.items():
456                cur_df = parse_metrics(results = result, key = key, 
457                                        include_as = include_as)
458                df = pd.concat([df, cur_df.copy()])
459            
460    else:
461        df = parse_metrics(results = results, include_as = include_as)
462
463    return df

Takes either a single scMKL result or a dictionary where each entry cooresponds to one result. Returns a dataframe with cols ['Alpha', 'Metric', 'Value']. If include_as == True, another col of booleans will be added to indicate whether or not the run respective to that alpha was chosen as optimal via CV. If include_key == True, another column will be added with the name of the key to the respective file (only applicable with multiple results).

Parameters
  • results (dict | None): A dictionary with the results of a single run from scmkl.run. Must be None if rfiles is not None.
  • rfiles (dict | None): A dictionary of results dictionaries containing multiple results from scmkl.run.
  • include_as (bool): When True, will add a bool col to output pd.DataFrame where rows with alphas cooresponding to alpha_star will be True.
Returns
  • df (pd.DataFrame): A pd.DataFrame containing all of the metrics present from the runs input.
Examples
>>> # For a single file
>>> results = scmkl.run(adata)
>>> metrics = scmkl.get_metrics(results = results)
>>> # For multiple runs saved in a dict
>>> output_dir = 'scMKL_outputs/'
>>> rfiles = scmkl.read_files(output_dir)
>>> metrics = scmkl.get_metrics(rfiles=rfiles)
def get_region_groupings( gene_anno: pandas.core.frame.DataFrame, gene_sets: dict, feature_names: numpy.ndarray | pandas.core.series.Series | list | set, len_up: int = 5000, len_down: int = 5000) -> dict:
290def get_region_groupings(gene_anno : pd.DataFrame, gene_sets : dict, 
291                       feature_names : np.ndarray | pd.Series | list | set,
292                       len_up : int = 5000, len_down : int = 5000) -> dict:
293    """
294    Creates a peak set where keys are gene set names from `gene_sets` 
295    and values are arrays of features pulled from `feature_names`. 
296    Features are added to each peak set given overlap between regions 
297    in single-cell data matrix and inferred gene promoter regions in 
298    `gene_anno`.
299
300    Parameters
301    ----------
302    gene_anno : pd.DataFrame
303        Gene annotations in GTF format as a pd.DataFrame with columns
304        `['chr', 'start', 'end', 'strand', 'gene_name']`.
305
306    gene_sets : dict
307        Gene set names as keys and an iterable object of gene names
308        as values.
309
310    feature_names : array_like | set
311        Feature names corresponding to a single_cell epigenetic data 
312        matrix.
313
314    Returns
315    -------
316    epi_grouping : dict
317        Keys are the names from `gene_sets` and values
318        are a list of regions from `feature_names` that overlap with 
319        promotor regions respective to genes in `gene_sets` (i.e., if 
320        region in `feature_names` overlaps with promotor region from a 
321        gene in a gene set from `gene_sets`, that region will be added 
322        to the new dictionary under the respective gene set name).
323
324    Examples
325    --------
326    >>> # Reading in a gene set and the peak names from dataset
327    >>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl", 
328    ...                     allow_pickle = True)
329    >>> peaks = np.load("data/MCF7_region_names.npy", 
330    ...                 allow_pickle = True)
331    >>> 
332    >>> # Reading in GTF file
333    >>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf"
334    >>> gtf = scmkl.read_gtf(gtf_path, filter_to_coding=True)
335    >>>
336    >>> region_grouping = scmkl.get_region_groupings(gene_anno = gtf,
337    ...                                         gene_sets = gene_sets,
338    ...                                         feature_names = peaks)
339    >>>
340    >>> region_grouping.keys()
341    dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...])
342    """
343    # Getting a unique set of gene names from gene_sets
344    all_genes = {gene for group in gene_sets.keys()
345                 for gene in gene_sets[group]}
346    
347    # Filtering out NaN values
348    all_genes = [gene for gene in all_genes if type(gene) != float]
349
350    # Filtering out annotations for genes not present in gene_sets
351    gene_anno = gene_anno[np.isin(gene_anno['gene_name'], all_genes)]
352
353    # Adjusting start and end columns to represent promotor regions
354    gene_anno = adjust_regions(gene_anno = gene_anno, 
355                                len_up = len_up, len_down = len_down)
356
357    # Creating a dictionary from assay features where [chr] : (start, end)
358    feature_dict = create_feature_dict(feature_names)
359
360    # Creating data structures from gene_anno for comparing regions
361    peak_gene_dict, ga_regions = create_region_dicts(gene_anno)
362
363    # Capturing the separator type used in assay
364    chr_sep = ':' if ':' in feature_names[0] else '-'
365
366    epi_groupings = compare_regions(feature_dict = feature_dict,
367                                     ga_regions = ga_regions,
368                                     peak_gene_dict = peak_gene_dict,
369                                     gene_sets = gene_sets,
370                                     chr_sep = chr_sep)
371    
372    return epi_groupings

Creates a peak set where keys are gene set names from gene_sets and values are arrays of features pulled from feature_names. Features are added to each peak set given overlap between regions in single-cell data matrix and inferred gene promoter regions in gene_anno.

Parameters
  • gene_anno (pd.DataFrame): Gene annotations in GTF format as a pd.DataFrame with columns ['chr', 'start', 'end', 'strand', 'gene_name'].
  • gene_sets (dict): Gene set names as keys and an iterable object of gene names as values.
  • feature_names (array_like | set): Feature names corresponding to a single_cell epigenetic data matrix.
Returns
  • epi_grouping (dict): Keys are the names from gene_sets and values are a list of regions from feature_names that overlap with promotor regions respective to genes in gene_sets (i.e., if region in feature_names overlaps with promotor region from a gene in a gene set from gene_sets, that region will be added to the new dictionary under the respective gene set name).
Examples
>>> # Reading in a gene set and the peak names from dataset
>>> gene_sets = np.load("data/RNA_hallmark_groupings.pkl", 
...                     allow_pickle = True)
>>> peaks = np.load("data/MCF7_region_names.npy", 
...                 allow_pickle = True)
>>> 
>>> # Reading in GTF file
>>> gtf_path = "data/hg38_subset_protein_coding.annotation.gtf"
>>> gtf = scmkl.read_gtf(gtf_path, filter_to_coding=True)
>>>
>>> region_grouping = scmkl.get_region_groupings(gene_anno = gtf,
...                                         gene_sets = gene_sets,
...                                         feature_names = peaks)
>>>
>>> region_grouping.keys()
dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', ...])
def get_selection( weights_df: pandas.core.frame.DataFrame, order_groups: bool = False) -> pandas.core.frame.DataFrame:
531def get_selection(weights_df: pd.DataFrame, 
532                  order_groups: bool=False) -> pd.DataFrame:
533    """
534    This function takes a pd.DataFrame created by 
535    `scmkl.get_weights()` and returns a selection table. Selection 
536    refers to how many times a group had a nonzero group weight. To 
537    calculate this, a col is added indicating whether the group was 
538    selected. Then, the dataframe is grouped by alpha and group. 
539    Selection can then be summed returning a dataframe with cols 
540    `['Alpha', 'Group', Selection]`. If is the result of multiclass 
541    run(s), `'Class'` column must be present and will be in resulting 
542    df as well.
543
544    Parameters
545    ----------
546    weights_df : pd.DataFrame
547        A dataframe output by `scmkl.get_weights()` with cols
548        `['Alpha', 'Group', 'Kernel Weight']`. If is the result of 
549        multiclass run(s), `'Class'` column must be present as well.
550
551    order_groups : bool
552        If `True`, the `'Group'` col of the output dataframe will be 
553        made into a `pd.Categorical` col ordered by number of times 
554        each group was selected in decending order.
555
556    Returns
557    -------
558    df : pd.DataFrame
559        A dataframe with cols `['Alpha', 'Group', Selection]`. Also, 
560        `'Class'` column if is a multiclass result.
561
562    Example
563    -------
564    >>> # For a single file
565    >>> results = scmkl.run(adata)
566    >>> weights = scmkl.get_weights(results = results)
567    >>> selection = scmkl.get_selection(weights)
568    
569    >>> # For multiple runs saved in a dict
570    >>> output_dir = 'scMKL_outputs/'
571    >>> rfiles = scmkl.read_files(output_dir)
572    >>> weights = scmkl.get_weights(rfiles=rfiles)
573    >>> selection = scmkl.get_selection(weights)
574    """
575    # Adding col indicating whether or not groups have nonzero weight
576    selection = weights_df['Kernel Weight'].apply(lambda x: x > 0)
577    weights_df['Selection'] = selection
578
579    # Summing selection across replications to get selection
580    is_mult = 'Class' in weights_df.columns
581    if is_mult:
582        df = weights_df.groupby(['Alpha', 'Group', 'Class'])['Selection'].sum()
583    else:
584        df = weights_df.groupby(['Alpha', 'Group'])['Selection'].sum()
585    df = df.reset_index()
586
587    # Getting group order
588    if order_groups and not is_mult:
589        order = df.groupby('Group')['Selection'].sum()
590        order = order.reset_index().sort_values(by = 'Selection', 
591                                                ascending = False)
592        order = order['Group']
593        df['Group'] = pd.Categorical(df['Group'], categories = order)
594
595
596    return df

This function takes a pd.DataFrame created by scmkl.get_weights() and returns a selection table. Selection refers to how many times a group had a nonzero group weight. To calculate this, a col is added indicating whether the group was selected. Then, the dataframe is grouped by alpha and group. Selection can then be summed returning a dataframe with cols ['Alpha', 'Group', Selection]. If is the result of multiclass run(s), 'Class' column must be present and will be in resulting df as well.

Parameters
  • weights_df (pd.DataFrame): A dataframe output by scmkl.get_weights() with cols ['Alpha', 'Group', 'Kernel Weight']. If is the result of multiclass run(s), 'Class' column must be present as well.
  • order_groups (bool): If True, the 'Group' col of the output dataframe will be made into a pd.Categorical col ordered by number of times each group was selected in decending order.
Returns
  • df (pd.DataFrame): A dataframe with cols ['Alpha', 'Group', Selection]. Also, 'Class' column if is a multiclass result.
Example
>>> # For a single file
>>> results = scmkl.run(adata)
>>> weights = scmkl.get_weights(results = results)
>>> selection = scmkl.get_selection(weights)
>>> # For multiple runs saved in a dict
>>> output_dir = 'scMKL_outputs/'
>>> rfiles = scmkl.read_files(output_dir)
>>> weights = scmkl.get_weights(rfiles=rfiles)
>>> selection = scmkl.get_selection(weights)
def get_summary(results: dict, metric: str = 'AUROC'):
294def get_summary(results: dict, metric: str='AUROC'):
295    """
296    Takes the results from `scmkl.run()` and generates a dataframe 
297    for each model containing columns for alpha, area under the ROC, 
298    number of groups with nonzero weights, and highest weighted 
299    group.
300
301    Parameters
302    ----------
303    results : dict
304        A dictionary of results from scMKL generated from 
305        `scmkl.run()`.
306
307    metric : str
308        Which metric to include in the summary. Default is AUROC. 
309        Options include `'AUROC'`, `'Recall'`, `'Precision'`, 
310        `'Accuracy'`, and `'F1-Score'`.
311
312    Returns
313    -------
314    summary_df : pd.DataFrame
315        A table with columns: `['Alpha', 'AUROC', 
316        'Number of Selected Groups', 'Top Group']`.
317    
318    Examples
319    --------
320    >>> results = scmkl.run(adata, alpha_list)
321    >>> summary_df = scmkl.get_summary(results)
322    ...
323    >>> summary_df.head()
324        Alpha   AUROC  Number of Selected Groups 
325    0   2.20  0.8600                          3   
326    1   1.96  0.9123                          4   
327    2   1.72  0.9357                          5   
328    3   1.48  0.9524                          7   
329    4   1.24  0.9666                          9   
330        Top Group
331    0   RNA-HALLMARK_E2F_TARGETS
332    1   RNA-HALLMARK_ESTROGEN_RESPONSE_EARLY
333    2   RNA-HALLMARK_ESTROGEN_RESPONSE_EARLY
334    3   RNA-HALLMARK_ESTROGEN_RESPONSE_EARLY
335    4   RNA-HALLMARK_ESTROGEN_RESPONSE_EARLY
336    """
337    is_multi, is_many = _parse_result_type(results)
338    assert not is_many, "This function only supports single results"
339    
340    if is_multi:
341        summaries = list()
342        for ct in results['Classes']:
343            data = extract_results(results[ct], metric)
344            data['Class'] = [ct]*len(data)
345            summaries.append(data.copy())
346        summary = pd.concat(summaries)
347
348    else:
349        summary = extract_results(results, metric)
350
351    return summary

Takes the results from scmkl.run and generates a dataframe for each model containing columns for alpha, area under the ROC, number of groups with nonzero weights, and highest weighted group.

Parameters
  • results (dict): A dictionary of results from scMKL generated from scmkl.run.
  • metric (str): Which metric to include in the summary. Default is AUROC. Options include 'AUROC', 'Recall', 'Precision', 'Accuracy', and 'F1-Score'.
Returns
  • summary_df (pd.DataFrame): A table with columns: ['Alpha', 'AUROC', 'Number of Selected Groups', 'Top Group'].
Examples
>>> results = scmkl.run(adata, alpha_list)
>>> summary_df = scmkl.get_summary(results)
...
>>> summary_df.head()
    Alpha   AUROC  Number of Selected Groups 
0   2.20  0.8600                          3   
1   1.96  0.9123                          4   
2   1.72  0.9357                          5   
3   1.48  0.9524                          7   
4   1.24  0.9666                          9   
    Top Group
0   RNA-HALLMARK_E2F_TARGETS
1   RNA-HALLMARK_ESTROGEN_RESPONSE_EARLY
2   RNA-HALLMARK_ESTROGEN_RESPONSE_EARLY
3   RNA-HALLMARK_ESTROGEN_RESPONSE_EARLY
4   RNA-HALLMARK_ESTROGEN_RESPONSE_EARLY
def get_weights(results: dict, include_as: bool = False) -> pandas.core.frame.DataFrame:
466def get_weights(results : dict, include_as : bool = False) -> pd.DataFrame:
467    """
468    Takes either a single scMKL result or dictionary of results and 
469    returns a pd.DataFrame with cols ['Alpha', 'Group', 
470    'Kernel Weight']. If `include_as == True`, a fourth col will be 
471    added to indicate whether or not the run respective to that alpha 
472    was chosen as optimal via cross validation.
473
474    Parameters
475    ----------
476    results : dict | None
477        A dictionary with the results of a single run from 
478        `scmkl.run()`. Must be `None` if `rfiles is not None`.
479
480    rfiles : dict | None
481        A dictionary of results dictionaries containing multiple 
482        results from `scmkl.run()`. 
483
484    include_as : bool
485        When `True`, will add a bool col to output pd.DataFrame 
486        where rows with alphas cooresponding to alpha_star will be 
487        `True`.
488
489    Returns
490    -------
491    df : pd.DataFrame
492        A pd.DataFrame containing all of the groups from each alpha 
493        and their cooresponding kernel weights.
494
495    Examples
496    --------
497    >>> # For a single file
498    >>> results = scmkl.run(adata)
499    >>> weights = scmkl.get_weights(results = results)
500    
501    >>> # For multiple runs saved in a dict
502    >>> output_dir = 'scMKL_outputs/'
503    >>> rfiles = scmkl.read_files(output_dir)
504    >>> weights = scmkl.get_weights(rfiles=rfiles)
505    """
506    # Checking which data is being worked with 
507    is_mult, is_many = _parse_result_type(results)
508
509    # Initiating col list with minimal columns
510    cols = ['Alpha', 'Group', 'Kernel Weight']
511
512    if include_as:
513        cols.append('Alpha Star')
514    if is_mult:
515        cols.append('Class')
516
517    if is_many:
518        cols.append('Key')
519        df = pd.DataFrame(columns = cols)
520        for key, result in results.items():
521            cur_df = parse_weights(results = result, key = key, 
522                                     include_as = include_as)
523            df = pd.concat([df, cur_df.copy()])
524            
525    else:
526        df = parse_weights(results = results, include_as = include_as)
527
528    return df

Takes either a single scMKL result or dictionary of results and returns a pd.DataFrame with cols ['Alpha', 'Group', 'Kernel Weight']. If include_as == True, a fourth col will be added to indicate whether or not the run respective to that alpha was chosen as optimal via cross validation.

Parameters
  • results (dict | None): A dictionary with the results of a single run from scmkl.run. Must be None if rfiles is not None.
  • rfiles (dict | None): A dictionary of results dictionaries containing multiple results from scmkl.run.
  • include_as (bool): When True, will add a bool col to output pd.DataFrame where rows with alphas cooresponding to alpha_star will be True.
Returns
  • df (pd.DataFrame): A pd.DataFrame containing all of the groups from each alpha and their cooresponding kernel weights.
Examples
>>> # For a single file
>>> results = scmkl.run(adata)
>>> weights = scmkl.get_weights(results = results)
>>> # For multiple runs saved in a dict
>>> output_dir = 'scMKL_outputs/'
>>> rfiles = scmkl.read_files(output_dir)
>>> weights = scmkl.get_weights(rfiles=rfiles)
def groups_per_alpha(selection_df: pandas.core.frame.DataFrame) -> dict:
599def groups_per_alpha(selection_df: pd.DataFrame) -> dict:
600    """
601    This function takes a pd.DataFrame from `scmkl.get_selection()` 
602    generated from multiple scMKL results and returns a dictionary 
603    with keys being alphas from the input dataframe and values being 
604    the mean number of selected groups for a given alpha across 
605    results. 
606
607    Parameters
608    ----------
609    selection_df : pd.DataFrame
610        A dataframe output by `scmkl.get_selection()` with cols 
611        `['Alpha', 'Group', Selection].
612    
613    Returns
614    -------
615    mean_groups : dict
616        A dictionary with alphas as keys and the mean number of 
617        selected groups for that alpha as keys.
618
619    Examples
620    --------
621    >>> weights = scmkl.get_weights(rfiles)
622    >>> selection = scmkl.get_selection(weights)
623    >>> mean_groups = scmkl.mean_groups_per_alpha(selection)
624    >>> mean_groups = {alpha : np.round(num_selected, 1)
625    ...                for alpha, num_selected in mean_groups.items()}
626    >>>
627    >>> print(mean_groups)
628    {0.05 : 50.0, 0.2 : 24.7, 1.1 : 5.3}
629    """
630    mean_groups = {}
631    for alpha in np.unique(selection_df['Alpha']):
632
633        # Capturing rows for given alpha
634        rows = selection_df['Alpha'] == alpha
635
636        # Adding mean number of groups for alpha
637        mean_groups[alpha] = np.mean(selection_df[rows]['Selection'])
638
639    return mean_groups

This function takes a pd.DataFrame from scmkl.get_selection() generated from multiple scMKL results and returns a dictionary with keys being alphas from the input dataframe and values being the mean number of selected groups for a given alpha across results.

Parameters
  • selection_df (pd.DataFrame): A dataframe output by scmkl.get_selection() with cols `['Alpha', 'Group', Selection].
Returns
  • mean_groups (dict): A dictionary with alphas as keys and the mean number of selected groups for that alpha as keys.
Examples
>>> weights = scmkl.get_weights(rfiles)
>>> selection = scmkl.get_selection(weights)
>>> mean_groups = scmkl.mean_groups_per_alpha(selection)
>>> mean_groups = {alpha : np.round(num_selected, 1)
...                for alpha, num_selected in mean_groups.items()}
>>>
>>> print(mean_groups)
{0.05 : 50.0, 0.2 : 24.7, 1.1 : 5.3}
def group_umap( adata: anndata._core.anndata.AnnData, g_name: str | list, is_binary: bool = False, labels: None | numpy.ndarray | list = None, title: str = '', save: str = ''):
613def group_umap(adata: ad.AnnData, g_name: str | list, is_binary: bool=False, 
614               labels: None | np.ndarray | list=None, title: str='', 
615               save: str=''):
616    """
617    Uses a scmkl formatted `ad.AnnData` object to show sample 
618    separation using scmkl discovered groupings.
619
620    Parameters
621    ----------
622    adata : ad.AnnData
623        A scmkl formatted `ad.AnnData` object with `'group_dict'` in 
624        `.uns`.
625
626    g_name : str | list
627        The groups who's features should be used to filter `adata`. If 
628        is a list, features from multiple groups will be used.
629    
630    is_binary : bool
631        If `True`, data will be processed using `muon` which includes 
632        TF-IDF normalization and LSI.
633
634    labels : None | np.ndarray | list
635        If `None`, labels in `adata.obs['labels']` will be used to 
636        color umap points. Else, provided labels will be used to color 
637        points.
638
639    title : str
640        The title of the plot.
641
642    save : str
643        If provided, plot will be saved using `scanpy`'s `save` 
644        argument. Should be the desired file name. Output will be 
645        `<cwd>/figures/<save>`.
646
647    Returns
648    -------
649    None
650
651    Examples
652    --------
653    >>> adata_fp = 'data/_pbmc_rna.h5ad'
654    >>> group_fp = 'data/_RNA_azimuth_pbmc_groupings.pkl'
655    >>> adata = scmkl.format_adata(adata_fp, 'celltypes', group_fp, 
656    ...                            allow_multiclass=True)
657    >>> scmkl.group_umap(adata, 'CD16+ Monocyte Markers')
658
659    ![group_umap](../tests/figures/umap_group_rna.png)
660    """
661    if list == type(g_name):
662        feats = {feature 
663                 for group in g_name 
664                 for feature in adata.uns['group_dict'][group]}
665        feats = np.array(list(feats))
666    else:
667        feats = np.array(list(adata.uns['group_dict'][g_name]))
668
669    if labels:
670        assert len(labels) == adata.shape[0], "`labels` do not match `adata`"
671        adata.obs['labels'] = labels
672
673    var_names = adata.var_names.to_numpy()
674
675    col_filter = np.isin(var_names, feats)
676    adata = adata[:, col_filter].copy()
677
678    if not is_binary:
679        sc.pp.normalize_total(adata)
680        sc.pp.log1p(adata)
681        sc.tl.pca(adata)
682        sc.pp.neighbors(adata)
683        sc.tl.umap(adata, random_state=1)
684
685    else:
686        ac.pp.tfidf(adata, scale_factor=1e4)
687        sc.pp.normalize_total(adata)
688        sc.pp.log1p(adata)
689        ac.tl.lsi(adata)
690        sc.pp.scale(adata)
691        sc.tl.pca(adata)
692        sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)
693        sc.tl.umap(adata, random_state=1)
694
695    if save:
696        sc.pl.umap(adata, title=title, color='labels', save=save, show=False)
697
698    else:
699        sc.pl.umap(adata, title=title, color='labels')
700
701    return None

Uses a scmkl formatted ad.AnnData object to show sample separation using scmkl discovered groupings.

Parameters
  • adata (ad.AnnData): A scmkl formatted ad.AnnData object with 'group_dict' in .uns.
  • g_name (str | list): The groups who's features should be used to filter adata. If is a list, features from multiple groups will be used.
  • is_binary (bool): If True, data will be processed using muon which includes TF-IDF normalization and LSI.
  • labels (None | np.ndarray | list): If None, labels in adata.obs['labels'] will be used to color umap points. Else, provided labels will be used to color points.
  • title (str): The title of the plot.
  • save (str): If provided, plot will be saved using scanpy's save argument. Should be the desired file name. Output will be <cwd>/figures/<save>.
Returns
  • None
Examples
>>> adata_fp = 'data/_pbmc_rna.h5ad'
>>> group_fp = 'data/_RNA_azimuth_pbmc_groupings.pkl'
>>> adata = scmkl.format_adata(adata_fp, 'celltypes', group_fp, 
...                            allow_multiclass=True)
>>> scmkl.group_umap(adata, 'CD16+ Monocyte Markers')

group_umap

def multimodal_processing( adatas: list[anndata._core.anndata.AnnData], names: list[str], tfidf: list[bool], combination: str = 'concatenate', batches: int = 10, batch_size: int = 100, verbose: bool = True) -> anndata._core.anndata.AnnData:
101def multimodal_processing(adatas : list[ad.AnnData], names : list[str], 
102                          tfidf: list[bool], combination: str='concatenate', 
103                          batches: int=10, batch_size: int=100, 
104                          verbose: bool=True) -> ad.AnnData:
105    """
106    Combines and processes a list of `ad.AnnData` objects.
107
108    Parameters
109    ----------
110    adatas : list[ad.AnnData]
111        List of `ad.AnnData` objects where each object is a different 
112        modality. Annotations must match between objects (i.e. same 
113        sample order).
114
115    names : list[str]
116        List of strings names for each modality repective to each 
117        object in adatas.
118            
119    combination: str
120        How to combine the matrices, either `'sum'` or `'concatenate'`.
121    
122    tfidf : list[bool]
123        If element `i` is `True`, `adata[i]` will be TF-IDF normalized.
124
125    batches : int
126        The number of batches to use for the distance calculation.
127        This will average the result of `batches` distance calculations
128        of `batch_size` randomly sampled cells. More batches will converge
129        to population distance values at the cost of scalability.
130
131    batch_size : int
132        The number of cells to include per batch for distance
133        calculations. Higher batch size will converge to population
134        distance values at the cost of scalability.
135        If `batches*batch_size > num_training_cells`, `batch_size` 
136        will be reduced to `int(num_training_cells / batches)`.
137
138    Returns
139    -------
140    adata : ad.AnnData
141        Concatenated from objects from `adatas` with Z matrices 
142        calculated.
143
144    Examples
145    --------
146    >>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat, 
147    ...                                feature_names=gene_names, 
148    ...                                scale_data=True, 
149    ...                                transform_data=True,
150    ...                                cell_labels=cell_labels, 
151    ...                                 group_dict=rna_grouping)
152    >>>
153    >>> atac_adata = scmkl.create_adata(X=mcf7_atac_mat, 
154    ...                                 feature_names=peak_names, 
155    ...                                 scale_data=False, 
156    ...                                 cell_labels=cell_labels, 
157    ...                                 group_dict=atac_grouping)
158    >>>
159    >>> adatas = [rna_adata, atac_adata]
160    >>> mod_names = ['rna', 'atac']
161    >>> adata = scmkl.multimodal_processing(adatas = adatas, 
162    ...                                     names = mod_names,
163    ...                                     tfidf = [False, True])
164    >>>
165    >>> adata
166    AnnData object with n_obs Ă— n_vars = 1000 Ă— 12676
167    obs: 'labels'
168    var: 'labels'
169    uns: 'D', 'kernel_type', 'distance_metric', 'train_indices',  
170    'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj'
171    """
172    import warnings 
173    warnings.filterwarnings('ignore')
174
175    diff_num_warn = "Different number of cells present in each object."
176    assert all([adata.shape[0] for adata in adatas]), diff_num_warn
177    
178    # True if all train indices match
179    same_train = np.all([np.array_equal(adatas[0].uns['train_indices'], 
180                                        adatas[i].uns['train_indices']) 
181                         for i in range(1, len(adatas))])
182
183    # True if all test indices match
184    same_test = np.all([np.array_equal(adatas[0].uns['test_indices'], 
185                                       adatas[i].uns['test_indices']) 
186                        for i in range(1, len(adatas))])
187
188    assert same_train, "Different train indices"
189    assert same_test, "Different test indices"
190
191    # Creates a boolean array for each modality of cells with non-empty rows
192    non_empty_rows = [np.array(sparse_var(adata.X, axis = 1) != 0).ravel() 
193                      for adata in adatas]
194    non_empty_rows = np.transpose(non_empty_rows)
195
196    # Returns a 1D array where sample feature sums non-0 across all modalities
197    non_empty_rows = np.array([np.all(non_empty_rows[i])
198                              for i in range(non_empty_rows.shape[0])])
199
200    # Initializing final train test split array
201    train_test = np.repeat('train', adatas[0].shape[0])
202    train_test[adatas[0].uns['test_indices']] = 'test'
203
204    # Capturing train test split with empty rows filtered out
205    train_test = train_test[non_empty_rows]
206    train_indices = np.where(train_test == 'train')[0]
207    test_indices = np.where(train_test == 'test')[0]
208
209    # Adding train test split arrays to AnnData objects 
210    # and filtering out empty samples
211    for i, adata in enumerate(adatas):
212        adatas[i].uns['train_indices'] = train_indices
213        adatas[i].uns['test_indices'] = test_indices
214        adatas[i] = adata[non_empty_rows, :]
215
216        # tfidf normalizing if corresponding element in tfidf is True
217        if tfidf[i]:
218            adatas[i] = tfidf_normalize(adata)
219        
220        if verbose:
221            print(f"Estimating sigma and calculating Z for {names[i]}", 
222                  flush = True)
223        adatas[i] = calculate_z(adata, n_features = 5000, batches=batches, 
224                                batch_size=batch_size)
225
226    if 'labels' in adatas[0].obs:
227        all_labels = [adata.obs['labels'] for adata in adatas]
228        # Ensuring cell labels for each AnnData object are the same
229        uneq_labs_warn = ("Cell labels between AnnData object in position 0 "
230                          "and position {} in adatas do not match")
231        for i in range(1, len(all_labels)):
232            same_labels = np.all(all_labels[0] == all_labels[i])
233            assert same_labels, uneq_labs_warn.format(i)
234
235    adata = combine_modalities(adatas=adatas,
236                                names=names,
237                                combination=combination)
238
239    del adatas
240    gc.collect()
241
242    return adata    

Combines and processes a list of ad.AnnData objects.

Parameters
  • adatas (list[ad.AnnData]): List of ad.AnnData objects where each object is a different modality. Annotations must match between objects (i.e. same sample order).
  • names (list[str]): List of strings names for each modality repective to each object in adatas.
  • combination (str): How to combine the matrices, either 'sum' or 'concatenate'.
  • tfidf (list[bool]): If element i is True, adata[i] will be TF-IDF normalized.
  • batches (int): The number of batches to use for the distance calculation. This will average the result of batches distance calculations of batch_size randomly sampled cells. More batches will converge to population distance values at the cost of scalability.
  • batch_size (int): The number of cells to include per batch for distance calculations. Higher batch size will converge to population distance values at the cost of scalability. If batches*batch_size > num_training_cells, batch_size will be reduced to int(num_training_cells / batches).
Returns
  • adata (ad.AnnData): Concatenated from objects from adatas with Z matrices calculated.
Examples
>>> rna_adata = scmkl.create_adata(X = mcf7_rna_mat, 
...                                feature_names=gene_names, 
...                                scale_data=True, 
...                                transform_data=True,
...                                cell_labels=cell_labels, 
...                                 group_dict=rna_grouping)
>>>
>>> atac_adata = scmkl.create_adata(X=mcf7_atac_mat, 
...                                 feature_names=peak_names, 
...                                 scale_data=False, 
...                                 cell_labels=cell_labels, 
...                                 group_dict=atac_grouping)
>>>
>>> adatas = [rna_adata, atac_adata]
>>> mod_names = ['rna', 'atac']
>>> adata = scmkl.multimodal_processing(adatas = adatas, 
...                                     names = mod_names,
...                                     tfidf = [False, True])
>>>
>>> adata
AnnData object with n_obs Ă— n_vars = 1000 Ă— 12676
obs: 'labels'
var: 'labels'
uns: 'D', 'kernel_type', 'distance_metric', 'train_indices',  
'test_indices', 'Z_train', 'Z_test', 'group_dict', 'seed_obj'
def one_v_rest( adatas: list | anndata._core.anndata.AnnData, names: list, alpha_params: numpy.ndarray, tfidf: list = None, batches: int = 10, batch_size: int = 100, train_dict: dict = None, force_balance: bool = False, other_factor: float = 1.0) -> dict:
232def one_v_rest(adatas : list | ad.AnnData, names : list, 
233               alpha_params : np.ndarray, tfidf : list=None, batches: int=10, 
234               batch_size: int=100, train_dict: dict=None, 
235               force_balance: bool=False, other_factor: float=1.0)-> dict:
236    """
237    For each cell class, creates model(s) comparing that class to all 
238    others. Then, predicts on the training data using `scmkl.run()`.
239    Only labels in both training and testing will be run.
240
241    Parameters
242    ----------
243    adatas : list[AnnData]
244        List of `ad.AnnData` objects created by `create_adata()` 
245        where each `ad.AnnData` is one modality and composed of both 
246        training and testing samples. Requires that `'train_indices'`
247        and `'test_indices'` are the same between all `ad.AnnData`s.
248
249    names : list[str]
250        String variables that describe each modality respective to 
251        `adatas` for labeling.
252        
253    alpha_params : np.ndarray | float | dict
254        If is `dict`, expects keys to correspond to each unique label 
255        with float as key (ideally would be the output of 
256        scmkl.optimize_alpha). Else, array of alpha values to create 
257        each model with or a float to run with a single alpha.
258
259    tfidf : list[bool]
260        If element `i` is `True`, `adatas[i]` will be TF-IDF 
261        normalized. If `None`, no views will be TF-IDF normalized.
262
263    batches : int
264        The number of batches to use for the distance calculation. 
265        This will average the result of `batches` distance calculations 
266        of `batch_size` randomly sampled cells. More batches will 
267        converge to population distance values at the cost of 
268        scalability.
269
270    batch_size : int
271        The number of cells to include per batch for distance
272        calculations. Higher batch size will converge to population
273        distance values at the cost of scalability.
274        If `batches*batch_size > num_training_cells`,
275        `batch_size` will be reduced to 
276        `int(num_training_cells / batches)`.
277
278    force_balance : bool
279        If `True`, training sets will be balanced to reduce class label 
280        imbalance. Defaults to `False`.
281
282    other_factor : float
283        The ratio of cells to sample for the other class for each 
284        model. For example, if classifying B cells with 100 B cells in 
285        training, if `other_factor=1`, 100 cells that are not B cells 
286        will be trained on with the B cells.
287
288    Returns
289    -------
290    results : dict
291        Contains keys for each cell class with results from cell class
292        versus all other samples. See `scmkl.run()` for futher details. 
293        Will also include a probablilities table with the predictions 
294        from each model.
295
296    Examples
297    --------
298    >>> adata = scmkl.create_adata(X = data_mat, 
299    ...                            feature_names = gene_names, 
300    ...                            group_dict = group_dict)
301    >>>
302    >>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'],
303    ...                           alpha_list = np.array([0.05, 0.1]),
304    ...                           tfidf = [False])
305    >>>
306    >>> adata.keys()
307    dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...])
308    """
309    if isinstance(adatas, ad.AnnData):
310        adatas = [adatas]
311    if isinstance(tfidf, type(None)):
312        tfidf = len(adatas)*[False]
313
314    _check_adatas(adatas, check_obs=True, check_uns=True)
315
316    # Want to retain all original train indices
317    train_indices = adatas[0].uns['train_indices'].copy()
318    test_indices = adatas[0].uns['test_indices'].copy()
319
320    uniq_labels = _eval_labels(cell_labels = adatas[0].obs['labels'], 
321                               train_indices = train_indices,
322                               test_indices = test_indices)
323
324    if (len(adatas) == 1) and ('Z_train' not in adatas[0].uns.keys()):
325        adata = calculate_z(adatas[0], n_features = 5000, 
326                            batches=batches, batch_size=batch_size)
327    elif len(adatas) > 1:
328        adata = multimodal_processing(adatas=adatas, 
329                                      names=names, 
330                                      tfidf=tfidf,
331                                      batches=batches,
332                                      batch_size=batch_size)
333    else:
334        adata = adatas[0].copy()
335
336    # Preventing multiple copies of adata(s) in memory
337    del adatas
338    gc.collect()
339
340    # Need obj for capturing results
341    results = dict()
342
343    # Capturing cell labels to regenerate at each comparison
344    cell_labels = np.array(adata.obs['labels'].copy())
345
346    # Capturing perfect train/test splits for each class
347    if train_dict:
348        train_idx = train_dict
349    else:
350        if force_balance:
351            train_idx = get_class_train(adata.uns['train_indices'], 
352                                        cell_labels, 
353                                        adata.uns['seed_obj'],
354                                        other_factor)
355    tracemalloc.start()
356    for label in uniq_labels:
357
358        print(f"Comparing {label} to other types", flush = True)
359        cur_labels = cell_labels.copy()
360        cur_labels[cell_labels != label] = 'other'
361
362        # Need cur_label vs rest to run model
363        adata.obs['labels'] = cur_labels
364
365        if force_balance or train_dict:
366            adata.uns['train_indices'] = train_idx[label]
367
368        # Will only run scMKL with tuned alphas
369        if isinstance(alpha_params, dict):
370            alpha_list = np.array([alpha_params[label]])
371        elif isinstance(alpha_params, float):
372            alpha_list = np.array([alpha_params])
373        else:
374            alpha_list = alpha_params
375        
376        # Running scMKL
377        results[label] = run(adata, alpha_list, return_probs=True)
378        gc.collect()
379
380    # Getting final predictions
381    if isinstance(alpha_params, dict):
382        alpha = alpha_params
383    else:
384        alpha = np.min(alpha_params)
385
386    prob_table, pred_class, low_conf = get_prob_table(results, alpha)
387    macro_f1 = f1_score(cell_labels[adata.uns['test_indices']], 
388                        pred_class, average='macro')
389
390    model_summary = per_model_summary(results, uniq_labels, alpha)
391
392    # Global adata obj will be permanently changed if not reset
393    adata.obs['labels'] = cell_labels
394    adata.uns['train_indices'] = train_indices
395
396    # Need to document vars, probs, and stats
397    results['Per_model_summary'] = model_summary
398    results['Classes'] = uniq_labels
399    results['Probability_table'] = prob_table
400    results['Predicted_class'] = pred_class
401    results['Truth_labels'] = cell_labels[adata.uns['test_indices']]
402    results['Low_confidence'] = low_conf
403    results['Macro_F1-Score'] = macro_f1
404
405    if force_balance or train_dict:
406        results['Training_indices'] = train_idx
407
408    return results

For each cell class, creates model(s) comparing that class to all others. Then, predicts on the training data using scmkl.run. Only labels in both training and testing will be run.

Parameters
  • adatas (list[AnnData]): List of ad.AnnData objects created by create_adata() where each ad.AnnData is one modality and composed of both training and testing samples. Requires that 'train_indices' and 'test_indices' are the same between all ad.AnnDatas.
  • names (list[str]): String variables that describe each modality respective to adatas for labeling.
  • alpha_params (np.ndarray | float | dict): If is dict, expects keys to correspond to each unique label with float as key (ideally would be the output of scmkl.optimize_alpha). Else, array of alpha values to create each model with or a float to run with a single alpha.
  • tfidf (list[bool]): If element i is True, adatas[i] will be TF-IDF normalized. If None, no views will be TF-IDF normalized.
  • batches (int): The number of batches to use for the distance calculation. This will average the result of batches distance calculations of batch_size randomly sampled cells. More batches will converge to population distance values at the cost of scalability.
  • batch_size (int): The number of cells to include per batch for distance calculations. Higher batch size will converge to population distance values at the cost of scalability. If batches*batch_size > num_training_cells, batch_size will be reduced to int(num_training_cells / batches).
  • force_balance (bool): If True, training sets will be balanced to reduce class label imbalance. Defaults to False.
  • other_factor (float): The ratio of cells to sample for the other class for each model. For example, if classifying B cells with 100 B cells in training, if other_factor=1, 100 cells that are not B cells will be trained on with the B cells.
Returns
  • results (dict): Contains keys for each cell class with results from cell class versus all other samples. See scmkl.run for futher details. Will also include a probablilities table with the predictions from each model.
Examples
>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>>
>>> results = scmkl.one_v_rest(adatas = [adata], names = ['rna'],
...                           alpha_list = np.array([0.05, 0.1]),
...                           tfidf = [False])
>>>
>>> adata.keys()
dict_keys(['B cells', 'Monocytes', 'Dendritic cells', ...])
def optimize_alpha( adata: anndata._core.anndata.AnnData | list[anndata._core.anndata.AnnData], group_size: int | None = None, tfidf: None | list[bool] = None, alpha_array: numpy.ndarray = array([1.9 , 1.69, 1.49, 1.28, 1.08, 0.87, 0.67, 0.46, 0.26, 0.05]), k: int = 4, metric: str = 'AUROC', early_stopping: bool = False, batches: int = 10, batch_size: int = 100, combination: str = 'concatenate', force_balance: bool = True, train_dict: dict = None):
442def optimize_alpha(adata: ad.AnnData | list[ad.AnnData], 
443                   group_size: int | None=None, tfidf: None | list[bool]=None, 
444                   alpha_array: np.ndarray=default_alphas, k: int=4, 
445                   metric: str='AUROC', early_stopping: bool=False,
446                   batches: int=10, batch_size: int=100, 
447                   combination: str='concatenate', force_balance: bool=True,
448                   train_dict: dict=None):
449    """
450    K-fold cross validation for optimizing alpha hyperparameter using 
451    training indices. 
452
453    Parameters
454    ----------
455    adata : list[ad.AnnData]
456        List of `ad.AnnData`(s) with `'Z_train'` and `'Z_test'` in 
457        `adata.uns.keys()`.
458
459    group_size : None | int
460        Argument describing how the features are grouped. If `None`, 
461        `2 * adata.uns['D']` will be used. For more information see 
462        [celer documentation](https://mathurinm.github.io/celer/
463        generated/celer.GroupLasso.html).
464
465    tfidf : list[bool]
466        If `False`, no data will be TF-IDF transformed. If 
467        `type(adata) is list` and TF-IDF transformation is desired for 
468        all or some of the data, a bool list corresponding to `adata` 
469        must be provided. To simply TF-IDF transform `adata` when 
470        `type(adata) is ad.AnnData`, use `True`.
471    
472    alpha_array : np.ndarray
473        Array of all alpha values to be tested.
474
475    k : int
476        Number of folds to perform cross validation over.
477            
478    metric : str
479        Which metric to use to optimize alpha. Options are `'AUROC'`, 
480        `'Accuracy'`, `'F1-Score'`, `'Precision'`, and `'Recall'`.
481
482    batches : int
483        The number of batches to use for the distance calculation.
484        This will average the result of `batches` distance calculations
485        of `batch_size` randomly sampled cells. More batches will converge
486        to population distance values at the cost of scalability.
487
488    batch_size : int
489        The number of cells to include per batch for distance
490        calculations. Higher batch size will converge to population
491        distance values at the cost of scalability. If 
492        `batches*batch_size > num_training_cells`, `batch_size` will be 
493        reduced to `int(num_training_cells/batches)`.
494
495    force_balance: bool
496        If `True`, training sets will be balanced to reduce class label 
497        imbalance for each iteration. Defaults to `False`.
498
499    other_factor : float
500        The ratio of cells to sample for the other class for each 
501        model. For example, if classifying B cells with 100 B cells in 
502        training, if `other_factor=1`, 100 cells that are not B cells 
503        will be trained on with the B cells. This will be done for each 
504        fold for each class if `force_balance` is `True`. 
505
506    combination: str
507        How should multiple views of data be combined. For more details 
508        see ad.concat.
509
510    train_dict: dict
511        A `dict` where each key is a class label and values are are the 
512        indices to be trained with for that class for class balance. 
513        All values must be present in each adata.uns['train_indices'].
514
515    Returns
516    -------
517    alpha_star : float | dict
518        If number of classes is more than 2, a dictionary with keys 
519        being class labels and values being the best performing alpha 
520        parameter for that class as a float. Else, a float for 
521        comparing the two classes.
522    """
523    # Need singe-view runs to be iterable
524    if isinstance(adata, ad.AnnData):
525        adata = [adata.copy()]
526
527    if isinstance(tfidf, type(None)):
528        tfidf = len(adata)*[False]
529
530    is_multi = 2 < len(set(adata[0].obs['labels']))
531    
532    if isinstance(group_size, type(None)):
533        group_size = 2*adata[0].uns['D']
534
535
536    if is_multi:
537        alpha_star = multi_optimize_alpha(adata, group_size, tfidf, 
538                                          alpha_array, k, metric, 
539                                          early_stopping, batches, 
540                                          batch_size, force_balance, 
541                                          combination, train_dict)
542        
543    else:
544        alpha_star = bin_optimize_alpha(adata, group_size, tfidf, 
545                                        alpha_array, k, metric, 
546                                        early_stopping, batches, 
547                                        batch_size, combination)
548        
549    gc.collect()
550
551    return alpha_star

K-fold cross validation for optimizing alpha hyperparameter using training indices.

Parameters
  • adata (list[ad.AnnData]): List of ad.AnnData(s) with 'Z_train' and 'Z_test' in adata.uns.keys().
  • group_size (None | int): Argument describing how the features are grouped. If None, 2 * adata.uns['D'] will be used. For more information see celer documentation.
  • tfidf (list[bool]): If False, no data will be TF-IDF transformed. If type(adata) is list and TF-IDF transformation is desired for all or some of the data, a bool list corresponding to adata must be provided. To simply TF-IDF transform adata when type(adata) is ad.AnnData, use True.
  • alpha_array (np.ndarray): Array of all alpha values to be tested.
  • k (int): Number of folds to perform cross validation over.
  • metric (str): Which metric to use to optimize alpha. Options are 'AUROC', 'Accuracy', 'F1-Score', 'Precision', and 'Recall'.
  • batches (int): The number of batches to use for the distance calculation. This will average the result of batches distance calculations of batch_size randomly sampled cells. More batches will converge to population distance values at the cost of scalability.
  • batch_size (int): The number of cells to include per batch for distance calculations. Higher batch size will converge to population distance values at the cost of scalability. If batches*batch_size > num_training_cells, batch_size will be reduced to int(num_training_cells/batches).
  • force_balance (bool): If True, training sets will be balanced to reduce class label imbalance for each iteration. Defaults to False.
  • other_factor (float): The ratio of cells to sample for the other class for each model. For example, if classifying B cells with 100 B cells in training, if other_factor=1, 100 cells that are not B cells will be trained on with the B cells. This will be done for each fold for each class if force_balance is True.
  • combination (str): How should multiple views of data be combined. For more details see ad.concat.
  • train_dict (dict): A dict where each key is a class label and values are are the indices to be trained with for that class for class balance. All values must be present in each adata.uns['train_indices'].
Returns
  • alpha_star (float | dict): If number of classes is more than 2, a dictionary with keys being class labels and values being the best performing alpha parameter for that class as a float. Else, a float for comparing the two classes.
def optimize_sparsity( adata: anndata._core.anndata.AnnData, group_size: int | None = None, starting_alpha=1.9, increment=0.2, target=1, n_iter=10):
  9def optimize_sparsity(adata: ad.AnnData, group_size: int | None=None, starting_alpha = 1.9, 
 10                      increment = 0.2, target = 1, n_iter = 10):
 11    """
 12    Iteratively train a grouplasso model and update alpha to find the 
 13    parameter yielding the desired sparsity.
 14    
 15    Parameters
 16    ----------
 17    adata : ad.AnnData
 18        `ad.AnnData` with `'Z_train'` and `'Z_test'` in 
 19        `adata.uns.keys()`.
 20
 21    group_size : None | int
 22        Argument describing how the features are grouped. If `None`, 
 23        `2 * adata.uns['D']` will be used. For more information see 
 24        [celer documentation](https://mathurinm.github.io/celer/
 25        generated/celer.GroupLasso.html).
 26
 27    starting_alpha : float
 28        The alpha value to start the search at.
 29    
 30    increment : float 
 31        Amount to adjust alpha by between iterations.
 32    
 33    target : int
 34        The desired number of groups selected by the model.
 35
 36    n_iter : int
 37        The maximum number of iterations to run.
 38            
 39    Returns
 40    -------
 41    sparsity_dict : dict
 42        Tested alpha as keys and the number of selected groups as 
 43        the values.
 44        
 45    alpha : float
 46        The alpha value yielding the number of selected groups closest 
 47        to the target.
 48
 49    Examples
 50    --------
 51    >>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, 
 52    ...                                                target = 1)
 53    >>>
 54    >>> alpha
 55    0.01
 56
 57    See Also
 58    --------
 59    celer.GroupLasso : https://mathurinm.github.io/celer/
 60    """
 61    assert increment > 0 and increment < starting_alpha, ("Choose a positive "
 62                                                          "increment less "
 63                                                          "than alpha")
 64    assert target > 0 and isinstance(target, int), ("Choose an integer "
 65                                                    "target number of groups "
 66                                                     "that is greater than 0")
 67    assert n_iter > 0 and isinstance(n_iter, int), ("Choose an integer "
 68                                                    "number of iterations "
 69                                                    "that is greater than 0")
 70
 71    if group_size == None:
 72        group_size = adata.uns['D']*2
 73
 74    sparsity_dict = {}
 75    alpha = starting_alpha
 76
 77    for _ in np.arange(n_iter):
 78        adata = train_model(adata, group_size, alpha)
 79        num_selected = len(find_selected_groups(adata))
 80
 81        sparsity_dict[np.round(alpha, 4)] = num_selected
 82
 83        if num_selected < target:
 84            #Decreasing alpha will increase the number of selected pathways
 85            if alpha - increment in sparsity_dict.keys():
 86                # Make increment smaller so the model can't go back and forth 
 87                # between alpha values
 88                increment/=2
 89            # Ensures that alpha will never be negative
 90            alpha = np.max([alpha - increment, 1e-3]) 
 91
 92        elif num_selected > target:
 93            if alpha + increment in sparsity_dict.keys():
 94                increment/=2
 95
 96            alpha += increment
 97        elif num_selected == target:
 98            break
 99
100    # Find the alpha that minimizes the difference between target and observed
101    # number of selected groups
102    spar_idx = np.argmin([np.abs(selected - target) 
103                          for selected in sparsity_dict.values()])
104    optimal_alpha = list(sparsity_dict.keys())[spar_idx]
105    
106    return sparsity_dict, optimal_alpha

Iteratively train a grouplasso model and update alpha to find the parameter yielding the desired sparsity.

Parameters
  • adata (ad.AnnData): ad.AnnData with 'Z_train' and 'Z_test' in adata.uns.keys().
  • group_size (None | int): Argument describing how the features are grouped. If None, 2 * adata.uns['D'] will be used. For more information see celer documentation.
  • starting_alpha (float): The alpha value to start the search at.
  • increment (float): Amount to adjust alpha by between iterations.
  • target (int): The desired number of groups selected by the model.
  • n_iter (int): The maximum number of iterations to run.
Returns
  • sparsity_dict (dict): Tested alpha as keys and the number of selected groups as the values.
  • alpha (float): The alpha value yielding the number of selected groups closest to the target.
Examples
>>> sparcity_dict, alpha = scmkl.optimize_sparsity(adata, 
...                                                target = 1)
>>>
>>> alpha
0.01
See Also

celer.GroupLasso: https://mathurinm.github.io/celer/

def parse_metrics( results: dict, key: str | None = None, include_as: bool = False) -> pandas.core.frame.DataFrame:
131def parse_metrics(results: dict, key: str | None=None, 
132                   include_as: bool=False) -> pd.DataFrame:
133    """
134    This function returns a pd.DataFrame for a single scMKL result 
135    with performance results.
136
137    Parameters
138    ----------
139    results : dict
140        A result dictionary from `scmkl.run()`.
141    
142    key : str
143        If specified, will add a key column to the output dataframe 
144        where each element is `key`.
145
146    include_as : bool
147        If `True`, will add a column indicating which models' used 
148        the optimal alphas.
149
150    Returns
151    -------
152    df : pd.DataFrame
153        A dataframe with columns `['Alpha', 'Metric', 'Value']`. 
154        `'Key'` col only added if `key` is not `None`.
155    """
156    df = {
157        'Alpha' : list(),
158        'Metric' : list(),
159        'Value' : list()
160    }
161
162    # Check if is a multiclass result
163    is_mult, _ = _parse_result_type(results)
164
165    if is_mult:
166        df['Class'] = list()
167
168    # Ensuring results is a scMKL result and checking multiclass
169    if 'Metrics' in results.keys():
170        for alpha in results['Metrics'].keys():
171            for metric, value in results['Metrics'][alpha].items():
172                df['Alpha'].append(alpha)
173                df['Metric'].append(metric)
174                df['Value'].append(value)
175
176    elif 'Classes' in results.keys():
177        for ct in results['Classes']:
178            for alpha in results[ct]['Metrics'].keys():
179                for metric, value in results[ct]['Metrics'][alpha].items():
180                    df['Alpha'].append(alpha)
181                    df['Metric'].append(metric)
182                    df['Value'].append(value)
183                    df['Class'].append(ct)
184
185    else:
186        print(f"{key} is not a scMKL result and will be ignored.")
187            
188    df = pd.DataFrame(df)
189    
190    if include_as:
191        assert 'Alpha_star' in results.keys(), "'Alpha_star' not in results"
192        df['Alpha Star'] = df['Alpha'] == results['Alpha_star']
193
194    if key is not None:
195        df['Key'] = [key] * df.shape[0]
196
197    return df        

This function returns a pd.DataFrame for a single scMKL result with performance results.

Parameters
  • results (dict): A result dictionary from scmkl.run.
  • key (str): If specified, will add a key column to the output dataframe where each element is key.
  • include_as (bool): If True, will add a column indicating which models' used the optimal alphas.
Returns
  • df (pd.DataFrame): A dataframe with columns ['Alpha', 'Metric', 'Value']. 'Key' col only added if key is not None.
def parse_weights( results: dict, include_as: bool = False, key: None | str = None) -> pandas.core.frame.DataFrame:
200def parse_weights(results: dict, include_as: bool=False, 
201                   key: None | str=None) -> pd.DataFrame:
202    """
203    This function returns a pd.DataFrame for a single scMKL result 
204    with group weights.
205
206    Parameters
207    ----------
208    results : dict
209        A result dictionary from `scmkl.run()`.
210    
211    key : str
212        If specified, will add a key column to the output dataframe 
213        where each element is `key`.
214
215    include_as : bool
216        If `True`, will add a column indicating which models' used 
217        the optimal alphas.
218
219    Returns
220    -------
221    df : pd.DataFrame
222        A dataframe with columns `['Alpha', 'Group', 
223        'Kernel Weight']`. `'Key'` col only added if `key` is not 
224        `None`.
225    """
226    df = {
227        'Alpha' : list(),
228        'Group' : list(),
229        'Kernel Weight' : list()
230    }
231
232    # Check if is a multiclass result
233    is_mult, _ = _parse_result_type(results)
234
235    if is_mult:
236        df['Class'] = list()
237
238    # Ensuring results is a scMKL result and checking multiclass
239    if 'Norms' in results.keys():
240        for alpha in results['Norms'].keys():
241            df['Alpha'].extend([alpha]*len(results['Norms'][alpha]))
242            df['Group'].extend(results['Group_names'])
243            df['Kernel Weight'].extend(results['Norms'][alpha])
244
245    elif 'Classes' in results.keys():
246        for ct in results['Classes']:
247            for alpha in results[ct]['Norms'].keys():
248                df['Alpha'].extend([alpha] * len(results[ct]['Norms'][alpha]))
249                df['Group'].extend(results[ct]['Group_names'])
250                df['Kernel Weight'].extend(results[ct]['Norms'][alpha])
251                df['Class'].extend([ct]*len(results[ct]['Norms'][alpha]))
252
253    df = pd.DataFrame(df)
254    
255    if include_as:
256        df['Alpha Star'] = df['Alpha'] == results['Alpha_star'] 
257
258    if key is not None:
259        df['Key'] = [key] * df.shape[0]
260
261    return df

This function returns a pd.DataFrame for a single scMKL result with group weights.

Parameters
  • results (dict): A result dictionary from scmkl.run.
  • key (str): If specified, will add a key column to the output dataframe where each element is key.
  • include_as (bool): If True, will add a column indicating which models' used the optimal alphas.
Returns
  • df (pd.DataFrame): A dataframe with columns ['Alpha', 'Group', 'Kernel Weight']. 'Key' col only added if key is not None.
def plot_metric( summary_df: pandas.core.frame.DataFrame, alpha_star=None, x_axis: str = 'Alpha', color='red'):
196def plot_metric(summary_df : pd.DataFrame, alpha_star = None, 
197                x_axis: str='Alpha', color = 'red'):
198    """
199    Takes a data frame of model metrics and optionally alpha star and
200    creates a scatter plot given metrics against alpha values. For 
201    multiclass results, `alpha_star` is not shown and points are 
202    colored by class.
203    
204    Parameters
205    ----------
206    summary_df : pd.DataFrame
207        Dataframe created by `scmkl.get_summary()`.
208
209    alpha_star : None | float
210        If `not None`, a label will be added for tuned `alpha_star` 
211        being optimal model parameter for performance from cross 
212        validation on the training data. Can be calculated with 
213        `scmkl.optimize_alpha()`. Is ignored if `summary_df` is from a 
214        multiclass result.
215
216    x_axis : str
217        Must be either `'Alpha'` or `'Number of Selected Groups'`. Is 
218        the variable that will be plotted on the x-axis.
219
220    color : str
221        Color to make points on plot.
222
223    Returns
224    -------
225    metric_plot : plotnine.ggplot.ggplot
226        A plot with alpha values on x-axis and metric on y-axis.
227
228    Examples
229    --------
230    >>> results = scmkl.run(adata, alpha_list)
231    >>> summary_df = scmkl.get_summary(results)
232    >>> metric_plot = plot_metric(results)
233    >>>
234    >>> metric_plot.save('scMKL_performance.png')
235
236    ![metric_plot](../tests/figures/plot_metric_binary.png)
237    """
238    # Capturing metric from summary_df
239    metric_options = ['AUROC', 'Accuracy', 'F1-Score', 'Precision', 'Recall']
240    metric = np.intersect1d(metric_options, summary_df.columns)[0]
241
242    x_axis_ticks = np.unique(summary_df[x_axis])
243    summary_df['Alpha Star'] = summary_df['Alpha'] == alpha_star
244
245    if 'Class' in summary_df.columns:
246        color_lab = 'Class'
247    else:
248        c_dict = {
249            True : 'gold',
250            False : color
251            }
252        color_lab = 'Alpha Star'
253
254    if np.min(summary_df[metric]) < 0.5:
255        min_y = 0
256    else:
257        min_y = 0.6
258
259    plot = (ggplot(summary_df, aes(x = x_axis, y = metric, color=color_lab)) 
260            + geom_point(size=4)
261            + theme_classic()
262            + ylim(min_y, 1)
263            + scale_x_reverse(breaks=x_axis_ticks)
264            + theme(
265                axis_text=element_text(weight='bold', size=10),
266                axis_title=element_text(weight='bold', size=12),
267                legend_title=element_text(weight='bold', size=14),
268                legend_text=element_text(weight='bold', size=12)
269            )
270            )
271    
272    if not 'Class' in  summary_df:
273        plot += scale_color_manual(c_dict)
274        
275    return plot

Takes a data frame of model metrics and optionally alpha star and creates a scatter plot given metrics against alpha values. For multiclass results, alpha_star is not shown and points are colored by class.

Parameters
  • summary_df (pd.DataFrame): Dataframe created by scmkl.get_summary().
  • alpha_star (None | float): If not None, a label will be added for tuned alpha_star being optimal model parameter for performance from cross validation on the training data. Can be calculated with scmkl.optimize_alpha. Is ignored if summary_df is from a multiclass result.
  • x_axis (str): Must be either 'Alpha' or 'Number of Selected Groups'. Is the variable that will be plotted on the x-axis.
  • color (str): Color to make points on plot.
Returns
  • metric_plot (plotnine.ggplot.ggplot): A plot with alpha values on x-axis and metric on y-axis.
Examples
>>> results = scmkl.run(adata, alpha_list)
>>> summary_df = scmkl.get_summary(results)
>>> metric_plot = plot_metric(results)
>>>
>>> metric_plot.save('scMKL_performance.png')

metric_plot

def plot_conf_mat( results, title='', cmap=None, normalize=True, alpha=None, save=None) -> None:
 80def plot_conf_mat(results, title = '', cmap = None, normalize = True,
 81                          alpha = None, save = None) -> None:
 82    """
 83    Creates a confusion matrix from the output of scMKL.
 84
 85    Parameters
 86    ----------
 87    results : dict
 88        The output from either scmkl.run() or scmkl.one_v_rest()
 89        containing results from scMKL.
 90
 91    title : str
 92        The text to display at the top of the matrix.
 93
 94    cmap : matplotlib.colors.LinearSegmentedColormap
 95        The gradient of the values displayed from `matplotlib.pyplot`.
 96        If `None`, `'Purples'` is used see matplotlib color map 
 97        reference for more information. 
 98
 99    normalize : bool
100        If `False`, plot the raw numbers. If `True`, plot the 
101        proportions.
102
103    alpha : None | float
104        Alpha that matrix should be created for. If `results` is from
105        `scmkl.one_v_all()`, this is ignored. If `None`, smallest alpha
106        will be used.
107
108    save : None | str
109        File path to save plot. If `None`, plot is not saved.
110
111    Returns
112    -------
113    None
114    
115    Examples
116    --------
117    >>> # Running scmkl and capturing results
118    >>> results = scmkl.run(adata = adata, alpha_list = alpha_list)
119    >>> 
120    >>> from matplotlib.pyplot import get_cmap
121    >>> 
122    >>> scmkl.plot_conf_mat(results, title = '', cmap = get_cmap('Blues'))
123
124    ![conf_mat](../tests/figures/plot_conf_mat_binary.png)
125
126    Citiation
127    ---------
128    http://scikit-learn.org/stable/auto_examples/model_selection/
129    plot_confusion_matrix.html
130    """
131    # Determining type of results
132    if ('Observed' in results.keys()) and ('Metrics' in results.keys()):
133        multi_class = False
134        names = np.unique(results['Observed'])
135    else:
136        multi_class = True
137        names = np.unique(results['Truth_labels'])
138
139    if multi_class:
140        cm = metrics.confusion_matrix(y_true = results['Truth_labels'], 
141                              y_pred = results['Predicted_class'], 
142                              labels = names)
143    else:
144        min_alpha = np.min(list(results['Metrics'].keys()))
145        alpha = alpha if alpha != None else min_alpha
146        cm = metrics.confusion_matrix(y_true = results['Observed'],
147                              y_pred = results['Predictions'][alpha],
148                              labels = names)
149
150    accuracy = np.trace(cm) / float(np.sum(cm))
151    misclass = 1 - accuracy
152
153    if cmap is None:
154        cmap = plt.get_cmap('Purples')
155
156    if normalize:
157        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
158
159    plt.figure(figsize=(8, 6))
160    plt.imshow(cm, interpolation='nearest', cmap=cmap)
161    plt.title(title)
162    plt.colorbar()
163
164    tick_marks = np.arange(len(names))
165    plt.xticks(tick_marks, names, rotation=45)
166    plt.yticks(tick_marks, names)
167
168
169    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
170    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
171        if normalize:
172            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
173                     horizontalalignment="center",
174                     color="white" if cm[i, j] > thresh else "black")
175        else:
176            plt.text(j, i, "{:,}".format(cm[i, j]),
177                     horizontalalignment="center",
178                     color="white" if cm[i, j] > thresh else "black")
179
180    acc_label = 'Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'
181    acc_label = acc_label.format(accuracy, misclass)
182
183    plt.tight_layout()
184    plt.ylabel('True label')
185    plt.xlabel(acc_label)
186
187    if save != None:
188        plt.savefig(save)
189        plt.clf()
190    else:
191        plt.show()
192
193    return None

Creates a confusion matrix from the output of scMKL.

Parameters
  • results (dict): The output from either scmkl.run or scmkl.one_v_rest containing results from scMKL.
  • title (str): The text to display at the top of the matrix.
  • cmap (matplotlib.colors.LinearSegmentedColormap): The gradient of the values displayed from matplotlib.pyplot. If None, 'Purples' is used see matplotlib color map reference for more information.
  • normalize (bool): If False, plot the raw numbers. If True, plot the proportions.
  • alpha (None | float): Alpha that matrix should be created for. If results is from scmkl.one_v_all(), this is ignored. If None, smallest alpha will be used.
  • save (None | str): File path to save plot. If None, plot is not saved.
Returns
  • None
Examples
>>> # Running scmkl and capturing results
>>> results = scmkl.run(adata = adata, alpha_list = alpha_list)
>>> 
>>> from matplotlib.pyplot import get_cmap
>>> 
>>> scmkl.plot_conf_mat(results, title = '', cmap = get_cmap('Blues'))

conf_mat

Citiation

http://scikit-learn.org/stable/auto_examples/model_selection/ plot_confusion_matrix.html

def prepare_fold(fold_adata, sort_idcs, fold_train, fold_test):
 94def prepare_fold(fold_adata, sort_idcs, fold_train, fold_test):
 95    """
 96    Reorders adata samples, reassigns each `adata.uns['train_indices' & 
 97    'test_indices']`, and removes sigmas if present.
 98
 99    Parameters
100    ----------
101    fold_adata : list[ad.AnnData]
102        A `list` of `AnnData` objects to be reordered based on fold 
103        indices.
104
105    sort_idcs: np.ndarray
106        The indices that will sort `AnnData`s as all train samples then 
107        all test.
108
109    fold_train : np.ndarray
110        The indices of training sample for respective fold.
111
112    fold_test : np.ndarray
113        The indices of testing samples for respective fold.
114
115    Returns
116    -------
117    fold_adata : list[ad.AnnData]
118        The `list` of `AnnData`s with each `AnnData` sorted as all 
119        training samples then all testing samples. 
120        `adata.uns['train_indices' & 'test_indices']` are also updated 
121        to reflect new sample positions.
122
123    Examples
124    --------
125    >>> adata = [scmkl.create_adata(...)]
126    >>> folds = scmkl.get_folds(adata[0], k=4)
127    >>> sort_idcs, fold_train, fold_test = sort_samples(fold_train, 
128    ...                                                 fold_test)
129    >>> fold_adata = scmkl.prepare_fold(adata, sort_idcs, 
130    ...                                 fold_train, fold_test)
131    """
132    for i in range(len(fold_adata)):
133        fold_adata[i] = fold_adata[i][sort_idcs]
134        fold_adata[i].uns['train_indices'] = fold_train
135        fold_adata[i].uns['test_indices'] = fold_test
136    
137    # Need to recalculate sigmas for new fold train
138    for i in range(len(fold_adata)):
139        if 'sigma' in fold_adata[i].uns_keys():
140            del fold_adata[i].uns['sigma']
141
142    return fold_adata

Reorders adata samples, reassigns each adata.uns['train_indices' & 'test_indices'], and removes sigmas if present.

Parameters
  • fold_adata (list[ad.AnnData]): A list of AnnData objects to be reordered based on fold indices.
  • sort_idcs (np.ndarray): The indices that will sort AnnDatas as all train samples then all test.
  • fold_train (np.ndarray): The indices of training sample for respective fold.
  • fold_test (np.ndarray): The indices of testing samples for respective fold.
Returns
  • fold_adata (list[ad.AnnData]): The list of AnnDatas with each AnnData sorted as all training samples then all testing samples. adata.uns['train_indices' & 'test_indices'] are also updated to reflect new sample positions.
Examples
>>> adata = [scmkl.create_adata(...)]
>>> folds = scmkl.get_folds(adata[0], k=4)
>>> sort_idcs, fold_train, fold_test = sort_samples(fold_train, 
...                                                 fold_test)
>>> fold_adata = scmkl.prepare_fold(adata, sort_idcs, 
...                                 fold_train, fold_test)
def process_fold(fold_adata, names, tfidf, combination, batches, batch_size):
145def process_fold(fold_adata, names, tfidf, combination, batches, batch_size):
146    """
147    Combines adata if needed, estimates sigmas, and calculates kernels 
148    for model training and evaluation.
149
150    Parameters
151    ----------
152    fold_adata : list[ad.AnnData]
153        A `list` of `AnnData` objects to combine and calculate kernels 
154        for.
155
156    names : list
157        A list of names respective to each `AnnData` in `fold_adata` 
158        for verbose outputs.
159
160    tfidf : list[bool]
161        A boolean list indicating whether or not to perform TF-IDF 
162        transformation for each adata respective to `fold_adata`.
163
164    combination : str
165        The method of combining `AnnData` objects passed to 
166        `ad.concatenate()`. Ignored if `len(fold_adata) == 1`.
167
168    batches : int
169        The number of batches for kernel width (sigma) estimation.
170
171    batch_size : int
172        The number of samples to include in each batch of kernel width 
173        (sigma) estimation.
174
175    Examples
176    --------
177    >>> adata = [scmkl.create_adata(...)]
178    >>> tfidf = [False]
179    >>> names = ['adata1']
180    >>> folds = scmkl.get_folds(adata[0], k=4)
181    >>> sort_idcs, fold_train, fold_test = sort_samples(fold_train, 
182    ...                                                 fold_test)
183    >>> fold_adata = scmkl.prepare_fold(adata, sort_idcs, 
184    ...                                 fold_train, fold_test)
185    >>> fold_adata = scmkl.process_fold(fold_adata, names, tfidf)
186    """
187    if 1 < len(fold_adata):
188        fold_adata = multimodal_processing(fold_adata, names, tfidf, 
189                                           combination, batches, 
190                                           batch_size, False)
191    else:
192        fold_adata = fold_adata[0]
193        if tfidf[0]:
194            fold_adata = tfidf_normalize(fold_adata, binarize=True)
195        fold_adata = calculate_z(fold_adata, n_features= 5000, 
196                                 batches=batches, batch_size=batch_size)
197        
198    return fold_adata

Combines adata if needed, estimates sigmas, and calculates kernels for model training and evaluation.

Parameters
  • fold_adata (list[ad.AnnData]): A list of AnnData objects to combine and calculate kernels for.
  • names (list): A list of names respective to each AnnData in fold_adata for verbose outputs.
  • tfidf (list[bool]): A boolean list indicating whether or not to perform TF-IDF transformation for each adata respective to fold_adata.
  • combination (str): The method of combining AnnData objects passed to ad.concatenate(). Ignored if len(fold_adata) == 1.
  • batches (int): The number of batches for kernel width (sigma) estimation.
  • batch_size (int): The number of samples to include in each batch of kernel width (sigma) estimation.
Examples
>>> adata = [scmkl.create_adata(...)]
>>> tfidf = [False]
>>> names = ['adata1']
>>> folds = scmkl.get_folds(adata[0], k=4)
>>> sort_idcs, fold_train, fold_test = sort_samples(fold_train, 
...                                                 fold_test)
>>> fold_adata = scmkl.prepare_fold(adata, sort_idcs, 
...                                 fold_train, fold_test)
>>> fold_adata = scmkl.process_fold(fold_adata, names, tfidf)
def read_files(dir: str, pattern: str | None = None) -> dict:
354def read_files(dir: str, pattern: str | None=None) -> dict:
355    """
356    This function takes a directory of scMKL results as pickle files 
357    and returns a dictionary with the file names as keys and the data 
358    from the respective files as the values.
359
360    Parameters
361    ----------
362    dir : str
363        A string specifying the file path for the output scMKL runs.
364
365    pattern : str
366        A regex string for filtering down to desired files. If 
367        `None`, all files in the directory with the pickle file 
368        extension will be added to the dictionary.
369
370    Returns
371    -------
372    results : dict
373        A dictionary with the file names as keys and data as values.
374
375    Examples
376    --------
377    >>> filepath = 'scMKL_results/rna+atac/'
378    ...
379    >>> all_results = scmkl.read_files(filepath)
380    >>> all_results.keys()
381    dict_keys(['Rep_1.pkl', Rep_2.pkl, Rep_3.pkl, ...])
382    """
383    # Reading all pickle files in patter is None
384    if pattern is None:
385        data = {file : np.load(f'{dir}/{file}', allow_pickle = True)
386                 for file in os.listdir(dir) if '.pkl' in file}
387    
388    # Reading only files matching pattern if not None
389    else:
390        pattern = repr(pattern)
391        data = {file : np.load(f'{dir}/{file}', allow_pickle = True)
392                 for file in os.listdir(dir) 
393                 if re.fullmatch(pattern, file) is not None}
394        
395    return data

This function takes a directory of scMKL results as pickle files and returns a dictionary with the file names as keys and the data from the respective files as the values.

Parameters
  • dir (str): A string specifying the file path for the output scMKL runs.
  • pattern (str): A regex string for filtering down to desired files. If None, all files in the directory with the pickle file extension will be added to the dictionary.
Returns
  • results (dict): A dictionary with the file names as keys and data as values.
Examples
>>> filepath = 'scMKL_results/rna+atac/'
...
>>> all_results = scmkl.read_files(filepath)
>>> all_results.keys()
dict_keys(['Rep_1.pkl', Rep_2.pkl, Rep_3.pkl, ...])
def read_gtf(path: str, filter_to_coding: bool = False):
642def read_gtf(path: str, filter_to_coding: bool=False):
643    """
644    Reads and formats a gtf file. Adds colnames: `['chr', 'source', 
645    'feature', 'start', 'end', 'score', 'strand', 'frame', 
646    'attribute']`. If `filter_to_coding == True`, `'gene_name'` col 
647    will be added with gene_name from attribute col if gene is protein 
648    coding. If `'gene_name'` is not in attribute for that row, 
649    `'gene_id'` will be used. 
650
651    Parameters
652    ----------
653    path : str
654        The file path to the gtf file to be read in. If the file is 
655        gzipped, file name must end with .gz.
656
657    filter_to_coding : bool
658        If `True`, will filter rows in gtf data frame to only 
659        protein coding genes. Will add column `'gene_name'` containing 
660        the gene name for each row. If gene name is missing in GTF 
661        gene_id will be used.
662
663    Returns
664    -------
665    df : pd.DataFrame
666        A pandas dataframe of the input gtf file.
667
668    Examples
669    --------
670    >>> import scmkl
671    >>>
672    >>> file = 'data/hg38_subset_protein_coding.annotation.gtf'
673    >>> gtf = scmkl.read_gtf(file)
674    >>>
675    >>> gtf.head()
676            chr  source     feature  start    end score strand frame                                          
677    0  chr1  HAVANA        gene  11869  14409     .      +     .  
678    1  chr1  HAVANA  transcript  11869  14409     .      +     .  
679    2  chr1  HAVANA        exon  11869  12227     .      +     .  
680    3  chr1  HAVANA        exon  12613  12721     .      +     .  
681    4  chr1  HAVANA        exon  13221  14409     .      +     .  
682    attribute
683    gene_id "ENSG00000223972.5"; gene_type "transc...
684    gene_id "ENSG00000223972.5"; transcript_id "EN...
685    gene_id "ENSG00000223972.5"; transcript_id "EN...
686    gene_id "ENSG00000223972.5"; transcript_id "EN...
687    gene_id "ENSG00000223972.5"; transcript_id "EN...
688    """
689    df = pd.read_csv(path, sep='\t', comment='#', 
690                     skip_blank_lines=True, header=None)
691    
692    df.columns = ['chr', 'source', 'feature', 'start', 'end', 
693                  'score', 'strand', 'frame', 'attribute']
694    
695    if filter_to_coding:
696        prot_rows = df['attribute'].str.contains('protein_coding')
697        df = df[prot_rows]
698        df = df[df['feature'] == 'gene']
699
700        # Capturing and adding gene name to df
701        gene_names = list()
702
703        for attr in df['attribute']:
704            gname = re.findall(r'(?<=gene_name ")[A-z0-9\-]+', attr)
705
706            if gname:
707                gene_names.append(gname[0])
708            else:
709                gid = re.findall(r'(?<=gene_id ")[A-z0-9\-]+', attr)
710                
711                if gid:
712                    gene_names.append(gid[0])
713                else:
714                    gene_names.append('NA')
715
716        df['gene_name'] = gene_names
717        df = df[df['gene_name'] != 'NA']
718    
719    return df

Reads and formats a gtf file. Adds colnames: ['chr', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']. If filter_to_coding == True, 'gene_name' col will be added with gene_name from attribute col if gene is protein coding. If 'gene_name' is not in attribute for that row, 'gene_id' will be used.

Parameters
  • path (str): The file path to the gtf file to be read in. If the file is gzipped, file name must end with .gz.
  • filter_to_coding (bool): If True, will filter rows in gtf data frame to only protein coding genes. Will add column 'gene_name' containing the gene name for each row. If gene name is missing in GTF gene_id will be used.
Returns
  • df (pd.DataFrame): A pandas dataframe of the input gtf file.
Examples
>>> import scmkl
>>>
>>> file = 'data/hg38_subset_protein_coding.annotation.gtf'
>>> gtf = scmkl.read_gtf(file)
>>>
>>> gtf.head()
        chr  source     feature  start    end score strand frame                                          
0  chr1  HAVANA        gene  11869  14409     .      +     .  
1  chr1  HAVANA  transcript  11869  14409     .      +     .  
2  chr1  HAVANA        exon  11869  12227     .      +     .  
3  chr1  HAVANA        exon  12613  12721     .      +     .  
4  chr1  HAVANA        exon  13221  14409     .      +     .  
attribute
gene_id "ENSG00000223972.5"; gene_type "transc...
gene_id "ENSG00000223972.5"; transcript_id "EN...
gene_id "ENSG00000223972.5"; transcript_id "EN...
gene_id "ENSG00000223972.5"; transcript_id "EN...
gene_id "ENSG00000223972.5"; transcript_id "EN...
def run( adata: anndata._core.anndata.AnnData, alpha_list: numpy.ndarray, metrics: list | None = None, return_probs: bool = False) -> dict:
 11def run(adata: ad.AnnData, alpha_list: np.ndarray, 
 12        metrics: list | None = None, 
 13        return_probs: bool=False) -> dict:
 14    """
 15    Wrapper function for training and test with multiple alpha values.
 16    Returns metrics, predictions, group weights, and resource usage.
 17
 18    Parameters
 19    ----------
 20    adata : ad.AnnData 
 21        A processed `ad.AnnData` with `'Z_train'`, `'Z_test'`, and 
 22        `'group_dict'` keys in `adata.uns`.
 23    
 24    alpha_list : np.ndarray 
 25        Sparsity values to create models with. Alpha refers to the 
 26        penalty parameter in Group Lasso. Larger alphas force group 
 27        weights to shrink towards zero while smaller alphas apply a 
 28        lesser penalty to kernal weights. Values too large will results 
 29        in models that weight all groups as zero.
 30
 31    metrics : list[str]
 32        Metrics that should be calculated on predictions. Options are 
 33        `['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']`. 
 34        When set to `None`, all metrics are calculated.
 35    
 36    Returns
 37    -------
 38    results : dict
 39        Results with keys and values: 
 40
 41        `'Metrics'` (dict): 
 42        A nested dictionary as `[alpha][metric] = value`.
 43
 44        `'Group_names'` (np.ndarray): 
 45        Array of group names used in model(s).
 46    
 47        `'Selected_groups'` (dict): 
 48        A nested dictionary as `[alpha] = np.array([nonzero_groups])`.
 49        Nonzero groups are groups that had a kernel weight above zero.
 50
 51        `'Norms'` (dict): 
 52        A nested dictionary as `[alpha] = np.array([kernel_weights])`
 53        Order of `kernel_weights` is respective to `'Group_names'` 
 54        values.
 55
 56        `'Observed'` (np.nparray): 
 57        An array of ground truth cell labels from the test set.
 58
 59        `'Predictions'` (dict): 
 60        A nested dictionary as `[alpha] = predicted_class` respective 
 61        to `'Observations'` for `alpha`.
 62
 63        `'Test_indices'` (np.array: 
 64        Indices of samples respective to adata used in the training 
 65        set.
 66
 67        `'Model'` (dict): 
 68        A nested dictionary where `[alpha] = celer.GroupLasso` object 
 69        for `alpha`.
 70
 71        `'RAM_usage'` (dict): 
 72        A nested dictionary with memory usage in GB after 
 73        training models for each `alpha`.
 74
 75    Examples
 76    --------
 77    >>> results = scmkl.run(adata = adata, 
 78    ...                     alpha_list = np.array([0.05, 0.1, 0.5]))
 79    >>> results
 80    dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions', 
 81    ...        'Observed', 'Test_indices', 'Group_names', 'Models', 
 82    ...        'Train_time', 'RAM_usage'])
 83    >>>
 84    >>> alpha values
 85    >>> results['Metrics'].keys()
 86    dict_keys([0.05, 0.1, 0.5])
 87    >>>
 88    >>> results['Metrics'][0.05]
 89    {'AUROC': 0.9859,
 90    'Accuracy': 0.945,
 91    'F1-Score': 0.9452736318407959,
 92    'Precision': 0.9405940594059405,
 93    'Recall': 0.95}
 94    """
 95    if metrics is None:
 96        metrics = ['AUROC', 'F1-Score','Accuracy', 'Precision', 'Recall']
 97
 98    # Initializing variables to capture metrics
 99    group_names = list(adata.uns['group_dict'].keys())
100    preds = {}
101    group_norms = {}
102    mets_dict = {}
103    selected_groups = {}
104    train_time = {}
105    models = {}
106    probs = {}
107
108    D = adata.uns['D']
109
110    # Generating models for each alpha and outputs
111    tracemalloc.start()
112    for alpha in alpha_list:
113        
114        print(f'  Evaluating model. Alpha: {alpha}', flush = True)
115
116        train_start = time.time()
117
118        adata = train_model(adata, group_size= 2*D, alpha = alpha)
119        if return_probs:
120            alpha_res = predict(adata, 
121                                metrics = metrics,
122                                return_probs = return_probs)
123            preds[alpha], mets_dict[alpha], probs[alpha] = alpha_res
124
125        else:
126            alpha_res = predict(adata, 
127                                metrics = metrics,
128                                return_probs = return_probs)
129            preds[alpha], mets_dict[alpha] = alpha_res
130
131        selected_groups[alpha] = find_selected_groups(adata)
132
133        kernel_weights = adata.uns['model'].coef_
134        group_norms[alpha] = [
135            np.linalg.norm(kernel_weights[i * 2 * D : (i + 1) * 2 * D - 1])
136            for i in np.arange(len(group_names))
137            ]
138        
139        models[alpha] = adata.uns['model']
140        
141        train_end = time.time()
142        train_time[alpha] = train_end - train_start
143
144    # Combining results into one object
145    results = {}
146    results['Metrics'] = mets_dict
147    results['Selected_groups'] = selected_groups
148    results['Norms'] = group_norms
149    results['Predictions'] = preds
150    results['Observed'] = adata.obs['labels'].iloc[adata.uns['test_indices']]
151    results['Test_indices'] = adata.uns['test_indices']
152    results['Group_names']= group_names
153    results['Models'] = models
154    results['Train_time'] = train_time
155    results['RAM_usage'] = f'{tracemalloc.get_traced_memory()[1]/1e9} GB'
156    results['Probabilities'] = probs
157
158    return results

Wrapper function for training and test with multiple alpha values. Returns metrics, predictions, group weights, and resource usage.

Parameters
  • adata (ad.AnnData): A processed ad.AnnData with 'Z_train', 'Z_test', and 'group_dict' keys in adata.uns.
  • alpha_list (np.ndarray): Sparsity values to create models with. Alpha refers to the penalty parameter in Group Lasso. Larger alphas force group weights to shrink towards zero while smaller alphas apply a lesser penalty to kernal weights. Values too large will results in models that weight all groups as zero.
  • metrics (list[str]): Metrics that should be calculated on predictions. Options are ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 'Recall']. When set to None, all metrics are calculated.
Returns
  • results (dict): Results with keys and values:

    'Metrics' (dict): A nested dictionary as [alpha][metric] = value.

    'Group_names' (np.ndarray): Array of group names used in model(s).

    'Selected_groups' (dict): A nested dictionary as [alpha] = np.array([nonzero_groups]). Nonzero groups are groups that had a kernel weight above zero.

    'Norms' (dict): A nested dictionary as [alpha] = np.array([kernel_weights]) Order of kernel_weights is respective to 'Group_names' values.

    'Observed' (np.nparray): An array of ground truth cell labels from the test set.

    'Predictions' (dict): A nested dictionary as [alpha] = predicted_class respective to 'Observations' for alpha.

    'Test_indices' (np.array: Indices of samples respective to adata used in the training set.

    'Model' (dict): A nested dictionary where [alpha] = celer.GroupLasso object for alpha.

    'RAM_usage' (dict): A nested dictionary with memory usage in GB after training models for each alpha.

Examples
>>> results = scmkl.run(adata = adata, 
...                     alpha_list = np.array([0.05, 0.1, 0.5]))
>>> results
dict_keys(['Metrics', 'Selected_groups', 'Norms', 'Predictions', 
...        'Observed', 'Test_indices', 'Group_names', 'Models', 
...        'Train_time', 'RAM_usage'])
>>>
>>> alpha values
>>> results['Metrics'].keys()
dict_keys([0.05, 0.1, 0.5])
>>>
>>> results['Metrics'][0.05]
{'AUROC': 0.9859,
'Accuracy': 0.945,
'F1-Score': 0.9452736318407959,
'Precision': 0.9405940594059405,
'Recall': 0.95}
def sort_groups( df: pandas.core.frame.DataFrame, group_col: str = 'Group', norm_col: str = 'Kernel Weight'):
49def sort_groups(df: pd.DataFrame, group_col: str='Group', 
50                norm_col: str='Kernel Weight'):
51    """
52    Takes a dataframe with `group_col` and returns sorted group list 
53    with groups in decending order by their weights. Assumes there is 
54    one instance of each group.
55
56    Parameters
57    ----------
58    df : pd.DataFrame
59        A dataframe with `group_col` and `norm_col` to be sorted by.
60
61    group_col : str
62        The column containing the group names.
63
64    norm_col : str
65        The column containing the kernel weights.
66
67    Returns
68    -------
69    group_order : list
70        A list of groups in descending order according to their kernel 
71        weights.
72
73    Examples
74    --------
75    >>> result = scmkl.run(adata, alpha_list)
76    >>> weights = scmkl.get_weights(result)
77    >>> group_order = scmkl.sort_groups(weights, 'Group', 
78    ...                                 'Kernel Weight')
79    >>>
80    >>> group_order
81    ['HALLMARK_ESTROGEN_RESPONSE_EARLY', 'HALLM...', ...]
82    """
83    df = df.copy()
84    df = df.sort_values(norm_col, ascending=False)
85    group_order = list(df[group_col])
86
87    return group_order

Takes a dataframe with group_col and returns sorted group list with groups in decending order by their weights. Assumes there is one instance of each group.

Parameters
  • df (pd.DataFrame): A dataframe with group_col and norm_col to be sorted by.
  • group_col (str): The column containing the group names.
  • norm_col (str): The column containing the kernel weights.
Returns
  • group_order (list): A list of groups in descending order according to their kernel weights.
Examples
>>> result = scmkl.run(adata, alpha_list)
>>> weights = scmkl.get_weights(result)
>>> group_order = scmkl.sort_groups(weights, 'Group', 
...                                 'Kernel Weight')
>>>
>>> group_order
['HALLMARK_ESTROGEN_RESPONSE_EARLY', 'HALLM...', ...]
def tfidf_normalize(adata: anndata._core.anndata.AnnData, binarize: bool = False):
 77def tfidf_normalize(adata: ad.AnnData, binarize: bool=False):
 78    """
 79    Function to TF-IDF normalize the data in an adata object. If any 
 80    rows are entirely 0, that row and its metadata will be removed from
 81    the object.
 82
 83    Parameters
 84    ----------
 85    adata : ad.AnnData
 86        `ad.Anndata` with `.X` to be normalized. If `'train_indices'` 
 87        and `'test_indices'` in `'adata.uns.keys()'`, normalization 
 88        will be done separately for the training and testing data. 
 89        Otherwise, it will calculate it on the entire dataset.
 90
 91    binarize : bool 
 92        If `True`, all values in `adata.X` greater than 1 will become 
 93        1.
 94
 95    Returns
 96    -------
 97    adata : ad.AnnData 
 98        `adata` with `adata.X` TF-IDF normalized. Will now have the 
 99        train data stacked on test data, and the indices will be 
100        adjusted accordingly.
101
102    Examples
103    --------
104    >>> adata = scmkl.create_adata(X = data_mat, 
105    ...                            feature_names = gene_names, 
106    ...                            group_dict = group_dict)
107    >>> 
108    >>> adata = scmkl.tfidf_normalize(adata)
109    """
110    X = adata.X.copy()
111    row_sums = np.sum(X, axis = 1)
112    assert np.all(row_sums > 0), "TFIDF requires all row sums be positive"
113
114    if binarize:
115        X[X > 0] = 1
116
117    if 'train_indices' in adata.uns_keys():
118
119        train_indices = adata.uns['train_indices'].copy()
120        test_indices = adata.uns['test_indices'].copy()
121
122        # Calculate the train TFIDF matrix on just the training data so it is 
123        # not biased by testing data
124        tfidf_train = tfidf(X[train_indices,:], mode = 'normalize')
125
126        # Calculate the test TFIDF by calculating it on the train and test 
127        # data and index the test data
128        tfidf_test = tfidf(X, mode = 'normalize')[test_indices,:]
129
130        # Impossible to add rows back to original location so we need to 
131        # stack the matrices to maintain train/test
132        if scipy.sparse.issparse(X):
133            tfidf_norm = scipy.sparse.vstack((tfidf_train, tfidf_test))
134        else:
135            tfidf_norm = np.vstack((tfidf_train, tfidf_test))
136
137        # I'm not sure why this reassignment is necessary, but without, the 
138        # values will be saved as 0s in adata
139        adata.uns['train_indices'] = train_indices
140        adata.uns['test_indices'] = test_indices
141
142        combined_indices = np.concatenate((train_indices, test_indices))
143
144        # Anndata indexes by "rownames" not position so we need to rename the 
145        # rows to properly index
146        adata_index = adata.obs_names[combined_indices].astype(int)
147        tfidf_norm = tfidf_norm[np.argsort(adata_index),:]
148
149    else:
150
151        tfidf_norm = tfidf(X, mode = 'normalize')
152
153    adata.X = tfidf_norm.copy()
154
155    return adata

Function to TF-IDF normalize the data in an adata object. If any rows are entirely 0, that row and its metadata will be removed from the object.

Parameters
  • adata (ad.AnnData): ad.Anndata with .X to be normalized. If 'train_indices' and 'test_indices' in 'adata.uns.keys()', normalization will be done separately for the training and testing data. Otherwise, it will calculate it on the entire dataset.
  • binarize (bool): If True, all values in adata.X greater than 1 will become 1.
Returns
  • adata (ad.AnnData): adata with adata.X TF-IDF normalized. Will now have the train data stacked on test data, and the indices will be adjusted accordingly.
Examples
>>> adata = scmkl.create_adata(X = data_mat, 
...                            feature_names = gene_names, 
...                            group_dict = group_dict)
>>> 
>>> adata = scmkl.tfidf_normalize(adata)
def train_model( adata: anndata._core.anndata.AnnData, group_size: int | None = None, alpha: float = 0.9):
 7def train_model(adata: ad.AnnData, group_size: int | None=None, alpha:float=0.9):
 8    """
 9    Fit a grouplasso model to the provided data.
10
11    Parameters
12    ----------
13    adata : ad.AnnData 
14        Has `'Z_train'` and `'Z_test'` keys in `.uns.keys()`.
15
16    group_size : None | int
17        Argument describing how the features are grouped. If `None`, 
18        `2 * adata.uns['D']` will be used. For more information see 
19        [celer documentation](https://mathurinm.github.io/celer/
20        generated/celer.GroupLasso.html).
21            
22    alpha : float
23        Group Lasso regularization coefficient, is a floating point 
24        value controlling model solution sparsity. Must be a positive 
25        float. The smaller the value, the more feature groups will be 
26        selected in the trained model.
27    
28    Returns
29    -------
30    adata : ad.AnnData 
31        Trained model accessible with `adata.uns['model']`.
32
33    Examples
34    --------
35    >>> adata = scmkl.estimate_sigma(adata)
36    >>> adata = scmkl.calculate_z(adata)
37    >>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 
38    ...            'Recall']
39    >>> d = scmkl.calculate_d(adata.shape[0])
40    >>> group_size = 2 * d
41    >>> adata = scmkl.train_model(adata, group_size)
42    >>>
43    >>> 'model' in adata.uns.keys()
44    True
45
46    See Also
47    --------
48    celer :
49        https://mathurinm.github.io/celer/generated/celer.GroupLasso.html
50    """
51    assert alpha > 0, 'Alpha must be positive'
52
53    if group_size is None:
54        group_size = 2*adata.uns['D']
55
56    y_train = adata.obs['labels'].iloc[adata.uns['train_indices']]
57    X_train = adata.uns['Z_train'][adata.uns['train_indices']]
58
59    cell_labels = np.unique(y_train)
60
61    # This is a regression algorithm. We need to make the labels 'continuous' 
62    # for classification, but they will remain binary. Casts training labels 
63    # to array of -1,1
64    train_labels = np.ones(y_train.shape)
65    train_labels[y_train == cell_labels[1]] = -1
66
67    # Alphamax is a calculation to regularize the effect of alpha across 
68    # different data sets
69    alphamax = np.max(np.abs(X_train.T.dot(train_labels)))
70    alphamax /= X_train.shape[0] 
71    alphamax *= alpha
72
73    # Instantiate celer Group Lasso Regression Model Object
74    model = celer.GroupLasso(groups=group_size, alpha=alphamax)
75
76    # Fit model using training data
77    model.fit(X_train, train_labels.ravel())
78
79    adata.uns['model'] = model
80    
81    return adata

Fit a grouplasso model to the provided data.

Parameters
  • adata (ad.AnnData): Has 'Z_train' and 'Z_test' keys in .uns.keys().
  • group_size (None | int): Argument describing how the features are grouped. If None, 2 * adata.uns['D'] will be used. For more information see celer documentation.
  • alpha (float): Group Lasso regularization coefficient, is a floating point value controlling model solution sparsity. Must be a positive float. The smaller the value, the more feature groups will be selected in the trained model.
Returns
  • adata (ad.AnnData): Trained model accessible with adata.uns['model'].
Examples
>>> adata = scmkl.estimate_sigma(adata)
>>> adata = scmkl.calculate_z(adata)
>>> metrics = ['AUROC', 'F1-Score', 'Accuracy', 'Precision', 
...            'Recall']
>>> d = scmkl.calculate_d(adata.shape[0])
>>> group_size = 2 * d
>>> adata = scmkl.train_model(adata, group_size)
>>>
>>> 'model' in adata.uns.keys()
True
See Also

celer: https://mathurinm.github.io/celer/generated/celer.GroupLasso.html

def weights_barplot( result, n_groups: int = 1, alpha: None | float = None, color: str = 'red'):
278def weights_barplot(result, n_groups: int=1, alpha: None | float=None, 
279                 color: str='red'):
280    """
281    Plots the top `n_groups` weighted groups for each cell class. Works 
282    for a single scmkl result (either multiclass or binary).
283
284    Parameters
285    ----------
286    result : dict
287        The output of `scmkl.run()`.
288
289    n_groups : int
290        The number of top groups to plot for each cell class.
291
292    alpha : None | float
293        The alpha parameter to create figure for. If `None`, the 
294        smallest alpha is used.
295
296    color : str
297        The color the bars should be.
298
299    Returns
300    -------
301    plot : plotnine.ggplot.ggplot
302        A barplot of weights.
303
304    Examples
305    --------
306    >>> result = scmkl.run(adata, alpha_list)
307    >>> plot = scmkl.weights_barplot(result)
308
309    ![weights_barplot](../tests/figures/weights_barplot_binary.png)
310    """
311    is_multi, is_many = _parse_result_type(result)
312    assert not is_many, "This function only supports single results"
313
314    alpha = _get_alpha(alpha, result, is_multi)
315    df = get_weights(result)
316
317    # Subsetting to only alpha and filtering to top groups
318    df = df[df['Alpha'] == alpha]
319    if is_multi:
320        g_order = set(df['Group'])
321        for ct in result['Classes']:
322            cor_ct = df['Class'] == ct
323            other_ct = ~cor_ct
324
325            temp_df = df.copy()
326            temp_df = temp_df[cor_ct]
327            temp_df = temp_df.sort_values('Kernel Weight', ascending=False)
328            top_groups = temp_df['Group'].iloc[0:n_groups].to_numpy()
329            
330            cor_groups = np.isin(df['Group'], top_groups)
331            filter_array = np.logical_and(cor_ct, cor_groups)
332
333            filter_array = np.logical_or(filter_array, other_ct)
334
335            df = df[filter_array]
336
337        df['Group'] = format_group_names(df['Group'], rm_words = ['Markers'])
338            
339
340    else:
341        df = df.sort_values('Kernel Weight', ascending=False)
342        df = df.iloc[0:n_groups]
343        df['Group'] = format_group_names(df['Group'], rm_words = ['Markers']) 
344        g_order = sort_groups(df)[::-1]
345        df['Group'] = pd.Categorical(df['Group'], g_order)
346
347    plot = (ggplot(df)
348            + theme_classic()
349            + coord_flip()
350            + labs(y=f'Kernel Weight (λ = {alpha})')
351            + theme(
352                axis_text=element_text(weight='bold', size=10),
353                axis_title=element_text(weight='bold', size=12),
354                axis_title_y=element_blank()
355            )
356            )
357
358    # This needs to be reworked for multiclass runs
359    if is_multi:
360        height = (3*ceil((len(set(df['Class'])) / 3)))
361        print(height)
362        plot += geom_bar(aes(x='Group', y='Kernel Weight'), 
363                         stat='identity', fill=color)
364        plot += facet_wrap('Class', scales='free', ncol=3)
365        plot += theme(figure_size=(15,height))
366    else:
367        plot += geom_bar(aes(x='Group', y='Kernel Weight'), 
368                         stat='identity', fill=color)
369        plot += theme(figure_size=(7, 9))
370
371    return plot

Plots the top n_groups weighted groups for each cell class. Works for a single scmkl result (either multiclass or binary).

Parameters
  • result (dict): The output of scmkl.run.
  • n_groups (int): The number of top groups to plot for each cell class.
  • alpha (None | float): The alpha parameter to create figure for. If None, the smallest alpha is used.
  • color (str): The color the bars should be.
Returns
  • plot (plotnine.ggplot.ggplot): A barplot of weights.
Examples
>>> result = scmkl.run(adata, alpha_list)
>>> plot = scmkl.weights_barplot(result)

weights_barplot

def weights_dotplot( result, n_groups: None | int = None, class_lab: str | None = None, low: str = 'white', high: str = 'red', alpha: float | None = None, scale_weights: bool = False):
493def weights_dotplot(result, n_groups: None | int=None, 
494                    class_lab: str | None=None, low: str='white', 
495                    high: str='red', alpha: float | None=None, 
496                    scale_weights: bool=False):
497    """
498    Plots a dotplot of kernel weights with groups on the y-axis and 
499    alpha on the x-axis if binary result. If a multiclass result, one 
500    alpha is used per class and the x-axis is class.
501
502    Parameters
503    ----------
504    result : dict
505        The output of `scmkl.run()`.
506
507    n_groups : int
508        The number of top groups to plot. Not recommended for 
509        multiclass results.
510
511    class_lab : str | None
512        For multiclass results, if `not None`, will only plot group 
513        weights for `class_lab`.
514
515    low : str
516        The color for low kernel weight.
517
518    high : str
519        The color for high kernel weight.
520
521    alpha : None | float
522        The alpha parameter to create figure for. If `None`, the 
523        smallest alpha is used.
524
525    scale_weights : bool
526        If `True`, the the kernel weights will be scaled for each 
527        within each class.
528
529    Returns
530    -------
531    plot : plotnine.ggplot.ggplot
532        A barplot of weights.
533
534    Examples
535    --------
536    >>> result = scmkl.run(adata, alpha_list)
537    >>> plot = scmkl.weights_dotplot(result)
538
539    ![weights_dotplot](../tests/figures/weights_dotplot_binary.png)
540    """
541    is_multi, is_many = _parse_result_type(result)
542    assert not is_many, "This function only supports single results"
543
544    if type(class_lab) is str:
545        result = result[class_lab]
546
547    df = get_weights(result)
548    df['Group'] = format_group_names(df['Group'], ['Markers'])
549
550    # Filtering and sorting values
551    sum_df = df.groupby('Group')['Kernel Weight'].sum()
552    sum_df = sum_df.reset_index()
553    order = sort_groups(sum_df)[::-1]
554    df['Group'] = pd.Categorical(df['Group'], categories=order)
555
556    if type(n_groups) is int:
557        sum_df = sum_df.sort_values(by='Kernel Weight', ascending=False)
558        top_groups = sum_df.iloc[0:n_groups]['Group'].to_numpy()
559        df = df[np.isin(df['Group'], top_groups)]
560    else:
561        n_groups = len(set(df['Group']))
562
563    df['Alpha'] = pd.Categorical(df['Alpha'], np.unique(df['Alpha']))
564
565    if n_groups > 40:
566        fig_size = (7,8)
567    elif n_groups < 25:
568        fig_size = (7,6)
569    else: 
570        fig_size = (7,8)
571
572    if 'Class' in df.columns:
573        alpha = _get_alpha(alpha, result, is_multi)
574        df = df[df['Alpha'] == alpha]
575        x_lab = 'Class'
576    else:
577        x_lab = 'Alpha'
578
579    if scale_weights:
580        max_norms = dict()
581        for ct in set(df['Class']):
582            g_rows = df['Class'] == ct
583            max_norms[ct] = np.max(df[g_rows]['Kernel Weight'])
584            scale_cols = ['Class', 'Kernel Weight']
585
586        new = df[scale_cols].apply(lambda x: x[1] / max_norms[x[0]], axis=1)
587        df['Kernel Weight'] = new
588
589        l_title = 'Scaled\nKernel Weight'
590
591    else:
592        l_title = 'Kernel Weight'
593
594
595    plot = (ggplot(df, aes(x=x_lab, y='Group', fill='Kernel Weight', color='Kernel Weight'))
596            + geom_point(size=5)
597            + scale_fill_gradient(high=high, low=low)
598            + scale_color_gradient(high=high, low=low)
599            + theme_classic()
600            + theme(
601                figure_size=fig_size,
602                axis_text=element_text(weight='bold', size=10),
603                axis_text_x=element_text(rotation=90),
604                axis_title=element_text(weight='bold', size=12),
605                axis_title_y=element_blank(),
606                legend_title=element_text(text=l_title, weight='bold', size=12),
607                legend_text=element_text(weight='bold', size=10)
608            ))
609
610    return plot

Plots a dotplot of kernel weights with groups on the y-axis and alpha on the x-axis if binary result. If a multiclass result, one alpha is used per class and the x-axis is class.

Parameters
  • result (dict): The output of scmkl.run.
  • n_groups (int): The number of top groups to plot. Not recommended for multiclass results.
  • class_lab (str | None): For multiclass results, if not None, will only plot group weights for class_lab.
  • low (str): The color for low kernel weight.
  • high (str): The color for high kernel weight.
  • alpha (None | float): The alpha parameter to create figure for. If None, the smallest alpha is used.
  • scale_weights (bool): If True, the the kernel weights will be scaled for each within each class.
Returns
  • plot (plotnine.ggplot.ggplot): A barplot of weights.
Examples
>>> result = scmkl.run(adata, alpha_list)
>>> plot = scmkl.weights_dotplot(result)

weights_dotplot

def weights_heatmap( result, n_groups: None | int = None, class_lab: str | None = None, low: str = 'white', high: str = 'red', alpha: float | None = None, scale_weights: bool = False):
374def weights_heatmap(result, n_groups: None | int=None, 
375                    class_lab: str | None=None, low: str='white', 
376                    high: str='red', alpha: float | None=None,
377                    scale_weights: bool=False):
378    """
379    Plots a heatmap of kernel weights with groups on the y-axis and 
380    alpha on the x-axis if binary result. If a multiclass result, one 
381    alpha is used per class and the x-axis is class.
382
383    Parameters
384    ----------
385    result : dict
386        The output of `scmkl.run()`.
387
388    n_groups : int
389        The number of top groups to plot. Not recommended for 
390        multiclass results.
391
392    class_lab : str | None
393        For multiclass results, if `not None`, will only plot group 
394        weights for `class_lab`.
395
396    low : str
397        The color for low kernel weight.
398
399    high : str
400        The color for high kernel weight.
401
402    alpha : None | float
403        The alpha parameter to create figure for. If `None`, the 
404        smallest alpha is used.
405
406    scale_weights : bool
407        If `True`, the the kernel weights will be scaled for each group 
408        within each class. Ignored if result is from a binary 
409        classification.
410
411    Returns
412    -------
413    plot : plotnine.ggplot.ggplot
414        A heatmap of weights.
415
416    Examples
417    --------
418    >>> result = scmkl.run(adata, alpha_list)
419    >>> plot = scmkl.weights_heatmap(result)
420
421    ![weights_heatmap](../tests/figures/weights_heatmap_binary.png)
422    """
423    is_multi, is_many = _parse_result_type(result)
424    assert not is_many, "This function only supports single results"
425
426    if type(class_lab) is str:
427        result = result[class_lab]
428
429    df = get_weights(result)
430    df['Group'] = format_group_names(df['Group'], ['Markers'])
431
432    # Filtering and sorting values
433    sum_df = df.groupby('Group')['Kernel Weight'].sum()
434    sum_df = sum_df.reset_index()
435    order = sort_groups(sum_df)[::-1]
436    df['Group'] = pd.Categorical(df['Group'], categories=order)
437
438    if type(n_groups) is int:
439        sum_df = sum_df.sort_values(by='Kernel Weight', ascending=False)
440        top_groups = sum_df.iloc[0:n_groups]['Group'].to_numpy()
441        df = df[np.isin(df['Group'], top_groups)]
442    else:
443        n_groups = len(set(df['Group']))
444
445    df['Alpha'] = pd.Categorical(df['Alpha'], np.unique(df['Alpha']))
446
447    if n_groups > 40:
448        fig_size = (7,8)
449    elif n_groups < 25:
450        fig_size = (7,6)
451    else: 
452        fig_size = (7,8)
453
454    if 'Class' in df.columns:
455        alpha = _get_alpha(alpha, result, is_multi)
456        df = df[df['Alpha'] == alpha]
457        x_lab = 'Class'
458    else:
459        x_lab = 'Alpha'
460
461    if scale_weights and is_multi:
462        max_norms = dict()
463        for ct in set(df['Class']):
464            g_rows = df['Class'] == ct
465            max_norms[ct] = np.max(df[g_rows]['Kernel Weight'])
466            scale_cols = ['Class', 'Kernel Weight']
467
468        new = df[scale_cols].apply(lambda x: x[1] / max_norms[x[0]], axis=1)
469        df['Kernel Weight'] = new
470
471        l_title = 'Scaled\nKernel Weight'
472
473    else:
474        l_title = 'Kernel Weight'
475
476    plot = (ggplot(df, aes(x=x_lab, y='Group', fill='Kernel Weight'))
477            + geom_tile(color='black')
478            + scale_fill_gradient(high=high, low=low)
479            + theme_classic()
480            + theme(
481                figure_size=fig_size,
482                axis_text=element_text(weight='bold', size=10),
483                axis_text_x=element_text(rotation=90),
484                axis_title=element_text(weight='bold', size=12),
485                axis_title_y=element_blank(),
486                legend_title=element_text(text=l_title, weight='bold', size=12),
487                legend_text=element_text(weight='bold', size=10)
488            ))
489
490    return plot

Plots a heatmap of kernel weights with groups on the y-axis and alpha on the x-axis if binary result. If a multiclass result, one alpha is used per class and the x-axis is class.

Parameters
  • result (dict): The output of scmkl.run.
  • n_groups (int): The number of top groups to plot. Not recommended for multiclass results.
  • class_lab (str | None): For multiclass results, if not None, will only plot group weights for class_lab.
  • low (str): The color for low kernel weight.
  • high (str): The color for high kernel weight.
  • alpha (None | float): The alpha parameter to create figure for. If None, the smallest alpha is used.
  • scale_weights (bool): If True, the the kernel weights will be scaled for each group within each class. Ignored if result is from a binary classification.
Returns
  • plot (plotnine.ggplot.ggplot): A heatmap of weights.
Examples
>>> result = scmkl.run(adata, alpha_list)
>>> plot = scmkl.weights_heatmap(result)

weights_heatmap