ALLCools.sandbox.scanorama

Allow to chose different metric

Modified from Scanorama, see LICENSE https://github.com/brianhie/scanorama/blob/master/LICENSE

Module Contents

ALPHA = 0.1[source]
APPROX = True[source]
BATCH_SIZE = 5000[source]
DIMRED = 100[source]
HVG[source]
KNN = 20[source]
N_ITER = 500[source]
PERPLEXITY = 1200[source]
SIGMA = 15[source]
VERBOSE = 2[source]
batch_correct_pc(adata, batch_series, correct=False, n_components=30, sigma=25, alpha=0.1, knn=30, metric='angular', **scanorama_kws)[source]

Batch correction PCA based on integration

Parameters
  • adata – one major adata

  • batch_series – batch_series used for splitting adata

  • correct – if True, adata.X will be corrected inplace, otherwise only corrected PCs are added to adata.obsm[‘X_pca’]

  • n_components – number of components in PCA

  • sigma – Correction smoothing parameter on Gaussian kernel.

  • alpha – Alignment score minimum cutoff.

  • knn – Number of nearest neighbors to use for matching.

  • metric – Metric to use in calculating KNN

  • scanorama_kws – Other Parameters passed to integration function

Returns

Return type

adata

correct(datasets_full, genes_list, return_dimred=False, batch_size=BATCH_SIZE, verbose=VERBOSE, ds_names=None, dimred=DIMRED, approx=APPROX, sigma=SIGMA, alpha=ALPHA, knn=KNN, return_dense=False, hvg=None, union=False, geosketch=False, geosketch_max=20000, seed=0, metric='manhattan')[source]

Integrate and batch correct a list of data sets.

Parameters
  • datasets_full (list of scipy.sparse.csr_matrix or of numpy.ndarray) – Data sets to integrate and correct.

  • genes_list (list of list of string) – List of genes for each data set.

  • return_dimred (bool, optional (default: False)) – In addition to returning batch corrected matrices, also returns integrated low-dimesional embeddings.

  • batch_size (int, optional (default: 5000)) – The batch size used in the alignment vector computation. Useful when correcting very large (>100k samples) data sets. Set to large value that runs within available memory.

  • verbose (bool or int, optional (default: 2)) – When True or not equal to 0, prints logging output.

  • ds_names (list of string, optional) – When verbose=True, reports data set names in logging output.

  • dimred (int, optional (default: 100)) – Dimensionality of integrated embedding.

  • approx (bool, optional (default: True)) – Use approximate nearest neighbors, greatly speeds up matching runtime.

  • sigma (float, optional (default: 15)) – Correction smoothing parameter on Gaussian kernel.

  • alpha (float, optional (default: 0.10)) – Alignment score minimum cutoff.

  • knn (int, optional (default: 20)) – Number of nearest neighbors to use for matching.

  • return_dense (bool, optional (default: False)) – Return numpy.ndarray matrices instead of scipy.sparse.csr_matrix.

  • hvg (int, optional (default: None)) – Use this number of top highly variable genes based on dispersion.

  • seed (int, optional (default: 0)) – Random seed to use.

Returns

  • corrected, genes – By default (return_dimred=False), returns a two-tuple containing a list of scipy.sparse.csr_matrix each with batch corrected values, and a single list of genes containing the intersection of inputted genes.

  • integrated, corrected, genes – When return_dimred=False, returns a three-tuple containing a list of numpy.ndarray with integrated low dimensional embeddings, a list of scipy.sparse.csr_matrix each with batch corrected values, and a a single list of genes containing the intersection of inputted genes.

integrate(datasets_full, genes_list, batch_size=BATCH_SIZE, verbose=VERBOSE, ds_names=None, dimred=DIMRED, approx=APPROX, sigma=SIGMA, alpha=ALPHA, knn=KNN, geosketch=False, geosketch_max=20000, n_iter=1, union=False, hvg=None, seed=0, metric='manhattan')[source]

Integrate a list of data sets.

Parameters
  • datasets_full (list of scipy.sparse.csr_matrix or of numpy.ndarray) – Data sets to integrate and correct.

  • genes_list (list of list of string) – List of genes for each data set.

  • batch_size (int, optional (default: 5000)) – The batch size used in the alignment vector computation. Useful when correcting very large (>100k samples) data sets. Set to large value that runs within available memory.

  • verbose (bool or int, optional (default: 2)) – When True or not equal to 0, prints logging output.

  • ds_names (list of string, optional) – When verbose=True, reports data set names in logging output.

  • dimred (int, optional (default: 100)) – Dimensionality of integrated embedding.

  • approx (bool, optional (default: True)) – Use approximate nearest neighbors, greatly speeds up matching runtime.

  • sigma (float, optional (default: 15)) – Correction smoothing parameter on Gaussian kernel.

  • alpha (float, optional (default: 0.10)) – Alignment score minimum cutoff.

  • knn (int, optional (default: 20)) – Number of nearest neighbors to use for matching.

  • hvg (int, optional (default: None)) – Use this number of top highly variable genes based on dispersion.

  • seed (int, optional (default: 0)) – Random seed to use.

Returns

Returns a two-tuple containing a list of numpy.ndarray with integrated low dimensional embeddings and a single list of genes containing the intersection of inputted genes.

Return type

integrated, genes

correct_scanpy(adatas, **kwargs)[source]

Batch correct a list of scanpy.api.AnnData.

Parameters
  • adatas (list of scanpy.api.AnnData) – Data sets to integrate and/or correct.

  • kwargs (dict) – See documentation for the correct() method for a full list of parameters to use for batch correction.

Returns

  • corrected – By default (return_dimred=False), returns a list of new scanpy.api.AnnData.

  • integrated, corrected – When return_dimred=True, returns a two-tuple containing a list of np.ndarray with integrated low-dimensional embeddings and a list of new scanpy.api.AnnData.

integrate_scanpy(adatas, **kwargs)[source]

Integrate a list of scanpy.api.AnnData.

Parameters
  • adatas (list of scanpy.api.AnnData) – Data sets to integrate.

  • kwargs (dict) – See documentation for the integrate() method for a full list of parameters to use for batch correction.

Returns

Returns a list of np.ndarray with integrated low-dimensional embeddings.

Return type

integrated

merge_datasets(datasets, genes, ds_names=None, verbose=True, union=False)[source]
check_datasets(datasets_full)[source]
reduce_dimensionality(X, dim_red_k=100)[source]
dimensionality_reduce(datasets, dimred=DIMRED)[source]
dispersion(X)[source]
process_data(datasets, genes, hvg=HVG, dimred=DIMRED, verbose=True)[source]
nn(ds1, ds2, knn=KNN, metric_p=2)[source]
nn_approx(ds1, ds2, knn=KNN, metric='manhattan', n_trees=10)[source]
fill_table(table, i, curr_ds, datasets, base_ds=0, knn=KNN, approx=APPROX, metric='manhattan')[source]
gs_idxs[source]
find_alignments_table(datasets, knn=KNN, approx=APPROX, verbose=VERBOSE, prenormalized=False, geosketch=False, geosketch_max=20000, metric='manhattan')[source]
find_alignments(datasets, knn=KNN, approx=APPROX, verbose=VERBOSE, alpha=ALPHA, prenormalized=False, geosketch=False, geosketch_max=20000, metric='manhattan')[source]
connect(datasets, knn=KNN, approx=APPROX, alpha=ALPHA, verbose=VERBOSE, metric='manhattan')[source]
handle_zeros_in_scale(scale, copy=True)[source]

Makes sure that whenever scale is zero, we handle it correctly. This happens in most scalers when we have constant features. Adapted from sklearn.preprocessing.data

batch_bias(curr_ds, match_ds, bias, batch_size=None, sigma=SIGMA)[source]
transform(curr_ds, curr_ref, ds_ind, ref_ind, sigma=SIGMA, cn=False, batch_size=None)[source]
assemble(datasets, verbose=VERBOSE, knn=KNN, sigma=SIGMA, approx=APPROX, alpha=ALPHA, expr_datasets=None, ds_names=None, batch_size=None, geosketch=False, geosketch_max=20000, alignments=None, matches=None, metric='manhattan')[source]
interpret_alignments(datasets, expr_datasets, genes, verbose=VERBOSE, knn=KNN, approx=APPROX, alpha=ALPHA, n_permutations=None, metric='manhattan')[source]