`ALLCools.sandbox.scanorama`¶

Allow to chose different metric

Modified from Scanorama, see LICENSE https://github.com/brianhie/scanorama/blob/master/LICENSE

Module Contents¶

ALPHA = 0.1[source]¶

APPROX = True[source]¶

BATCH_SIZE = 5000[source]¶

DIMRED = 100[source]¶

HVG[source]¶

KNN = 20[source]¶

N_ITER = 500[source]¶

PERPLEXITY = 1200[source]¶

SIGMA = 15[source]¶

VERBOSE = 2[source]¶

batch_correct_pc(adata, batch_series, correct=False, n_components=30, sigma=25, alpha=0.1, knn=30, metric='angular', **scanorama_kws)[source]¶

Batch correction PCA based on integration

Parameters

adata – one major adata
batch_series – batch_series used for splitting adata
correct – if True, adata.X will be corrected inplace, otherwise only corrected PCs are added to adata.obsm[‘X_pca’]
n_components – number of components in PCA
sigma – Correction smoothing parameter on Gaussian kernel.
alpha – Alignment score minimum cutoff.
knn – Number of nearest neighbors to use for matching.
metric – Metric to use in calculating KNN
scanorama_kws – Other Parameters passed to integration function

Returns

Return type

adata

correct(datasets_full, genes_list, return_dimred=False, batch_size=BATCH_SIZE, verbose=VERBOSE, ds_names=None, dimred=DIMRED, approx=APPROX, sigma=SIGMA, alpha=ALPHA, knn=KNN, return_dense=False, hvg=None, union=False, geosketch=False, geosketch_max=20000, seed=0, metric='manhattan')[source]¶

Integrate and batch correct a list of data sets.

Parameters

datasets_full (list of scipy.sparse.csr_matrix or of numpy.ndarray) – Data sets to integrate and correct.
genes_list (list of list of string) – List of genes for each data set.
return_dimred (bool, optional (default: False)) – In addition to returning batch corrected matrices, also returns integrated low-dimesional embeddings.
batch_size (int, optional (default: 5000)) – The batch size used in the alignment vector computation. Useful when correcting very large (>100k samples) data sets. Set to large value that runs within available memory.
verbose (bool or int, optional (default: 2)) – When True or not equal to 0, prints logging output.
ds_names (list of string, optional) – When verbose=True, reports data set names in logging output.
dimred (int, optional (default: 100)) – Dimensionality of integrated embedding.
approx (bool, optional (default: True)) – Use approximate nearest neighbors, greatly speeds up matching runtime.
sigma (float, optional (default: 15)) – Correction smoothing parameter on Gaussian kernel.
alpha (float, optional (default: 0.10)) – Alignment score minimum cutoff.
knn (int, optional (default: 20)) – Number of nearest neighbors to use for matching.
return_dense (bool, optional (default: False)) – Return numpy.ndarray matrices instead of scipy.sparse.csr_matrix.
hvg (int, optional (default: None)) – Use this number of top highly variable genes based on dispersion.
seed (int, optional (default: 0)) – Random seed to use.

Returns

corrected, genes – By default (return_dimred=False), returns a two-tuple containing a list of scipy.sparse.csr_matrix each with batch corrected values, and a single list of genes containing the intersection of inputted genes.
integrated, corrected, genes – When return_dimred=False, returns a three-tuple containing a list of numpy.ndarray with integrated low dimensional embeddings, a list of scipy.sparse.csr_matrix each with batch corrected values, and a a single list of genes containing the intersection of inputted genes.

integrate(datasets_full, genes_list, batch_size=BATCH_SIZE, verbose=VERBOSE, ds_names=None, dimred=DIMRED, approx=APPROX, sigma=SIGMA, alpha=ALPHA, knn=KNN, geosketch=False, geosketch_max=20000, n_iter=1, union=False, hvg=None, seed=0, metric='manhattan')[source]¶

Integrate a list of data sets.

Parameters

datasets_full (list of scipy.sparse.csr_matrix or of numpy.ndarray) – Data sets to integrate and correct.
genes_list (list of list of string) – List of genes for each data set.
batch_size (int, optional (default: 5000)) – The batch size used in the alignment vector computation. Useful when correcting very large (>100k samples) data sets. Set to large value that runs within available memory.
verbose (bool or int, optional (default: 2)) – When True or not equal to 0, prints logging output.
ds_names (list of string, optional) – When verbose=True, reports data set names in logging output.
dimred (int, optional (default: 100)) – Dimensionality of integrated embedding.
approx (bool, optional (default: True)) – Use approximate nearest neighbors, greatly speeds up matching runtime.
sigma (float, optional (default: 15)) – Correction smoothing parameter on Gaussian kernel.
alpha (float, optional (default: 0.10)) – Alignment score minimum cutoff.
knn (int, optional (default: 20)) – Number of nearest neighbors to use for matching.
hvg (int, optional (default: None)) – Use this number of top highly variable genes based on dispersion.
seed (int, optional (default: 0)) – Random seed to use.

Returns

Returns a two-tuple containing a list of numpy.ndarray with integrated low dimensional embeddings and a single list of genes containing the intersection of inputted genes.

Return type

integrated, genes

correct_scanpy(adatas, **kwargs)[source]¶

Batch correct a list of scanpy.api.AnnData.

Parameters

adatas (list of scanpy.api.AnnData) – Data sets to integrate and/or correct.
kwargs (dict) – See documentation for the correct() method for a full list of parameters to use for batch correction.

Returns

corrected – By default (return_dimred=False), returns a list of new scanpy.api.AnnData.
integrated, corrected – When return_dimred=True, returns a two-tuple containing a list of np.ndarray with integrated low-dimensional embeddings and a list of new scanpy.api.AnnData.

integrate_scanpy(adatas, **kwargs)[source]¶

Integrate a list of scanpy.api.AnnData.

Parameters

adatas (list of scanpy.api.AnnData) – Data sets to integrate.
kwargs (dict) – See documentation for the integrate() method for a full list of parameters to use for batch correction.

Returns

Returns a list of np.ndarray with integrated low-dimensional embeddings.

Return type

integrated

merge_datasets(datasets, genes, ds_names=None, verbose=True, union=False)[source]¶

check_datasets(datasets_full)[source]¶

reduce_dimensionality(X, dim_red_k=100)[source]¶

dimensionality_reduce(datasets, dimred=DIMRED)[source]¶

dispersion(X)[source]¶

process_data(datasets, genes, hvg=HVG, dimred=DIMRED, verbose=True)[source]¶

nn(ds1, ds2, knn=KNN, metric_p=2)[source]¶

nn_approx(ds1, ds2, knn=KNN, metric='manhattan', n_trees=10)[source]¶

fill_table(table, i, curr_ds, datasets, base_ds=0, knn=KNN, approx=APPROX, metric='manhattan')[source]¶

gs_idxs[source]¶

find_alignments_table(datasets, knn=KNN, approx=APPROX, verbose=VERBOSE, prenormalized=False, geosketch=False, geosketch_max=20000, metric='manhattan')[source]¶

find_alignments(datasets, knn=KNN, approx=APPROX, verbose=VERBOSE, alpha=ALPHA, prenormalized=False, geosketch=False, geosketch_max=20000, metric='manhattan')[source]¶

connect(datasets, knn=KNN, approx=APPROX, alpha=ALPHA, verbose=VERBOSE, metric='manhattan')[source]¶

handle_zeros_in_scale(scale, copy=True)[source]¶: Makes sure that whenever scale is zero, we handle it correctly. This happens in most scalers when we have constant features. Adapted from sklearn.preprocessing.data

batch_bias(curr_ds, match_ds, bias, batch_size=None, sigma=SIGMA)[source]¶

transform(curr_ds, curr_ref, ds_ind, ref_ind, sigma=SIGMA, cn=False, batch_size=None)[source]¶

assemble(datasets, verbose=VERBOSE, knn=KNN, sigma=SIGMA, approx=APPROX, alpha=ALPHA, expr_datasets=None, ds_names=None, batch_size=None, geosketch=False, geosketch_max=20000, alignments=None, matches=None, metric='manhattan')[source]¶

interpret_alignments(datasets, expr_datasets, genes, verbose=VERBOSE, knn=KNN, approx=APPROX, alpha=ALPHA, n_permutations=None, metric='manhattan')[source]¶

ALLCools.sandbox.scanorama

Contents

ALLCools.sandbox.scanorama¶

Module Contents¶

`ALLCools.sandbox.scanorama`¶