ALLCools.sandbox.scanorama
Contents
ALLCools.sandbox.scanorama
¶
Allow to chose different metric
Modified from Scanorama, see LICENSE https://github.com/brianhie/scanorama/blob/master/LICENSE
Module Contents¶
- batch_correct_pc(adata, batch_series, correct=False, n_components=30, sigma=25, alpha=0.1, knn=30, metric='angular', **scanorama_kws)[source]¶
Batch correction PCA based on integration
- Parameters
adata – one major adata
batch_series – batch_series used for splitting adata
correct – if True, adata.X will be corrected inplace, otherwise only corrected PCs are added to adata.obsm[‘X_pca’]
n_components – number of components in PCA
sigma – Correction smoothing parameter on Gaussian kernel.
alpha – Alignment score minimum cutoff.
knn – Number of nearest neighbors to use for matching.
metric – Metric to use in calculating KNN
scanorama_kws – Other Parameters passed to integration function
- Returns
- Return type
adata
- correct(datasets_full, genes_list, return_dimred=False, batch_size=BATCH_SIZE, verbose=VERBOSE, ds_names=None, dimred=DIMRED, approx=APPROX, sigma=SIGMA, alpha=ALPHA, knn=KNN, return_dense=False, hvg=None, union=False, geosketch=False, geosketch_max=20000, seed=0, metric='manhattan')[source]¶
Integrate and batch correct a list of data sets.
- Parameters
datasets_full (list of scipy.sparse.csr_matrix or of numpy.ndarray) – Data sets to integrate and correct.
genes_list (list of list of string) – List of genes for each data set.
return_dimred (bool, optional (default: False)) – In addition to returning batch corrected matrices, also returns integrated low-dimesional embeddings.
batch_size (int, optional (default: 5000)) – The batch size used in the alignment vector computation. Useful when correcting very large (>100k samples) data sets. Set to large value that runs within available memory.
verbose (bool or int, optional (default: 2)) – When True or not equal to 0, prints logging output.
ds_names (list of string, optional) – When verbose=True, reports data set names in logging output.
dimred (int, optional (default: 100)) – Dimensionality of integrated embedding.
approx (bool, optional (default: True)) – Use approximate nearest neighbors, greatly speeds up matching runtime.
sigma (float, optional (default: 15)) – Correction smoothing parameter on Gaussian kernel.
alpha (float, optional (default: 0.10)) – Alignment score minimum cutoff.
knn (int, optional (default: 20)) – Number of nearest neighbors to use for matching.
return_dense (bool, optional (default: False)) – Return numpy.ndarray matrices instead of scipy.sparse.csr_matrix.
hvg (int, optional (default: None)) – Use this number of top highly variable genes based on dispersion.
seed (int, optional (default: 0)) – Random seed to use.
- Returns
corrected, genes – By default (return_dimred=False), returns a two-tuple containing a list of scipy.sparse.csr_matrix each with batch corrected values, and a single list of genes containing the intersection of inputted genes.
integrated, corrected, genes – When return_dimred=False, returns a three-tuple containing a list of numpy.ndarray with integrated low dimensional embeddings, a list of scipy.sparse.csr_matrix each with batch corrected values, and a a single list of genes containing the intersection of inputted genes.
- integrate(datasets_full, genes_list, batch_size=BATCH_SIZE, verbose=VERBOSE, ds_names=None, dimred=DIMRED, approx=APPROX, sigma=SIGMA, alpha=ALPHA, knn=KNN, geosketch=False, geosketch_max=20000, n_iter=1, union=False, hvg=None, seed=0, metric='manhattan')[source]¶
Integrate a list of data sets.
- Parameters
datasets_full (list of scipy.sparse.csr_matrix or of numpy.ndarray) – Data sets to integrate and correct.
genes_list (list of list of string) – List of genes for each data set.
batch_size (int, optional (default: 5000)) – The batch size used in the alignment vector computation. Useful when correcting very large (>100k samples) data sets. Set to large value that runs within available memory.
verbose (bool or int, optional (default: 2)) – When True or not equal to 0, prints logging output.
ds_names (list of string, optional) – When verbose=True, reports data set names in logging output.
dimred (int, optional (default: 100)) – Dimensionality of integrated embedding.
approx (bool, optional (default: True)) – Use approximate nearest neighbors, greatly speeds up matching runtime.
sigma (float, optional (default: 15)) – Correction smoothing parameter on Gaussian kernel.
alpha (float, optional (default: 0.10)) – Alignment score minimum cutoff.
knn (int, optional (default: 20)) – Number of nearest neighbors to use for matching.
hvg (int, optional (default: None)) – Use this number of top highly variable genes based on dispersion.
seed (int, optional (default: 0)) – Random seed to use.
- Returns
Returns a two-tuple containing a list of numpy.ndarray with integrated low dimensional embeddings and a single list of genes containing the intersection of inputted genes.
- Return type
integrated, genes
- correct_scanpy(adatas, **kwargs)[source]¶
Batch correct a list of scanpy.api.AnnData.
- Parameters
adatas (list of scanpy.api.AnnData) – Data sets to integrate and/or correct.
kwargs (dict) – See documentation for the correct() method for a full list of parameters to use for batch correction.
- Returns
corrected – By default (return_dimred=False), returns a list of new scanpy.api.AnnData.
integrated, corrected – When return_dimred=True, returns a two-tuple containing a list of np.ndarray with integrated low-dimensional embeddings and a list of new scanpy.api.AnnData.
- integrate_scanpy(adatas, **kwargs)[source]¶
Integrate a list of scanpy.api.AnnData.
- Parameters
adatas (list of scanpy.api.AnnData) – Data sets to integrate.
kwargs (dict) – See documentation for the integrate() method for a full list of parameters to use for batch correction.
- Returns
Returns a list of np.ndarray with integrated low-dimensional embeddings.
- Return type
integrated
- fill_table(table, i, curr_ds, datasets, base_ds=0, knn=KNN, approx=APPROX, metric='manhattan')[source]¶
- find_alignments_table(datasets, knn=KNN, approx=APPROX, verbose=VERBOSE, prenormalized=False, geosketch=False, geosketch_max=20000, metric='manhattan')[source]¶
- find_alignments(datasets, knn=KNN, approx=APPROX, verbose=VERBOSE, alpha=ALPHA, prenormalized=False, geosketch=False, geosketch_max=20000, metric='manhattan')[source]¶
- connect(datasets, knn=KNN, approx=APPROX, alpha=ALPHA, verbose=VERBOSE, metric='manhattan')[source]¶
- handle_zeros_in_scale(scale, copy=True)[source]¶
Makes sure that whenever scale is zero, we handle it correctly. This happens in most scalers when we have constant features. Adapted from sklearn.preprocessing.data