ALLCools.clustering
Contents
ALLCools.clustering
¶
Basic Cellular Analysis Functions
Subpackages¶
Submodules¶
Package Contents¶
- class ConsensusClustering(model=None, n_neighbors=25, metric='euclidean', min_cluster_size=10, leiden_repeats=200, leiden_resolution=1, target_accuracy=0.95, consensus_rate=0.7, random_state=0, train_frac=0.5, train_max_n=500, max_iter=50, n_jobs=- 1)¶
- add_data(self, x)¶
- fit_predict(self, x, leiden_kwds=None)¶
- compute_neighbors(self)¶
Calculate KNN graph
- multi_leiden_clustering(self, partition_type=None, partition_kwargs=None, use_weights=True, n_iterations=- 1)¶
Modified from scanpy, perform Leiden clustering multiple times with different random states
- _summarize_multi_leiden(self)¶
Summarize the multi_leiden results, generate a raw cluster version simply based on the hamming distance between cells and split cluster with cutoff (consensus_rate)
- _create_model(self, n_estimators=1000)¶
Init default model
- supervise_learning(self)¶
Perform supervised learning and cluster merge process
- final_evaluation(self)¶
Final evaluation of the model and assign outliers
- save(self, output_path)¶
Save the model
- plot_leiden_cases(self, coord_data, coord_base='umap', plot_size=3, dpi=300, plot_n_cases=4, s=3)¶
Show some leiden runs with biggest different as measured by ARI
- plot_before_after(self, coord_data, coord_base='umap', plot_size=3, dpi=300)¶
Plot the raw clusters from multi-leiden and final clusters after merge
- plot_steps(self, coord_data, coord_base='umap', plot_size=3, dpi=300)¶
Plot the supervised learning and merge steps
- plot_merge_process(self, plot_size=3)¶
Plot the change of accuracy during merge
- select_confusion_pairs(true_label, predicted_label, ratio_cutoff=0.001)[source]¶
Select cluster pairs that are confusing (ratio_cutoff) between true and predicted labels
- Parameters
true_label (true cell labels) –
predicted_label (predicted cell labels) –
ratio_cutoff (ratio of clusters cutoff to define confusion) –
- Returns
list of cluster pair tuples
- Return type
confused_pairs
- tsne(adata, obsm='X_pca', metric: Union[str, Callable] = 'euclidean', exaggeration: float = - 1, perplexity: int = 30, n_jobs: int = - 1)[source]¶
Calculating T-SNE embedding with the openTSNE package [Poličar et al., 2019] and parameter optimization strategy described in [Kobak and Berens, 2019].
- Parameters
adata – adata object with principle components or equivalent matrix stored in .obsm
obsm – name of the matrix in .obsm that can be used as T-SNE input
metric – Any metric allowed by PyNNDescent (default: ‘euclidean’)
exaggeration – The exaggeration to use for the embedding
perplexity – The perplexity to use for the embedding
n_jobs – Number of CPUs to use
- Returns
- Return type
T-SNE embedding will be stored at adata.obsm[“X_tsne”]
- balanced_pca(adata: anndata.AnnData, groups: str = 'pre_clusters', max_cell_prop=0.1, n_comps=200, scale=False)[source]¶
Given a categorical variable (e.g., a pre-clustering label), perform balanced PCA by downsample cells in the large categories to make the overall population more balanced, so the PCs are expected to represent more variance among small categories.
- Parameters
adata – adata after preprocessing and feature selection steps
groups – the name of the categorical variable in adata.obsm
max_cell_prop – any single category with cells > n_cell * max_cell_prop will be downsampled to this number.
n_comps – Number of components in PCA
scale – whether to scale the input matrix before PCA
- Returns
- Return type
adata with PC information stored in obsm, varm and uns like the
scanpy.tl.pca()
do.
- significant_pc_test(adata: anndata.AnnData, p_cutoff=0.1, update=True, obsm='X_pca', downsample=50000)[source]¶
Perform two-sample Kolmogorov-Smirnov test for goodness of fit on two adjacent PCs, select top PCs based on the p_cutoff. Top PCs have significantly different distributions, while later PCs only capturing random noise will have larger p-values. An idea from [Zeisel et al., 2018].
- Parameters
adata – adata with PC matrix calculated and stored in adata.obsm
p_cutoff – the p-value cutoff to select top PCs
update – Whether modify adata.obsm and only keep significant PCs
obsm – name of the PC matrix in adata.obsm
downsample – If the dataset is too large, downsample the cells before testing.
- Returns
number of PCs selected
- Return type
n_components
- log_scale(adata, method='standard', with_mean=False, with_std=True, max_value=10, scaler=None)[source]¶
Perform log transform and then scale the cell-by-feature matrix
- Parameters
adata – adata with normalized, unscaled cell-by-feature matrix
method – the type of scaler to use: ‘standard’ for
sklearn.preprocessing.StandardScaler
; ‘robust’ forsklearn.preprocessing.RobustScaler
.with_mean – Whether scale with mean center
with_std – Whether scale the std
max_value – Whether clip large values after scale
scaler – A fitted sklearn scaler, if provided, will only use it to transform the adata.
- Returns
- Return type
adata.X is scaled in place, the fitted scaler object will be return if the scaler parameter is None.
- get_pc_centers(adata: anndata.AnnData, group: str, outlier_label=None, obsm='X_pca')[source]¶
Get the cluster centroids of the PC matrix
- Parameters
adata – adata with cluster labels in obs and PC in obsm
group – the name of cluster labels in adata.obs
outlier_label – if there are outlier labels in obs, will exclude outliers before calculating centroids.
obsm – the key of PC matrix in obsm
- Returns
a dataframe for cluster centroids by PC
- Return type
pc_center
- class ReproduciblePCA(scaler, mc_type, adata=None, pca_obj=None, pc_loading=None, var_names=None, max_value=10)[source]¶
- mcds_to_adata(self, mcds)¶
Get adata from MCDS with only selected features
- Parameters
mcds (Input raw count MCDS object) –
- Returns
Adata with per-cell normalized mC fraction and selected features
- Return type
adata
- scale(self, adata)¶
Perform {func}`log_scale <ALLCools.clustering.balanced_pca.log_scale>` with fitted scaler
- Parameters
adata – Adata with per-cell normalized mC fraction and selected features
- Returns
- Return type
adata.X is transformed in place
- pc_transform(self, adata)¶
calculate the PC from adata.X and PC loading, store PCs in adata.obsm[“X_pca”] and loadings in adata.varm[“PCs”]
- Parameters
adata – Adata with log_scale transformed mC fraction and selected features
- Returns
- Return type
PC information stored in adata.obsm and varm
- mcds_to_adata_with_pc(self, mcds)¶
From raw count MCDS to adata object with PCs using fitted scaler and PC loadings. Steps include select features, calculate per-cell normalized mC fractions, log_scale transform the data with fitted scaler, and finally add PC matrix.
- Parameters
mcds – Raw count MCDS
- Returns
anndata object with per-cell normalized, log scale transformed matrix in .X and PCs in adata.obsm[“X_pca”] and PC loadings in adata.varm[“PCs”]. The scale and PC are done with fitted model.
- Return type
adata
- dump(self, path)¶
Save the ReproduciblePCA to path
- class Dendrogram(nboot=1000, method_dist='correlation', method_hclust='average', n_jobs=- 1)[source]¶
- fit(self, data)¶
- Parameters
data – The data is in obs-by-var form, row is obs.
- save(self, output_path)¶
- class PairwiseDMG(max_cell_per_group=1000, top_n=10000, adj_p_cutoff=0.001, delta_rate_cutoff=0.3, auroc_cutoff=0.9, random_state=0, n_jobs=1, verbose=True)[source]¶
- fit_predict(self, x, groups, var_dim, obs_dim='cell', outlier='Outlier', cleanup=True, selected_pairs: List[tuple] = None)¶
provide data and perform the pairwise DMG
- Parameters
x – 2D cell-by-feature xarray.DataArray
groups – cluster labels
obs_dim – name of the cell dim
var_dim – name of the feature dim
outlier – name of the outlier group, if provided, will ignore this label
cleanup – Whether to delete the group adata file
selected_pairs – By default, pairwise DMG will calculate all possible pairs between all the groups, which might be very time consuming if the group number is large. With this parameter, you may provide a list of cluster pairs
- _save_cluster_adata(self)¶
Save each group into separate adata, this way reduce the memory during parallel
- _pairwise_dmg(self)¶
pairwise DMG runner, result save to self.dmg_table
- _cleanup(self)¶
Delete group adata files
- aggregate_pairwise_dmg(self, adata, groupby, obsm='X_pca')¶
Aggregate pairwise DMG results for each cluster, rank DMG for the cluster by the sum of AUROC * cluster_pair_similarity This way, the DMGs having large AUROC between similar clusters get more weights
- Parameters
adata –
groupby –
obsm –
- one_vs_rest_dmg(cell_meta, group, mcds=None, mcds_paths=None, obs_dim='cell', var_dim='gene', mc_type='CHN', top_n=1000, adj_p_cutoff=0.01, fc_cutoff=0.8, auroc_cutoff=0.8, max_cluster_cells=2000, max_other_fold=5, cpu=1, verbose=True)[source]¶
Calculating cluster marker genes using one-vs-rest strategy.
- Parameters
cell_meta – cell metadata containing cluster labels
group – the name of the cluster label column
mcds – cell-by-gene MCDS object for calculating DMG. Provide either mcds_paths or mcds.
mcds_paths – cell-by-gene MCDS paths for calculating DMG. Provide either mcds_paths or mcds.
obs_dim – dimension name of the cells
var_dim – dimension name of the features
mc_type – value to select methylation type in the mc_type dimension
top_n – report top N DMGs
adj_p_cutoff – adjusted P value cutoff to report significant DMG
fc_cutoff – mC fraction fold change cutoff to report significant DMG
auroc_cutoff – AUROC cutoff to report significant DMG
max_cluster_cells – The maximum number of cells from a group, downsample large group to this number
max_other_fold – The fold of other cell numbers comparing
cpu – number of cpus
- Returns
pandas Dataframe of the one-vs-rest DMGs
- Return type
dmg_table
- cluster_enriched_features(adata: anndata.AnnData, cluster_col: str, top_n=200, alpha=0.05, stat_plot=True, method='mc')[source]¶
Calculate top Cluster Enriched Features (CEF) from per-cell normalized dataset. An post-clustering feature selection step adapted from [Zeisel et al., 2018, La Manno et al., 2021] and their great [cytograph2](https://github.com/linnarsson-lab/cytograph2) package. For details about CEF calculation, read the methods of [Zeisel et al., 2018]. Note that in original paper, they look for cluster-specific highly expressed genes as CEFs; for methylation, we are looking for hypo-methylation as CEFs, so the score and test is reversed.
- Parameters
adata – adata containing per-cell normalized values. For methylation fraction, the value need to be 1-centered (1 means cell’s average methylation), like those produced by
ALLCools.mcds.mcds.MCDS.add_mc_frac()
with normalize_per_cell=True. For RNA and ATAC, you can use per cell normalized counts. Do not log transform the data before running this functioncluster_col – The name of categorical variable in adata.obs
top_n – Select top N CEFs for each cluster
alpha – FDR corrected q-value cutoff
stat_plot – Whether making some summary plots for the CEF calculation
method – “mc” for methylation CEF (look for hypo-methylation), “rna” and “atac” for the RNA and ATAC or any count based data (use the original cytograph algorithm, look for higher value)
- Returns
Modify adata inplace, adding a dictionary in adata.uns called f”{cluster_col}_feature_enrichment”
The dictionary contains “qvals” (np.ndarray cluster-by-feature enrichment score q-value) and
”cluster_order” (cluster order of the “qvals”)
- filter_regions(adata, hypo_percent=0.5)[source]¶
Filter regions based on % of cells having non-zero scores.
- Parameters
adata –
hypo_percent – min % of cells that are non-zero in this region.
- lsi(adata, scale_factor=100000, n_components=100, algorithm='arpack', obsm='X_pca', random_state=0, fit_size=None)[source]¶
Run TF-IDF on the binarized adata.X, followed by TruncatedSVD and then scale the components by svd.singular_values_
- Parameters
adata –
scale_factor –
n_components –
algorithm –
obsm –
random_state –
fit_size – Ratio or absolute int value, use to downsample when fitting the SVD to speed up run time.
- binarize_matrix(adata, cutoff=0.95)[source]¶
Binarize adata.X with adata.X > cutoff
- Parameters
adata – AnnData object whose X is survival function matrix
cutoff – Cutoff to binarize the survival function
- Returns
- Return type
- remove_black_list_region(adata, black_list_path, f=0.2)[source]¶
Remove regions overlap (bedtools intersect -f {f}) with regions in the black_list_path
- Parameters
adata –
black_list_path – Path to the black list bed file
f – Fraction of overlap when calling bedtools intersect
- Returns
- Return type
- class ClusterMerge(merge_criterion, stop_criterion=None, stop_clusters=- 1, n_cells=200, metric='euclidean', method='average', label_concat_str='::')[source]¶
- _construct_tree(self)¶
- static _traverse(node, call_back)¶
- _merge_pair(self, pair, concat_str='::')¶
- fit_predict(self, data_for_tree, cell_to_type, gene_mcds)¶