ALLCools.clustering

Basic Cellular Analysis Functions

Package Contents

class ConsensusClustering(model=None, n_neighbors=25, metric='euclidean', min_cluster_size=10, leiden_repeats=200, leiden_resolution=1, target_accuracy=0.95, consensus_rate=0.7, random_state=0, train_frac=0.5, train_max_n=500, max_iter=50, n_jobs=- 1)
add_data(self, x)
fit_predict(self, x, leiden_kwds=None)
compute_neighbors(self)

Calculate KNN graph

multi_leiden_clustering(self, partition_type=None, partition_kwargs=None, use_weights=True, n_iterations=- 1)

Modified from scanpy, perform Leiden clustering multiple times with different random states

_summarize_multi_leiden(self)

Summarize the multi_leiden results, generate a raw cluster version simply based on the hamming distance between cells and split cluster with cutoff (consensus_rate)

_create_model(self, n_estimators=1000)

Init default model

supervise_learning(self)

Perform supervised learning and cluster merge process

final_evaluation(self)

Final evaluation of the model and assign outliers

save(self, output_path)

Save the model

plot_leiden_cases(self, coord_data, coord_base='umap', plot_size=3, dpi=300, plot_n_cases=4, s=3)

Show some leiden runs with biggest different as measured by ARI

plot_before_after(self, coord_data, coord_base='umap', plot_size=3, dpi=300)

Plot the raw clusters from multi-leiden and final clusters after merge

plot_steps(self, coord_data, coord_base='umap', plot_size=3, dpi=300)

Plot the supervised learning and merge steps

plot_merge_process(self, plot_size=3)

Plot the change of accuracy during merge

select_confusion_pairs(true_label, predicted_label, ratio_cutoff=0.001)[source]

Select cluster pairs that are confusing (ratio_cutoff) between true and predicted labels

Parameters
  • true_label (true cell labels) –

  • predicted_label (predicted cell labels) –

  • ratio_cutoff (ratio of clusters cutoff to define confusion) –

Returns

list of cluster pair tuples

Return type

confused_pairs

tsne(adata, obsm='X_pca', metric: Union[str, Callable] = 'euclidean', exaggeration: float = - 1, perplexity: int = 30, n_jobs: int = - 1)[source]

Calculating T-SNE embedding with the openTSNE package [Poličar et al., 2019] and parameter optimization strategy described in [Kobak and Berens, 2019].

Parameters
  • adata – adata object with principle components or equivalent matrix stored in .obsm

  • obsm – name of the matrix in .obsm that can be used as T-SNE input

  • metric – Any metric allowed by PyNNDescent (default: ‘euclidean’)

  • exaggeration – The exaggeration to use for the embedding

  • perplexity – The perplexity to use for the embedding

  • n_jobs – Number of CPUs to use

Returns

Return type

T-SNE embedding will be stored at adata.obsm[“X_tsne”]

balanced_pca(adata: anndata.AnnData, groups: str = 'pre_clusters', max_cell_prop=0.1, n_comps=200, scale=False)[source]

Given a categorical variable (e.g., a pre-clustering label), perform balanced PCA by downsample cells in the large categories to make the overall population more balanced, so the PCs are expected to represent more variance among small categories.

Parameters
  • adata – adata after preprocessing and feature selection steps

  • groups – the name of the categorical variable in adata.obsm

  • max_cell_prop – any single category with cells > n_cell * max_cell_prop will be downsampled to this number.

  • n_comps – Number of components in PCA

  • scale – whether to scale the input matrix before PCA

Returns

Return type

adata with PC information stored in obsm, varm and uns like the scanpy.tl.pca() do.

significant_pc_test(adata: anndata.AnnData, p_cutoff=0.1, update=True, obsm='X_pca', downsample=50000)[source]

Perform two-sample Kolmogorov-Smirnov test for goodness of fit on two adjacent PCs, select top PCs based on the p_cutoff. Top PCs have significantly different distributions, while later PCs only capturing random noise will have larger p-values. An idea from [Zeisel et al., 2018].

Parameters
  • adata – adata with PC matrix calculated and stored in adata.obsm

  • p_cutoff – the p-value cutoff to select top PCs

  • update – Whether modify adata.obsm and only keep significant PCs

  • obsm – name of the PC matrix in adata.obsm

  • downsample – If the dataset is too large, downsample the cells before testing.

Returns

number of PCs selected

Return type

n_components

log_scale(adata, method='standard', with_mean=False, with_std=True, max_value=10, scaler=None)[source]

Perform log transform and then scale the cell-by-feature matrix

Parameters
  • adata – adata with normalized, unscaled cell-by-feature matrix

  • method – the type of scaler to use: ‘standard’ for sklearn.preprocessing.StandardScaler; ‘robust’ for sklearn.preprocessing.RobustScaler.

  • with_mean – Whether scale with mean center

  • with_std – Whether scale the std

  • max_value – Whether clip large values after scale

  • scaler – A fitted sklearn scaler, if provided, will only use it to transform the adata.

Returns

Return type

adata.X is scaled in place, the fitted scaler object will be return if the scaler parameter is None.

get_pc_centers(adata: anndata.AnnData, group: str, outlier_label=None, obsm='X_pca')[source]

Get the cluster centroids of the PC matrix

Parameters
  • adata – adata with cluster labels in obs and PC in obsm

  • group – the name of cluster labels in adata.obs

  • outlier_label – if there are outlier labels in obs, will exclude outliers before calculating centroids.

  • obsm – the key of PC matrix in obsm

Returns

a dataframe for cluster centroids by PC

Return type

pc_center

class ReproduciblePCA(scaler, mc_type, adata=None, pca_obj=None, pc_loading=None, var_names=None, max_value=10)[source]
mcds_to_adata(self, mcds)

Get adata from MCDS with only selected features

Parameters

mcds (Input raw count MCDS object) –

Returns

Adata with per-cell normalized mC fraction and selected features

Return type

adata

scale(self, adata)

Perform {func}`log_scale <ALLCools.clustering.balanced_pca.log_scale>` with fitted scaler

Parameters

adata – Adata with per-cell normalized mC fraction and selected features

Returns

Return type

adata.X is transformed in place

pc_transform(self, adata)

calculate the PC from adata.X and PC loading, store PCs in adata.obsm[“X_pca”] and loadings in adata.varm[“PCs”]

Parameters

adata – Adata with log_scale transformed mC fraction and selected features

Returns

Return type

PC information stored in adata.obsm and varm

mcds_to_adata_with_pc(self, mcds)

From raw count MCDS to adata object with PCs using fitted scaler and PC loadings. Steps include select features, calculate per-cell normalized mC fractions, log_scale transform the data with fitted scaler, and finally add PC matrix.

Parameters

mcds – Raw count MCDS

Returns

anndata object with per-cell normalized, log scale transformed matrix in .X and PCs in adata.obsm[“X_pca”] and PC loadings in adata.varm[“PCs”]. The scale and PC are done with fitted model.

Return type

adata

dump(self, path)

Save the ReproduciblePCA to path

class Dendrogram(nboot=1000, method_dist='correlation', method_hclust='average', n_jobs=- 1)[source]
fit(self, data)
Parameters

data – The data is in obs-by-var form, row is obs.

save(self, output_path)
class PairwiseDMG(max_cell_per_group=1000, top_n=10000, adj_p_cutoff=0.001, delta_rate_cutoff=0.3, auroc_cutoff=0.9, random_state=0, n_jobs=1, verbose=True)[source]
fit_predict(self, x, groups, var_dim, obs_dim='cell', outlier='Outlier', cleanup=True, selected_pairs: List[tuple] = None)

provide data and perform the pairwise DMG

Parameters
  • x – 2D cell-by-feature xarray.DataArray

  • groups – cluster labels

  • obs_dim – name of the cell dim

  • var_dim – name of the feature dim

  • outlier – name of the outlier group, if provided, will ignore this label

  • cleanup – Whether to delete the group adata file

  • selected_pairs – By default, pairwise DMG will calculate all possible pairs between all the groups, which might be very time consuming if the group number is large. With this parameter, you may provide a list of cluster pairs

_save_cluster_adata(self)

Save each group into separate adata, this way reduce the memory during parallel

_pairwise_dmg(self)

pairwise DMG runner, result save to self.dmg_table

_cleanup(self)

Delete group adata files

aggregate_pairwise_dmg(self, adata, groupby, obsm='X_pca')

Aggregate pairwise DMG results for each cluster, rank DMG for the cluster by the sum of AUROC * cluster_pair_similarity This way, the DMGs having large AUROC between similar clusters get more weights

Parameters
  • adata

  • groupby

  • obsm

one_vs_rest_dmg(cell_meta, group, mcds=None, mcds_paths=None, obs_dim='cell', var_dim='gene', mc_type='CHN', top_n=1000, adj_p_cutoff=0.01, fc_cutoff=0.8, auroc_cutoff=0.8, max_cluster_cells=2000, max_other_fold=5, cpu=1, verbose=True)[source]

Calculating cluster marker genes using one-vs-rest strategy.

Parameters
  • cell_meta – cell metadata containing cluster labels

  • group – the name of the cluster label column

  • mcds – cell-by-gene MCDS object for calculating DMG. Provide either mcds_paths or mcds.

  • mcds_paths – cell-by-gene MCDS paths for calculating DMG. Provide either mcds_paths or mcds.

  • obs_dim – dimension name of the cells

  • var_dim – dimension name of the features

  • mc_type – value to select methylation type in the mc_type dimension

  • top_n – report top N DMGs

  • adj_p_cutoff – adjusted P value cutoff to report significant DMG

  • fc_cutoff – mC fraction fold change cutoff to report significant DMG

  • auroc_cutoff – AUROC cutoff to report significant DMG

  • max_cluster_cells – The maximum number of cells from a group, downsample large group to this number

  • max_other_fold – The fold of other cell numbers comparing

  • cpu – number of cpus

Returns

pandas Dataframe of the one-vs-rest DMGs

Return type

dmg_table

cluster_enriched_features(adata: anndata.AnnData, cluster_col: str, top_n=200, alpha=0.05, stat_plot=True, method='mc')[source]

Calculate top Cluster Enriched Features (CEF) from per-cell normalized dataset. An post-clustering feature selection step adapted from [Zeisel et al., 2018, La Manno et al., 2021] and their great [cytograph2](https://github.com/linnarsson-lab/cytograph2) package. For details about CEF calculation, read the methods of [Zeisel et al., 2018]. Note that in original paper, they look for cluster-specific highly expressed genes as CEFs; for methylation, we are looking for hypo-methylation as CEFs, so the score and test is reversed.

Parameters
  • adata – adata containing per-cell normalized values. For methylation fraction, the value need to be 1-centered (1 means cell’s average methylation), like those produced by ALLCools.mcds.mcds.MCDS.add_mc_frac() with normalize_per_cell=True. For RNA and ATAC, you can use per cell normalized counts. Do not log transform the data before running this function

  • cluster_col – The name of categorical variable in adata.obs

  • top_n – Select top N CEFs for each cluster

  • alpha – FDR corrected q-value cutoff

  • stat_plot – Whether making some summary plots for the CEF calculation

  • method – “mc” for methylation CEF (look for hypo-methylation), “rna” and “atac” for the RNA and ATAC or any count based data (use the original cytograph algorithm, look for higher value)

Returns

  • Modify adata inplace, adding a dictionary in adata.uns called f”{cluster_col}_feature_enrichment”

  • The dictionary contains “qvals” (np.ndarray cluster-by-feature enrichment score q-value) and

  • ”cluster_order” (cluster order of the “qvals”)

filter_regions(adata, hypo_percent=0.5)[source]

Filter regions based on % of cells having non-zero scores.

Parameters
  • adata

  • hypo_percent – min % of cells that are non-zero in this region.

lsi(adata, scale_factor=100000, n_components=100, algorithm='arpack', obsm='X_pca', random_state=0, fit_size=None)[source]

Run TF-IDF on the binarized adata.X, followed by TruncatedSVD and then scale the components by svd.singular_values_

Parameters
  • adata

  • scale_factor

  • n_components

  • algorithm

  • obsm

  • random_state

  • fit_size – Ratio or absolute int value, use to downsample when fitting the SVD to speed up run time.

binarize_matrix(adata, cutoff=0.95)[source]

Binarize adata.X with adata.X > cutoff

Parameters
  • adata – AnnData object whose X is survival function matrix

  • cutoff – Cutoff to binarize the survival function

Returns

Return type

None

remove_black_list_region(adata, black_list_path, f=0.2)[source]

Remove regions overlap (bedtools intersect -f {f}) with regions in the black_list_path

Parameters
  • adata

  • black_list_path – Path to the black list bed file

  • f – Fraction of overlap when calling bedtools intersect

Returns

Return type

None

class ClusterMerge(merge_criterion, stop_criterion=None, stop_clusters=- 1, n_cells=200, metric='euclidean', method='average', label_concat_str='::')[source]
_construct_tree(self)
static _traverse(node, call_back)
_merge_pair(self, pair, concat_str='::')
fit_predict(self, data_for_tree, cell_to_type, gene_mcds)
class PairwiseDMGCriterion(max_cell_per_group=100, top_n_markers=5, adj_p_cutoff=0.001, delta_rate_cutoff=0.3, auroc_cutoff=0.85, use_modality='either', random_state=0, n_jobs=10, verbose=False)[source]
predict(self, pair_labels, pair_cells, pair_cell_types, pair_mcds, da_name='gene_da_frac')