ALLCools.sandbox.sccaf
Contents
ALLCools.sandbox.sccaf
¶
Module Contents¶
- color_long = ['#e6194b', '#3cb44b', '#ffe119', '#0082c8', '#f58231', '#911eb4', '#46f0f0', '#f032e6',...[source]¶
- run_BayesianGaussianMixture(Y, K)[source]¶
For K-means clustering
Y: the expression matrix K: number of clusters
- Returns
- Return type
clusters assigned to each cell.
- bhattacharyya_distance(repr1, repr2)[source]¶
Calculates Bhattacharyya distance (https://en.wikipedia.org/wiki/Bhattacharyya_distance).
- normalize_confmat1(cmat, mod='1')[source]¶
Normalize the confusion matrix based on the total number of cells in each class x(i,j) = max(cmat(i,j)/diagnol(i),cmat(j,i)/diagnol(j)) confusion rate between i and j is defined by the maximum ratio i is confused as j or j is confused as i.
Input cmat: the confusion matrix
- Returns
- Return type
the normalized confusion matrix
- normalize_confmat2(cmat)[source]¶
Normalize the confusion matrix based on the total number of cells. x(i,j) = max(cmat(i,j)+cmat(j,i)/N) N is total number of cells analyzed. Confusion rate between i and j is defined by the sum of i confused as j or j confused as i. Then divide by total number of cells.
Input cmat: the confusion matrix
- Returns
- Return type
the normalized confusion matrix
- cluster_adjmat(xmat, resolution=1, cutoff=0.1)[source]¶
Cluster the groups based on the adjacent matrix. Use the cutoff to discretize the matrix used to construct the adjacent graph. Then cluster the graph using the louvain clustering with a resolution value. As the adjacent matrix is binary, the default resolution value is set to 1.
- xmat: numpy.array or sparse matrix
the reference matrix/normalized confusion matrix
- cutoff: float optional (default: 0.1)
threshold used to binarize the reference matrix
- resolution: float optional (default: 1.0)
resolution parameter for louvain clustering
- Returns
- Return type
new group names.
- msample(x, n, frac)[source]¶
sample the matrix by number or by fraction. if the fraction is larger than the sample number, use number for sampling. Otherwise, use fraction.
x: the matrix to be split n: number of vectors to be sampled frac: fraction of the total matrix to be sampled
- Returns
- Return type
sampled selection.
- train_test_split_per_type(X, y, n=100, frac=0.8)[source]¶
This function is identical to train_test_split, but can split the data either based on number of cells or by fraction.
- X: numpy.array or sparse matrix
the feature matrix
- y: list of string/int
the class assignments
- n: int optional (default: 100)
maximum number sampled in each label
- fraction: float optional (default: 0.8)
Fraction of data included in the training set. 0.5 means use half of the data for training, if half of the data is fewer than maximum number of cells (n).
- Returns
- Return type
X_train, X_test, y_train, y_test
- SCCAF_assessment(*args, **kwargs)[source]¶
Assessment of clustering reliability using self-projection. It is the same as the self_projection function.
- self_projection(X, cell_types, classifier='LR', penalty='l1', sparsity=0.5, fraction=0.5, solver='liblinear', n=0, cv=5, whole=False, n_jobs=None)[source]¶
This is the core function for running self-projection.
- X: numpy.array or sparse matrix
the expression matrix, e.g. ad.raw.X.
- cell_types: list of String/int
the cell clustering assignment
- classifier: String optional (defatul: ‘LR’)
a machine learning model in “LR” (logistic regression), “RF” (Random Forest), “GNB”(Gaussion Naive Bayes), “SVM” (Support Vector Machine) and “DT”(Decision Tree).
- penalty: String optional (default: ‘l2’)
the standardization mode of logistic regression. Use ‘l1’ or ‘l2’.
- sparsity: fload optional (default: 0.5)
The sparsity parameter (C in sklearn.linear_model.LogisticRegression) for the logistic regression model.
- fraction: float optional (default: 0.5)
Fraction of data included in the training set. 0.5 means use half of the data for training, if half of the data is fewer than maximum number of cells (n).
- n: int optional (default: 100)
Maximum number of cell included in the training set for each cluster of cells. only fraction is used to split the dataset if n is 0.
- cv: int optional (default: 5)
fold for cross-validation on the training set. 0 means no cross-validation.
- whole: bool optional (default: False)
if measure the performance on the whole dataset (include training and test).
n_jobs: int optional, number of threads to use with the different classifiers (default: None - unlimited).
- Returns
y_prob, y_pred, y_test, clf
y_prob (matrix of float) – prediction probability
y_pred (list of string/int) – predicted clustering of the test set
y_test (list of string/int) – real clustering of the test set
clf (the classifier model.)
- make_unique(dup_list)[source]¶
Make a name list unique by adding suffix “_%d”. This function is identical to the make.unique function in R.
dup_list: a list
- Returns
- Return type
a unique list with the same length as the input.
- confusion_matrix(y_test, y_pred, clf, labels=None)[source]¶
Get confusion matrix based on the test set.
y_test, y_pred, clf: same as in self_projection
- Returns
- Return type
the confusion matrix
- per_cluster_accuracy(mtx, ad=None, clstr_name='louvain')[source]¶
Measure the accuracy of each cluster and put into a metadata slot. So the reliability of each cluster can be visualized.
- mtx: pandas.dataframe
the confusion matrix
- ad: AnnData
anndata object
- clstr_name: String
the name of the clustering
- get_topmarkers(clf, names, topn=10)[source]¶
Get the top weighted features from the logistic regressioin model.
clf: the logistic regression classifier names: list of Strings
the names of the features (the gene names).
- topn: int
number of top weighted featured to be returned.
- Returns
- Return type
list of markers for each of the cluster.
- eu_distance(X, gp1, gp2, cell)[source]¶
Measure the euclidean distance between two groups of cells and the third group.
- X: np.array or sparse matrix
the total expression matrix
- gp1: bool list
group1 of cells
- gp2: bool list
group2 of cells
- cell: bool list
group3 of cells, the group to be compared with gp1 and gp2.
- Returns
float value
the average distance difference.
- get_distance_matrix(X, clusters, labels=None, metric='euclidean')[source]¶
Get the mean distance matrix between all clusters.
- X: np.array or sparse matrix
the total expression matrix
- clusters: string list
the assignment of the clusters
- labels: string list
the unique labels of the clusters
- metric: string (optional, default: euclidean)
distance metrics, see (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)
- Returns
the all-cluster to all-cluster distance matrix.
- Return type
np.array
- SCCAF_optimize_all(adata, start_groups=None, min_acc=0.9, r1_norm_cutoff=0.5, r2_norm_cutoff=0.05, R1norm_step=0.01, R2norm_step=0.001, min_iter=3, max_iter=10, *args, **kwargs)[source]¶
- adata: AnnData
The AnnData object of the expression profile.
- min_acc: float optional (default: 0.9)
The minimum self-projection accuracy to be optimized for. e.g., 0.9 means the clustering optimization (merging process) will not stop until the self-projection accuracy is above 90%.
- R1norm_cutoff: float optional (default: 0.5)
The start cutoff for R1norm of confusion matrix. e.g., 0.5 means use 0.5 as a cutoff to discretize the confusion matrix after R1norm. the discretized matrix is used to construct the connection graph for clustering optimization.
- R2norm_cutoff: float optional (default: 0.05)
The start cutoff for R2norm of confusion matrix.
- R1norm_step: float optional (default: 0.01)
The reduce step for minimum R1norm value. Each round of optimization calls the function SCCAF_optimize. The start of the next round of optimization is based on a new cutoff for the R1norm of the confusion matrix. This cutoff is determined by the minimum R1norm value in the previous round minus the R1norm_step value.
- R2norm_step: float optional (default: 0.001)
The reduce step for minimum R2norm value.
- SCCAF_optimize(ad, prefix='L1', use='raw', use_projection=False, R1norm_only=False, R2norm_only=False, dist_only=False, dist_not=True, plot=True, basis='umap', plot_dist=False, plot_cmat=False, mod='1', low_res=None, c_iter=3, n_iter=10, n_jobs=None, start_iter=0, sparsity=0.5, n=100, fraction=0.5, r1_norm_cutoff=0.1, r2_norm_cutoff=1, dist_cutoff=8, classifier='LR', mplotlib_backend=None, min_acc=1)[source]¶
This is a self-projection confusion matrix directed cluster optimization function.
- ad: AnnData
The AnnData object of the expression profile.
- prefix: String, optional (default: ‘L1’)
The name of the optimization, which set as a prefix. e.g., the prefix = ‘L1’, the start round of optimization clustering is based on ‘L1_Round0’. So we need to assign an over-clustering state as a start point. e.g., ad.obs[‘L1_Round0’] = ad.obs[‘louvain’]
- use: String, optional (default: ‘raw’)
Use what features to train the classifier. Three choices: ‘raw’ uses all the features; ‘hvg’ uses the highly variable genes in the anndata object ad.var_names slot; ‘pca’ uses the PCA data in the anndata object ad.obsm[‘X_pca’] slot.
- R1norm_only: bool optional (default: False)
If only use the confusion matrix(R1norm) for clustering optimization.
- R2norm_only: bool optional (default: False)
If only use the confusion matrix(R2norm) for clustering optimization.
- dist_only: bool optional (default: False)
If only use the distance matrix for clustering optimization.
- dist_not: bool optional (default: True)
If not use the distance matrix for clustering optimization.
- plot: bool optional (default: True)
If plot the self-projectioin results, ROC curves and confusion matrices, during the optimization.
- plot_tsne: bool optional (default: False)
If plot the self-projectioin results as tSNE. If False, the results are plotted as UMAP.
- plot_dist: bool optional (default: False)
If make a scatter plot of the distance compared with the confusion rate for each of the cluster.
- plot_cmat: bool optional (default: False)
plot the confusion matrix or not.
- mod: string optional (default: ‘1’)
two directions of normalization of confusion matrix for R1norm.
- c_iter: int optional (default: 3)
Number of iterations of sampling for the confusion matrix. The minimum value of confusion rate in all the iterations is used as the confusion rate between two clusters.
- n_iter: int optional (default: 10)
Maximum number of iterations(Rounds) for the clustering optimization.
- start_iter: int optional (default: 0)
The start round of the optimization. e.g., start_iter = 3, the optimization will start from ad.obs[‘%s_3’%prefix].
- sparsity: fload optional (default: 0.5)
The sparsity parameter (C in sklearn.linear_model.LogisticRegression) for the logistic regression model.
- n: int optional (default: 100)
Maximum number of cell included in the training set for each cluster of cells.
n_jobs: int number of jobs/threads to use (default: None - unlimited). fraction: float optional (default: 0.5)
Fraction of data included in the training set. 0.5 means use half of the data for training, if half of the data is fewer than maximum number of cells (n).
- R1norm_cutoff: float optional (default: 0.1)
The cutoff for the confusion rate (R1norm) between two clusters. 0.1 means we allow maximum 10% of the one cluster confused as another cluster.
- R2norm_cutoff: float optional (default: 1.0)
The cutoff for the confusion rate (R2norm) between two clusters. 1.0 means the confusion between any two cluster should not exceed 1% of the total number of cells.
- dist_cutoff: float optional (default: 8.0)
The cutoff for the euclidean distance between two clusters of cells. 8.0 means the euclidean distance between two cell types should be greater than 8.0.
- low_res: str optional
the clustering boundary for under-clustering. Set a low resolution in louvain/leiden clustering and give the key as the underclustering boundary.
- classifier: String optional (default: ‘LR’)
a machine learning model in “LR” (logistic regression), “RF” (Random Forest), “GNB”(Gaussion Naive Bayes), “SVM” (Support Vector Machine) and “DT”(Decision Tree).
- mplotlib_backend: matplotlib.backends.backend_pdf optional
MatPlotLib multi-page backend object instance, previously initialised (currently the only type supported is PdfPages).
- min_acc: float
the minimum total accuracy to be achieved. Above this threshold, the optimization will stop.
- Returns
assigned as the clustering optimization results.
- Return type
The modified anndata object, with a slot “%s_result”%prefix