ALLCools.sandbox.balanced_knn
Contents
ALLCools.sandbox.balanced_knn
¶
Module Contents¶
- jensen_shannon_divergence(pk: numpy.ndarray, qk: numpy.ndarray) float [source]¶
- jensen_shannon_distance(pk: numpy.ndarray, qk: numpy.ndarray) float [source]¶
- Remarks:
pk and qk must already be normalized so that np.sum(pk) == np.sum(qk) == 1
- balance_knn_loop(dsi: numpy.ndarray, dist: numpy.ndarray, lsi: numpy.ndarray, maxl: int, k: int) Tuple [source]¶
Fast and greedy algorythm to balance a K-NN graph so that no node is the NN to more than maxl other nodes :param dsi: distance sorted indexes (as returned by sklearn NN) :type dsi: np.ndarray (samples, K) :param dist: the actual distance corresponding to the sorted indexes :type dist: np.ndarray (samples, K) :param lsi: degree of connectivity (l) sorted indexes :type lsi: np.ndarray (samples,) :param maxl: max degree of connectivity (from others to self) allowed :type maxl: int :param k: number of neighbours in the final graph :type k: int :param return_distance: wether to return distance :type return_distance: bool
- Returns
dsi_new (np.ndarray (samples, k+1)) – indexes of the NN, first column is the sample itself
dist_new (np.ndarray (samples, k+1)) – distances to the NN
l (np.ndarray (samples)) – l[i] is the number of connections from other samples to the sample i
- knn_balance(dsi: numpy.ndarray, dist: numpy.ndarray = None, maxl: int = 200, k: int = 60) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Balance a K-NN graph so that no node is the NN to more than maxl other nodes :param dsi: distance sorted indexes (as returned by sklearn NN) :type dsi: np.ndarray (samples, K) :param dist: the actual distance corresponding to the sorted indexes :type dist: np.ndarray (samples, K) :param maxl: max degree of connectivity allowed :type maxl: int :param k: number of neighbours in the final graph :type k: int
- Returns
dist_new (np.ndarray (samples, k+1)) – distances to the NN
dsi_new (np.ndarray (samples, k+1)) – indexes of the NN, first column is the sample itself
l (np.ndarray (samples)) – l[i] is the number of connections from other samples to the sample i
- class NearestNeighbors(k: int = 50, sight_k: int = 100, maxl: int = 200, mode: str = 'distance', metric: str = 'euclidean', minkowski_p: int = 20, n_jobs: int = - 1)[source]¶
Greedy algorithm to balance a K-nearest neighbour graph It has an API similar to scikit-learn :param k: the number of neighbours in the final graph :type k: int (default=50) :param sight_k: the number of neighbours in the initialization graph
It correspondent to the farthest neighbour that a sample is allowed to connect to when no closest neighbours are allowed. If sight_k is reached then the matrix is filled with the sample itself
- Parameters
maxl (int (default=200)) – max degree of connectivity allowed. Avoids the presence of hubs in the graph, it is the maximum number of neighbours that are allowed to contact a node before the node is blocked
mode (str (default="connectivity")) – decide wich kind of utput “distance” or “connectivity”
n_jobs (int (default=4)) – parallelization of the standard KNN search preformed at initialization
- fit(self, data: numpy.ndarray, sight_k: int = None) Any [source]¶
Fits the model data: np.ndarray (samples, features)
np
- sight_k: int
the farthest point that a node is allowed to connect to when its closest neighbours are not allowed
- kneighbors(self, X: numpy.ndarray = None, maxl: int = None, mode: str = 'distance') Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
- kneighbors_graph(self, X: numpy.ndarray = None, maxl: int = None, mode: str = 'distance') scipy.sparse.csr_matrix [source]¶
Retrun the K-neighbors graph as a sparse csr matrix :param X: The query point or points.
If not provided, neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor.
- Parameters
maxl (int) – max degree of connectivity allowed
mode ("distance" or "connectivity") – Decides the kind of output
- Returns
neighbor_graph (scipy.sparse.csr_matrix) – The values are either distances or connectivity dependig of the mode parameter
NOTE (The diagonal will be zero even though the value 0 is actually stored)