ALLCools.mcds.mcds
Contents
ALLCools.mcds.mcds
¶
Module Contents¶
- class MCDS(dataset, obs_dim=None, var_dim=None)[source]¶
Bases:
xarray.Dataset
MCDS Class
- classmethod open(cls, mcds_paths, obs_dim='cell', use_obs=None, var_dim=None, chunks='auto', split_large_chunks=True, obj_to_str=True, engine=None)[source]¶
Take one or multiple MCDS file paths and create single MCDS concatenated on obs_dim
- Parameters
mcds_paths – Single MCDS path or MCDS path pattern with wildcard or MCDS path list
obs_dim – Dimension name of observations, default is ‘cell’
use_obs – Subset the MCDS by a list of observation IDs.
var_dim – Which var_dim dataset to use, needed when MCDS has multiple var_dim stored in the same directory
chunks – if not None, xarray will use chunks to load data as dask.array. The “auto” means xarray will determine chunks automatically. For more options, read the xarray.open_dataset chunks parameter documentation. If None, xarray will not use dask, which is not desired in most cases.
split_large_chunks – Whether split large chunks in dask config array.slicing.split_large_chunks
obj_to_str – Whether turn object coordinates into string data type
engine – xarray engine used to store MCDS, if multiple MCDS provided, the engine need to be the same
- Returns
- Return type
- add_mc_frac(self, var_dim=None, da=None, normalize_per_cell=True, clip_norm_value=10, da_suffix='frac')[source]¶
Add posterior mC rate data array for certain feature type (var_dim).
- Parameters
var_dim – Name of the feature type
da – if None, will use f’{var_dim}_da’
normalize_per_cell – if True, will normalize the mC rate data array per cell
clip_norm_value – reset larger values in the normalized mC rate data array to this
da_suffix – name suffix appended to the calculated mC rate data array
- add_feature_cov_mean(self, obs_dim=None, var_dim=None, plot=True)[source]¶
Add feature cov mean across obs_dim.
- Parameters
var_dim – Name of var dimension
obs_dim – Name of obs dimension
plot – If true, plot the distribution of feature cov mean
- Returns
- Return type
- filter_feature_by_cov_mean(self, var_dim=None, min_cov=0, max_cov=999999)[source]¶
filter MCDS by feature cov mean. add_feature_cov_mean() must be called before this function.
- Parameters
var_dim – Name of var dimension
min_cov – Minimum cov cutoff
max_cov – Maximum cov cutoff
- Returns
- Return type
- get_feature_bed(self, var_dim=None)[source]¶
Get a bed format data frame of the var_dim
- Parameters
var_dim – Name of var_dim
- Returns
- Return type
pd.DataFrame
- remove_black_list_region(self, black_list_path, var_dim=None, f=0.2)[source]¶
Remove regions overlap (bedtools intersect -f {f}) with regions in the black_list_path
- Parameters
var_dim – Name of var_dim
black_list_path – Path to the black list bed file
f – Fraction of overlap when calling bedtools intersect
- Returns
- Return type
- remove_chromosome(self, exclude_chromosome, var_dim=None)[source]¶
Remove regions in specific chromosome
- Parameters
var_dim – Name of var_dim
exclude_chromosome – Chromosome to remove
- Returns
- Return type
MCDS (xr.Dataset)
- calculate_hvf_svr(self, mc_type=None, var_dim=None, obs_dim=None, n_top_feature=5000, da_suffix='frac', plot=True)[source]¶
- calculate_hvf(self, mc_type=None, var_dim=None, obs_dim=None, min_disp=0.5, max_disp=None, min_mean=0, max_mean=5, n_top_feature=5000, bin_min_features=5, mean_binsize=0.05, cov_binsize=100, da_suffix='frac', plot=True)[source]¶
Calculate normalized dispersion to select highly variable features.
- Parameters
mc_type – Type of mC to calculate
var_dim – Name of variable
obs_dim – Name of observation, default is cell
min_disp – minimum dispersion for a feature to be considered
max_disp – maximum dispersion for a feature to be considered
min_mean – minimum mean for a feature to be considered
max_mean – maximum mean for a feature to be considered
n_top_feature – Top N feature to use as highly variable feature. If set, all the cutoff will be ignored, HDF selected based on order of normalized dispersion.
bin_min_features – Minimum number of features to be considered as a separate bin, if bellow this number, the bin will be merged to its closest bin.
mean_binsize – bin size to separate features across mean
cov_binsize – bin size to separate features across coverage
plot – If true, will plot mean, coverage and normalized dispersion scatter plots.
- Returns
- Return type
pd.DataFrame
- get_adata(self, mc_type=None, obs_dim=None, var_dim=None, da_suffix='frac', select_hvf=True, split_large_chunks=True)[source]¶
Get anndata from MCDS mC rate matrix :param mc_type: mC rate type :param var_dim: Name of variable :param da_suffix: Suffix of mC rate matrix :param obs_dim: Name of observation :param select_hvf: Select HVF or not, if True, will use mcds.coords[‘{var_dim}_{mc_type}_feature_select’] to select HVFs :param split_large_chunks: Whether split large chunks in dask config array.slicing.split_large_chunks
- Returns
- Return type
anndata.Anndata
- merge_cluster(self, cluster_col, obs_dim=None, add_mc_frac=True, add_overall_mc=True, overall_mc_da='chrom100k_da')[source]¶
- write_dataset(self, output_path, mode='w-', obs_dim=None, var_dims: Union[str, list] = None, use_obs=None, chunk_size=1000)[source]¶
Write MCDS into a on-disk zarr dataset. Data arrays for each var_dim will be saved in separate sub-directories of output_path.
- Parameters
output_path – Path of the zarr dataset
mode – ‘w-‘ means write to output_path, fail if the path exists; ‘w’ means write to output_path, overwrite if the var_dim sub-directory exists
obs_dim – dimension name of observations
var_dims – dimension name, or a list of dimension names of variables
use_obs – Select AND order observations when write.
chunk_size – The load and write chunks, set this as large as possible based on available memory.
- Returns
- Return type
output_path