`ALLCools.mcds.mcds`¶

Module Contents¶

make_obs_df_var_df(use_data, obs_dim, var_dim)[source]¶

class MCDS(dataset, obs_dim=None, var_dim=None)[source]¶

Bases: xarray.Dataset

MCDS Class

__slots__ = [][source]¶

property var_dim(self)[source]¶

property obs_dim(self)[source]¶

property obs_names(self)[source]¶

property var_names(self)[source]¶

_verify_dim(self, dim, mode)[source]¶

classmethod open(cls, mcds_paths, obs_dim='cell', use_obs=None, var_dim=None, chunks='auto', split_large_chunks=True, obj_to_str=True, engine=None)[source]¶

Take one or multiple MCDS file paths and create single MCDS concatenated on obs_dim

Parameters

mcds_paths – Single MCDS path or MCDS path pattern with wildcard or MCDS path list
obs_dim – Dimension name of observations, default is ‘cell’
use_obs – Subset the MCDS by a list of observation IDs.
var_dim – Which var_dim dataset to use, needed when MCDS has multiple var_dim stored in the same directory
chunks – if not None, xarray will use chunks to load data as dask.array. The “auto” means xarray will determine chunks automatically. For more options, read the xarray.open_dataset chunks parameter documentation. If None, xarray will not use dask, which is not desired in most cases.
split_large_chunks – Whether split large chunks in dask config array.slicing.split_large_chunks
obj_to_str – Whether turn object coordinates into string data type
engine – xarray engine used to store MCDS, if multiple MCDS provided, the engine need to be the same

Returns

Return type

MCDS

add_mc_frac(self, var_dim=None, da=None, normalize_per_cell=True, clip_norm_value=10, da_suffix='frac')[source]¶

Add posterior mC rate data array for certain feature type (var_dim).

Parameters

var_dim – Name of the feature type
da – if None, will use f’{var_dim}_da’
normalize_per_cell – if True, will normalize the mC rate data array per cell
clip_norm_value – reset larger values in the normalized mC rate data array to this
da_suffix – name suffix appended to the calculated mC rate data array

add_mc_rate(self, *args, **kwargs)[source]¶

add_feature_cov_mean(self, obs_dim=None, var_dim=None, plot=True)[source]¶

Add feature cov mean across obs_dim.

Parameters

var_dim – Name of var dimension
obs_dim – Name of obs dimension
plot – If true, plot the distribution of feature cov mean

Returns

Return type

None

add_cell_metadata(self, metadata, obs_dim=None)[source]¶

filter_feature_by_cov_mean(self, var_dim=None, min_cov=0, max_cov=999999)[source]¶

filter MCDS by feature cov mean. add_feature_cov_mean() must be called before this function.

Parameters

var_dim – Name of var dimension
min_cov – Minimum cov cutoff
max_cov – Maximum cov cutoff

Returns

Return type

MCDS

get_feature_bed(self, var_dim=None)[source]¶

Get a bed format data frame of the var_dim

Parameters: var_dim – Name of var_dim
Returns
Return type: pd.DataFrame

remove_black_list_region(self, black_list_path, var_dim=None, f=0.2)[source]¶

Remove regions overlap (bedtools intersect -f {f}) with regions in the black_list_path

Parameters

var_dim – Name of var_dim
black_list_path – Path to the black list bed file
f – Fraction of overlap when calling bedtools intersect

Returns

Return type

MCDS

remove_chromosome(self, exclude_chromosome, var_dim=None)[source]¶

Remove regions in specific chromosome

Parameters

var_dim – Name of var_dim
exclude_chromosome – Chromosome to remove

Returns

Return type

MCDS (xr.Dataset)

calculate_hvf_svr(self, mc_type=None, var_dim=None, obs_dim=None, n_top_feature=5000, da_suffix='frac', plot=True)[source]¶

calculate_hvf(self, mc_type=None, var_dim=None, obs_dim=None, min_disp=0.5, max_disp=None, min_mean=0, max_mean=5, n_top_feature=5000, bin_min_features=5, mean_binsize=0.05, cov_binsize=100, da_suffix='frac', plot=True)[source]¶

Calculate normalized dispersion to select highly variable features.

Parameters

mc_type – Type of mC to calculate
var_dim – Name of variable
obs_dim – Name of observation, default is cell
min_disp – minimum dispersion for a feature to be considered
max_disp – maximum dispersion for a feature to be considered
min_mean – minimum mean for a feature to be considered
max_mean – maximum mean for a feature to be considered
n_top_feature – Top N feature to use as highly variable feature. If set, all the cutoff will be ignored, HDF selected based on order of normalized dispersion.
bin_min_features – Minimum number of features to be considered as a separate bin, if bellow this number, the bin will be merged to its closest bin.
mean_binsize – bin size to separate features across mean
cov_binsize – bin size to separate features across coverage
plot – If true, will plot mean, coverage and normalized dispersion scatter plots.

Returns

Return type

pd.DataFrame

get_score_adata(self, mc_type, quant_type, obs_dim=None, var_dim=None, sparse=True)[source]¶

get_adata(self, mc_type=None, obs_dim=None, var_dim=None, da_suffix='frac', select_hvf=True, split_large_chunks=True)[source]¶

Get anndata from MCDS mC rate matrix :param mc_type: mC rate type :param var_dim: Name of variable :param da_suffix: Suffix of mC rate matrix :param obs_dim: Name of observation :param select_hvf: Select HVF or not, if True, will use mcds.coords[‘{var_dim}_{mc_type}_feature_select’] to select HVFs :param split_large_chunks: Whether split large chunks in dask config array.slicing.split_large_chunks

Returns
Return type: anndata.Anndata

merge_cluster(self, cluster_col, obs_dim=None, add_mc_frac=True, add_overall_mc=True, overall_mc_da='chrom100k_da')[source]¶

to_region_ds(self, region_dim=None)[source]¶

write_dataset(self, output_path, mode='w-', obs_dim=None, var_dims: Union[str, list] = None, use_obs=None, chunk_size=1000)[source]¶

Write MCDS into a on-disk zarr dataset. Data arrays for each var_dim will be saved in separate sub-directories of output_path.

Parameters

output_path – Path of the zarr dataset
mode – ‘w-‘ means write to output_path, fail if the path exists; ‘w’ means write to output_path, overwrite if the var_dim sub-directory exists
obs_dim – dimension name of observations
var_dims – dimension name, or a list of dimension names of variables
use_obs – Select AND order observations when write.
chunk_size – The load and write chunks, set this as large as possible based on available memory.

Returns

Return type

output_path

ALLCools.mcds.mcds

Contents

ALLCools.mcds.mcds¶

Module Contents¶

`ALLCools.mcds.mcds`¶