ALLCools._extract_allc

Module Contents

_merge_cg_strand(in_path, out_path)[source]

Merge strand after extract context step in extract_allc (and only apply on CG), so no need to check context.

_check_strandness_parameter(strandness) str[source]
_check_out_format_parameter(out_format, binarize=False) Tuple[str, Callable[[list], str]][source]
_merge_gz_files(file_list, output_path)[source]

Merge the small chunk files generated by _extract_allc_parallel, remove the small files after merge

_extract_allc_parallel(allc_path, output_prefix, mc_contexts, strandness, output_format, chrom_size_path, cov_cutoff, cpu, chunk_size=100000000, tabix=True)[source]

Parallel extract_allc on region level Then parallel merge region chunk files to the final output in order Same input output as extract_allc, but will generate a bunch of small files during running Don’t use this on small files

extract_allc(allc_path: str, output_prefix: str, mc_contexts: Union[str, list], chrom_size_path: str, strandness: str = 'both', output_format: str = 'allc', region: str = None, cov_cutoff: int = 9999, tabix: bool = True, cpu=1, binarize=False)[source]

Extract information (strand, context) from 1 ALLC file. Save to several formats.

Parameters
  • allc_path – {allc_path_doc}

  • output_prefix – Path prefix of the output ALLC file.

  • mc_contexts – {mc_contexts_doc}

  • strandness – {strandness_doc}

  • output_format – Output format of extracted information, possible values are: 1. allc: keep the allc format 2. bed5: 5-column bed format, chrom, pos, pos, mc, cov

  • chrom_size_path – {chrom_size_path_doc} If chrom_size_path provided, will use it to extract ALLC with chrom order, but if region provided, will ignore this.

  • region – {region_doc}

  • cov_cutoff – {cov_cutoff_doc}

  • tabix – Whether to generate tabix if format is ALLC, only set this to False from _extract_allc_parallel

  • cpu – {cpu_basic_doc} This function parallel on region level and will generate a bunch of small files if cpu > 1. Do not use cpu > 1 for single cell region count. For single cell data, parallel on cell level is better.

  • binarize – {binarize_doc}

Returns

Return type

A list of output file paths, not include index files.