ALLCools.utilities

Module Contents

IUPAC_TABLE[source]
COMPLIMENT_BASE[source]
reverse_complement(seq)[source]
get_allc_chroms(allc_path)[source]
parse_mc_pattern(pattern: str) set[source]

parse mC context pattern

parse_chrom_size(path, remove_chr_list=None)[source]

support simple UCSC chrom size file, or .fai format (1st and 2nd columns same as chrom size file)

return chrom:length dict

chrom_dict_to_id_index(chrom_dict, bin_size)[source]
get_bin_id(chrom, chrom_index_dict, bin_start, bin_size) int[source]
genome_region_chunks(chrom_size_path: str, bin_length: int = 10000000, combine_small: bool = True) List[str][source]

Split the whole genome into bins, where each bin is {bin_length} bp. Used for tabix region query

Parameters
  • chrom_size_path – Path of UCSC genome size file

  • bin_length – length of each bin

  • combine_small – whether combine small regions into one record

Returns

Return type

list of records in tabix query format

parse_file_paths(input_file_paths: Union[str, list]) list[source]
get_md5(file_path)[source]
check_tbi_chroms(file_path, genome_dict, same_order=False)[source]
generate_chrom_bin_bed_dataframe(chrom_size_path: str, window_size: int, step_size: int = None) pandas.DataFrame[source]

Generate BED format dataframe based on UCSC chrom size file and window_size return dataframe contain 3 columns: chrom, start, end. The index is 0 based continue bin index.

profile_allc(allc_path, drop_n=True, n_rows=1000000, output_path=None)[source]

Generate some summary statistics of 1 ALLC. 1e8 rows finish in about 5 min.

Parameters
  • allc_path – {allc_path_doc}

  • drop_n – Whether to drop context that contain N, such as CCN. This is usually very rare and need to be dropped.

  • n_rows – Number of rows to calculate the profile from. The default number is usually sufficient to get pretty precise assumption.

  • output_path – Path of the output file. If None, will save the profile next to input ALLC file.

is_gz_file(filepath)[source]

Check if a file is gzip file, bgzip also return True Learnt from here: https://stackoverflow.com/questions/3703276/how-to-tell-if-a-file-is-gzip-compressed

tabix_allc(allc_path, reindex=False)[source]

a simple wrapper of tabix command to index 1 ALLC file

Parameters
  • allc_path – {allc_path_doc}

  • reindex – If True, will force regenerate the ALLC index.

standardize_allc(allc_path, chrom_size_path, compress_level=5, remove_additional_chrom=False)[source]
Standardize 1 ALLC file by checking:
  1. No header in the ALLC file;

  2. Chromosome names in ALLC must be same as those in the chrom_size_path file, including “chr”;

  3. Output file will be bgzipped with .tbi index

  4. Remove additional chromosome (remove_additional_chrom=True) or raise KeyError if unknown chromosome found (default)

Parameters
  • allc_path – {allc_path_doc}

  • chrom_size_path – {chrom_size_path_doc}

  • compress_level – {compress_level_doc}

  • remove_additional_chrom – {remove_additional_chrom_doc}

_transfer_bin_size(bin_size: int) str[source]

Get proper str for a large bin_size

parse_dtype(dtype)[source]
binary_count(mc, cov)[source]