allcools generate-dataset

Tutorial

Example 1

This example covers comprehensively the situation where five features denoted after regions (e.g. chrom100k, chrom5k, geneslop2k, promoter, CGI) and their corresponding contexts specified with count (e.g. CGN, CHN, CAN) are used for generating mcds files, while two features denoted after quantifiers (e.g. promoter, CGI) and their respective contexts specified with hypo/hyper-score (e.g. CGN) are used for generating mcad files within the dataset.

Command

allcools generate-dataset  \
--allc_table allc_table.tsv \
--output_path Farlik2016CSC.mcds \
--chrom_size_path hg38.main.nochrM.chrom.sizes \
--obs_dim cell  \
--cpu 20 \
--chunk_size 50 \
--regions chrom100k 100000 \
--regions chrom5k 5000 \
--regions geneslop2k hg38_genecode_v28.geneslop2k.bed.gz \
--regions promoter promoter_slop2k.sorted.bed.gz \
--regions CGI hg38_CGI.sorted.bed.gz \
--quantifiers chrom100k count CGN,CHN,CAN \
--quantifiers chrom5k count CGN \
--quantifiers geneslop2k count CGN \
--quantifiers promoter count CGN \
--quantifiers CGI count CGN \
--quantifiers promoter hypo-score CGN cutoff=0.9 \
--quantifiers CGI hypo-score CGN cutoff=0.9

Command  Breakdown

--allc_table allc_table.tsv

Specify the file paths of the allc table files in this line. Here is an example of what the allc table looks like:

sample1    absolute_allc_path_1
sample2    absolute_allc_path_2
sample3    absolute_allc_path_3

The first column indicates the cell name (e.g. GSM1922083_A01) whereas the second column indicates the allc file path of the cell. Make sure the two parts are separated by a tab.

--output_path Farlik2016CSC.mcds

Specify here the absolute path of the output MCDS directory.

--obs_dim cell 

Use the word “cell” as the obs_dim.

--cpu 20

Specify here how much cpu usage you need for running the command.

--chunk_size 50

Determine the parallel chunk size. The default size is 50.

# Format: --regions REGION_NAME REGION_DEFINITION

--regions chrom100k 100000
--regions chrom5k 5000
--regions geneslop2k hg38_genecode_v28.geneslop2k.bed.gz
--regions promoter promoter_slop2k.sorted.bed.gz
--regions CGI hg38_CGI.sorted.bed.gz

One command can generate one MCDS containing multiple region sets. Specify any regions with the region name followed by the definition of regions in the “–regions” parameter. In this example, the features include chrom100k, chrom5k, geneslop2k, promoter, CGI, etc. For the chrom100k and chrom5k, specify the bin size using 100000 and 5000 respectively. For geneslop2k, promoter and CGI, specify the absolute bed file paths.

# Format: --quantifiers REGION_NAME QUANT_TYPE MC_CONTEXTS

--quantifiers chrom100k count CGN,CHN,CAN
--quantifiers chrom5k count CGN
--quantifiers geneslop2k count CGN
--quantifiers promoter count CGN
--quantifiers CGI count CGN

Each regionset must have one or multiple “–quantifers” to specify how the region data array will be quantified. Here the mc context needed for each feature for generating the MCDS files. Use “count” quantifier to simply count the sum of base calls of each feature.

--quantifiers promoter hypo-score CGN cutoff=0.9

In addition to the count quantifier, one can also specific hypo-score or hyper-score for any regions. A cutoff parameter follows the mC context will be used to filter out small values to reduce file size.

Example 2

This example shows a simpler case where three features denoted after regions (e.g. chrom100k, chrom5k, geneslop2k) and their corresponding contexts specified with count (e.g. CGN, CHN) are used for generating mcds files, while one features denoted after quantifiers (e.g. chrom5k) and its contexts specified with hypo-score (e.g. CGN) within the dataset.

allcools generate-dataset \
--allc_table Luo2017Science_Human_allc_table.tsv \
--output_path Luo2017Science_Human.mcds \
--chrom_size_path hg19.main.nochrM.chrom.sizes \
--obs_dim cell \
--cpu 20 \
--chunk_size 50 \
--regions chrom100k 100000 \
--regions chrom5k 5000 \
--regions geneslop2k gencode.v28lift37.annotation.gene.bed.gz \
--quantifiers chrom100k count CGN,CHN \
--quantifiers geneslop2k count CGN,CHN \
--quantifiers chrom5k hypo-score CGN cutoff=0.9

Example 3

This example is for a NOMe treated dataset, the hypo-score in GCHN context denote chromatin accessibility information.

allcools generate-dataset \
--allc_table allc_table.tsv \
--output_path Li2018NCB.mcds \
--chrom_size_path hg19.main.nochrM.chrom.sizes \
--obs_dim cell \
--cpu 20 \
--chunk_size 50 \
--regions chrom100k 100000 \
--regions chrom5k 5000 \
--regions geneslop2k gencode.v28lift37.annotation.gene.bed.gz \
--regions promoter promoter_slop2k.sorted.bed.gz \
--regions CGI hg19_CGI.sorted.bed.gz \
--quantifiers chrom100k count GCHN,WCGN \
--quantifiers chrom5k count GCHN,WCGN \
--quantifiers geneslop2k count GCHN,WCGN \
--quantifiers promoter count GCHN,WCGN \
--quantifiers CGI count GCHN,WCGN \
--quantifiers chrom5k hypo-score WCGN cutoff=0.9 \
--quantifiers promoter hypo-score WCGN cutoff=0.9 \
--quantifiers CGI hypo-score WCGN cutoff=0.9 \
--quantifiers chrom5k hyper-score GCHN cutoff=0.9 \
--quantifiers promoter hyper-score GCHN cutoff=0.9 \
--quantifiers CGI hyper-score GCHN cutoff=0.9

Command Doc

execute_command_and_return_markdown('allcools generate-dataset -h')
$ allcools generate-dataset -h
usage: allcools generate-dataset [-h] --allc_table ALLC_TABLE --output_path
                                 OUTPUT_PATH --chrom_size_path CHROM_SIZE_PATH
                                 [--obs_dim OBS_DIM] [--cpu CPU]
                                 [--chunk_size CHUNK_SIZE] --regions REGIONS
                                 REGIONS --quantifiers QUANTIFIERS
                                 [QUANTIFIERS ...]

optional arguments:
  -h, --help            show this help message and exit
  --obs_dim OBS_DIM     Name of the observation dimension. (default: cell)
  --cpu CPU             Number of processes to use in parallel. (default: 1)
  --chunk_size CHUNK_SIZE
                        Chunk allc_table with chunk_size when generate dataset
                        in parallel (default: 10)

required arguments:
  --allc_table ALLC_TABLE
                        Contain all the ALLC file information in two tab-
                        separated columns: 1. file_uid, 2. file_path. No
                        header (default: None)
  --output_path OUTPUT_PATH
                        Output path of the MCDS dataset (default: None)
  --chrom_size_path CHROM_SIZE_PATH
                        Path to UCSC chrom size file. This can be generated
                        from the genome fasta or downloaded via UCSC
                        fetchChromSizes tools. All ALLCools functions will
                        refer to this file whenever possible to check for
                        chromosome names and lengths, so it is crucial to use
                        a chrom size file consistent to the reference fasta
                        file ever since mapping. ALLCools functions will not
                        change or infer chromosome names. (default: None)
  --regions REGIONS REGIONS
                        Definition of genomic regions in the form of "--
                        regions {region_name} {region_definition}". This
                        parameter can be specified multiple times, to allow
                        quantification of multiple region sets in the same
                        MCDS dataset. Several cases are allowed: 1) a integer
                        number means fix-sized genomic bins, region bed and
                        region id will be generated automatically based on the
                        chrom_size_path parameter (e.g., "--regions chrom100k
                        100000"); 2) a path to a three-column bed file, in
                        this case, a forth column containing region id in the
                        form of {region_name}_{i} will be added automatically
                        (e.g., "--regions gene /path/to/gene_bed_no_id.bed",
                        where the bed file only has chrom, start, end
                        columns); 3) a path to a four-column bed file, in this
                        case, the forth column will be treated as region id
                        and the region ids must be UNIQUE. (e.g., "--regions
                        gene /path/to/gene_bed_with_id.bed", where the bed
                        file has chrom, start, end, id columns). (default:
                        None)
  --quantifiers QUANTIFIERS [QUANTIFIERS ...]
                        Definition of genome region quantifiers in the form of
                        "--quantifiers {region_name} {quant_type}
                        {mc_contexts} {optional_parameter}". The region_name
                        determines which region set this quantifier applies
                        to, region_name must be defined by "--regions"
                        parameter. The quant_type specify which quantifiers,
                        it must be in ["count", "hypo-score", "hyper-score"].
                        The mc_contexts specify a comma separated mC context
                        list, it must be the same size as the ALLC table, and
                        uses IUPAC base abbreviation. --quantifiers parameter
                        can be specified multiple times, to allow different
                        quantification for different region sets, or multiple
                        quantification for the same region set. Some examples:
                        1) To quantify raw counts of a region set in mCG and
                        mCH context: "--quantifiers gene count CGN,CHN" 2) To
                        quantify the mCG hypo-methylation score of chrom 5Kb
                        bins: "--quantifiers chrom5k hypo-score CGN
                        cutoff=0.9", by default, cutoff=0.9, so the last part
                        is optional. 3) To ALSO quantify the mCG raw counts of
                        chrom 5Kb bins in the same MCDS, just specify another
                        quantifiers in the same command: "--quantifiers
                        chrom5k count CGN", note the count matrix of chrom5k
                        will be large. Its not usually needed, but you have
                        the option if needed. (default: None)