allcools generate-dataset
Contents
allcools generate-dataset
¶
Tutorial¶
Example 1¶
This example covers comprehensively the situation where five features denoted after regions (e.g. chrom100k, chrom5k, geneslop2k, promoter, CGI) and their corresponding contexts specified with count (e.g. CGN, CHN, CAN) are used for generating mcds files, while two features denoted after quantifiers (e.g. promoter, CGI) and their respective contexts specified with hypo/hyper-score (e.g. CGN) are used for generating mcad files within the dataset.
Command¶
allcools generate-dataset \
--allc_table allc_table.tsv \
--output_path Farlik2016CSC.mcds \
--chrom_size_path hg38.main.nochrM.chrom.sizes \
--obs_dim cell \
--cpu 20 \
--chunk_size 50 \
--regions chrom100k 100000 \
--regions chrom5k 5000 \
--regions geneslop2k hg38_genecode_v28.geneslop2k.bed.gz \
--regions promoter promoter_slop2k.sorted.bed.gz \
--regions CGI hg38_CGI.sorted.bed.gz \
--quantifiers chrom100k count CGN,CHN,CAN \
--quantifiers chrom5k count CGN \
--quantifiers geneslop2k count CGN \
--quantifiers promoter count CGN \
--quantifiers CGI count CGN \
--quantifiers promoter hypo-score CGN cutoff=0.9 \
--quantifiers CGI hypo-score CGN cutoff=0.9
Command Breakdown¶
--allc_table allc_table.tsv
Specify the file paths of the allc table files in this line. Here is an example of what the allc table looks like:
sample1 absolute_allc_path_1
sample2 absolute_allc_path_2
sample3 absolute_allc_path_3
The first column indicates the cell name (e.g. GSM1922083_A01) whereas the second column indicates the allc file path of the cell. Make sure the two parts are separated by a tab.
--output_path Farlik2016CSC.mcds
Specify here the absolute path of the output MCDS directory.
--obs_dim cell
Use the word “cell” as the obs_dim.
--cpu 20
Specify here how much cpu usage you need for running the command.
--chunk_size 50
Determine the parallel chunk size. The default size is 50.
# Format: --regions REGION_NAME REGION_DEFINITION
--regions chrom100k 100000
--regions chrom5k 5000
--regions geneslop2k hg38_genecode_v28.geneslop2k.bed.gz
--regions promoter promoter_slop2k.sorted.bed.gz
--regions CGI hg38_CGI.sorted.bed.gz
One command can generate one MCDS containing multiple region sets. Specify any regions with the region name followed by the definition of regions in the “–regions” parameter. In this example, the features include chrom100k, chrom5k, geneslop2k, promoter, CGI, etc. For the chrom100k and chrom5k, specify the bin size using 100000 and 5000 respectively. For geneslop2k, promoter and CGI, specify the absolute bed file paths.
# Format: --quantifiers REGION_NAME QUANT_TYPE MC_CONTEXTS
--quantifiers chrom100k count CGN,CHN,CAN
--quantifiers chrom5k count CGN
--quantifiers geneslop2k count CGN
--quantifiers promoter count CGN
--quantifiers CGI count CGN
Each regionset must have one or multiple “–quantifers” to specify how the region data array will be quantified. Here the mc context needed for each feature for generating the MCDS files. Use “count” quantifier to simply count the sum of base calls of each feature.
--quantifiers promoter hypo-score CGN cutoff=0.9
In addition to the count quantifier, one can also specific hypo-score or hyper-score for any regions. A cutoff parameter follows the mC context will be used to filter out small values to reduce file size.
Example 2¶
This example shows a simpler case where three features denoted after regions (e.g. chrom100k, chrom5k, geneslop2k) and their corresponding contexts specified with count (e.g. CGN, CHN) are used for generating mcds files, while one features denoted after quantifiers (e.g. chrom5k) and its contexts specified with hypo-score (e.g. CGN) within the dataset.
allcools generate-dataset \
--allc_table Luo2017Science_Human_allc_table.tsv \
--output_path Luo2017Science_Human.mcds \
--chrom_size_path hg19.main.nochrM.chrom.sizes \
--obs_dim cell \
--cpu 20 \
--chunk_size 50 \
--regions chrom100k 100000 \
--regions chrom5k 5000 \
--regions geneslop2k gencode.v28lift37.annotation.gene.bed.gz \
--quantifiers chrom100k count CGN,CHN \
--quantifiers geneslop2k count CGN,CHN \
--quantifiers chrom5k hypo-score CGN cutoff=0.9
Example 3¶
This example is for a NOMe treated dataset, the hypo-score in GCHN context denote chromatin accessibility information.
allcools generate-dataset \
--allc_table allc_table.tsv \
--output_path Li2018NCB.mcds \
--chrom_size_path hg19.main.nochrM.chrom.sizes \
--obs_dim cell \
--cpu 20 \
--chunk_size 50 \
--regions chrom100k 100000 \
--regions chrom5k 5000 \
--regions geneslop2k gencode.v28lift37.annotation.gene.bed.gz \
--regions promoter promoter_slop2k.sorted.bed.gz \
--regions CGI hg19_CGI.sorted.bed.gz \
--quantifiers chrom100k count GCHN,WCGN \
--quantifiers chrom5k count GCHN,WCGN \
--quantifiers geneslop2k count GCHN,WCGN \
--quantifiers promoter count GCHN,WCGN \
--quantifiers CGI count GCHN,WCGN \
--quantifiers chrom5k hypo-score WCGN cutoff=0.9 \
--quantifiers promoter hypo-score WCGN cutoff=0.9 \
--quantifiers CGI hypo-score WCGN cutoff=0.9 \
--quantifiers chrom5k hyper-score GCHN cutoff=0.9 \
--quantifiers promoter hyper-score GCHN cutoff=0.9 \
--quantifiers CGI hyper-score GCHN cutoff=0.9
Command Doc¶
execute_command_and_return_markdown('allcools generate-dataset -h')
$ allcools generate-dataset -h
usage: allcools generate-dataset [-h] --allc_table ALLC_TABLE --output_path
OUTPUT_PATH --chrom_size_path CHROM_SIZE_PATH
[--obs_dim OBS_DIM] [--cpu CPU]
[--chunk_size CHUNK_SIZE] --regions REGIONS
REGIONS --quantifiers QUANTIFIERS
[QUANTIFIERS ...]
optional arguments:
-h, --help show this help message and exit
--obs_dim OBS_DIM Name of the observation dimension. (default: cell)
--cpu CPU Number of processes to use in parallel. (default: 1)
--chunk_size CHUNK_SIZE
Chunk allc_table with chunk_size when generate dataset
in parallel (default: 10)
required arguments:
--allc_table ALLC_TABLE
Contain all the ALLC file information in two tab-
separated columns: 1. file_uid, 2. file_path. No
header (default: None)
--output_path OUTPUT_PATH
Output path of the MCDS dataset (default: None)
--chrom_size_path CHROM_SIZE_PATH
Path to UCSC chrom size file. This can be generated
from the genome fasta or downloaded via UCSC
fetchChromSizes tools. All ALLCools functions will
refer to this file whenever possible to check for
chromosome names and lengths, so it is crucial to use
a chrom size file consistent to the reference fasta
file ever since mapping. ALLCools functions will not
change or infer chromosome names. (default: None)
--regions REGIONS REGIONS
Definition of genomic regions in the form of "--
regions {region_name} {region_definition}". This
parameter can be specified multiple times, to allow
quantification of multiple region sets in the same
MCDS dataset. Several cases are allowed: 1) a integer
number means fix-sized genomic bins, region bed and
region id will be generated automatically based on the
chrom_size_path parameter (e.g., "--regions chrom100k
100000"); 2) a path to a three-column bed file, in
this case, a forth column containing region id in the
form of {region_name}_{i} will be added automatically
(e.g., "--regions gene /path/to/gene_bed_no_id.bed",
where the bed file only has chrom, start, end
columns); 3) a path to a four-column bed file, in this
case, the forth column will be treated as region id
and the region ids must be UNIQUE. (e.g., "--regions
gene /path/to/gene_bed_with_id.bed", where the bed
file has chrom, start, end, id columns). (default:
None)
--quantifiers QUANTIFIERS [QUANTIFIERS ...]
Definition of genome region quantifiers in the form of
"--quantifiers {region_name} {quant_type}
{mc_contexts} {optional_parameter}". The region_name
determines which region set this quantifier applies
to, region_name must be defined by "--regions"
parameter. The quant_type specify which quantifiers,
it must be in ["count", "hypo-score", "hyper-score"].
The mc_contexts specify a comma separated mC context
list, it must be the same size as the ALLC table, and
uses IUPAC base abbreviation. --quantifiers parameter
can be specified multiple times, to allow different
quantification for different region sets, or multiple
quantification for the same region set. Some examples:
1) To quantify raw counts of a region set in mCG and
mCH context: "--quantifiers gene count CGN,CHN" 2) To
quantify the mCG hypo-methylation score of chrom 5Kb
bins: "--quantifiers chrom5k hypo-score CGN
cutoff=0.9", by default, cutoff=0.9, so the last part
is optional. 3) To ALSO quantify the mCG raw counts of
chrom 5Kb bins in the same MCDS, just specify another
quantifiers in the same command: "--quantifiers
chrom5k count CGN", note the count matrix of chrom5k
will be large. Its not usually needed, but you have
the option if needed. (default: None)