This function counts methylated and unmethylated DNA bases taking into the account average methylation level of the entire sequence read.

generateCytosineReport(
  bam,
  report.file = NULL,
  threshold.reads = TRUE,
  threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"),
  min.context.sites = 2,
  min.context.beta = 0.5,
  max.outofcontext.beta = 0.1,
  report.context = threshold.context,
  ...,
  gzip = FALSE,
  verbose = TRUE
)

Arguments

bam

BAM file location string OR preprocessed output of preprocessBam function. Read more about BAM file requirements and BAM preprocessing at preprocessBam.

report.file

file location string to write the cytosine report. If NULL (the default) then report is returned as a data.table object.

threshold.reads

boolean defining if sequence reads (read pairs) should be thresholded before counting methylated cytosines (default: TRUE). Disabling thresholding makes the report virtually indistinguishable from the ones generated by other software, such as Bismark or Illumina DRAGEN Bio IT Platform. Thresholding is not recommended for long-read sequencing data.

threshold.context

string defining cytosine methylation context used for thresholding the reads:

  • "CG" (the default) --- within-the-context: CpG cytosines (called as zZ), out-of-context: all the other cytosines (hHxX)

  • "CHG" --- within-the-context: CHG cytosines (xX), out-of-context: hHzZ

  • "CHH" --- within-the-context: CHH cytosines (hH), out-of-context: xXzZ

  • "CxG" --- within-the-context: CG and CHG cytosines (zZxX), out-of-context: CHH cytosines (hH)

  • "CX" --- all cytosines are considered within-the-context, this effectively results in no thresholding

This option has no effect when read thresholding is disabled.

min.context.sites

non-negative integer for minimum number of cytosines within the `threshold.context` (default: 2). Reads containing fewer within-the-context cytosines are considered completely unmethylated (all C are counted as T). This option has no effect when read thresholding is disabled.

min.context.beta

real number in the range [0;1] (default: 0.5). Reads with average beta value for within-the-context cytosines below this threshold are considered completely unmethylated (all C are counted as T). This option has no effect when read thresholding is disabled.

max.outofcontext.beta

real number in the range [0;1] (default: 0.1). Reads with average beta value for out-of-context cytosines above this threshold are considered completely unmethylated (all C are counted as T). This option has no effect when read thresholding is disabled.

report.context

string defining cytosine methylation context to report (default: value of `threshold.context`).

...

other parameters to pass to the preprocessBam function. Options have no effect if preprocessed BAM data was supplied as an input.

gzip

boolean to compress the report (default: FALSE).

verbose

boolean to report progress and timings (default: TRUE).

Value

data.table object containing cytosine report in Bismark-like format or NULL if report.file was specified. The report columns are:

  • rname --- reference sequence name (as in BAM)

  • strand --- strand

  • pos --- cytosine position

  • context --- methylation context

  • meth --- number of methylated cytosines

  • unmeth --- number of unmethylated cytosines

Details

The function reports cytosine methylation information using BAM file or data as an input. In contrast to the other currently available software, reads (for paired-end sequencing alignment files - read pairs as a single entity) can be thresholded by their average methylation level before counting methylated bases, effectively resulting in hypermethylated variant epiallele frequency (VEF) being reported instead of beta value. The function's logic is explained below.

Let's suppose we have a BAM file with four reads, all mapped to the "+" strand of chromosome 1, positions 1-16. Assuming the default values for the thresholding parameters (threshold.reads = TRUE, threshold.context = "CG", min.context.sites = 2, min.context.beta = 0.5, max.outofcontext.beta = 0.1), the input and results will look as following:

methylation stringthresholdexplainedmethylation reported
...Z..x+.h..x..h.belowmin.context.sites < 2 (only one zZ base)all cytosines unmethylated
...Z..z.h..x..h.abovepass all criteriaonly C4 (Z at position 4) is methylated
...Z..z.h..X..h.belowmax.outofcontext.beta > 0.1 (1XH / 3xXhH = 0.33)all cytosines unmethylated
...Z..z.h..z-.h.belowmin.context.beta < 0.5 (1Z / 3zZ = 0.33)all cytosines unmethylated

Only the second read will satisfy all of the thresholding criteria, leading to the following CX report (given that all reads map to chr1:+:1-16):

rnamestrandposcontextmethunmeth
chr1+4CG13
chr1+7CG03
chr1+9CHH04
chr1+12CHG03
chr1+15CHH04

With the thresholding disabled (threshold.reads = FALSE) all methylated bases will retain their status, so the CX report will be similar to the reports produced by other methylation callers (such as Bismark or Illumina DRAGEN Bio IT Platform):

rnamestrandposcontextmethunmeth
chr1+4CG40
chr1+7CG03
chr1+9CHH04
chr1+12CHG12
chr1+15CHH04

Other notes:

To produce conventional cytosine reports without thresholding by within-context methylation level though minimally affected by incomplete cytosine conversion, run this method with the following parameters: `threshold.reads=TRUE`, `threshold.context="CG"`, `min.context.sites=0`, `min.context.beta=0`, `max.outofcontext.beta=0.1`. All cytosines within reads (read pairs) having more than 10 cytosines methylated, will be effectively treated as unmethylated ones.

Methylation string bases in unknown context ("uU") are simply ignored, which, to the best of our knowledge, is consistent with the behaviour of other tools.

In order to mitigate the effect of sequencing errors (leading to rare variations in the methylation context, as in reads 1 and 4 above), the context present in more than 50% of the reads is assumed to be correct, while all bases at the same position but having other methylation context are simply ignored. This allows reports to be prepared without using the reference genome sequence.

The downside of not using the reference genome sequence is the inability to determine the actual sequence of triplet for every base in the cytosine report. Therefore this sequence is not reported, and this won't change until such information will be considered as worth adding.

Please also note, that read thresholding by an average methylation level (as explained above) makes little sense for long-read sequencing alignments, as such reads can cover multiple regions with very different DNA methylation properties.

See also

`values` vignette for a comparison and visualisation of epialleleR output values for various input files. `epialleleR` vignette for the description of usage and sample data.

preprocessBam for preloading BAM data, generateBedReport for genomic region-based statistics, generateVcfReport for evaluating epiallele-SNV associations, extractPatterns for exploring methylation patterns, generateBedEcdf for analysing the distribution of per-read beta values.

Examples

  capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
  
  # CpG report with thresholding
  cg.report <- generateCytosineReport(capture.bam)
#> Checking BAM file: 
#> short-read, paired-end, name-sorted alignment detected
#> Reading paired-end BAM file 
#> [0.012s]
#> Thresholding reads 
#> [0.001s]
#> Preparing cytosine report 
#> [0.013s]
  
  # CX report without thresholding
  cx.report <- generateCytosineReport(capture.bam, threshold.reads=FALSE,
               report.context="CX")
#> Checking BAM file: 
#> short-read, paired-end, name-sorted alignment detected
#> Reading paired-end BAM file 
#> [0.013s]
#> Preparing cytosine report 
#> [0.015s]