Skip to contents

This function reports base frequencies at particular genomic positions and tests their association with the methylation status of the sequencing reads.

Usage

generateVcfReport(
  bam,
  vcf,
  vcf.style = NULL,
  bed = NULL,
  report.file = NULL,
  zero.based.bed = FALSE,
  threshold.reads = TRUE,
  threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"),
  min.context.sites = 2,
  min.context.beta = 0.5,
  max.outofcontext.beta = 0.1,
  ...,
  gzip = FALSE,
  verbose = TRUE
)

Arguments

bam

BAM file location string OR preprocessed output of preprocessBam function. Read more about BAM file requirements and BAM preprocessing at preprocessBam.

vcf

Variant Call Format (VCF) file location string OR a VCF object returned by readVcf function. If VCF object is supplied, the style of its seqlevels must match the style of seqlevels of the BAM file/object used.

vcf.style

string for the seqlevels style of the VCF file, if different from BED file/object. Only has effect when `vcf` parameter points to the VCF file location and `bed` is not NULL. Possible values:

  • NULL (the default) – seqlevels in BED file/object and VCF file are the same

  • "NCBI", "UCSC", ... – valid parameters of seqlevelsStyle function

bed

Browser Extensible Data (BED) file location string OR object of class GRanges holding genomic coordinates for regions of interest. It is used to include only the specific genomic ranges when the VCF file is loaded. This option has no effect when VCF object is supplied as a `vcf` parameter. The style of seqlevels of BED file/object must match the style of seqlevels of the BAM file/object used.

report.file

file location string to write the VCF report. If NULL (the default) then report is returned as a data.table object.

zero.based.bed

boolean defining if BED coordinates are zero based (default: FALSE).

threshold.reads

boolean defining if sequence reads should be thresholded before counting bases in reference and variant epialleles (default: TRUE). Disabling thresholding is possible but makes no sense in the context of this function, because all the reads will be assigned to the variant epiallele, which will result in Fisher's Exact test p-value of 1 (in columns `FEp+` and `FEP-`). As thresholding is not recommended for long-read sequencing data, this function is not recommended for such data either.

threshold.context

string defining cytosine methylation context used for thresholding the reads:

  • "CG" (the default) – within-the-context: CpG cytosines (called as zZ), out-of-context: all the other cytosines (hHxX)

  • "CHG" – within-the-context: CHG cytosines (xX), out-of-context: hHzZ

  • "CHH" – within-the-context: CHH cytosines (hH), out-of-context: xXzZ

  • "CxG" – within-the-context: CG and CHG cytosines (zZxX), out-of-context: CHH cytosines (hH)

  • "CX" – all cytosines are considered within-the-context, this effectively results in no thresholding

This option has no effect when read thresholding is disabled.

min.context.sites

non-negative integer for minimum number of cytosines within the `threshold.context` (default: 2). Reads containing fewer within-the-context cytosines are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.

min.context.beta

real number in the range [0;1] (default: 0.5). Reads with average beta value for within-the-context cytosines below this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.

max.outofcontext.beta

real number in the range [0;1] (default: 0.1). Reads with average beta value for out-of-context cytosines above this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.

...

other parameters to pass to the preprocessBam function. Options have no effect if preprocessed BAM data was supplied as an input.

gzip

boolean to compress the report (default: FALSE).

verbose

boolean to report progress and timings (default: TRUE).

Value

data.table object containing VCF report or NULL if report.file was specified. The report columns are:

  • name – variation identifier (e.g. "rs123456789")

  • seqnames – reference sequence name

  • range – genomic coordinates of the variation

  • REF – base at the reference allele

  • ALT – base at the alternative allele

  • [M|U][+|-][Ref|Alt] – number of Reference or Alternative bases that were found at this particular position within Methylated (above threshold) or Unmethylated (below threshold) reads that were mapped to "+" (forward) or "-" (reverse) DNA strand. NA values mean that it is not possible to determine the number of bases due to the bisulfite conversion-related limitations (C->T variants on "+" and G->A on "-" strands)

  • SumRef – sum of all Reference base counts

  • SumAlt – sum of all Alternative base counts

  • FEp+ – Fisher Exact test p-value for association of a variation with methylation status of the reads that map to the "+" (forward) DNA strand. Calculated using following contingency table:

    M+RefM+Alt
    U+RefU+Alt
  • FEp- – Fisher Exact test p-value for association of a variation with methylation status of the reads that map to the "-" (reverse) DNA strand. Calculated using following contingency table:

    M-RefM-Alt
    U-RefU-Alt

Details

Using BAM reads and sequence variation information as an input, `generateVcfReport` function thresholds the reads (for paired-end sequencing alignment files - read pairs as a single entity) according to supplied parameters and calculates the occurrence of Reference and Alternative bases within reads, taking into the account DNA strand the read mapped to and average methylation level (epiallele status) of the read.

The information on sequence variation can be supplied as a Variant Call Format (VCF) file location or an object of class VCF, returned by the readVcf function call. As whole-genome VCF files can be extremely large, it is strongly advised to use only relevant subset of their data, prefiltering the VCF object manually before calling `generateVcfReport` or specifying `bed` parameter when `vcf` points to the location of such large VCF file. Please note that all the BAM, BED and VCF files must use the same style for seqlevels (i.e. chromosome names).

After counting, function checks if certain bases occur more often within reads belonging to certain epialleles using Fisher Exact test (HTSlib's own implementation) and reports separate p-values for reads mapped to "+" (forward) and "-" (reverse) DNA strands.

Please note that the final report currently includes only the VCF entries with single-base REF and ALT alleles. Also, the default (`min.baseq=0`) output of `generateVcfReport` is equivalent to the one of `samtools mplieup -Q 0 ...`, and therefore may result in false SNVs caused by misalignments. Remember to increase `min.baseq` (`samtools mplieup -Q` default value is 13) to obtain higher-quality results.

Read thresholding by an average methylation level used in this function makes little sense for long-read sequencing alignments, as such reads can cover multiple regions with very different DNA methylation properties. Instead, please use extractPatterns, limiting pattern output to the region of interest only.

See also

preprocessBam for preloading BAM data, generateCytosineReport for methylation statistics at the level of individual cytosines, generateBedReport for genomic region-based statistics, extractPatterns for exploring methylation patterns and plotPatterns for pretty plotting of its output, generateBedEcdf for analysing the distribution of per-read beta values, and `epialleleR` vignettes for the description of usage and sample data.

GRanges class for working with genomic ranges, readVcf function for loading VCF data, seqlevelsStyle function for getting or setting the seqlevels style.

Examples

  capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
  capture.bed <- system.file("extdata", "capture.bed", package="epialleleR")
  capture.vcf <- system.file("extdata", "capture.vcf.gz",
                             package="epialleleR")
  
  # VCF report
  vcf.report <- generateVcfReport(bam=capture.bam, bed=capture.bed,
                                  vcf=capture.vcf)
#> Loading required namespace: VariantAnnotation
#> Reading BED file 
#> [0.027s]
#> Reading VCF file 
#> [4.910s]
#> Checking BAM file: 
#> short-read, paired-end, name-sorted alignment detected
#> Reading paired-end BAM file 
#> [0.012s]
#> Thresholding reads 
#> [0.001s]
#> Extracting base frequences 
#> [0.108s]