This function reports base frequencies at particular genomic positions and tests their association with the methylation status of the sequencing reads.
Usage
generateVcfReport(
bam,
vcf,
vcf.style = NULL,
bed = NULL,
report.file = NULL,
zero.based.bed = FALSE,
threshold.reads = TRUE,
threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"),
min.context.sites = 2,
min.context.beta = 0.5,
max.outofcontext.beta = 0.1,
...,
gzip = FALSE,
verbose = TRUE
)
Arguments
- bam
BAM file location string OR preprocessed output of
preprocessBam
function. Read more about BAM file requirements and BAM preprocessing atpreprocessBam
.- vcf
Variant Call Format (VCF) file location string OR a VCF object returned by
readVcf
function. If VCF object is supplied, the style of its seqlevels must match the style of seqlevels of the BAM file/object used.- vcf.style
string for the seqlevels style of the VCF file, if different from BED file/object. Only has effect when `vcf` parameter points to the VCF file location and `bed` is not NULL. Possible values:
NULL (the default) – seqlevels in BED file/object and VCF file are the same
"NCBI", "UCSC", ... – valid parameters of
seqlevelsStyle
function
- bed
Browser Extensible Data (BED) file location string OR object of class
GRanges
holding genomic coordinates for regions of interest. It is used to include only the specific genomic ranges when the VCF file is loaded. This option has no effect when VCF object is supplied as a `vcf` parameter. The style of seqlevels of BED file/object must match the style of seqlevels of the BAM file/object used.- report.file
file location string to write the VCF report. If NULL (the default) then report is returned as a
data.table
object.- zero.based.bed
boolean defining if BED coordinates are zero based (default: FALSE).
- threshold.reads
boolean defining if sequence reads should be thresholded before counting bases in reference and variant epialleles (default: TRUE). Disabling thresholding is possible but makes no sense in the context of this function, because all the reads will be assigned to the variant epiallele, which will result in Fisher's Exact test p-value of 1 (in columns `FEp+` and `FEP-`). As thresholding is not recommended for long-read sequencing data, this function is not recommended for such data either.
- threshold.context
string defining cytosine methylation context used for thresholding the reads:
"CG" (the default) – within-the-context: CpG cytosines (called as zZ), out-of-context: all the other cytosines (hHxX)
"CHG" – within-the-context: CHG cytosines (xX), out-of-context: hHzZ
"CHH" – within-the-context: CHH cytosines (hH), out-of-context: xXzZ
"CxG" – within-the-context: CG and CHG cytosines (zZxX), out-of-context: CHH cytosines (hH)
"CX" – all cytosines are considered within-the-context, this effectively results in no thresholding
This option has no effect when read thresholding is disabled.
- min.context.sites
non-negative integer for minimum number of cytosines within the `threshold.context` (default: 2). Reads containing fewer within-the-context cytosines are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.
- min.context.beta
real number in the range [0;1] (default: 0.5). Reads with average beta value for within-the-context cytosines below this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.
- max.outofcontext.beta
real number in the range [0;1] (default: 0.1). Reads with average beta value for out-of-context cytosines above this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.
- ...
other parameters to pass to the
preprocessBam
function. Options have no effect if preprocessed BAM data was supplied as an input.- gzip
boolean to compress the report (default: FALSE).
- verbose
boolean to report progress and timings (default: TRUE).
Value
data.table
object containing VCF report or
NULL if report.file was specified. The report columns are:
name – variation identifier (e.g. "rs123456789")
seqnames – reference sequence name
range – genomic coordinates of the variation
REF – base at the reference allele
ALT – base at the alternative allele
[M|U][+|-][Ref|Alt] – number of Reference or Alternative bases that were found at this particular position within Methylated (above threshold) or Unmethylated (below threshold) reads that were mapped to "+" (forward) or "-" (reverse) DNA strand. NA values mean that it is not possible to determine the number of bases due to the bisulfite conversion-related limitations (C->T variants on "+" and G->A on "-" strands)
SumRef – sum of all Reference base counts
SumAlt – sum of all Alternative base counts
FEp+ – Fisher Exact test p-value for association of a variation with methylation status of the reads that map to the "+" (forward) DNA strand. Calculated using following contingency table:
M+Ref M+Alt U+Ref U+Alt FEp- – Fisher Exact test p-value for association of a variation with methylation status of the reads that map to the "-" (reverse) DNA strand. Calculated using following contingency table:
M-Ref M-Alt U-Ref U-Alt
Details
Using BAM reads and sequence variation information as an input, `generateVcfReport` function thresholds the reads (for paired-end sequencing alignment files - read pairs as a single entity) according to supplied parameters and calculates the occurrence of Reference and Alternative bases within reads, taking into the account DNA strand the read mapped to and average methylation level (epiallele status) of the read.
The information on sequence variation can be supplied as a Variant Call
Format (VCF) file location or an object of class VCF, returned by the
readVcf
function call. As whole-genome VCF
files can be extremely large, it is strongly advised to use only relevant
subset of their data, prefiltering the VCF object manually before calling
`generateVcfReport` or specifying `bed` parameter when `vcf` points to the
location of such large VCF file. Please note that all the BAM, BED and VCF
files must use the same style for seqlevels (i.e. chromosome names).
After counting, function checks if certain bases occur more often within reads belonging to certain epialleles using Fisher Exact test (HTSlib's own implementation) and reports separate p-values for reads mapped to "+" (forward) and "-" (reverse) DNA strands.
Please note that the final report currently includes only the VCF entries with single-base REF and ALT alleles. Also, the default (`min.baseq=0`) output of `generateVcfReport` is equivalent to the one of `samtools mplieup -Q 0 ...`, and therefore may result in false SNVs caused by misalignments. Remember to increase `min.baseq` (`samtools mplieup -Q` default value is 13) to obtain higher-quality results.
Read thresholding by an average methylation level used in this function
makes little sense for long-read sequencing alignments,
as such reads can cover multiple regions with very different DNA methylation
properties. Instead, please use extractPatterns
,
limiting pattern output to the region of interest only.
See also
preprocessBam
for preloading BAM data,
generateCytosineReport
for methylation statistics at the level
of individual cytosines, generateBedReport
for genomic
region-based statistics, extractPatterns
for exploring
methylation patterns and plotPatterns
for pretty plotting
of its output, generateBedEcdf
for analysing the
distribution of per-read beta values, and `epialleleR` vignettes for the
description of usage and sample data.
GRanges
class for working with genomic ranges,
readVcf
function for loading VCF data,
seqlevelsStyle
function for getting or setting
the seqlevels style.
Examples
capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
capture.bed <- system.file("extdata", "capture.bed", package="epialleleR")
capture.vcf <- system.file("extdata", "capture.vcf.gz",
package="epialleleR")
# VCF report
vcf.report <- generateVcfReport(bam=capture.bam, bed=capture.bed,
vcf=capture.vcf)
#> Loading required namespace: VariantAnnotation
#> Reading BED file
#> [0.027s]
#> Reading VCF file
#> [4.910s]
#> Checking BAM file:
#> short-read, paired-end, name-sorted alignment detected
#> Reading paired-end BAM file
#> [0.012s]
#> Thresholding reads
#> [0.001s]
#> Extracting base frequences
#> [0.108s]