This function reads and preprocesses (optionally `bgzip`ped) FASTA file with reference sequences.
Details
The function loads and preprocesses reference (genomic) sequences, saving time when methylation calling needs to be performed on multiple BAM files. Currently, reading the data is done by means of HTSlib, therefore it is possible to speed up the loading by means of HTSlib decompression threads when FASTA file is compressed by `bgzip`.
This function is also called internally when file location is supplied
as an input for callMethylation
method.
`preprocessGenome` checks if index file is present, and if not, creates it automatically. It is possible and recommended to use compressed FASTA file as an input, but the file must be compressed by `bgzip` (part of samtools/HTSlib). When FASTA file is compressed, faster loading can be achieved using (typically one) additional HTSlib decompression thread.
During loading, both lowercase and uppercase ACGTN symbols are allowed and correctly recognised, however all the other symbols (e.g., extended IUPAC symbols, MRSVWYHKDB) within sequences are converted to N.
Please also note that for the purpose of methylation calling, the very same
reference genome must be used for both alignment (when BAM is produced) and
calling cytosine methylation by callMethylation
method.
See also
callMethylation
for methylation calling,
and `epialleleR` vignettes for the description of usage and sample data.
Block compression/decompression utility bgzip.
Examples
genome.file <- system.file("extdata", "test", "reference.fasta.gz", package="epialleleR")
genome.data <- preprocessGenome(genome.file)
#> Reading reference genome file
#> [0.001s]