VARIANTS CALLING OPTIMIZATION: PARAMETER SWEEP OF THE GATK BEST PRACTICES PIPELINE
1,2Azza Ahmed, 3Gloria Rendon, 3,4Liudmila Sergeevna Mainzer, 4Victor Jongeneel, 1Faisal M. Fadlelmola
1Centre for Bioinformatics and Systems Biology, Faculty of Science, University of Khartoum, Sudan
2Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Khartoum, Sudan
3Institute for Genomic Biology, University of Illinois Urbana-Champaign, USA
4National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, USA
The advent of massively parallel sequencing technologies (Next Generation Sequencing, NGS) had modified the landscape of human genetics. Whole Exome Sequencing (WES) is the NGS branch that focuses on the exonic regions of the eukaryotic genomes. Exomes are of interest as they are helping us understand high-penetrance allelic variation and its relationship to phenotype.
Variant calling is a study of genomic sequences differences, between some samples of interest and the reference genome; with the purpose of aiding understanding disease (or phenotype mechanism) and ultimately designing optimal treatment targets (i.e. personalized medicine). Typically, this involves many wet lab assays and procedures for preparing the biological samples and intensive computational processing via many tools and software. Errors can creep into the analysis from any of these aspects.
When carried out in large cohorts, errors are exacerbated, and the variants observed at the level of the individual are lost in joint genotyping. Besides experimental wet lab errors, the called variants are subject to biases due to the choice of software, configuration of the analysis pipeline, and individual parameters of each tool used. Also intended as the pilot phase of a collaboration with Mayo Clinic in Florida, USA, this talk provides insights into the effect of the parameter configurations in a variant calling pipeline following GATK best practices from a mathematical point of view, along with experimental results, with the objective of identifying optimization targets in such a set up. Computational challenges relating to running the pipeline in this context are also highlighted.