An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

Antonio Ribeiro; Agnieszka Golicz; Christine Anne Hackett; Iain Milne; Gordon Stephen; David Marshall; Andrew J. Flavell; Micha Bayer

doi:10.1186/s12859-015-0801-z

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

Antonio Ribeiro^* (Corresponding Author), Agnieszka Golicz, Christine Anne Hackett, Iain Milne, Gordon Stephen, David Marshall, Andrew J. Flavell, Micha Bayer

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

30 Citations (Scopus)

Abstract

Background: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling - quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive. Results: The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases. Conclusions: The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.

Original language	English
Article number	382
Number of pages	16
Journal	BMC Bioinformatics
Volume	16
Issue number	1
DOIs	https://doi.org/10.1186/s12859-015-0801-z
Publication status	Published - 11 Nov 2015

Bibliographical note

Funding Information:
This work was supported in part by the Rural & Environment Science & Analytical Services Division of the Scottish Government through an International Studentship from the James Hutton Institute, and in part by the University of Dundee, Scotland, UK.

Keywords

False positive
Mapping stringency
Misassembly
NGS
Read length
Read mismapping
SNP

Access to Document

10.1186/s12859-015-0801-zLicence: CC BY

Cite this

@article{dc4e9a7064a2487d948ee54f6a3a39f3,

title = "An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome",

abstract = "Background: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling - quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive. Results: The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases. Conclusions: The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.",

keywords = "False positive, Mapping stringency, Misassembly, NGS, Read length, Read mismapping, SNP",

author = "Antonio Ribeiro and Agnieszka Golicz and Hackett, {Christine Anne} and Iain Milne and Gordon Stephen and David Marshall and Flavell, {Andrew J.} and Micha Bayer",

note = "Funding Information: This work was supported in part by the Rural & Environment Science & Analytical Services Division of the Scottish Government through an International Studentship from the James Hutton Institute, and in part by the University of Dundee, Scotland, UK. ",

year = "2015",

month = nov,

day = "11",

doi = "10.1186/s12859-015-0801-z",

language = "English",

volume = "16",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

AU - Ribeiro, Antonio

AU - Golicz, Agnieszka

AU - Hackett, Christine Anne

AU - Milne, Iain

AU - Stephen, Gordon

AU - Marshall, David

AU - Flavell, Andrew J.

AU - Bayer, Micha

N1 - Funding Information: This work was supported in part by the Rural & Environment Science & Analytical Services Division of the Scottish Government through an International Studentship from the James Hutton Institute, and in part by the University of Dundee, Scotland, UK.

PY - 2015/11/11

Y1 - 2015/11/11

N2 - Background: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling - quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive. Results: The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases. Conclusions: The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.

AB - Background: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling - quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive. Results: The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases. Conclusions: The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.

KW - False positive

KW - Mapping stringency

KW - Misassembly

KW - NGS

KW - Read length

KW - Read mismapping

KW - SNP

UR - http://www.scopus.com/inward/record.url?scp=84947864117&partnerID=8YFLogxK

U2 - 10.1186/s12859-015-0801-z

DO - 10.1186/s12859-015-0801-z

M3 - Article

C2 - 26558718

AN - SCOPUS:84947864117

SN - 1471-2105

VL - 16

JO - BMC Bioinformatics

JF - BMC Bioinformatics

IS - 1

M1 - 382

ER -

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Centre for Genome-Enabled Biology and Medicine

Cite this

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Equipment

Centre for Genome-Enabled Biology and Medicine

Cite this