De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads

Rhys A. Farrer; Eric Kemen; Jonathan D. G. Jones; David J. Studholme

doi:10.1111/j.1574-6968.2008.01441.x

De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads

Rhys A. Farrer, Eric Kemen, Jonathan D. G. Jones, David J. Studholme (Corresponding Author)

Medical Sciences

The Sainsbury Laboratory

Research output: Contribution to journal › Article › peer-review

75 Citations (Scopus)

Abstract

Illumina's Genome Analyzer generates ultra-short sequence reads, typically 36 nucleotides in length, and is primarily intended for resequencing. We tested the potential of this technology for de novo sequence assembly on the 6 Mbp genome of Pseudomonas syringae pv. syringae B728a with several freely available assembly software packages. Using an unpaired data set, velvet assembled >96% of the genome into contigs with an N50 length of 8289 nucleotides and an error rate of 0.33%. EDENA generated smaller contigs (N50 was 4192 nucleotides) and comparable error rates. SSAKE and VCAKE yielded shorter contigs with very high error rates. Assembly of paired-end sequence data carrying 400 bp inserts produced longer contigs (N50 up to 15 628 nucleotides), but with increased error rates (0.5%). Contig length and error rate were very sensitive to the choice of parameter values. Noncoding RNA genes were poorly resolved in de novo assemblies, while >90% of the protein-coding genes were assembled with 100% accuracy over their full length. This study demonstrates that, in practice, de novo assembly of 36-nucleotide reads can generate reasonably accurate assemblies from about 40 x deep sequence data sets. These draft assemblies are useful for exploring an organism's proteomic potential, at a very economic low cost.

Original language	English
Pages (from-to)	103-111
Number of pages	9
Journal	FEMS Microbiology Letters
Volume	291
Issue number	1
DOIs	https://doi.org/10.1111/j.1574-6968.2008.01441.x
Publication status	Published - 1 Feb 2009

Bibliographical note

D.J.S. and J.D.G.J. are supported by the Gatsby Charitable Foundation. R.F. was supported by a scholarship funded by the UK Biotechnology & Biological Sciences Research Council's competitive strategic grant to the John Innes Centre. E.K. is supported by a DFG research fellowship (KE 1509/1-1). The authors are grateful to Jodie Pike and Michael Burrell for excellent technical support, and to Sophien Kamoun, Daniel Zerbino, Daniel MacLean and Naveed Ishaque for useful discussions and comments on the manuscript.

Keywords

Chromosome Mapping/methods
Computational Biology
Genome, Bacterial
Pseudomonas syringae/genetics
Sequence Analysis, DNA
Software

Access to Document

10.1111/j.1574-6968.2008.01441.xLicence: Unspecified

Cite this

@article{5865e9c1c276460ab91fc97a4fbcc292,

title = "De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads",

abstract = "Illumina's Genome Analyzer generates ultra-short sequence reads, typically 36 nucleotides in length, and is primarily intended for resequencing. We tested the potential of this technology for de novo sequence assembly on the 6 Mbp genome of Pseudomonas syringae pv. syringae B728a with several freely available assembly software packages. Using an unpaired data set, velvet assembled >96% of the genome into contigs with an N50 length of 8289 nucleotides and an error rate of 0.33%. EDENA generated smaller contigs (N50 was 4192 nucleotides) and comparable error rates. SSAKE and VCAKE yielded shorter contigs with very high error rates. Assembly of paired-end sequence data carrying 400 bp inserts produced longer contigs (N50 up to 15 628 nucleotides), but with increased error rates (0.5%). Contig length and error rate were very sensitive to the choice of parameter values. Noncoding RNA genes were poorly resolved in de novo assemblies, while >90% of the protein-coding genes were assembled with 100% accuracy over their full length. This study demonstrates that, in practice, de novo assembly of 36-nucleotide reads can generate reasonably accurate assemblies from about 40 x deep sequence data sets. These draft assemblies are useful for exploring an organism's proteomic potential, at a very economic low cost.",

keywords = "Chromosome Mapping/methods, Computational Biology, Genome, Bacterial, Pseudomonas syringae/genetics, Sequence Analysis, DNA, Software",

author = "Farrer, {Rhys A.} and Eric Kemen and Jones, {Jonathan D. G.} and Studholme, {David J.}",

note = "D.J.S. and J.D.G.J. are supported by the Gatsby Charitable Foundation. R.F. was supported by a scholarship funded by the UK Biotechnology & Biological Sciences Research Council's competitive strategic grant to the John Innes Centre. E.K. is supported by a DFG research fellowship (KE 1509/1-1). The authors are grateful to Jodie Pike and Michael Burrell for excellent technical support, and to Sophien Kamoun, Daniel Zerbino, Daniel MacLean and Naveed Ishaque for useful discussions and comments on the manuscript.",

year = "2009",

month = feb,

day = "1",

doi = "10.1111/j.1574-6968.2008.01441.x",

language = "English",

volume = "291",

pages = "103--111",

journal = "FEMS Microbiology Letters",

issn = "0378-1097",

publisher = "Oxford University Press",

number = "1",

}

TY - JOUR

T1 - De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads

AU - Farrer, Rhys A.

AU - Kemen, Eric

AU - Jones, Jonathan D. G.

AU - Studholme, David J.

N1 - D.J.S. and J.D.G.J. are supported by the Gatsby Charitable Foundation. R.F. was supported by a scholarship funded by the UK Biotechnology & Biological Sciences Research Council's competitive strategic grant to the John Innes Centre. E.K. is supported by a DFG research fellowship (KE 1509/1-1). The authors are grateful to Jodie Pike and Michael Burrell for excellent technical support, and to Sophien Kamoun, Daniel Zerbino, Daniel MacLean and Naveed Ishaque for useful discussions and comments on the manuscript.

PY - 2009/2/1

Y1 - 2009/2/1

N2 - Illumina's Genome Analyzer generates ultra-short sequence reads, typically 36 nucleotides in length, and is primarily intended for resequencing. We tested the potential of this technology for de novo sequence assembly on the 6 Mbp genome of Pseudomonas syringae pv. syringae B728a with several freely available assembly software packages. Using an unpaired data set, velvet assembled >96% of the genome into contigs with an N50 length of 8289 nucleotides and an error rate of 0.33%. EDENA generated smaller contigs (N50 was 4192 nucleotides) and comparable error rates. SSAKE and VCAKE yielded shorter contigs with very high error rates. Assembly of paired-end sequence data carrying 400 bp inserts produced longer contigs (N50 up to 15 628 nucleotides), but with increased error rates (0.5%). Contig length and error rate were very sensitive to the choice of parameter values. Noncoding RNA genes were poorly resolved in de novo assemblies, while >90% of the protein-coding genes were assembled with 100% accuracy over their full length. This study demonstrates that, in practice, de novo assembly of 36-nucleotide reads can generate reasonably accurate assemblies from about 40 x deep sequence data sets. These draft assemblies are useful for exploring an organism's proteomic potential, at a very economic low cost.

AB - Illumina's Genome Analyzer generates ultra-short sequence reads, typically 36 nucleotides in length, and is primarily intended for resequencing. We tested the potential of this technology for de novo sequence assembly on the 6 Mbp genome of Pseudomonas syringae pv. syringae B728a with several freely available assembly software packages. Using an unpaired data set, velvet assembled >96% of the genome into contigs with an N50 length of 8289 nucleotides and an error rate of 0.33%. EDENA generated smaller contigs (N50 was 4192 nucleotides) and comparable error rates. SSAKE and VCAKE yielded shorter contigs with very high error rates. Assembly of paired-end sequence data carrying 400 bp inserts produced longer contigs (N50 up to 15 628 nucleotides), but with increased error rates (0.5%). Contig length and error rate were very sensitive to the choice of parameter values. Noncoding RNA genes were poorly resolved in de novo assemblies, while >90% of the protein-coding genes were assembled with 100% accuracy over their full length. This study demonstrates that, in practice, de novo assembly of 36-nucleotide reads can generate reasonably accurate assemblies from about 40 x deep sequence data sets. These draft assemblies are useful for exploring an organism's proteomic potential, at a very economic low cost.

KW - Chromosome Mapping/methods

KW - Computational Biology

KW - Genome, Bacterial

KW - Pseudomonas syringae/genetics

KW - Sequence Analysis, DNA

KW - Software

U2 - 10.1111/j.1574-6968.2008.01441.x

DO - 10.1111/j.1574-6968.2008.01441.x

M3 - Article

C2 - 19077061

SN - 0378-1097

VL - 291

SP - 103

EP - 111

JO - FEMS Microbiology Letters

JF - FEMS Microbiology Letters

IS - 1

ER -

De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads

Abstract

Bibliographical note

Keywords

Access to Document

Fingerprint

Cite this