Compatibility and interoperability evaluation of Next Generation Sequencing Technologies

The purpose of this review is to provide potential researchers with information on compatibility and interoperability of various Next Generation Sequencing (NGS) Technologies and issues they are to most likely face when they would try to compare, use, or analyze the sequencing results produced by various machine vendors.

Background

The ongoing revolution in sequencing technology has led to the production of sequencing machines with significantly lower costs and higher throughput than the technology of just 3 years ago. These NGS technologies greatly increase sequencing throughput by laying out millions of short DNA fragments on a single chip and sequencing all these fragments in parallel.

Next Generation Sequencing (NGS) Technologies today are commonly called Short Read Sequencing technologies. The advantage over older capillary technology is the cost of the experiment, which is significantly lower by comparison and will continue to get lower as new players join the market. Their disadvantage is that the reads produced are usually very short: 30 - 70 base calls. The only new technology comparable to the older (capillary) technology is 454 Roche, which can deliver longer reads, up to 500 bases, but it is also the most expensive among the others (more than twice).

Each NGS system has the obvious need of producing results containing data specific to its technology, while investigators need to process, annotate and correlate data regardless of origin of manufacture. No current NGS system produces results that are compatible with any other current system.

Currently most sequencing centers utilize short reads by combining them with low-coverage Sanger-based reads for assembly and finishing, and do not take advantage of the peculiarities of the platform-dependent error models.

As many researches state, it is important to compare to or use the results of the experiments derived from different types of machines. A hybrid assembly not only improves the size, quality and reliability of scaffolds, but also dramatically reduces the cost of assembled genomes. It is also important to apply various analyzing tools to the raw data results in order to try to extend the length of the produced reads simply because any new base call applied to an existing read can make a big difference when it comes time to assemble the data. As well as a hybrid assembly, a comparative assembly can also be used to improve the results of the experiment. Comparative assembly refers to the assembly of a genome using the sequence of a close relative as a reference, and is frequently referred to as "templated assembly" or "re-sequencing".

Today many new assemblers for NSG technologies are being introduced to the market. A small list of those is below.

  • ABBA (Assembly Boosted By Amino Acid) - A short-read assembler that uses a reference amino acid sequence to guide the assembly of a gene. Amino acid sequence is more conserved than DNA which allows a more distant relative to be used as the reference.
  • AMOScmp - A comparative assembler that can assemble a set of shotgun reads from an organism by mapping them to the finished sequence of a related organism.
  • AMOScmp-shortReads - A comparative assembler for short reads (Illumina, 454).
  • ABySS - (Assembly By Short Sequences) a parallel assembler for short read sequence data.
  • ALLPATHS - De Novo assembly of whole genome shotgun microreads.
  • SHARCGS - Short Read assembly algorithm for de novo genomic sequencing.
  • SRMP - A program for aligning short reads against reference sequences.
  • MAQ - Mapping and Assembly with Qualities. It builds assembly by mapping short reads to reference sequences.

These new tools are being developed due to a common shortage of the available tools and their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects. Unfortunately not all of these new tools employ the vendor's specific error model, and therefore have a need to incorporate their own algorithms, hence losing a quite valuable information all along.

The technologies being evaluated are:

454 Roche, Illumina Solexa, AB SOLiD, and Complete Genomics, Inc.

454 Roche.

The company currently offers the following data sequencing products:

The GS FLX Titanium is capable of producing 400 - 600 million reads per run, and the GS FLX Standard is capable of producing 100 million reads per run. Both instruments incorporate the company proprietary binary data storage format called SFF (Standard Flowgram Format). The company provides a variety of tools and software to work with this format with every machine purchase.

While NCBI adopted this format for submissions to the NCBI Trace Archive and to the NCBI Short Read Archive, this format is technology oriented and cannot be used without a specialized software. For this exact reason NCBI does not provide the data in this format for downloads. Therefore the researchers are left with the only option to work with the stripped form of data in FASTQ format. The ability to analyze the raw sequencing data any further is left behind. On the positive side the SFF format provides the sequenced data quality scores in the industry adopted Phred like form.

Illumina Solexa Sequencer

The company currently offers the following data sequencing products:

The Genome Analyzer is capable of producing 300 million reads per run. The data output from the machine is text based and usually consists of several file types per lane, where each type of data such as sequences, probabilities, signals, noises, etc, is all in the vendor specific form and not compatible with any other vendor format. Therefore it again leaves the researchers with the stripped FASTQ form courteously generated by NCBI when they download the data from the archive. Any other option to analyze the raw sequencing data is unavailable. The quality scores provided by the manufacturer are in the specific from as well and do not comply with the industry adopted Phred like representation.

AB SOLiD

The company currently offers the following data sequencing products:

The system is capable of producing 400 million reads per run. The data output from the machine is text based and usually consists of two types of files. The Sequence file is in vendor specific color based format and a Score file with Phred based quality scores, but the file itself is in a proprietary format. The original raw sequencing data is not provided. Therefore it again leaves the researcher with the stripped FASTQ form generated by NCBI when they download the data from the archive. Any other option to analyze the raw sequencing data is unavailable.

Complete Genomics, Inc.

Currently Complete Genomics does not offer any instruments for purchase. They provide only data sequencing services and use entirely proprietary approaches to work with the data. The whole data process is hidden. From our sources the system is capable of generating amounts comparable to Illumina or AB SOLiD amounts of data, but the formats of the generated data are closed, and a potential researcher has an option either to develop his/her own tools to analyze the data or use the stripped FASTQ files generated by NCBI.

Conclusions

The NCBI Short Read Archive was created to store the primary data generated by NGS technology and to provide researchers world-wide with a variety of tools and services, including the raw data delivery. However to this moment, there is no established industry standard to store and exchange NGS raw sequencing data other then the format developed by NCBI, which is capable of storing any vendor provided data, although it has not yet been accepted as an industry-wide standard.

Therefore NCBI is unable to provide raw data to researchers, which technically defeats the purpose of the archive. The only form of data available for download is FASTQ, which consists of base calls or color calls plus quality scores and can be used by current data assemblers, but it leaves no room for any further data analysis, and is being phased out.

The research above shows that the current Next Generation Sequencing technologies are not compatible in the data storage format and any cross-platform data analysis is not possible without creating new tools capable of working with vendors' data individually. The other option is to create and adopt a commonly recognized format for NGS, which today's needs dictate. The adopted NGS format would allow the creation of software tools capable of working with any machine vendor, which would be highly beneficial for the industry and the bilogical community world-wide.

References

  • Mihai Pop and Steven L. Salzberg. Bioinformatics challenges of new sequencing technology. Trends in Genetics. 24(3):142-149.2008
  • Mardis ER. The impact of next generation sequencing technology on genetics. Trends Genet 2008;24:133-141. [PubMed: 18262675]
  • Heng Li, Jue Ruan and Richard Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008. 18. 1851-1858
  • Josephine A. Reinhardt, David A. Baltrus, Marc T. Nishimura, William R. Jeck, Corbin D. Jones and Jeffery L. Dang. De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Res. 2009. 19: 294-305
  • Pop M, Phillippy A, Delcher AL, Salzberg SL. Comparative genome assembly. Brief Bioinform. 2004 Sep 5(3):237-4
  • Mark J. Chaisson and Pavel A. Pevzner. Short read fragment assembly of bacterial genomes. Genome Res. 2008. 18: 324-330
  • Goldberg, S.M., Johnson, J., Busam, D., Feldblyum, T., Ferriera, S., Friedman, R., Halpern, A., Khouri, H., Kravitz, S.A., Lauro, F.M., et al., 2006 A Sanger/Pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes Proc. Natl. Acad. Sci. 103 :11240 11245
  • Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008 May;18(5):810-20. Epub 2008 Mar 13.
  • Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007 Nov;17(11):1697-706. Epub 2007 Oct 1.