Standardizing the data exchange and storage solutions for Next Generation Sequencing Technologies

The purpose of this article is to review the existing data exchange and storage solutions that may be suitable for Next Generation Sequencing Technologies (NGS). As it was mentioned in the Compatibility and interoperability evaluation of NGS Technologies article, the industry does not currently have a reputable and reliable standard for storing and exchanging the sequencing data generated by NGS instruments.

Background

Back in 2006 454 Life Sciences pioneered the Next Geneation Sequencing (NGS) Technology with the introduction of the 454 sequencer, and other new promising sequencing systems were on the way.

In the beginning of 2007 the biological community decided that a new data exchange standard was due. The initial discussions started in 2007 and it was a common desire to complete the specification for the new standard within a short period of time. This Short Read Format (SRF) was aimed to standardize the data exchange between sequencing centers and NCBI. NCBI has initially endorsed the support of a new standard. It was also expected that the new vendors of NGS sequencing machines would use the new format to generate the output data, thus that data could be potentially submitted to NCBI without any modifications. Unfortunately due to the lack of engineering support, the new format was still in the development stage when the new NGS machines, this time from Illumina, started to rush the market.

Overall the SRF format was not thoroughly thought out, and it turned out that it did not generalize the storage. It was a simple container with a set of vendor specific data in it and in order to process that data it was required to implement a vendor specific data parsing module. No data virtualization was implied in the format.

On the other hand, NCBI, having quite an experienced and knowledgeable engineering crew of employees and consultants, has arrived with a smart solution which supports any types of NGS data. This solution was initially developed for the NCBI Trace Archive. It has been proven to be compact, reliable, and flexible as a data storage and data transfers solution. Later, with some modifications, it was adopted by NCBI Short Read Archive. This exact solution can be successfully used for the exchange of data between the sequencing centers as well as for the data downloads from the Archive. Furthermore this approach might have quite a beneficial impact on the industry because it constitutes a well developed and thought out implementation. The solution is described with greater detail in the Short Read Archive article.

The main idea of this solution is to store and provide the data on demand with the ability to distribute any part of it across multiple volumes of disk arrays or the network when necessary.

Contain or not Contain.

The computer industry has developed a vast number of data containers: tar, cpio, zip, rpm, hdf, etc. Some of them are to serve a particular purpose and some of them are generic. Perhaps the most recognizable generic format is USTAR, or simply 'tar', which was initially developed to transfer files onto tape devices, and is still widely used due to its compatibility on various computer systems.

The major problem with the containers is that they are simply containers, i.e.:

  • the data must be transfered entirely from point A to point B;
  • while parts of data within the container are valuable for some users, for others those parts are just junk, a network overhead, and a waste of disk space;
  • the design of the containers is borrowed from the 80's and its usgae is not effective in the modern computing topologies;
  • an end user needs only the data essential for his/her reserch;

Therefore the containers are an ancient technology and using them in the modern computing would be the wrong decision.

Having the data distributed across the network with only small and necessary parts being transferred is what the modern technology dictates today. Therefore a solution with the data distributed across multiple data centers is more preferable as it is capable to deliver a desired amount of data to the end user from different locations and can provide a faster access to the details. The 'Bit Torrent' model is a good example. This is especially true for the NGS. The amount of data generated by the modern NGS technologies is massive and the industry is working hard to make these amounts even greater. As of today, in May of 2009, the NCBI Trace Archives occupy approximately 100 TB of space, and the data accumulation trend is exponential for the last few years. Although NCBI will most likely find the resources to deal with the increasing amounts of data, the ability to serve the data to the end user may become a huge burden considering that the number of users will be growing as well. This means that in the nearest future, there will be delays in the data transfers from the Archives.

Having all this in mind, and taking into account that new NGS technologies may establish new data types, NCBI has developed a solution allowing to virtually store any type of data across multiple volumes as well as across the network. Incidentally, this solution can be successfully applied to the network data transfers. It deliberately breaks the data into small chunks and stores individual data types separately. Therefore, if an end user requests a particular type of data, such as a set of base call reads for a particular region, the actual data transfer would be minimized to a requested subset only.

Conclusions

NCBI has agreed to make the solution publicly available. It can be used as it is or modified to the user's needs, but the modifications may not be incorporated into the final NCBI toolkit. It is our understanding that for now, there is technically no other suitable and workable solution available on the market.

Referenses