Enhancing the data transfers for Next Generation Sequencing to allow a non privileged user a speed optimized data access to network stored databases
The purpose of this article is to review the existing data transfer protocols and solutions which might be suitable for the Next Generation Sequencing Technologies (NGS).
Previous generations of sequencing technology such as Sanger Sequencing included gel and capillary sequencers from a single vendor, ABI. They were characterized by high cost and low yield, generating a manageable amount and rate of data, from 2 to 40 runs per day, representing between 100Kb and 2Mb and requiring a large staff to operate. The systems were available to a small number of well funded sequencing centers. Since the data was from a single source, it was generated in a common data representation and it was simple to exchange.
Next Generation Sequencing technologies have dramatically changed the situation. Several vendors have emerged that have produced systems with much higher rates of production and lower operating cost, bringing the cost per Mb down 1000 fold. The lower barrier to entry causes the number of potential researches to increase in inverse proportion, attracting centers with less substantial budget.
As it was mentioned in this article the new technologies are capable to deliver tremendous amounts of data in a much shorter period of time and an efficient solution for the data exchange between researches will soon define the overall success of a study.
Background
With optical networks limited only by the speed of light, 10GbE switches, and multiple, hyperthreaded processors inside our computers there is a verifiable fact that today's applications are having trouble delivering needed throughput even as network bandwidth increases and hardware increases in capacity and power. We are particularly interested in improving bulk transfers over high-bandwidth, high-latency networks because of our involvement in storage and in the transfer of data for cutting-edge scientific applications. However it is important to remember that aggressive tools, such as the ones described in this review, are intended for high-speed, dedicated links or links over which quality of service is available.
Protocols
The only two protocols in IP networks which are currently suatable for data transfers: TCP and UDP. In this article we will review a subset of various application protocols utilizing these two, which can be applied to data exchanges for NGS Technologies.
The following network transfer application protocols will be reviewed: FTP, SFTP, HTTP, BT, Slurpie, SABUL, and TSUNAMI. While FTP, SFTP, HTTP, BT, and Slurpie are using TCP/IP stack, SABUL, and TSUNAMI are using UDP/IP stack respectively. NETBLT/IP protocol is being reviewed for historical reasons since it was originally designed for bulk data transfers.
FTP
The FTP Protocol was one of the first efforts to create a standard means of exchanging files over a TCP network, so the FTP has been around since the 1970's. The FTP was designed with as much flexibility as possible, so it could be used over networks other than TCP, as well as being engineered to have the capability with exchanging files with a broad variety of machines. This protocol remains most widely used for file transfers across networks. The base specification is RFC 959 and is dated October 1985. There are some additional RFCs relating to FTP, but it should be noted that even as of this writing (July 2009) that most of the new additions are not in widespread use. The purpose of this document is to provide general information about how the protocol works without getting into too many technical details. RFC 959 should be consulted for details on the protocol.
The protocol can be thought of as interactive, because clients and servers actually have a conversation where they authenticate themselves and negotiate file transfers. In addition, the protocol specifies that the client and server do not exchange data on the conversation channel. Instead, clients and servers negotiate how to send data files on separate connections, with one connection for each data transfer. Note that a directory listing is considered a file transfer.
The protocol has built-in support for different types of data transfers. The two mandated types are ASCII for text (specified by the client sending "TYPE A" to the server), and "image" for binary data (specified by "TYPE I"). Binary transfers can be used for any type of raw data that requires no translation. Client programs should use binary transfers unless they know that the file in question is text.
It is important to note that the base specification, as implemented by the vast majority of the world's FTP servers, does not have any special handling for encrypted communication of any kind. When clients login to FTP servers, they are sending clear text user names and passwords. This means that anyone with a packet sniffer between the client and server could surreptitiously steal passwords as well as contents of the data transfers themselves. There have been proposals to make the FTP protocol more secure, but these proposals have not seen widespread adoption.
SFTP (SSH File Transfer Protocol)
This protocol provides secure file transfer (and more generally file system access). It is designed so that it could be used to implement a secure remote file system service as well as a secure file transfer service. Like the FTP protocol SFTP also utilizes TCP/IP stack for file transfers and commands. This protocol assumes that it runs over a secure channel, that the server has already authenticated the client, and that the identity of the client user is available to the protocol.
In general, this protocol follows a simple request-response model. Each request and response contains a sequence number and multiple requests may be pending simultaneously. There are a relatively large number of different request messages, but a small number of possible response messages.
The SFTP protocol supports resource URLs starting with sftp:// and containing a hostname, an optional port number and a resource path. For example, sftp://sftp.valexllc.com/public/myfile.txt. The resource URL can be specified either as the request target resource name or using the URL protocol configuration parameter.
HTTP
The HTTP is the de facto standard for transferring World Wide Web documents, although it is designed to be extensible to almost any document format. HTTP Version 1.1 is documented in RFC2068; version 1.0 (deprecated) is documented in RFC1945. HTTP operates over TCP connections, usually to port 80, though this can be overridden and another port used. After a successful connection, the client transmits a request message to the server, which sends a reply message back. HTTP messages are human-readable, and an HTTP server can be manually operated with a command such as 'telnet servername 80'.
The HTTP is an application-level protocol for distributed, collaborative, hypermedia information systems. It is a generic, stateless, protocol which can be used for many tasks beyond its use for hypertext, such as name servers and distributed object management systems, through extension of its request methods, error codes and headers. A feature of HTTP is the typing and negotiation of data representation, allowing systems to be built independently of the data being transferred.
Practical information systems require more functionality than simple retrieval, including search, front-end update, and annotation. HTTP allows an open-ended set of methods and headers that indicate the purpose of a request. It builds on the discipline of reference provided by the URI, as a location URL or name URN, for indicating the resource to which a method is to be applied. Messages are passed in a format similar to that used by Internet mail as defined by the MIME.
HTTP is also used as a generic protocol for communication between user agents and proxies/gateways to other Internet systems, including those supported by the SMTP, NNTP, and FTP. In this way, HTTP allows basic hypermedia access to resources available from diverse applications.
BT (BitTorrent)
The BT is a protocol for distributing files. It identifies content by URL and is designed to integrate seamlessly with the web. Its advantage over plain HTTP is that when multiple downloads of the same file happen concurrently, the downloaders upload to each other, making it possible for the file source to support very large numbers of downloaders with only a modest increase in its load. Initially the protocol is based on TCP packets, although it seems that there is no need to do so. The new uTP protocol will be based on UDP packets with an application congestion control.
BT protocol allows users to receive large amounts of data without putting the level of strain on their computers that would be needed for standard Internet hosting. A standard hosts servers can easily be brought to a halt if extreme levels of simultaneous data flow are reached. The protocol works as an alternative data distribution method that makes even small computers with low bandwidth capable of participating in large data transfers.
A novel feature of BT is connection choking. Peer A will stop sending blocks to peer B (this is called 'choking' the connection) until peer B sends A a block, or a time out occurs. The choking encourages cooperation, as well as implicitly rate limits the data going out of a loaded peer. It is assumed that a BT client was started a priori on the web server, and that the client stays in the system indefinitely serving the file. The web server itself serves a file with a '.torrent' extension, which contains both a set of hashes for the files contents, and a URL for the tracker.
Slurpie: A Cooperative Bulk Data Transfer Protocol
The Slurpie is a peer-to-peer protocol for bulk data transfers. Slurpie is specifically designed to reduce client download times for large, popular files, and to reduce load on servers that serve these files. Slurpie employs a novel adaptive downloading strategy to increase client performance, and employs a randomized 'backoff' strategy to precisely control load on the server. Slurpie is a similar to BT protocol, the authors of it claim that users are not required to persist in the system after they finish downloading their file, although they admit that it would decrease the overall system performance. Compared to Slurpie, BitTorrent does not adapt to varying bandwidth conditions, or scale its number of neighbors as the group size increases.
Slurpie is also designed for bulk data transfer, and downloads blocks in a random order, while a number of the multicast protocols are optimized for streaming. Compared to Slurpie, most multicast protocols are much more careful about creating a topology that approximates a shortest path tree (or some some other good topological property). The Slurpie topology is essentially ad-hoc, and data transfer links are added and kept only for transferring a few blocks. Slurpie provides complete reliability, while for the most part, reliable multicast is still has many difficult open research issues.
NETBLT (NETwork BLock Transfer)
NETBLT is a transport level protocol intended for the rapid transfer of a large quantity of data between computers. It provides a transfer that is reliable and flow controlled, and is designed to provide maximum throughput over a wide variety of networks. Although NETBLT currently runs on top of the IP, it should be able to operate on top of any datagram protocol similar in function to IP.
The protocol works by opening a connection between two "clients" (the "sender" and the "receiver"), transferring the data in a series of large data aggregates called "buffers", and then closing the connection. Because the amount of data to be transferred can be very large, the client is not required to provide at once all the data to the protocol module. Instead, the data is provided by the client in buffers. The NETBLT layer transfers each buffer as a sequence of packets; since each buffer is composed of a large number of packets, the per-buffer interaction between NETBLT and its client is far more efficient than a per-packet interaction would be.
NETBLT protocol described in RFC998 and proposed in the 1985-87 timeframe as a transport level protocol having been assigned the official protocol number of 30. NETBLT is included in the background section because at least two of the newer UDP protocols have acknowledged their debt to the design of NETBLT and the design of a third is clearly based on the NETBLT ideas. However the resistance to new kernel-level protocols plus the lengthy approval process seems to have influenced the authors of the new UDP protocols to implement their designs at the application-level.
NETBLT was designed specifically for high-bandwidth, high-latency networks including satellite channels. NETBLT differs from TCP in that it uses a rate-based flow control scheme rather than TCP's window-based flow control. The rate control parameters are negotiated during the connection initialization and periodically throughout the connection. The sender uses timers rather than ACKS to maintain the negotiated rate. Since the overhead of timing mechanisms on a per packet basis can lower performance, NETBLT's rate control consists of a burst size and a burst rate with burst_size/burst_rate equal to the average transmission time per packet. Both size and rate should be based on a combination of the capacities of the end points as well as that of the intermediate gateways and networks. NETBLT separates error control and flow control so that losses and retransmissions do not affect the flow rate. NETBLT uses a system of timers to ensure reliability in delivery of control messages and both sender and receiver send/receive control messages.
SABUL (Simple Available Bandwidth Utilization Library)
The SABUL protocol has demonstrated the efficiency and fairness features in both experimental and practical applications. SABUL is a lightweight, reliable, application level protocol. It uses UDP to transfer data and TCP to feedback control messages. The protocol uses a rate based congestion control that tunes the inter-packet time. This algorithm is proven to be TCP friendly. In addition, the protocol also specifies the transparent memory copy avoidance.
SABUL uses two connections: the control connection over TCP and the data connection over UDP. A SABUL connection is uni-directional: data can be only sent from one side to the other side. According to the relationship between the two sides involved in SABUL, they are called the sender (side) and the receiver (side), respectively. The sender initializes the connection and waits for the receiver to connect to it and then constructs the control connection. The data connection is built up following a successful control connection. Data flow is from the sender to the receiver only. The control information is from the receiver to the sender only, so we also call the control information as feedback. The sender manages application buffer and is responsible for its transmission, retransmission, and release according to the feedback from the receiver. It manages a queue to record the lost packets. The receiver reorders the packets according to the sequence number and puts them into its own buffer or the application buffer. A flag array is kept for packet reordering and loss detection. The sequence numbers of the lost packets number are kept in a lost list. The implementation is similar in concept to TSUNAMI's in that both keep the delay between blocks/packets between an upper and lower limit.
TSUNAMI
TSUNAMI, designed for faster transfer of large files over high-speed networks than appears possible with standard implementations of TCP. Tsunami is an application-level protocol that features rate control via adjustment of inter-packet delay rather than a sliding-window mechanism. Data blocks are transferred via UDP and control data are transferred via TCP.
During a file transfer, the client has two running threads. The network thread handles all network communication, maintains the retransmission queue, and places blocks that are ready for disk storage into a ring buffer. The disk thread simply moves blocks from the ring buffer to the destination file on disk. The server creates a single thread in response to each client connection that handles all disk and network activity. The client initiates a Tsunami session by connecting to the TCP port of the server. Upon connection, the server sends a small block of random data to the client. The client then XOR's this random data with a shared secret, calculates an MD5 checksum, and transmits the result to the server. The server performs the same operation on the random data and verifies that the results are identical. We thus establish client authentication. After exchanging protocol revision codes, the client sends the name of the requested file to the server. If the server indicates that the file is available, the client sends its desired block size, target transfer rate, error threshold, and inter-packet delay scaling factors. The server responds with the length of the file, the agreed-upon block size, the number of block, and a timestamp. The client then creates a UDP port and transmit the port number.
TSUNAMI gives the user the ability to initialize many parameters including UDP buffer size, tolerated error rate, sending rate and slowdown/speedup factors with the 'set' command. If the user does not set sending rate however, it starts out at 1000Mbs with a default tolerated loss rate of 7.9%. Since a block of file data (default 32768) is read and handed to UDP/IP, the rate control is actually implemented per block rather than per packet. The receiver uses the combination of the number of packets received (a multiple of 50) and a timed interval (>350ms) since the last update to determine when to send a REQUEST_ERROR_RATE packet containing a smoothed error rate. If the error rate is greater than the maximum tolerated rate, the sending rate is decreased; if it is less, the sending rate is increased.
Conclusions
Referenses
- OSI Connectionless Transport Services on top of UDP
- File Transfer Protocol (FTP)
- Hypertext Transfer Protocol (HTTP/1.0)
- Hypertext Transfer Protocol (HTTP/1.1)
- Uniform Resource Identifiers (URI)
- Uniform Resource Locators (URL)
- URN Syntax
- The Secure Shell (SSH) Transport Layer Protocol
- Uniform Resource Names (URN) Namespaces
- NETBLT: A Bulk Data Transfer Protocol
- On Testing the NETBLT Protocol over Divers Networks
- NETBLT: A High Throughput Transport Protocol
- A Reliable Transport Protocol for the Tactical Internet
- SABUL: A High Performance Data Transfer Protocol
- Rate Based Congestion Control over High Bandwidth/Delay Links
- Performance analysis of high-performance file transfer systems for Grid applications
- Theoretical and Experimental Analysis of the SABUL Congestion Control Algorithm
- Tsunami: A High-Speed Rate-Control Protocol for File Transfer
- Slurpie: A Cooperative Bulk Data Transfer Protocol
- Wikipedia