RELIABLE ARRAY OF INDEPENDENT NODES

The Internet is changing the way that people manage and access information. In the last five years, the amount of traffic on the Internet has been growing at an exponential rate. The World Wide Web has evolved from a hobbyists' toy to become one of the dominating media of our society. Ecommerce has grown past adolescence and multimedia content has come of age. Communication, computation and storage are converging to reshape the lives of everyone. Looking forward, this growth will continue for some time. The question is: what can we do to scale the Internet infrastructure to meet this growth?

There are four trends in the current growth of the Internet:

1. Internet clients are becoming more numerous and varied. In addition to the ever-increasing number of PCs in offices and homes, there are new types of clients, such as mobile data units, (cell phones, PDAs, etc.) and home Internet appliances (set-top boxes, game consoles, etc.) In the next five years, these new types of Internet devices will pervade the Internet landscape.

2. To support these new clients, new types of networks are being designed and implemented. Examples are wireless data networks, broadband networks and voice-over-IP networks. Technologies are being developed to connect these new networks with the existing Internet backbone.

3. The content delivered over the Internet is evolving, partly because of the emergence of the new clients and new networks. There will be a growing presence of multimedia content, such as video, voice, music and gaming streams. The growth in content adds not only to the volume of the traffic, but also to the computation complexity in transporting and processing the traffic, thus accelerating the convergence between communication and computation.

4. New Internet applications emerge, both on the server side and the client side. As the Internet penetrates deeper and deeper into everyone's life, the demand for security, reliability, convenience and performance sky-rockets. With the popularity of cars comes the invention of traffic lights and stop signs, the gas station and the drive-thru. As Internet makes its way into daily lives, the demand will grow for firewalls and VPNs, intrusion detection and virus scanning, server load balancing and content management, quality of service and billing/reporting applications. The list goes on, and will keep expanding.

1.2 Reliability and Performance

The primary function of the Internet is for information to flow from where it is stored, traditionally known as a server, to where it is requested, commonly referred to as a client. The Internet is the network that interconnects all clients and servers to allow information to flow in an orderly way. While people become more and more dependent on this network, they demand that it become faster and more reliable. As a result, reliability and performance are becoming key challenges in many parts of the Internet infrastructure.

The communication path between a client and a server can be viewed as a chain. Each device along the path between the client and the server is a link in the chain. For example, for a user to receive a HTML page from yahoo.com, he or she would issue a request, which travels from the user's client, through a number of routers and firewalls and other devices to reach the Yahoo web server, before the data will return along the same or a similar chain. The strength of this chain, both in terms of throughput and reliability, will determine the user experience of the Internet. So, how do we make this chain stronger?

A chain is only as strong as its weakest link, and the longer the chain, the weaker it is overall. To increase reliability and performance, one should look for ways to reduce the number of links in the chain, and make each remaining link more robust. The weak links in the Internet infrastructure are single points of failure and performance bottlenecks. Single points of failure are devices that have no inherent redundancy or backup. Bottlenecks are devices that do not have enough processing power to handle the amount of traffic they receive. Rainfinity 's RAIN technology was invented to eliminate single points of failure and performance bottlenecks in the Internet infrastructure. In the chain of links analogy, it is equivalent to strengthening one link without adding additional links. In some cases, it may even allow several links to be consolidated into one.

The key to reliability is redundancy. If one device fails, there must be a second device ready and able to take its place. If the second device fails, there must be a third, and so on. The key to performance is processing power. To increase capacity and speed, the customer has the choice of using a bigger, faster processor, or by dividing the task among several processors working in concert. Using a single processor limits scalability to the state of the art in processors, so that performance can only be what Moore's Law will allow. Multiple processors working in a cluster provide a more flexible and scalable architecture. Capacity can be added or subtracted at will; the overall performance to price ratio is higher; and combined with intelligent fail-over protocols; such a cluster enables higher reliability.

A clustering approach must allow multiple machines to work together as if they were a single system. The key challenge here is that all the machines in the cluster need to have consensus on the exact state of the cluster, and make collective decisions without conflicts. To address the issue of reliability, a cluster must also allow healthy machines within the cluster to automatically and transparently take over for any failed nodes. To address the issue of performance, all healthy nodes in the cluster must be actively processing in parallel, and each additional node must add processing power to the group, not detract from it. Creating such a clustering solution for the Internet infrastructure is an extremely difficult task.

Rainfinity delivers clustering solutions that allow Internet applications to run on a reliable, scalable cluster of computing nodes so that they do not become single points of failures or performance bottlenecks. The Rainfinity software manages load balancing and fail-over. It scales horizontally Without introducing additional hardware layers. Furthermore, the Rainfinity solutions can coexist with multiple Internet applications on the same physical layers. This reduces the number of links in this Internet chain, and therefore improves the overall reliability and performance of the Internet.

2. RAIN Technology

2.1 Origin

RAIN technology originated in a research project at the California Institute of Technology (Caltech), in collaboration with NASA's Jet Propulsion Laboratory and the Defense Advanced Research Projects Agency (DARPA). The name of the original research project was RAIN, which stands for Reliable Array of Independent Nodes. The goal of the RAIN project was to identify key software building blocks for creating reliable distributed applications using off-the-shelf hardware. The focus of the research was on high-performance, fault-tolerant and portable clustering technology for space-borne computing.

Led by Caltech professor Shuki Bruck, the RAIN research team in 1998 formed a company called Rainfinity. Rainfinity, located in Mountain View, Calif., is already shipping its first commercial software package derived from the RAIN technology, and company officials plan to release several other Internet-oriented applications.

The RAIN project was started four years ago at Caltech to create an alternative to the expensive, special-purpose computer systems used in space missions. The Caltech researchers wanted to put together a highly reliable and available computer system by distributing processing across many low-cost commercial hardware and software components.

To tie these components together, the researchers created RAIN software, which has three components:

• A component that stores data across distributed processors and retrieves it even if some of the processors fail.

• A communications component that creates a redundant network between multiple processors and supports a single, uniform way of connecting to any of the processors.

• A computing component that automatically recovers and restarts applications if a processor fails.

The RAIN software was delivered to JPL's Center for Integrated Space Microsystems, where it has been running for several months. The RAIN software would be useful on the space station or space shuttle, where astronauts each have their own laptops that could be strung together to behave as a single, reliable system for certain applications.

After delivering the RAIN software to JPL, Bruck and four of his colleagues secured a patent on it and created Rainfinity to develop commercial applications. Under an agreement with Caltech, Rainfinity has exclusive rights to the RAIN patent and software. In return, Caltech has an equity position in the company.

Rainfinity is shipping its first product, Rainwall, which is software that runs on a cluster of PCs or workstations and creates a distributed Internet gateway for hosting applications such as a firewall. When Rainwall detects a hardware or software failure, it automatically shifts traffic to a healthy gateway without disruption of service.

Two important assumptions were made, and these two assumptions reflect the differentiations between RAIN and a number of existing solutions both in the industry and in academia:

1. The most general share-nothing model is assumed. There is no shared storage accessible from all computing nodes. The only way for the computing nodes to share state is to communicate via a network. This differentiates RAIN technology from existing back-end server clustering solutions such as SUNcluster, HP MC Serviceguard or Microsoft Cluster Server.

2. The distributed application is not an isolated system. The distributed protocols interact closely with existing networking protocols so that a RAIN cluster is able to interact with the environment. Specifically, technological modules were created to handle high-volume network-based transactions. This differentiates it from traditional distributed computing projects such as Beowulf.

In short, the RAIN project intended to marry distributed computing with networking protocols. It became obvious that RAIN technology was well-suited for Internet applications. During the RAIN project, key components were built to fulfill this vision. A patent was filed and granted for the RAIN technology. Rainfinity was spun off from Caltech in 1998, and the company has exclusive intellectual property rights to the RAIN technology

2.2 Architecture

The RAIN technology incorporates a number of unique innovations as its core modules:

Reliable transport ensures the reliable communication between the nodes in the cluster. This transport has a built-in acknowledgement scheme that ensures reliable packet delivery. It transparently uses all available network links to reach the destination. When it fails to do so, it alerts the upper layer, therefore functioning as a failure detector. This module is portable to different computer platforms, operating systems and networking environments.

Consistent global state sharing protocol provides consistent group membership, optimized information distribution and distributed group-decision making for a RAIN cluster. This module is at the core of a RAIN cluster. It enables efficient group communication among the computing nodes, and ensures that they operate together without conflict. Always On IP maintains pools of "always-available" virtual IPs. These virtual IPs are logical addresses that can move from one node to another for load sharing or fail-over. Usually a pool of virtual IPs is created for each subnet that the RAIN cluster is connected to. A pool can consist of one or more virtual IPs. Always On IP guarantees that all virtual IP addresses representing the cluster are available as long as at least one node in the cluster is operational. In other words, when a physical node fails in the cluster, its virtual IP will be taken over by another healthy node in the cluster.

Local and global fault monitors monitor, on a continuous or event-driven basis, the critical resources within and around the cluster: network connections, Rainfinity or other applications residing on the nodes, remote nodes or applications. It is an integral part of the RAIN technology, guaranteeing the healthy operation of the cluster.

Secure and central management offers a browser-based management GUI for centralized monitoring and configuration of all nodes in the RAIN clusters. The central management GUI connects to any node in the cluster to obtain a single-system view of the entire cluster. It actively monitors the status, and can send operation and configuration commands to the entire cluster.

2.3 Features of RAIN

Features of RAIN system include scalability, dynamic reconfigurability, and high availability. Through software implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. In addition to reliability, the RAIN architecture permits efficient use of network resources, such as multiple data paths and redundant storage, with graceful degradation in the presence of faults.

The RAIN project incorporates many novel features in an attempt to deal with faults in nodes, networks, and data storage.

• Communication: Since the network is frequently a single point of failure, RAIN provides fault tolerance in the network via the following mechanisms:

�� Bundled interfaces: Nodes are permitted to have multiple interface cards. This not only adds fault tolerance to the network but also gives improved bandwidth.

�� Link monitoring: To correctly use multiple paths between nodes in the presence of faults, we have developed a link-state monitoring protocol that provides a consistent history of the link state at each endpoint.

�� Fault-tolerant interconnect topologies: Network partitioning is always a problem when a cluster of computers must act as a whole. Network topologies are designed to resist partitioning as network elements fails.

• Group membership: A fundamental part of fault management is identifying which nodes are healthy and participating in the cluster.

• Data storage: Fault tolerance in data storage over multiple disks is achieved through redundant storage schemes. Novel error-correcting codes have been

developed for this purpose. These are array codes that encode and decode using simple XOR operations. Traditional RAID codes generally only allow mirroring or parity as options. Array codes exhibit optimality in the storage requirements as well as in the number of update operations needed. Although some of the original motivations for these codes come from traditional RAID systems , these schemes apply equally well to partitioning data over disks on distinct nodes or even partitioning data over remote geographic locations.

2.4 Communication

The RAIN project addresses fault tolerance in the network with fault-tolerant interconnect topologies and with bundled network interfaces.

2.4.1 Fault-tolerant Interconnect Topologies

We were faced with the question of how to connect computing nodes to switching networks to maximize the network’s resistance to partitioning. Many distributed computing algorithms face trouble when presented with a large set of nodes that have become partitioned from the others. A network that is resistant to partitioning should loose only some constant number of nodes given that we do not exceed some number of failures. After additional failures we may see partitioning of the set of compute nodes, ie, some fraction of the total number of compute nodes may be lost. By carefully choosing how we connect our compute nodes to the switches, we can maximize a system’s ability to resist partitioning in the presence of faults.

2.4.1.1 The Problem

We look at the following problem. Given n switches of degree d_s,connected in a ring, what is the best way to connect n compute nodes d_cto the switches to minimize the possibility of partitioning the computing nodes when failures occur? The following figure illustrates

this problem.

2.4.1.2 A Naïve Approach

At first glance above shown figure may seem the solution to our problem. In this construction we simply connect the compute nodes to the nearest switches in a regular fashion. If we are using this approach we are relying entirely on fault tolerance in the switching network.

A ring is 1-fault tolerant for connectivity, so we can lose one switch without upset.

A second switch failure can partition the switches and the computing nodes as s

This prompts the study of whether we can use the multiple connections of the compute nodes to make the compute nodes more resistant partitioning. In other words we want a construction where the connectivity of the nodes is maintained even after the switch network has become partitioned.

2.4.1.3 Diameter Construction d_c=2

The intuitive driving idea behind this construction is to connect the compute nodes to the switching network in the most non-local way possible. i.e. connect a compute node to switches that are maximally distant from each other. This idea can be applied to arbitrary compute nodes, degree d_cwhere each connection for a node is as far as possible from its neighbors.

We call this solution Diameter solution because maximally distant switches in a ring are on opposite sides of the ring. So a compute node of degree2connected between them forms a diameter and hence this name. .

Diameter construction for n is (a) odd, (b) even.

In the diameter solution, we actually use the switches that are one less than the diameter apart to permit n compute nodes to be connected to n switches with each compute node connected to a unique pair of switches.

Diameter construction with compute nodes of degree d_c=2 connected to a ring of n switches of degree d_s= 4 can tolerate 3 faults of any kind (switch, link, or node) without partitioning the network. This construction is optimal in the sense that no construction connecting n computing nodes of degree d_c=2 to a ring of switches of degree d_s= 4 can tolerate an arbitrary 4 faults without partitioning the nodes into sets of non-constant size.

2.4.2 Consistent – History Protocol for Link Failures

When we bundle interfaces together on a machine and allows links and network adapters to fail, we must monitor available paths in the network for proper functioning. A ping protocol guarantees each side of the channel sees the same history. Each side is limited to how much it may lead or lag the other side of the channel, giving the protocol-bounded slack. This notion of identical history can be useful in the development of applications using this connectivity information.

Our main contributions are: (1) a simple, stable protocol for monitoring connectivity that maintains a consistent history with bounded slack and (2) proofs that this protocol exhibits correctness, bounded slack and stability.

2.4.2.1 Precise problem definition

We now present all requirements of the protocol:

• Correctness: The protocol will eventually correctly reflect the true state of the channel. If the channel ceases to perform bi-directional communication, both sides should eventually mark the channel as Down. If the channel resumes bi-directional communications, both sides should eventually mark the channel as Up.

• Bounded Slack: The protocol will ensure a maximum slack of N exists between the two sides. Neither side will be allowed to lag or lead the other by more than N transitions.

• Stability: Each read channel event will cost at most some bounded number of observable state transitions, preferably one at each endpoint.

2.4.2.2 The Protocol

This protocol uses reliable message passing to ensure that nodes on opposite ends of some faulty channel see the same history of link failure and recovery. At first it may seem odd to discuss monitoring the status of a link using reliable messages. However, it makes the description and proof of the protocol easier, preventing us from essentially re-providing sliding window protocols in a different form. For, actual implementations there are no reasons to build the protocol on an existing reliable communication layer. The protocol can be easily implemented on top of the ping messages with only a sequence number and acknowledgement number as data (in other words we can easily map reliable messaging on top of ping messages).

The protocol consists of two parts:

�� First we send and receive tokens using reliable messaging. Tokens are conserved, neither lost nor duplicated. Tokens are sent whenever a side sees an observable channel state transition. The observable channel state is whether the link is seen as Up or Down. The token passing part of the protocol is essentially the protocol. Its job is to ensure that a consistent history is maintained.

�� Second, we send and receive ping messages using reliable messaging. The sole purpose of the pings is to detect when the link can be considered as Up or Down. This part of the protocol would not necessarily have to be implemented with pings, but could be done using other hints from the underlying system.

The token passing part of the protocol maintains the consistent history between the sides and the pings give information on the current channel state. The token passing protocol can be seen as a filter that takes in raw information about the channel and produces channel information guaranteed to be consistent at both ends of the channel.

2.5 Group membership

Tolerating faults in an asynchronous distributed system is a challenging task. Reliable group Membership service ensures that processes in a group maintain a consistent view of the global membership.

In order for a distributed application to work correctly in the presence of faults, a certain level of problems in an asynchronous distributed system such as consensus, group membership, commit and atomic broadcast that have been extensively studied by researchers. In the RAIN system, the group membership protocol is the critical building block. It is a difficult task especially when change in membership occurs, either due to failures or voluntary joins and withdrawals.

In fact under the classical asynchronous environment, the group membership problem has been proven impossible to solve in the presence of any failures. The underlying reason for the impossibility is that according to the classical definition of asynchronous environment, processes in the system share no common clock and there is no bound on the message delay. Under this definition it is impossible to implement a reliable fault detector, for no fault detector can distinguish between a crashed mode and a very slow mode. Since the establishment of this theoretic result researchers have been striving to circumvent this impossibility. Theorists have modified the specification while practitioners have built a number of real systems that achieve a level of reliability in their particular environment.

2.5.1 Novel Features

The group membership in the RAIN system differs from that of other systems in several respects:

Firstly, it is based exclusively on unicast messages, a practical model given the nature of the Internet. With this model the total ordering of packets is not relevant. Compared to broadcast messages unicast messages are more efficient in terms of CPU overhead.

Secondly, the protocol does not require the system to freeze during reconfiguration. We do make the assumption the mean time to failure of a system is greater than the convergence time of the protocol. With this Assumption the RAIN system tolerates node and link failures, both permanent and transient.

In general it is not possible to distinguish a slow node from a dead node in an asynchronous environment. It is inevitable for a group membership protocol to exclude a live node, if it is slow, from the membership. Our protocol allows such a node to rejoin the cluster automatically.

The key to this fault management service is a token-based group membership protocol. The protocol consists of two mechanisms, a token mechanism and a 911 mechanism. The two mechanisms are described in detail in the next two sections.

2.5.2 Token Mechanism

The nodes in the membership are ordered in a logical ring. A token is a message that is being passed at a regular interval from one node to next node in the ring. The reliable packet communication layer is used for the transmission of the token, and guarantees that the token will eventually reach the destination. The token carries the authoritative knowledge of the membership when a node receives a token; it updates its local membership information according to the token.

The token is also used for failure detection. There are two variants for failure detection protocol in this token mechanism. The aggressive detection protocol achieves fast detection time but is more prone to incorrect decisions viz, it may temporarily exclude a node only in the presence of link failures. The conservative detection protocol excludes a node only when its communication has failed from all nodes in the connected component. The conservative failure detection protocol has slower detection time than the other detection protocol

2.5.2.1 Aggressive Failure Detection

When the aggressive failure detection protocol is used, after a node fails to send a token to the next node, the former node immediately decides that the latter node has failed or disconnected, and removes information and passes the token to the next live node in the ring. This protocol does not guarantee that all nodes in the connected component are included in the membership at all times. If a node looses a connection to part of the system because of link failure, it could be excluded from the membership. The excluded node will automatically rejoin the system, however, via the 911 mechanism,which will describe in the next section. For eg., for the situation in fig(b), the link between A and B is broken. After node A fails to send the token to node B, the aggressive failure detection protocol excludes node B from the membership. The ring changes from ABCD to ACD until node B rejoins the membership when the 911 mechanism is activated.

2.5.2.2 Conservative Failure Detection

In comparison when conservative failure detection protocol is used, partially disconnected nodes will not be excluded. When a node detects that another node is not responding, the former node does not remove the latter node from the membership instead it changes the order of the ring. In fig (c), after node A fails to send the token to node B, it changes the order of the ring from ABCD to ACBD. Node A then sends the token to node C, and C to node B. in the case when a node is indeed broken, all the nodes in the connected component fail to send the token to this node. When a node fails to send a token to another node twice in a row, it removes that node from the membership.

2.5.2.3 Uniqueness of Tokens

The token mechanism is the basic component of the membership protocol. It guarantees that there exists no more than one token in the system at any time. This single token detects the failures, records the membership and updates all live nodes as it travels around the ring. After a failed node is determined, all live nodes in the membership are unambiguously informed within one round of token travel. Group membership consensus is therefore achieved.

2.5.3 911 Mechanism

Having described the token mechanism, few questions remain. What if a node fails when it processes the token and consequently the token is lost? Is it possible to add a new node to the system? How does the system recover from the transient failures? All of these questions can be answered by the 911 mechanism.

2.5.3.1 Token Regeneration

To deal with the token loss problem, a time out has been set on each node in the membership. If a node does not receive a token for a certain period of time, it enters the STARVING mode. The node suspects that the token has been lost and sends out a 911 message to the next node in the ring. The 911 message is a request for a right to regenerate the token, and is to be provided by all the live nodes in the membership. It is imperative to allow one and only one node to regenerate the token when a token regeneration is needed. To guarantee this mutual exclusivity, we utilize the sequence number on the token.

Every time a token is being passed from one node to another, the sequence number on it is increased by one. The primary function of the sequence number is to allow the receiving node to discard the out of sequence tokens. The sequence number also plays an important role in the token regeneration mechanism. Each node makes a local copy of the token every time that the node receives it. When a node needs to send a 911 message to request the regeneration of token, it adds this message to the sequence number that is on its last local copy of the token. This sequence number will be compared to all the sequence numbers on the local copies of the token on the other live nodes. The 911 requests will be denied by any node, which possesses a more recent copy of the token. In the event that the token is lost, every live node sends out a 911 request after its STARVING timeout expires. Only the node with the latest copy of the token will receive the right to regenerate the token.

2.5.3.2 Dynamic Scalability

The 911 message is not only used as a token regeneration request, but also as a request to join the group. When a new node wishes to participate in the membership, it sends a 911 message to any node in the cluster. The receiving

node notices that the originating node of this 911 is not a member of the distributed system, and therefore, treats it as a join request. The next time that it receives the token, it adds the new node to the membership, and sends the token to the new node. The new node becomes a part of the system.

2.5.3.3 Link Failures and Transient Failures

The unification of the token regeneration request and the join request facilitates the treatment of the link failures in the aggressive failure detection protocol. Using the example in fig (b), node B has been removed from the membership because of the failure between A and B. node B does not receive the token for a while and it enters the STARVING mode and sends out a 911 message to node C. node C notices that node B is not a part of the membership and therefore treats the 911 as a join request. The ring is changed to ABCD and node B joins the membership.

Transient failures are treated with the same mechanism. When a transient failure occurs a node is removed from the membership. After the node recovers it sends out a 911 message. The 911 message is treated as a join request and the node is added back into the cluster. In the same fashion, wrong decisions made in a local failure detector can be corrected, guaranteeing that all non-faulty nodes in the primary connected component eventually stay in the primary membership.

Putting together the token and 911 mechanisms, we have a reliable group membership protocol. Using this protocol it is easy to build the fault management service. It is also possible to attach to the token application dependant synchronization information.

2.6. Data Storage

Much research has been done in increasing reliability by increasing data redundancy. The RAIN system provides a distributed storage system based on a class of error control codes called array codes.

2.6.1 Array Codes

Array codes are a class of error control codes that a particularly well suited to be used as erasure-correcting codes. Erasure-correcting codes are a mathematical means of representing data so that lost information can be recovered. With an (n,k) erasure correcting code, we represent k symbols of the original data with n symbols of the encoded data. (n-k is called the amount of redundancy or parity). With an m erasure-correcting code, the original data can be recovered even if m symbols of the encoded data are lost. A code is said to be Maximum Distance Seperable (MDS) if m=n-k. an MDS code is optimal in terms of amount of redundancy versus the erasure recovering capability. The Reed-Solomon code is an example of an MDS code.

The complexity of computations needed to construct the encoded data and to recover the original data is an important consideration for practical systems. Array codes are ideal in this respect. The only operations needed for encoding and decoding are simple binary exclusive-OR (XOR) operations, which can be easily implemented in hardware and/or software.

2.6.2 Distributed Store/Retrieve Operations

The complexity of computations needed to construct the encoded our distributed store and retrieve operations are a straightforward application of MDS array codes to distributed storage. Suppose that we have n nodes. For a store operation, we encode a block of data of size d into n symbols, each of size d/k using an (n, k) MDS array code. We store one symbol per node. For retrieve operation, we collect the symbols from any k nodes and decode them to obtain the original data.

This data storage schemes has several attractive features. Firstly, it provides reliability. The original data can be recovered with up to n-k node failures. Secondly, it provides dynamic reconfigurability and hot-swapping of components. We can dynamically remove and replace up to n-k nodes. In addition, the flexibility to choose any k out of n nodes permits load balancing. We can select the k nodes with the smallest load, or in the case of a wide-area network, the k nodes that are geographically closest.

3. Advantages of RAIN

RAIN technology is the most scalable software cluster technology for the Internet marketplace today. There is no limit on the size of a RAIN cluster. Within a RAIN cluster, there is no master-slave relationship or primary-secondary pairing. All nodes are active and can participate in load balancing. Any node can

fail-over to any node. A RAIN cluster can tolerate multiple node failures, as long as at least one node is healthy. It employs highly efficient consistent state sharing decision-making protocols, so that the entire cluster can function as one system. A RAIN cluster is a true distributed computing system that is resilient to faults. It behaves well in the presence of node, link and application failures, as well as transient failures. When there are failures in the system, a RAIN cluster gracefully degrades its performance to exclude the failed node, but maintains the overall functionality. New nodes can be added into the cluster “on the fly” to participate in load sharing, without taking down the cluster. With RAIN, online maintenance without downtime is possible. Part of the cluster can be taken down for maintenance, while the other part maintains the functionality. RAIN also allows online addition of new nodes for the growth of the cluster to provide higher performance and higher levels of fault tolerance. It is very simple to deploy and manage a RAIN cluster. RAIN technology addresses the scalability problem on the layer where it is happening, without the need to create additional layers in the front. One element in the RAIN architecture is the management module, which allows the user to monitor and configure the entire cluster by connecting to any one of the nodes. The consistent state-sharing module will help propagate the configuration throughout the cluster.

This software-only technology is open and highly portable. It works with a variety of hardware and software environments. Currently it has been ported to Solaris, NT and Linux. It supports a heterogeneous environment as well, where the cluster can consist of nodes of different operating systems with different configurations. There is no distance limitation to RAIN technology. It supports clusters of geographically distributed nodes. It can work with many different Internet applications.

With RAIN technology at the core, Rainfinity has created a family of Internet Reliability Software solutions that address the availability and

performance requirements of the Internet Infrastructure. Each solution is focused on critical elements or functions of the Internet Infrastructure, such as firewalls, web servers, and traffic management. They bring the unlimited scalability and built-in reliability that mission-critical Internet environment require.

4. Applications

We present several applications implemented on RAIN platform based on the fault management, communication and data storage building blocks described in the preceding sections: a video sever (RAIN Video), a web server (SNOW), and a distributed checkpointing system (RAINCheck).

4.1 High Availability Video Server

There has been considerable research in the areas of fault-tolerant internet and multimedia servers. Examples are the SunSCALR project at SUN Microsystems.

The RAIN Video system

For our RAINVideo application, a collection of videos are written and encoded to all n nodes in the system with distributed store operations. Each node runs a client application that attempts to display a video, as well as a server application that supplies encoded video data. For each block of video data a client performs a distributed retrieve operation to obtain encoded symbols from k of the servers. It then decodes the block of video data and displays it. If we break

network connections or take down nodes, some of the servers may no longer be accessible. However the videos continue to run without interruption provided that each client can access at least k servers. Snapshots of the demo are shown in figure. There are 10 computers each with two Myrinet network interfaces, and four 8-way Myrinet network switches.

4.2 High Availability Web Server

SNOW stands for Strong Network Of Web Servers. It is a proof-of-concept project that demonstrates the features of the RAIN system. The goal is to develop a highly available Fault-Tolerant Distributed Web Server Cluster that minimizes the risk of down time for mission critical Internet and intranet applications.

The SNOW project uses several key building blocks of the RAIN technology. Firstly, the reliable communication layer is used to handle all of the messages, which passes between the servers in the SNOW system. Secondly, the token-based fault management module is used to establish the set of servers participating in the cluster. In addition, the token protocol is used to guarantee that when SNOW receives a request, one and only one server will reply to the client. The latest information about the http queue is attached to the token. Thirdly, the distributed storage module can be used to store the actual data for the web server.

A client node displaying a video in the RAIN Video System

SNOW also uses the distributed state sharing mechanism enabled by the RAIN system. The state information of the web servers, namely, the queue of http requests is shared reliably and consistently among the SNOW nodes. High availability and performance are achieved without external load balancing devices. The SNOW system is also readily scalable. In contrast the commercially available Microsoft Wolfpack is only available for up to two nodes per cluster.

4.3 Distributed Checkpointing Mechanism

We have implemented a checkpoint and rollback/recovery mechanism on the RAIN platform based on the distributed store and retrieve operations. The scheme runs in conjunction with a leader election protocol. This protocol ensures that there is a unique node designated as leader in every connected set of nodes. As each job executes, a checkpoint of the state is taken periodically. The state is encoded and written to all accessible nodes with a distributed store operation. If a node fails or becomes inaccessible, the leader assigns the node’s job to other nodes. The encoded symbols for the state of each job are read from k nodes with a distributed read operation. The state of each job is then decoded and execution is resumed from the last checkpoint. As long as connected component of k nodes survives, all jobs execute to completion

5. Future Scope

�� Development of API’s for using the various building blocks. We should standardize the packaging of the various components to make them more practical for use by outside groups.

�� The implementation of a real distributed file system using the partitioning scheme developed here. In addition to making the building blocks more accessible to others, it would help in accessing the performance benefits and penalties from partitioning data in such a manner.

�� The Group Communication Protocols are being extended to address more challenging scenarios. For example, we are currently working on the hierarchical design that extends the scalability of the protocol.

9. Conclusion

The goal of the RAIN project has been to build a test-bed for various building blocks that address fault management, communication and storage in a distributed environment. The creation of such building blocks is important for the development of a fully functional distributed computing system. One of the fundamental driving ideas behind this work has been to consolidate assumptions required to get around the “difficult” parts of distributed computing into several basic building blocks. We feel the ability to provide basic, probably correct services are essential to building a real fault-tolerant system. In other words difficult proofs should be confined to a few basic components of the system. Components of the system built on top of these reliable components should then be easier to develop and easier to establish as correct in their own right. Building blocks that we consider important and that are discussed in this paper are those providing reliable communication, group membership and reliable storage.

Simply put, RAIN allows for the grouping of an unlimited number of nodes, which can then function as one single giant node, sharing load or taking over if one or more of the nodes ceases to function correctly. The RAIN technology incorporates many important unique innovations in its core elements, which deliver important advantages:

• unlimited scalability

• high performance

• built-in reliability

• simple deployment and management

• flexibility of software for integration in a variety of hardware and software environments

Engineering Seminar Topics :: Seminar Paper

RELIABLE ARRAY OF INDEPENDENT NODES

No comments:

Post a Comment

Seminar Topics