_ 1970s: Timesharing (1 computer with many users)
_ 1980s: Personal computing (1 computer per user)
1990s: Parallel computing (many computers per user)
Until about 1980, computers were huge, expensive, and located in computer centers. The development of high speed LANs and cheap powerful microprocessors has lead to the creation of computing systems composed of large no. of CPUs connected by a high-speed LAN. Some systems have many processors per user, either in the form of a parallel computer or a large collection of CPUs shared by a small user community. Such systems are usually called parallel or distributed computer systems. The OS which manages these new systems are called Distributed Operating Systems. Why these systems are here to stay are because of the following reasons:
ADVANTAGES OVER CENTRALIZED SYSTEMS
ü Economics — Cheap microprocessors offer better price/performance than mainframes
ü Speed — A DOS may have more computing power than a mainframe
ü Inherent Distribution — Some applications involve spatially separated m/cs
ü Fault Tolerance — If 1 m/c crashes, the system as a whole can still survive
ü Incremental — Computing power can be added in small increments
ADVANTAGES OVER INDEPENDENT PCs
ü Data Sharing — Allows many users access to a common database, etc.
ü Device Sharing — Allows many users to share expensive peripherals like colour printers
ü Communication — Faster person-to-person communication e.g. electronic mail
ü Flexibility — Spread the workload over all the m/cs to enhance overall performance
WHAT IS AMOEBA?
Amoeba is a new general-purpose distributed operating system designed for an environment consisting of a large number of computers. It is an ongoing research project. It was developed by a group under the direction of Prof. Andrew S. Tanenbaum at the Vrije Universiteit (VU) in Amsterdam (The Netherlands). It should be thought of as a platform for doing research and development in distributed and parallel systems, languages, protocols and applications.
WHY AMOEBA?
The main reasons why Amoeba is considered to be superior to others are as follows :
- Distribution — Connecting together many machines
- Parallelism —Allowing individual jobs to use multiple CPUs easily
- Transparency — Having the collection of computers act like a single system
- Performance — Achieving all of the above in an efficient manner
- Flexibility — Both Distributed and Parallel Computing
Amoeba is a distributed system, in which multiple machines can be connected together . These machines need not all be of the same kind. The machines can be spread around a building on a LAN. Amoeba uses the high performance FLIP network protocol for LAN communication. If an Amoeba machine has more than one network interface it will automatically act as a FLIP router between the various networks and thus connect the various LANs together.
Amoeba is also a parallel system. This means that a single job or program can use multiple processors to gain speed. For example, a branch and bound problem such as the
Traveling Salesman Problem can use tens or even hundreds of CPUs, if available, all working together to solve the problem more quickly. Large ‘‘back end’’ multiprocessors, for example, can be harnessed this way as big ‘‘compute engines.’’
Another key goal is transparency. The user need not know the number or the location of the CPUs, nor the place where the files are stored. Similarly, issues like file replication are handled largely automatically, without manual intervention by the users. Put in different terms, a user does not log into a specific machine, but into the system as a whole. There is no concept of a ‘‘home machine.’’ Once logged in, the user does not have to give special remote login commands to take advantage of multiple processors or do special remote mount operations to access distant files. To the user, the whole system looks like a single conventional timesharing system.
Performance and reliability are always key issues in operating systems, so substantial effort has gone into dealing with them. In particular, the basic communication mechanism has been optimized to allow messages to be sent and replies received with a minimum of delay, and to allow large blocks of data to be shipped from machine to machine at high bandwidth. These building blocks serve as the basis for implementing high performance subsystems and applications on Amoeba. FAULT TOLERANCE is very high in Amoeba as multiple copies of files are kept in multiple servers.
Amoeba is intended for both ‘‘distributed’’ computing (multiple independent users working on different projects) and ‘‘parallel’’ computing (e.g., one user using 50 CPUs to play chess in parallel). Amoeba provides the necessary mechanism for doing both distributed and parallel applications, but the policy is entirely determined by user-level programs.
SYSTEM ARCHITECTURE
Since distributed and parallel computing is different from personal computing, it is worthwhile first describing the kind of hardware configuration for which Amoeba was designed.
A typical Amoeba system should consist of four functional classes of machines:
i. Workstations
ii. Processor Pool
iii. Specialized Servers
iv. Gateway
First, each user has a workstation for running the user interface, the X window system.
This workstation can be a typical engineering workstation, or a specialized X terminal.
It is entirely dedicated to running the user interface, and does not have to do other computing. In other words it can be a dumb terminal.
Second, there exists a pool of processors that are dynamically allocated to users as required. These processors can be part of a multiprocessor or multicomputer, be a collection of single-board computers or be a group of workstations allocated for this purpose. Usually, each pool processor has several megabytes of private memory, that is, pool processors need not have any shared memory (but it is not forbidden). Communication is performed by sending packets over the LAN. All the heavy computing happens in the processor pool.
Third, there are specialized servers, such as file servers and directory servers that run all the time. They may run on processor pool processors, or on dedicated hardware, as desired.
Finally, there are gateways to other Amoeba systems that can only be accessed over wide area networks. In the context of a project sponsored by the European Community, we built a distributed Amoeba system that spanned several countries. The role of the gateway is to protect the local machines from the idiosyncrasies of the protocols that must be used over the wide area links. At present only Ethernet is supported, but ports to other LANs are possible.
Amoeba Microkernel
The microkernel has four primary functions:
1. Manage processes and threads.
2. Provide low-level memory management support.
3. Support communication.
4. Handle low-level I/O.
First like most operating systems, Amoeba supports the concept of a process. In addition,
Amoeba also supports multiple threads of control within a single address space. A process with one thread is essentially the same as a process in UNIX. Such a process has a single address space, a set of registers, a program counter, and a stack. In contrast, although a process with multiple threads still has a single address space shared by all threads, each thread logically has its own registers, its own program counter, and its own stack. In effect, a collection of threads in a process is similar to a collection of independent processes in UNIX; with the one exception that they all share a single common address space. A typical use for multiple threads might be in a file server, in which every incoming request is assigned to a separate thread to work on. That thread might begin processing the request, then block waiting for the disk, then continue work. By splitting the server up into multiple threads, each thread can be purely sequential, even if it has to block waiting for I/O. Nevertheless, all the threads can, for example, have access to a single shared software cache. Threads can synchronize using semaphores or mutexes to prevent two threads from accessing the shared cache simultaneously.
The second task of the kernel is to provide low-level memory management. Threads can allocate and deallocate blocks of memory, called segments. These segments can be read and written, and can be mapped into and out of the address space of the process to which the calling thread belongs. A process must have at least one segment, but it may have many more of them. Segments can be used for text, data, stack, or any other purpose the process desires. The operating system does not enforce any particular pattern on segment usage. Normally, users do not think in terms of segments, but this facility could be used by libraries or language run-time systems.
The third job of the kernel is to handle interprocess communication. Two forms of communication are provided: point-to-point communication and group communication.
Point-to-point communication is based on the model of a client sending a message to a server, then blocking until the server has sent a reply back. This request/reply exchange is the basis on which almost everything else is built. The request/reply is usually packaged in library routine so the remote call looks like a local procedure call. This mechanism is generally known as remote procedure call (RPC).
Group communication is the other form of communication. It allows a message to be sent from one source to multiple destinations. Software protocols provide reliable, fault tolerant group communication to user processes even with lost messages and other errors. Both the point-to-point message system and the group communication make use of a specialized protocol called FLIP. This protocol is a network layer protocol, and has been specifically designed to meet the needs of distributed computing. It deals with both unicasting and multicasting on complex internetworks.
The fourth function of the kernel is to manage low-level I/O. For each I/O device attached to a machine, there is a device driver in the kernel. The driver manages all I/O for the device. Drivers are linked with the kernel, and cannot be loaded dynamically.
PROTECTION MECHANISM
Protection Mechanisms refer to the specific OS mechanisms used to safeguard information in the computer, be it unauthorized usage of files or snooping on messages. The different kinds of mechanisms commonly employed are Protection Domains, Access Control Lists, Access Matrix and Capabilities.
OBJECTS and CAPABILITIES
The basic unifying concept underlying all the Amoeba servers and the services they provide is the object. An object is an encapsulated piece of data upon which certain well-defined operations may be performed. It is in essence, an abstract data type. Objects are passive. They do not contain processes or methods or other active entities that "do" things. Instead, each object is managed by a server process.
To perform an operation on an object, a client does an RPC with the server, specifying the object, the operation to be performed, and optionally, any parameters needed.
Objects are named and protected by special tickets called capabilities.
To create an object, a client does an RPC with the appropriate server specifying what it wants. The server then creates the object and returns a 128-bit capability to the client. On subsequent operations, the client must present the capability to identify the object.
When a client wants to perform an operation on an object, it calls a stub procedure that builds a message containing the object’s capability, and then traps to the kernel. The kernel extracts the Server Port field from the capability and looks it up in its cache to locate the machine on which the server resides. If there is no cache entry, or that entry is no longer valid, the kernel locates the server by broadcasting.
The rest of the information in the capability is ignored by the kernels and passed to the server for its own use. The Object field is used by the server to identify the specific object in question. For example, a file server might manage thousands of files, with the object number being used to tell it which one is being operated on. In a sense, the Object field is analogous to the i-node number in UNIX.
The Rights field is a bit map telling which of the allowed operations the holder of a capability may perform. For example, although a particular object may support reading and writing, a specific capability may be constructed with all the rights bits except READ turned off.
The Check Field is used for validating the capability. Capabilities are manipulated directly by user processes. Without some form of protection, there would be no way to prevent user processes from forging capabilities.
OBJECT PROTECTION
The basic algorithm is as follows.
When an object is created, the server picks a random Check Field and stores it both in the new capability and inside its own tables. All the rights bits in a new capability are initially on, and it is this owner capability that is returned to the client. When the capability is sent back to the server in a request to perform an operation, the Check Field is verified.
To create a restricted capability, a client can pass a capability back to the server, along with a bit mask for the new rights. The server takes the original Check Field from its tables, EXCLUSIVE ORs it with the new rights (which must be a subset of the rights in the capability), and then runs the result through a one-way function. Such a function, y = f(x), has the property that given x it is easy to find y, but given only y, finding x requires an exhaustive search of all possible x values. The server then creates a new capability, with the same value in the Object field, but the new rights bits in the Rights field and the output of the one-way function in the Check Field. The client may give this to another process, if it wishes.
When the capability comes back to the server, the server sees from the Rights field that it is not an owner capability because at least one bit is turned off. The server then fetches the original random number from its tables, EXCLUSIVE ORs it with the Rights field, and runs the result through the one-way function. If the result agrees with the Check Field, the capability is accepted as valid.
It should be obvious from this algorithm that a user who tries to add rights that he does not have will simply invalidate the capability. Capabilities are used throughout Amoeba for both naming of all objects and for protecting them. This single mechanism leads to an easy-to-understand and easy-to-implement naming and protection scheme. It also is fully location transparent. To perform an operation on an object, it is not necessary to know where the object resides. In fact, even if this knowledge were available, there would be no way to use it.
Note that Amoeba does not use access control lists for authentication. The protection scheme used requires almost no administrative overhead. However, in an insecure environment, cryptography may be required to keep capabilities from being accidentally disclosed.
Example
We show two examples of how capabilities are used in Amoeba. In (a), a group of three segments have been created, each of which has its own capability. Using these capabilities, the process creating the segments can read and write the segments. Given a collection of memory segments, a process can go to the process server and ask for a process to be constructed from them, as shown in (b). This results in a process capability, through which the new process can be run, stopped, inspected, and so on. This mechanism for process creation is much more location transparent and efficient in a distributed system than the UNIX fork system call
The low-level process interface to the process management system consists of several procedures. Only three of these will concern us here.
The first one, exec, is the most important. It has two input parameters, the capability for a process server and a process descriptor. Its function
PROCESS MANAGEMENT IN AMOEBA
A process in Amoeba is basically an address space and a collection of threads that run in it. In this section we will explain how processes and threads work, and how they are implemented.
Processes
A process is an object in Amoeba. When a process is created, the parent process is given a capability for the child process, just as with any other newly created object. Using this capability, the child can be suspended, restarted, or destroyed.
Process creation in Amoeba is different from UNIX. The UNIX model of creating a child process by cloning the parent is inappropriate in a distributed system due to the potential overhead of first creating a copy somewhere (FORK) and almost immediately afterwards replacing the copy with a new program (EXEC). Instead, in Amoeba it is possible to create a new process on a specific processor with the intended memory image starting right at the beginning. The children, can, in turn, create their own children, leading to a tree of processes.
Process management is handled at three different levels in Amoeba.
At the lowest level are the process servers, which are kernel threads running on every machine. To create a process on a given machine, another process does an RPC with that machine’s process server, providing it with the necessary information.
At the next level up we have a set of library procedures that provide a more convenient interface for user programs. They do their job by calling the low-level interface procedures.
At the highest level, the run server can be invoked to choose a machine and start the process there. The run server keeps track of the load on the various processors and chooses the most favorable machine based on CPU load and memory usage.
PROCESS DESCRIPTOR
Some of the process management calls use a data structure called a process descriptor to provide information about a process to be run. It is used both for new processes and those that have run for a while and been suspended (e.g., by a debugger). One field in the process descriptor tells which CPU architecture the process can run on. In heterogeneous systems, this field is essential to make sure 386 binaries are not run on SPARCs, and so on. Another field contains a capability for communicating the exit status to the owner. When the process terminates or is stunned (see below), RPCs will be done using this capability to report the event. It also contains descriptors for all the process’ segments, which collectively define its address space.
Finally, the process descriptor also contains a descriptor for each thread in the process.
The content of a thread descriptor is architecture dependent, but as a bare minimum, it contains the thread’s program counter and stack pointer. It may also contain additional information necessary to run the thread, including other registers, the thread’s state, and various flags.
EXAMPLE
A process descriptor and the corresponding process. In this example, the process has one text segment, one data segment shared by all threads, three segments that are each private to one thread, and three stack segments.
is to do an RPC with the specified process server asking it to run the process. If the call is successful, a capability for the new process is returned to the caller.
A second important procedure is getload. It returns information about the CPU speed, current load, and amount of memory free at the moment. It is used by the run server to determine the best place to execute a new process.
A third major procedure is stun. A process’ parent can suspend it by stunning it. More commonly, the parent can give the process’ capability to a debugger, which can stun it and later restart it for interactive debugging purposes. Two kinds of stuns are supported: normal and emergency. They differ with respect to what happens if the process is blocked on one or more RPCs at the time it is stunned. With a normal stun, the process sends a message to the server it is currently waiting for saying, in effect: ‘‘I have been stunned. Finish your work instantly and send me a reply.’’ If the server is also blocked, waiting for another server, the message is propagated further, all the way down the line to the end, where it generates an interrupt. If the server at the end of the line catches the interrupt, it replies with a special error message. In this way, all the pending RPCs are terminated quickly in a clean way, with all of the servers finishing properly. The nesting structure is not violated, and no ‘‘long jumps’’ are needed. Processes that do not want to be interrupted can have their wish by simply not enabling handlers (the default is to ignore stuns). Then, the client process stays alive until it receives the reply from the server process.
An emergency stun stops the process instantly. It sends messages to servers that are currently working for the stunned process, but does not wait for the replies. The computations being done by the servers become orphans. When the servers finally finish and send replies, these replies are discarded.
THREADS
Amoeba supports a simple threads model. When a process starts up, it has at least one thread and possibly more. The number of threads is dynamic. During execution, the process can create additional threads, and existing threads can terminate. When a new thread is created, the parameters to the call specify the procedure to run and the size of the initial stack.
Although all threads in a process share the same program text and global data, each thread has its own stack, its own stack pointer, and its own copy of the machine registers. In addition, if a thread wants to create and use variables that are global to all its procedures but invisible to other threads, library procedures are provided for that purpose. These variables are managed by the thread itself; the kernel does not intervene.
Three methods are provided for thread synchronization: signals, mutexes, and semaphores.
Signals are asynchronous interrupts sent from one thread to another thread in the same process. They are conceptually similar to UNIX signals, except that they are between threads rather than between processes. Signals can be raised, caught, or ignored. Asynchronous interrupts between processes use the stun mechanism.
The second form of interthread communication is the mutex. A mutex is like a binary semaphore. It can be in one of two states, locked or unlocked. Trying to lock an unlocked mutex causes it to become locked. The calling thread continues. Trying to lock a mutex that is already locked causes the calling thread to block until another thread unlocks the mutex. If more than one thread is waiting on a mutex, when it is unlocked, exactly one thread is released. In addition to the calls to lock and unlock mutexes, there is also a call that tries to lock a mutex, but if it is unable to do so within a specified interval, it times out and returns an error code.
The third way threads can synchronize is by counting semaphores. These are slower than mutexes, but there are times when they are needed. They work in the usual way, except that here too an additional call is provided to allow a DOWN operation to time out if it is unable to succeed within a specified interval.
All threads are managed by the kernel. The advantage of this design is that when a thread does an RPC, the kernel can block that thread and schedule another one in the same process if one is ready. Thread scheduling is done using priorities, with kernel threads having higher priority than user threads. Thread scheduling can be set up to be either pre-emptive or run-to-completion (i.e., threads continue to run until they block), as the process wishes. Within a user process, threads do not have priorities, and run nonpreemptively.
MEMORY MANAGEMENT IN AMOEBA
Amoeba also has a simple memory model. A process can have any number of segments and they can be located wherever it wants in the process’ virtual address space. Segments are not swapped or paged, so a process must be entirely memory resident to run. Since the hardware MMU is used, a segment can be located anywhere within the virtual address space. Each segment is stored contiguously in physical memory.
Although this design is perhaps somewhat unusual these days, it was done for three reasons: performance, simplicity, and economics. Having a process entirely in memory all the time makes RPC go faster. When a large block of data must be sent, the system knows that all of the data is contiguous not only in virtual memory, but also in physical memory. This knowledge saves having to check if all the pages containing the buffer happen to be around at the moment, and eliminates having to wait for them if they are not. Similarly, on input, the buffer is always in memory, so the incoming data can be placed there simply and without page faults. This design was one of the factors that allowed Amoeba to achieve high transfer rates for large RPCs.
The second reason for the design is simplicity. Not having paging or swapping makes the system considerably simpler and makes the kernel smaller and more manageable. However, it is the third reason that makes the first two feasible. Memory is becoming so cheap that within a few years, all Amoeba machines will probably have tens of megabytes of it. Such large memories will reduce the need for paging and swapping, namely, to fit large programs into small machines. Programs that do not fit in physical memory cannot be run on Amoeba.
SEGMENTS
Processes have several calls available to them for managing segments. Most important among these is the ability to create, destroy, read, and write segments. When a segment is created, the caller gets back a capability for it. This capability is used for all the other calls involving the segment. Because segments can be read and written, it is possible to use them to construct a main memory file server. To start, the server creates a segment as large as it can, determining the maximum size by asking the kernel. This segment will be used as a simulated disk. The server then formats the segment as a file system, putting in whatever data structures it needs to keep track of files. After that, it is open for business, accepting and processing requests from clients.
MAPPED SEGMENTS
Virtual address spaces in Amoeba are constructed by mapping segments into them.
When a process is started, it must have at least one segment. Once it is running, a process can create additional segments and map them into its address space at any unused virtual address.
A process can also unmap segments. Furthermore, a process can specify a range of virtual addresses and request that the range be unmapped, after which those addresses are no longer legal. When a segment or a range of addresses is unmapped, a capability is returned, so the segment may still be accessed, or even mapped back in again later, possibly at a different virtual address (on the same processor).
A segment may be mapped into the address space of two or more processes at the same time. This allows processes to operate on shared memory. For example, two processes can map the screen buffer or other hardware devices into their respective address spaces. Also, cooperating processes can share a buffer. Segments cannot be shared over a network.
COMMUNICATION IN AMOEBA
Amoeba supports two forms of communication: point-to-point message passing and group communication. At the lowest level, an RPC consists of a request message sent by a client to a server followed by a reply message from the server back to the client. Group communication uses hardware broadcasting or multicasting if it is available; otherwise it transparently simulates it with individual messages. In this section we will describe both RPC and group communication, and then discuss the underlying FLIP protocol that is used to support them.
POINT-TO-POINT COMMUNICATION
All point-to-point communication in Amoeba consists of a client sending a message to a server followed by the server sending a reply back to the client. It is not possible for a client to just send a message and then go do something else. The primitive that sends the request automatically blocks the caller until the reply comes back, thus forcing a certain amount of structure on programs. Separate send and receive primitives can be thought of as the distributed system’s answer to the goto statement: parallel spaghetti programming.
Each standard server defines a procedural interface that clients can call. These library routines are stubs that pack the parameters into messages and invoke the kernel primitives to actually send the message. During message transmission, the stub, and hence the calling thread, is blocked. When the reply comes back, the stub returns the status and results to the client. Although the kernel-level primitives are closely related to the message passing, the use of stubs makes this mechanism look like RPC to the programmer, so we will refer to the basic communication primitives as RPC, rather than the slightly more precise ‘‘request/reply message exchange.’’ Stubs can either be hand written or generated by a stub compiler.
In order for a client thread to do an RPC with a server thread, the client must know the server’s address. Addressing is done by allowing any thread to choose a random 48-bit number, called a port, to be used as the address for messages sent to it. Different threads in a process may use different ports if they so desire. All messages are addressed from a sender to a destination port. A port is nothing more than a kind of logical thread address. There is no data structure and no storage associated with a port. It is similar to an IP address or an Ethernet address in that respect, except that it is not tied to any particular physical location. The first field in each capability gives the port of the server that manages the object.
When a request message is transmitted over the network, it contains a header and (optionally) a data buffer. The header is a fixed 32-byte structure. The Signature field is currently not in use, but is reserved for authentication purposes. The Private part is normally used to hold the rightmost three fields of the capability. Most servers support multiple operations on their objects, such as reading, writing, and destroying. The Command field is conventionally used on requests to indicate which operation is needed. On replies it tells whether the operation was successful or not, and if not, it gives the reason for failure. The last three fields hold parameters, if any. For example, when reading a segment or file, they can be used to indicate the offset within the object to begin reading at, and the number of bytes to read.
Group Communication in Amoeba
RPC is not the only form of communication supported by Amoeba. It also supports group communication. A group in Amoeba consists of one or more processes that are cooperating to carry out some task or provide some service. Processes can be members of several groups at the same time. Groups are closed. The usual way for a client to access a service provided by a group is to do an RPC with one of its members. That member then uses group communication within the group, if necessary, to determine who will do what.
Group Communication Primitives
Let us now look at how Amoeba implements group communication. Amoeba works best on LANs that support either multicasting or broadcasting (or like Ethernet, both). For simplicity, we will just refer to broadcasting, although in fact the implementation uses multicasting when it can to avoid disturbing machines that are not interested in the message being sent. It is assumed that the hardware broadcast is good, but not perfect. In practice, lost packets are rare, but receiver overruns do happen occasionally. Since these errors can occur, the protocol has been designed to deal with them.
The key idea that forms the basis of the implementation of group communication is reliable broadcasting. By this we mean that when a user process broadcasts a message, the user-supplied message is correctly delivered to all members of the group, even though the hardware may lose packets. For simplicity, we will assume that each message fits into a single packet. For the moment, we will assume that processors do not crash. We will consider the case of unreliable processors afterwards. The description given below is just an outline.
The hardware of all the machines is normally identical, and they all run exactly the same kernel. However, when the application starts up, one of the machines is elected as sequencer (like a committee electing a chairman). If the sequencer machine subsequently crashes, the remaining members elect a new one. Many election algorithms are known, such as choosing the process with the highest network address.
One sequence of events that can be used to achieve reliable broadcasting can be summarized as follows.
1. The thread traps to the kernel.
2. The thread, now in kernel mode, adds a protocol header and sends the message to the sequencer using a point-to-point message.
3. When the sequencer gets the message, it allocates the next available sequence number, puts the sequence number in the protocol header, and broadcasts the message (and sequence number).
4. When the sending kernel sees the broadcast message, it unblocks the calling process to let it continue execution.
Let us now consider these steps in more detail. When an application process executes a broadcast primitive, such as SendToGroup, a trap to its kernel occurs. The calling thread switches to kernel mode and builds a message containing a kernel-supplied header and the application-supplied data. The header contains the message type (Request for Broadcast in this case), a unique message identifier (used to detect duplicates), the number of the next broadcast expected by the kernel and some other information.
The kernel sends the message to the sequencer using a normal point-to-point message, and simultaneously starts a timer. If the broadcast comes back before the timer runs out (normal case), the sending kernel stops the timer and returns control to the caller. In practice, this case happens well over 99% of the time, because LANs are highly reliable.
On the other hand, if the broadcast has not come back before the timer expires, the kernel assumes that either the message or the broadcast has been lost. Either way, it retransmits the message. If the original message was lost, no harm has been done, and the second (or subsequent) attempt will trigger the broadcast in the usual way. If the message got to the sequencer and was broadcast, but the sender missed the broadcast, the sequencer will detect the retransmission as a duplicate (from the message identifier) and just tell the sender that everything is all right. The message is not broadcast a second time.
A third possibility is that a broadcast comes back before the timer runs out, but it is the wrong broadcast. This situation arises when two processes attempt to broadcast simultaneously. One of them, A, gets to the sequencer first, and its message is broadcast. A sees the broadcast and unblocks its application program. However its competitor, B, sees A’s broadcast and realizes that it has failed to go first. Nevertheless, B knows that its message probably got to the sequencer (since lost messages are rare) where it will be queued, and broadcast next. Thus B accepts A’s broadcast and continues to wait for its own broadcast to come back or its timer to expire.
Now consider what happens at the sequencer when a Request for Broadcast arrives there. First a check is made to see if the message is a retransmission, and if so, the sender is informed that the broadcast has already been done, as mentioned above. If the message is new (normal case), the next sequence number is assigned to it, and the sequencer counter is incremented by one. The message and its identifier are then stored in a history buffer, and the message is then broadcast. The message is also passed to the application running on the sequencer’s machine (because the broadcast does not interrupt itself).
Finally, let us consider what happens when a kernel receives a broadcast. First, the sequence number is compared to the sequence number of the most recently received broadcast. If the new one is 1 higher (normal case), no broadcasts have been missed so the message is passed up to the application program, assuming that it is waiting. If it is not waiting, it is buffered until the program calls ReceiveFromGroup. Suppose that the newly received broadcast has sequence number 25, while the previous one had number 23. The kernel is alerted to the fact that it has missed number 24, so it sends a point-to-point message to the sequencer asking for a private retransmission of the missing message. The sequencer fetches the missing message from its history buffer and sends it. When it arrives, the receiving kernel processes 24 and 25, passing them to the application program in numerical order. Thus the only effect of a lost message is a minor time delay. All application programs see all broadcasts in the same order, even if some messages are lost. The reliable broadcast protocol is illustrated in Fig. 8. Here the application program running on machine A passes a message, M, to its kernel for broadcasting. The kernel sends the message to the sequencer, where it is assigned sequence number 25. The message (containing the sequence number 25) is now broadcast to all machines and is also passed to the application running on the sequencer itself. This broadcast message is denoted by M25 in figure.
The M25 message arrives at machines B and C. At machine B the kernel sees that it has already processed all broadcasts up to and including 24, so it immediately passes M25 up to the application program. At C, however, the last message to arrive was 23 (24 must have been lost), so M25 is buffered in the kernel, and a point-to-point message requesting 24 is sent to the sequencer. Only after the reply has come back and been given to the application program will M25 be passed upwards as well.
Now let us look at the management of the history buffer. Unless something is done to prevent it, the history buffer will quickly fill up. However, if the sequencer knows that all machines have correctly received broadcasts, say, 0 through 23, it can delete these from its history buffer. Several mechanisms are provided to allow the sequencer to discover this information. The basic one is that each Request for Broadcast message sent to the sequencer carries a piggybacked acknowledgement, k, meaning that all broadcasts up to and including k have been correctly received and that it expects k next. This way, the sequencer can maintain a piggyback table, indexed by machine number, telling for each machine which broadcast was the last one received. Whenever the history buffer begins to fill up, the sequencer can make a pass through this table to find the smallest value. It can then safely discard all messages up to and including this value.
If a machine happens to be silent for a long period of time, the sequencer will not know what its status is. To inform the sequencer, it is required to send a short acknowledgement message when it has sent no broadcast messages for a certain period of time. Furthermore, the sequencer can broadcast a Request for Status message, which asks all other machines to send it a message giving the number of the highest broadcast received in sequence. In this way, the sequencer can update its piggyback table and then truncate its history buffer. Although in practice Request for Status messages are rare, they do occur, and thus raise the mean number of messages required for a reliable broadcast slightly above 2, even when there are no lost messages. The effect increases slightly as the number of machines grows.
There is a subtle design point concerning this protocol that should be clarified. There are two ways to do the broadcast. In method 1 (described above), the user sends a point-to-point message to the sequencer, which then broadcasts it. In method 2, the user broadcasts the message, including a unique identifier. When the sequencer sees this, it broadcasts a special Accept message containing the unique identifier and its newly assigned sequence number. A broadcast is only ‘‘official’’ when the Accept message has been sent. The two methods are compared in Fig. 9.
Two methods for doing reliable broadcasting.
These protocols are logically equivalent, but they have different performance characteristics.
In method 1, each message appears in full on the network twice: once to the sequencer and once from the sequencer. Thus a message of length m bytes consumes 2m bytes worth of network bandwidth. However, only the second of these is broadcast, so each user machine is only interrupted once (for the second message).
In method 2, the full message only appears once on the network, plus a very short Accept message from the sequencer, so only half the bandwidth is consumed. On the other hand, every machine is interrupted twice, once for the message and once for the Accept. Thus method 1 wastes bandwidth to reduce interrupts compared to method 2. Depending on the average message size, one may be preferable to the other. In summary, this protocol allows reliable broadcasting to be done on an unreliable network in just over two messages per reliable broadcast. Each broadcast is indivisible, and all applications receive all messages in the same order, no matter how many are lost. The worst that can happen is that a short delay is introduced when a message is lost, which rarely happens.
If two processes attempt to broadcast at the same time, one of them will get to the sequencer first and win. The other will see a broadcast from its competitor coming back from the sequencer, and will realize that its request has been queued and will appear shortly, so it simply waits.
The Fast Local Internet Protocol (FLIP)
Amoeba uses a custom protocol called FLIP (Fast Local Internet Protocol) for actual message transmission. This protocol supports both RPC and group communication and is below them in the protocol hierarchy. In OSI terms, FLIP is a network layer protocol, whereas RPC is more of a connectionless transport or session protocol (the exact location is arguable, since OSI was designed for connection-oriented networks). Conceptually, FLIP can be replaced by another network layer protocol, such as IP, although doing so would cause some of Amoeba’s transparency to be lost. Although FLIP was designed in the context of Amoeba, it is intended to be useful in other operating systems as well.
Protocol Requirements for Distributed Systems
Before getting into the details of FLIP, it is useful to understand something about why it was designed. After all, there are plenty of existing protocols, so the invention of a new one clearly has to be justified. In Fig we list the principal requirements that a protocol for a distributed system should meet.
First, the protocol must support both RPC and group communication efficiently. If the underlying network has hardware multicast or broadcast, as Ethernet does, for example, the protocol should use it for group communication. On the other hand, if the network does not have either of these features, group communication must still work exactly the same way, even though the implementation will have to be different.
A characteristic that is increasingly important is support for process migration. A process should be able to move from one machine to another, even to one in a different network, with nobody noticing. Protocols such as OSI, X.25, and TCP/IP that use machine addresses to identify processes make migration difficult, because a process cannot take its address with it when it moves.
Security is also an issue. Although the get-ports and put-ports provide security for
Amoeba, a security mechanism should also be present in the packet protocol so it can be used with operating systems that do not have Amoeba-type cryptographically secure addresses.
Another point on which most existing protocols score badly is network management. It should not be necessary to have elaborate configuration tables telling which network is connected to which other network. Furthermore, if the configuration changes, due to routers (gateways) going down or coming back up, the protocol should adapt to the new configuration automatically.
Finally, the protocol should work on both local and wide-area networks. In particular, the same protocol should be usable on both.
The FLIP protocol and its associated architecture was designed to meet all these requirements, although when used on wide-area networks, it is best suited to a modest number of sites. A typical FLIP configuration is shown in Fig. Here we see five machines, two on an Ethernet and four on a token ring. Each machine has one user process, A through E. One of the machines is connected to both networks, and as such automatically functions as a router. Routers may also run clients and servers, just like other nodes.
The kernel contains two layers. The top layer handles calls from user processes for RPC or group communication services. The bottom layer handles the FLIP protocol. For example, when a client calls trans, it traps to the kernel. The RPC layer examines the header and buffer, builds a message from them, and passes the message down to the FLIP layer for transmission. All low-level communication in Amoeba is based on FLIP addresses. Each process has one or more FLIP addresses: 64-bit random numbers chosen by the system when the process is created. If the process ever migrates, it takes its FLIP address with it. If the network is ever reconfigured, so that all machines are assigned new (hardware) network numbers or network addresses, the FLIP addresses still remain unchanged. It is the fact that a FLIP address uniquely identifies a process (or a group of processes), not a machine, that makes communication in Amoeba insensitive to changes in network topology and network addressing.
A FLIP address is really two addresses, a public-address and a private-address, related by Public-address = DES(private-address) where DES is the Data Encryption Standard. To compute the public-address from the private one, the private-address is used as a DES key to encrypt a 64-bit block of 0s. Given a public-address, finding the corresponding private address is computationally infeasible. Servers listen to private-addresses, but clients send to public-addresses, analogous to the way put-ports and get-ports work, but at a lower level.
ADVANTAGES
FLIP has been designed to work not only with Amoeba, but also with other operating systems. A version for UNIX also exists, although for technical reasons it differs slightly from the Amoeba version. The security provided by the private-address, public-address scheme also works for UNIX to UNIX communication using FLIP, independent of Amoeba. Furthermore, FLIP has been designed so that it can be built in hardware, for example, as part of the network interface chip. For this reason, a precise interface with the layer above it has been specified. The interface between the FLIP layer and the layer above it (which we will call the RPC layer) has nine primitives, seven for outgoing traffic and two for incoming traffic. Each one has a library procedure that invokes it.
Operation of the FLIP Layer
Packets passed by the RPC layer or group communication layer to the FLIP layer are addressed by FLIP addresses, so the FLIP layer must be able to convert these addresses to network addresses for actual transmission. In order to perform this function, the FLIP layer maintains the routing table. Currently this table is maintained in software, but future chip designers could implement it in hardware.
Whenever an incoming packet arrives at any machine, it is first handled by the FLIP layer, which extracts from it the FLIP address and network address of the sender. The number of hops the packet has made is also recorded. Since the hop count is only incremented when a packet is forwarded by a router, the hop count tells how many routers the packet has passed through. The hop count is therefore a crude measure of how far away the source is. (Actually, things are slightly better than this, as slow networks count for multiple hops, with the weight a function of the network speed.) If the FLIP address is not presently in the routing table, it is entered. This entry can later be used to send packets to that FLIP address, since its network number and address are now known.
An additional bit present in each packet tells whether the path the packet has followed so far is entirely over trusted networks. It is managed by the routers. If the packet has gone through one or more untrusted networks, packets to the source address should be encrypted if absolute security is desired. With trusted networks, encryption is not needed.
The last field of each routing table entry gives the age of the routing table entry. It is reset to 0 whenever a packet is received from the corresponding FLIP address. Periodically, all the ages are incremented. This field allows the FLIP layer to find a suitable table entry to purge if the table fills up (large numbers indicate that there has been no traffic for a long time).
Locating Put-Ports
To see how FLIP works in the context of Amoeba, let us consider a simple example. A is a client and B is a server. With FLIP, any machine having connections to two or more networks is automatically a router, so the fact that B happens to be running on a router machine is irrelevant. When B is created, the RPC layer picks a new random FLIP address for it and registers it with the FLIP layer. After starting, B initializes itself and then does a get_ request on its getport, which causes a trap to the kernel. The RPC layer computes the put-port from the get-port and makes a note that a process is listening to that port. It then blocks until a request comes in. Later, A does a trans on the put-port. Its RPC layer looks in its tables to see if it knows the FLIP address of the server process that listens to the put-port. Since it does not, the RPC layer sends a special broadcast packet to find it. This packet has a maximum hop count of 1 to make sure that the broadcast is confined to its own network. (When a router sees a packet whose current hop count is already equal to its maximum hop count, the packet is discarded instead of being forwarded.) If the broadcast fails, the sending RPC layer times out and tries again with a maximum hop count of 2, and so on, until it locates the server.
When the broadcast packet arrives at B’s machine, the RPC layer there sends back a reply announcing its get-port. This packet, like all incoming packets, causes A’s FLIP layer to make an entry for that FLIP address before passing the reply packet up to the RPC layer. The RPC layer now makes an entry in its own tables mapping the put-port onto the FLIP address. Then it sends the request to the server. Since the FLIP layer now has an entry for the server’s FLIP address, it can build a packet containing the proper network address and send it without further ado. Subsequent requests to the server’s put-port use the RPC layer’s cache to find the FLIP address and the FLIP layer’s routing table to find the network address. Thus broadcasting is only used the very first time a server is contacted. After that, the kernel tables provide the necessary information.
To summarize, locating a put-port requires two mappings:
1. From the put-port to the FLIP address (done by the RPC layer).
2. From the FLIP address to the network address (done by the FLIP layer).
The advantage of this scheme over having just a single (port, network address) cache is that it permits servers to migrate to new machines or have their machines be wheeled over to new networks and plugged in without requiring any manual reconfiguration, as say, TCP/IP does. There is a strong analogy here with a person moving and being assigned the same telephone number at the new residence as he had at the old one. (For the record, Amoeba does not currently support process migration, but it could be added later.) The advantage over having clients and servers use FLIP addresses directly is that FLIP addresses are temporary, whereas ports may be valid for a long time. If a server crashes, it will pick a new FLIP address when it reboots. Attempts to use the old FLIP address will time out, allowing the RPC layer to indicate failure to the client. This mechanism is how at-most-once semantics are guaranteed. The client, however, can just try again with the same put-port if it wishes, since that is not necessarily invalidated by server crashes.
FLIP over Wide-Area Networks
FLIP also works transparently over wide-area networks. In Fig. 14 we have three local area networks connected by a wide-area network. Suppose the client A wants to do an RPC with the server E. A’s RPC layer first tries to locate the put-port using a maximum hop count of 1. When that fails, it tries again with a maximum hop count of 2. This time, C forwards the broadcast packet to all the routers that are connected to the wide-area network, namely, D and G. Effectively, C simulates broadcast over the wide-area network by sending individual messages to all the other routers. When this broadcast fails to turn up the server, a third broadcast is sent, this time with a maximum hop count of 3. This one succeeds. The reply contains E’s network address and FLIP address, which are then entered into A’s routing table. From this point on, communication between A and E happens using normal point-to-point communication. No more broadcasts are needed.
Communication over the wide-area network is encapsulated in whatever protocol the wide-area network requires. For example, on a TCP/IP network, C might have open connections to D and G all the time. Alternatively, the implementation might decide to close any connection not used for a certain length of time.
Although this method does not scale to large networks, we expect that for modest numbers it may be usable, based on our initial experiments with an internetwork of five networks on two continents. In practice, few servers move between sites, so that once a server has been located by broadcasting, subsequent requests will use the cached entries. Using this method, a modest number of machines all over the world can work together in a totally transparent way. An RPC to a thread in the caller’s address space and an RPC to a thread halfway around the world are done in exactly the same way.
Group communication also uses FLIP. When a message is sent to multiple destinations,
FLIP uses the hardware multicast or broadcast on those networks where it is available. On those that do not have it, broadcast is simulated by sending individual messages, just as we saw on the wide-area network. The choice of mechanism is done by the FLIP layer, with the same user semantics in all cases.
RUNNING AMOEBA
The main services of Amoeba are implemented in the servers, which therefore form an integral part of the Amoeba distributed operating system.
THE BULLET SERVER
Like all operating systems, Amoeba has a file system. However, unlike most other ones, the choice of file system is not dictated by the operating system. The file system runs as a collection of server processes. Any user who does not like the standard ones is free to write his own. The microkernel does not know, or care, which one is the "real" file system. In fact, different users may use different and incompatible file systems at the same time, if they so desire. In this section we will describe an experimental file server called the bullet server, which has a number of interesting properties.
The bullet server was designed to be very fast (hence the name). It was also designed to run on future machines having large primary memory, rather than low-end machines where memory is very tight. The organization is quite different from most conventional file servers. In particular, files are immutable. Once a file has been created, it cannot subsequently be changed. It can be deleted, and a new file created in its place, but the new file has a different capability than the old one. This fact simplifies automatic replication, as will be seen. In effect, there are only two major operations on files: CREATE and READ.
Because files cannot be modified after their creation, the size of a file is always known at creation time. This allows files to be stored contiguously on the disk, and also in the in-core cache. By storing files contiguously, they can be read into memory in a single disk operation, and they can be sent to users in a single RPC reply message. These simplifications lead to high performance.
The bullet server maintains a table with one entry per file, analogous to the UNIX i-node table. When a client process wants to read a file, it sends the capability for the file to the bullet server. The server extracts the object number from the capability and uses it as an index into the in-core i-node table to locate the entry for the file. The entry contains the random number used in the Check Field as well as some accounting information and two pointers: one giving the disk address of the file and one giving the cache address (if the file is in the cache). This design leads in principle to a simple implementation and high performance. It is well suited to optical juke boxes and other write-once media, and can be used as a base for more sophisticated storage systems.
THE DIRECTORY SERVER
Another interesting server is the directory server. Its primary function is to provide a mapping from human-readable (ASCII) names to capabilities. Users can create one or more directories, each of which contains multiple (name, capability-set) pairs. Operations are provided to create and delete directories, add and delete (name, capability-set) pairs, and look up names in directories. Unlike bullet files, directories are not immutable. Entries can be added to existing directories and entries can be deleted from existing directories.
The layout of an example directory with six entries is shown in Fig. 8. This directory has one row for each of the six file names stored in it. The directory also has three columns, each one representing a different protection domain. For example, the first column might store capabilities for the owner (with all the rights bits on), the second might store capabilities for members of the owner’s group (with some of the rights bits turned off), and the third might store capabilities for everyone else (with only the read bit turned on). When the owner of a directory gives away a capability for it, the capability is really a capability for a single column, not for the directory as a whole. When giving a directory capability to an unrelated person, the owner could give a capability for the third column, which contains only the highly restricted capabilities. The recipient of this capability would have no access to the more powerful capabilities in the first two columns. In this manner, it is possible to approximate the UNIX protection system, as well as devise other ones for specific needs.
Another important aspect of the directory server is that the entry for a given name in a given column may contain more than one capability. In principle it is a capability set, that is, a group of capabilities for replicas of the file. Because files are immutable, when a file is created, it is possible to install the newly generated capability in a directory, with the understanding that in due time, a specific number of replicas will be automatically generated and added to the entry. Thus an entry in a directory consists of a set of capabilities, all for the same file, and normally located on different bullet or other servers.
When a user presents the directory server with a capability for a (column of a) directory, along with an ASCII name, the server returns the capability set corresponding to that name and column. The user can then try any one of the servers to access the file. If that one is down, it can try one of the others. In this way, an extremely high availability can be achieved. The capability-set mechanism can be made transparent for users by hiding it in library procedures. For example, when a file is opened, the open procedure could fetch and store the capability set internally. Subsequently, the read procedure could keep trying capabilities until it found a functioning server. The key to the whole idea is that files are immutable, so that the replication mechanism is not subject to race conditions and it does not matter which capability is used, since the files cannot change.
THE REPLICATION SERVER
Object managed by the directory server can be replicated automatically by this server. The server runs continuously in the background scanning the directory system periodically. Whenever it finds a directory entry that is supposed to contain n capabilities but contains fewer , it arranges for additional copies to be made. It works best for immutable files as they cannot change during replication.
It also runs the aging and garbage collection mechanism used by the bullet and other servers.
THE RUN SERVER
Run Server decides on which architecture/which machine should a process run. It manages a pool of processors, sorted by architecture. A program may be compiled for multiple architectures and so when it is looked up, finds a directory containing executable programs for each available architecture. Run Server looks at appropriate pools. It sees which machines have enough memory to run the program. Using GETLOAD calls, it knows approximate memory and CPU usage of each of it’s poll processors. Moreover each potential CPU estimates how much compute power it can spare to this process (using processor speed and number of threads running).Server chooses processor with highest available processing bandwidth
THE TCP/IP SERVER
To establish a connection across non-Amoeba networks, an Amoeba process does an RPC with the TCP/IP server giving it a TCP/IP address. The caller is blocked till the connection has been established or refused. In reply, the server provides a capability for using that connection. Subsequent RPCs can send and receive packets from the remote machine without the Amoeba process having to know that TCP/IP is being used.
THE BOOT SERVER
As a final example of an Amoeba server, let us consider the boot server. The boot server is used to provide a degree of fault tolerance to Amoeba by checking that all servers that are supposed to be running are in fact running, and taking corrective action when they are not. A process that is interested in surviving crashes can register with the boot server. They agree on how often the boot server should poll, what it should send, and what reply it should get. As long as the server responds correctly, the boot server takes no further action.
However, if the server should fail to respond after a specified number of attempts, the boot server declares it dead, and arranges to allocate a new pool processor on which a new copy is started. In this manner, critical services are automatically rebooted if they should ever fail. The boot server could itself be replicated, to guard against its own failure (although this is not done at present).
OTHER SERVERS
- I/O Servers
- Time Servers
- Random Number Servers
- Mail Servers
CONCLUSION
The Amoeba project has clearly demonstrated that it is possible to build an efficient, high performance distributed operating system on current hardware. The object-based nature of the system and the use of capabilities provide a unifying theme that holds the various pieces together. By making the kernel as small as possible, most of the key features are implemented as user processes, which means that the system can evolve gradually as needs change and we learn more about distributed computing. Amoeba has been operating satisfactorily for several years now, both locally and to a limited extent over a wide-area network. Its design is clean and its performance is excellent. By and large it has satisfied everyone with the results. Nevertheless, no operating system is ever finished and more is still been done on Amoeba.
FUTURE DEVELOPMENTS
On the Hardware Side what types of special purpose hardware could be used to advantage by MISD are been considered. DMA controllers could reduce communications costs as well as provide support for operations, while object-based processors (separate address spaces, tagged data etc.) could speed up object invocation.
On the software side, we plan to investigate various optimizations. MISD enhances the opportunities for optimizations by exposing cache coherency to software control and by using a hierarchical architecture. Software controlled caching will allow us to explore intermediate replication strategies, where operations on some shared data are broadcast, and operations on other shared data are performed locally or remotely. Pipelining (i.e., allowing more than one outstanding operation) and optimistic execution (i.e., performing operations immediately and then recovering if necessary) are other potential enhancements whose implementations may benefit from software based scheme. The advantage of a hierarchical architecture is that by considering each subnet in the hierarchy as a single entity, the cost of intractable optimizations (or their approximations) may become less prohibitive.
Bibliography
REFERENCE : Andrew S. Tanenbaum – “Distributed Operating Systems”
Pearson Education Asia.
INTERNET : http://www.cs.vu.nl, Vrije Universiteit, Amsterdam, The Netherlands.
I think this is one amongst them who provides good and neat services.
ReplyDeleteRouter Mill