An important component of 712 is critical reading of the assigned papers and coming to class ready to discuss them. To help you in this process, we require that you hand in a short review of each paper at the beginning of the class during which we will discuss the paper.
Your reviews should:
We do not want a book report or a repeat of the paper's abstract. Rather, we want your considered opinions about the key points indicated above. Of course, if you have an insight that doesn't fit the above format, please include it as well.
We want the reviews to be short, between 1/4 and 1/2 a page. Reviews must be typed.
All reviews will be counted, and a random sample of the reviews will be graded. Do not skip class or come late just to finish a review -- we expect that everyone will miss a few, though skipping more than 10% will have a negative impact on your grade.
Date: 09/18
From: Nitin Parab
Soft Updates: A solution for the Metadata Update Problem in File Systems. Greg Gnager ET. Al. Take Home Message: 1. Understand the requirements of the system well. Do not merge two related requirements into one. If one component of the system has requirement set {A} and another has requirement set {B} then do not support {a} AND {B} for entire system. Rather satisfy {A} for fist component and satisfy {B} for the other component. 2. Sometime you can solve a problem by looking at it from a different level of abstraction of the system. (If you cant solve a layout problem in a building because you are looking at it as a system made up of m floors, look at it as a system made up of n apartments. May be u'll get the solution now). Metadata update problem: 1. First view: Various metadata and data modifications must be made in such a way that the system should have a consistent view so that it can recover in presence of un predictable failures. Solution from this viewpoint: Propagate changes to stable storage in specific order synchronously. 2. Second View: The application (one component of the system) should always see the most recent copy of the metadata blocks and the disk (another component) should always see copies that are consistent with its other contents (albeit slightly older copies). Solution from this viewpoint: Each component of the system needs to have a consistent view but two different components may have different views. Delay metadata updates to disk. When writing a block, if there are any dependencies that require other blocks to be written before the metadata in the current block can be written to the disk, then those parts of the metadata in the current block are rolled back to an earlier state (which is consistent with respect to the current on disk state). During the disk write, the block is locked to prevent the applications from seeing the rolled back state. When the disk write completes, any undone updates in the source memory block are restored before the block is unlocked. Cyclic dependency problem: Most metadata blocks contain many pointers (many inodes for example) hence cyclic dependencies occur frequently if dependencies are recorded only at the block level. However, if you track dependencies on a per pointer basis instead of per block basis. dependency cycles do not occur because independent sequences of updates remain independent and no single sequence is cyclic(see paper for details). What more could have been done? Measure of overhead introduced by the system (like syncer task). Study of performance difference between NO-Order and Soft-Update. The Design and Implementation of Log Structured FS. Take home message: 1. Sometimes it is possible to completely eliminate a problem rather than finding a optimized solution for the problem. One way to improve FS performance is to find a rotationally optimal block placement policy so as to improve (synchronous) write performance. Another solution is to totally do away with synchronous write. 2. There is nothing called free lunch. In traditional UNIX (FFS) the problem was how do you force the disk into rotationally optimal distribution". The problem is hard as it requires one to predict file usage patterns. Logstructured FS eliminates this problem. But introduces a new problem: how do you force the disk into "bimodal segment distribution" (and of course the overheads of cleaner process). This depends on the stability of the data in the segment. But stability cannot be predicted without knowing file access patterns. Fortunately latter problem is much easier and affects a high disk utilization only. 3. Identify the problem correctly. What is the problem with the current FS design? First they spread data blocks (and meta-data blocks) across the disk causing too many small accesses. Secondly, they write synchronously to disks (in order to take care of meta-data update problem). Logstructure FS clubs metadata and all data together into long segments (log) resulting in single long sequential disks write. Question: An FFS with support for "clustering" of data block (large sequential write) and soft updates (no synchronous writes, asynchronous metadata updates) give better performance? I believe that a system paper that reports a totally new policy (design and implementation) should have a abstract philosophical (almost religious) discussion. The paper should compare the two approaches at an abstract philosophical level (filtering out all implementation details). After all each Operating System is a different religion!
Date: 09/20
From: Sagar Jyoti Chaki
Andrew File System: ------------------ The following points about the paper are noteworthy: * The main concept behind AFS is caching of files by the client, in order to avo id remote open and reduce latencies of file operations, since most of these late ncies are due to the network. * AFS is capable of supporting a few hundred thousand users simultaneously. It h as achieved such a high degree of scalability by ensuring that the bulk of the w ork is done on the client side and the server is used only for tasks essential f or maintaining the integrity and consistency of the file system. Moreover a sing le server process is used to serve all clients, and a concept of LWP, which is v ery similar to today's threads, is used. * Following Amdahl's law, the paper stresses how the implementation of AFS has s trived to make the common case faster. An independently optimized protocol is us ed to transfer data in bulk. A special callback mechanism, which significantly r educes network traffic, is used for cache validation. A flat file naming scheme is used for faster translation. The major drawback of the paper is that only a single synthetic benchmark was us ed to judge its performance and compare it with other similar systems. This does not guarantee similar performances in real-life scenarios. The paper highlights the process of using experimental results to modify a proto type and convert it to a more advanced product that can perform admirably in rea l-life situations. CODA: ---- * Any mobile application must be geared to cope with low bandwidth, low latency and expensive connectivity. It must be able to adapt to orders of magnitude vari ance of the connectivity in terms of bandwidth, latency and cost. It must be abl e to do so with minimal human assistance. Coda does well on all these accounts. The design of Coda is guided by the following principles: * Do not punish strongly connected clients. * Do not make life worse than when disconnected. * Do it in the background if you can. * When in doubt, seek user advice. * A major mechanism used by Coda is disconnected operation where, in the presenc e of negligible or no connectivity, the client behaves as if the network connect ivity has been lost. However this mechanism has significant limitations too. * Again, as with AFS, Coda employs several specializes techniques to improve per formance in slow networks. Both RPC2 and SFTP have been tuned for situations. Ca che coherence granularity has been raised and a multilevel coherence detection s cheme is employed to speed up this process. An asynchronous update propagation s cheme called trickle reintegration is used for propagating updates in the backgr ound without unnecessarily blocking other processes. Finally user-assistance is sought for cache misses in situations where the fetching the unavailable file(s) would take time longer than a certain threshold. The major problem with the paper is that no real-life experiments were carried o ut to verify the performance of the system. Instead trace simulation was used. I t is unclear how variances in network connectivity was simulated. The system sho uld have been subjected to a more robust examination. Once again this paper illustrates how newer application domains might necessitat e appropriate modifications in the design of any system. Such modifications shou ld not only be based on the characteristics of the environment measured directly or indirectly, their usefulness should be validated through rigorous experiments.
Date: 09/25
From: Pedro Vaz Artigas
Serverless Network File Systems The paper presents a novel way to implement a distributed file system, each component of the server may be located in each and any machine that is connected to the network. The motivation for this solution is that, assuming that network speed is faster than local disk accesses, the disk accesses could be performed in parallel in a number of systems in the network and that would result in better performance even when compared to what can be obtained by a local disk. Caches also play a significant role as the machines in a network have and aggregate cache that is much larger than what can be cached by a single machine, therefore cooperative caching can also lead to better performance than that of the local disk. Strategies are presented to ensure cache coherency. Also, writes are postponed and local file systems use logging to speed up writes and cache coherency enforces that clients get the most up-to-date value in spite of write back caching. The architecture of the system is based in four entities. Clients that generate access requests. Managers, that are responsible for the control of the meta-data and for cache coherence for the files they manage. Storage servers are responsible to physically store on disk data, data is organized in segments as in LFS, also a segment may span multiple servers to improve performance; stripe groups are sets of servers that cooperate in order to store data, each segment is stored in a stripe group and storage servers may be part of more than one stripe group. Cleaners are responsible for segment cleaning as in LFS. The only drawback to this file system architecture is that the assumption that networks are switched and have much higher performance than that of a local disk does not really hold. Fast local disks (10Mb/s) are more common these days than a fast switched network with similar amounts of bandwidth (Switched fast Ethernet, about 100Mbits, about 12Mb/s) and results were presented for a 80Mb/s network and 3Mb/s disks. So it is not clear that this architecture would provide the best performance in today's environments. A Cost-Effective, High-Bandwidth Storage Architecture The paper presents a new storage architecture whose main advantage is providing scalability at a much lower price than current server based architectures. The idea is to enable disks to be connected directly to the network, avoiding the cost of file servers utilized for storing data and avoid delays and bottlenecks due to the fact that the server has to store-and-forward the data. Also, I/O bandwidth may be scaled more easily by attaching new disks directly to the network, assuming that networks are switched or have adequate bandwidth to enable the utilization of the full disk bandwidth. This strategy requires disks, called NASDs, that are capable of implementing elaborate protocols that are more complicated than the ones they currently employ, also more file system related tasks, such as disk layout, are left to the disk. Given the amount of circuitry that may be added to ASICs responsible to controlling disks if newer process technologies are employed this requirement may be fulfilled. File servers are still required in order to enforce access control, global naming, concurrency control and cache coherency. Disks provide all the data, organized in objects, and verify that a client has the privilege to access such piece of data. The file server is responsible for emitting a ticket to each client and the disk is able to verify that tickets are valid without any server intervention. This security model is based on the capabilities model. Given that disks may supply data directly to the client the file server bottleneck for data requests is removed. In order to ensure that the model presented is adequate and may be utilized to implement a distributed file system, as current applications are not NASD aware, the authors present an NFS implementation based on their NASD disk prototype. Also, a benchmark for a NASD aware application is presented and utilizes a parallel file system implemented on top of the NASD disks, that application scales better under the NASD model than when utilizing a similar conventional configuration with an NFS distributed file system.
Date: 10/02
From: Nat Lanza
Cluster I/O with River: Making the Fast Case Common The River paper describes an interesting flow model for I/O-intensive computation in clusters, and a programming environment to take advantage of it. The main technical advancements described are their distributed queues, which allow producers and consumers of data to work at different rates while balancing the workload, and the graduated declustering method of using mirrored data, which helps insulate the system from performance perturbations on some nodes. Their model is also interesting, and involves individual I/O modules tied together by flows, potentially allowing complex applications to be cobbled together from a number of small and perhaps standard modules. Unfortunately, the paper does not spend much time discussing the programming environment for River. The authors mention a graphical environment, but then spend the paper discussing the performance and applications of their distributed queues and graduated declustering. Implementing Remote Procedure Calls This paper describes an experimental RPC system for the Cedar environment, and focuses on the binding system and the custom transport protocol. The binding system is interesting in that it incorporates a database to allow clients to look up servers and servers to register their presence; instead of simply requiring clients to know where the servers are, it provides a relatively secure discovery mechanism. The protocol is also interesting; the authors rejected the then-current PUP byte stream protocols, reasoning that a large setup/teardown time would ruin performance. Another interesting detail was their emphasis on keeping the RPC interface as close to the native procedure call interface as possible -- they avoided RPC timeouts because local procedure calls do not have timeouts. Unfortunately, while the paper claims that RPC will make building distributed applications much simpler, it doesn't provide much evidence for that. The authors do admit that they are still in the early stages of their work, but it would be good to see more examples of this, as well as some more mention of their stub generator.
Date: 10/04
From: Sagar Chaki
CS 15-712 : Advanced OS & Distributed Systems Paper Review - due 10.04.00 Sagar Chaki ABACUS: ------ * Optimal partitioning of application and file-system functionality within a cluster of clients is very hard. The solution depends on cluster and workload characteristics in a complicated manner. Many of these characteristics cannot be predicted before running the application, and some of them (like availability of a server) may change dynamically during execution. Hence universal solutions do not exist and any attempt for such a solution will lead to sub-optimal and often catastrophic performance. * The ABACUS framework strives to achieve an optimal solution for a restricted class of applications viz. those that manipulate large data sets. It dynamically alters function placement on the basis of blackbox monitoring of function characteristics like inter-object communication and per-object resource requirement. * Experiments indicate that ABACUS achieves 2-10X performance improvement and successfully adapts to both long-term and short-term runtime variations. The success of ABACUS relies critically on the effectiveness of the analytical model used. It is unclear how robust the model is, how quickly it adapts to variations, and whether the model itself is static or can be modified dynamically. Also it is unclear how accurate the estimates of the blackbox monitoring system are. The benchmarks used to evaluate ABACUS seem to be synthetic. The ABACUS system tries to avoid unnecessary and expensive generalization. In a situation, where there is no clear "best-for-all-situations" solution, ABACUS tries to address the problem in a specific domain and does an extremely good job about it. Emerald: ------- * Process migration is a well-studied technique for achieving optimum performance in distributed systems. Emerald is an object-oriented language and system that aims to provide efficient object mobility. * Emerald claims the following uniqueness in its support for mobility * It is object based - the unit of distribution and mobility is an object. This enables Emerald to provide mobility at all levels of granularity * Emerald has language support for mobility. * The Emerald language allows the user to supply additional information about any object with its definition. For example an object may be attached to other objects. Such information is used by the compiler to generate smarter code. The major problem with Emerald is that there seems to be no support for coping with variations in the system characteristics or breach of security. For example, the paper assumes all nodes to be similar (even at the assembly language level) and trustworthy. The experiments carried out do not gauge Emerald's performance in heterogeneous and malicious environments. Moreover using it effectively requires mastering a new language. It is also unclear how difficult it would be to port existing distributed applications to Emerald. Emerald's tries to extend the object-oriented paradigm to a distributed setting. It provides yet another example of why techniques that have been successful in one domain should be applied to others.
Date: 10/09
From: Nitin Parab
15-712 Nitin Parab Monday 9th Oct. ExoKernel & SPIN. Extensible Application Specific Operating Systems What's wrong with current systems: 1. High Level Abstractions: Current operating systems virtualize resources with high level abstractions like process, files, etc. These interfaces tend to be generic and introduce a semantic gap between application requirements and the low-level hardware resource. The OS should export low level primitives which are flexible and versatile to support any high level programming abstraction and be efficiently mapped onto any hardware. 2. Resource Management Policies executed oblivious to the application: Low-level events like blocking for device I/O, page faults, processor allocation etc are not made available to the application. Thus applications cannot implement their own resource management policies. Applications know their requirements far better than OS. Hence application controlled resource management results in improved performance. The operating system should only enforce protection boundaries between resources and allocate resources among the applications. Management and policy functionalities like resource scheduling should be left up to the applications. Requirements of Application Specific Operating Systems (and how they are satisfied in Exokernel and Spin): 1. Securely Expose Hardware: In exokernel the operating system is a small kernel, which exports a low level interface to the application level. These primitives' export resources like CPU, disk memory, TLBs etc. Spin allows application code to be linked into the kernel. The kernel provides a set of interfaces to core system services. Spin uses language support to authenticate modules and prevent malicious access to resources. It also implements capabilities, once again using language support. Exokernel uses architectural support (protection mechanisms) to securely export low level interface (system calls). It also uses self-authenticating capabilities to give resource access. It may cache these capabilities. It also uses mechanism of downloading application code like Spin. 2. Expose Events: Spin delivers events to the application by calling (procedure call) to the application specific safe handler extensions in the kernel. In exokernel the events are delivered to the application OS library. It also supports Application Specific Handler to be uploaded into the kernel. 3. Expose Allocations: Spin exports interface which allows safe kernel modules to request resource with attributes (like physical page with particular attributes). Even exokernel exports similar interface to application layer. 4. Expose Names: Exokernel exports the native physical name of the resource like address of the physical page. It appears that Spin does not do so. 5. Expose Revocations: Spin exposes revocation by delivering a revocation event just like any other event. Exokernel also delivers revocation event to the application layer like other events. Analysis : 1. Exokernel and Spin provide similar features. The main difference being the implementation mechanism. Spin uses language support to securely extend kernel. Exokernel securely exports hardware resources to the application level using hardware support (system call interface). 2. Spin's event delivery mechanism is very efficient, just a procedure call. Exokernel event delivery mechanism is expensive since it has to cross protection domain. Thus applications with fine grain events will not perform well on Exokernel. Exokernel addresses this problem by supporting secure code downloading just like Spin for fine grain events like network packet processing. 3. Both the application example given in Spin paper are applications fully implemented inside the kernel. These applications do not cross the kernel to user boundary. They do not incur the cost of protection domain crossing (which exokernel applications will incur). Hence these are not the right applications to evaluate performance. Not all applications can be fully written in the kernel, for e.g. a web server, which supports scripts. Exokernel approach thus sounds more generic. 4. At first glance it may appear that Exokernel has problems in securely exposing hardware resources like network and disk. However, note that Exokernel requires the hardware virtualization of the resources like the virtual memory. Current network and disk controllers require software (OS) virtualization. However recent developments like protocol offloading (silicon TCP) and active object based disks do provide virtualization at device level. Thus it looks like Exokernel OS will be able to securely expose all hardware without even support of code downloading. 5. Spin paper does not discuss the effects on global system performance. 6. Exokernel paper should have had a discussion on architectural requirements for Exokernel Operating Systems.
Date: 10/11
From: Nitin Parab
15-712 Nitin Parab Wednesday 11th October. FS development with stackable layers. Heidmann et al. Composable OS kernel: One way of building composable OS kernel is supporting stackable layered design (e.g. STREAMS subsystem in UNIXes) other by symmetric interface, syntactically identical above and below. Thus new layers can be easily removed or added at any level in the stack. These stacks can be made extensible by having layers support new operations; existing layers adapt automatically by forwarding a new operation it does not recognize. Layers often execute within a single protection domain (address space independence). Being symmetric the stackable layers support layer substitution (easy to substitute TCP with UDP in STREAMS). One can have non-linear stacking for example TCP and UDP layers on top if IP in STREAMS. Reusable services that occur in the middle of a large layer can be extracted out as a whole new (semantic free) layer with two cooperating layers around it. This extracted layer can now be used by multiple other layers/services. The architecture also supports stack layers at user level. Analysis 1. The paper should have discussed the implementation and it's layer interface in more detail. 2. Section 2 and 3 which talks about stackable layers could have been omitted as the issues pointed out are well known and tried and tested in systems like STREAMS subsystem. 3. Porting from BSD to SUN OS 4.x is not a good argument for portability as these systems are very similar to each other. 4. The paper fails to show how a file system (say UFS) can be divided into layers. It talks about UFS (or LFS) as being one whole layer. =================== Scripting: Higher Level Programming for 21st Cnetury. John K Outerhoust. Composable systems: Basic requirement of composable systems is the "glue logic". One way of having a glue logic is to define interfaces and dynamically load modules that will interact with other modules through these interfaces (e.g. STREAMS stacks or stacks in stackable FS). In strongly typed languages these interfaces requires objects of specific types and the compiler prevents any other type of objects being from used with the interface. Thus the strongly typed nature of systems programming languages discourages reuse and building of composable systems. A typeless language makes it much easier to hook together components. Classic examples are filter programs in UNIX shells. Scripting Languages are glue languages or system integration languages. Scripting and system programming are symbiotic. System programming is used to make cool components and scripting languages are used to glue together these components to build cool applications. Scripting languages provide platform for very rapid development applications of gluing nature. Concerns regarding Scripting Languages: 1. Typeless nature makes them unsafe: Strong typing allows errors to be detected at compile time. In contrast scripting languages do error checking at runtime. 2. Performance: Scripting languages are interpreted and hence are not as efficient as system languages. However, scripting languages are used to glue together components and hence performance depends more on the components, which are written in systems languages. Scripting is getting popular: 1. Rise of GUI. GUIs are fundamentally gluing applications. 2. Growth of Internet. Again Internet is a gluing tool. 3. Popularity of component frameworks. Without good scripting language to manipulate the components much of the power of component framework is lost. 4. Improvements in scripting technology. 5. Change in the programmer community. Casual programmers want a tool easy to learn.
Date: 10/14
From: Ken Tew
Ken Tew 15-712 Advanced OS and Distributed Systems Lecture 9 "Reflections on Trusting Trust" "Why Cryptosystems Fail" "Crisis and Aftermath" These 3 papers give motivation and some guidelines on how to design and maintain security for a system. Note that designing the system is just the first step in security. Other important (and often overlooked) security issues are educating users and maintenance of the system. Designing and Verifying system security. When designing system security, the first step is to consider what is being protected against. A threat model is an outline of what you are trying to protect a system from, what types of attacks and from whom? Typical questions to ask in making threat models: what are you trying to protect against? Are you simply trying to prevent data from being corrupted or altered? Are you trying to prevent denial of service? Are you trying to prevent unauthorized reading of the data? Who are you trying to protect against? Are you trying to prevent misbehavior from privileged users? How many privileged users would be required to conspire in order to break the security? Are you trying to prevent outside access? Independent verification. The people designing a security system should not be the ones to verify that it is secure. If a designer has overlooked some security hole in designing the system, he or she is less likely to think of it while testing the system. Additionally, someone whose job is to verify the system will be more likely to look for back-doors the designer may have put into the system. Educating users as to the importance of following security procedures. Once you have a security system in place, it is worthless if no one will use it. It is important that system users be educated about what the security procedures are, as well as why it is important to follow them. This also means procedures must be made simply enough so that users can follow them. Additionally, it may mean showing management that the cost of not implementing security can be greater than cost of implementing security. Maintaining system once it is in place. Monitoring usage. It is important to keep an eye on how the system is being used. Irregularities may indicate a break-in or attempted break in. Also, monitoring may be able to pinpoint weaknesses in the system before someone can exploit a security hole. Updating the system. Simply because a system is reasonable secure today, doesn't mean that same system will still be secure in the future. Systems need to be updated to: a) patch any security holes that have been discovered b) update cryptographic keys that may no longer be valid Catching and punishing security violators. It is impossible to make a usable system 100% secure. Therefore, deterring people from violating system security is also important. Punishment may be as simple as reprimanding employees for being lax in following security procedures to as severe as criminal convictions.
Date: 11/01
From: Zia Syed
The Byzantine Generals Problem The basic idea in this paper is to show how a fault tolerant distributed system can be created. The problem is modeled in terms of an army in which some of the generals are traitors. The generals have to agree on a common decision based on the views of the others and that a few traitors should not be able to mislead the loyal generals into making a wrong move. This decision can be as simple as that either they all attack or they all retreat. The problem becomes difficult when the loyal generals do not know which ones are the traitor generals and the traitor generals can give false and malicious information to the loyal generals. They show that m traitors can mislead fewer than 3*m + 1 generals but for more than that number of generals, there does not exist any solution. This problem can also be mapped into a problem of many processes trying to agree on something in a distributed manner and there are some processes who are trying to cheat and mislead others like in sharing resources etc. However everyone must know the decision function and they must use the same decision function e.g. the majority function or the median of an ordered set of values. One solution that they present is that the generals have unforgeable signatures which can be verified and if forged can be detected. Their algorithm achieves the two aims of all the generals agreeing on a decision by receiving the same messages. A nice feature of this is that regardless of the number of traitors, this algorithm works correctly. However it has a huge overhead of an approximately exponential number of messages. Another modification of this problem is when the majority voting is used for extreme decisions. In this case two correct processors would achieve the same decision if both receive the same input. This is an issue of reliable transfer of data so that everyone receives the correct input. In this case faulty communication line is the same as a general sending a false message and the receiver cannot distinguish the difference. So this situation is the same as the original problem and so this system can work correctly if less than 1/3 of messages are corrupted. If we do not have unforgeable signatures then an alternate can be a communication link between every pair of generals. Another issue is the absence of a message which requires a timeout mechanism which requires synchronized clocks and an upper bound on the maximum time for the message to get through which is another problem of its own. ---- Time Clocks, and the Ordering of Events in a Distributed System This paper gives a distributed algorithm for synchronizing a system of logical clocks, which can be used to totally order the events. The main problem that is being addressed is that in a distributed system a collection of processes which communicate with one another by exchanging messages do not take a negligible time to reach the other process. Thus the notion of 'before' i.e. which event occurred earlier is hard to know and different processes would have a different relative ordering of events. We want this ordering to be the same or at least partially same so that the notion of things happening across all processes is the same. The events occurring in a single process have an a priori total ordering. They define a relation -> which basically means that a -> b implies that a causally affects b and therefore must occur before b. Using logical clocks which are just counters and may not have any physical meaning of time, they define the order as if an event occurs before b, then the time of a comes before time of b. A simple implementation of this logical clock C can be that within any process, the time can be just incremented between events occurring in the process. Then every process timestamps the message it sends and the receiving process updates the time to be greater than the timestamp and the current time at this process. This ensures that the receiving time of the message was after the sending time of the message. This gives us a partial ordering of events. To break ties we can arbitrarily assign an ordering of events and this will give us a total ordering. Using a central server to allocate resources according to the order in which messages are received but by defining a total ordering of all things, this problem is overcome since the total ordering specified by the algorithm will make sure that the allocation and release is done in the right order. This is achieved by sending the time stamped messages to all processes and they ACK back with their timestamps. The process can only execute that command if all other time stamped messages received are less than or equal to its timestamp. The problem with this is that a process must know all other processes and failure of the process. Certain external events may also be handled by modeling them into the system and handling them in the same way. For using physical clocks, using the rate of the clocks, we can find upper bounds for the synchronization errors in terms of the minimum delay for a message to be received by the other end. For this they require that the clocks are never set back.
Date: 11/06
From: Sagar Chaki
Cost of Quality in Internet-style networks: ------------------------------------------ * Quality of Service (QoS) is not a major issue with traditional circuit-switched telephony networks which have been designed to satisfy the human ear and does it extremely well. But with the proliferation of packet-switched networks and applications with diverse requirements that rely critically on such networks providing certain service guarantees, QoS is rapidly becoming an important part of agreements between service providers and their customers. Broadly, QoS of a wide-area network measures how well it does its job - how quickly and reliably it transfers data from source to destination. Technically, QoS refers to a collection of performance metrics, primarily the following five: availability, throughput, packet loss, latency and jitter. * Different data streams have different QoS requirements. At network aggregation points, these data streams are combined for transport over a common infrastructure. A mechanism is needed to label flows with their priorities and to recognize these labels and act on them. Unfortunately such mechanisms are absent in the popular TCP/IP suite used for the vast majority of internet communication. The IETF has proposed several methods to rectify this situation. Notable among them are integrated service (IntServ), differentiated service (DiffServ) and multiprotocol label switching (MPLS). * The common open policy service (COPS) is a tool for assuring the QoS of a network. It is extremely adaptable to the customer's requirements. The requirements and rules for resource allocation, known as policies are decided in advance. This allows services to be specified unequivocally and allocate resources required to provide the service. The key elements of a policy-bases traffic management system are: policy creation and storage, interpretation, and enforcement. The paper discusses several interesting and promising schemes but provides very little evidence of their effectiveness. The techniques suggested are basically attempts to incorporate ATM like service guarantees into TCP/IP. The driving principles behind the design of ATM have been extremely successful on small and reasonably secure LANs. It remains to be seen whether they still remain effective on a network as large, as distributed and as insecure as the internet. The paper shows how an existing infrastructure can be suitably modified to satisfy unforeseen requirements. Internet-style networks, which are undoubtedly the most prevalent, can be used to assure QoS by introducing appropriate. The alternative is to develop a totally new network infrastructure, which will be extremely expensive, if not impossible. Generalized Rate Monotonic Scheduling: ------------------------------------- * The generalized rate monotonic scheduling (GRMS) theory provides an analytic and engineering basis for the development, modification and maintenance of real-time systems. It guarantees that all tasks will meet their deadlines if the total system utilization of these tasks lies below a known bound, and these tasks are scheduled using appropriate algorithms. * A set of n independent tasks scheduled by the rate monotonic algorithm will always meet their deadlines for all task start times, if the total utilization is less than n(2^(1/n)-1). This bound is too pessimistic since worst-case task set is contrived and highly unlikely to occur in practice. For a set of independent periodic tasks, if a task meets its first deadline, when all the higher priority tasks are started at the same time, then it can meet all its future deadlines with any task start times. The scheduling of aperiodic tasks can also be treated within the rate-monotonic framework of scheduling periodic tasks using the concept of sporadic servers and budgets. * Synchronization between tasks can lead to unbounded priority inversion problems. This problem can be avoided using a simple priority inheritance protocol in which a low priority task that blocks a high priority task temporarily inherits the priority of the blocked task. However this scheme can still lead to large periods of blocking and deadlock. These issues are solved using a different priority inheritance scheme called the priority ceiling protocol. The paper does not cover scenarios where the completion time of a task is probabilistic. The framework discussed in the paper is only effective when the worst case completion times are known. In the face of uncertainty, certain weaker guarantees might still be provided. The paper proves that formalism can be used to obtain guarantees even in system design and development. The GRMS framework assures that all tasks will meet their deadlines if certain criteria are satisfied. It is powerful enough to handle aperiodic tasks as well as tasks that might depend on each other.
Date: 11/06
From: Zia Syed
Practical Byzantine Fault Tolerance ----------------------------------- The paper presents BFS , Byzantine-fault-tolerant NFS Service, and the theory underlying it. The authors argue that practical Byzantine systems are not practical so far. The current systems either make synchrony assumption or they are just there to prove theoretical results. The synchrony assumption is dangerous in presence of malicious attacks because attackers can delay the communication from the non-faulty nodes (denial of service attack) until they are tagged as faulty. The system presented in the paper provides practical Byzantine failure that doesn’t assume synchrony and it is fast enough to be used in practice. It provides 2 features i.e. safety and liveliness. Which means that if no more than (n-1)/3 replicas are faulty then client would receive reply to their requests and delay(t) does not grow faster than t indefinitely where delay(t) is the time between the moment t when a message is sent for the first time and the moment when it is received by its destination. These two features are provided by the state machine replication algorithm. The service is modeled as a state machine that is replicated across the different nodes in a distributed system. The replicas move through a succession of configuration called views. In each view one replica is primary other are backups. View changes are carried out when it appears that primary has failed (hence assuring the liveliness property). A client sends a request to a replica that it believes to be primary. The primary multicasts the request to all backups. The backups send the result directly to clients. The client waits for f+1 requests from different replicas with valid signatures with same results. This is the result of operation. ‘f’ is the number of maximum nodes that may be faulty. The system doesn’t use the digital signatures for message authentication because the verification is costly and the algorithm presented requires the verification many times. The system uses message authentication codes (MAC) instead. The system is able to provide the same level of security as digital signatures by taking advantage of the invariants that they have imposed on the system. The system also uses two other optimizations for faster response time i.e. reducing the size and number of the messages and incremental checkpoint-management. The paper presents a practical example of such service i.e. BFS and shows that it is just 3% slower than non replicated version and works well . The system assumes the faults occurring in the replicas to be independent i.e. targeting basically non-determinsitc software errors, which are most problematic and persistent and hardest to detect. The faults occurring independently can be assured by implementing replicas and OS by different teams (N-version programming). It also requires each replica to have a different administrator and root password. All these assumptions seem to be reasonable except having different OS implementations, which might not be always feasible. Implementing Fault-Tolerant Services Using the State Machines Approach: A Tutorial ------------------------------------------------------------- The paper presents a state machine model for tolerating faults in a client server model. The single server usage in client-server computing might be simplest to attempt but doesn’t provide any fault tolerance. If fault tolerance is required then a number of servers that fail independently are required. The paper says that most of the fault tolerance protocols can be modeled using state-machine approach. The server can be modeled as a state machine. A state machine consists of a number of variables, which encodes its state, and commands which transforms its state. The state machine is deterministic. The clients request the service using the commands and transform the state of the server. There is a causal ordering in the processing of requests. Then the author presents two representative classes of failures i.e. Byzantine and fail-stop. To quantify fault tolerance the paper defines a ‘t’ fault tolerant system to be the system that can meet its specifications provided that no more than t components fail during some interval of interest. The paper argues that measuring the system’s fault tolerance in terms of t is more meaningful from the system design point of view than MTBF, which is more user-oriented. A t fault-tolerant state machine is defined as one that is implemented by replicating the state machine and running a replica on each of the processors in a distributed system. The key to implementing a t fault tolerant state machine is to ensure Replica Coordination i.e. all non-faulty replicas receive and process all the requests (Agreement) and all non-faulty replicas process the requests in the same relative order (Order). The knowledge of commands can allow weakening of replica coordination restriction and hence allows implementation of cheaper protocols to b used for replica management. The Agreement requirement can be satisfied by using Byzantine agreement protocol (Lamport). The order can be implemented by using Lamport’s scheme of logical clocks or synchronized real time clocks. In order to implement a t-fault tolerant system that can tolerate faulty output devices, one needs to know if the output of the whole replicated state machine ensemble is to be used within or outside the system. If the output is used outside the system e.g. to a device, then tolerating the failure of a single voter is not enough. The system must be able to tolerate the failure of output device. This is done by replicating the output device and voter. Each voter combines the output of all state machine replicas and produces a signal that drives the associated output device. If the output is used inside the system i.e. by a client then the client can itself combine the outputs of the state machine replicas. Here the voter is considered to be the part of the client. If the voter is faulty then it means that the client is faulty itself so it we don’t need to worry about incorrect output. It is client’s own business now. At fault tolerant system also needs to tolerate a faulty client that may corrupt the state machine replicas and may result in generation of erroneous output in future as a result of request from non-faulty clients. This situation can be handled by replicating the clients themselves. This approach would require change in the state machine replicas because each state machine would need to get request from each client replica. This would also require the state machine implementation to know about application specific information about the clients for the cases where each version of client may send a slightly different value of an input value (e.g. different sensor value because each replica read the value at slightly different times). The final value can be computed by using a median or some more complex algorithm such as fault-tolerant intersection algorithm [Marzullo 1989]. If it not possible to replicate a client then defensive programming techniques can be used to protect state machine replicas from getting corrupted. In such techniques the state machine is formulated in a way that a faulty client cannot corrupt that it’s state. These techniques are also very application specific and requires mathematically rigorous model of state machine so that only valid sequence of commands may be executed by a client.