712 Topics and Readings (Fall 2000)

For each class meeting, readings are assigned. Usually, the readings will consist of two or three hard-core technical papers. You are expected to read these papers thoroughly and summarize them BEFORE arriving at class. For each class meeting, we identify the topic and papers below; for each, we also try to identify good sources for background reading and for further investigation.

Electronic versions are linked where available (access will be denied for IP addresses outside of CMU); see the course secretary for paper copies. We will try to provide hard copies of the assigned readings at least a week in advance.

(NOTE: THIS SCHEDULE IS HIGHLY PRELIMINARY, IN TERMS OF LECTURE CONTENTS, LECTURE ORDER, AND ESPECIALLY DATES. ALSO, GUEST LECTURES ARE STILL BEING SET UP. ALL IS SUBJECT TO CHANGE.)

September 13 : Welcome to 712

Because of the late start to 712, we will start quickly. This first meeting will be more than just organizational in nature. In it, we will discuss how the class is going to work and what will (and won't) be covered --- see the 712 overview for a recap of the general information. In addition, we will dive into the material. This will include very rapidly recapping stuff you should already know (e.g., stuff covered in 15-412), discussing what defines operating systems and distributed systems and what makes them continue to be interesting after all these years, and overviewing how the various topics in the course fit together.

The papers listed above are fun and insight-filled, talking generally about the construction and history of computer systems. We strongly encourage you to read them, but are not requiring the standard written summary for them. The Lampson paper, in particular, is something that you should read now, read again in a few weeks, and then put into the pile of papers that you re-read every year or so. A book that should go in this same category is The Mythical Man-Month: Essays on Software Engineering (by F. Brooks, Addison-Wesley Publishers).

September 18 : File Systems and Disk Management

The next three classes will use file systems as a concrete example for the various general problems introduced on day one (and covered throughout the course. This first one will focus on local file systems, including such issues as performance enhancement techniques (caching, prefetching, layout), naming and metadata, integrity maintenance for crash recovery, use of multiple devices for performance, and use of multiple devices for reliability/availability. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe two file system implementation techniques specifically focused on tuning performance in response to particular technological advances. Each describes some of the trends leading to their designs, and also discusses some of the additional aspects listed above. The first responds to large caches by optimizing for writes; note that, like most big ideas, the merits of LFS have been the source of heated debate. The second responds to the expense of synchronous I/O operations by integrity-maintaining write caching for metadata; it also requires evaluation relative to other options (e.g., write-ahead logging).

This class topic will move rapidly through a variety of file system and disk management issues, since you should have a solid base already (from your undergraduate OS course). Good sources for more background on how disks and disk systems work are An Introduction to Disk Drive Modeling (by C. Ruemmler and J. Wilkes, in IEEE Computer magazine, March 1994) and RAID: High-Performance, Reliable Secondary Storage (by P. Chen, et. al, in ACM Computing Surveys, June 1994). For more background in file systems, we suggest chapters 6, 7 and 8 of The Design and Implementation of the 4.4BSD Operating System (by M.K. McKusick, K. Bostic, M. Karels and J. Quarterman, Addison-Wesley publishers), Practical File Systems Design with the Be File System (by D. Giampaolo, Morgan Kaufmann Publishers). There's not much in the way of good background material about file system caching. There are general rules like, "most new data is over-written or deleted within 30 seconds" and "bigger caches tend to capture more reads". There are also lots of papers that make specific points about file caches. Some good ones are: Disk Cache--Miss Ratio Analysis and Design Considerations by A. Smith in ACM TOCS (Aug. 1985), Analysis of the Periodic Update Write Policy for Disk Cache Systems by S. Carson and S. Setia in IEEE TOSE (Jan. 1992), and A Trace-Driven Analysis of the UNIX 4.2 BSD File System by J. Ousterhout, et al., in the ACM SOSP Conference of 1985.

September 20 : Distributed File Systems

The second class in the file systems series will focus on distributed file systems, which are file systems whose functionality is split between client machines and file servers that are connected via some kind of network. Topics touched on will include client/server organization, how it works (RPC), how it differs from the local file system case, cache coherence, concurrency control, data consistency, performance enhancement, scalability of clients supported. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe distributed file system designs. The first is a landmark paper that describes AFS and how its design enhances scalability. The second explores how a distributed file system might want to change in response to particular changes to the underlying hardware and usage characteristics -- specifically, it looks at what happens when clients become mobile computers and the client/server connections become slow and intermittent.

For additional background on distributed file systems, a couple of good sources are Chapter 5 of Distributed Operating Systems (by A. Tanenbaum, Prentice Hall Publishers) and Chapter 9 of The Design and Implementation of the 4.4BSD Operating System (by M.K. McKusick, K. Bostic, M. Karels and J. Quarterman, Addison-Wesley publishers). For further reading, we suggest many of the papers referenced in the papers you will read for the next class meeting (NASD, xFS).

September 25 : Distributed Storage Services

The third class in the file systems series will focus on distributed storage services, which include file systems whose server-side functionality is distributed among several file servers. Topics touched on will include partitioning and coordination, naming and resource discovery, reliability/availability, security, data consistency, dealing with scale. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe distributed storage service designs. The first describes Network-Attached Secure Disks (NASD), which is an architecture designed to cost-effectively deliver scalable storage bandwidth via a particular partitioning of functionality among clients, servers, and network-attached storage devices. The second describes an ambitious cluster storage system design (xFS) that is meant to avoid having any central point of failure or performance bottleneck.

There isn't a particularly good source of background reading on this topic, but the assigned readings do identify a number of other system designs. We suggest following up on papers referenced therein as a start toward digging deeper.

October 2 : Communication Models

Having plowed through a concrete set of advanced OS and basic distributed systems examples, we now start looking at aspects in a more general light. Our first such aspect is the communication mechanism among components of a distributed system. The focus is not on the particular network protocol, but on how the communication is viewed by the parts and how it all hangs together. Some particular topics for today include remote procedure call (RPC), implementing RPC, data streams, load balancing, and work stealing. As always, along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today highlight particular communication models. The first is a classic paper discussing implementation details involved with RPCs, which are the foundation of most distributed systems. The second is a more recent paper that describes a (possibly) more appropriate mechanism for data-intensive cluster applications.

For both background reading and further reading, we suggest looking at general distributed systems books, such as Tanenbaum's Distributed Operating Systems, Coulouris, et al.,'s Distributed Systems, or Mullender's Distributed Systems.

October 4 : Function Placement

An important topic in distributed systems in function placement, including both deciding what to run where and instantiating these decisions. Issues involved with deciding what should run where include inter-function communication, parallelism, load balancing, and security. Instantiating these decisions involves communication issues discussed in previous class meetings. In some cases, it also includes moving functionality from one place to another, which introduces the topic of mobile code and such issues as execution environment heterogeneity, virtual machine environments, re-binding, and encapsulation/isolation for protection. In examining these issues, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe systems with particular approaches to making and instantiating function placement decisions. The first describes Emerald, a system that realizes fine-grained mobility by encapsulating data and threads into mobile objects. The second describes Abacus, a system that dynamically adjusts function plaement decisions based on black-box monitoring of runtime performance.

There is not a lot of background literature about function placement (check out the papers' Related Work sections to dig further), but there is a huge amount of literature on mobile code systems. One interesting place to look is at UMBC's Agent Resources website. Another is the Object Management Group's website.

October 9 : OS Structure and Extensibility

The structure of operating systems has been the topic of much research and debate, since it has such a large impact on complexity, performance, flexibility, robustness, security, etc. This class meeting will look at various options and their trade-offs, including monolithic (e.g., linux, Windows), microkernels (e.g., Mach), virtual machine monitors (e.g., VM/370), and recent research systems. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe two recently researched operating system structures focused on enabling safe extensibility. The first describes Spin, which uses a type-safe programming language (Modula-3) to enable code to be safel loaded into its kernel. The second describes exokernels, which minimize kernel functionality to that required for protection, pushing all high-level abstractions into application libraries wher it can be replaced by application writers at will. It is worth noting that not everyone buys into the value of extensible systems -- a fun rebuttal is given in Extensible Systems are Leading OS Researchers Astray (by P. Druschel, V. Pai, W. Zwaenepoel, IEEE Workshop on Hot Topics in Operating Systems (HotOS-6), May 1997, pp. 38-42).

Background material for this topic is just basic operating systems, though specifically relevant discussion can be found in Tanenbaum's Modern Operating Systems. For additional details on the research systems described in the papers, see the Spin and exokernel project pages.

October 11 : Composable Systems

As computer systems become more and more complex, it becomes important to be able to compose them of substantial pre-existing components. Far from new, composability has appeared in many aspects of system construction in many different fashions. This class meeting will look at various forms of composability that have arisen in systems, such as module switches (e.g., vnode interfaces), stackable layers (e.g., in file systems and networks), pipelined applications (e.g., the UNIX programming model), tweakable interfaces, and glue-logic via scripting. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe two specific examples of composability in systems. The first describes stackable file systems, which allow functionality to easil be added and subtracted incrementally. The second discusses the importance of scriptin languages (e.g., Perl and Tcl) in the construction and extension of systems.

I don't have a lot to offer in terms of suggestions for background reading at this time. The related work sections of the assigned papers provide some suggestions for further reading.

October 16 : Security Overview

Computer system security is a topic of growing importance (and great confusion). This class meeting will explore the basics of computer sstem security as it relates to operating systems and distributed systems. Specific topics will include what security is really all about, some examples of real problems, the role of cryptography, assurrance, trust, and blame. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today discuss interesting aspects of computer system security. The first is a brief lecture given by Ken Thompson when accepting his Turing Award; it makes an interesting point about trust and security. The second discusses why cryptography does not solve the security problem, by drawing on real-world examples, and some consequences of this fact. The third describes the notorious Internet Worm and some lessons regarding the real problems in computer security; sadly the lessons haven't changed much 12 years later.

There is a huge and growing literature on security. One good overview is "The Protection of Information in Computer Systems" by J. Saltzer and D. Schroeder in Proceedings of the IEEE (September 1975). It is a bit dated (and therefore doesn't include some recent problems/solutions), but is quite comprehensive in the fundamentals. A couple of good overview books are Computer Security (by D. Gollmann, Wiley Publishing) and Security in Computing (by C. Pfleeger, Prentice Hall Publishers).

October 18 : Exam #1

In-class exam.

October 30 : Concurrency, Threads, Transactions

One of the most basic (and yet persistently complex) aspects of both operating systems and distributed systems is dealing with concurrent threads of control. While your undergraduate OS course undoubtedly spent a lot of time talking about monitors and semaphores, there is a lot more to concurrency than that. This class meeting will discuss a number of such topics, including locking, avoiding deadlock, optimistic concurrency control, avoiding livelock, leases, and transactions. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today differ in what they offer. The first describes experiences with thread programming, offering a number of insights from serious usage. The second describes an optimistic approach to concurrency control, which allows locks to be completely avoided.

For additional background on thread programming, we suggest An Introduction to Programming with Threads (A. Birrell, DEC Technical Report TR-35, DEC/SRC, January 1989; the instructors will be happy to provide you with a copy), Principles of Concurrent and Distributed Programming (by M. Ben-Ari, Prentice Hall Publishers) or Programming with POSIX Threads (by D. Butenhof, Addison-Wesley Publishers).

November 1 : Event Ordering and Multi-Party Consensus

The nature of distributed systems is such that autonomous systems don't see the same things, don't see things at the same time, and can't even easily agree on what time it is. These problems make it difficult for each to reason about what the others are up to and makes it difficult for systems to come to consensus. This class meeting will discuss various issues of event ordering and coming to consensus. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today discuss different aspects of this general area of problem. The first discusses the problems of event ordering and the distributed clock synchronization. The second discusses the problem of coming to consensus when some systems misbehave.

Much of the theoretical side of distributed systems is more or less focuses on these two problems. Thus, many good distributed systems books address them thoroughly, including those on the reserved list.

November 6 : QoS and Multimedia Support (Rajkumar)

Many modern applications, such a video playback and games, require some degree of real-time guarantees. For many embedded environments, these guarantees are critical to correct operation. This class meeting will discuss important aspects of providing for quality of service and how it can be done. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today are still to be determined.

There is a huge literature and effort in real-time systems, including annual conferences and a technical committee within the IEEE.

November 8 : Impact of Mobile Computing (Satya)

Mobility has become a fact of life in modern computing, and it changes a number of aspects of both operating systems and distributed systems. This class meeting will discuss some of them. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

No papers are assigned for today.

Mobility and its impact on systems is the topic of much ongoing research (and a lot of figure-it-out-as-we-go practice). One good source of "what's hot" is the ACM Special Interest Group on Mobile Computing.

November 13 : Group Communication and State Machines

As we've discussed, one reason for replacing centralized services with distributed systems is fault tolerance. Dealing with both clean (fail-stop) failures and misbehaving systems (e.g., compromised systems) can be done by having multiple systems doing exactly the same thing in lockstep. This class meeting will look at approaches and technologies for accomplishing such function replication. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe this method in general and in a particular application, distributed file service. The first provides a tutorial-like overview of replicated state machines. The second describes a fault-tolerant NFS replacement that employs some well-tuned algorithms.

The first paper actually provides some pretty good background discussion, and we encourage those looking for further reading to follow up the Related Work sections.

November 15 : Fault Tolerance

Fault-tolerance is a broad and important topic in operating systems and distributed systems. This class meeting will discuss a range of issues and approaches, complementing previous discussions. Some topics to be discussed include data redundancy, failover, checkpointing, state machines, N-versioning, fault containment, isolation, and failstop versus byzantine failures. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe fault tolerance strategies for particular systems. The first describes fault management in Quicksilver, which relied heavily on transactions and their recoverability properties. The second describes Hive, a cellular operating system architecture for large multiprocessor systems.

Fault-tolerance is another of the large areas of research and practice. For example, see the IEEE Technical Committee on Fault-Tolerant Computing. For some intersting and scary stories of insufficiently fault tolerant systems and their consequences, see Neumann's Computer Related Risks or Peterson's Fatal Defect.

November 20 : Exam #2

In-class exam.

November 27 : Project Status Presentations

In class presentations, by the students doing them, of projects and their midpoint status.

November 29 : Cluster-Based Computing

An alternative to having large, expensive, multi-processor computer systems is to treat a networked set of PCs or workstations as a parallel computer. Many have promoted and explored this approach, and we will discuss some of the trade-offs and issues involved. In particular, many trade-offs revolve around the performance and usability aspects of the most important design consideration of this approach -- how to program such a system. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today focus on particular approaches to this problem. The first describes SoftFLASH, yet another distributed shared memory system, and analyzes its performance on the kinds of parallel applications relevant to such computation. The second describes Shasta, which focuses on getting SMP applications to work on clusters without any programmer effort or attention.

Cluster computing is a fairly old topic that has received a lot of attention in recent years. A good, accessible explanation of why is given in A Case for Networks of Workstations: NOW (by T. Anderson, D. Culler, D. Patterson and the NOW team, in IEEE Micro, February 1995).

December 4 : Distributed OSes

An alternate to treating each system as separate and exposing that separateness to users is to have a distributed operating system that hides distribution under a single-system image. This class meeting will look at examples of how this has been done and the issues and challenges involved. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe three distributed operating system designs. The first describes the V system, a precursor to many distributed OSes. The second compares and contrasts the Sprite and Amoeba distributed systems.

Much of the course represents background for the general topic of distributed OSs, and Tanenbaum's Distributed Operating Systems textbook is a useful resource. For more information on these past and some recent projects in this space, see the history of V's group, the Sprite web site, the Amoeba web site, the Beowulf web site, and the Condor web site.

December 6 : Futures

As with so many aspects of computer systems, changes in technology and applications changes the issues faced in operating systems and distributed systems. In this class meeting we'll discuss some of these cool new developments and some of their consequences.

The papers assigned for today describe visions for future computing. The first describes ubiquitous computing.

For additional material on ubiquitous computing, see Mark Weiser's Ubiquitous Computing web pages.

December 11 : Project Poster Session

The final day of classes will be a poster session, with one poster for each group's class project. This will allow you all to show off your great work to each other and other interested CMU people.