712 Topics and Readings (Fall 2000)
For each class meeting, readings are assigned. Usually, the readings
will consist of two or three hard-core technical papers. You are
expected to read these papers thoroughly and summarize them
BEFORE
arriving at class.
For each class meeting, we identify the topic and papers below; for
each, we also try to identify good sources for background reading and
for further investigation.
Electronic versions are linked where available (access will be denied
for IP addresses outside of CMU); see the course secretary for
paper copies. We will try to provide hard copies of the assigned readings
at least a week in advance.
(NOTE: THIS SCHEDULE IS HIGHLY PRELIMINARY, IN TERMS OF LECTURE CONTENTS,
LECTURE ORDER, AND ESPECIALLY DATES. ALSO, GUEST LECTURES ARE STILL
BEING SET UP. ALL IS SUBJECT TO CHANGE.)
September 13 : Welcome to 712
- 0. Hints for Computer System Design
B. Lampson, ACM Symposium on Operating Systems Principles,
Dec. 1983, pp 33-48.
- 0. "Worse is Better", and excerpt from
LISP: good news, bad news, how to win BIG
R. Gabriel, AI Expert, vol. 6, no. 6, June 1991, pp. 31-39.
Because of the late start to 712, we will start quickly.
This first meeting will be more than just organizational in nature.
In it, we will discuss how the class is going to work and what will
(and won't) be covered --- see the 712 overview
for a recap of the general information.
In addition, we will dive into the material. This will include
very rapidly recapping stuff you should already know (e.g., stuff
covered in 15-412), discussing what defines operating systems and
distributed systems and what makes them continue to be interesting
after all these years, and overviewing how the various topics in the
course fit together.
The papers listed above are fun and insight-filled, talking
generally about the construction and history of computer systems.
We strongly encourage you to read them, but are not requiring the
standard written summary for them. The Lampson paper, in particular,
is something that you should read now, read again in a few weeks, and
then put into the pile of papers that you re-read every year or so.
A book that should go in this same category is
The Mythical Man-Month: Essays on Software Engineering
(by F. Brooks, Addison-Wesley Publishers).
September 18 : File Systems and Disk Management
- 1. The Design and Implementation of a Log-Structured File System
M. Rosenblum, J. Ousterhout, ACM Transactions on Computer Systems,
vol. 10, no. 1, February 1992, pp. 26-52.
- 2. Soft Updates: A Solution to the Metadata Update Problem in File Systems
G. Ganger, M.K. McKusick, C. Soules, Y. Patt, ACM Transactions on
Computer Systems, vol. 18, no. 2, to appear.
The next three classes will use file systems as a concrete example
for the various general problems introduced on day one (and covered
throughout the course.
This first one will focus on local file systems, including such issues
as performance enhancement techniques (caching, prefetching, layout),
naming and metadata, integrity maintenance for crash recovery,
use of multiple devices for performance, and use of multiple devices
for reliability/availability.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe two file system implementation
techniques specifically focused on tuning performance in response to
particular technological advances.
Each describes some of the trends leading to their designs,
and also discusses some of the additional aspects listed above.
The first responds to large caches by optimizing for writes; note
that, like most big ideas, the merits of LFS have been the source
of heated debate.
The second responds to the expense of synchronous I/O operations
by integrity-maintaining write caching for metadata; it also requires
evaluation relative to other options (e.g., write-ahead logging).
This class topic will move rapidly through a variety of file system
and disk management issues, since you should have a solid base
already (from your undergraduate OS course).
Good sources for more background on how disks and disk systems
work are An
Introduction to Disk Drive Modeling (by C. Ruemmler and
J. Wilkes, in IEEE Computer magazine, March 1994)
and RAID: High-Performance,
Reliable Secondary Storage (by P. Chen, et. al, in ACM
Computing Surveys, June 1994).
For more background in file systems, we suggest chapters 6, 7 and 8 of
The Design and Implementation of the 4.4BSD Operating System
(by M.K. McKusick, K. Bostic, M. Karels and J. Quarterman,
Addison-Wesley publishers),
Practical File Systems Design with the Be File System
(by D. Giampaolo, Morgan Kaufmann Publishers).
There's not much in the way of good background material about file
system caching. There are general rules like, "most new data is over-written
or deleted within 30 seconds" and "bigger caches tend to capture more reads".
There are also lots of papers that make specific points about file caches.
Some good ones are:
Disk Cache--Miss Ratio Analysis and Design
Considerations by A. Smith in ACM TOCS (Aug. 1985),
Analysis of the Periodic Update Write Policy for Disk Cache
Systems by S. Carson and S. Setia in IEEE TOSE (Jan. 1992),
and A Trace-Driven Analysis of the UNIX 4.2 BSD File System
by J. Ousterhout, et al., in the ACM SOSP Conference of 1985.
September 20 : Distributed File Systems
- 3. Scale and Performance in a Distributed File System
J. Howard, M. Kazar, et. al, ACM Transactions on Computer Systems,
vol. 6, no. 1, February 1988, pp. 51-81.
- 4. Exploiting Weak Connectivity for Mobile File Access
L. Mummert, M. Ebling, M. Satyanarayanan, ACM Symposium on
Operating Systems Principle, December 1995, pp. 143-155.
The second class in the file systems series will focus on
distributed file systems, which are file systems whose functionality
is split between client machines and file servers that are connected
via some kind of network.
Topics touched on will include
client/server organization, how it works (RPC), how it differs from
the local file system case, cache coherence, concurrency control,
data consistency, performance enhancement, scalability of clients supported.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe distributed file system designs.
The first is a landmark paper that describes AFS and how its design
enhances scalability.
The second explores how a distributed file system might want to change
in response to particular changes to the underlying hardware and
usage characteristics -- specifically, it looks at what happens when
clients become mobile computers and the client/server connections
become slow and intermittent.
For additional background on distributed file systems, a couple of
good sources are Chapter 5 of Distributed Operating Systems
(by A. Tanenbaum, Prentice Hall Publishers) and Chapter 9 of
The Design and Implementation of the 4.4BSD Operating System
(by M.K. McKusick, K. Bostic, M. Karels and J. Quarterman,
Addison-Wesley publishers).
For further reading, we suggest many of the papers referenced in
the papers you will read for the next class meeting (NASD, xFS).
September 25 : Distributed Storage Services
- 5. A Cost-Effective, High-Bandwidth Storage Architecture
G. Gibson, D. Nagle, et. al, 8th Conf. on Architectural Support
for Programming Languages and Operating Systems, October 1998,
pp. 92-103
- 6. Serverless Network File Systems
T. Anderson, M. Dahlin, et. al, ACM Transactions on Computer
Systems, vol. 14, no. 1, February 1996, pp. 41-79.
The third class in the file systems series will focus on
distributed storage services, which include file systems whose server-side
functionality is distributed among several file servers.
Topics touched on will include
partitioning and coordination, naming and resource discovery,
reliability/availability, security, data consistency, dealing with scale.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe distributed storage service designs.
The first describes Network-Attached Secure Disks (NASD), which is an
architecture designed to cost-effectively deliver scalable storage bandwidth
via a particular partitioning of functionality among clients, servers, and
network-attached storage devices.
The second describes an ambitious cluster storage system design (xFS)
that is meant to avoid having any central point of failure or performance
bottleneck.
There isn't a particularly good source of background reading on this
topic, but the assigned readings do identify a number of other system
designs. We suggest following up on papers referenced therein as
a start toward digging deeper.
October 2 : Communication Models
- 7. Implementing Remote Procedure Call
A. Birrell, B. Nelson, ACM Transactions on Computer Systems,
vol. 2, no. 1, February 1984, pp. 39-59.
- 8. Cluster I/O with River: Making the Fast Case Common
R. Arpaci-Dusseau, E. Anderson, et. al, Workshop on Input/Output
for Parallel and Distributed Systems (IOPADS), May 1999, pp. 10-22.
Having plowed through a concrete set of advanced OS and basic distributed
systems examples, we now start looking at aspects in a more general light.
Our first such aspect is the communication mechanism among components
of a distributed system. The focus is not on the particular network
protocol, but on how the communication is viewed by the parts and how
it all hangs together. Some particular topics for today include
remote procedure call (RPC), implementing RPC, data streams, load
balancing, and work stealing.
As always, along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today highlight particular communication models.
The first is a classic paper discussing implementation details involved
with RPCs, which are the foundation of most distributed systems.
The second is a more recent paper that describes a (possibly) more
appropriate mechanism for data-intensive cluster applications.
For both background reading and further reading, we suggest looking
at general distributed systems books, such as Tanenbaum's
Distributed Operating Systems, Coulouris, et al.,'s
Distributed Systems, or Mullender's
Distributed Systems.
October 4 : Function Placement
- 9. Fine-Grained Mobility in the Emerald System
E. Jul, H. Levy, N. Hutchinson, A. Black, ACM Transactions on
Computer Systems, vol. 6, no. 1, February 1988, pp. 109-133.
- 10. Dynamic Function Placement for Data-Intensive Cluster Computing
K. Amiri, D. Petrou, G. Ganger, G. Gibson, Usenix Annual Technical
Conference, June 2000, pp. 307-322.
An important topic in distributed systems in function placement,
including both deciding what to run where and instantiating these decisions.
Issues involved with deciding what should run where include
inter-function communication, parallelism, load balancing, and security.
Instantiating these decisions involves communication issues discussed
in previous class meetings.
In some cases, it also includes moving functionality from one place to
another, which introduces the topic of mobile code and such issues
as execution environment heterogeneity, virtual machine environments,
re-binding, and encapsulation/isolation for protection.
In examining these issues, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe systems with particular approaches
to making and instantiating function placement decisions.
The first describes Emerald, a system that realizes fine-grained mobility
by encapsulating data and threads into mobile objects.
The second describes Abacus, a system that dynamically adjusts function
plaement decisions based on black-box monitoring of runtime performance.
There is not a lot of background literature about function placement
(check out the papers' Related Work sections to dig further), but
there is a huge amount of literature on mobile code systems. One
interesting place to look is at
UMBC's Agent Resources website. Another is the
Object Management Group's website.
October 9 : OS Structure and Extensibility
- 11. Extensibility, Safety, and Performance in the SPIN Operating System
B. Bershad, S. Savage, et. al, ACM Symposium on Operating Systems
Principles, December 1995, pp. 267-284.
- 12. Application Performance and Flexibility on Exokernel Systems
M.F. Kaashoek, D. Engler, et. al, ACM Symposium on Operating Systems
Principles, December 1997, pp. 52-65.
The structure of operating systems has been the topic of much research
and debate, since it has such a large impact on complexity, performance,
flexibility, robustness, security, etc.
This class meeting will look at various options and their trade-offs,
including monolithic (e.g., linux, Windows), microkernels (e.g., Mach),
virtual machine monitors (e.g., VM/370), and recent research systems.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe two recently researched operating
system structures focused on enabling safe extensibility.
The first describes Spin, which uses a type-safe programming language
(Modula-3) to enable code to be safel loaded into its kernel.
The second describes exokernels, which minimize kernel functionality to
that required for protection, pushing all high-level abstractions
into application libraries wher it can be replaced by application
writers at will.
It is worth noting that not everyone buys into the value of extensible
systems -- a fun rebuttal is given in
Extensible Systems are Leading OS Researchers Astray
(by P. Druschel, V. Pai, W. Zwaenepoel, IEEE Workshop on Hot Topics in
Operating Systems (HotOS-6), May 1997, pp. 38-42).
Background material for this topic is just basic operating systems,
though specifically relevant discussion can be found in Tanenbaum's
Modern Operating Systems.
For additional details on the research systems described in the papers,
see the
Spin
and exokernel project pages.
October 11 : Composable Systems
- 13. File-System Development with Stackable Layers
J. Heidemann, G. Popek, ACM Tranactions on Computer Systems,
vol. 12, no. 1, February 1994, pp. 58-89.
- 14. Scripting: Higher-Level Programming for the 21st Century
J. Ousterhout, IEEE Computer, vol. 31, no. 3, March 1998, pp. 23-30.
As computer systems become more and more complex, it becomes important
to be able to compose them of substantial pre-existing components.
Far from new, composability has appeared in many aspects of system
construction in many different fashions.
This class meeting will look at various forms of composability that
have arisen in systems, such as module switches (e.g., vnode interfaces),
stackable layers (e.g., in file systems and networks), pipelined
applications (e.g., the UNIX programming model), tweakable interfaces,
and glue-logic via scripting.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe two specific examples of composability
in systems.
The first describes stackable file systems, which allow functionality to
easil be added and subtracted incrementally.
The second discusses the importance of scriptin languages (e.g., Perl
and Tcl) in the construction and extension of systems.
I don't have a lot to offer in terms of suggestions for background
reading at this time.
The related work sections of the assigned papers provide some suggestions
for further reading.
October 16 : Security Overview
- 15. Reflections on Trusting Trust
K. Thompson, Communications of the ACM, vol. 27, no. 8,
August 1984, pp. 761-763.
- 16. Why Cryptosystems Fail
R. Anderson, Communications of the ACM, vol. 37, no. 11,
November 1994, pp. 32-40.
- 17. Crisis and Aftermath
E. Spafford, Communications of the ACM, vol. 32, no. 6,
June 1989, pp. 678-687.
Computer system security is a topic of growing importance (and great
confusion). This class meeting will explore the basics of computer
sstem security as it relates to operating systems and distributed
systems.
Specific topics will include what security is really all about, some
examples of real problems, the role of cryptography, assurrance, trust,
and blame.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today discuss interesting aspects of computer system
security.
The first is a brief lecture given by Ken Thompson when accepting his
Turing Award; it makes an interesting point about trust and security.
The second discusses why cryptography does not solve the security
problem, by drawing on real-world examples, and some consequences of
this fact.
The third describes the notorious Internet Worm and some lessons
regarding the real problems in computer security; sadly the lessons
haven't changed much 12 years later.
There is a huge and growing literature on security.
One good overview is "The Protection of Information in Computer
Systems" by J. Saltzer and D. Schroeder in Proceedings of the IEEE
(September 1975). It is a bit dated (and therefore doesn't include some
recent problems/solutions), but is quite comprehensive in the fundamentals.
A couple of good overview books are
Computer Security (by D. Gollmann, Wiley Publishing) and
Security in Computing (by C. Pfleeger, Prentice Hall Publishers).
October 18 : Exam #1
In-class exam.
October 30 : Concurrency, Threads, Transactions
- 18. Using Threads in Interactive Systems: A Case Study
C. Hauser, C. Jacobi, et. al, ACM Symposium on Operating
Systems Principles, December 1993, pp. 94-105.
- 19. On Optimistic Methods for Concurrency Control
H.T. Kung, J. Robinson, ACM Transactions on Database Systems,
vol. 6, no. 2, June 1981, pp. 213-226.
One of the most basic (and yet persistently complex) aspects of
both operating systems and distributed systems is dealing with
concurrent threads of control. While your undergraduate OS course
undoubtedly spent a lot of time talking about monitors and semaphores,
there is a lot more to concurrency than that. This class meeting
will discuss a number of such topics, including locking, avoiding
deadlock, optimistic concurrency control, avoiding livelock, leases,
and transactions.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today differ in what they offer.
The first describes experiences with thread programming, offering a
number of insights from serious usage.
The second describes an optimistic approach to concurrency control,
which allows locks to be completely avoided.
For additional background on thread programming, we suggest
An Introduction to Programming with Threads
(A. Birrell, DEC Technical Report TR-35, DEC/SRC, January 1989;
the instructors will be happy to provide you with a copy),
Principles of Concurrent and Distributed Programming
(by M. Ben-Ari, Prentice Hall Publishers) or
Programming with POSIX Threads
(by D. Butenhof, Addison-Wesley Publishers).
November 1 : Event Ordering and Multi-Party Consensus
- 20. Time, Clocks, and the Ordering of Events in a Distributed System
L. Lamport, Communications of the ACM, vol. 21, no. 7, July 1978,
pp. 558-565.
- 21. The Byzantine General's Problem
L. Lamport, R. Shostak, M. Pease, ACM Transactions on Programming
Languages and Systems, vol. 4, no. 3, July 1982, pp. 382-401.
The nature of distributed systems is such that autonomous systems don't
see the same things, don't see things at the same time, and can't
even easily agree on what time it is. These problems make it difficult
for each to reason about what the others are up to and makes it difficult
for systems to come to consensus. This class meeting will discuss
various issues of event ordering and coming to consensus.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today discuss different aspects of this
general area of problem.
The first discusses the problems of event ordering and the distributed
clock synchronization.
The second discusses the problem of coming to consensus when some systems
misbehave.
Much of the theoretical side of distributed systems is more or less
focuses on these two problems. Thus, many good distributed systems
books address them thoroughly, including those on the reserved list.
November 6 : QoS and Multimedia Support (Rajkumar)
- 22. paper TBD
Citation info.
- 23. paper TBD
Citation info.
Many modern applications, such a video playback and games, require
some degree of real-time guarantees. For many embedded environments,
these guarantees are critical to correct operation.
This class meeting will discuss important aspects of providing for
quality of service and how it can be done.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today are still to be determined.
There is a huge literature and effort in real-time systems, including
annual conferences and a
technical committee within the IEEE.
November 8 : Impact of Mobile Computing (Satya)
Mobility has become a fact of life in modern computing, and it changes
a number of aspects of both operating systems and distributed systems.
This class meeting will discuss some of them.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
No papers are assigned for today.
Mobility and its impact on systems is the topic of much ongoing research
(and a lot of figure-it-out-as-we-go practice).
One good source of "what's hot" is the
ACM Special Interest Group on Mobile Computing.
November 13 : Group Communication and State Machines
As we've discussed, one reason for replacing centralized services with
distributed systems is fault tolerance. Dealing with both clean
(fail-stop) failures and misbehaving systems (e.g., compromised
systems) can be done by having multiple systems doing exactly the
same thing in lockstep. This class meeting will look at approaches
and technologies for accomplishing such function replication.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe this method in general and in
a particular application, distributed file service.
The first provides a tutorial-like overview of replicated state machines.
The second describes a fault-tolerant NFS replacement that employs
some well-tuned algorithms.
The first paper actually provides some pretty good background discussion,
and we encourage those looking for further reading to follow up the
Related Work sections.
November 15 : Fault Tolerance
- 26. Recovery Management in Quicksilver
R. Haskin, Y. Malachi, W. Sawdon, G. Chan, ACM Transactions on
Computer Systems, vol. 6, no. 1, February 1988, pp. 82-108.
- 27. Hive: Fault Containment for Shared-Memory Multiprocessors
J. Chapin, M. Rosenblum, et. al, ACM Symposium on Operating
Systems Principles, December 1995, pp. 12-25.
Fault-tolerance is a broad and important topic in operating systems
and distributed systems. This class meeting will discuss a range
of issues and approaches, complementing previous discussions.
Some topics to be discussed include
data redundancy, failover, checkpointing, state machines, N-versioning,
fault containment, isolation, and failstop versus byzantine failures.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe fault tolerance strategies for
particular systems.
The first describes fault management in Quicksilver, which relied
heavily on transactions and their recoverability properties.
The second describes Hive, a cellular operating system architecture
for large multiprocessor systems.
Fault-tolerance is another of the large areas of research and
practice. For example, see the
IEEE Technical Committee on Fault-Tolerant Computing.
For some intersting and scary stories of insufficiently fault tolerant
systems and their consequences, see Neumann's Computer Related Risks
or Peterson's Fatal Defect.
November 20 : Exam #2
In-class exam.
November 27 : Project Status Presentations
In class presentations, by the students doing them, of projects
and their midpoint status.
November 29 : Cluster-Based Computing
An alternative to having large, expensive, multi-processor computer
systems is to treat a networked set of PCs or workstations as a parallel
computer. Many have promoted and explored this approach, and we
will discuss some of the trade-offs and issues involved.
In particular, many trade-offs revolve around the
performance and usability aspects of the most important design consideration
of this approach -- how to program such a system.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today focus on particular approaches to this problem.
The first describes SoftFLASH, yet another distributed shared memory
system, and analyzes its performance on the kinds of parallel applications
relevant to such computation.
The second describes Shasta, which focuses on getting SMP applications
to work on clusters without any programmer effort or attention.
Cluster computing is a fairly old topic that has received a lot of
attention in recent years.
A good, accessible explanation of why is given in A Case for Networks of Workstations:
NOW
(by T. Anderson, D. Culler, D. Patterson and the NOW team,
in IEEE Micro, February 1995).
December 4 : Distributed OSes
- 30. The V Distributed System
D. Cheriton, Communications of the ACM, vol. 31, no. 3,
March 1988, pp. 314-333.
- 31. A Comparison of Two Distributed Systems: Amoeba and Sprite
F. Douglis, M.F. Kaashoek, A. Tanenbaum, J. Ousterhout, ACM Computing
Surveys, vol. 4, no. 4, Fall 1991, pp. 353-384.
An alternate to treating each system as separate and exposing that
separateness to users is to have a distributed operating system that
hides distribution under a single-system image.
This class meeting will look at examples of how this has been done
and the issues and challenges involved.
Along the way, we will look at several case studies and discuss
(yes, you too) their pros and cons.
The papers assigned for today describe three distributed operating system
designs.
The first describes the V system, a precursor to many distributed OSes.
The second compares and contrasts the Sprite and Amoeba distributed systems.
Much of the course represents background for the general topic of
distributed OSs, and Tanenbaum's Distributed Operating Systems
textbook is a useful resource.
For more information on these past and some recent projects in this
space, see
the
history of V's group,
the Sprite web site,
the Amoeba web site,
the Beowulf web site, and
the Condor web site.
December 6 : Futures
- 32. The computer for the 21st century
Mark Weiser, Scientific American, vol. 265, no. 3, September 1991, pp. 94-104.
As with so many aspects of computer systems, changes in technology and
applications changes the issues faced in operating systems and
distributed systems. In this class meeting we'll discuss some of
these cool new developments and some of their consequences.
The papers assigned for today describe visions for future computing.
The first describes ubiquitous computing.
For additional material on ubiquitous computing, see
Mark Weiser's Ubiquitous Computing web pages.
December 11 : Project Poster Session
The final day of classes will be a poster session, with one poster
for each group's class project.
This will allow you all to show off your great work to each other
and other interested CMU people.