712 Paper Review Guidelines (Fall 2000)

An important component of 712 is critical reading of the assigned papers and coming to class ready to discuss them. To help you in this process, we require that you hand in a short review of each paper at the beginning of the class during which we will discuss the paper.

Your reviews should:

We do not want a book report or a repeat of the paper's abstract. Rather, we want your considered opinions about the key points indicated above. Of course, if you have an insight that doesn't fit the above format, please include it as well.

We want the reviews to be short, between 1/4 and 1/2 a page. Reviews must be typed.

All reviews will be counted, and a random sample of the reviews will be graded. Do not skip class or come late just to finish a review -- we expect that everyone will miss a few, though skipping more than 10% will have a negative impact on your grade.

Selected reviews

Here we're posting a selection of your reviews so you can see how your fellow students thought about the readings. These reviews will also be good refresher material before exams.

Date: 09/18

From: Nitin Parab

Soft Updates: A solution for the Metadata Update Problem in File
Systems.  Greg Gnager ET. Al.

Take Home Message:
1.      Understand the requirements of the system well. Do not merge two
related requirements into one. If one component of the system has
requirement set {A} and another has requirement set {B} then do not
support {a} AND {B} for entire system. Rather satisfy  {A} for fist
component and satisfy {B} for the other component.
2.      Sometime you can solve a problem by looking at it from a different
level of abstraction of the system. (If you cant solve a layout problem in
a building because you are looking at it as a system made up of m floors,
look at it as a system made up of n apartments. May be u'll get the
solution now).

Metadata update problem:
1.      First view: Various metadata and data modifications must be made
in such a way that the system should have a consistent view so that it can
recover in presence of un predictable failures.
Solution from this viewpoint: Propagate changes to stable storage in
specific order synchronously.
2.      Second View: The application (one component of the system) should
always see the most recent copy of the metadata blocks and the disk
(another component) should always see copies that are consistent with its
other contents (albeit slightly older copies).
Solution from this viewpoint: Each component of the system needs to have a
consistent view but two different components may have different views.
Delay metadata updates to disk. When writing a block, if there are any
dependencies that require other blocks to be written before the metadata
in the current block can be written to the disk, then those parts of the
metadata in the current block are rolled back to an earlier state (which
is consistent with respect to the current on disk state). During the disk
write, the block is locked to prevent the applications from seeing the
rolled back state. When the disk write completes, any undone updates in
the source memory block are restored before the block is unlocked.

Cyclic dependency problem:
Most metadata blocks contain many pointers (many inodes for example) hence
cyclic dependencies occur frequently if dependencies are recorded only at
the block level. However, if you track dependencies on a per pointer basis
instead of per block basis. dependency cycles do not occur because
independent sequences of updates remain independent and no single sequence
is cyclic(see paper for details).

What more could have been done?
Measure of overhead introduced by the system (like syncer task). Study of
performance difference between NO-Order and Soft-Update.



The Design and Implementation of Log Structured FS.

Take home message:
1.      Sometimes it is possible to completely eliminate a problem rather
than finding a optimized solution for the problem. One way to improve FS
performance is to find a rotationally optimal block placement policy so as
to improve (synchronous) write performance. Another solution is to totally
do away with synchronous write.
2.      There is nothing called free lunch. In traditional UNIX (FFS) the
problem was how do you force the disk into rotationally optimal
distribution". The problem is hard as it requires one to predict file
usage patterns. Logstructured FS eliminates this problem. But introduces a
new problem: how do you force the disk into "bimodal segment distribution"
(and of course the overheads of cleaner process). This depends on the
stability of the data in the segment. But stability cannot be predicted
without knowing file access patterns. Fortunately latter problem is much
easier and affects a high disk utilization only.
3.      Identify the problem correctly. What is the problem with the
current FS design? First they spread data blocks (and meta-data blocks)
across the disk causing too many small accesses. Secondly, they write
synchronously to disks (in order to take care of meta-data update
problem). Logstructure FS clubs metadata and all data together into long
segments (log) resulting in single long sequential disks write. Question:
An FFS with support for "clustering" of data block (large sequential
write) and soft updates (no synchronous writes, asynchronous metadata
updates) give better performance?

I believe that a system paper that reports a totally new policy (design
and implementation) should have a abstract philosophical (almost
religious) discussion. The paper should compare the two approaches at an
abstract philosophical level (filtering out all implementation details).
After all each Operating System is a different religion!

Date: 09/20

From: Sagar Jyoti Chaki

Andrew File System:
------------------
The following points about the paper are noteworthy:

* The main concept behind AFS is caching of files by the client, in
order to avo id remote open and reduce latencies of file operations,
since most of these late ncies are due to the network.

* AFS is capable of supporting a few hundred thousand users
simultaneously. It h as achieved such a high degree of scalability by
ensuring that the bulk of the w ork is done on the client side and the
server is used only for tasks essential f or maintaining the integrity
and consistency of the file system. Moreover a sing le server process
is used to serve all clients, and a concept of LWP, which is v ery
similar to today's threads, is used.

* Following Amdahl's law, the paper stresses how the implementation of
AFS has s trived to make the common case faster. An independently
optimized protocol is us ed to transfer data in bulk. A special
callback mechanism, which significantly r educes network traffic, is
used for cache validation. A flat file naming scheme is used for
faster translation.

The major drawback of the paper is that only a single synthetic
benchmark was us ed to judge its performance and compare it with other
similar systems. This does not guarantee similar performances in
real-life scenarios.

The paper highlights the process of using experimental results to
modify a proto type and convert it to a more advanced product that can
perform admirably in rea l-life situations.

CODA:
----
* Any mobile application must be geared to cope with low bandwidth,
low latency and expensive connectivity. It must be able to adapt to
orders of magnitude vari ance of the connectivity in terms of
bandwidth, latency and cost. It must be abl e to do so with minimal
human assistance. Coda does well on all these accounts.  The design of
Coda is guided by the following principles:

        * Do not punish strongly connected clients.
        * Do not make life worse than when disconnected.
        * Do it in the background if you can.
        * When in doubt, seek user advice.

* A major mechanism used by Coda is disconnected operation where, in
the presenc e of negligible or no connectivity, the client behaves as
if the network connect ivity has been lost. However this mechanism has
significant limitations too.

* Again, as with AFS, Coda employs several specializes techniques to
improve per formance in slow networks. Both RPC2 and SFTP have been
tuned for situations. Ca che coherence granularity has been raised and
a multilevel coherence detection s cheme is employed to speed up this
process. An asynchronous update propagation s cheme called trickle
reintegration is used for propagating updates in the backgr ound
without unnecessarily blocking other processes. Finally
user-assistance is sought for cache misses in situations where the
fetching the unavailable file(s) would take time longer than a certain
threshold.

The major problem with the paper is that no real-life experiments were
carried o ut to verify the performance of the system. Instead trace
simulation was used. I t is unclear how variances in network
connectivity was simulated. The system sho uld have been subjected to
a more robust examination.

Once again this paper illustrates how newer application domains might
necessitat e appropriate modifications in the design of any
system. Such modifications shou ld not only be based on the
characteristics of the environment measured directly or indirectly,
their usefulness should be validated through rigorous experiments.

Date: 09/25

From: Pedro Vaz Artigas

Serverless Network File Systems

The paper presents a novel way to implement a distributed file system,
each component of the server may be located in each and any machine that
is connected to the network. The motivation for this solution is that,
assuming that network speed is faster than local disk accesses, the disk
accesses could be performed in parallel in a number of systems in the
network and that would result in better performance even when compared
to what can be obtained by a local disk. Caches also play a significant
role as the machines in a network have and aggregate cache that is much
larger than what can be cached by a single machine, therefore
cooperative caching can also lead to better performance than that of the
local disk. Strategies are presented to ensure cache coherency. Also,
writes are postponed and local file systems use logging to speed up
writes and cache coherency enforces that clients get the most up-to-date
value in spite of write back caching.
The architecture of the system is based in four entities. Clients that
generate access requests. Managers, that are responsible for the control
of the meta-data and for cache coherence for the files they manage.
Storage servers are responsible to physically store on disk data, data
is organized in segments as in LFS, also a segment may span multiple
servers to improve performance; stripe groups are sets of servers that
cooperate in order to store data, each segment is stored in a stripe
group and storage servers may be part of more than one stripe group.
Cleaners are responsible for segment cleaning as in LFS.
The only drawback to this file system architecture is that the
assumption that networks are switched and have much higher performance
than that of a local disk does not really hold. Fast local disks
(10Mb/s) are more common these days than a fast switched network with
similar amounts of bandwidth (Switched fast Ethernet, about 100Mbits,
about 12Mb/s) and results were presented for a 80Mb/s network and 3Mb/s
disks. So it is not clear that this architecture would provide the best
performance in today's environments.

A Cost-Effective, High-Bandwidth Storage Architecture

The paper presents a new storage architecture whose main advantage is
providing scalability at a much lower price than current server based
architectures. The idea is to enable disks to be connected directly to
the network, avoiding the cost of file servers utilized for storing data
and avoid delays and bottlenecks due to the fact that the server has to
store-and-forward the data. Also, I/O bandwidth may be scaled more
easily by attaching new disks directly to the network, assuming that
networks are switched or have adequate bandwidth to enable the
utilization of the full disk bandwidth.
This strategy requires disks, called NASDs, that are capable of
implementing elaborate protocols that are more complicated than the ones
they currently employ, also more file system related tasks, such as disk
layout, are left to the disk. Given the amount of circuitry that may be
added to ASICs responsible to controlling disks if newer process
technologies are employed this requirement may be fulfilled.
File servers are still required in order to enforce access control,
global naming, concurrency control and cache coherency. Disks provide
all the data, organized in objects, and verify that a client has the
privilege to access such piece of data. The file server is responsible
for emitting a ticket to each client and the disk is able to verify that
tickets are valid without any server intervention. This security model
is based on the capabilities model. Given that disks may supply data
directly to the client the file server bottleneck for data requests is
removed.
In order to ensure that the model presented is adequate and may be
utilized to implement a distributed file system, as current applications
are not NASD aware, the authors present an NFS implementation based on
their NASD disk prototype. Also, a benchmark for a NASD aware
application is presented and utilizes a parallel file system implemented
on top of the NASD disks, that application scales better under the NASD
model than when utilizing a similar conventional configuration with an
NFS distributed file system.

Date: 10/02

From: Nat Lanza

Cluster I/O with River: Making the Fast Case Common

  The River paper describes an interesting flow model for
  I/O-intensive computation in clusters, and a programming environment
  to take advantage of it. The main technical advancements described
  are their distributed queues, which allow producers and consumers of
  data to work at different rates while balancing the workload, and
  the graduated declustering method of using mirrored data, which
  helps insulate the system from performance perturbations on some
  nodes. Their model is also interesting, and involves individual I/O
  modules tied together by flows, potentially allowing complex
  applications to be cobbled together from a number of small and
  perhaps standard modules.
  
  Unfortunately, the paper does not spend much time discussing the
  programming environment for River. The authors mention a graphical
  environment, but then spend the paper discussing the performance and
  applications of their distributed queues and graduated declustering.

Implementing Remote Procedure Calls
  
  This paper describes an experimental RPC system for the Cedar
  environment, and focuses on the binding system and the custom
  transport protocol. The binding system is interesting in that it
  incorporates a database to allow clients to look up servers and
  servers to register their presence; instead of simply requiring
  clients to know where the servers are, it provides a relatively
  secure discovery mechanism. The protocol is also interesting; the
  authors rejected the then-current PUP byte stream protocols,
  reasoning that a large setup/teardown time would ruin
  performance. Another interesting detail was their emphasis on
  keeping the RPC interface as close to the native procedure call
  interface as possible -- they avoided RPC timeouts because local
  procedure calls do not have timeouts. Unfortunately, while the paper
  claims that RPC will make building distributed applications much
  simpler, it doesn't provide much evidence for that. The authors do
  admit that they are still in the early stages of their work, but it
  would be good to see more examples of this, as well as some more
  mention of their stub generator.

Date: 10/04

From: Sagar Chaki

CS 15-712 : Advanced OS & Distributed Systems
Paper Review -  due 10.04.00
Sagar Chaki

ABACUS:
------
* Optimal partitioning of application and file-system functionality
  within a cluster of clients is very hard. The solution depends on
  cluster and workload characteristics in a complicated manner. Many
  of these characteristics cannot be predicted before running the
  application, and some of them (like availability of a server) may
  change dynamically during execution. Hence universal solutions do
  not exist and any attempt for such a solution will lead to
  sub-optimal and often catastrophic performance.

* The ABACUS framework strives to achieve an optimal solution for a
  restricted class of applications viz. those that manipulate large
  data sets. It dynamically alters function placement on the basis of
  blackbox monitoring of function characteristics like inter-object
  communication and per-object resource requirement.


* Experiments indicate that ABACUS achieves 2-10X performance
  improvement and successfully adapts to both long-term and short-term
  runtime variations.

The success of ABACUS relies critically on the effectiveness of the
analytical model used. It is unclear how robust the model is, how
quickly it adapts to variations, and whether the model itself is
static or can be modified dynamically. Also it is unclear how accurate
the estimates of the blackbox monitoring system are. The benchmarks
used to evaluate ABACUS seem to be synthetic.

The ABACUS system tries to avoid unnecessary and expensive
generalization. In a situation, where there is no clear
"best-for-all-situations" solution, ABACUS tries to address the
problem in a specific domain and does an extremely good job about it.

Emerald:
-------

* Process migration is a well-studied technique for achieving optimum
  performance in distributed systems. Emerald is an object-oriented
  language and system that aims to provide efficient object mobility.

* Emerald claims the following uniqueness in its support for mobility

        * It is object based - the unit of distribution and mobility
          is an object. This enables Emerald to provide mobility at
          all levels of granularity

        * Emerald has language support for mobility.

* The Emerald language allows the user to supply additional
  information about any object with its definition. For example an
  object may be attached to other objects. Such information is used by
  the compiler to generate smarter code.

The major problem with Emerald is that there seems to be no support
for coping with variations in the system characteristics or breach of
security. For example, the paper assumes all nodes to be similar (even
at the assembly language level) and trustworthy. The experiments
carried out do not gauge Emerald's performance in heterogeneous and
malicious environments. Moreover using it effectively requires
mastering a new language. It is also unclear how difficult it would be
to port existing distributed applications to Emerald.

Emerald's tries to extend the object-oriented paradigm to a
distributed setting. It provides yet another example of why techniques
that have been successful in one domain should be applied to others.

Date: 10/09

From: Nitin Parab

15-712		Nitin Parab		Monday 9th Oct.

ExoKernel      &     SPIN.

Extensible Application Specific Operating Systems
What's wrong with current systems:
1.	High Level Abstractions: Current operating systems virtualize
resources with high level abstractions like process, files, etc. These
interfaces tend to be generic and introduce a semantic gap between
application requirements and the low-level hardware resource. The OS
should export low level primitives which are flexible and versatile to
support any high level programming abstraction and be efficiently mapped
onto any hardware.
2.	Resource Management Policies executed oblivious to the
application: Low-level events like blocking for device I/O, page faults,
processor allocation etc are not made available to the application. Thus
applications cannot implement their own resource management policies.
Applications know their requirements far better than OS. Hence application
controlled resource management results in improved performance. The
operating system should only enforce protection boundaries between
resources and allocate resources among the applications. Management and
policy functionalities like resource scheduling should be left up to the
applications. 

Requirements of Application Specific Operating Systems (and how they are
satisfied in Exokernel and Spin):
1.	Securely Expose Hardware: In exokernel the operating system is a
small kernel, which exports a low level interface to the application
level. These primitives' export resources like CPU, disk memory, TLBs etc.
Spin allows application code to be linked into the kernel. The kernel
provides a set of interfaces to core system services. 
Spin uses language support to authenticate modules and prevent malicious
access to resources. It also implements capabilities, once again using
language support. Exokernel uses architectural support (protection
mechanisms) to securely export low level interface (system calls). It also
uses self-authenticating capabilities to give resource access. It may
cache these capabilities. It also uses mechanism of downloading
application code like Spin.
2.	Expose Events: Spin delivers events to the application by calling
(procedure call) to the application specific safe handler extensions in
the kernel. In exokernel the events are delivered to the application OS
library. It also supports Application Specific Handler to be uploaded into
the kernel.
3.	Expose Allocations: Spin exports interface which allows safe
kernel modules to request resource with attributes (like physical page
with particular attributes). Even exokernel exports similar interface to
application layer.
4.	Expose Names: Exokernel exports the native physical name of the
resource like address of the physical page. It appears that Spin does not
do so. 
5.	Expose Revocations: Spin exposes revocation by delivering a
revocation event just like any other event. Exokernel also delivers
revocation event to the application layer like other events.

Analysis :
1.	Exokernel and Spin provide similar features. The main difference
being the implementation mechanism. Spin uses language support to securely
extend kernel. Exokernel securely exports hardware resources to the
application level using hardware support (system call interface). 
2.	Spin's event delivery mechanism is very efficient, just a
procedure call. Exokernel event delivery mechanism is expensive since it
has to cross protection domain. Thus applications with fine grain events
will not perform well on Exokernel. Exokernel addresses this problem by
supporting secure code downloading just like Spin for fine grain events
like network packet processing.
3.	Both the application example given in Spin paper are applications
fully implemented inside the kernel. These applications do not cross the
kernel to user boundary. They do not incur the cost of protection domain
crossing (which exokernel applications will incur). Hence these are not
the right applications to evaluate performance. Not all applications can
be fully written in the kernel, for e.g. a web server, which supports
scripts. Exokernel approach thus sounds more generic.
4.	At first glance it may appear that Exokernel has problems in
securely exposing hardware resources like network and disk. However, note
that Exokernel requires the hardware virtualization of the resources like
the virtual memory. Current network and disk controllers require software
(OS) virtualization. However recent developments like protocol offloading
(silicon TCP) and active object based disks do provide virtualization at
device level. Thus it looks like Exokernel OS will be able to securely
expose all hardware without even support of code downloading.
5.	Spin paper does not discuss the effects on global system
performance.
6.	Exokernel paper should have had a discussion on architectural
requirements for Exokernel Operating Systems.

Date: 10/11

From: Nitin Parab

15-712		Nitin    Parab		Wednesday 11th October.


FS development with stackable layers.
Heidmann et al.

Composable OS kernel:
One way of building composable OS kernel is supporting stackable layered
design (e.g. STREAMS subsystem in UNIXes) other by symmetric interface,
syntactically identical above and below. Thus new layers can be easily
removed or added at any level in the stack. These stacks can be made
extensible by having layers support new operations; existing layers adapt
automatically by forwarding a new operation it does not recognize. Layers
often execute within a single protection domain (address space
independence). Being symmetric the stackable layers support layer
substitution (easy to substitute TCP with UDP in STREAMS). One can have
non-linear stacking for example TCP and UDP layers on top if IP in
STREAMS.
Reusable services that occur in the middle of a large layer can be
extracted out as a whole new (semantic free) layer with two cooperating
layers around it. This extracted layer can now be used by multiple other
layers/services. The architecture also supports stack layers at user
level. 

Analysis
1.	The paper should have discussed the implementation and it's layer
interface in more detail. 
2.	Section 2 and 3 which talks about stackable layers could have been
omitted as the issues pointed out are well known and tried and tested in
systems like STREAMS subsystem. 
3.	Porting from BSD to SUN OS 4.x is not a good argument for
portability as these systems are very similar to each other. 
4.	The paper fails to show how a file system (say UFS) can be divided
into layers. It talks about UFS (or LFS) as being one whole layer.

===================
Scripting: Higher Level Programming for 21st Cnetury.
John K Outerhoust.

Composable systems:
Basic requirement of composable systems is the "glue logic". One way of
having a glue logic is to define interfaces and dynamically load modules
that will interact with other modules through these interfaces (e.g.
STREAMS stacks or stacks in stackable FS). In strongly typed languages
these interfaces requires objects of specific types and the compiler
prevents any other type of objects being from used with the interface.
Thus the strongly typed nature of systems programming languages
discourages reuse and building of composable systems. A typeless language
makes it much easier to hook together components. Classic examples are
filter programs in UNIX shells. 

Scripting Languages are glue languages or system integration languages.
Scripting and system programming are symbiotic. System programming is used
to make cool components and scripting languages are used to glue together
these components to build cool applications. Scripting languages provide
platform for very rapid development applications of gluing nature.

Concerns regarding Scripting Languages: 
1.	Typeless nature makes them unsafe: Strong typing allows errors to
be detected at compile time. In contrast scripting languages do error
checking at runtime. 
2.	Performance: Scripting languages are interpreted and hence are not
as efficient as system languages. However, scripting languages are used to
glue together components and hence performance depends more on the
components, which are written in systems languages.

Scripting is getting popular: 
1.	Rise of GUI. GUIs are fundamentally gluing applications.
2.	Growth of Internet. Again Internet is a gluing tool.
3.	Popularity of component frameworks. Without good scripting
language to manipulate the components much of the power of component
framework is lost.
4.	Improvements in scripting technology.
5.	Change in the programmer community. Casual programmers want a tool
easy to learn.

Date: 10/16

From: Ken Tew

Ken Tew
15-712 Advanced OS and Distributed Systems
Lecture 9

"Reflections on Trusting Trust"
"Why Cryptosystems Fail"
"Crisis and Aftermath"

These 3 papers give motivation and some guidelines on how to design and 
maintain security for a system.  Note that designing the system is just the 
first step in security.  Other important (and often overlooked) security issues
are educating users and maintenance of the system. 


Designing and Verifying system security.  

 When designing system security, the first step is to consider what is being 
 protected against.  A threat model is an outline of what you are trying to 
 protect a system from,  what types of attacks and from whom?  

 Typical questions to ask in making threat models:

  what are you trying to protect against?  
    Are you simply trying to prevent data from being corrupted or altered? 
    Are you trying to prevent denial of service? 
    Are you trying to prevent unauthorized reading of the data? 
  Who are you trying to protect against?
    Are you trying to prevent misbehavior from privileged users?
    How many privileged users would be required to conspire in order to 
     break the security?
    Are you trying to prevent outside access?

 Independent verification.
   The people designing a security system should not be the ones to verify 
   that it is secure.  If a designer has overlooked some security hole in
   designing the system, he or she is less likely to think of it while testing 
   the system.  Additionally, someone whose job is to verify the system will 
   be more likely to look for back-doors the designer may have put into the 
   system.  


Educating users as to the importance of following security procedures.

  Once you have a security system in place, it is worthless if no one will use
  it.  It is important that system users be educated about what the security 
  procedures are, as well as why it is important to follow them.  This also 
  means procedures must be made simply enough so that users can follow them.
  Additionally, it may mean showing management that the cost of not 
  implementing security can be greater than cost of implementing security.


Maintaining system once it is in place.  

  Monitoring usage.  
    It is important to keep an eye on how the system is being used.  
    Irregularities may indicate a break-in or attempted break in.  Also, 
    monitoring may be able to pinpoint weaknesses in the system before someone
    can exploit a security hole.

  Updating the system.
    Simply because a system is reasonable secure today, doesn't mean that same 
    system will still be secure in the future.  Systems need to be updated to:
    a) patch any security holes that have been discovered
    b) update cryptographic keys that may no longer be valid
 

Catching and punishing security violators.
   It is impossible to make a usable system 100% secure.  Therefore, deterring
   people from violating system security is also important.  Punishment may 
   be as simple as reprimanding employees for being lax in following security 
   procedures to as severe as criminal convictions.  

Date: 10/30

From: Sagar Chaki

Optimistic Concurrency Control:
------------------------------

* The paper proposes a scheme that could potentially allow more
  concurrency in accesses to a database by avoiding the use of
  locks. However unrestricted concurrent access might lead to a loss
  of integrity. The scheme prevents this by validating any updates to
  the database before they are actually made.

* The basic idea behind the success of the scheme is that locks may be
  necessary only in the worst case. Getting rid of locks would let us
  avoid all lock related overheads and achieve better throughput. The
  penalty to be paid is the need to validate all updates to make sure
  they do not compromise the integrity of the database. But an overall
  gain is expected if such occurrences are rare.

* Each transaction is partitioned into three distinct phases : a read
  phase, a validation phase and a possible write phase. The
  correctness of the concurrent execution of multiple transactions is
  ensured by verifying serial equivalence between them. This is done
  with the aid of transaction numbers (which enforce an order between
  the transactions) and certain serialization conditions between
  transactions with respect to their transaction numbers.
	
The biggest problem with the scheme is its limited scope. It can only
be applied in situations where both the following two conditions hold
: (a) the number of nodes in the graph is very large compared to the
total number of nodes involved in all the running transactions at a
given time, and (b) the probability of modifying a congested node is
small.
	
The paper takes the bold stance that in some scenarios, worst case
situations might arise only very rarely. In such cases it makes more
sense to use a more relaxed scheme to achieve higher throughput,
making sure that the worst case situations are appropriately detected
and handled.

Threads in Interactive Systems:
------------------------------

* The paper is an impressive case study of usage of multi-threading in
  modern large scale research and commercial systems. The authors
  scrutinize two such systems (Cedar and GVX) in minute details using
  analysis of macroscopic thread statistics, analysis of the
  microsecond spacing between thread events, and reading the
  implementation code. They mention common paradigms, pitfalls and
  issues in thread usage and their implementation.

* An interesting outcome is that threads can be classified into three
  general categories. The eternal threads repeatedly wait on a
  condition variable and then run briefly before waiting again. The
  worker threads are forked to perform an activity like formatting a
  document. Finally transient threads are short-lived and are forked
  by some long-lived thread. They run for a relatively short while and
  then exit.

* The paper mentions eight distinct paradigms for efficient thread
  usage viz. defer work, pumps, slack processes, sleepers and
  one-shots, deadlock avoiders, task rejuvenation, serializers,
  concurrency exploiters and encapsulated forks. It also mentions the
  lack of efficient mechanisms for recovering from fork failures and
  the presence of timeouts and pauses with ridiculous values.
	
The paper focusses on extremely large systems developed in the same
place. As such its results are probably not very representative of the
general scenario. The thread model considered is also not universally
used, e.g. there is no mention of the semaphore construct used widely
for achieving mutual exclusion.
	
The article is an extremely valuable source of paradigms, pitfalls and
important issues related to multi-threading. It would be very useful
guide for anyone trying to develop large concurrent systems.

Date: 11/01

From: Zia Syed

The Byzantine Generals Problem

The basic idea in this paper is to show how a fault tolerant
distributed system can be created. The problem is modeled in terms of
an army in which some of the generals are traitors. The generals have
to agree on a common decision based on the views of the others and
that a few traitors should not be able to mislead the loyal generals
into making a wrong move. This decision can be as simple as that
either they all attack or they all retreat. The problem becomes
difficult when the loyal generals do not know which ones are the
traitor generals and the traitor generals can give false and malicious
information to the loyal generals.

They show that m traitors can mislead fewer than 3*m + 1 generals but
for more than that number of generals, there does not exist any
solution.  This problem can also be mapped into a problem of many
processes trying to agree on something in a distributed manner and
there are some processes who are trying to cheat and mislead others
like in sharing resources etc.  However everyone must know the
decision function and they must use the same decision function
e.g. the majority function or the median of an ordered set of values.

One solution that they present is that the generals have unforgeable
signatures which can be verified and if forged can be detected. Their
algorithm achieves the two aims of all the generals agreeing on a
decision by receiving the same messages. A nice feature of this is
that regardless of the number of traitors, this algorithm works
correctly. However it has a huge overhead of an approximately
exponential number of messages.

Another modification of this problem is when the majority voting is
used for extreme decisions. In this case two correct processors would
achieve the same decision if both receive the same input. This is an
issue of reliable transfer of data so that everyone receives the
correct input. In this case faulty communication line is the same as a
general sending a false message and the receiver cannot distinguish
the difference. So this situation is the same as the original problem
and so this system can work correctly if less than 1/3 of messages are
corrupted. If we do not have unforgeable signatures then an alternate
can be a communication link between every pair of generals. Another
issue is the absence of a message which requires a timeout mechanism
which requires synchronized clocks and an upper bound on the maximum
time for the message to get through which is another problem of its
own.

----                                                                         
Time Clocks, and the Ordering of Events in a Distributed System
          
This paper gives a distributed algorithm for synchronizing a system of
logical clocks, which can be used to totally order the events.
                  
The main problem that is being addressed is that in a distributed
system a collection of processes which communicate with one another by
exchanging messages do not take a negligible time to reach the other
process. Thus the notion of 'before' i.e. which event occurred earlier
is hard to know and different processes would have a different
relative ordering of events. We want this ordering to be the same or
at least partially same so that the notion of things happening across
all processes is the same.

The events occurring in a single process have an a priori total
ordering.  They define a relation -> which basically means that a -> b
implies that a causally affects b and therefore must occur before
b. Using logical clocks which are just counters and may not have any
physical meaning of time, they define the order as if an event occurs
before b, then the time of a comes before time of b. A simple
implementation of this logical clock C can be that within any process,
the time can be just incremented between events occurring in the
process. Then every process timestamps the message it sends and the
receiving process updates the time to be greater than the timestamp
and the current time at this process. This ensures that the receiving
time of the message was after the sending time of the message. This
gives us a partial ordering of events. To break ties we can
arbitrarily assign an ordering of events and this will give us a total
ordering.

Using a central server to allocate resources according to the order in
which messages are received but by defining a total ordering of all
things, this problem is overcome since the total ordering specified by
the algorithm will make sure that the allocation and release is done
in the right order. This is achieved by sending the time stamped
messages to all processes and they ACK back with their timestamps. The
process can only execute that command if all other time stamped
messages received are less than or equal to its timestamp. The problem
with this is that a process must know all other processes and failure
of the process.
                                          
Certain external events may also be handled by modeling them into the
system and handling them in the same way. For using physical clocks,
using the rate of the clocks, we can find upper bounds for the
synchronization errors in terms of the minimum delay for a message to
be received by the other end.  For this they require that the clocks
are never set back.


Date: 11/06

From: Sagar Chaki

Cost of Quality in Internet-style networks:
------------------------------------------

* Quality of Service (QoS) is not a major issue with traditional
circuit-switched telephony networks which have been designed to
satisfy the human ear and does it extremely well. But with the
proliferation of packet-switched networks and applications with
diverse requirements that rely critically on such networks providing
certain service guarantees, QoS is rapidly becoming an important part
of agreements between service providers and their customers. Broadly,
QoS of a wide-area network measures how well it does its job - how
quickly and reliably it transfers data from source to
destination. Technically, QoS refers to a collection of performance
metrics, primarily the following five: availability, throughput,
packet loss, latency and jitter.

* Different data streams have different QoS requirements.  At network
aggregation points, these data streams are combined for transport over
a common infrastructure. A mechanism is needed to label flows with
their priorities and to recognize these labels and act on
them. Unfortunately such mechanisms are absent in the popular TCP/IP
suite used for the vast majority of internet communication. The IETF
has proposed several methods to rectify this situation. Notable among
them are integrated service (IntServ), differentiated service
(DiffServ) and multiprotocol label switching (MPLS).

* The common open policy service (COPS) is a tool for assuring the QoS
of a network. It is extremely adaptable to the customer's
requirements. The requirements and rules for resource allocation,
known as policies are decided in advance. This allows services to be
specified unequivocally and allocate resources required to provide the
service. The key elements of a policy-bases traffic management system
are: policy creation and storage, interpretation, and enforcement.

The paper discusses several interesting and promising schemes but
provides very little evidence of their effectiveness. The techniques
suggested are basically attempts to incorporate ATM like service
guarantees into TCP/IP. The driving principles behind the design of
ATM have been extremely successful on small and reasonably secure
LANs. It remains to be seen whether they still remain effective on a
network as large, as distributed and as insecure as the internet.

The paper shows how an existing infrastructure can be suitably
modified to satisfy unforeseen requirements. Internet-style networks,
which are undoubtedly the most prevalent, can be used to assure QoS by
introducing appropriate. The alternative is to develop a totally new
network infrastructure, which will be extremely expensive, if not
impossible.

Generalized Rate Monotonic Scheduling:
-------------------------------------

* The generalized rate monotonic scheduling (GRMS) theory provides an
analytic and engineering basis for the development, modification and
maintenance of real-time systems.  It guarantees that all tasks will
meet their deadlines if the total system utilization of these tasks
lies below a known bound, and these tasks are scheduled using
appropriate algorithms.

* A set of n independent tasks scheduled by the rate monotonic
algorithm will always meet their deadlines for all task start times,
if the total utilization is less than n(2^(1/n)-1). This bound is too
pessimistic since worst-case task set is contrived and highly unlikely
to occur in practice. For a set of independent periodic tasks, if a
task meets its first deadline, when all the higher priority tasks are
started at the same time, then it can meet all its future deadlines
with any task start times. The scheduling of aperiodic tasks can also
be treated within the rate-monotonic framework of scheduling periodic
tasks using the concept of sporadic servers and budgets.

* Synchronization between tasks can lead to unbounded priority
inversion problems. This problem can be avoided using a simple
priority inheritance protocol in which a low priority task that blocks
a high priority task temporarily inherits the priority of the blocked
task. However this scheme can still lead to large periods of blocking
and deadlock. These issues are solved using a different priority
inheritance scheme called the priority ceiling protocol.

The paper does not cover scenarios where the completion time of a task
is probabilistic. The framework discussed in the paper is only
effective when the worst case completion times are known. In the face
of uncertainty, certain weaker guarantees might still be provided.

The paper proves that formalism can be used to obtain guarantees even
in system design and development. The GRMS framework assures that all
tasks will meet their deadlines if certain criteria are satisfied. It
is powerful enough to handle aperiodic tasks as well as tasks that
might depend on each other.

Date: 11/13

From: Zia Syed

Practical Byzantine Fault Tolerance
-----------------------------------

The paper presents BFS , Byzantine-fault-tolerant NFS Service, and the
theory underlying it. The authors argue that practical Byzantine
systems are not practical so far. The current systems either make
synchrony assumption or they are just there to prove theoretical
results. The synchrony assumption is dangerous in presence of
malicious attacks because attackers can delay the communication from
the non-faulty nodes (denial of service attack) until they are tagged
as faulty.

The system presented in the paper provides practical Byzantine failure
that doesn’t assume synchrony and it is fast enough to be used in
practice. It provides 2 features i.e. safety and liveliness. Which
means that if no more than (n-1)/3 replicas are faulty then client
would receive reply to their requests and delay(t) does not grow
faster than t indefinitely where delay(t) is the time between the
moment t when a message is sent for the first time and the moment when
it is received by its destination.

These two features are provided by the state machine replication
algorithm. The service is modeled as a state machine that is
replicated across the different nodes in a distributed system. The
replicas move through a succession of configuration called views. In
each view one replica is primary other are backups. View changes are
carried out when it appears that primary has failed (hence assuring
the liveliness property). A client sends a request to a replica that
it believes to be primary. The primary multicasts the request to all
backups. The backups send the result directly to clients. The client
waits for f+1 requests from different replicas with valid signatures
with same results. This is the result of operation. ‘f’ is the
number of maximum nodes that may be faulty.

The system doesn’t use the digital signatures for message
authentication because the verification is costly and the algorithm
presented requires the verification many times. The system uses
message authentication codes (MAC) instead. The system is able to
provide the same level of security as digital signatures by taking
advantage of the invariants that they have imposed on the system. The
system also uses two other optimizations for faster response time
i.e. reducing the size and number of the messages and incremental
checkpoint-management. The paper presents a practical example of such
service i.e. BFS and shows that it is just 3% slower than non
replicated version and works well .
 
The system assumes the faults occurring in the replicas to be
independent i.e. targeting basically non-determinsitc software errors,
which are most problematic and persistent and hardest to detect. The
faults occurring independently can be assured by implementing replicas
and OS by different teams (N-version programming). It also requires
each replica to have a different administrator and root password. All
these assumptions seem to be reasonable except having different OS
implementations, which might not be always feasible.

Implementing Fault-Tolerant Services Using the State Machines
Approach: A Tutorial
-------------------------------------------------------------

The paper presents a state machine model for tolerating faults in a
client server model. The single server usage in client-server
computing might be simplest to attempt but doesn’t provide any
fault tolerance. If fault tolerance is required then a number of
servers that fail independently are required.

The paper says that most of the fault tolerance protocols can be
modeled using state-machine approach. The server can be modeled as a
state machine. A state machine consists of a number of variables,
which encodes its state, and commands which transforms its state. The
state machine is deterministic. The clients request the service using
the commands and transform the state of the server. There is a causal
ordering in the processing of requests.

Then the author presents two representative classes of failures
i.e. Byzantine and fail-stop. To quantify fault tolerance the paper
defines a ‘t’ fault tolerant system to be the system that can
meet its specifications provided that no more than t components fail
during some interval of interest. The paper argues that measuring the
system’s fault tolerance in terms of t is more meaningful from the
system design point of view than MTBF, which is more user-oriented. A
t fault-tolerant state machine is defined as one that is implemented
by replicating the state machine and running a replica on each of the
processors in a distributed system.

The key to implementing a t fault tolerant state machine is to ensure
Replica Coordination i.e. all non-faulty replicas receive and process
all the requests (Agreement) and all non-faulty replicas process the
requests in the same relative order (Order). The knowledge of commands
can allow weakening of replica coordination restriction and hence
allows implementation of cheaper protocols to b used for replica
management. The Agreement requirement can be satisfied by using
Byzantine agreement protocol (Lamport). The order can be implemented
by using Lamport’s scheme of logical clocks or synchronized real
time clocks.

In order to implement a t-fault tolerant system that can tolerate
faulty output devices, one needs to know if the output of the whole
replicated state machine ensemble is to be used within or outside the
system. If the output is used outside the system e.g. to a device,
then tolerating the failure of a single voter is not enough. The
system must be able to tolerate the failure of output device. This is
done by replicating the output device and voter. Each voter combines
the output of all state machine replicas and produces a signal that
drives the associated output device. If the output is used inside the
system i.e. by a client then the client can itself combine the outputs
of the state machine replicas. Here the voter is considered to be the
part of the client. If the voter is faulty then it means that the
client is faulty itself so it we don’t need to worry about
incorrect output. It is client’s own business now.

At fault tolerant system also needs to tolerate a faulty client that
may corrupt the state machine replicas and may result in generation of
erroneous output in future as a result of request from non-faulty
clients. This situation can be handled by replicating the clients
themselves. This approach would require change in the state machine
replicas because each state machine would need to get request from
each client replica. This would also require the state machine
implementation to know about application specific information about
the clients for the cases where each version of client may send a
slightly different value of an input value (e.g. different sensor
value because each replica read the value at slightly different
times). The final value can be computed by using a median or some more
complex algorithm such as fault-tolerant intersection algorithm
[Marzullo 1989].  If it not possible to replicate a client then
defensive programming techniques can be used to protect state machine
replicas from getting corrupted. In such techniques the state machine
is formulated in a way that a faulty client cannot corrupt that
it’s state. These techniques are also very application specific and
requires mathematically rigorous model of state machine so that only
valid sequence of commands may be executed by a client.

Date: 11/15

From: Sagar Chaki

Quicksilver:
-----------

* Quicksilver is structured as a lean kernel above which system
  services are implemented as processes (servers) communicating with
  other requesting processes (clients) using an IPC mechanism. Since
  essential services like the file system are external to the kernel,
  Quicksilver has to cope with a much more complicated set of failure
  modes.

* The recovery mechanism in Quicksilver is based on atomic
  transactions. The recovery manager contains three primary
  components:
	
	* Transaction manager: Manages commit coordination by
          communicating with servers at its own node and with
          transaction managers at other nodes.

	* Log manager: Serves as a common recovery log both for the
          transaction manager's commit log and server's recovery data.

	* Deadlock detector: Detects global deadlocks and resolves
          them by aborting offending transactions.

* The IPC is a request-response protocol structured according to the
  client-server model. Several guarantees are made regarding the
  reliability of the IPC: requests are not lost or duplicated, data is
  transferred reliably, and a particular client's requests are queued
  to the service in the sequence they are issued. The kernel handles
  IPC between a client and a server on the same node. Requests made to
  remote servers are forwarded to the communication manager on that
  node.

Quicksilver requires that every service be modeled as an
transaction. It is unclear how this can be done efficiently.

The client-server architecture has been extremely successful in the
domain of standalone applications. Quicksilver tries to extend this
paradigm to an entire programming environment. Often this leads to
successful and versatile systems.

Hive:
----

* Hive is an OS designed for large-scale distributed shared memory
  multiprocessors. It is structured as an internal distributed system
  of independent kernels called cells. This allows it to be reliable
  by providing isolation of cells from faults in others. It also
  allows scalability since few kernel resources are shared by
  processes on different cells. So one could potentially achieve more
  parallelism by adding more cells.

* Hive addresses three ways in which faults can be propagated:

	* Message exchange: This occurs mostly via RPCs. It is avoided
          by sanity checks on data received from other cells and
          timeouts.
	
	* Remote read: Since cells can read each other's internal data
          structures, a correct cell might read corrupt data belonging
          to some other cell. This is avoided by using a careful
          reference protocol that checks for various possible error
          conditions, in addition to sanity checks.

	* Remote write: A faulty cell might also issue wild writes to
          another cell's memory. Hive avoids this by firstly
          prohibiting cells from writing other's data-structures. All
          user-level pages used by local processes are also
          protected. Shared user-level pages are given as high
          protection as possible. Cells monitor each other and use a
          distributed consensus algorithm for fault-detection. The
          pages writable by faulty cells are discarded.

* The intercell resource allocation is handled by a user-level process
  called Wax, which maintains a global picture of the system. Wax is
  multi-threaded and can execute on multiple cells. This allows Hive
  to do good global allocation without creating a central performance
  bottleneck.
	
The biggest problem with Hive is probably its reliability on specific
hardware support. It would be impossible to implement it efficiently
on top of hardware that does not provide some of the specific support,
like memory firewall, that FLASH provides.
	
The Hive architecture was built on top of a multiprocessor that has
still not reached the production stage. It is important to develop
systems assuming hardware support that is not existent in the present
but is very likely to be available in the near future.