18-849 Reading List Fall 2008
Course Home Page
- Shaw, M., Writing good software engineering research papers: minitutorial;
ICSE, 2003, pp. 726 - 736. (ACM |
local)
Required:
Note: Read Wallace & Kuhn before reading Sullivan &
Chillarege.
- D. Wallace and D. R. Kuhn, "Failure modes in medical device software:
an analysis of 15 years of recall data", International Journal of
Reliability, Quality and Safety Engineering (IJRQSE), vol. 8 no. 4, Dec 2001,
pp 351-371 (Web
| local) / 20
small pages.
Analysis of FDA data for non-lethal software recalls.
- M. Sullivan, R. Chillarege, (IBM Watson), "Software Defects and their
Impact on System Availability A Study of Field Failures in Operating
Systems," FTCS-21, 1991. (Citeseer |
local) / 8 pages.
The seminal paper for Orthogonal Defect Classification (ODC).
- A. Avizienis, J.-C. Laprie, B. Randell and C. Landwehr, Basic concepts
and taxonomy of dependable and secure computing, IEEE Transactions on
Dependable and Secure Computing, v. 1, n. 1, January 2004. (IEEE |
local) / 23 pages.
This is an update to the paper that was required reading for 18-649; read it
to brush up on terminology and re-orient yourself to the big picture.
Supplemental:
- A. Avizienis, J.-C. Laprie and B. Randell, Fundamental Concepts of
Dependability, Research Report N01145, LAAS-CNRS, April 2001. (Citeseer |
local) / 21 pages.
- Butler, R., A
Primer on Architectural Level Fault Tolerance, NASA/TM-2008-215108,
Langley Research Center, Hampton VA., Feb. 2008. (local)
- Siewiorek, Chillarege, & Kalbarczyk, "Reflections on Industry
Trends and Experimental Research in Dependability," IEEE TDSC, Vol. 1, No.
2., April 2004. (local)
- R. Chillarege, "ODC for process measurement, analysis, and
control," Fourth International Conference on Software Quality, ASQC
Software Division, Oct 3-5, 1994 McLean, VA. (Web |
local)
See especially the
chapter
on comparing defect types to phase of development process.
- Ram Chillarege's ODC home
page.
Required:
- ESA, "Ariane 501 - Presentation of Inquiry Board report," press
release N° 33-1996, (WWW |
local) / 2 pages.
Summary of Ariane 501 board of inquiry report. For full report see
supplemental reading below.
- F. Cristian, "Understanding fault-tolerant distributed systems,"
Communications of the ACM, Vol. 34 No. 2, February 1991, pp. 56 - 78 (ACM |
local) / 23 pages
- Gray, 1990, a census of tandem system availability, IEEE Trans.
reliability, 39(4), 409-418, Oct 1990. (IEEE |
local) / 10 pages.
- Weinstock, C.B., "SIFT: System Design and Implementation,"
Fault-Tolerant Computing 1995, Highlights from Twenty-Five Years.,
Twenty-Fifth International Symposium on (originally FTCS 1980), (IEEE |
local) / 3 pages
Supplemental:
- Alfred Spector , David Gifford; The space shuttle primary computer system
Communications of the ACM September 1984 Volume 27 Issue 9 (ACM |
local) / 28 pages
- N. Leveson and C. Turner, (U. Washington; U.C. Irvine) "An
Investigation of the Therac-25 Accidents," IEEE Computer, Vol. 26,
No. 7, July 1993, pp.18-41. (IEEE |
local) / 24 pages.
Classic case study -- this would be required reading except it is covered by
the pre-req course.
- Gage, D.; McCormick, J., "We Did Nothing Wrong: why software quality
matters," Baseline magazine, March 4, 2004. (Web |
local)
And all these years after the Therac 25, radiation machine software is still
killing people!
- Murphy, B.; "Automating software failure reporting," ACM Queue;
vol. 2 no 8; pp. 42-48. (ACM |
local)
- Bartlett, J., "A NonStop kernel," SOSP, 1981, pp. 22-29. (ACM
| local)
- Bartlett, W.; Spainhower, L., "Commercial fault tolerance: a tale of
two systems," IEEE TDSC, vol. 1, no. 1, Jan 2004, pp. 88-96. (local)
- Gene D. Carlow, Architecture of the space shuttle primary avionics software
system, Communications of the ACM September 1984 Volume 27 Issue 9 (ACM |
local) / 11 pages
- Garman, "The bug heard 'round the world," ACM Sigsoft software
engr. notices 6(5), pp. 3-10, oct 81 (local)
- J. Gray, Why do computers stop and what can be done about it?,
in Proc. 5th Symp. on Reliability in Distributed Software and Database Systems,
(Los Angeles, CA, USA), pp.3-12, IEEE Computer Society Press, January 1986. (Tech. report |
local TR |
local) / 9 pages. (ILL requested
12/19/02)
- L. Hatton, "Software failures-follies and fallacies," IEE
Review, Volume: 43 Issue: 2, 20 March 1997, pp. 49-52, (IEEE |
local) / 4 pages.
Thoughts about Ariane 5 and other failures -- why can't we get this stuff
right?
- Hennebert, C. & Guiho, G., "SACEM: a fault tolerant system for
train speed control", 1993. (local)
- Hopkins, A.L., Jr.; Smith, T.B., III; Lala, J.H.; FTMPA highly
reliable fault-tolerant multiprocess for aircraft, Proc. IEEE, Oct 1978,
1221-1239 (IEEE |
local)
- Hoyme, K.; Driscoll, K.; "Safebus," Digital Avionics Systems
Conference, 1992. Proceedings., IEEE/AIAA 11th , 5-8 Oct 1992 Page(s): 68 -73
(IEEE |
local)
- Kuhn, D., "Sources of failure in the public switched telephone
network," IEEE Computer, April 1997, pp. 31-36. (local)
- I. Lee and R. K. Iyer, "Faults, Symptoms, and Software Fault Tolerance
in the Tandem GUARDIAN90 Operating System", IEEE 1993, pp. 20-29. (IEEE
| local) / 10 pages.
- J. Lions, Ariane 501 Inquiry Board Report, July 1996. (WWW |
local) / 60 pages.
- Maisel, W.; Sweeney, M.; Stevenson, W.; Ellison, K.; Epstein, L.,
"Recalls and safety alerts involving pacemakers and implantable
cardioverter-defibrillator generators, JAMA, vol 286, No. 7, August 15,
2001, pp. 793-799. (local)
- B. Nuseibeh, "Ariane 5: Who Dunnit?" IEEE Software, Vol.
14 No. 3, May-June 1997, pp. 15 -16 (IEEE |
local) / 2 pages.
- Powell, D.; Bonn, G.; Seaton, D.; Verissimo, P.; Waeselynck, F., "The
Delta-4 Approach to Dependability in Open Distributed Computing Systems ,"
Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years.,
Twenty-Fifth International Symposium (originally FTCS 1988) (IEEE |
local) / 6 pages.
- D. Powell (LAAS-CNRS), "Distributed Fault Tolerance Lessons Learnt
from Delta-4", Workshop on Fault-Tolerant Architectures, 1994. (Citeseer |
local) / 16 pages.
Case study of distributed fault tolerance implemented in software with
mostly-off-the-shelf hardware.
- Rennels, D.A.; Architectures for fault-tolerant spacecraft computers; Proc.
IEEE, Oct 1978; Page(s): 1255- 1268 (IEEE |
local)
- S. Shrivastava, "Lessons Learned from Building and Using the Arjuna
Distributed Programming System," 1995. (Citeseer |
local) / 15 pages.
- S. Webber, J. Beirne, "The Stratus Architecture," FTCS 21, 1991.
(IEEE
| local) / 7 pages.
- Wensley, J.H.; Lamport, L.; Goldberg, J.; Green, M.W.; Levitt, K.N.;
Melliar-Smith, P.M.; Shostak, R.E.; Weinstock, C.B.' SIFT: Design and analysis
of a fault-tolerant computer for aircraft control; Proc. IEEE, Oct 1978;
Page(s): 1240- 1255 (IEEE |
local)
Required:
- A. Reibman & M. Veeraraghavan, (Bell Labs) "Reliability modeling:
an overview for system designers," Computer, Vol. 24, No. 4, April
1991, pp. 49-57. (IEEE |
local) / 9 pages.
- Dugan, "Dependability modeling for fault-tolerant software" (Ch
5) In: Lyu, Ed., Software Fault Tolerance, Wiley & Sons, 1995. (local) / 15 pages.
- Mitra, S.; Seifert, N.; Zhang, M.; Shi, Q.; Kim, K.S.; "Robust system
design with built-in soft-error resilience" Computer, Volume 38, Issue 2,
Feb. 2005 Page(s):43 - 52 (IEEE |
local)
- Schlichting & Schneider, "Fail-stop processors: an approach to
designing fault-tolerant computing systems," ACM Trans. Comp. Sys., v 1,
pp 222-238, Aug. 1983 (citeseer |
local) / 21 pages.
Supplemental:
- Abraham, J., & Siewiorek, D., "An algorithm for the accurate
reliability evaluation of triple modular redundancy networks," IEEE Trans.
Computers, July 1974 (local)
- P. Agrawal, "Fault-Tolerance in Microprocessor Systems without
Dedicated Redundancy," IEEE Transactions on Computers, Vol. 37, no. 3,
March 1988. (IEEE |
local) / 5
pages.
- Balkovich et al., "VAXcluster availability modeling", Digital
Technical Journal, 1987. (local)
- D. Barbara, H. Garcia-Molina, "The Reliability of Voting
Mechanisms," IEEE Transactions on Computers, Vol. C-36, No. 10, October
1987. (local
- D. Bossen & M. Hsiao, (IBM) "ED/FI: A Technique for Improving
Computer System RAS," Fault-Tolerant Computing, 1995, Highlights from
Twenty-Five Years., Twenty-Fifth International Symposium, (originally FTCS
1981). (IEEE |
local) / 6 pages.
- W. Bouricius., W. Carter, D. Jessep, P. Schneider, & A. Wadia, (IBM)
"Reliability modeling for fault tolerant computers,"
Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years.,
Twenty-Fifth International Symposium, (originally FTCS 1971) (IEEE |
local) / 4 pages.
The math is pretty difficult to follow in this one.
- Cullyer, W.J.; "Implementing high integrity systems: the VIPER
microprocessor" Computer Assurance, 1988. COMPASS '88 , 27 Jun-1 Jul 1988
Page(s): 56 -66 (IEEE |
local)
- Geist, Reliability estimation of fault-tolerant systems: tools and
techniques, IEEE Computer, 23(7), July 1990. (IEEE |
local) / 10 pages.
- R.D. Malhis, L.M.; Sanders, W.H.; Schlichting, "Numerical evaluation
of a group-oriented multicast protocol using stochastic activity
networks," Petri Nets and Performance Models, 1995, pp. 63 -72.
(IEEE |
local) / 10
pages.
- Nelson, Fault tolerant computing: fundamental concepts, IEEE Computer,
23(7), July 1990. (IEEE |
local) / 7 pages.
- J. von Neumann, (1956) "Probabilistic Logic and the Synthesis of
Reliable Organisms from Unreliable Components." In: A. H. Taub, editor.
John von Neumann: Collected Works, volume V: Design of Computers, Theory of
Automata and Numerical Analysis. Pergamon Press, 1961. (local)
This is a seminal paper for hardware fault tolerance.
- Rai, S. et al., "Two recursive algorithms for computing the
reliability of k-out-of-n systems," IEEE Trans. Reliability, June 1987.
(local)
- Rennels, D., "Fault-tolerant computing -- concepts and examples",
IEEE Trans. Computers, Dec. 1984 (local)
- Siewiorek, Fault tolerance in commercial computers, IEEE Computer, 23(7),
July 1990. (IEEE |
local) / 12 pages.
- Singh, Fault tolerant system intro, IEEE Computer, 23(7), July 1990. (IEEE |
local) / 3 pages.
- Sahner, R. & Trivedi, K., "Reliability modeling using
SHARPE," IEEE Trans. Reliability, June 1987. (local)
- Wang, N.J. Patel, S.J., "ReStore: Symptom-Based Soft Error Detection
in Microprocessors," IEEE Trans. Dependable and Secure Computing,
July-Sept. 2006 Volume: 3, Issue: 3 On page(s): 188- 201 (IEEE |
local)
Pending:
- Derr, prediction of wiring harness reliability, SAE 870055, (in SP-696,
Feb. 1987). ( | local) / pages.
- Davis & Johri, reliability analysis of mechanical components, SAE
870052, (in SP-696, Feb. 1987) ( | local) / pages.
- Binroth, Coit, Desnon and Hammer. "Development Of Reliability
Prediction Models For Electronic Components In Automotive Applications",
SAE Paper 840486. ( | local) / pages.
Required:
- Maffeis, S., "Adding Group Communication and Fault-Tolerance to
CORBA," Proc. USENIX Conf. on Object-Oriented Technologies, June 1995. (Citeseer |
local) / 12 pages
- Pascal A. Felber, Benoit Garbinato & Rachid Guerraoui, "The Design
of a CORBA Group Communication Service," (long version of paper in:
Proceedings of the 15th Symposium on Reliable Distributed Systems (SRDS-15)),
1996 (Citeseer |
local) / 12 pages
- Narasimhan, P.; Moser, L.E.; Melliar-Smith, P.M.; "Lessons Learned in
Building a Fault-Tolerant CORBA system," Dependable Systems and Networks,
2002, pp. 39-44. (IEEE |
local) / 6 pages.
- P. Felber; P. Narasimhan; "Experiences, strategies, and challenges in
building fault-tolerant CORBA systems," IEEE Transactions on
Computers, vol. 54, no. 5, May 2004, pp. 497-511 (web |
local)
Supplemental:
- These are recommended, but not required reading:
- P. Narasimhan, L.E. Moser, P.M. Melliar-Smith, "Exploiting the
Internet Inter-ORB Protocol Interface to Provide CORBA with Fault
Tolerance,"Proceedings of the 3rd USENIC Conference on Object-Oriented
Technologies and Systems (COOTS),1997. (Citeseer |
local)
- OMG, FT-CORBA standard, version 3, July 2002 (Web)
- Merlin, P.M.; Randell, B., State restoration in distributed systems,
Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years.,
Twenty-Fifth International Symposium on Page(s): 207 (Originally FTCS 1978) (IEEE |
local)
- Chandy and Lamport, Distributed Snapshots: Determining the Global States of
a Distributed System, ACM TOCS, pp. 63-75, Feb. 1985. (ACM |
local) / 13 pages.
Required:
- Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed
System, Communications of the ACM, Vol. 21, No. 7 (July 1978), pp. 558-565. (ACM |
local) / 8 pages.
- Kopetz, H., & Ochsenreiter, W., "Clock synchronization in
distributed real time systems," IEEE Trans. Computers, August 1987. (local) / 8 pages
- Mills, D.L. "On the chronology and metrology of computer network
timescales and their application to the Network Time Protocol;" ACM
Computer Communications Review, 21, 5 (October 1991), 8-17. (Web |
local) / 9 pages
- Select one of:
- Temporal composability, Kopetz, H.; Obermaisser, R.; Computing &
Control Engineering Journal , Volume: 13 Issue: 4 , Aug 2002 Page(s): 156 -162
(IEEE |
local) / 7 pages.
- Raynal, M., Singhal, M., Logical time: capturing causality in distributed
systems, Computer 29(2):49-56, IEEE, February 1996. (IEEE |
local) / 8 pages.
Supplemental:
- Kenneth P. Birman. A Response to Cheriton and Skeen's Criticism of Causal
and Totally Ordered Communication. Technical report, Cornell University,
October 1993. (Citeseer |
local)
- Fault-tolerant clock synchronization in distributed systems Butler, R.W.;
Ramanathan, P.; Shin, K.G.; Computer , Volume: 23 Issue: 10 , Oct 1990 Page(s):
33 -42 (IEEE
| local)
- David Cheriton and Dale Skeen, Understanding the Limitations of Causally
and Totally Ordered Communication, Proc. of the Symposium on Operating System
Principles (SOSP), December 1993. (ACM |
local)
- Cristian, F., "Probabilistic Clock Synchronization,"
Distributed Computing, No. 3, 1989, pp. 146-158. (local)
- D. A. Jefferson. "Virtual Time". ACM Transactions on Programming
Languages and Systems, Vol. 7, No. 3, pp. 404--425, July 1985. (ACM |
local)
- Robert H. B. Netzer and Jian Xu, "Necessary and Sufficient Conditions
for Consistent Global Snapshots," IEEE Trans. on PADS., Vol. 6, No. 2,
February 1995. (IEEE |
local)
- Mills, D.L., and P.-H. Kamp. "The nanokernel;" Proc. Precision
Time and Time Interval (PTTI) Applications and Planning Meeting; (Reston VA,
November 2000), 423-430. (Web
| local)
- D. L. Palumbo, "The Derivation and Experimental Verification of Clock
Synchronization Theory," IEEE Transactions on Computers, Vol. 43, No. 6,
June 1994. (IEEE |
local)
- Raynal, M.; Singhal, M.; Mastering agreement problems in distributed
systems, IEEE Software , Volume: 18 Issue: 4 , Jul/Aug 2001 Page(s): 40 -47 (IEEE |
local)
- Shin, J. & Ramanathan, P., "Clock synchronization of a large
multiprocessor system in the presence of malicious faults," IEEE Trans.
Computers, Jan. 1987. (local)
- Francisco J. Torres-Rojas; Mustaque Ahamad; and Michel Raynal. "Timed
consistency for shared distributed objects," ACM PODC, May, 1999. (IEEE
| local)
- Synchronization of fault-tolerant clocks in the presence of malicious
failures Vasanthavada, N.; Marinos, P.N., IEEE Trans. Computers, April 1988.
Page(s): 440-448 (IEEE |
local)
Required:
- Randell, The evolution of the recovery block concept (ch 1) In: Lyu, Ed.,
Software Fault Tolerance, Wiley & Sons, 1995. (local) / 21 book
pages.
- Xu, J., Randell, B., Roll-forward error recovery in embedded real-time
systems, Proceedings. 1996 International Conference on Parallel and Distributed
Systems (Citeseer |
local) / 8 pages.
- Chung-Chi Jim Li; Fuchs, W.K., "CATCH - Compiler-Assisted Techniques
for Checkpointing," Fault-Tolerant Computing, 1995, Highlights from
Twenty-Five Years., Twenty-Fifth International Symposium on (From FTCS
1990), (IEEE |
local)
Supplemental
- Campbell & Randell, 1986 error recovery in asynchronous systems, IEEE
Trans SW Eng. SE-12, 8, pp. 811-826. (local) / 16 pages.
- K. M. Chandy, "A survey of analytic models of rollback and recovery
strategies," Computer, vol. 8 no. 5, May 1975, pp. 40-47. (local)
- Chiu, J.-F.; Ge-Ming Chiu; "Placing forced checkpoints in distributed
real-time embedded systems," Computing & Control Engineering Journal ,
Volume: 13 Issue: 4 , Aug 2002 Page(s): 197 -205, (IEEE |
local) / 9 pages.
- E. N. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson, "A survey of
rollback-recovery protocols in message-passing systems," ACM Computing
Surveys, v. 34, n. 3, Sept 2002, pp. 375-408. (ACM |
local)
- Kim, The distributed recovery block scheme (ch 8) In: Lyu, Ed., Software
Fault Tolerance, Wiley & Sons, 1995. (local)
- Koo, R. and Toueg, S., Checkpointing and rollback-recovery for distributed
systems, Trans. Software Engineering, SE-13(1):23-31, IEEE, 1987. (local) / 9 pages.
- Krishna, C.M.; Singh, A.D.; Reliability of checkpointed real-time systems
using time redundancy Reliability, IEEE Transactions on , Volume: 42 Issue: 3 ,
Sep 1993 Page(s): 427 -435 (IEEE |
local). / 8 pages.
- Leu, P. and Bhargava, B., A model for concurrent checkpointing and recovery
using transactions, Proc. 9th Intl. Conf. Distr. Comp. Sys, 423-430, IEEE, 1989
. (IEEE |
local) / 8 pages.
- D. K. Pradhan, N. H. Vaidya, "Roll-Forward and Rollback Recovery:
Performance-Reliability Trade-Off", FTCS 24, 1994. (Citeseer |
local)
- Pradhan, D.K.; Vaidya, N.H.; Roll-forward checkpointing scheme: a novel
fault-tolerant architecture, Computers, IEEE Transactions on , Volume: 43
Issue: 10 , Oct 1994 Page(s): 1163 -1174 (IEEE |
local)
- B. Randell. System structures for software fault-tolerance. IEEE Trans.
Software Eng., 1, 2(June 1975), 220-232. (local)
- Strom, R.E. and Yemini, S., Optimistic recover in distributed systems,
Trans. Computer Systems, 3(3):204-226, ACM, August 1985. (ACM |
local) / 23 pages.
Required:
- Anderson, T.; Barrett, P.A.; Balliwell, D.N.; Moulding, M.R.B., "An
Evaluation of Software Fault Tolerance in a Practical System,"
Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years.,
Twenty-Fifth International Symposium, p. 130 (Originally FTCS 1985) (IEEE |
local) / 6 pages.
- Levendel, Y., "The cost effectiveness of telecommunication service
dependability" (ch 12) In: Lyu, Ed., Software Fault Tolerance, Wiley &
Sons, 1995 (local) / 36 book
pages.
- Select one of below:
- Wilken, K.; Shen, J.P.; "Continuous signature monitoring: efficient
concurrent-detection of processor control errors;" IEEE Trans. Computer
Aided Design, 9(6) 629-641, June 1990. (IEEE |
local) / 13 pages.
- Vaidyanathan, K.; Trivedi, K.S.; "A comprehensive model for software
rejuvenation;" IEEE Trans. Dependable and Secure Computing, Volume 2,
Issue 2, April-June 2005 Page(s):124 - 137 (IEEE |
local)
Other High-Level Discussions
- Littlewood, B. & Strigini, L., "Software Reliability &
Dependability: a roadmap," Proceedings of the conference on the future
of software engineering,", May 2000. (ACM |
local)
- Bev Littlewood, "Limits to Dependability Assurance--A Controversy
Revisited," icsecompanion, pp.6, 29th International Conference on Software
Engineering (ICSE'07 Companion), 2007. (ACM)
Supplemental:
- Arlat, J.; Kanoun, K.; Laprie, J., "Dependability evaluation of
software fault-tolerance," Fault-Tolerant Computing, 1995, Highlights from
Twenty-Five Years., Twenty-Fifth International Symposium on Page(s): 194 (IEEE |
local)
- S. Garg, A. van Moorsel, K. Vaidyanathan and K. S. Trivedi., "A
Methodology for Detection and Estimation of Software Aging," Int'l. Symp.
on Software Reliability Engineering, ISSRE 1998, November 1998. (IEEE |
local) / 10 pages.
- J. R. Horgan and A. P. Mathur, "Perils of software reliability
modeling," Technical Report, SERC-TR-160-P, 1995, Software Engineering
Research Center, Purdue University, W. Lafayette, IN. (Citeseer |
local)
- G. F. Sullivan, D. S. Wilson, G. M. Masson, "Certification of
Computational Results," IEEE Trans. on Computers, Vol. 44, No. 7, July
1995. (IEEE
| local)
- Y. M. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software
error recovery in distributed systems, in Proc. IEEE Fault-Tolerant Computing
Symposium (FTCS-23), pp. 138--144, June 1993. (IEEE |
local)
- Huang, software fault tolerance in the application layer (ch 10) In: Lyu,
Ed., Software Fault Tolerance, Wiley & Sons, 1995 (local)
- Iyer, software fault tolerance in computer operating systems (ch 11) In:
Lyu, Ed., Software Fault Tolerance, Wiley & Sons, 1995 (local)
- D. J. Taylor, J. P. Black, "Principles of Data Structure Error
Correction," IEEE Trans. on Computers, Vol. C-31, No. 7, July 1982. (local)
- D. J. Taylor, D. E. Morgan, J. P. Black, "Redundancy in Data
Structures: Improving Software Fault-Tolerance," IEEE Trans. on Software
Engineering, V. SE-6, No. 6, November 1980. (local)
See also: Exception handling; Fault
Injection
Required:
- J. Goodenough, "Exception Handling: Issues and Proposed
Notation," Communications of the ACM, vol. 18(12), pp. 683-696, 1975. (ACM |
local). / 14 pages
- Vo, Kiem-Pheng; Wang, Yi-Min; Chung, P.Emerald; Huang, Yennun, "Xept:
A Software Instrumentation Method For Exception Handling." Eighth
International Symposium on Software Reliability Engineering, November 1997, p.
60-69. (Citeseer |
IEEE |
local) / 10 pages
- Romanovsky, A., An exception handling framework for N-version
programming in object-oriented systems, Proceedings Third IEEE
International Symposium on Object-Oriented Real-Time Distributed Computing,
2000 (IEEE |
local)
Supplemental:
- Buhr, P.A.; Mok, W.Y.R.; " Advanced exception handling
mechanisms," IEEE Trans. Software Engineering, Sep. 2000, vol 26
no. 9, pp. 820-836. (IEEE |
local)
- Cristian, "Exception Handling" (Citeseer |
local)
- Cristian, F., Exception Handling and Software Fault Tolerance,
Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years.,
Twenty-Fifth International Symposium on Page(s): 120 (IEEE |
local)
- A. Garcia, C. Rubira, A. Romanovsky, J. Xu, "A comparative study of
exception handling mechanisms for building dependable object-oriented
software," Journal of Systems and Software Volume 59, Issue 2, 15 November
2001, Pages 197-222 (local)
- I. Hill, "Faults in functions, in ALGOL and FORTRAN," The
Computer Journal, 14(3): 315-316, August 1971. (Web |
local)
- P.A. Lee, "Exception Handling in C Programs," Software Practice
and Experience, Vol. 13, 1983. (local)
- P. M. Melliar-Smith B. Randell Publisher, "reliability: The role of
programmed exception handling", Proceedings of an ACM conference on
Language design for reliable software, 1977 , Raleigh, North Carolina Software
ACM Press New York, NY, USA Pages: 95 - 100 (ACM |
local)
- Romanovsky, Alexander; Xu, Jie; Randell, Brian, "Exception Handling
in Object-Oriented Real-Time Distributed Systems." First International
Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98), April
1998, p. 32-42. (Citeseer |
IEEE |
local) / 12 pages
- Jie Xu; Romanovsky, A.; Randell, B. , "Concurrent exception handling
and resolution in distributed object systems," IEEE Trans. Parallel and
Distributed Systems, vol. 11, no. 10., Oct 2000, pp. 1019 - 1032 (IEEE | local)
Other sources:
- Garcia, A.F., Beder, D.M., Rubira, C.M.F., An exception handling
software architecture for developing fault-tolerant software, 5th
International Symposium on High Assurance System Engineering, 2000
- Hagen, C., Alonso, G., Flexible Exception Handling in the OPERA
Process Support System, 18th International Conference on Distributed
Computing Systems, 1998
Required:
Part 1:
- Lamport, L., Shostak, R., and Pease, M., The Byzantine Generals Problem,
Trans. Prog. Lang. and Sys. 4(3):382-401, ACM, July 1982. (ACM |
local) / 20 pages.
- J. Wylie, "Byzantine Generals OM algorithm explained," Feb. 2003.
(local) / 3 pages
This was written by an 18-849 student to try to help explain the ideas. Note
that the value labels (e.g., 140V) are an attempt to show the path taken by a
message, not the actual value of the message itself.
- K. Driscoll, B. Hall, H. Sivencrona, P. Zumsteg. Byzantine Fault Tolerance,
From Theory to Reality. Proc. 22nd International Conference on Computer Safety,
Reliability and Security (SAFECOMP03), pp.235-248, Edinburgh, Scotland, UK,
October 2003. (Citeseer |
local) / 18 pages
- K. Driscoll, "Portrait of a Byzantine Assassin" (slides) (local)
Part 2:
- M. Azadmanesh and R. Kieckhafer. Exploiting Omissive Faults in Synchronous
Approximate Agreement. IEEE Transactions on Computers, Vol. 49, No. 10, Oct.
2000, p. 1031-42. (IEEE |
local) / 12 pages.
- Lamport, Leslie, and Melliar-Smith, P.M. "Synchronizing Clocks in the
Presence of Faults." Journal of the ACM, vol 32, no 1, January 1985, p.
53-78. (ACM |
local) / 27 pages.
- R. Kiechafer, C. J. Walter, A. M. Finn, P. M. Thambidurai, "The MAFT
Architecture for Distributed Fault Tolerance," IEEE Trans. on Computers,
Vol. 37, No. 4, April 1988. (IEEE |
local) / 8 pages.
Highly Recommended:
- E. Latronico, Reliability Validation of Group Membership Services for
X-by-Wire Protocols, Ph.D. Dissertation, Carnegie Mellon University, May
2005 (local)
The selected pages give an overview and literature tour of Byzantine fault
model research and agreement research.
Supplemental:
- M. Barborak, M. Malek, A. Duhbura, "The Consensus Problem in Fault
Tolerant Computing," ACM Computing Surveys, vol. 25, No. 2, June 1993. (ACM |
local)
- Christian Cachin, Klaus Kursawe, Frank Petzold, Victor Shoup, Secure and
Efficient Asynchronous Broadcast Protocols, Lecture Notes in Computer Science,
Volume 2139, Jan 2001, Page 524 (local)
Improved Byzantine algorithm
- Miguel Castro and Barbara Liskov. Practical Byzantine Fault-Tolerance and
Proactive Recovery. ACM Transactions on Computer Systems. Volume 20, Issue 4,
Nov. 2002, Pages 398-461. (ACM |
local)
Improved Byzantine algorithm
- K. Birman and T. Joseph. Reliable communication in the presence of
failures. ACM Trans. Computer Systems, 5(1):47--76, 1987. (ACM |
local)
- Cristian, F.; Aghili, H.; Strong, R.; Volev, D.; "Atomic broadcast:
from simple message diffusion to Byzantine agreement," Fault-Tolerant
Computing, 1995, Highlights from Twenty-Five Years., Twenty-Fifth International
Symposium on Page(s): 431 (Originally FTCS 1985). (IEEE |
local)
- Frison, S.G.; Wensley, J.H., "Interactive consistency and its impact
on the design in TMR systems," Fault-Tolerant Computing, 1995, Highlights
from Twenty-Five Years., Twenty-Fifth International Symposium on Page(s): 425
(Originally FTCS 1982). (IEEE |
local) / 6 pages.
- James Kistler and M. Satyanarayanan. Disconnected Operation in the Coda
File System, ACM Trans. on Computer Systems 10(1), February 1992, pp. 3-25. (Citeseer |
local)
- M. Fischer and N. Lynch. "A Lower Bound for the Time to Assure
Interactive Consistency", Information Processing Letters, 14(4), pp.
183--186, 1982. (Citeseer |
local)
- Dolev, D.; Lynch, N.; Pinter, S.; Stark, E.; Weihl, W; "Reaching
approximate agreement in the presence of faults," Journal of the ACM, Vol.
22, No. 3, July 1986, pp. 499-516. (ACM |
local)
- P. R. Lorczak, A. K. Koglayan, D. E. Eckhardt, "A Theoretical
Investigation of Generalized Voters for Redundant Systems," FTCS 19, 1989.
(IEEE |
local)
- Martin, J.-P. Alvisi, L., "Fast Byzantine Consensus," IEEE
Trans. Dependable and Secure Computing, July-Sept. 2006 Volume: 3, Issue: 3 pp.
202- 215. (IEEE |
local)
- Pradhan, D.K.; Vaidya, N.H.; Degradable Byzantine agreement; Computers,
IEEE Transactions on , Volume: 44 Issue: 1 , Jan 1995 Page(s): 146 -150 (IEEE |
local)
- Pease, M., R. Shostak, L. Lamport. Reaching Agreement in the Presence of
Faults. JACM 27, 2 (April 1980). (ACM |
local)
Frames the Byzantine Generals question
- J. Rufino and P. Veríssimo and G. Arroz and C. Almeida and L.
Rodrigues; Fault-Tolerant Broadcasts in CAN; FTCS 98; pg 150 (IEEE |
local)
- K. G. Shin, J. W. Dolter, "Alternative Majority-Voting Methods for
Real-Time Computing Systems,", IEEE Transactions on Reliability, V. 38,
No. 1, April 1989. (IEEE |
local)
Required:
- Cristian, F., Agreeing on who is present and who is absent in a synchronous
distributed system ; Fault-Tolerant Computing, 1988. FTCS-18, Digest of
Papers., Eighteenth International Symposium on , 27-30 Jun 1988 Page(s): 206
-211 (IEEE |
local) / 6 pages
- Poledna, S., "Fault tolerance in safety critical automotive
applications: cost of agreement as a limiting factor ", Fault-Tolerant
Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International
Symposium on , 27-30 Jun 1995 Page(s): 73 -82 (IEEE |
local) / 10 pages
- Latronico, E. & Koopman, P., "Design Time Reliability Analysis of
Distributed Fault Tolerance Algorithms," DSN05, June 2005 (web
| local) . 10
pages.
Recommended:
- Galleni, A.; David Powell, Consensus and Membership in Synchronous and
Asynchronous Distributed Systems (1996) (Citeseer |
local) / 45 book pages
This looks like a good overview of this and related areas -- but too long
for our course.
Supplemental:
- Birman, K.; The Process Group Approach to Reliable Distributed Computing,
Communications of the ACM 36(12), December 1993, pp. 37-53. (ACM |
local)
- Chockler, G.; Idit Keidar, and Roman Vitenberg. Group communication
specifications: A comprehensive study. ACM Computing Surveys, 33(4):427--469,
December 2001. (Citeseer |
local)
- Cristian, F. "Reaching Agreement on Processor Group Membership in
Synchronous Distributed Systems, Distributed Computing (1991) (Citeseer |
local)
- Davidson, S.B., Garcia-Molina, H., Skeen, D., Consistency in Partitioned
Networks: a survey, Computing Surveys 17(3):341-370, ACM, September 1985. (ACM |
local)
- Fetzer, C., "A Highly Available Local Leader Election Service,"
IEEE Trans. SW Engineering, Sept-Oct 1999, pp. 603-618 (Web |
local)
- Garcia-Molina, H., Elections in a distributed computer system, Trans.
Computers C-31(2):48-59, IEEE, 1982. (local) / 7 pages + appendix.
- Lampson, B.. How to Build a Highly Available System Using Consensus, 1996,
pp 1-17. (Citeseer
| local) / 17 pages.
- Meyer, F.J.; Pradhan, D.K.; Consensus with dual failure modes; Parallel and
Distributed Systems, IEEE Transactions on , Volume: 2 Issue: 2 , Apr 1991
Page(s): 214 -222 (IEEE |
local)
- Parker, D.S., et al, Detection of mutual inconsistency in distributed
systems, Trans. Software Engineering 9(3):240-246, IEEE, 1983 (local)
- Poledna, S.; Tolerating sensor timing faults in highly responsive hard
real-time systems ; Computers, IEEE Transactions on , Volume: 44 Issue: 2 , Feb
1995 Page(s): 181 -191 (IEEE |
local)
- Rushby, J., "Reconfiguration and transient recovery in state machine
architectures ," FTCS, June 1996, Page(s):6 - 15 (IEEE |
local)
- W. Steiner & H. Kopetz, "The startup problem in fault-tolerant
time-triggered communication," DSN 2006, 10 pages. (local)
- H. Kopetz, G. Grnsteidl and J. Reisinger, Fault-Tolerant Membership Service
in a Synchronous Distributed Real-Time System, in Dependable Computing for
Critical Applications, pp. 411- 429, Springer-Verlag, Vienna, Austria, (1991)
Required:
- Either:
- Turek, J.; Shasha, D. The Many Faces of Consensus in Distributed Systems.
IEEE Computer, Vol. 25, Iss. 6, Jun. 1992, pages 8-17. (IEEE |
local) / 10 pages
OR
- M. Raynal. A Short Introduction to Failure Detectors for Asynchronous
Distributed Systems. ACM SIGACT News, Vol. 36, Iss. 1, COLUMN: Distributed
Computing, Mar. 2005, pages 53-70 (ACM |
local) / 18 pages
- Fischer, M; Nancy Lynch, and Michael Patterson, "Impossibility of
Distributed Consensus with One Faulty Processor," Journal of the ACM, vol
32, no 2, 1985, pp. 374-382. (ACM |
Web |
local) / 9
pages
Supplemental:
- Aguillera, M.K.; C. Delporte-Gallet, H. Fauconnier, S. Toueg. On
Implementing Omega with Weak Synchrony Assumptions. PODC 2003. pages 306-314
(ACM
| local)
Presents efficient algorithms for implementing Omega (the leader-election
failure detector) along with the minimal (partial) synchrony assumptions
required.
- Chandra T.D. and Toueg S., Unreliable failure detectors for reliable
distributed systems. Journal of the ACM , 43(2), pp:225--267, (March 1996). (ACM |
local)
- Delporte-Gallet, C.; H. Fauconnier, and R. Guerraoui. A realistic look at
failure detectors. In Proceedings of the IEEE International Conference on
Dependable Systems and Networks (DSN'2002), pages 345-352, Washington D.C.,
June 2002. (IEEE |
local)
This paper proves that you need group membership (the perfect failure
detector P) for most interesting problems, including Consensus with any number
of crash faults. It also shows that, if you only consider realistic failure
detectors that cannot guess the future, you don't have that many classes of
failure detectors as Chandra and Toueg originally believed: e.g., P and S (the
"strong" failure detector) are the same.
- Dwork, Cynthia; Nancy Lynch, and Larry Stockmeyer, Consensus in the
presence of partial synchrony, JACM 35(2), Apr. 1988. (ACM |
local)
This paper presents partial synchrony, which is a model in between synchrony
and asynchrony. It shows that Consensus can be solved if there are hard bounds
that are unknown or if there are known eventual bounds on delays. Failure
detectors are another way to express (and encapsulate) these kind of
properties.
- L. Lamport, "The part-time parliament," ACM Transactions on
Computer Systems, Vol. 16, No. 2, May 1998, pp. 133-169. (Citeseer |
local) / 33 pages.
Paxos. If anything, read just the introduction. Paxos solves Consensus by
assuming that a form of leader election is possible. Later papers have argued
that this is really a failure detector (called Omega). Paxos guarantees that
safety is never violated (it is indulgent) and it requires a majority of
correct processes to achieve liveness.
- Mostefaoui, A., M. Raynal. Low Cost Consensus-Based Atomic Broadcast.
Pacific Rim International Symposium on Dependable Computing, Los Angeles CA,
2000. pages 45-52. (IEEE |
local)
This is an example of a thrifty protocol as mentioned on page 66 of the
Raynal paper. Atomic Broadcast can be implemented by running Consensus on
ordered batches of messages. If enough nodes propose the same set of messages
to compose the current batch (and you have some deterministic function that can
order the batch) you can bypass Consensus and deliver the messages. Consensus
is needed only when, due to asynchrony or process crashes, the nodes a priori
disagree on the batch composition.
- Vassos Hadzilacos and Sam Toueg. A Modular Approach to Fault-Tolerant
Broadcasts and Related Problems. Technical Report TR94-1425, Department of
Computer Science, Cornell University, Ithaca NY, May 1994. (Citeseer |
local)
Required:
- Maxion, R.A.; Olszewski, R.T.; "Eliminating exception handling errors
with dependability cases: a comparative, empirical study", IEEE
Transactions on Software Engineering, Volume: 26 Issue: 9 , Sep 2000 Page(s):
888 -906 (IEEE
| local) / 19 pages
Supplemental:
- Robust software - no more excuses De Vale, J.; Koopman, P. Dependable
Systems and Networks, 2002. Proceedings. International Conference on , 2002
Page(s): 145 -154 (IEEE |
local)
- Maxion, Roy A.; Olszewski, Robert T., "Improving Software Robustness
With Dependability Cases." Twenty-Eighth Annual International Symposium on
Fault-Tolerant Computing, June 1998, p. 346-355. (IEEE |
Citeseer |
local) (conference
version of the 2000 journal paper)
Required:
- Nelson, J. "Incremental avionics upgrades for legacy aircraft";
Digital Avionics Systems Conference, 1997. 16th DASC., AIAA/IEEE , Volume: 1 ,
26-30 Oct 1997 Page(s): 3.2 -15-23 vol.1, (IEEE |
local) / 9 pages.
- Sha, L.; Rajkumar, R.; Gagliardi, M.; "Evolving dependable real-time
systems," Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE
, Volume: 1 , 3-10 Feb 1996 Page(s): 335 -346 vol.1, (IEEE |
Citeseer |
local) / 12 pages
- Arlat, J.; Jarboui, T.; Kanoun, K.; Powell, D.; ; "Dependability
assessment of GUARDS instances," Computer Performance and Dependability
Symposium, 2000. IPDS 2000. Proceedings. IEEE International , 2000 Page(s): 147
-156 (IEEE |
local) / 10 pages
(Note: GUARDS is mostly about dependable upgrade, but I haven't found any
good short papers that concentrate on that aspect.)
Supplemental:
- Bloom, T.; Day, M. Reconfiguration in Argus. International Workshop on
Configurable Distributed Systems, 25-27 Mar 1992, pages: 176-187. (IEEE |
local)
One of the earliest papers on dynamic upgrades and fault-tolerance (they did
the work in the early 1980s, but they didn't publish much back then). Argus is
an operating system and a programming language that provides crash-recovery and
maintains consistency after restart, which is a useful feature for implementing
a live upgrade mechanism. This is a design overview as well as a "lessons
learned" paper (excellent comparisons in Related Work section).
- Cook, J.E.; Dage, J.A.; Highly reliable upgrading of components ; Software
Engineering, 1999. Proceedings of the 1999 International Conference on , 1999
Page(s): 203 -212 (IEEE |
local)
- Kramer, J.; J. Magee, and A. Young. "Towards unifying fault and change
management." In IEEE Workshop on Future Trends of Dist. Computing Systems
in the '90s, 1990. (Citeseer |
local)
Faults, as well as live upgrades, may have a disruptive effect on the
functionality of a distributed system, and the techniques to mitigate these
problems can be combined in a unified framework. For instance, a change
management system that totally separates the functional application concerns
from the configuration management concerns (such as Kramer and Magee's Conic),
can provide a good basis for implementing fault recovery. This is a position
paper rather than an in-depth study.
- Lyu, J.; Youngjin Kim; Yongsub Kim; Inhwan Lee; ; " A procedure-based
dynamic software update"; Dependable Systems and Networks, 2001.
Proeedings. The International Conference on , 2001 Page(s): 271 -280 (IEEE |
local)
- Moser, L.E.; P. M. Melliar-Smith, P. Narasimhan, L. Tewksbury and V.
Kalogeraki, "Eternal: Fault Tolerance and Live Upgrades for Distributed
Object Systems", Proceedings of the DISCEX Information Survivability
Conference/, Hilton Head, SC (January 2000), pp. 184-196. (Citeseer |
local)
An infrastructure built for fault-tolerance can provide a good basis for
live upgrades because of the inherent redundancy. A fault-tolerant CORBA system
designed following the interception approach (see the FT Middleware lecture)
provides all the ingredients needed for dynamic change management of CORBA
objects: interceptor (indirection layer needed when switching to a new
version), replication mechanisms (for incrementally upgrading some replicas
while other provide service), state extraction/restoration mechanisms (for
maintaining consistency between versions).
- Powell, D.; Arlat, J.; Beus-Dukic, L.; Bondavalli, A.; Coppola, P.;
Fantechi, A.; Jenn, E.; Rabejac, C.; Wellings, A.; GUARDS: a generic upgradable
architecture for real-time dependable systems ; Parallel and Distributed
Systems, IEEE Transactions on , Volume: 10 Issue: 6 , Jun 1999 Page(s): 580
-599 (IEEE
| local)
- Romanovsky, A.; Smith, I.; "Dependable on-line upgrading of
distributed systems," Computer Software and Applications Conference, 2002.
Proceedings. 26th Annual International , 2002 Page(s): 975 -976 (IEEE |
local) /
Online
workshop proceedings
- Segal, M.E.; Frieder, O.; "On-the-fly program modification: systems
for dynamic updating." Software, IEEE, Vol.10, Iss.2, Mar 1993,
Pages:53-65. (IEEE |
local)
Excellent reference for live upgrades (although not very concerned with
dependability). Presents a thorough survey and a qualitative comparison of all
the previous approaches and it details a pretty well designed dynamic change
management system (built by the authors). A must-read for anyone interested in
live upgrades.
- Sha, L., Ragunathan Rajkumar, Michael Gagliardi, A Software Architecture
for Dependable and Evolvable Industrial Computing Systems,
CMU/SEI-95-TR-005, 1995. (Web
| local)
- Sha, L., "Dependable system upgrade," Real-Time Systems
Symposium, 1998. Proceedings., The 19th IEEE , 2-4 Dec 1998 Page(s): 440 -448
(IEEE |
local) / 9 pages.
- Tai, A.T.; Alkalai, L.; Chau, S.N.; Sanders, W.H.; Tso, K.S.;
"Low-cost error containment and recovery for onboard guarded software
upgrading and beyond"; Computers, IEEE Transactions on , Volume: 51 Issue:
2 , Feb 2002 Page(s): 121 -137 (IEEE |
local)
Required:
- Liming Chen; Avizienis, A., N-version programming: a fault-tolerance
approach to reliability of software operation, Fault-Tolerant Computing, 1995,
Highlights from Twenty-Five Years., Twenty-Fifth International Symposium on
Page(s): 113 (originally FTCS 1978) (IEEE |
local) / 7 pages.
- Knight, Leveson & St. Jean "A large scale experiment in N-version
programming", FTCS15, 1985, 135-139 (local) / 5 pages.
- Avizienis, A.; Lyu, M.R.; Schutz, W., In search of effective diversity: a
six-language study of fault-tolerant flight control software, FTCS 1988. (IEEE |
local) / 8 pages.
- Knight, J. & Leveson, N., "A reply to the criticisms of the Knight
& Leveson experiment," ACM SIGSOFT Software Engineering Notes, vol.
15, no. 1, pg. 24, Jan 1990. (ACM |
Web |
local) / 13 pages
Recommended:
- Littlewood & Rushby, "Reasoning about the reliability of diverse
two-channel systems in which one channel is 'possibly perfect'," IEEE
Trans. SW Engr., 38(5), Sept/Oct 2012, pp. 1178-1194. (local) / 19 pages.
Supplemental:
- Ammann, P.E.; Knight, J.C., Data diversity: an approach to software fault
tolerance Page(s): 418-425 (IEEE |
local)
- Anh Nguyen-Tuong, David Evans, John Knight, Benjamin Cox, and Jack
Davidson, "Security through Redundant Data Diversity," DSN 2008 (local).
- Avizienis, A., "The N-version approach to fault tolerant
software," IEEE Trans. Software Engineering, SE-11(12), December 1985, pp.
1491-1501. (local)
- Avizienis, the methodology of n-version programming (ch 2) In: Lyu, Ed.,
Software Fault Tolerance, Wiley & Sons, 1995 (local)
- Bishop, software fault tolerance by design diversity (ch 9) In: Lyu, Ed.,
Software Fault Tolerance, Wiley & Sons, 1995 (local)
- Brilliant, S.S., Knight, J.C., Leveson, N.G., "Analysis of Faults in
an N-Version Software Experiment", IEEE Transactions on Software
Engineering, 16(2): 238-47, Feb. 1990. (IEEE |
local)
- Brilliant, Knight, Leveson, "The consistent comparison problem in
N-version software", IEEE Trans. SW Eng, 15(11) 1481-1485, Nov 89. (IEEE |
local)
- J. DeVale and P. Koopman, "Comparing the Robustness of POSIX Operating
Systems," FTCS-29, 1999. (Web |
local)
- Eckhardt, D.E.; Caglayan, A.K.; Knight, J.C.; Lee, L.D.; McAllister, D.F.;
Vouk, M.A.; Kelly, J.P.J.; "An experimental evaluation of software
redundancy as a strategy for improving reliability," Software Engineering,
IEEE Transactions on , Volume: 17 Issue: 7 , Jul 1991 Page(s): 692 -702. (IEEE |
local)
- Kelly, J.P.J.; Eckhardt, D.E., Jr.; Vouk, M.A.; McAllister, D.F.;
Caglayan, A.; "A large scale second generation experiment in multi-version
software: description and early results," FTCS-18, 27-30 Jun 1988 Page(s):
9 -14, (IEEE
| local)
- J. C. Knight and N. G. Leveson, "An Experimental Evaluation of the
Assumption of Independence in Multi-version Programming", IEEE
Transactions on Software Engineering, Vol. SE-12, No. 1 (January 1986), pp.
96-109. (Citeseer |
local)
- Nancy Leveson, Stephen Cha, John Knight, and Timothy Shimeall, "The
Use of Self Checks and Voting in Software Error Detections: An Empirical
Study," IEEE Trans. on Software Engineering, Vol. SE-16, No. 4, April,
1990. (IEEE
| local)
- Littlewood, B.; Miller, D.R., A conceptual model of multi-version software,
Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years.,
Twenty-Fifth International Symposium on Page(s): 188 (originally FTCS 1987) (IEEE |
local)
- Bev Littlewood, Peter Popov and Lorenzo Strigini, "N-version design
Versus one Good," Fastabs at DSN 2000. (Citeseer |
local)
- Timothy Shimeall and Nancy Leveson, An Empirical Comparison of Software
Fault Tolerance and Fault Elimination," IEEE Trans. on Software
Engineering, Vol. SE-17, No. 2, February 1991, pp. 173-183 (IEEE |
local)
- Townend, P.; Jie Xu; Munro, M.; Building dependable software for critical
applications: multi-version software versus one good version; Object-Oriented
Real-Time Dependable Systems, 2001. Proceedings. Sixth International Workshop
on 8-10 Jan. 2001 Page(s):103 - 110 (IEEE |
local)
Other sources:
- Chillarege, 1995, challenges facing software fault-tolerance, IBMC 20281
- D. E. Eckhardt & L.D. Lee, "Fundamental differences in the
reliability of N-modular redundancy and N-version programming," Journal of
Systems and Software, 8(4): 313-318, Sept. 1988.
Required:
- Segall, Z.; Vrsalovic, D.; Siewiorek, D.; Ysskin, D.; Kownacki, J.; Barton,
J.; Dancey, R.; Robinson, A.; Lin, T.; "FIAT-fault injection based
automated testing environment," FTCS, 1988. (IEEE |
local) / 6 pages.
- Mei-Chen Hsueh, Timonthy K. Tsai, Ravishankar K. Iyer, "Fault
Injection Techniques and Tools," IEEE Computer, April 1997. (Citeseer |
local) / 8 pages
- Madeira, H.; Some, R.R.; Moreira, F.; Costa, D.; Rennels, D.;
"Experimental evaluation of a COTS system for space applications,"
DSN 2002 (IEEE |
local) / 6 pages
- A. Ademaj, H. Sivencrona, G. Bauer, J. Torin, Evaluation of Fault Handling
of the Time-Triggered Architecture with Bus and Star Topology, Proc.
International Conference on Dependable Systems and Networks (DSN 2003), San
Francisco, CA, June 22nd - 25th, USA 2003. (IEEE |
local) / 10 pages
Supplemental:
- Duraes, J.; Madeira, H.; "Definition of software fault emulation
operators: a field data study," Dependable Systems and Networks 2003, pp.
105 - 114 / 10 pages (IEEE
| local)
- Karthik Pattabiraman, Nithin Nakka, Zbigniew Kalbarczyk, Ravishankar Iyer,
"SymPLFIED: Symbolic Program-Level Fault-Injection and Error-Detection
Framework," DSN 2008 (web|
local) / 10 pages.
- Aidemark, J.; Vinter, J.; Folkesson, P.; Karlsson, J.; "Experimental
evaluation of time-redundant execution for a brake-by-wire application,"
DSN 2002, (IEEE |
local) / 6 pages
- Arlat, J.; Crouzet, Y.; Laprie, J., "Fault injection for dependability
validation of fault-tolerant computing systems " FTCS 1989. (IEEE |
local)
- Arlat, J., Yves Crouzet, Johan Karlsson, Peter Folkesson, Günther
Leber, "Evaluation of the MARS Architecture by means of Three Physical
Fault Injection Techniques," ETDS 1995, Extended Abstract (Citeseer |
local)
- Barton, J., Czeck, E., Segall, Z., Siewiorek, D., Fault injection
experiments using FIAT, IEEE Transactions on Computers, 39(4):
57582 (IEEE |
local)
- Carreira, J.; Madeira, H.; Silva, J.G., Xception: a technique for the
experimental evaluation of dependability in modern computers, IEEE
Transactions on Software Engineering, vol.24, no.2, Feb 1998, p. 125-36 (IEEE |
Citeseer |
local)
- Chillarege, R.; Bowen, N.S.; "Understanding large system failures -- a
fault injection experiment," FTCS 1989, pp. 356-363 (IEEE |
local)
- Christmansson, J.; Chillarege, R.; "Generation of an error set that
emulates software faults based on field data", FTCS 1996. (IEEE |
local)
- Han, S., Shin, & Rosenberg, "DOCTOR: An IntegrateD SO ftware Fault
InjeC T iO n EnviR onment for Distributed Real-time Systems," ICPDS, 1995
(Citeseer |
local)
- Jenn, E.; Arlat, J.; Rimen, M.; Ohlsson, J.; Karlsson, J.; "Fault
injection into VHDL models: the MEFISTO tool," Fault-Tolerant Computing,
1994. FTCS-24. Digest of Papers., Twenty-Fourth International Symposium on ,
15-17 Jun 1994 Page(s): 66 -75 (IEEE |
local)
- G. A. Kanawati, N. A. Kanawati and J. A. Abraham, "FERRARI: A Flexible
Software-Based Fault and Error Injection System," IEEE Transactions on
Computers, vol. 44, no. 2, February 1995, pp. 248-260. (IEEE |
local)
- Karlsson, J.; Liden, P.; Dahlgren, P.; Johansson, R.; Gunneflo, U., Using
heavy-ion radiation to validate fault-handling mechanisms; Micro, IEEE ,
Volume: 14 Issue: 1 , Feb 1994 Page(s): 8 -23 (IEEE |
local)
- Koopman, P., Whats Wrong With Fault Injection As A Benchmarking
Tool? DSN Workshop on Dependability Benchmarking, 2002. (Web |
local)
- Madeira, H.; Costa, D.; Vieira, M.; "On the emulation of software
faults by software fault injection," DSN 2000 (IEEE |
local)
- Rodriguez, M.; Albinet, A.; Arlat, J.; "MAFALDA-RT: a tool for
dependability assessment of real-time systems", DSN 2002 (IEEE |
local)
- Fédédric Salles, Jean Arlat, Jean-Charles Fabre, "Can
We Rely on COTS Microkernels for Building Fault-Tolerant Systems?", 1997
(Citeseer |
local)
- Stott, D., Neil A. Speirs, Jun Xu, Saurabh Bagchi, Keith Whisnant, Zbigniew
Kalbarczyk, Ravishankar K. Iyer, "Fault Injection Based Assessment of
Fail-Silence Provided by Process Duplication versus Internal Error
Detection", FTCS, 2000. (Citeseer |
local)
- Voas, J., "Fault injection for the masses," IEEE Computer, Dec.
1997 (IEEE |
local)
Required:
- Yeh, Y.C.; " Design considerations in Boeing 777 fly-by-wire
computers", HASE 1998. (IEEE |
local)
NOTE: this is a "how we did it" case study paper, without too much
on "why". So don't waste breath on saying they didn't talk about
"why". Just read the paper to get a feel for all the stuff that has
to go into a real x-by-wire system.
- McWha, J., "Development of the 777 Flight Control System," AIAA
2003-5767, 2003, pp. 2817-2821 (local).
Supplemental:
- Norris, G.; "Boeing's seventh wonder"; IEEE Spectrum , Volume: 32
Issue: 10 , Oct 1995 Page(s): 20 -23 (IEEE |
local)
- Briere, D., Travers, P., "Airbus A320/A330/A340 electrical flight
controls: a family of fault-tolerant systems," IEEE, 1993, (local) (dicusses
multi-version redundancy of software and hardware)
- Buus, H.; McLees, R.; Orgun, M.; Pasztor, E.; Schultz, L.; "777 flight
controls validation process," Aerospace and Electronic Systems, IEEE
Transactions on , Volume: 33 Issue: 2 , Apr 1997 Page(s): 656 -666 (IEEE |
local)
- Driscoll, K.; Hoyme, K.; "The Airplane Information Management System:
an integrated real-time flight-deck control system"; Real-Time Systems
Symposium, 1992 , 2-4 Dec 1992 Page(s): 267 -270 (IEEE |
local)
- Gries, M.J.; "Systems engineering for the 777 Autopilot Flight
Director System," Digital Avionics Systems Conference, 1995., 14th DASC ,
5-9 Nov 1995 Page(s): 403 -409 (IEEE |
local)
- Hess, R.; "Computing platform architectures for robust operation in
the presence of lightning and other electromagnetic threats"; Digital
Avionics Systems Conference, 1997. 16th DASC., AIAA/IEEE , Volume: 1 , 26-30
Oct 1997 Page(s): 4.3 -9-16 vol.1 (IEEE |
local)
- Hoyme, K.; Driscoll, K.; "SAFEbus"; Digital Avionics Systems
Conference, 1992. Proceedings., IEEE/AIAA 11th , 5-8 Oct 1992 Page(s): 68 -73
(IEEE |
local)
- Ramohalli, G.; "The Honeywell on-board diagnostic and maintenance
system for the Boeing 777"; Digital Avionics Systems Conference, 1992.
Proceedings., IEEE/AIAA 11th , 5-8 Oct 1992 Page(s): 485 -490 (IEEE |
local)
- Yeh, Y.C.; "Triple-triple redundant 777 primary flight computer,"
Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , Volume: 1 ,
3-10 Feb 1996 Page(s): 293 -307 vol.1 (IEEE |
local)
Required:
- Meyer, J.F., "On evaluating the performability of degradable computing
systems," FTCS 1978, (IEEE |
local)
- Bodson, M.; Lehoczky, J.; Rajkumar, R.; Sha, L.; Smith, M.; Soh, D.;
Stephan, J.; Control reconfiguration in the presence of software failures;
Decision and Control, 1993., Proceedings of the 32nd IEEE Conference on , 15-17
Dec 1993 Page(s): 2284 -2289 vol.3 (IEEE |
local)
- Shelton, C. & Koopman, P., "Improving System Dependability with
Alternative Functionality," DSN04, June 2004, Page(s):295 - 304 (IEEE |
local)
- Strunk, Elisabeth A., John C. Knight, and M. Anthony Aiello Assured
Reconfiguration of Fail-Stop Systems DSN 2005: The International Conference on
Dependable Systems and Networks, Yokohama, Japan (June 2005) (web
| local)
Supplemental:
- Adlemo, A.; Andreasson, S.-A.; "Improved availability in manufacturing
systems through graceful degradation: case study of a machining cell,"
Robotics and Automation, 1995. Proceedings., 1995 IEEE International Conference
on , Volume: 2 , 21-27 May 1995 Page(s): 1744 -1750 vol.2 (IEEE |
local)
- Burns, A.; Punnekkat, S.; Strigini, L.; Wright, D.R. ; Probabilistic
scheduling guarantees for fault-tolerant real-time systems Dependable Computing
for Critical Applications 7, 1999 , 1999 Page(s): 361 -378 (Citeseer |
local) / 18 pages
- Herlihy & Wing, 1991, "specifying graceful degradation", IEEE
Trans. Parallel & Distr. Sys. 2(1), Jan 1991 (IEEE | local)
- Jeffrey O. Kephart; Proceedings of the 27th international conference on
Software engineering table of contents St. Louis, MO, Pages: 15 - 22 Year of
Publication: 2005 (ACM
| local)
- Knight, J. & Sullivan, K., "On the definition of
survivability", 2000. (Citeseer |
local)
- Losq, J., "Effects of failures on gracefully degradable systems,"
7th Annual International Conference on Fault-Tolerant Computing, Los Angeles,
CA, USA; 28-30 June 1977, p. 29-34. (local)
- Subhasish Mitra; Huang, W.-J.; Saxena, N.R.; Yu, S.-Y.; McCluskey, E.J.;
Reconfigurable architecture for autonomous self-repair; Design & Test of
Computers, IEEE Volume 21, Issue 3, May-June 2004 Page(s):228 - 240 (IEEE |
local)
- Ying-Wah Ng, Avizienis, A., "A reliability model for gracefully
degrading and repairable fault-tolerant systems," 7th Annual International
Conference on Fault-Tolerant Computing, Los Angeles, CA, USA; 28-30 June 1977,
p. 22-8 (local)
- S. Poledna, "Tolerating Sensor Timing Faults in Highly Responsive Hard
Real-Time Systems," IEEE Trans. on Computers, Vol. 44, No. 2, February
1995. (IEEE
| local)
- Ramanathan, P.; Graceful degradation in real-time control applications
using (m, k)-firm guarantee; Fault-Tolerant Computing, 1997. FTCS-27. Digest of
Papers., Twenty-Seventh Annual International Symposium on , 24-27 Jun 1997
Page(s): 132 -141 (IEEE |
local)
ALSO: see papers on "self-healing systems" although as of the last
update this area was not quite mature enough to have generated a paper for
inclusion in this list. A more recent term being used is "autonomic
systems" although those are mostly in the IT space and not in the embedded
system area.
Required:
- Dawson, S.; Jahanian, F.; Mitton, T.; Teck-Lee Tung; "Testing of
fault-tolerant and real-time distributed systems via protocol fault
injection", FTCS 1996 (IEEE |
local)
- Koopman, P.; DeVale, J., The exception handling effectiveness of
POSIX operating systems, IEEE Transactions on Software Engineering, Sept.
2000, vol.26, no.9 p. 837-48 (IEEE |
local)
- DeVale, J. & Koopman, P., "Robust software - no more
excuses," International Conference on Dependable Systems and Networks
(DSN), Washington DC, July 2002. (Web
| local) / 10 pages.
- George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando
Fox, "Microreboot -- A Technique for Cheap Recovery," Proceedings of
the 6th Symposium on Operating Systems Design and Implementation (OSDI), San
Francisco, CA, December 2004. (web
| local)
Supplemental:
- Dingman, C.P.; Marshall, J.; Siewiorek, D.P.; Measuring robustness of
a fault tolerant aerospace system, 25th International Symposium on
Fault-Tolerant Computing, June 1995. pp. 522-7 (IEEE |
local) / 6 pages.
- Carrette, G., CRASHME: Random input testing, (no formal
publication available)
http://people.delphiforums.com/gjc/crashme.html
accessed February 28, 2003. (local)
- DeVale, J., Koopman, P., Guttendorf, D., The Ballista Software
Robustness Testing Service, 16th International Conference on Testing
Computer Software, 1999. pp. 3342. (Web |
local)
- Koopman, P., DeVale, K. & DeVale, J., "Interface robustness
testing: experiences and lessons learned from the Ballista Project," In:
Kanoun, K. & Spainhower, L., Eds., Dependability Benchmarking for Computer
Systems, IEEE Press, 2008, pp. 201-226. (Web
| local)
- Madeira, Henrique; Diamantino Costa; Marco Vieiro; "On the Emulation
of Software Faults by Software Fault Injection," 2000 (IEEE |
Citeseer |
local)
- Miller, B., Fredriksen, L., So, B., An empirical study of the
reliability of operating system utilities, Communication of the ACM,
(33):3244, December 1990 (ACM |
Citeseer |
local)
- Miller, B., Koski, D., Lee, C., Maganty, V., Murthy, R., Natarajan, A.
& Steidl, J., Fuzz Revisited: A Re-examination of the Reliability of
UNIX Utilities and Services, Computer Science Technical Report 1268,
Univ. of Wisconsin-Madison, May 1998. (Citeseer |
local)
- Mukherjee, A., Siewiorek, D.P., Measuring software dependability by
robustness benchmarking, IEEE Transactions on Software Engineering,
Volume: 23 Issue: 6 , Jun 1997 Page(s): 366 -378 (IEEE |
local)
- Shelton, C. & Koopman, P., "Robustness Testing of the Microsoft
Win32 API, International Conference on Dependable Systems and Networks (DSN),
New York City, June 26-28 2000. (Web
| local)
- Siewiorek, D., Hudak, J., Suh, B. & Segall, Z., Development of a
benchmark to measure system robustness, 23rd International Symposium on
Fault-Tolerant Computing, June 1993. pp. 88-97 (IEEE |
local)
- Martin Süßkraut and Christof Fetzer, Robustness and Security
Hardening of COTS Software Libraries, The 37th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks (DSN2007) (Web)
Builds on Ballista to create an automated robustness wrapper approach.
- Vo, K-P., Wang, Y-M., Chung, P. & Huang, Y., Xept: a software
instrumentation method for exception handling, The Eighth International
Symposium on Software Reliability Engineering, Albuquerque, NM, USA; 2-5 Nov.
1997, pp. 6069 (IEEE |
local) / 10 pages.
- Joe W. Duran and Simeon C. Ntafos, "An Evaluation of Random
Testing," IEEE Transaction on Software Engineering, Vol. SE-10, Number 4,
July 1984, pp. 438-442. (local)
(Conference paper version at:
ACM)
This is a good reference for the strengths and weaknesses of random as
opposed to intentionally designed testing.
Other sources:
- Hastings, R.; Joyce, B., Purify: fast detection of memory leaks and
access errors, Proceedings of the Winter 1992 USENIX Conference.
Required:
- Stankovic, 1988, misconceptions about real-time computing, IEEE Computer,
21(10) oct 88 pp. 10-19 (IEEE |
local)
- Sunondo Ghosh, Rami Melhem and Daniel Mosse, "Fault-Tolerant
Scheduling on a Hard Real-Time Multiprocessor System", IPPS, 1994. (Citeseer |
local)
- Kaiser, J.; Livani, M.A.; "Invocation of real-time objects in a CAN
bus-system," Object-Oriented Real-Time Distributed Computing, 1998. (ISORC
98) Proceedings. 1998 First International Symposium on , 20-22 Apr 1998
Page(s): 298 -307 (IEEE |
local)
Other High-Level Summaries:
- Kopetz, H., "Software Engineering for Real-Time: a roadmap,"
Proceedings of the conference on the future of software
engineering,", May 2000. (ACM |
local)
Supplemental:
- Cheng, Stankovic & Ramamritham, "Scheduling algorithms for hard
real-time systems: a brief survey," 1988. (local)
- Ghosh, Melhem, Mossé; Enhancing Real-Time Schedules to Tolerate
Transient Faults, Real-Time Systems Symposium, 1995. Proceedings., 16th IEEE ,
5-7 Dec 1995 Page(s): 120 -129. (IEEE |
local)
- Sunondo Ghosh, Rami Melhem, Daniel Mossé, Joydeep Sen Sarma,
"Fault-Tolerant Rate-Monotonic Scheduling", Journal of Real-Time
systems. vol 15, no. 2 September 1998 (1998). (Citeseer |
local)
- Nagarajan Kandasamy, John P. Hayes, and Brian T. Murray; Tolerating
Transient Faults in Statically Scheduled Safety-Critical Embedded
Systems"; Reliable Distributed Systems, 1999. Proceedings of the 18th IEEE
Symposium on , 1999 Page(s): 212 -221. (Citeseer |
IEEE |
local)
- Krishna, C. & Shin, K., "On scheduling tasks with a quick recovery
from failure," IEEE Trans. Computers, May 1986. (local)
- Lehoczky, J.P.; Rajkumar, R.; Sha, L.; Priority inheritance protocols: an
approach to real-time synchronization; Computers, IEEE Transactions on ,
Volume: 39 Issue: 9 , Sep 1990 Page(s): 1175 -1185 (IEEE |
local)
- Lonn, H.; Axelsson, J.; "A comparison of fixed-priority and static
cyclic scheduling for distributed automotive control applications,"
Real-Time Systems, 1999. Proceedings of the 11th Euromicro Conference on , 1999
Page(s): 142 -149 (IEEE |
local)
- Lu, Chenyang; Gang Tao; Son, S.H.; Stankovic, J.A.; The case for feedback
control real-time scheduling; Real-Time Systems, 1999. Proceedings of the 11th
Euromicro Conference on , 1999 Page(s): 11 -20 (IEEE |
local)
- Muppala, J.K.; Trivedi, K.S.; Woolet, S.P.; Real-time systems performance
in the presence of failures; Computer , Volume: 24 Issue: 5 , May 1991 Page(s):
37 -47 (IEEE
| local)
- Ramamritham, K.; Stankovic, J.A.; The Spring kernel: a new paradigm for
real-time systems; IEEE Software , Volume: 8 Issue: 3 , May 1991 Page(s): 62
-72 (IEEE |
local)
- Ramanathan, P.; Shin, K.G.;Real-time computing: a new discipline of
computer science and engineering; Proceedings of the IEEE , Volume: 82 Issue: 1
, Jan 1994 Page(s): 6 -24 (IEEE |
local)
- Minsoo Ryu, Seongsoo Hong, End-To-End Design Of Distributed Real-Time
Systems (1997) (Citeseer |
local)
- Salkind, L., Unix for Real-Time Control: problems and solutions, TR
400, NYU, September 1988. (local)
- Sha, Lui; Rajkumar Ragunathan & Shrish Sathaye (1994). Generalized
Rate-Monotonic Scheduling Theory: A Framework for Developing Real-Time Systems.
In Proceeding of the IEEE. Vol. 82. No. 1, Jan 1994,(pp. 68-82). (IEEE |
local)
- Shin, K.G.; HARTS: a distributed real-time architecture; Computer ,
Volume: 24 Issue: 5 , May 1991 Page(s): 25 -35 (IEEE |
local)
- Shin, K.G.; Zuberi, K.M.; EMERALDS: a microkernel for embedded real-time
systems; Real-Time Technology and Applications Symposium, 1996. Proceedings.,
1996 IEEE , 10-12 Jun 1996 Page(s): 241 -249 (IEEE |
local)
- John A. Stankovic; "Real-Time and Embedded Systems"; ACM
Computing Surveys (CSUR) March 1996. (Citeseer |
local)
- H. Tokuda, T. Nakajima & P. Rao, "Real-time Mach: towards a
predictable real-time system", Proc. Usenix Mach Workchop, October
1990, pp. 1-10. (Citeseer |
local)
- Lei Zhou; Rundensteiner, E.A.; Shin, K.G.; "Rate-monotonic scheduling
in the presence of timing unpredictability" Real-Time Technology and
Applications Symposium, 1998. Proceedings. Fourth IEEE , 3-5 Jun 1998 Page(s):
22 -27 (IEEE |
local)
Other Sources:
- Locke, 1992, "software architecture for hard real-time applications:
cyclic executives vs. priority executives," real-time systems 4(1):37-53,
March 1992
Required:
- Fagan, M., "Advances in software inspections," IEEE Trans.
Software Engineering, SE-12, July 1986, pp. 744-751. (local) / 8 pages.
- Umansky, Studs and Duds, The Washington Monthly, December 2001 (Web
| local) / 6 easy pages.
- Buus, H.; McLees, R.; Orgun, M.; Pasztor, E.; Schultz, L.; "777 flight
controls validation process," Aerospace and Electronic Systems, IEEE
Transactions on , Volume: 33 Issue: 2 , Apr 1997 Page(s): 656 -666 (IEEE |
local) / 11 pages.
Supplemental:
- A. F. Ackerman, "Software inspections and the cost effective
production of reliable software," in M. Dorfman & R. Thayer (Eds.),
Software Engineering, IEEE Computer Society, 1997, pp. 116-130. (local)
- Cousot, P.; Cousot, R., "Verification of Embedded Software: Problems
and Perspectives," Lecture Notes In Computer Science; Vol. 2211 archive
Proceedings of the First International Workshop on Embedded Software table of
contents Pages: 97 - 113 Year of Publication: 2001. (Citeseer |
local)
- Crary, K.; Harper, R.; Lee, P.; Pfenning, F.; "Automated techniques
for provably safe mobile code"; DARPA Information Survivability Conference
and Exposition, 2000. DISCEX '00. Proceedings , Volume: 1 , 2000 Page(s): 406
-419 vol.1 (IEEE
| local)
- Fagan, M., "Design and code inspections to reduce errors in program
development," IBM Systems Journal, 15(3), 1976, pp. 182-211. (local)
- R. Fujii & D. Wallace, "Software verification and validation"
in M. Dorfman & R. Thayer (Eds.), Software Engineering, IEEE
Computer Society, 1997, pp. 116-130. (local)
- Goddard, "Validating the safety of embedded real-time control systems
using FMEA," Proc. annual reliability and maintainability symp., 1993, pp.
227-230 (IEEE |
local)
(Talks about software FMEA)
- Musa, J. "Operational Profiles in Software-Reliability
Engineering." IEEE Software, March 1993. (IEEE |
local)
- Myers, G., "A controlled experiment in program testing and code
walkthroughs/experiments," CACM, September 1978. (local)
- J. Palmer, "Traceability," in: M. Dorvman & R. Thayer (Eds.),
Software Engineering, 1997, pp. 266-276. (local)
- Weinberg, G. & Freedman, D., "Reviews, walkthroughs, and
Inspections," IEEE Trans. on Software Engineering, Vol. SE-10(1), January
1984, pp. 68-72. (local)
- Whittaker, J., "What is software testing? And Why is it so
hard?", IEEE Software, Jan/Feb 2000. (IEEE |
local)
- Hong Zhu, Patrick A.V. Hall, and John H.R. May, "Software Unit Test
Coverage and Adequacy", ACM Computing Surveys (CSUR) December 1997, pages
366-427. (ACM |
local)
Supplemental Formal Methods papers:
- Anthony Hall, Seven Myths of Formal Methods, IEEE Software, September 1990
pp. 11-19 (IEEE |
local)
- Bowen, J.P.; Hinchey, M.G.; "Seven more myths of formal methods,"
IEEE Software , Volume: 12 Issue: 4 , Jul 1995 Page(s): 34 -41 (IEEE |
local)
- Edmund M. Clarke, Jeannette M. Wing "Formal Methods: State of the Art
and Future Directions," ACM Computing Surveys, 1996, (Citeseer |
local)
- Gerhart, Craigen & Ralston, Experience with formal methods in critical
systems, IEEE software, Jan 1994, pp. 21-39 (IEEE |
local)
- Ostroff, J., "Formal methods for the specification and design of
real-time safety critical systems," Journal of Systems and Software, April
1992, pp. 33-60. (local)
- John Rushby, "Formal Methods for Dependable Real-Time Systems,"
International Symposium on Real-Time Embedded Processing for Space
Applications, 1992 (Citeseer |
local)
- Wai Wong, "Formal Verification Of VIPER's ALU", 1993. (Citeseer |
local)
- Xu, J.; Randell, B.; Romanovsky, A.; Stroud, R.J.; Zorzo, A.F.; Canver, E.;
von Henke, F.; Rigorous development of an embedded fault-tolerant system based
on coordinated atomic actions; Computers, IEEE Transactions on , Volume: 51
Issue: 2 , Feb 2002 Page(s): 164 -179 (IEEE |
local)
Other sources:
- P. M. Melliar-Smith, R. L. Schwartz, "Formal Specification and
Mechanical Verification of SIFT: A Fault-Tolerant Flight Control System,"
IEEE Trans. on Computers, Vol. C-31,No. 7, July 1982.
- H.D. Mills, M. Dyer, and R.C. Linger, "Cleanroom Software
Engineering," IEEE Software, Sept. 1987, pp. 19-24
- Deck, M.D, and J. A. Whittaker, "Lessons Learned from Fifteen Years of
Cleanroom Testing," Software Testing, Analysis, and Review (STAR) '97, San
Jose, CA, May 5-9, 1997
Required:
- Rubenstein, E. & Mason, J., "An analysis of Three Mile
Island", IEEE Spectrum, November 1979, pp. 32-43 (local) / 14 pages.
- Sugarman, R., "Analysis and Assessment: Nuclear Power and the Public
Risk", IEEE Spectrum, November 1979, pp. 58-79 (local) / 22 pages.
- Lombardo, T., "Institutional constraints: the decision-makers: a
cacophony of voices", IEEE Spectrum, November 1979, pp. 81-95 (local) / 15pages. (Includes:
Christiansen, D., "TMI and the Press" sidebar and other material.)
Recommended:
Required:
- Rasmussen, J., "The definition of human error and a taxonomy for
technical system design," In: Rasmussen, J., Duncan, K., Leplat, J. (eds)
New Technology and Human Error, John Wiley & Sons, 1987. (local) / 8 pages.
- Rasmussen, J.; "Human factors in the high-risk systems"; Human
Factors and Power Plants, 1988., Conference Record for 1988 IEEE Fourth
Conference on , 5-9 Jun 1988 Page(s): 43 -48 (IEEE |
local) / 7 pages.
- Nancy G. Leveson, L. Denise Pinnel, Sean David Sandys, Shuichi Koga, Jon
Damon Reese. , "Analyzing Software Specifications for Mode Confusion
Potential," Workshop on Human Error and System Development, Glascow, March
1997. (Web |
local) / 16 pages
Supplemental:
- Nancy G. Leveson and Clark S. Turner. An Investigation of the Therac-25
Accidents. IEEE Computer, Vol. 26, No. 7, July 1993, pp.18-41. (Web |
IEEE |
local)
- Boweler, Y.; Cullen, I.; Hutchinson, E.; " Enhancing the safety of
future systems"; Human Interfaces in Control Rooms, Cockpits and Command
Centres, 1999. International Conference on , 21-23 Jun 1999 Page(s): 179 -183
(IEEE |
local)
- Brown, M.L.; "Software systems safety and human errors", Computer
Assurance, 1988. COMPASS '88 , 27 Jun-1 Jul 1988 Page(s): 19 -28 (IEEE |
local) / 10 pages.
- Burns, A., "The HCI component of dependable real-time systems."
Software Engineering Journal, July 1991, vol. 6, no. 4, p. 168 174. (local)
- Neumann, Peter G. ; "The human element"; Communications of the
ACM November 1991 Volume 34 Issue 11 (ACM |
local)
- Nielsen, J., Usability Engineering, Morgan Kaufmann, San Francisco,
1994. (descriptive web
page)
- Rasmussen, J., "Human error mechanisms in complex work
environments" Reliability Engineering & System Safety 22, no. 1-4,
(1988) : 155-67 (local)
- William B. Rouse, "Human-Computer Interaction in the Control of
Dynamic Systems", ACM Computing Surveys (CSUR) Volume 13 , Issue 1 (March
1981), (ACM |
local)
Other Reading:
- Reason & maddox, Human factors guide for aviation maintenance
http://www.galazyatl.com/hfg/c14s00.htm
- Nagel, D.C. (1988). Human error in aviation operations. In Wiener, E.L.,
and Nagel, D.C. (Eds.) Human factors in aviation (Chapter 9). San Diego, CA:
Academic Press.
Required:
- Leveson, N.G., Software safety: why, what, and how ACM Computing Surveys
(CSUR) June 1986 Volume 18 Issue 2 (ACM |
local) / 39 book pages.
- Leveson, N.G. "High-pressure steam engines and computer
software," Computer , Volume: 27 Issue: 10 , Oct. 1994 Page(s): 65 -73 (Web |
IEEE
| local)
Other High-Level Summaries:
- Lutz, R., "Software Engineering for Safety: a roadmap,"
Proceedings of the conference on the future of software
engineering,", May 2000. (ACM |
local)
Supplemental:
- Leveson, N., "A systems-theoretic approach to safety in
software-intensive systems," IEEE TDSC, vol. 1 no. 1, pp. 66-86 (local)
- Addy, E.A.; A case study on isolation of safety-critical syoftware, proc.
6th conf. computer assurance, 1991, NIST/IEEE, pp. 75-83 (IEEE |
local)
- Dalcher, D.; "Lessons for the future: safety critical systems;"
Engineering of Computer-Based Systems, 1999. Proceedings. ECBS '99. IEEE
Conference and Workshop on , 7-12 Mar 1999 Page(s): 281 -293 (IEEE |
local)
- de Lemos, R.; Saeed, A.; Anderson, T.; "Analyzing safety requirements
for process-control systems;" IEEE Software , Volume: 12 Issue: 3 , May
1995 Page(s): 42 -53 (IEEE |
local)
- Hansen, Kirsten M., Anders, P. Ravn, Stavridou, Victoria, From Safety
Analysis to Software Requirements, IEEE Transactions on Software Engineering,
Vol .24, No. 7, July 1998 (IEEE |
local)
- Herrmann, D.S.; "A methodology for evaluating, comparing, and
selecting software safety and reliability standards," COMPASS '95.
'Systems Integrity, Software Safety and Process Security', 25-29 Jun 1995
Page(s): 223 -232 (IEEE |
local)
- Knight, J.C.; Safety critical systems: challenges and directions; Software
Engineering, 2002. ICSE 2002. Proceedings of the 24rd International Conference
on , 2002 Page(s): 547 -550 (IEEE |
local)
- Leveson, N., "System safety in computer-controlled automotive
systems", SAE Congress, March 2000. (Web |
local)
- Leveson, Software safety in embedded computer systems, CACM 34(2), 1991, p.
34-46 (ACM |
local)
- Parnas, David L.;A. John van Schouwen , Shu Po Kwan; Evaluation of
safety-critical software; Communications of the ACM June 1990 Volume 33 Issue 6
(ACM |
local)
- Wallace, D.R.; Kuhn, D.R.; Ippolito, L.M.; "An analysis of selected
software safety standards", IEEE Aerospace and Electronics Systems
Magazine , Volume: 7 Issue: 8 , Aug 1992 Page(s): 3 -14 (IEEE |
local)
- Weiss, K.A.; Leveson, N.; Lundqvist, K.; Farid, N.; Stringfellow, M. An
analysis of causation in aerospace accidents Digital Avionics Systems, 2001.
DASC. 20th Conference , Volume: 1 , 2001 Page(s): 4A3/1 -4A3/12 vol. (IEEE |
local)
- Mcdermid, "Education and training for safety-critical systems
practitioners." In wichmann (ed), software in safety-related systems, pp.
177-207, chichester: wiley, 1992
Required:
- Anderson, R.J., "Why Cryptosystems Fail," CACM, 37(11), ACM,
November 1994. (ACM |
local)
- Kocher, P.; R Lee, G McGraw, A Raghunathan, S Ravi; Security as a New
Dimension in Embedded System Design - Proc. of 41st Design Automation
Conference (DAC 2004) (ACM |
local)
- Koopman, P., Morris, J. & Narasimhan, P., "Challenges in Deeply
Networked System Survivability," Nato Advanced Research Workshop On
Security and Embedded Systems, August 2005 (Web
| local)
- Paar, C., Weimerskirch, A., "Embedded Security in a pervasive
world," Information security tech report, vol 12, issue 3, 2007. (ACM |
local)
Supplemental:
- Ravi, S., Raghunathan, A., Kocher, P., Hattangady, S., "Security in
Embedded Systems: Design Challenges," ACM Trans. Embedded Computing
Systems, 3(3), Aug 2004, 461-491.(ACM |
local)
- Peter Bergstrom, Kevin Driscoll, John Kimball, "Making Home Automation
Communications Secure," Computer, Oct 2001, pp. 50-56 / 7 pages. (IEEE |
local)
- Devanbu, P. & Stubblebine, S., "Software engineering for security:
a roadmap," Proceedings of the conference on the future of software
engineering,", May 2000. (ACM |
local)
- Dobson & Randell, "building reliable secure computing systems out
of unreliable insecure components," Proc. 1986 symp. security &
privacy, IEEE, 1986, pp. 187-193 (2001 IEEE
reprint | local) Also,
the introduction to the reprint with more context (IEEE |
local).
- Duri, S., Marco Gruteser, Xuan Liu, Paul Moskowitz, Ronald Perez, Moninder
Singh, Jung-Mu Tang; Framework for security and privacy in automotive
telematics; International Workshop on Mobile Commerce ; 2002 (ACM |
local)
Discusses data privacy for a usage-based auto insurance scenarios.
- Kornecki, A.; Zalewski, J., "Safety and security in industrial
control," Proceedings of the Sixth Annual Workshop on Cyber Security and
Information Intelligence Research (poster session) (ACM |
local)
- Lampson, B., et al, Authentication in Distributed Systems: Theory and
Practice, Proc. 13th SOSP, ACM, October 1991 (ACM |
local)
- Landwehr, A taxonomy of computer security flaws, with examples; ACM
computing surveys 26(3), Sept. 1994 (ACM |
local)
- Larson, U.; Nilsson, D., "Securing vehicles against cyber
attacks," CSIIRW vol 288, 2008. (local |
ACM)
- NCS TIB 04-1, Supervisory Control and Data Acquisition (SCADA) Systems,
October 2004 (Web |
local) A lengthy report describing
SCADA systems overall and attack scenarios.
- Oman, P.; Edmund O. Schweitzer III, Deborah Frincke; Concerns About
Intrusions Into Remotely Accessible Substation Controllers And Scada Systems
Proc. 27th Annual Western Protective Relay Conferences. Oct 2000 (Citeseer |
local)
- Parno, Bryan; Adrian Perrig; Challenges in Securing Vehicular Networks; To
Appear in Proceedings of the Fourth Workshop on Hot Topics in Networks
(HotNets-IV), November 14-15, 2005, College Park, MD. (Web |
local)
- Schneier, B.; A Shostack; Breaking up is hard to do: Modeling security
threats for smart cards; - USENIX Symposium on Smart Cards, 1999 (Web
| local)
- Stankovic, J.A.; Wood, A.D.; "Denial of service in sensor
networks"; Computer , Volume: 35 Issue: 10 , Oct 2002 Page(s): 54 -62 (IEEE |
local)
- Summers, R. C. "An overview of computer security," IBM systems
journal, vol. 23, no. 4, 1984. (local)
- Thompson, K., "Reflections on trusting trust", CACM, 27(8),
August 1984. (ACM
| local)
- Wargo, C. & Dhas, C., "Security Considerations for the e-Enabled
Aircraft", Aerospace Conference 2003. (local)
- Weis, SA; SE Sarma, RL Rivest, DW Engels; Security and Privacy Aspects of
Low-Cost Radio Frequency Identification Systems - First International
Conference on Security in Pervasive Computing, 2003(web |
local)Cryptographic functions
proposed for RFID use
- Wolf, M.; A Weimerskirch, C Paar; Security in automotive bus systems;
Workshop on Embedded IT-Security in Cars (escar), 2004 (Web |
local) Describes the
shortcomings of various automotive protocols from a security standpoint.
Required:
- Czerny, B.J.; D'Ambrosio, J.G.; Murray, B.T.; "Providing convincing
evidence of safety in X-by-wire automotive systems;" High Assurance
Systems Engineering, 2000, Fifth IEEE International Symposim on. HASE 2000 ,
2000 Page(s): 189 -192 (IEEE |
local) / 4 pages.
- Pilkington, S.D.J.; Lee, A.R.; "The development of safety cases for
mass transit signalling and control projects-Jubilee Line case study,"
Developments in Mass Transit Systems, 20-23 Apr 1998 Page(s): 254 -259 (IEEE
| local) / 6 pages
- Kelly, T., "A systematic approach to safety case management," SAE
04AE-149, 2003. (Web |
local) / 10
pages
- Despotou, G., et al., "Extending the safety case to address
dependability," 22nd International System Safety Conference, 2004. (Web |
local) / 10 pages
You can see a rail "safety case" at:
http://www.tfl.gov.uk/tube/company/reports/safety-case.asp
Supplemental:
- Bell & Reinert, "Risk and system integrity concepts for
safety-related control systems," Microprocessors & microsystems, 1993,
17(1), 3-15 (IEEE |
local)
- Betts, A.E.; Welbourne, D.; "Software safety assessment and the
Sizewell B applications", Electrical and Control Aspects of the Sizewell B
PWR, 1992., International Conference on , 14-15 Sep 1992 Page(s): 204 -207 (IEEE |
local)
- Bishop, P. & Bloomfield, R., "A methodology for safety case
development,"
- Cooper, L., "Assessing risk from the stakeholder perspective,"
IEEE Aerospace Conference, March 2003. (local)
- Jesty, Peter H ; Keith M Hobley (University of Leeds), Richard Evans (Rover
Group Ltd), Ian Kendall (Jaguar Cars Ltd), "Safety Analysis of
Vehicle-Based Systems", Proceedings of the 8th Safety-critical Systems
Symposium, 2000. (Web |
local) / 21 pages
- Kelly, T., "Arguing Safety A Systematic Approach to Managing
Safety Cases," Ph.D. Thesis, University of York, 1998. (Citeseer |
local)
- Lane, M.; "Predicting the reliability and safety of commercial
software in advanced avionic systems"; Digital Avionics Systems
Conferences, 2000. Proceedings. DASC. The 19th , Volume: 1 , 2000 Page(s):
4E4/1 -4E4/8 vol.1 (IEEE |
local)
- Perera, J.S., "Risk management for the international space
station," IEEE Aerospace Conference, March 2003. (local)
- Rivett, R. "Is there a Role for Third Party Software Assessment in the
Automotive Industry?", Proceedings of the 5th Safety-critical Systems
Symposium, 1997. (Web |
local)
- Rodriguez-Dapena, P.; "Software safety certification: a multidomain
problem" IEEE Software, Volume: 16 Issue: 4 , Jul/Aug 1999 Page(s): 31 -38
(IEEE |
local)
- Rushby, Understanding and evaluating assurance cases, SRI Tech. Report
SRI-CSL-16-01, July 2015. (Web)
- Weinstock, C.; Goodenough, J.; Hudak, J., "Dependability Cases,"
CMU/SEI-2004-TN-016. (Web
| local)
- S P Wilson, T P Kelly, J A McDermid, "Safety Case Development:
Current Practice, Future Prospects", Proceedings of 1st ENCRESS/12th CSR
Workshop, September 1995, Springer-Verlag. (Citeseer |
local) / 22 pages
Risk Management Tools
- Cornford, S.L.; Feather, M.S.; Hicks, K.A.; DDP-a tool for life-cycle risk
management; Aerospace Conference, 2001, IEEE Proceedings. , Volume: 1 , 2001
Page(s): 1/441 -1/451 vol.1 (IEEE |
local)
- Probabilistic Risk Assessment Procedures Guide for NASA Managers and
Practitioners, New Version 1.1 of November 12, 2002 (Web |
local)
- NPG 8715.3
NASA Safety Manual (local)
- NASA risk management web site
Web
Required:
- Schinzinger. Technology hazards and the engineer. IEEE Technology and
Society Magazine, June 1986, pp. 12-16. (local) / 5 pages
- Redmill, F.; "Some dimensions of risk not often considered by
engineers" Computing & Control Engineering Journal , Volume: 13 Issue:
6 , Dec 2002 Page(s): 268 -272 (IEEE |
local) / 5 pages
- Davis, "Safety critical systems - legal liabilities," Computing
& control, 1994, 5(1), 13-17 (IEEE |
local) / 5 pages
- John C. Knight , Nancy G. Leveson; "Licensing software engineers:
Should software engineers be licensed?" Communications of the ACM November
2002 Volume 45 Issue 11 (ACM |
local)) / 4 pages.
Supplemental:
- Jonathan Bowen, "The ethics of safety-critical systems", Comm.
ACM, Volume 43, No. 4 (Apr. 2000), Pages 91 - 97. (ACM |
local)
- Gibbs, W., "Software's chronic crisis," Scientific American,
Sept. 1994, pp. 86-95. (local)
- Gotterbarn, "How the new software engineering code of ethics affects
you," IEEE Software, Nov/Dec 1999. (IEEE | local)
- Herket, J.R.; "Ethical risk assessment: valuing public
perceptions"; IEEE Technology and Society Magazine , Volume: 13 Issue: 1 ,
Spring 1994 Page(s): 4 -10 (IEEE |
local)
- Kahn, Shulamit. Economic Estimates of the Value of Life. IEEE Technology
and Society Magazine, June 1986, pp. 24-31. (local)
- McFarland, "Ethics and the safety of computer systems," IEEE
Computer, February 1991. (IEEE |
local).
Other References:
- Perrow, C., Normal Accidents, Princeton University Press, 1999.
- Wiener, Lauren. Digital Woes: why we should not depend on software.
Reading, Mass.: Addison-Wesley Pub. Co., 1993. ISBN 0201626098.
- Birsch, D. and J.H. Fielder. The ford pinto case: A study in applied
ethics, business, and technology. Albany, NY: State University of New York
Press. 1994.
- Royal society, Risk: analysis, perception, and management, London:
royal society, 1992
- Barnett, "Doctrine of manifest danger" ;ASME DE-55, reliability,
stress analysis, failure prevention,1993
- D. Okrent, "Risk Perception Versus Risk Analysis," Reliability
Engineering & System Safety, Volume 59, Number 1
- Wichmann, "Legal liability for software in safety-realted
systems," in: Wichmann (ed) software in safety-related systems,
chichester: wiley, 1992.
- Wilde, G. J. S. "The theory of risk homeostasis: Implications for
safety and health." Risk Analysis, 2:209-225, 1982.
http://www.badsoftware.com/ has
several papers that talk about UCITA, which is an attempt to regulate software
that will have an effect on embedded system software.
Required:
- Kopetz, H.; Merker, W., "The Architecture of MARS", FTCS 1985, p.
50. (IEEE |
local)
- Kopetz, H.; Grunsteidl, G.; "TTP-a protocol for fault-tolerant
real-time systems"; Computer , Volume: 27 Issue: 1 , Jan 1994 Page(s): 14
-23 (IEEE |
local)
- Hermann Kopetz, Günther Bauer "The Time-Triggered
Architecture," Proceedings of the IEEE, Jan 2003 (IEEE |
local)
Supplemental:
- W. Steiner & H. Kopetz, "The startup problem in fault-tolerant
time-triggered communication," DSN 2006, 10 pages. (local)
- Damm, A.; Kopetz, H.; Koza, C.; Mulazzani, M.; Schwabl, W.; Senft, C.;
Zainlinger, R.; "Distributed fault-tolerant real-time systems: the Mars
approach;" Micro, IEEE , Volume: 9 Issue: 1 , Feb 1989 Page(s): 25 -40 (IEEE |
local)
- Maier, Bauer, Stoger & Poledna, "Time-triggered architecture: a
consistent computing platform," IEEE Micro, July-August 2002. (IEEE |
local)
- Poledna, S.; Burns, A.; Wellings, A.; Barrett, P.; "Replica
determinism and flexible scheduling in hard real-time dependable systems;"
Computers, IEEE Transactions on , Volume: 49 Issue: 2 , Feb 2000 Page(s): 100
-111 (IEEE
| local)
Required:
- BART (San Francisco Bay Area Rapid Transit District), System Safety
specification, 1981. (local) / 6
pages
- Butler & Finelli, "The infeasibility of experimental
quantification of life-critical software reliability,", IEEE Trans. SW
Engr. 19(1):3-12, Jan 1993. (IEEE |
local) / 10 pages
- Myers, W. "Can software for the Strategic Defense Initiative ever be
error-free? ", Computer 19, no. 11, (Nov. 1986) : 61-7 (local) / 7 pages
Perrow paper:
http://www3.interscience.wiley.com/journal/119973206/abstract
Supplemental:
- Alger, L.S.; Harper, R.E.; Lala, J.H.; "A design approach for
ultrareliable real-time systems; Computer , Volume: 24 Issue: 5 , May 1991
Page(s): 12 -22 (IEEE |
local)
- Brooks, F., "No Silver Bullet: essence and accidents of software
engineering," IEEE Computer, 20(4): 10-19. (local)
- Lala, J.H.; Harper, R.E.; "Architectural principles for
safety-critical real-time applications," Proceedings of the IEEE, Volume
82, Issue 1, Jan. 1994, pp. 25-40 (IEEE |
local)
- Littlewood & Strigini, "Validation of ultrahigh dependability for
software-based systems." CACM, pp. 69-80, Nov. 1993 (ACM |
local) / 12 pages
- Roger S. Rivett; "Emerging Software Best Practice and how to be
Compliant", Proceedings of the 6th International EAEC Congress July 97.
(Web |
local)
- Rushby, John, "Formal Methods and the Certification of Critical
Systems," SRI-CSL Technical Report, November 1993. (Citeseer |
local)
- Rushby, "Critical system properties: survey and taxonomy" (web
version), 1994 (Citeseer |
local)
- Saltzer, J.H., Reed, D.P., Clark, D.D, End-to-End Arguments in System
Design, Transactions on Computer Systems 2(4):277-288, ACM, November 1984. (ACM |
local)
- Suri, N., Walter, C. & Hugue, M., "Introduction", Advances
in ultra-dependable distributed systems, IEEE Press, 1995 (local)
- Siewiorek, Daniel P., Hsiao, M. Y., Rennels, David, Gray, James, Williams,
Thomas, Ultradependable Architectures, Annual Review of Computer Science, 1990
Miscellaneous
Error Coding
- M. Paulitsch, J. Morris, B. Hall, K. Driscoll, E. Latronico, and P.
Koopman, "Coverage and the Use of Cyclic Redundancy Codes in
Ultra-Dependable Systems," DSN 2005 (local)
- Ray, J., & Koopman, P. "Efficient High Hamming Distance CRCs for
Embedded Applications," DSN06, June 2006 (web |
local)
- Maxino, T., & Koopman, P. "The Effectiveness of Checksums for
Embedded Control Networks," IEEE Trans. on Dependable and Secure
Computing, in press.