To: "'Philip Koopman'" <koopman@cmu.edu>
Subject: Automated Robustness testing research
From: Greg Bergsma <gbergsma@qnx.com>
Date: Thu, 12 Mar 1998 13:56:50 -0500

Dear Mr. Koopman,

As you are aware, we at QNX Software Systems Ltd (QSSL) are following your research into comparing operating systems with robustness benchmarks. Coming up with an appropriate benchmark suite poses a significant challenge and we commend you and your colleagues in taking such an active role. We are particularly interested in the opportunity for us to improve the robustness of QNX as a result of your research.

We would like to take this opportunity to explain the architectural response we have taken to failures that you have identified as "abort" failures. We believe our approach to handling corrupt pointers has significant merit as an alternative to the approach you propose in your research.

Corrupt pointers are the nemesis of all C programmers. They indicate that something wrong has happened, something unexpected. It is the role of the OS to ensure corrupt pointers in one process cannot affect other processes in the system or the OS kernel itself. Taking advantage of the MMU is crucial for any OS that is to be considered for mission-critical applications.

Finding the code which resulted in the corrupt pointer can be very difficult, particularly in large programs. The OS must provide the maximum amount of feedback to the programmer to allow him to track down the problem.

Although returning EFAULT from C library routines provides definite feedback that an invalid pointer is being used, we knew it was possible to provide additional feedback - feedback that could be crucial to finding the source of the fault. If a library routine returns EFAULT, then it is up to the calling program to determine how it should recover from the error. Having a stray pointer in many cases indicates some other inherent problem with the program, a problem that the programmer may not be able to identify in any recovery code that would follow the function call. In fact, writing recovery code can only handle situations where you think you MAY have had a problem - the problem itself could be quite different to what you might have coded for. The recovery code may be totally inappropriate for the condition that occurred. You may do further damage by continuing to use data areas and variables that may have contributed to the stray pointer in the first place. In most cases, the only course of action would be to restart the process. We also recognized that coding for handling the EFAULT error was not part of most C programmers programming paradigm. If you look at the source for X and BSD TCP/IP and a whole host of related source code you find a negligible amount of code handling the EFAULT condition. Thus we decided it was essential to terminate processes with a memory violation (even in a C library function) - provided we also gave the developer recovery mechanisms and maximum feedback on the fault.

So the mechanisms we provide developers are :

1. The ability for the OS to notify an overseer process (we call this a "software watchdog" - written by the developer) that can intelligently recover from the fault (ie. Restart the process and/or any related processes).

2. Process dump capabilities whereby the process that is being terminated is first dumped to disk (all code and data), before it is terminated. We then provide post-mortem dump capability in our debugger that allows the developer to view the state of the process at the exact instruction where the violation occurred (including variables, stack trace, function trace, C and assembly source trees). The dump can be analyzed off-line, while the software watchdog can keep the system running.

This is MUCH more accurate than checking for EFAULT, pointing the programmer to the precise instruction which caused the fault. However, if the program wants to deal with memory access violations itself, it can also catch the SIGSEGV signal and have it's own signal handler execute. Again, this is more accurate than simply checking for an EFAULT return value, and doesn't require modifying potentially huge amounts of source code.

With QNX adopting such a philosophy, it comes as no surprise that you have identified more Abort failures with QNX than other OS' evaluated in your research.

Our approach provides enormous advantages in the amount of feedback we provide developers to assist them in getting to the root of these sorts of programming errors. We feel that the identification of so many Abort failures and subsequently how that affects our rating in your report is not representative of the level of robustness we deliver - rather it represents a difference in philosophy. Had we no recovery and feedback mechanisms then we would agree with the rating. We have no issue with the other categories of failure you identify.

One last issue is that you might wish to consider is one that reflects directly on the architecture of the OS and how drivers can have an affect on system reliability and robustness. In the embedded and real-time arenas, many systems require the use of custom hardware. For applications to access that hardware, drivers need to be written. These drivers vary in complexity, and in most operating systems end up running as part of the kernel - in kernel space. What would happen if a driver started using a corrupt pointer? It doesn't run with any memory-protection so it has free reign to clobber any part of memory - even the kernel (possibly causing a kernel fault). Tracking these problems can be a nightmare. However, an OS with a microkernel architecture like QNX solves this problem - drivers run in user space with full memory protection. In other OS's, any attempt to use a bad pointer would crash the OS. Under QNX, it becomes a routine bug to be analyzed as easily as any other application-level programming error".

We look forward to your response to these issues and will continue to eagerly monitor your research. We would like to pursue a conference call with you to discuss these issue - please nominate a time that would be convenient for you.

Sincerely,
Greg Bergsma - Senior Technology Analyst (greg@qnx.com)
Dan Hildebrand - Senior Architect (danh@qnx.com)

________________________________________________________
QNX Software Systems Ltd     | Phone: +1 (613) 591 0931
175 Terence Matthews Crescent| Fax:   +1 (613) 591 3579
Kanata, Ontario, Canada      | Email: gbergsma@qnx.com
K2M1W8                       | WWW:   http://www.qnx.com
________________________________________________________