Undefined Behavior Review: tis-interpreter vs. RV-Match

Posted on August 4th, 2017 by Brandon Moore

John Regher's blog is a great source for anyone concerned about undefined behavior in C programs. The latest installment Undefined Behavior in 2017, written jointly with Pascal Cuoq, reviews the state of popular tools for detecting important categories of undefined behavior, and describes a new tool tis-interpreter that they have been working with.

At Runtime Verification we are big proponents of dynamic program analysis and rigorous error detection, so we were excited to hear of another tool following a similar approach to our own undefined-behavior checker, RV-Match(which is the commercial continuation of the academic kcc tool). We tested the most recent available version of tis-interpreter against RV-Match. With roots in the Frama-C verification project, we would hope tis-interpreter is faithful to the C standard - or become concerned that Frama-C could "prove" wrong code correct. We found that tis-interpreter detects many fewer errors than RV-Match, and has missing or incorrect implementations of several language features and almost all library functions. On the positive side, their examples revealed a few errors in RV-Match, which we have now fixed. We will update this post for improvements in tis-interpreter.

tis-interpreter Evaluation

Unfortunately the public tis-interpreter repository does not include a test suite, so our comparison was based mostly on our tests, plus the four examples provided with tis-interpreter. We compiled tis-interpreter from the most recent code in the repository at the time of writing, commit 33132ce from Dec 14, 2016. Our tests are all available in our public C semantics repository. The tests for supporting valid code are in the tests/unit-pass directory, and tests for error detection in examples/c/error-codes/.

The two most important concerns for an error-detecting C interpreter are the quality of error detection and the difficulty of using the tool on existing code. Comprehensive and reliable error detection is the whole point of such a tool, and the tool must work on your code to get any benefit from it. Usability problems in the form of missing support for language features and library functions are particularly concerning, as they required code changes for testing that cast doubt on the applicability of the results to the original code.

Compatibility

We first ran our conformance tests, as checking that programs without undefined behavior run correctly is easier than interpreting and comparing error reports. These tests are found in our tests/unit-pass directory. The results were not particularly favorable. Out of these 1203 tests which we claimed did not have undefined behavior (including all error-free programs from gcc's own tests), tis-interpreter was unable to run 124 of them, or a little over ten percent.

15 of these programs caused tis-interpreter to crash with some kind of error. This appears to be the result of 3 different defects in tis-interpreter, one involving calling fopen on a device, one involving improper implementation of stdarg.h macros, and one involving use of volatile-qualified variables.

Another 65 programs failed because tis-interpreter has little to no support for the standard library. This is a serious problem because essentially all software written in C in the desktop environment makes significant use of the standard library. Among the unimplemented functionality are file I/O, threading, mathematical operations, and exit hooks. This is however only a small subset of the total missing standard library functionality. It is our understanding that in order to use tis-interpreter you must first modify your program in order to stub nearly all standard library support, which seems to us to be a huge investment with any complex application software. However, the tis-interpreter homepage gives several examples of bugs found in significant libraries. We curious about the modifications needed to test those libraries and reproduce those results.

Another 22 tests failed because tis-interpreter appears to have an incorrect semantics of the C language. Some of the failing tests include those involving anonymous structs as struct fields, compound literals, constant expressions, flexible array members, floating point arithmetic, integer promotions, the long double type, scanf %n, tentative definitions, incomplete array types, va_arg, and volatile-qualified variables. This is a particularly serious problem as dynamic tools ought to be capable of having no false alarms, and worse, tis-intepreter is unable to execute past the point it detects a violation, which will almost-certainly render real applications unanalyzable due to these false positives. Reporting multiple undefined behaviors occurring during the same execution was a feature that was repeatedly asked for by our customers, and it was rather non-trivial to support.

Finally, another 22 tests appear to fail because entire features of C appear to be missing or incorrectly supported in tis-interpreter. Any programs including keywords introduced in C11 failed with syntax errors, indicating that even the grammar of their parser has not been updated to support C11. Similarly, code which introduced #pragmas in certain locations also triggered syntax errors, which is a defect compared to the ISO standard's definition for the behavior of pragmas. Even some C99 functionality is missing, namely variable-length arrays and universal character names were also not supported correctly. It is not clear what version of the C standard tis-interpreter aims to support.

On the positive side, tis-interpreter found 2 errors that RV-Match had missed in these programs. Both false negatives have been fixed in the latest version of RV-Match on our website. In one case we were expecting an incorrect argument type for a certain printf conversion specifier. In the other example the error was using va_start in a varargs function whose last fixed argument was declared with type float which isn't compatible with the "default argument promotion" of its type (which promotes float to double).

The examples from tis-interpreter also demonstrated several more errors in our implementation and analysis of printf. Dynamic precision specifiers as in %.*d were not parsed at all, precision was ignored on string conversions such as %.4s, and also formatting a string did not report an error when accessing uninitialized memory. These problems have also been fixed.

In contrast RV-Match supports C11 except for _Generic and threading support in thread.h/stdatomic.h, and also supports some popular GCC extensions. The entire C standard library is available, some portions fully analyzed according to the specification in the standard, some with a C implementation run under analysis, and others calling the native implementation. RV-Match also allows calls to and from any native code from interpreted code, even exporting function pointers to interpreted code, and linking with native libraries.

Another important feature for supporting real code is that the main RV-Match command line tool is designed to be usable as a C compiler, and supports many GCC options. This allows many programs to be tested not just without modifying code but also without even modifying the build system. For example, the current gzip-1.8 source code can be built to be run under analysis simply by running CC=kcc ./configure; make.

Error Detection

To evaluate the error detection power of tis-interpreter we ran our tests for 122 of the errors we detect, leaving out only those errors for which GCC already detects as a compiler error or warning. These tests are in our repository at examples/c/error-codes/ (leave out the violates-constraints subdirectory). Each test consists of a pair of similar programs with and without the target error.

We found that tis-interpreter apparently makes no attempt to detect or report reliance on unspecified or implementation-defined behavior. They also do not support programs relying on POSIX features, let alone detect violations of the POSIX standard. We also have a few optional "lint-style" warnings that customers have requested to be able to detect legal behavior they wish to avoid, such as unsigned overflow. This leaves 106 undefined behaviors that RV-Match detects within the scope of tis-interpreter's apparent goals. Of these 106 undefined behaviors, tis-interpreter failed to detect any instance of 38 of them, or over one third. Included in these errors are some serious errors that can trigger crashes, incorrect behavior, or lack of portability in real C programs.

Running the tis-interpreter examples we found one false negative in RV-Match, where our model of printf did not report an error with a string conversion read uninitialized bytes. We also found errors that tis-interpreter missed, where the program attempts to reinterpret bytes as 64-bit integers by casting a pointer into a uint8_t array to a uint64_t*. This possibly causes undefined behavior by producing an insufficiently aligned pointer (depending how the array is aligned) and definitely produces undefined behavior by being incompatible with the type of the array.

Conclusion of Our Evaluation

As a further measure, we evaluated our performance relative to tis-interpreter, and here their behavior shows promise. On the benchmark example they used, they were 4x faster than our tool. Some speed difference is expected merely because tis-interpreter attempts to detect fewer errors. Another factor is that our tool is generated automatically from a formal semantics of the C11 standard, rather than being a hand-crafted interpreter. Given these factors stemming from a rigorous foundation, the speed difference doesn't seem like enough compensation (it's a different story with tools like ASan which have several orders of magnitude less overhead than RV-Match, and are thus well worth using in their domains).

We also found tis-interpreter difficult to use. The need to invoke the interpreter on the correct source files rather than reusing any existing build system is unfortunate. That difficulty might have been overcome, but the lack of standard library support ultimately led us to give up on our attempts to evaluate tis-interpreter on the third party ITC benchmark, because we simply could not get the code to even "compile" without heavy modification. Publicly available code to reproduce their use of tis-interpreter on libraries such as SQLite or OpenSSL would be helpful. On the ITC Benchmark, RV-Match got a score of 92 while Grammatech's CodeSonar, which performed best among static analysis tools, got 82. A comparison of our results on the ITC benchmark with those of Valgrind, MSan, ASan, TSan, and UBSan can be found here. It would be great if the authors of tis-interpreter could run the ITC benchmark and report their scores.

Altogether, we believe that RV-Match continues to stand out as the best-in-class dynamic analysis tool for analyzing undefined behavior in C applications. Details of our evaluation are available on request.

Undefined Behavior Review

We would also like to add a couple suggestions to John and Pascal's review of available tools for various categories of undefined behavior.

Memory Safety: Internal Overflow

For detecting memory safety violations, tools such as ASan are good at detecting many common kinds of buffer overflows, but there is a specific category of internal overflow which are not detected by any tool other than RV-Match. In particular, tis-interpreter does not track memory carefully enough to detect them.

An internal overflow happens when a pointer into a struct field - usually an array - is offset enough to fall into another field, but still remains within the containing struct. Simply tracking what memory is allocated cannot detect this. Here is an example with a struct that might be used for a linked list of buffers.

#include <stdlib.h>
struct node {
    char buf[120];
    struct node* next
};
void init_node(struct node* n) {
    n->next = NULL;
    memset(&n->buf, 0x6A, 128);
}

The initialization is supposed to set the next pointer to NULL and fill the buffer with a distinctive value, but it writes too many bytes and corrupts the pointer. A simple main method demonstrates the problem:

#include <stdio.h>
int main(void) {
  struct node n;
  init_node(&n);
  if (n.next == NULL) {
    return 0;
  } else {
    printf("Pointer not initialized to NULL\n");
    return 1;
  }
}

Compiled by clang -O0 -fsanitize=address bounds.c, this reports no memory error, and prints only

Pointer not initialized to NULL

tis-interpreter keeps only the overall bounds of the struct rather than faithfully implementing the C memory model, and corrupts the pointer with no complaint:

$ ../tis-interpreter.sh bounds.c

[value] Analyzing a complete application starting at main
[value] Computing initial state
[value] Initial state computed

Pointer not initialized to NULL

[value] done for function main

RV-Match actually tracks which (sub)object a pointer refers to, and reports the undefined behavior. The first report is from merely computing a pointer offset too far past the end of the array, and the second from writing through it.

$ kcc bounds.c
$ ./a.out
 A pointer (or array subscript) outside the bounds of an object.
   at memset(string.c:31:13)
   by init_node(bounds.c:8:3)
   by main(bounds.c:14:3)
  Undefined behavior (UB-CEA1).
   see C11 section 6.5.6:8 http://rvdoc.org/C11/6.5.6
   see C11 section J.2:1 item 46 http://rvdoc.org/C11/J.2
   see CERT-C section ARR30-C http://rvdoc.org/CERT-C/ARR30-C
   see CERT-C section ARR37-C http://rvdoc.org/CERT-C/ARR37-C
   see CERT-C section STR31-C http://rvdoc.org/CERT-C/STR31-C
 Dereferencing a pointer past the end of an array.
   at memset(string.c:31:13)
   by init_node(bounds.c:8:3)
   by main(bounds.c:14:3)
  Undefined behavior (UB-CER4).
   see C11 section 6.5.6:8 http://rvdoc.org/C11/6.5.6
   see C11 section J.2:1 items 47 and 49 http://rvdoc.org/C11/J.2
   see CERT-C section ARR30-C http://rvdoc.org/CERT-C/ARR30-C
   see CERT-C section ARR37-C http://rvdoc.org/CERT-C/ARR37-C
   see CERT-C section STR31-C http://rvdoc.org/CERT-C/STR31-C
Pointer not initialized to NULL

Note that a char* pointer cast from the pointer to the overall structure could be used to write all those bytes. The program is undefined specifically because the pointer passed to memset was the address of the field. There would be no problem if the init_node function was defined as

init_node(struct node* n) {
    n->next = NULL;
    memset(n, 0x6A, 128);
}

Data Races

As mentioned on John's blog, data races are an important category of undefined behavior, and the main difficulty of using the race detection tool TSan is that some races are extremely rare to trigger, rather than any specific weakness in the tool. We have also found that data races are a particularly difficult sort of undefined behavior, and are currently developing advanced data race detection separately from RV-Match in our RV-Predict tool. The biggest difference from TSan is that RV-Predict uses a much more powerful analysis than TSan, which can sometimes deduce that a data race is definitely possible from observing one run of an instrumented program even if that race did not trigger on that run. This heavier analysis means that programs currently run approximately 10x slower with RV-Predict compared to TSan, but also allows RV-Predict to reliably find (on nearly every run) some data races that TSan can only find by running the same program hundreds or thousands of times. More details will be in an upcoming blog post.