14% of SV-COMP's "Correct Programs" are Undefined!

Posted on September 18th, 2016 by Grigore Roșu
Posted in RV-Match

TACAS 2016

Last April (2016), I gave a tutorial on K at ETAPS'16 in Eindhoven, Netherlands, where I also demonstrated RV-Match.  During the week that I spent there, I heard several friends and colleagues who were involved with the Competition on Software Verification, SV-COMP, that some of the benchmark's correct programs appear to be undefined.  What?  So some of the assumed-correct C programs that are used to evaluate the best program verifiers in the world are actually wrong programs?  That immediately required for action, so I informed the RV-Match team, which took the challenge (special thanks to Manasvi Saxena, then an employee of RV, now a grad student at UIUC).  All their experiments are public and reproducible, you can find them on Github.

RV team's findings were almost unbelievable: in only a few hours of playing, they found that 14% of the supposedly correct programs were undefined! Recall that an undefined program is allowed, by virtue of the ISO C11 standard, to behave in any possible way when compiled and executed: it may or may not yield a compilation error (none of the SV-COMP programs had compilation errors), and then when executed may return 0, 1, 1000, etc., it can read and mail your password file, can erase your disk, etc., etc., it can literally do anything; see some funny examples of undefined programs in RV-Match's demo.  Therefore, 14% of the programs that the state-of-the-art program verifiers can prove correct, are in fact incorrect programs.  Just think about it ... Can we really trust any program verifiers for C these days?

The morale here is that a program verifier for a language, be it C or any other language, is only as good as the language model that it defines or encodes internally.  Defining C rigorously is a highly nontrivial task, but there is no other way around!  Program verifiers must define it if they want to claim they verify C and not some other toy-ish C-like language.  By allowing undefined programs in the SV-COMP benchmark the program verifiers are encouraged to take shortcuts wrt the correctness of their C model, which will ultimately lead to lack of trust in them.  The SV-COMP organizers should take undefinedness of their benchmark programs very seriously, if they expect their competition to be taken seriously.  In my view, they should have an undefinedness checking preliminary section, which each tool for C should be required to pass first, to make sure that the tool indeed speaks C.

RV-Match

In the meanwhile, a simple solution is for the participating teams and organizers to simply run their benchmarks with RV-Match.  It is free for academia and research, after all.  And at our knowledge, it is the only tool of its kind that is based on a formal (operational) C semantics, so the participating teams can consult the semantics for answers if not sure about how C should behave in certain tricky cases.

 

Note (added after the article above was published): I was informed that the SV-COMP organizers do actually seriously consider eliminating the undefined behavior from their benchmark.  Also, that in the meanwhile the participating verifiers that check for undefined behavior have to switch this feature off in order to not experience a disadvantage (and that was criticized by the participants and needs to be addressed).  There also seems to be some confusion wrt what undefined behavior is, in that some participants think that it is about non-determinism: that is, undefined programs allow some non-deterministic behaviors that violate some property, so tools can simply disallow those non-deterministic paths/behaviors.  Note that undefined behavior is very tricky, to say the least, and has nothing to do with non-determinism.  Consider, for example, this basic undefined program:

int main() { int x = 0; return (x = 1) + (x = 2); }

Believe it or not, when compiled with gcc this program returns 4 ('echo $?' to see the returned value):

$ gcc 1-unsequenced-side-effect.c
$ ./a.out ; echo $?
4

Now consider the following apparently equivalent program, at least in terms of non-determinism, which performs the assignment in a separate function:

int asgn(int *x, int v) {   return (*x = v); } int main() {   int x = 0;   return asgn(&x,1) + asgn(&x,2); }

This program is well-defined.  It has absolutely no problem whatsoever, albeit one could say that it allows the same number of paths as the previous program.  Why is this one defined?  Well, it is one of the particularities of C, of its ~200 different types of undefined behavior it allows by design (see the ISO C11 standard).