Testing K Framework using RV-Predict

Posted on August 30th, 2015 by Yilong Li

This blog post is the beginning of a series of blog posts recording our experience of hunting data races in complex real-world applications using RV-Predict. To start with, we will choose the K framework as our first test subject.

Why the K framework? Well, there are a number of reasons. First of all, it's a very cool project! Check out its website for a quick introduction. It is also the foundation of our RV-Match product, which aims to provide strong correctness guarantees through formal verification. Therefore, it's very important that the underlying K Framework is free from bugs including data races. Finally, the K framework is complex. It is written in a combination of Java and Scala and makes use of the Java 8 features extensively. It also takes advantage of many 3rd-party libraries such as Guice, Guava, Nailgun, etc. As a practical race detector, RV-Predict must be able to handle all these aspects gracefully.

Understanding the Story

In a nutshell, the K framework distribution comes with a collection of tools or commands for defining and executing programming language semantics: kompile, krun, kast, ktest, etc. All these commands can be run either normally or in a so-called kserver mode. The kserver mode is designed to overcome the high startup overhead of JVM. Under the hood, it is using Nailgun to share a single JVM among multiple calls of those K commands. Besides JVM, some internal data structures are also shared among multiple calls of K commands to avoid the duplicate work of building them everytime. With the kserver, multiple requests are fully parallelized within a single JVM. On the bright side, this is extremely handy in testing because the running time of the entire test suite is reduced by almost an order of magnitude. However, introducing new concurrency into the framework also imposes new challenges to the K developers because most code in the framework was not designed to be used in such a concurrent context in the first place. Without the help of RV-Predict, the initial transition to kserver was slow and tedious: many developer hours were wasted in tracking down data races introduced when parallelizing the code. Even until kserver has now been in use for a few months, there are still quite some data races lurking around causing the Jenkins build to fail intermittently.

Preparation

We are using K 3.5 revision 3626e0c to perform the evaluation. First off, clone the K repository from GitHub and follow the instructions in README to build the K framework from source:

git clone https://github.com/kframework/k.git 3626e0c
cd k
mvn package

Then add ~/k/k-distribution/target/release/k/bin/ to your PATH variable. Finally, replace java with rv-predict in k-distribution/target/release/k/bin/kserver (or kserver.bat if you are using Windows) and specify your favorite RV-Predict options, so that we can start RV-Predict along with the kserver mode later. In this experiment, we are using --stacks to compute stack traces in the final race reports and --window 2000 to set the prediction window size. The detailed explanation of RV-Predict options can be found in the RV-Predict documentation. Now the modified kserver script looks like this:

#!/usr/bin/env bash
source "$(dirname "$0")/../lib/setenv"
ulimit -s `ulimit -H -s`
if [ -z "$K_OPTS" ];
  then export K_OPTS="-Xms64m -Xmx4G -Xss32m -XX:+TieredCompilation"
fi
if "$(dirname "$0")/../lib/checkJava"; then
  rv-predict --window 2000 --stacks -Djava.awt.headless=true -Djansi.force=true $K_OPTS -ea -cp "$(dirname "$0")/../lib/java/*" org.kframework.main.Main -kserver "$@"
else
  exit 1
fi

Running Test Programs

For demonstration purpose, we are going to run RV-Predict against the test programs of the kool-typed-static semantics that comes with the tutorial. First of all, open up another terminal and enter kserver. Pay attention to the log directory printed after [RV-Predict] Log directory:. Data races will be written to the result.txt file in the log directory as they are detected, so you can examine the content of this file anytime to get some quick results before the test finishes. Then in the original terminal:

cd k-distribution/target/release/k/tutorial/2_languages/2_kool/2_typed/2_static
ktest kool-typed-static.k tests/config.xml --timeout 3600000
stop-kserver

It takes 15'10'' on my laptop to finish ktest with RV-Predict. Without RV-Predict, the test finishes in 0'50'', so the runtime overhead of RV-Predict is about 18.2x. The relatively high overhead comparing to some of the fastest dynamic race detectors such as FastTrack and ThreadSanitizer can be attributed to the sophisticated analysis performed by RV-Predict under the hood. In fact, that's where the "Predict" part in RV-Predict comes from: RV-Predict can detect, or predict, real data races that are not observed in the actual execution! Pretty cool, huh? OK, end of this small digression. Right now, you should see the race reports printed in the terminal running kserver. As you can see, it is very easy to integrate RV-Predict into the existing test workflow of the K framework.

Findings

Here is a complete list of data races in the K framework that are found by RV-Predict. As of today, there are 19 data races in the list. RV-Predict also found 5 data races in the underlying Nailgun. They are reported to Nailgun developers here. In fact, RV-Predict found a lot more data races that happen inside 3rd-party libraries (e.g., Spoofax, SDF, etc) and code automatically generated from them. Since this part of the code base are going to be removed in the upcoming K 4.0, we decide that it is not worth the effort to report them separately.

Out of the 24 (= 19 + 5) data races reported by RV-Predict, 1 of them is a false positive, 2 (issue #1602 & issue #1541) are benign racy single-check for lazy-initialization, and the rest are harmful. The false positive is due to our heuristics used to identify thread-safe Collection rather than our core algorithm. This, as well as the 2 benign races, can be easily suppressed using the --suppress option. We categorize the 21 harmful data races based on our previous post on popular kinds of Java data races.

Simple races:
#1665, #1606, #1599, #1598, #1580, #1568, #1546, #1544, #1543, #1540, #1538, #63, #65, #66

Using non-thread-safe class without sync.:
#1658, #1601, #1597, #1547, #1545, #62

Broken spinning loop:
#64

Data races that stem from the use of non-thread-safe class without synchronization are always harmful because they can result in corrupted data silently and, thus, wrong result silently. Among the simple races, two of them, issue #1580 and issue #1606, are particularly harmful and directly responsible for the intermittent failures of the Jenkins builds. Issue #1580 is a data race on a variable counter field responsible for assigning unique IDs to distinct anonymous variables in the rewrite rules used to define the programming language semantics. If the data of this counter is corrupted, distinct anonymous variables can get identical ID and change the meaning of the rewrite rules completely. The fix for issue #1580 is quite simple: just declare the counter as an atomic variable. Issue #1606 is another race on the field of a mutable AST term, CellCollection$Cell.content, which the rewrite rule is manipulating. Obviously, this can lead to incorrect rewrite result but the fix is not so straightforward on first sight. We cannot simply declare the field content as volatile or atomic to eliminate the race; this only hides the real problem: mutable terms are not supposed to be shared among multiple threads/rewriters and, thus, should always be copied before rewriting! However, class KAbstractRewriteMachine fails to follow this principle and generates instructions that are incorrect in the context of concurrency. It turns out that this bug is introduced in the process of parallelizing the K framework using Nailgun.

Summary

Our evaluation has proven to be very effective: numerous data races that are hidden for months are finally fixed, while many of them would be really hard, if not impossible, to identify by manual code inspection. In the process, we also learn a few facts about developing concurrent applications. First of all, it is best for people to use RV-Predict from early stages in their project and continuously, so that they fix the races as they occur in their code. If you only use it at the end and see tens of races, you feel demoralized as a developer to spend hours and hours to fix all of them, so you end up only fixing those that gave you non-deterministic bugs in the past. As an extra-additional benefit, using RV-Predict continuously as you develop and test your software enables you learn and understand concurrency in your software better than before. For example, most of the races found in our evaluation are introduced in the process of parallelizing the K framework. Very often, developers do not have a clear view from the very start regarding which parts of the code may become shared between multiple threads. Look at issue #1658 for another example. This time the data race manifested in a pending pull request as non-deterministic ConcurrentModificationExceptions. Thanks to RV-Predict, the root cause is identified immediately and hours of debugging are saved. Finally, races are often in off-the-shelf components (e.g., Nailgun, Spoofax, and SDF) that people use in their applications, even in ones that have been used for some time and by many. Fixing bugs in those is actually even more important, because of that.

In the following posts in this series, we will continue our journey of hunting data races in some of the Java big-name applications (e.g., Eclipse, Tomcat, Hadoop, etc) using RV-Predict. Stay tuned!