Testing ERC-20 Tokens Part 2: Advancing Benchmarking with Mutation Testing

Posted on January 29th, 2024 by ERCx

Posted in ERCx, Smart Contracts, Tooling

testing erc20 part 2.png

Abstract

Runtime Verification and Certora partnered to compare and evaluate a set of testing tools for DeFi applications. In Part 1 of this blog post, we introduced the testing tools. In Part 2, we focus on the bug-detection capabilities of these tools. Specifically, we evaluate and compare them against each other by using mutation testing using Gambit.

What is ERC-20?

ERC-20 is a standard for enabling interoperability across different tokens. The ERC-20 defines a set of functions and events that all tokens must implement to be considered an ERC-20 token. The six core functions are described below:

The totalSupply() function returns the total number of tokens in circulation.
The balanceOf() function allows users to check the balance of a specific account.
The transfer() function enables the transfer of tokens from one account to another.
The transferFrom() function allows a third party to transfer tokens on behalf of an account that has approved it to do so.
The approve() function allows an account to approve another account to spend a certain number of tokens.
The allowance() function checks the amount of tokens an approved account can spend on behalf of another account.

The ERCx Test Suite for ERC-20

ERCx is a framework for evaluating ERC tokens using parameterized fuzz testing (see Part 1 of this blog post). At the time of writing, the ERCx test suite for ERC-20 offers 157 tests, making it the most complete test suite available. For more details about ERCx, please refer to the ERCx website.

While the ERCx team thoroughly reviewed and exercised the test suite on numerous tokens, we needed an objective, systematic, and automated approach to 1) the qualitative evaluation of the test suite and 2) the discovery of possibilities for improvement.

Quality Assurance using Mutation

Mutation testing is a well-studied technique to evaluate and improve the efficacy of a test suite by measuring the number of artificially introduced faults it can detect. Studies show that these artificial faults, although typically simpler than real faults, in practice, reflect the ability of a test suite to detect real faults ¹ ² ³ . A mutation testing tool generates faulty versions of a program, called “mutants”, on which a given test suite is run. To a first approximation, a mutant passing all the tests exposes gaps in the test suite and can be used as a hint to add more test cases. The percentage of mutants caught by a test suite can be used to assign an efficacy score to a test suite. Mutants can be redundant (e.g., semantically equivalent to another mutant) or equivalent to the original program. Techniques like program equivalence checking or trivial compiler equivalence can be used to detect such unhelpful mutants. A good mutant generator strives to generate few such mutants as they do not offer any meaningful information about the efficacy of a test suite.

Gambit is a mutation generator for Solidity developed by Certora. You can read more about Gambit in this blog post. In brief, Gambit traverses the abstract syntax tree (AST) of a Solidity program to identify program points that can be mutated. Gambit uses an extensible set of mutation operators to mutate each eligible program point. Gambit offers a declarative configuration language for customizing the types of mutation operators to apply, the functions and contracts to mutate, and the number of mutations. The mutants generated by Gambit can be used to evaluate and improve test suites and formal specifications.

Certora has built an integration of Gambit with the Certora Prover to help assess the quality of formal specifications. In this work, we did a similar integration with the ERC-20 test suite in ERCx to evaluate the efficacy of ERCx tests.

Our Method

In this section, we report on the generic method we followed to assess the efficacy of the ERCx test suite. The method proceeds in several steps and is illustrated in the figure above.

1. Smart Contract Selection
We begin by selecting a well-established and verified ERC-20 smart contract as our baseline. This contract adheres to the ERC-20 standard and serves as the foundation for generating mutants. The solidity code for this smart contract is considered reliable and thoroughly tested.

2. Mutation Generation with Gambit
Utilizing Gambit, we systematically introduce faults into the selected ERC-20 smart contract. Gambit incorporates an extensible set of mutation operators to diversify the types of mutations introduced.

3. Optimizing the set of Mutants
Before subjecting mutants to testing tools, we optimize the set of mutants. This step involves refining the generated mutants to ensure relevance and effectiveness in evaluating the bug-detection capabilities of the testing tools. Redundant or semantically equivalent mutants that do not contribute meaningfully to the evaluation are removed.

4. Testing Tool Evaluation
With the optimized mutants at hand, we run the testing tools introduced in Part 1 of this blog series on each of them.

5. Detection Analysis
For each mutant, we carefully analyze the results produced by the testing tools. An undetected mutant implies a potential flaw in the tool's bug-detection capabilities. The goal is to identify the number of undetected mutants for each testing tool, serving as a quantitative metric for their effectiveness.

6. Test Suite Completion
For each undetected mutant, we examine the related mutation and, more specifically, the syntactic modification to the initial contract and the concerned function. Using that information, we develop a test case to detect that mutant.

7. Overall Assessment
By aggregating undetected mutants across all testing tools, we comprehensively understand their relative performance. A higher count of undetected mutants indicates a weaker bug-detection capability, while a lower count signifies a more robust and reliable tool.

To improve a test suite, steps 4, 5, and 6 can be iterated over until all mutants are detected, whenever the base program is changed or the scope of the test suite is augmented.

To evaluate the comparative efficacy of testing tools when subjected to mutation testing, we run the above steps, excluding step 6.

We note that the method is generic in that it does not depend on the set of evaluated tools nor the standard.

Comparing Test Suites Using Mutation Testing

To conduct our experimental evaluation, we selected the ERC-20 smart contract from the OpenZeppelin library, a well-tested and widely recognized implementation of the ERC-20 standard. This choice ensures a robust baseline with a contract that has passed tests from testing tools, including ERCx, the OpenZeppelin test suite, and Slither.

We utilized Gambit, configured with the following parameters:

{
    "filename": "ERC-20.sol",
    "solc": "solc8.20",
    "contract": "ERC-20",
    "functions": [
        "name",
        "symbol",
        "decimals",
        "totalSupply",
        "balanceOf",
        "transfer",
        "_transfer",
        "_spendAllowance",
        "_update",
        "allowance",
        "approve",
        "_approve",
        "transferFrom"
    ]
}

In this configuration, mutants were generated in every public function and internal function used in public functions. Notably, we excluded mutation of the _mint and _burn functions since they are not invoked in public functions, making the mutants undetectable in our testing process. After optimization, we obtained a set of 48 mutants, having removed 8 mutants from the _update function due to their undetectable nature.

Below is an example of a function and its mutated version:

Original Function:

1function _update(address from, address to, uint256 value) internal virtual {
2    if (from == address(0)) {
3        _totalSupply += value;
4    }
5    else {
6        uint256 fromBalance = _balances[from];
7        if (fromBalance < value) {
8            revert ERC-20InsufficientBalance(from, fromBalance, value);
9        }
10        unchecked {
11            _balances[from] = fromBalance - value;
12        }
13    }
14...
15}

Mutated Function:

1function _update(address from, address to, uint256 value) internal virtual {
2    // MUTATION HERE
3    if (false) {
4        _totalSupply += value;
5    }
6    else {
7        uint256 fromBalance = _balances[from];
8        if (fromBalance < value) {
9            revert ERC-20InsufficientBalance(from, fromBalance, value);
10        }
11        unchecked {
12            _balances[from] = fromBalance - value;
13        }
14    }
15...
16}

Here, the mutation removes the zero address check on the `from` address. But this internal function is only used in the other internal function _transfer, which already performs this check. Hence, this mutant can never be detected.

We subjected our optimized set of 48 mutants to testing using ERCx, the OpenZeppelin test suite, and Slither. Each testing tool was employed to detect the presence of faults in the mutated ERC-20 contract.

Experimental Results

The table below summarizes our findings.

Evaluation Criteria

We used the following criteria for our comparison:

Number of tests: the number of test cases in the test suite.
Parametric tests: the ability of the test cases to be parametric; this influences the number of scenarios under which a contract is tested.
Detected mutants: the number of detected mutants; a mutant is considered detected if at least one test fails.
ERC-20 compatibility: the ability of the test suite to test different ERC-20s (and in particular, already deployed contracts).

Results

We observe that the ERCx tool detects every mutant. OpenZeppelin’s test suite was also able to detect most of them, but their tests are not parametric (which implies a lower coverage of the execution paths of the tested tokens). Slither’s test suite detects none of the mutants. This can be explained by the fact that this tool is limited to testing the ABI of smart contracts. It only checks that the function signatures are present and correct. However, the mutants generated by Gambit do not modify the signatures, which explains why no mutants are detected.

There are many ways to measure the efficacy of a test suite — mutation score (percentage of mutants killed) is one such metric. It is important to remember that while such metrics help assess and improve test suites (by adding missing tests that could catch the live mutants), having a 100% mutation score does not guarantee that the test suite will catch all bugs. For example, mutation tools like Gambit implement a set of important mutation operators that can introduce certain types of artificial bugs. While this set is extensible, it will not produce every mutation that reflects a real bug. Mutation tools typically operate under a fault model — it is the assumption that the errors a programmer could introduce are subsumed by one of the supported mutation operators. We encourage users to exercise caution when relying on such metrics. We finally note that mutants are generated from a base contract that should be fault free and recognized as a standard.

Conclusion

Mutation testing is a well-studied topic, with decades of research suggesting best practices for scalable and efficient adoption in real-world scenarios. While many tools for applying mutation testing to smart contract testing have been proposed, prior work in this area has not comprehensively studied large, real-world test suites to understand how they compare. This work reviewed several well-known test suites for ERC-20 smart contracts guided by their ability to catch mutants. We hope that the findings in this work will help developers design better test suites and pave the way to test DeFi test suites.

Discussion and Prospects

Mutants can guide the development of test suites and formal specifications. In this blog, we showed how this is already valuable for ERC-20 tokens and will continue to apply mutation testing to other standards and use different base contracts to generate mutants. We also welcome comments and suggestions from the rest of the community on improving test suites for smart contracts.

The community would benefit from a publicly available set of mutants for common standards like ERC-20, ERC-4626, etc. (either generated by tools like Gambit or manually written). Stay tuned!

J. H. Andrews, L. C. Briand and Y. Labiche, "Is mutation an appropriate tool for testing experiments? [software testing]," Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005., St. Louis, MO, USA, 2005, pp. 402-411, doi: 10.1109/ICSE.2005.1553583. ↩
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). Association for Computing Machinery, New York, NY, USA, 654–665. https://doi.org/10.1145/2635868.2635929 ↩
Petrović, M. Ivanković, G. Fraser and R. Just, "Please fix this mutant: How do developers resolve mutants surfaced during code review?," 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Melbourne, Australia, 2023, pp. 150-161, doi: 10.1109/ICSE-SEIP58684.2023.00019. ↩