Last February, I went to #OffensiveCon20 and, as you might expect, it was awesome. The talks were great, but the real gem was the CodeQL workshop that was held the second day of the event. That session inspired us to start researching the potential of CodeQL and how it can be used to do variant analysis. During that research, we found 7 new vulnerabilities in the popular open-source framework – FFmpeg, an audio and video streaming and conversion framework, widely used in software including Google Chrome, VLC Player and more.
In this blog we’ll review CodeQL and its key features as well as show how I used it to find vulnerable calls to memcpy in several open-source projects.
TL;DR
Hunting bugs in open-source projects became a lot easier and fun with CodeQL. We wrote a custom CodeQL query that locates potentially vulnerable memcpy calls and found 7 new vulnerabilities in FFmpeg.
CodeQL: A Very (very) Short Introduction
CodeQL is a framework developed by Semmle and is free to use on open-source projects. It lets a researcher perform variant analysis to find security vulnerabilities by querying code databases generated using CodeQL. CodeQL supports many languages such as C/C++, C#, Java, JavaScript, Python, and Golang.
Once we generated a code database, we can use premade queries developed by Semmle and the community or write custom queries and use them.
Generating A Code Database
In order to review examples of what we can do using CodeQL, we first need a code database. Obtaining a code database can be done by downloading a code database generated by Semmle from here, or by generating a code database on our own.
Let’s dive in by creating a small code database from the code below:
codeql database create ~/semmle/databases/example.db --command="clang code.c -o example" --language=cpp
int main(int argc, char **argv) { int size1 = 5; int size2 = size1 + 5; char *dst = malloc(size1); // allocation size is 5 bytes char *src = malloc(size2); // allocation size is 10 bytes if (strcmp(argv, "change") == 0) size1 = 15; memcpy(dst, src, size1); }
Figure 1 – code.c
We will use the database generated from this simple C program in the next three examples.
The Query Structure
CodeQL’s syntax is very similar to SQL, and is comprised of these main parts:
- Imports – At the beginning of the query we denote which CodeQL libraries we wish to import. For example, to use the basic features of CodeQL for C/C++, we import cpp.
- from – We must define the CodeQL variables and their types. Each CodeQL variable represents an object from the CodeQL library, e.g., Function, FunctionCall, VariableAccess, Variable and Expression.
- where – Once we’ve defined CodeQL variables, we can then construct the predicates to be applied to them. Although this part is optional, it is also the core of the query.
- select – Under this clause, we set how the output is going to look. We can bind CodeQL variables and present them in different ways, usually in a table.
Variable
A variable is a name that holds one or more values. When referring to a variable in CodeQL, we mean the declaration of that particular variable. The same rules of variable declaration apply like in any procedural coding language — one variable declaration in one code scope.
Using a simple query, we can extract all the variable declarations from the code above:
import cpp from Variable var select var, var.getLocation() as location
Figure 2 – Extracting variable declarations with CodeQL
Running that query against the code database we created will produce the following results:
Figure 3 – Variable declarations in vscode
The results are displayed in vscode because we are using the CodeQL extension for vscode. We would highly recommend setting up a CodeQL workspace for vscode even though it is possible to use only the CodeQL cli tool.
Variable Access
We can also easily find all the accesses to a variable. The query below will present all the access locations to all the variables. By adding a condition to the query, we could look for an access to a specific variable.
import cpp from VariableAccess var_access select var_access, var_access.getLocation() as access_location
Figure 4 – Extracting the access locations to all the variables
And the results:
Figure 5 – Variable accesses
We can see that the left column shows the name of the variable we access and the right column the exact location of the access.
Do keep in mind the difference between variable and variable access; while a variable represents the declaration of that variable, variable access represents every access to that variable. Therefore, for example, we see that the variable size1 appears multiple times in the left column since it is accessed several times.
Local Taint Tracking
One of the many reasons CodeQL is such a powerful tool is local/global taint tracking.
CodeQL creates a graph that represents a given code. By doing so, it can track the flow of any variable and tell which variable affected another and where.
In the code above (figure 1), size1 = 5 and size2 = size1 + 5, meaning size1 tainted size2. Using this logic, we can see which variable affects other expressions in our code:
import cpp import semmle.code.cpp.dataflow.TaintTracking from Variable source_var, VariableAccess source_access, Expr sink where // Find all the variable accesses that affect any expression in our code TaintTracking::localExprTaint(source_access, sink) and // Linking the access to the variable itself source_access.getTarget() = source_var select source_var, source_var.getLocation() as source_location, sink, sink.getLocation() as sink_location
Figure 6 – A simple taint tracking example
And the results:
Figure 7 – Taint tracking for variable accesses
Under the column source_var we see all the variable accesses that affect the expression under column sink. For example, in row 7 the access to size1 (size1 = 5) affects the call to memcpy in line 13.
So now we see which variable affects which and where, exactly.
Vulnerable Memcpy
After learning and experimenting a bit with CodeQL, our goal was to write a new query that will find heap-based write buffer overflows caused by memcpy.
The following section describes our thought process, which eventually led us to write a single query that achieves our goal. Since that query is a bit long and complexed, we determined reviewing each logical part from that query will make it more understandable.
Finding the Calls to Memcpy
Obviously in order to find a vulnerable memcpy call, we first need to find all the calls to memcpy.
In addition to finding all the memcpy calls, we extract the size of that memcpy (the 3rd argument) which will help us distinguish safe from unsafe memcpy calls.
We will create a new code database from the following code:
int main(int argc, char **argv) { int dst_size = 5; int src_size = 10; char *dst = malloc(dst_size); char *src = malloc(src_size); memcpy(dst, src, src_size); }
Figure 8 – An example of a simple, unsafe call to memcpy
After creating the database, we will use the following query:
import cpp import Utils from FunctionCall memcpy, Expr size where memcpy.getTarget().hasName("memcpy") and // Example: memcpy(dst, src, a * b) => size = a || size = b size = memcpy.getArgument(2).getAChild*() and // Example: memcpy(dst, src, a * 5) => size = a || size = 5, the number 5 is not a variable! if exists(VariableAccess va | size = va) then isNumber(size.(VariableAccess).getTarget()) else size = size select memcpy.getLocation() as location, size as value
Figure 9 – Extracting the calls to memcpy
And the results:
Figure 10 – A single call to memcpy , as expected
The “location” column shows the exact location of a memcpy call and the corresponding cell under the “value” column shows all the variables/values that affect how many bytes to copy. As expected, the results show a single memcpy call with src_size bytes to copy.
From an Allocation Function to Memcpy 1st Argument
- malloc
- calloc
- realloc
- Custom methods that wrap the above
Given a call to memcpy, we need to find where the memory block provided to it as the destination argument (1st argument) was allocated.
To do so, we can use taint tracking, the same way we showed earlier:
import cpp import Utils import semmle.code.cpp.dataflow.TaintTracking // CallAllocationExpr represent a call to malloc/calloc/realloc/etc./custom wrappers from CallAllocationExpr alloc, Expr memcpy_dst, FunctionCall memcpy, Expr size where // Setting memcpy, memcpy_dst memcpy.getTarget().hasName("memcpy") and memcpy_dst = memcpy.getArgument(0) and // Every allocation that flows to the 1st arguemnt in a memcpy call TaintTracking::localExprTaint(alloc, memcpy_dst) and // malloc(a * b) => size = a || size = b size = alloc.getAnArgument().getAChild*() and // malloc(a * 5) => size = a || size = 5, the number 5 is not a variable! if exists(VariableAccess va | size = va) then isNumber(size.(VariableAccess).getTarget()) else size = size select memcpy.getLocation() as memcpy_location, alloc.getLocation() as allocation_location, size
Figure 11 – Finding the allocation function that allocated the memory block provided as the destination argument in a memcpy
And the results:
Figure 12 – Finding the source buffer size in memcpy
- “memcpy_location” – the location of the memcpy.
- “allocation_location” – the location of the allocation function call that allocated the memory for the destination argument of the matched memcpy.
- “size” – the variable/value that affected the size of the matched allocation function
Notice, we copy src_size bytes from src to dst even though dst points to an allocation with the size of dst_size.
Since src_size and dst_size are different variables, this puts this memcpy call on our list of potentially vulnerable call sites – and it is a great way of surfacing these weaknesses at a fairly low level of effort!
Things Are Getting Complicated – Affecting Variables
The following code shows a slightly more complicated scenario, in which the allocation size is affected by the memcpy length variable.
int main(int argc, char **argv) { int memcpy_size = 5; int src_size = 10; int dst_size = memcpy_size * src_size; char *dst = malloc(dst_size); char *src = malloc(src_size); memcpy(dst, src, memcpy_size); }
Figure 13 – Simply allocating and copying
Here, “dst” points to a dst_size sized allocation and we copy memcpy_size bytes from “src” to “dst”.
But, this memcpy call is not vulnerable – dst_size value derived from memcpy_size and src_size and in this case, dst_size > memcpy_size.
The query below is using taint tracking to find all the variables that dst_size is derived from. Using a similar query like the one below, we can find all the variables that affect the value of memcpy_size.
import cpp import Utils import semmle.code.cpp.dataflow.TaintTracking // CallAllocationExpr represent a call to malloc/calloc/realloc/etc./custom wrappers from CallAllocationExpr alloc, Expr memcpy_dst, FunctionCall memcpy, Expr size, VariableAccess affecting_size where // Setting memcpy, memcpy_dst memcpy.getTarget().hasName("memcpy") and memcpy_dst = memcpy.getArgument(0) and // Every allocation that flows to the 1st arguemnt in a memcpy call TaintTracking::localExprTaint(alloc, memcpy_dst) and // malloc(a * b) => size = a || size = b size = alloc.getAnArgument().getAChild*() and // malloc(a * 5) => size = a || size = 5, the number 5 is not a variable! if exists(VariableAccess va | size = va) then isNumber(size.(VariableAccess).getTarget()) else size = size and // Setting affecting_size -> all the variables that affect the size of the allocation TaintTracking::localExprTaint(affecting_size, size) select memcpy.getLocation() as memcpy_location, alloc.getLocation() as allocation_location, size, affecting_size
Figure 14 – Extracting the variables that affect the allocation size
The query above is almost identical to the previous query (figure 11) with a small addition of a 2nd taint tracking check.
And of course, the results:
Figure 15 – Affecting variables on the source buffer in a memcpy
As expected, we see that dst_size is derived from memcpy_size, src_size, and dst_size.
At this point, we might assume that the memcpy from the snippet above, is not vulnerable, but we shouldn’t.
int main(int argc, char **argv) { int memcpy_size = 5; int src_size = 10; int dst_size = memcpy_size * src_size; char *dst = malloc(dst_size); char *src = malloc(src_size); // Changing memcpy_size to be bigger than dst_size memcpy_size = 100; memcpy(dst, src, memcpy_size); }
Figure 16 – Why we need Global Value Numbering
Using the previous query to identify which variables affect dst_size is not enough; we need memcpy_size to have the same value at both the definition of dst_size and at the call to memcpy.
To do so, we will use a new feature in CodeQL named “Global Value Numbering.” According to Semmle: “The global value numbering library provides a mechanism for identifying expressions that compute the same value at runtime.“. Value numbering is extremely powerful since it allows us to determine whether expressions are equal or not. We can now instantly infer which calls to memcpy are safe, and which should be further analyzed.
However, we cannot use this mechanism to determine if an expression is bigger or smaller than a different one, but we can know if they’re different. Let’s edit our query and add some global value numbering:
import cpp import Utils import semmle.code.cpp.dataflow.TaintTracking import semmle.code.cpp.valuenumbering.GlobalValueNumbering // CallAllocationExpr represent a call to malloc/calloc/realloc/etc./custom wrappers from CallAllocationExpr alloc, Expr memcpy_dst, FunctionCall memcpy, VariableAccess size, VariableAccess affecting_size where // Setting memcpy, memcpy_dst memcpy.getTarget().hasName("memcpy") and memcpy_dst = memcpy.getArgument(0) and // Every allocation that flows to the 1st arguemnt in a memcpy call TaintTracking::localExprTaint(alloc, memcpy_dst) and // malloc(a * b) => size = a || size = b size = alloc.getAnArgument().getAChild*() and // size should be a variable and a number. isNumber(size.getTarget()) and // Get the variables that affect the size of the malloc TaintTracking::localExprTaint(affecting_size, size) and exists ( // memcpy_length is the variable that affect hoe many bytes to copy in a memcpy VariableAccess memcpy_length | memcpy_length = memcpy.getArgument(2).getAChild*() and ( (not memcpy_length.getTarget() = affecting_size.getTarget()) or // Since it`s an or expression, this only applies to variable accesses with a different varible globalValueNumber(memcpy_length) = globalValueNumber(affecting_size) ) ) select memcpy.getLocation() as memcpy_location, alloc.getLocation() as allocation_location, size, affecting_size
Figure 17 – Using “GVN” in a query
We can see that this query is the same as the previous one (figure 14) with a small (but significant) addition. Now, we make sure that if a variable affects both the size of the allocation and the size value in the memcpy call, that variable keeps the same value.
Executing this query against the non-vulnerable code (figure 13) will present these results:
Figure 18 – The awesomeness of GVN (part 1)
But, executing our query against the vulnerable code (figure 16, where we changed memcpy_size right before the memcpy) will produce the following results:
Figure 19 – The awesomeness of GVN (part 2)
As you can see, memcpy_size is no longer affecting dst_size since memcpy_size does not keep its value when calling to memcpy.
Guarding the Memcpy
In CodeQL, guards allow us to identify conditions that control the execution of other parts in our program. For instance:
int main(int argc, char **argv) { int size = 5; int memcpy_size = 10; char *dst = malloc(size); char *src = malloc(size); if (memcpy_size < size) { memcpy(dst, src, memcpy_size); return 0; } else { return -1; } }
Figure 20 – Guarding correctly a memcpy
In this example, the if statement is guarding the memcpy. If the condition holds in runtime, the call to memcpy will occur – otherwise, it will be avoided.
Our previous queries would assume that this memcpy is vulnerable – but, that is no longer the case!
Not only that, but this guard also guards the memcpy, it defends the memcpy correctly. The following (insufficiently safe) example explains what correctly means:
int main(int argc, char **argv) { int size = 5; int memcpy_size = 10; char *dst = malloc(size); char *src = malloc(size); if (memcpy_size > 0) { memcpy(dst, src, memcpy_size); return 0; } else { return -1; } }
Figure 21 – insufficient guard example
Even though this memcpy is guarded as well, the guard itself is checking a condition that is irrelevant when determining whether this memcpy is vulnerable or not.
Since this guard is only checking the lower bound of memcpy_size, memcpy_size could be bigger than size, and this will cause an overflow.
The following query is a snippet from the final query (GitHub link at the end of the blog):
predicate isMemcpyNotGuardedEnough(FunctionCall memcpy){ exists ( BasicBlock bb, GuardCondition gc, Expr left, Expr right | bb = memcpy.getBasicBlock() and // No Guards at all - Easy scenario ( not gc.controls(bb, _) ) // Guard exists - checking the type of the guard and where is the length variable. or ( gc.controls(bb, _) and ( ( // Condition has the form: x < y and must be true in order for memcpy to execute // Make sure the check dosen`t check the maximum value of the length variable exists ( boolean enterBlock | gc.ensuresLt(left, right, _, bb, enterBlock) and if enterBlock = true then not lengthVariable.getTarget() = left.getAChild*().(VariableAccess).getTarget() else not lengthVariable.getTarget() = right.getAChild*().(VariableAccess).getTarget() ) ) or ( // Meaning guard is from the form of x == y to execute memcpy gc.ensuresEq(left, right, _, bb, true) and // In that case, make sure the length variable is not checked // If it is being checked, it means length must have a specific value => well guraded not ( lengthVariable.getTarget() = gc.getAChild*().(VariableAccess).getTarget() ) ) ) ) ) }
Figure 22 – We make sure that if a memcpy is guarded, then it is guarded correctly
The logic here is simple: determine if the guard is checking for an upper bound of the variables that create the 2nd argument in a memcpy.
With this, we can determine if a memcpy call is guarded correctly or not. We can now filter out calls to memcpy that might have been vulnerable, but the guard prevents the bug from occurring.
Sanity Check
As mentioned earlier, we reviewed each logical part from the finalized query (GitHub link below).
Eventually, we wrote a single query that contains all the logical steps we reviewed in this blog.
To validate the query, we must prove that the query works as expected. To do so, we have to accomplish two things:
- Prove we find vulnerable calls to memcpy
- Clear the safe calls to memcpy – narrowing down the number of calls a researcher must review
To check the credibility of that query, we created code databases with existing vulnerabilities that are caused by an unsafe call to memcpy, for example:
- CVE-2020-12284 in FFmpeg
- CVE-2016-9453 in LibTiff
To eliminate false positives, we tested the query against different versions of several projects and cleared 90%-99% of the calls to memcpy:
- ImageMagick
- GraphicsMagick
- Linux/Torvalds
- Many more
Finding X Vulnerabilities
Once we deemed the query reliable, we decided to run it against an updated (at the time) version of FFmpeg. I chose FFmpeg for two reasons:
- Familiarity with the code and how to debug it
- New code is shipped to the library daily
After creating a new code database from FFmpeg, executing the query produced the following results:
Figure 23 – Vulnerable memcpy calls
It was extremely easy to pull up these 7 new calls to memcpy that did not appear during the sanity check procedure. Digging deeper showed that those calls are, in fact, vulnerable. An attacker can control the address of both source and destination in those calls to memcpy and cause a read/write heap-based overflow that could lead to RCE.
Conclusions
Besides being a potent tool, CodeQL is relatively easy to learn and use for vulnerability research. The community is growing, and the language is evolving significantly with each day passing. Yet, mastering all the capabilities of it might take time. If you have an idea for a query, I would highly recommend working as organized as possible:
- Define the bug you’re looking for. What are you looking to find? What are you not looking to find?
- Breakdown the query. Find out what are the logical components that you need in order to find the bug you’re looking for
- Make sure each logical part returns the results you expect before writing a single complex query
- Enjoy the process and learn from your mistakes
Finally, CodeQL is relatively new and still evolving, do keep in mind that bugs might exist so stay sharp and understand everything you write.
Future Work
Right now, the query checks the flow from standard allocation functions to the standard memcpy call.
If the code you’re analyzing has wrappers to standard allocation functions, you might need to add small modifications in the query. Those modifications will occur in the Config.qll and Utils.qll files. By creating a new CodeQL class that will represent those wrapper functions, we could solve that issue, and no adjustments would be necessary.
Finally, we will continue to perform variant analysis by writing new queries to find even more complex and unique vulnerabilities.
Links & References
- Learn CodeQL – https://codeql.github.com/docs/codeql-for-visual-studio-code/
- The query – https://github.com/assafsion/DangerousMemcpy