This talk will explore program analysis on compiled code, where source is not available. Many static program analysis tools, such as LLVM passes, depend on the ability to compile source to bytecode, and cannot operate on binaries. A solution to this problem will be explained and demonstrated using the new Intermediate Language (IL) in Binary Ninja. Binary Ninja IL will be described, providing a basic understanding of how to write analyses using it.
This talk will describe and release a tool in Binary Ninja IL for automated discovery of a simple memory corruption vulnerability and demonstrate it on a CTF binary. The concepts of variable analysis, abstract interpretation, and integer range analysis will be discussed in the context of vulnerability discovery.
--- Sophia D'Antoine
Sophia D’Antoine is a security engineer at Trail of Bits in NYC and a graduate of Rensselaer Polytechnic Institute. She is a regular speaker at security conferences around the world, including RECon, HITB, and CanSecWest. Her present work includes techniques for automated software exploitation and software obfuscation using LLVM. She spends too much time playing CTF and going to noise concerts.
4. Fuzzing…
.
Current state of the art.
Binary.Source
Code.
Problem
Reading/
scripting
disassembly
Reading
code
Analysis
of Bitcode
Static Analysis with
Bindead, REIL,
BAP.
Dynamic
Instrumentation
Static and
Dynamic
Analysis
Compilers
Source
code
analyzer
McSem
a
6. Problems.
Binary.Source
Code.
Problem
● Lack of robust tooling options
● Reading code continues to be
useful
● Increase in compiler
strength and LLVM
tooling (lots of cool
projects in this area!)
● Most tools lack semantic reasoning
● Decompilers widely used but difficult
to automatically reason over
● Majority of program analysis
frameworks are hard to use - they
lack usable frameworks for
interaction with your own analysis
● No really good options to lift binaries
to interactive, workable IL
frameworks
9. Binja: Tree Based
Structure● Binary Ninja IL Organized
into expressions:
LowLevelILInstruction
● LLILI’s are infinite length tree-based
instructions
● Infix notation. Destination operand is the left
hand operand
(e.g. x86 ``mov eax, 0`` vs. LLIL ``eax = 0``)
● Side effect free
● Recursive descent analysis
10. Binja: Tree Based
Structure
● Symbolic analysis (abstract interpretation) to
find bounds of a jump table
● Determine function ends, aborts, etc using
disassembly and their own IL.
19. Binja API● Python, C and C++ API (idiomatic!)
● Missing some analysis features, built into LLVM
(i.e. integrated CFG traversal, Uses, SSA, reg/ var distinction)
● Branches: Basic block/ Function edges (outgoing)
● Get the register states, some naive range analysis
● api.binary.ninja/search.html
20. Symbolic Execution● Very accurate
● Takes time, data, and memory, often not feasible
● IDEA! Reasoning only about what we can about
● Apply complex data to abstract domains !
● Domains: type, sign, range, color etc….
21. Practical(Academia) & Program Analysis
● Sets of concrete
values are
abstracted
imprecisely
● Galois Connection
formalizes
Concrete <-> Abstract
25. Practical(Academia) & Program Analysis
● X ‘s value is imprecise
● Compilers perform
imprecise abstraction
int x;
int[] a = new int[10];
a[2 * x] = 3;
1. Add precision - i.e. declare
abstract value [0, 9]
1. Symbolically execute with
abstract domain/ values
● Requires control-flow analysis
26. Abstract Domains & Sign Analysis
int a,b,c;
a = 42;
b = 87;
if (input) {
c = a
+ b;
} else {
c = a
- b;
}
● Map variables to an
abstract value
27. Abstract Domains & Sign Analysis
● Binary Ninja plugin
● Path sensitive -
construct lattices of
abstract values
● Under approximate
● One abstract state per
CFG node
● Avoid loss in precision
for fractions.
31. Conclusion
● Thanks!
○ Vector35
○ Trail of Bits
○ Ryan Stortz (@withzombies)
● Resources
○ binary.ninja/
○ github.com/quend/abstractanalysis
○ santos.cs.ksu.edu/schmidt/Escuela03/WSSA/talk1p.pdf
○ Static Program Analysis Book!
cs.au.dk/~amoeller/spa/spa.pdf
remember:
prune this
before
analysing
32. Agenda
1) IDA isn’t perfect
2) Binary Ninja IL
3) Practical(Academia) and program analysis
a) Abstract Interpretation
4) Binary Ninja plugin demo
5) Conclusion
Editor's Notes
This talk isn’t about a new fantastical analysis platform. This talk isn’t about how one tool is better than another. This talk isn’t about a new silver bullet.
This talk is about making simple and advanced static analysis techniques easy and available to everyone...
Joern - source code analyzer
There’s a lot available, but we all know we’re going to ignore all of them and go straight for IDA. Why? Because IDA is interactive and tweakable and customizable.
Let’s face it...IDA isn’t perfect.
I’m sure most of you have taken a shot at doing some automated analysis in IDA. Maybe you wanted to identify all the dynamically bounded memcpys. IDA has a python API, how hard could it be?
Okay, let’s start by getting all the cross references to memcpy. Easy enough in the IDA API, we just iterate over the xrefs.
Now, we need to see if the size parameter of memcpy is constant. So we look up the calling convention of our architecture and look up the 3rd parameter. Our architecture is x86-32, so that means we need model the stack. So now we jump back to the top of the basic block and start implementing instructions. Let’s start by implementing the pushes...oh wait, then we need to do the moves...but now we need to remember that ESP *and* EBP are stack pointers...etc etc.
That’s a lot of work for such a simple analysis. There has to be a better way.
Cannot reason. Mcsema is not really that great
``class LowLevelILInstruction`` Low Level Intermediate Language Instructions are infinite length tree-based
instructions. Tree-based instructions use infix notation with the left hand operand being the destination operand.
Infix notation is thus more natural to read than other notations (e.g. x86 ``mov eax, 0`` vs. LLIL ``eax = 0``).
``class LowLevelILInstruction`` Low Level Intermediate Language Instructions are infinite length tree-based
instructions. Tree-based instructions use infix notation with the left hand operand being the destination operand.
Infix notation is thus more natural to read than other notations (e.g. x86 ``mov eax, 0`` vs. LLIL ``eax = 0``).
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
Memcpy example with binary ninja here
https://gist.github.com/withzombies/75d12d8fa1237213beb7e82acbfc3b40
http://santos.cs.ksu.edu/schmidt/Escuela03/WSSA/talk1p.pdf
In one sense, every analysis based on abstract interpretation is a “predicate abstraction.” But the “logic” is weak — it supports conjunction (u) but not necessarily disjunction (t).
https://cs.au.dk/~amoeller/spa/spa.pdf
Here, the analysis could conclude that a and b are positive numbers in all possible executions at the end of the program. The sign of c is either positive or negative depending on the concrete execution, so the analysis must report ? for that variable. Altogether we have an abstract domain consisting of the five abstract values {+, -, 0, ?, ⊥}, which we can organize as follows with the least precise information at the top and the most precise information at the bottom: ? + 0 − The ordering reflects the fact that ⊥ represents the empty set of integer values and ? represents the set of all integer values. This abstract domain is an example of a lattice. We continue the development of the sign analysis in Section 5.2, but we first need the mathematical foundation in place.
https://cs.au.dk/~amoeller/spa/spa.pdf
Here, the analysis could conclude that a and b are positive numbers in all possible executions at the end of the program. The sign of c is either positive or negative depending on the concrete execution, so the analysis must report ? for that variable. Altogether we have an abstract domain consisting of the five abstract values {+, -, 0, ?, ⊥}, which we can organize as follows with the least precise information at the top and the most precise information at the bottom: ? + 0 − The ordering reflects the fact that ⊥ represents the empty set of integer values and ? represents the set of all integer values. This abstract domain is an example of a lattice. We continue the development of the sign analysis in Section 5.2, but we first need the mathematical foundation in place.