Decompiler Design - Reverse Engineering Tools


Prev: Forward Engineering Tools

Reverse Engineering Tools

Even during development, programmers regularly use reverse engineering tools to verify the correctness of their program. The most used tool is clearly the debugger.
Debuggers are the best reverse engineering tool, since they allow the step by step execution of the program and the inspection of the program's variables. Whether a debugger is used to debug the original source code or machine code depends on whether debugging information is available to the debugger. Typically debugging information is generated by the compiler via a switch (such as -g for gcc), and is passed through by the assembler and linker to the debugger. Various standards are used between compiler and debugger for this purpose, and we'll consider them when we talk about object file formats, since this is very useful information even for other reverse engineering tools.

While a debugger can do run-time or "dynamic" analysis of the target program, the other tools can only do "static" analysis.

Disassembling is one of the features of debuggers, but it can also be performed by a stand-alone tool, a disassembler. A disassembler converts a sequence of bytes that will potentially be executed by the processor, into a series of text lines that represent the operation performed by those bytes. This is a very crude process, and it is prone to errors. It is based on the assumption that the byte sequence being disassembled represents some processor instruction ("code"). If the compiler has put some data in the text section, the disassembler will try to convert that into instructions; even worse than that, the instruction stream may become de-synchronized, since many instructions have arguments that span more than one byte. We will consider this problem again, when we talk about identifying code and data areas.

Smarter disassemblers are able to provide more information about a specific instruction. In particular, whether the instruction references some global variable or it calls a global function. Remember that the processor knows nothing about functions and variables. It only knows about memory addresses. Therefore, a smart disassembler should be able to combine information from different parts of the binary file and present it in a meaningful way.
"objdump" is a Linux program that has most of these features.

Technically speaking, "objdump" is not a disassember; it's an object dumper (hence the name). An object dumper is very useful in that it can show the full content of compiled binary files, not just the code area. In fact, object files can have a lot more information than what the processor will see. They usually have information on where in memory an area of the file will be loaded as well as a list of dynamically linked libraries that are needed to run the program, as well as the necessary code to link and load those libraries during execution. Object files may have also a set of names for addresses (global symbols), and in some case even information on local variables and types.

Even with the wealth of information provided by an object dumper, the information is still at the machine level. The object dumper will show what the processor and operating system operate on. Only in very rare case you can see the source of your program mixed with the disassembly output (in that case you don't need any other tool to analyze the program, since you already have the sources).

If you want to have a higher-level view of the program, you need to convert the machine view into a programmer's view. That is, you need to raise the level of abstraction, introducing concepts that were used during the development phase, and that have been lost during the compilation and linking phase.
This is what decompilers do: they try to reconstruct the information that is partially lost when creating the binary program. This document outlines how this is done. Since this is a highly technical subject, it is expected that the reader is familiar with at least one assembly language and processor architecture. Most of the algorithms are processor independent, so it doesn't matter which processor you know. The most widely available processor is x86, so most examples will use this CPU, but this shouldn't prevent you from following the content, except for some corner cases.

Next: The Simplest Decompiler