Decompiler Design - Introduction


Prev: Introduction

Forward Engineering Tools (compilers, assemblers, linkers)

Forward engineering tools are programs that move a program from a human-centric level of abstraction towards a machine-centric level of abstraction. Most programmers' main interface to the machine is the compilation environment. The compilation environment takes as input one or more files in a high-level language such as C or Java, plus a number of supporting files such as resource files and libraries, and converts all of them into an executable for a particular execution environment, say Linux or Windows (we are not considering higher-level program representation formats such as UML, although they can also be the target of decompilation). This is accomplished through a number of steps that involve individual programs:

  • Each high-level language source file is compiled into assembly by a compiler for that high-level language.
  • Each assembly language file, whether created by a compiler or directly by the programmer, is converted into a relocatable object file by an assembler program. The assembler is not concerned about which language was used to write the high-level source file. It is only concerned about which processor will execute the binary code. This is the first step where information can be lost, since the assembler may not see a lot of the information that is important to the programmer, such as local variable names and types.
  • Each relocatable object file is combined together with a number of libraries that support the target execution environment by the linker. The linker may not care about the processor that will execute the program. It may only care about what information is required for the program to be loaded by the target operating system. The linker may decide to remove information from the generated binary file that it thinks will not be necessary to execute the program.

As one can see, at each step some information that was vital to the programmer when he wrote the program is removed from the output of each tool since it's not necessary to the final execution of the program.

What's worse is that the programmer himself may instruct each tool to generate or remove valuable information. When using a compiler, the user may decide to:

  • generate additional information to improve the debuggability of the code (the -g command line option of Unix compilers is used for this purpose)
  • generate code that is more difficult to understand for humans, but is better executed by processors; that is, to generate optimized code through the -O1, -O2 or higher command line options. Optimizing compilers, through a number of transformations they perform on the generated instructions, make the final code less readable even when there is debugging information present in the final file and the original source is available and inspected through a debugger. Debugging optimized code is a worthy area of research in its own right, and will not be considered in this document, although many of the techniques can be applied to a 'de-optimizing debugger'.

When using any of the other tools, the user may also affect the operation of a decompiler, for example by instructing the linker to remove any symbolic information from the binary file.

From this point on, any tool that we can use to understand the program can be considered a reverse engineering tool.

Next: Reverse Engineering Tools


by Mary Orban