Decompiler Design - Object File Formats
Prev : Decompiler architecture
Before you can start decompiling a file you need to be able to read it.
There are 3 possible types of file formats:
Each type of format has its strength and weaknesses.
The first two file formats carry some information with them that can help the decompiler to identify useful information. On the other hand, the third type does not give much information, so the information must be provided by the user to the decompiler.
Each file format requires its own File Format Reader.
The first step after opening the file is to identify its type. Structured and tagged stream files start with a well defined byte sequence that helps identify them. Here are some of byte sequences for common object formats:
|ELF||0||7F 45 4C 46|
|COFF||0||2-byte machine type (*)|
(*) One characteristic of COFF is that the first 2 bytes identify both the
format as COFF and the target processor. Unfortunately, there is no
standard for these 2 bytes, and also for processors that support both
big endiand and little endian, the same 2 bytes may appear in both orders,
making it difficult to identify the file as a COFF file with absolute certainty.
We'll see that even COFF files for the same target processor may have different data structures, because different compilers chose not to follow the standard (typically they sometimes used 32-bit fields in place of the original 16-bit field definition).
If none of the above sequences is detected, the file may be a raw image or an unknown file format. In this cases, manual intervention is required by the user to specify the information the decompiler needs.
Since we'll be dealing with machine instructions the decompiler must identify the target CPU, that is the CPU able to execute the instructions in the input file. Decompilation does not require actual execution of the instructions, so the target CPU can be different from the one that is executing the decompiler (the host CPU). That is, decompilers should be cross tools, able to accept binary files generated for different processor architectures.
Selecting the correct CPU sometimes defines the data types that the target program is going to use. This however is not always true, since the original program may have been compiled in a variety of models. The following structured file formats provide architectural information:
|ELF||0x12||2-byte machine type|
|COFF||0||see previous table|
On the other hand, only raw formats provide only a minimum amount of information (sometimes there is no information at all). In these cases, a number of heuristics can be applied to infer the type of the file. The decompiler can use a database of common code sequences to identify the target CPU. This can be a long shot, but it is sometimes successful, if nothing else to give a suggestion to the user. If no match is found, the user must provide (via a project file or via the UI) the target CPU he wants to use before proceeding with the object file analysis.
The structured object formats carry both code and data areas that will be executed when the program is run, and also support areas that are used by the operating system when loading the file into memory (but whose content is not actually executed by the CPU), and also areas that are used by other tools such as a debugger.
The ELF and COFF formats are based on the concept of sections. A section is an area in the file that has homogeneous information, such as all code, or all data, or all symbols etc.
The decompiler reads the section table and uses it to convert file offsets to addresses and vice-versa.
This is also used to allow the user to inspect the file by offset or by address (for example when doing a hex dump of the file's content).
Note however that not necessarily a section marked as executable will only contain machine instructions. Other types of read-only data can also be put in an executable section, such as read-only ("const") strings and floating-point constants. The compiler or the linker can also add extra code that was not directly generated by the compiled source code. An example of extra code is virtual function tables, exception handling (try/catch/throw) in C++, and the Global Offset Table (GOT) and Procedure Linkage Table (PLT) to support dynamically linking of DLLs.
It's therefore important that the decompiler identifies data that was placed in the code section, so that it does not disassemble some data area. Should this happen, a lot of the successive analyses will be using incorrect data, possibly invalidating the entire decompilation process. All file formats should provide at the very least the offset of the first instruction executed after the file is loaded in memory.
Informational areas can help tremendously the decompilation process, because every piece of additional information beyond the executed code and data is a piece of information that the decompiler will not have to guess using its own analysis.
Non-stripped executable files provide various level of symbolic information:
The output of the object file format loader is a set of tables that allows the following stages of the decompiler to be independent of any particular object file format.