RecStudio Decompiler Design - Object File Formats

The first two file formats carry some information with them that can help the decompiler to identify useful information. On the other hand, the third type does not give much information, so the information must be provided by the user to the decompiler.

Each file format requires its own File Format Reader.

Step 1: identify the file type

The first step after opening the file is to identify its type. Structured and tagged stream files start with a well defined byte sequence that helps identify them. Here are some of byte sequences for common object formats:

Format	File Offset	Content
ELF	0	7F 45 4C 46
COFF	0	2-byte machine type (*)
IEEE
OMF

(*) One characteristic of COFF is that the first 2 bytes identify both the format as COFF and the target processor. Unfortunately, there is no standard for these 2 bytes, and also for processors that support both big endiand and little endian, the same 2 bytes may appear in both orders, making it difficult to identify the file as a COFF file with absolute certainty.
We'll see that even COFF files for the same target processor may have different data structures, because different compilers chose not to follow the standard (typically they sometimes used 32-bit fields in place of the original 16-bit field definition).

If none of the above sequences is detected, the file may be a raw image or an unknown file format. In this cases, manual intervention is required by the user to specify the information the decompiler needs.

Step 2 : identify the processor type

Since we'll be dealing with machine instructions the decompiler must identify the target CPU, that is the CPU able to execute the instructions in the input file. Decompilation does not require actual execution of the instructions, so the target CPU can be different from the one that is executing the decompiler (the host CPU). That is, decompilers should be cross tools, able to accept binary files generated for different processor architectures.

Selecting the correct CPU sometimes defines the data types that the target program is going to use. This however is not always true, since the original program may have been compiled in a variety of models. The following structured file formats provide architectural information:

Format	File Offset	Content
ELF	0x12	2-byte machine type
COFF	0	see previous table
IEEE
OMF

On the other hand, only raw formats provide only a minimum amount of information (sometimes there is no information at all). In these cases, a number of heuristics can be applied to infer the type of the file. The decompiler can use a database of common code sequences to identify the target CPU. This can be a long shot, but it is sometimes successful, if nothing else to give a suggestion to the user. If no match is found, the user must provide (via a project file or via the UI) the target CPU he wants to use before proceeding with the object file analysis.

Step 3 : identify code, data and information areas

The structured object formats carry both code and data areas that will be executed when the program is run, and also support areas that are used by the operating system when loading the file into memory (but whose content is not actually executed by the CPU), and also areas that are used by other tools such as a debugger.

The ELF and COFF formats are based on the concept of sections. A section is an area in the file that has homogeneous information, such as all code, or all data, or all symbols etc.

The decompiler reads the section table and uses it to convert file offsets to addresses and vice-versa.

This is also used to allow the user to inspect the file by offset or by address (for example when doing a hex dump of the file's content).

Note however that not necessarily a section marked as executable will only contain machine instructions. Other types of read-only data can also be put in an executable section, such as read-only ("const") strings and floating-point constants. The compiler or the linker can also add extra code that was not directly generated by the compiled source code. An example of extra code is virtual function tables, exception handling (try/catch/throw) in C++, and the Global Offset Table (GOT) and Procedure Linkage Table (PLT) to support dynamically linking of DLLs.

It's therefore important that the decompiler identifies data that was placed in the code section, so that it does not disassemble some data area. Should this happen, a lot of the successive analyses will be using incorrect data, possibly invalidating the entire decompilation process. All file formats should provide at the very least the offset of the first instruction executed after the file is loaded in memory.

Informational areas can help tremendously the decompilation process, because every piece of additional information beyond the executed code and data is a piece of information that the decompiler will not have to guess using its own analysis.

Non-stripped executable files provide various level of symbolic information:

addresses of global labels: these could be function entry points and global data variables. Note however that most times the size of such objects is not provided. That is, we may know where a function starts but we may not know where it ends. Since static functions entry points may not be stored in the files, it is not reliable to simply assume that a function ends at the start of the next function's entry point.
names of imported (dynamically linked) libraries and addresses of libraries entry points or of trampoline code generated to access those libraries. If the target program is a DLL itself, an export table is stored in the binary file. The export table provides the entry point of the functions exported by the DLL.
if a relocation table is found in the file, the decompiler can use the information therein to infer which instructions operate on addresses as opposed to numeric constants. This is very important when trying to identify the purpose an assembly instruction. This also means that the decompiler must virtually link the target file to some fictitious memory address, which can be totally different from the actual address that will be used by the operating system, especially when decompiling a relocatable file (.o, .obj or .dll).
if the file was compiled with debugging information (-g on Unix systems), a lot more information can be found, such as the list of source files and line numbers that was used to build the target program, the types of the variables, and both global, module static and function local variables with their names. This is the best case, so a decompiler must take advantage of this wealth of information.

The output of the object file format loader is a set of tables that allows the following stages of the decompiler to be independent of any particular object file format.

Next: Identifying Code and Data