Prev : Decompiler architecture
Object File Formats
Before you can start decompiling a file you need to be able to read it.
There are 3 possible types of file formats:
- Structured formats (COFF, ELF)
-
Tagged stream formats (OMF, IEEE)
- Raw files (DOS, rom images, S-records etc.)
Each type of format has its strength and weaknesses.
The first two file formats carry some information with them that can help the
decompiler to identify useful information. On the other hand, the third type
does not give much information, so the information must be provided by the user
to the decompiler.
Each file format requires its own File Format Reader.
Step 1: identify the file type
The first step after opening the file is to identify its type. Structured and
tagged stream files start with a well defined byte sequence that helps identify
them. Here are some of byte sequences for common object formats:
Format | File Offset | Content |
ELF | 0 | 7F 45 4C 46 |
COFF | 0 | 2-byte machine type (*) |
IEEE | | |
OMF | | |
(*) One characteristic of COFF is that the first 2 bytes identify both the
format as COFF and the target processor. Unfortunately, there is no
standard for these 2 bytes, and also for processors that support both
big endiand and little endian, the same 2 bytes may appear in both orders,
making it difficult to identify the file as a COFF file with absolute certainty.
We'll see that even COFF files for the same target processor may have different
data structures, because different compilers chose not to follow the standard
(typically they sometimes used 32-bit fields in place of the original
16-bit field definition).
If none of the above sequences is detected, the file may be a raw image or an
unknown file format. In this cases, manual intervention is required by the user
to specify the information the decompiler needs.
Step 2 : identify the processor type
Since we'll be dealing with machine instructions the decompiler must identify
the target CPU, that is the CPU able to execute the instructions in the input
file. Decompilation does not require actual execution of the instructions, so
the target CPU can be different from the one that is executing the decompiler
(the host CPU). That is, decompilers should be cross tools, able to accept
binary files generated for different processor architectures.
Selecting the correct CPU sometimes defines the data types that the target
program is going to use. This however is not always true, since the original
program may have been compiled in a variety of models.
The following structured file formats provide architectural information:
Format | File Offset | Content |
ELF | 0x12 | 2-byte machine type |
COFF | 0 | see previous table |
IEEE | | |
OMF | | |
On the other hand, only raw formats provide only a minimum amount of information
(sometimes there is no information at all). In these cases, a number of
heuristics can be applied to infer the type of the file. The decompiler can use
a database of common code sequences to identify the target CPU. This can be a
long shot, but it is sometimes successful, if nothing else to give a suggestion
to the user. If no match is found, the user must provide (via a project file or
via the UI) the target CPU he wants to use before proceeding with the object
file analysis.
Step 3 : identify code, data and information areas
The structured object formats carry both code and data areas that will be
executed when the program is run, and also support areas that are used by the
operating system when loading the file into memory (but whose content is not
actually executed by the CPU), and also areas that are used by other tools such
as a debugger.
The ELF and COFF formats are based on the concept of sections. A section is an
area in the file that has homogeneous information, such as all code, or all
data, or all symbols etc.
The decompiler reads the section table and uses it to convert file offsets to
addresses and vice-versa.
This is also used to allow the user to inspect the file by offset or by address
(for example when doing a hex dump of the file's content).
Note however that not necessarily a section marked as executable will only
contain machine instructions. Other types of read-only data can also be put in
an executable section, such as read-only ("const") strings and floating-point
constants. The compiler or the linker can also add extra code that was not
directly generated by the compiled source code. An example of extra code is
virtual function tables, exception handling (try/catch/throw) in C++, and the
Global Offset Table (GOT) and Procedure Linkage Table (PLT) to support
dynamically linking of DLLs.
It's therefore important that the decompiler identifies data that was placed in
the code section, so that it does not disassemble some data area. Should this
happen, a lot of the successive analyses will be using incorrect data, possibly
invalidating the entire decompilation process. All file formats should provide
at the very least the offset of the first instruction executed after the file is
loaded in memory.
Informational areas can help tremendously the decompilation process, because
every piece of additional information beyond the executed code and data is a
piece of information that the decompiler will not have to guess using its own
analysis.
Non-stripped executable files provide various level of symbolic information:
- addresses of global labels: these could be
function entry points and global data variables.
Note however that most times the size of such objects is not provided. That
is, we may know where a function starts but we may not know where it ends.
Since static functions entry points may not be stored in the files, it is
not reliable to simply assume that a function ends at the start of the next
function's entry point.
- names of imported (dynamically linked) libraries and
addresses of libraries entry points or of trampoline code generated to
access those libraries. If the target program is a DLL itself, an
export table is stored in the binary file. The export table
provides the entry point of the functions exported by the DLL.
- if a relocation table is found in the file, the
decompiler can use the information therein to infer which instructions
operate on addresses as opposed to numeric constants. This is very important
when trying to identify the purpose an assembly instruction. This also means
that the decompiler must virtually link the target file to some fictitious
memory address, which can be totally different from the actual address that
will be used by the operating system, especially when decompiling a
relocatable file (.o, .obj or .dll).
- if the file was compiled with debugging information (-g
on Unix systems), a lot more information can be found, such as the list of
source files and line numbers that was used to build the target program, the
types of the variables, and both global, module static and function local
variables with their names. This is the best case, so a decompiler must take
advantage of this wealth of information.
The output of the object file format loader is a set of tables that allows the
following stages of the decompiler to be independent of any particular object
file format.
Next: Identifying Code and Data
|