Decompiler Design - Code vs. Data 2

 

Prev: Switch Statements


Code vs. Data 2

Remember that by separating code from data we achieve the following goals:

  • we avoid disassembling data areas, which would confuse the loop detection and the variable lifetime algorithms;
  • we detect destination of branches and function calls. Knowing the relationship between branches, we can compute loops which allow us to compute variables lifetimes. Knowing the destination of function calls allow us to compute the number of parameters that each function receives and the type of the returned value from a function, and also allows us to record which registers are saved across the call and which registers are trashed.
  • we can rank functions based on their distance from the entry point and based on their complexity. Analyzing simpler (leaf) functions first allows us to perform more accurate analysis of the more complex functions.

As we have seen in the previous page, switch statements pose a serious problem to the task of separating the code regions from the data regions.

The switch problem is a particular case of the more generic problem of how to handle code that is referenced through pointers. While in the switch case the pointer was automatically generated by the compiler and had a fairly limited scope (that of the switch statement basic block, and possibly that of the block's predecessors), there are other cases involving pointers to code that have a broader scope.

In this page we show some of these cases.

Explicit Function Pointers

Languages such as C and C++ allow the use of a pointer to a function. For example:

C code                   Asm code
---------------          --------------------
int (*ptr)();

ptr = addr1;             L1:   R1 = addr1
                               [ptr] = R1
...                            ...
(*ptr)();                L2:   R2 = [ptr]
                               CALL [R2]

How does the decompiler know which function is called at L2?
Moreover, how do we know that ptr is a pointer to code? From the instructions at L1, there is no indication that addr1 is a function. Sure, we could assume that if addr1 is in the text section it is the entry point of a function, but there's no guarantee of that. Only if in some other part of the code there is an explicit call to addr1 then we are assured that it is a function.
In case both L1 and L2 are in the same function, we could apply type analysis to ptr and discover that is is used as a function pointer by the CALL instruction, and thus infer that addr1 is a function. However, if L1 and L2 are in different functions, then it becomes very difficult to know that ptr was assigned an address and then it was used in a CALL instruction.
Moreover, remember that we are doing the code walking during the early phases of decompilation, and at this stage we have not done any data flow analysis, or even less, any type analysis.

We could apply some heuristics to addr1 to see if it starts with the typical instruction sequence that saves the frame pointer and allocates the local stack. Still, if the heuristic fails, we won't be able to identify addr1 as a function, which means that its code, along with the code of any other function it calls, will remain unreachable to the decompiler.

One special case of the above is a form of constant folding applied to CALL instructions, typically performed by the Microsoft compiler and the Green Hills compilers.
When this optimization is performed, for example to reduce the size of the generated code, a register is assigned to the address of the function, and later used in the CALL instruction without saving the register to a memory location.
This case is a bit easier to deal with, since a simple constant propagation algorithm should be able to relink the CALL instruction with its destination address. For example:

C code               Asm code                Decompiled code
---------------      -----------------       ----------------

func1();             L1:   R1 = addr1
                           CALL [R1]         addr1();
...                        ...
func1();                   CALL [R1]         addr1();
...                        ...
func2();                   CALL func2        func2();
...                        ...
func1();                   CALL [R1]         addr1(); // ?
------------------------------------------------------------

SymbolTable.defineFunction(addr1)

In this case we can see that there are no other assignments to R1, and that all statements are reachable from the same function, so we can convert the CALL [R1] instructions into CALL addr1 (as long as we know that func2 does not change the value of R1!).

In case there are multiple possible values for the destination, we can still detect that each value is a function address, even though we will not always be able to replace the CALL instructions. In the following example var is a local variable and all statements are in the same function:

C code               Asm code                Decompiled code
---------------      -----------------       ----------------

void *(var)();

var = addr1;         L1:   R1 = addr1        var = addr1;
(*var)();                  CALL [R1]         addr1();
...                        ...
(*var)();                  CALL [R1]         addr1();
...                        ...
if (expr)                                    if (expr)
   var = addr2;            R1 = addr2          var = addr2;
...                        ...
(*var)();            L2:   CALL [R1]         (*var)();
------------------------------------------------------------

SymbolTable.defineFunction(addr1)
SymbolTable.defineFunction(addr2)

In this case, we can use use-def chains to identify all possible values of R1 at L2, and define a function for each of the addresses (in this case addr1 and addr2). However, we won't be able to convert the CALL [R1] instruction into a direct call, and we won't be able to eliminate the local variable var, since its value is not constant at L2 (it depends on whether the if() statement was true).

Implicit function pointers

Languages such as C++ and Java have the concept of a "virtual" method. Virtual methods are nothing more than function pointers associated with the class or struct that defines them. When a method is marked as "virtual", the compiler will always use an indirect call to access the method, taking the address from a special table (the "virtual table") associated with the instance of that class (the this pointer).

Different compilers use different methods to retrieve the method pointer from the virtual table for each class. These methods are not officially documented, since they are not part of the Application Binary Interface of an architecture, which implies that it's not genreally possible to mix libraries compiled with one C++ compiler with code that was compiled with another C++ compiler.

This is a major problem for a decompiler, since not only the decompiler must infer that the code was originally from a C++ source instead than from a C source, but it must also infer which compiler was used and use different access algorithms to identify where is the virtual table for each class and where are the virtual method pointers.


Next: Call Optimizations