Decompiler Design - Call Optimizations

 

Prev: Code vs. Data 2


Call Optimizations

The theme that is arising from the previous two pages shows that analyses limited to a basic block may not be sufficient to retrieve the information about a statement in the block.

Here we show some other cases that could potentially hamper even more the ability of the decompiler to analyze the instruction stream.

These cases typically occur in optimized code, especially when the optimization spans one function and is extended to a whole source module (compilation unit).

Tail call optimization

This optimization is performed on the join point of one or more basic blocks when the code at the end of each predecessor is the same. In this case, a compiler may chose to merge the identical code, and move the join point to the first non-identical instruction.

Consider the following example:

C code                  Non-optimized             Optimized
-------------           -------------------       ---------------
if(cond)                    CMP R1, cond              CMP R1, cond
                            JNE  Lelse                JNE Lelse
   x[i] = func1(1);         PUSH 1                    PUSH 1
                            CALL func1                CALL func1
			    SP += 4                   JMP Ljoin
			    R1 = i
		            x[R1*4] = R0
			    JMP L1
else                    Lelse:                    Lelse:
   x[i] = func2(4);         PUSH 2                    PUSH 2
                            CALL func2                CALL func2
			                          Ljoin:
			    SP += 4                   SP += 4
			    R1 = i                    R1 = i
			    x[R1*4] = R0              x[R1*4] = R0
                       L1:

This optimization has clear advantages, such as smaller code and higher cache hit ratio. Obviously it's not always possible for the compiler to perform it, since there are several conditions that have to be met. Still, it's frequent to see this optimization in assembly code.

Unfortunately this affects the decompiler's algorithm that detects parameter passing, because now the decompiler has to cross a block boundary to see if the instruction after the CALL is one that deallocates parameters (the SP += 4 instruction). The following optimization is another one that affects the computation of parameter passing and stack frame allocation.

Stack deallocation coalescing

When there are several function calls in a basic block, the compiler may chose to leave the parameter on the stack between the call instructions, and only deallocate them at the end of the sequence.

Consider the following example:

C code                  Non-optimized             Optimized
-------------           -------------------       ---------------
   a = func1(1);        PUSH 1                    PUSH 1
                        CALL func1                CALL func1
		        SP += 4
			a = R0                    a = R0
   b = func2(2,3);      PUSH 3                    PUSH 3
                        PUSH 2                    PUSH 2
			CALL func2                CALL func2
			SP += 8
			b = R0                    b = R0
   func3(func4(4));     PUSH 4                    PUSH 4
                        CALL func4                CALL func4
			SP += 4
			PUSH R0                   PUSH R0
			CALL func3                CALL func3
			SP += 4                   SP += 20

As you can see the optimized sequence has less instructions, at the price of using more stack space. This is normally not a big problem on Unix or Windows systems. However it may be a problem on embedded systems, where the stack space may be limited.

A variation of the above scheme, typically used by GCC, is to pre-allocate the stack space and use only indirect move instructions to pass the parameters, as in the following example:

C code                  Asm code
-------------           -------------------
func()
{                       SP -= 8+locals_size
   a = func1(1);        SP[0] = 1
                        CALL func1
			a = R0
   b = func2(2,3);      SP[4] = 3
                        SP[0] = 2
			CALL func2
			b = R0
   func3(func4(4));     SP[0] = 4
                        CALL func4
			SP[0] = R0
			CALL func3

Note that the code uses the smalles stack size needed (8 bytes) and uses the smalles code size, since there is no need to de-allocate the parameters after each call. The restriction is that none of the functions can use the (...) prototype, i.e. the compiler must know exactly how many parameters are expected by each function.

When code like this is generated it's very difficult for the decompiler to compute how many parameters are expected by each function, because there is no indication in the instruction stream at the call point of how many bytes are passed. In fact, each SP[x] = y instruction could just be assigning a value to a local variable allocated on the stack. Therefore the decompiler must rely on the prototype of the called function (if available), or on the data accesses gathered when analyzing each called function, to know whether a particular location SP[x] is a local variable or is a passed parameter.

Note that the same limitation is present when passing parameters in registers.

Call-return optimization

When a function call is the last statement in a function, a compiler may chose to generate a jump instead of a call followed by a return. For example:

C code              Non-optimized          Optimized
-------------       -----------------      ---------------
func1()
{                   PUSH FP                PUSH FP
                    FP = SP                FP = SP
     ...
     func2();       CALL func2             SP = FP
     return;                               POP FP
}                   SP = FP                JMP func2
                    POP FP
		    RET

This optimization allows the compiler to generate more efficient code, since a CALL+RET is equivalent to a JMP instruction; the JMP instruction does not use the stack to save the return address to the called function and also the processor will execute only one RET instruction instead of 2, further speeding up execution.

This optimization may confuse a decompiler because now we effectively have a function with 2 entry points. One entry point is func1, and the other is func2; both entry points will lead to a single exit point, since they share the final RET instruction.

Luckily this case is easier to handle than the previous ones, since once the frame set up instruction are eliminated by the standard stack processing algorithm, the control flow graph will still be legal, at the price of duplicating the code of func2() in the body of func1(). Even the algorithm that discovers function entry points will still work, since func2() will probably be called by other portions of the program and thus be recognized as a function entry point, and entered as such in the symbol table.

A particularly smart decompiler can check whether a JMP instruction goes to a function entry point and handle the stack frame deallocation instructions preceding it as a specific case.


Next: Passing Parameters in Registers