|
Prev: Code vs. Data 2
Call Optimizations
The theme that is arising from the previous two pages shows that
analyses limited to a basic block may not be sufficient to
retrieve the information about a statement in the block.
Here we show some other cases that could potentially hamper
even more the ability of the decompiler to analyze the instruction
stream.
These cases typically occur in optimized code, especially
when the optimization spans one function and is extended to a
whole source module (compilation unit).
Tail call optimization
This optimization is performed on the join point of one or more
basic blocks when the code at the end of each predecessor is the
same. In this case, a compiler may chose to merge the identical
code, and move the join point to the first non-identical instruction.
Consider the following example:
C code Non-optimized Optimized
------------- ------------------- ---------------
if(cond) CMP R1, cond CMP R1, cond
JNE Lelse JNE Lelse
x[i] = func1(1); PUSH 1 PUSH 1
CALL func1 CALL func1
SP += 4 JMP Ljoin
R1 = i
x[R1*4] = R0
JMP L1
else Lelse: Lelse:
x[i] = func2(4); PUSH 2 PUSH 2
CALL func2 CALL func2
Ljoin:
SP += 4 SP += 4
R1 = i R1 = i
x[R1*4] = R0 x[R1*4] = R0
L1:
This optimization has clear advantages, such as smaller code and higher
cache hit ratio. Obviously it's not always possible for the compiler
to perform it, since there are several conditions that have to be
met. Still, it's frequent to see this optimization in assembly code.
Unfortunately this affects the decompiler's algorithm that detects
parameter passing, because now the decompiler has to cross a block
boundary to see if the instruction after the CALL is one that
deallocates parameters (the SP += 4 instruction). The following
optimization is another one that affects the computation of
parameter passing and stack frame allocation.
Stack deallocation coalescing
When there are several function calls in a basic block, the
compiler may chose to leave the parameter on the stack between
the call instructions, and only deallocate them at the end of
the sequence.
Consider the following example:
C code Non-optimized Optimized
------------- ------------------- ---------------
a = func1(1); PUSH 1 PUSH 1
CALL func1 CALL func1
SP += 4
a = R0 a = R0
b = func2(2,3); PUSH 3 PUSH 3
PUSH 2 PUSH 2
CALL func2 CALL func2
SP += 8
b = R0 b = R0
func3(func4(4)); PUSH 4 PUSH 4
CALL func4 CALL func4
SP += 4
PUSH R0 PUSH R0
CALL func3 CALL func3
SP += 4 SP += 20
As you can see the optimized sequence has less instructions,
at the price of using more stack space. This is normally not
a big problem on Unix or Windows systems. However it may be
a problem on embedded systems, where the stack space may
be limited.
A variation of the above scheme, typically used by GCC,
is to pre-allocate the stack space and use only indirect
move instructions to pass the parameters, as in the following
example:
C code Asm code
------------- -------------------
func()
{ SP -= 8+locals_size
a = func1(1); SP[0] = 1
CALL func1
a = R0
b = func2(2,3); SP[4] = 3
SP[0] = 2
CALL func2
b = R0
func3(func4(4)); SP[0] = 4
CALL func4
SP[0] = R0
CALL func3
Note that the code uses the smalles stack size needed (8 bytes)
and uses the smalles code size, since there is no need to
de-allocate the parameters after each call. The restriction
is that none of the functions can use the (...) prototype,
i.e. the compiler must know exactly how many parameters are
expected by each function.
When code like this is generated it's very difficult for
the decompiler to compute how many parameters are expected
by each function, because there is no indication in the
instruction stream at the call point of how many bytes
are passed. In fact, each SP[x] = y instruction could just
be assigning a value to a local variable allocated on the
stack. Therefore the decompiler must rely on the prototype
of the called function (if available), or on the data accesses
gathered when analyzing each called function, to know
whether a particular location SP[x] is a local variable
or is a passed parameter.
Note that the same limitation is present when
passing parameters in registers.
Call-return optimization
When a function call is the last statement in a function,
a compiler may chose to generate a jump instead of a call followed
by a return. For example:
C code Non-optimized Optimized
------------- ----------------- ---------------
func1()
{ PUSH FP PUSH FP
FP = SP FP = SP
...
func2(); CALL func2 SP = FP
return; POP FP
} SP = FP JMP func2
POP FP
RET
This optimization allows the compiler to generate more
efficient code, since a CALL+RET is equivalent to a JMP
instruction; the JMP instruction does not use the stack
to save the return address to the called function and also
the processor will execute only one RET instruction instead
of 2, further speeding up execution.
This optimization may confuse a decompiler because now
we effectively have a function with 2 entry points.
One entry point is func1, and the other is func2; both
entry points will lead to a single exit point, since
they share the final RET instruction.
Luckily this case is easier to handle than the previous
ones, since once the frame set up instruction are
eliminated by the standard stack processing algorithm,
the control flow graph will still be legal, at the
price of duplicating the code of func2() in the body
of func1(). Even the algorithm that discovers function
entry points will still work, since func2() will probably
be called by other portions of the program and thus
be recognized as a function entry point, and entered as
such in the symbol table.
A particularly smart decompiler can check whether a JMP
instruction goes to a function entry point and handle
the stack frame deallocation instructions preceding it
as a specific case.
|