Decompiler Design - Expression Propagation
Prev: Data Flow Analysis
Having available liveness information for each instruction helps tremendously in compressing multiple simple expressions into more complex ones. This is especially true in RISC chips, where every single instruction operates on some register, but it is also true for CISC chips with few registers.
We can now compress the various sequences of instructions without worrying about removing a reference to a register that may be used in a subsequent instruction. For example we'll be able to distinguish the safe case:
R1 = var_1
R2 = R1 + 1 var_2 = var_1 + 1
var_2 = R2
Expression propagation also improves code structuring, because it creates legal expressions out of statements, and expressions can be joined through the && and || operators:
MOV R1, var_1
CMP R1,#10 != 10
JEQ L1 &&
MOV R1, var_2 var_2 if(var_1 != 10 && var_2 != 20)
CMP R1,#20 != 20)
JNE L2 goto L2; goto L2;
L1: ... L1: ...
Several complex expressions can be built by scanning the list of instructions and saving for each assignment to a register the right-hand-side of the assignment. Later, when the register is used, we can replace it with the right-hand-side we saved earlier.
Here's an algorithm that uses this approach:
for each block in CFG
b.savedRegs = new BitSet(processor.getNumberOfRegs)
for each block in CFG
for each instruction i in block
if i.op == Assign && isRegister(i.dest)
saveDefinition(block, i) // remember rhs
replaceUses(block, i.dest) // e.g. for indexed addr. modes
saveDefinition(Block b, Node i)
b.savedRegs[i.dest.regnum] = i
replaceUses(Block b, Node n)
if n == null
n.op1 = replaceUses(n.op1)
n.op2 = replaceUses(n.op2)
if n.op == Register && b.savedRegs[n.regnum] != null
reg = n.regnum
b.savedRegs[reg] = 0 // keep assignment to reg
n = b.savedRegs[reg].op1 // get rhs of original assignment
b.savedRegs[reg] = 0 // remove original assignment to reg
This is still a simplistic algorithm, which only works on a basic block at a time. Enhancements to the algorithm would be to allow the saved expressions to propagate to other basic blocks. However, we need to be careful not to allow propagation across instructions that may change any value that's in the saved array. This is a common problem, referred in compiler implementation literature as the alias detection problem.
Consider the following code:
R1 = var_1
*R2 = 0 *R2 = 0
var_2 = R1 var_2 = var_1 ?
In this code, we cannot move the reference to var_1 to the final statement unless we know for sure that the *R2 = 0 assignment will not change the var_1 location. Since R2 contains a pointer, we have to make sure that R2 doesn't contain the value &var_1, otherwise the resulting decompiled program would not behave the same as the input binary code.
This problem is very difficult to solve; even an advanced compiler cannot always decide whether a variable "has been aliased" by a pointer. In this case, we have to be conservative, thus avoiding moving var_1 to the last statement.
A lot of work can be put in deciding whether it is safe to keep the definition of R1 or whether to remove it. The conservative approach is to invalidate the savedRegs entry for R1 (and actually for all other registers) whenever there is an assignment through an indirect pointer. It is always safe to keep references to constant values, including fixed address references, which by definition cannot be aliased.
Another case where it is difficult to decide what to remember and what to forget is when there are function calls in one of the statements:
R1 = local_1
call f f()
var_2 = R1 var_2 = local_1 ?
In this case it's okay to keep local_1 in R1 across the call to f(), but only under certain conditions:
It's very difficult, although not impossible, to keep track of which address is saved to global variables, and which is used by another function. This case is very rare, though, so it should be safe to assume that a function does not change aliased locations in global memory (for example we can analyze the function and decide that it does not write any global memory location).
The second condition is somewhat easier to compute, although even in this case there are common cases where it is not known which registers are changed by the function. There are 2 ways to compute register use by a function:
Next: The Stack