Decompiler Design - Creating Statements

 

Prev: Loop Analysis


Creating Statements

With the information we have gathered in the previous two steps (basic blocks and loops) we now have all we need to convert control flow instructions into high-level statements.

This process is called "structuring" in the literature, because we re-create the original program structure from an unstructured sequence of instructions.

Control flow instructions (branches, goto, labels) are converted into the following set of C statements:
do { break; continue; } while;
while() { break; continue; }
for() { break; continue; }
if() { }       if() { } else { }
return
goto label

The final goal is to remove all goto and all labels.

We use a proven technique called "interval analysis". The general approach is to replace a number of basic blocks which are identified to have a known pattern into a single basic block, that is into a statement that has only one entry and only one exit. We'll do this by moving statements from the containing scope (initially there is only one scope, the entire procedure), into the scope(s) of the control statement, effectively creating a tree of basic blocks.

Structuring loops

We start by structuring loop statements. The reason is that we want to convert goto instructions into break and continue statements before we attempt to convert conditional branches into if-else statements.

Having the list of loops in a procedure, we first identify the most nested loop (a loop that has no other loops inside). We then replace the header block of each loop with a do-while statement, and we move all blocks belonging to the loop into the do-while statement.

sort loopSet from innermost loop to outermost loop
for each loop in loopSet {
    doWhile = new Statement(DoWhile)
    doWhile->expr = new Expr(loop->blocks.last->lastCompareInstr)
    doWhile->blocks = loop->blocks
    doWhile->breakBlock = loop->blocks->postDominator
    doWhile->continueBlock = loop->blocks.last

    if doWhile->continueBlock->onlyStatement != If or
       doWhile->continueBlock->onlyStatement.destination != header
        doWhile->continueBlock = NULL

    loop->header->lastStatement = doWhile
    StructureBreakContinue(doWhile, doWhile->continueBlock, doWhile->breakBlock)
}

StructureBreakContinue(Statement stmt, Block contBlock, Block breakBlock)
{
    switch stmt.type {

    case Goto:
        if stmt.destination == contBlock
            stmt.type = Continue
        else if stmt.destination == breakBlock
            stmt.type = Break
        break

    case If:
        if stmt.elseBlocks
            for each statement in stmt.elseBlocks
                StructureBreakContinue(statement, contBlock, breakBlock)
       
        if stmt.trueBlocks
            for each statement in stmt.trueBlocks
                StructureBreakContinue(statement, contBlock, breakBlock)
    }
}

The above algorithm converts the following sequence:

   ....                             ....
Lheader:
                                    do {
   ....                                 .... 
   if(expr)                             if(expr)
      goto Lcontinue                        continue;
   ....                                 ....
   if(expr)                             if(expr)
      goto Lbreak                           break;
   ...                                  ....
Lcontinue:
   if(expr)                         } while(expr);
      goto Lheader
Lbreak:
   ...

As you can see, 2 goto statements and 2 labels have been removed. Note how we need to visit innermost loops first, and how we only descend the if-else statements but not do-while statements that we already discovered. This is because a break or continue statement only applies to the closest loop, not to any outer loop that may be active at that statement. goto statements that jump out of the innermost loop will remain as gotos in the final output.

More loop structuring

Now that we have identified the basic loop statements we can improve the output by trying to eliminate even more gotos and labels, and also by creating more familiar loop constructs. After all, the do-while loop is not used as often as the more common while and for loops. These can now be created by checking a few conditions and rewriting the do-while loops that we just generated.

   if(expr)                         while(expr) {
      goto Lbreak;
   do {
      ....                              ....
   } while(expr);                   }
Lbreak:
   ....                             ....

Note that the break and continue points have remained the same during the conversion from do-while to while.

We can now go further and convert while loops into for loops by checking if there is an initialization and an increment statement for the same variable:

   v = init_expr
   while(expr comparing v) {       for(v = init_expr; expr comparing v; v = incr) {
      ....
Lcontinue:
      v = incr // e.g. v += 1
   }                               }
   ....

Note that for loops have a different continue point than do-while and while loops, therefore the initial conversion of if-goto to if-continue statements would have failed. But now that we know the continue point for the loop, we can call StructureBreakContinue for the statements inside the for loop, so that gotos to the continue point are converted into continue statements. 

Structuring if statements

Now that we have converted loop statements we can try to eliminating the remaining goto statements. These should be part of if or if-else statements, so we'll try to identify basic block configurations that allow us to convert if-goto sequences into if or if-else sequences.

The basic idea is the same as we used in structuring loops: we remove all blocks that belong to the if-goto sequence and make them children of the containing if-else statement.

StructureIfElse(Block block)

   if block.succ.size != 2
       return false
   trueBlock = block.succ[0]
   falseBlock = block.succ[1]
   if trueBlock.succ.size != 1 or falseBlock.succ.size != 1
       return false
   if falseBlock.succ[0] != trueBlock.succ[0]
       return false

   ifStmt = new Statement(If)
   ifStmt.expr = NegateCondition(block.lastStmt)
   ifStmt.trueBlocks = trueBlock
   ifStmt.falseBlocks = falseBlock
   removeLastGoto(trueBlock, trueBlock.succ[0])
   removeLastGoto(falseBlock, falseBlock.succ[0])

   return true 

The above algorithm converts patterns like the following into if-else statements.

   if(expr)                     if(!expr) {
      goto Lfalse
   ....                             ....
   goto Lelse                   }
Lfalse:                         else {
   ....                             ....
Lelse:                          }
   ....                         .... 

Note that we have to negate the condition to enter the true block (alternatively we can swap the true and false blocks and keep the same condition). This pattern will eliminate 2 gotos and possibly 2 labels.

A variant of the previous algorithm has no else portions. That is:

StructureIfs(Block block)

   if block.succ.size != 2
       return false
   trueBlock = block.succ[0]
   falseBlock = block.succ[1]
   if trueBlock.succ.size == 1 and trueBlock.succ[0] == falseBlock
       ifStmt = new Statement(If)
       ifStmt.expr = NegateCondition(block.lastStmt)
       ifStmt.trueBlocks = trueBlock
       ifStmt.falseBlocks = NULL
       removeLastGoto(trueBlock, falseBlock)
 

The above variant takes care of the following case:

   if(expr)                     if(!expr) {
      goto Lfalse
   ....                             ....
Lfalse:                         }

More if structuring

Not all if-goto pairs will be converted by the above algorithms. Consider the following pattern:

    if(expr1)
        goto Lfalse
    if(expr2)
        goto Lfalse
    ....
Lfalse:
    ....

The above sequence is indicative of a single if statement with two conditions. That must be converted into a single expression concatenating the 2 conditions with an && operator:

    if(!expr1 && !expr2) {
        ....
    }

There is no limit to the number of expressions that can be collapsed into the same if statement, so recognizing the above pattern should be repeated as long as we can collapse more and more expressions. Similarly, we can recognize sequences of expressions that jump to the true block, and concatenate them using the || operator:

    if(expr1)                   if(expr1  ||
        goto Ltrue
    if(expr2)                      !expr2) {
        goto Lfalse
Ltrue:
    ....                            ....
Lfalse:                         }
    ....

Limitations

The above algorithms and patterns work fairly well for specific instruction sequences. However their fairly simplistic approach fails miserably in presence of even the slightest disturbance in the instruction sequence. In particular the presence of complex expressions prevents the collapsing of if sequences. Consider the following sequence:

    MOV  R1, var_1              R1 = var_1;
    CMP  R1,#10                 if(R1 == 10 ||
    JEQ  L1
    CMP  R1,#20                    R1 == 20) {
    JNE  L2
L1: ....
L2:                             }

The sequence maps nicely into the if-or pattern. However, having a more complex expression sequence will break the mapping:

    MOV  R1, var_1               R1 = var_1;     // basic block 1
    CMP  R1,#10                  if(R1 == 10)
    JEQ  L1                          goto L1;
    MOV  R1, var_2               R1 = var_2;     // basic block 2
    CMP  R1,#20                  if(R1 != 20)
    JNE  L2                          goto L2;
L1: ....                     L1: ...
L2:                          L2:

This is because the second basic block has multiple statements as opposed to a single if statement. Since an expression cannot contain statements, we cannot collapse the two compare instructions into a single if expression.

In order to be able to reduce the number of statements, thus opening the door for further statement structuring, we need to collapse simple expressions into more complex ones.

Data flow analysis helps in reducing the non-control-flow statements, and opens the door to better statement structuring.


Next: Data flow analysis