RecStudio Decompiler Design - Switch Statements

The switch statement creates special problems to a decompiler, because usually a compiler has a few options to choose from when generating code for switch statements. Some of these sequences can be mapped to sequences of statements that we already considered, and thus will be handled automatically by the statement structuring algorithm in the Intermediate section.

An example of one of these sequences is:

   switch (expr) {         MOV R1, expr         R1 = expr
   case 1:                 CMP R1, #1           if (R1 == 1) {
                           JNE L_100
        stmt1;                ...                   ...
        break;             JMP L_break          } else
   case 100:         L_100:
                           CMP R1, #100         if (R1 == 100) {
                           JNE L_1000
        stmt2;                ...                   ...
        break;             JMP L_break          } else
   case 1000:        L_1000:
                           CMP R1, #1000        if (R1 == 1000) {
                           JNE L_default
        stmt3;                ...                   ...
        break;             JMP L_break          } else {
   default:          L_default:
        stmt3;                ...                   ...
        break;
   }                 L_break:                   }

This switch statement maps directly into a sequence of if-else statements. The main characteristic of this kind of statement is the re-use of the same value in each if statement, and the same destination for each block. Expression propagation cannot remove the R1 register because it is used multiple times, but statement structuring can check if the expression has no side-effects (i.e. uses a simple value, with no memory changes and no function calls) and it can check if all if blocks have the same successor. If these conditions are met, then the statetement structuring algorithm can collapse all if-else blocks into a switch statement, thus removing the R1 register.

Unfortunately this only works if the case values are very sparse. If the case values are consecutive, like 1, 2, 3, 4, etc., most compilers generate a jump table to the various case labels, and use an indirect jump to one of these labels. Here's an example of such code:

   switch (expr) {         MOV R1, expr             R1 = expr
                           CMP R1, #1               if (R1 < 1)
                           JLT L_default                goto L_default;
                           CMP R1, #3               if (R1 > 3)
                           JGT L_default                goto L_default;
                           SUB R1, #1
                           SHL R1, #2
                           MOV R2, jmpTable[R1]     R2 = JmpTable[(R1 - 1)]
                           JMP [R2]                 goto *R2;   // ?

   case 1:            L_1:                       L_1?
        stmt1;                ...                       ...
        break;             JMP L_break              goto L_break;
   case 2:            L_2:                       L_2?
        stmt2;                ...                       ...
        break;             JMP L_break              goto L_break;
   case 3:            L_3:                       L_3?
        stmt3;                ...                       ...
        break;             JMP L_break              goto L_break;
   default:           L_default:                 L_default:
        stmt3;                ...                       ...
        break;
   }                  L_break:                   L_break?

Here we have a number of problems, all caused by the indirect jump JMP [R2]. This instruction cannot be mapped directly to a high-level language, hence the ? next to the goto *R2; statement. This statement is illegal and has to be converted to something else in order to generate recompilable code. Unfortunately, the only way to convert that statement into a legal statement, is to recognize the previous instructions as part of a switch statement. This is not so obvious, because we need to analyze more than one block. We could still leave the 2 if statements that check the lower and upper boundaries of the case values table, and only analyze the access to the jump table.

It's worth noting that the above code is only one of the possible code sequences that compilers generate for a switch statement. There are several variations that conspire against the implementation of a generic algorithm for switch statements. Some of these variations include having a jump table of offsets instead of straight addresses, or a table of value-address pairs which can be scanned by a loop.

These variations suggest that detection of switch statement be implemented in a processor- or compiler-dependent module, where the different code sequences can be better detected.

Detecting switch statements is also very important to improve code/data separation. In fact, if you remember, the code/data detection algorithm relied on the fact that calls and jumps can be followed to their destinations. But here we cannot be sure of the destination of the indirect jump, because the destination depends on the value of a computed expression. Surely we know where the jump table is (when a jump table is used) but we won't know how many values there are in the table unless we analyze the if statements that check the expression boundaries.

Note something very important here: we cannot recognize the if statements if we cannot construct the basic blocks for the procedure. But we cannot construct all the basic blocks of the procedure unless we recognize the if statements that check for the case value boundaries!

How do we solve this problem?

Next: Code vs. Data 2