|
Prev: Advanced Topics
Decompiling switch statements
The switch statement creates special problems to a decompiler, because usually a
compiler has a few options to choose from when generating code for switch
statements. Some of these sequences can be mapped to sequences of statements
that we already considered, and thus will be handled automatically by the
statement structuring algorithm in the Intermediate section.
An example of one of these sequences is:
switch (expr) { MOV R1,
expr R1 = expr
case 1:
CMP R1, #1 if (R1 ==
1) {
JNE L_100
stmt1;
...
...
break;
JMP L_break } else
case 100: L_100:
CMP R1, #100 if (R1 == 100) {
JNE L_1000
stmt2;
...
...
break;
JMP L_break } else
case 1000: L_1000:
CMP R1, #1000 if (R1 == 1000) {
JNE L_default
stmt3;
...
...
break;
JMP L_break } else {
default:
L_default:
stmt3;
...
...
break;
}
L_break:
}
This switch statement maps directly into a sequence of if-else statements. The
main characteristic of this kind of statement is the re-use of the same value in
each if statement, and the same destination for each block. Expression
propagation cannot remove the R1 register because it is used multiple times, but
statement structuring can check if the expression has no side-effects (i.e. uses
a simple value, with no memory changes and no function calls) and it can check
if all if blocks have the same successor. If these conditions are met, then the
statetement structuring algorithm can collapse all if-else blocks into a switch
statement, thus removing the R1 register.
Unfortunately this only works if the case values are very sparse. If the case
values are consecutive, like 1, 2, 3, 4, etc., most compilers generate a jump
table to the various case labels, and use an indirect jump to one of these
labels. Here's an example of such code:
switch (expr) { MOV R1,
expr R1 = expr
CMP R1, #1 if (R1
< 1)
JLT L_default
goto L_default;
CMP R1, #3
if (R1 > 3)
JGT L_default
goto L_default;
SUB R1, #1
SHL R1, #2
MOV R2, jmpTable[R1] R2 = JmpTable[(R1 - 1)]
JMP [R2]
goto *R2; // ?
case 1:
L_1:
L_1?
stmt1;
... ...
break;
JMP L_break
goto L_break;
case 2:
L_2:
L_2?
stmt2;
...
...
break;
JMP L_break
goto L_break;
case 3: L_3:
L_3?
stmt3;
...
...
break;
JMP L_break
goto L_break;
default: L_default:
L_default:
stmt3;
...
...
break;
} L_break:
L_break?
Here we have a number of problems, all caused by the indirect jump JMP [R2].
This instruction cannot be mapped directly to a high-level language, hence the ?
next to the goto *R2; statement. This statement is illegal and has to be
converted to something else in order to generate recompilable code.
Unfortunately, the only way to convert that statement into a legal statement, is
to recognize the previous instructions as part of a switch statement. This is
not so obvious, because we need to analyze more than one block. We could still
leave the 2 if statements that check the lower and upper boundaries of the case
values table, and only analyze the access to the jump table.
It's worth noting that the above code is only one of the possible code sequences
that compilers generate for a switch statement. There are several variations
that conspire against the implementation of a generic algorithm for switch
statements. Some of these variations include having a jump table of offsets
instead of straight addresses, or a table of value-address pairs which can be
scanned by a loop.
These variations suggest that detection of switch statement be implemented in a
processor- or compiler-dependent module, where the different code sequences can
be better detected.
Detecting switch statements is also very important to improve code/data
separation. In fact, if you remember, the code/data
detection algorithm relied on the fact that calls and jumps can be followed
to their destinations. But here we cannot be sure of the destination of the
indirect jump, because the destination depends on the value of a computed
expression. Surely we know where the jump table is (when a jump table is used)
but we won't know how many values there are in the table unless we analyze the
if statements that check the expression boundaries.
Note something very important here: we cannot recognize the if
statements if we
cannot construct the basic blocks for the procedure. But we cannot construct
all
the basic blocks of the procedure unless we recognize the if
statements that
check for the case value boundaries!
How do we solve this problem?
|