Decompiler Design - Simple Decompiler Problems

 

Prev: Simple Decompiler


Problems with the Simplest Decompiler

Let's consider the problems that are introduced by the simple decompiler described before, and try to find a possible solution. We start with the initial instruction we decompiled:

0x004011E5:  or byte [ebp-0x3], 0x8  ->  L4011E5:    ebp[-0x3] |= 0x8;
  • EBP is a processor register, so it is not known to a generic compiler; therefore, to have our program compiled, we need to declare EBP. In this case, we can declare EBP to be a pointer to some byte, that is:
        unsigned char *ebp;
        ebp[-0x3] |= 0x8;
    
  • EBP is mostly used to point to the frame of the local procedure, so the above statement is probably trying to access a local variable that was declared of byte size. However, there may be other local variables that are of different sizes, and they will all be accessed through EBP (at different offsets from EBP), so declaring EBP as an unsigned char pointer won't work for those other variables. We can solve this problem by forcing the size of each access through the register, based on the size of the original instruction:
        char *bp;
        *(char *)(ebp-0x3) |= 0x8;
    

    Accessing a different variable will use a different cast:

        *(short *)(ebp-0x8) += 10;
    
  • Even though usually EBP contains an address, there are cases where the compiler might use it to hold a numeric constant (in frame-less functions EBP is used just like any other register). If EBP is used in an arithmetic operation other than + or -, the compiler will complain if we declare it char *. So we need to declare it as an int. This is not a problem when using the cast approach above, because the adding and subtracting from an integer is essentially the same as adding and subtracting from a char *.
  • One instruction type that is not easily translated using the above method is for those instructions that set a register with an address for later use. These instructions may be indistinguishable from instructions that load a numeric constant for some later calculation:
    8D 3C 85 DC 68 40 00       lea edi, [eax*4+0x4068DC] // A
    8D 3C 85 04 00 00 00       lea edi, [eax*4+4]        // B
    
    The translation for these two instructions is quite different:
       edi = &var_4068DC[eax*4];    // A
       edi = 4 + eax * 4;           // B
    

    Why the difference? Because in one case some memory area will be accessed, thus we need to take the address of that memory area and index it (in fact, we should probably remove the multiplication, too, since that's how array of integers are typically accessed - so the instruction should become: edi = &var_4068DC[eax];). In the second case, some numeric computation is performed and the compiler chose to use one lea instruction instead of the lengthier sequence mul and add. Solving this problem is not easy.

These are just some example. As you have seen, control flow instructions (jumps, calls) don't translate very well, and using the mechanical translation method described above quickly creates unreadable generated C code that may become bloated when recompiled to machine code after the translation.

We clearly need a better solution.


Next: Standard Decompiler Architecture