RecStudio Decompiler Design - Simple Decompiler

As we hinted when talking about disassemblers, the simplest way to decompile a program is to present as much high-level information as possible for each instruction. Let's say that the disassembler converts a sequence of bytes into the following instruction:

0x004011E5:   80 4D FD 08          or byte [ebp-0x3], 0x8

We can translate this instruction into C by simply considering the effect that the instruction has on the program. A possible translation of this instruction is:

           ebp[-0x3] |= 0x8;

A simple filter on the output of the disassembler can do the job:

      Objdump -d file.exe | translate > file.c

Where translate would do something like this:

     while(gets(instr)) {
          p = &instr[FIRST_OPCODE_COL];  // skip address, bytes
          p = isolate_word(p, opcode);   // get operation
          p = isolate_argument(p, dest); // get destination operand
          if(*p == ',')
              p = isolate_argument(p + 1, source);// get source operand

          if(!strcmp(opcode, "or"))
              printf("%s  |=  %s\n", dest, source);
           . . . // other operators
      }

Most of the arithmetic instructions can be translated this way. Some effort can be dedicated to pretty-printing the arguments, so that [ebp-0x3] can become ebp[-3]. If a memory location is involved in the instruction, we can simply record its use in a table and emit the variable declaration in a .h file. Jump instructions are special, because you need to define the label that corresponds to the target of the jump, but this is easy to do: just emit the address of each instruction as a label as you translate them:

0x004011D6:   EB 0D                jmp L4011E5
. . .
0x004011E5:   80 4D FD 08          or byte [ebp-0x3], 0x8


L4011D6:  goto L4011E5
. . .
L4011E5:  ebp[-0x3] |= 0x8;

You can experiment with other cases, and most of them can be translated this way. Here's some example:

0x00401383:   80 FB 49           cmp bl, 0x49
0x00401386:   74 2E              jz L4013B6

L401383:    flags = bl - 0x49;
L401386:    if ( flags == 0 )  goto L4013B6;


0x00401357:   E8 B6 06 00 00     call L401A12
0x0040135C:   . . .

L401357:      esp -= 4;          // make space for return location
              *esp = L40135C;    // save return location
              goto L401A12;      // go to ("call") the function
L40135C:      . . .

Now we have a slight problem. C does not allow computed returns, so how are we going to translate RET and expect to go back to L40135C? One way to do it is by recording the addresses of all instructions that follow a call instruction, and use a switch() statement to go to that label:

              // Translating RET

              _dest = *esp;
              esp += 4;
              switch(_dest) {
              case 0x40135C:    goto L40135C;
              . . .
              }

This approach is very simple and shows the basics of decompilation: that you don't need to recreate the original program's source; you only need to recreate the original program's behavior.

However, this approach has several shortcomings, as you may imagine. We consider them in the next chapter.

Next: Problems with the Simplest Decompiler