REC User's Manual

A step by step approach to decompilation using REC

REC's Home Page | User's Manual


The REC decompiler can be used in several different ways, depending on the type of the executable file that must be decompiled.

In this example, we'll consider a raw binary file, about which not much is known. But even if you have a file in a recognized format, sometimes you can get much better results by following the approach described here.

The first step is to tell REC how to load the file. If the file is in one of the recognized formats (COFF, Windows PE, ELF, AOUT), REC will automatically determine which portions of the file contain code and which contain data. On the other hand, if the file format is not recognized, REC will print the following message and exit:

      file.exe: file error or wrong format file
    

To force REC to accept the file, we must use a command file (.cmd). Thus, assuming we want to load a file named file.exe, we should first write the file.cmd file containing the following commands:

      #!wrec
      file: file.exe
      cpu: mips
      region: 0x0 0x2000 0 data
    

We assume here that we know that the file contains MIPS R3000 instructions. Other possible values for the cpu: command are: i386, 68000 and ppc. We also assume that the file is 0x2000 bytes long. Then we should be able to load the file by typing the following command from a shell prompt:

      rec +interactive file.cmd
    

REC will then read the file and show the main menu. From this menu, the only item that makes sense is the Hexadecimal Dump, activated by the letter 'D'. This will show the content of the file in hex. From a visual inspection of the hexadecimal dump, we should be able to isolate certain areas that look like code, and other areas that look like data. For example, all byte sequences that are too regular (e.g. all zeros) are most certainly data. Also, sequences that look like addresses, i.e. that repeat themselves every 4 bytes (for a 32bit architecture) are also likely to be data. Everything else can be assumed to be code. It is also very important to know where is the entry point, i.e. the first instruction executed when the program is loaded. Knowing this information, we can create a region list, and specify the regions in the .cmd file. For example, a revised .cmd file could be:

      #!wrec
      file: file.exe
      cpu: mips
      region: 0x80000400 0x80001600 0x400 text
      region: 0x80001600 0x80002000 0x1600 data
    

In this example, we assume that the first 0x400 bytes are the header of the file, that the code starts at offset 0x400 and that the first code address is 0x80000400, that the data portion starts at offset 0x1600 and will be loaded at address 0x80001600. Note that we left out the bytes between offsets 0 and 0x3FF. This means that that portion of the file will be ignored, since it contains information that is only useful to the operating system when loading the file. Similarly, we could leave out the portion of the file that contains relocation, or symbolic information.

Remember that this process is automated for those object formats that are recognized by REC.

Once we have partitioned our file into code, data and auxiliary information, we can quit REC and restart it with the new file.cmd file. At this point, since REC knows which portions of the file have code, it will try to locate the entry point of each procedure in the program. To do this, each text region is disassembled, and a list of labels, branches and call instructions is constructed. This list can be viewed using the 'l' and 'b' main menu commands. If we see that there are jumps or labels outside of the memory regions that we specified, it is likely that those branches came from REC interpreting as code a region that was actually data. We can then modify the region list until all branches look within the boundaries of the target memory.

Another useful command is 'p : show procedures'. A list of the procedures found will be shown. The list (as well as any other list) can be traversed using the 'k' and 'j' keys (up and down one line, respectively) or the Ctrl-B and Ctrl-F keys (up and down one page, respectively).

We can now disassemble the procedure under the cursor by typing the 'd' key. This will show the disassembled code between the start and end address of the procedure. If the disassembler output looks reasonable, we can exit the disassembler screen by typing ESC, and try to decompile the procedure by typing 'x'. This will show a C-like representation of the disassembled procedure.

Doing this process for all procedures can be very tedious. It may be faster to simply disassemble the entire program and then examine the disassembly listing. To do this, exit REC by repeatedly typing ESC. When at the shell prompt, invoke REC with the following command:

      rec +disasmonly file.cmd
    

This command will produce the file named file.dis, which is the disassembly of all text regions. Using this file, we can further determine which region has code and which region has data.

We should continue to refine our .cmd file until we are sure that all data portions have been identified. Then we can ask REC to decompile the entire executable file, with the following command:

      rec -interactive file.cmd
    

This tells REC to produce a file named file.rec with the C representation of all the procedures. Note that at the end of the file is present a list of regions. This list can be larger than the one provided in the .cmd file, because REC may have determined that some portions in a text region does not contain code, for example because it has a jump table or a read-only string.

Viewing the C-like file alongside with the disassembly file, we can try to understand what each function does. When we have determined that a function for example opens a file (beacause it calls some system function that we know the behavior of), we can provide a symbolic name and type for the function and each parameter. We can do this in two ways. First, we can specify in the .cmd file that a certain address range belongs to the procedure. For example we can add the following line to the.cmd file:

      symbol: 0x80001000, 0x80001100 T OpenFile(char *filename)
    

This tells REC to replace all references to address 0x80001000 with the symbol name OpenFile. Moreover, it tells REC that the OpenFile function has one parameter of type char *. When we restart REC with the improved .cmd file, the output file will become more readable and with more accurate information.

Another way to provide the type information, is to write a symbol-only file. This is an ELF, AOUT or COFF file compiled with GNU's gcc, that contains symbolic information about the program being decompiled. For example, we could have achieved the same result with the following line in the .cmd file:

      types: file.o
      symbol: 0x80001000, 0x80001100 T OpenFile()
    

With these commands, REC will read the symbolic information from file.o, and apply it to the functions specified by the symbol: commands. In our example, file.o might have been produced compiling the following file.c source with gcc -g -c file.c:

      /* file.c - symbolic info for file.exe */

      int OpenFile(char *file_name)
      {
      }
    

Only the symbolic information is used. The code produced by gcc is ignored. This can also be used to provide information about C library functions, whose behavior is specified by the C language. You will find pre-compiled object files for several C library interfaces on the download page.

This second method is preferred over the first one for two reasons: first, it allows more complex data types to be specified, such as structures and unions; second, the type files can be used with many different executables. You don't have to change all .cmd files when you find what a new piece of code does. You can simply provide the name of the function, and REC will find its parameters from the object files.

REC provides another feature that can be used to further speed up the improvement of the decompiled output. This feature is most beneficial when we know that a set of executable files has been compiled with the same compiler, and linked with the same C run-time libraries. In this case, we can ask REC to automatically determine where in the text regions these library functions are. To do so, first we must recognize where each function starts, and we must tell REC the first few bytes of each function. For example, if we know that the open() and the lseek() functions always start with the same bytes, we can write the following signature file, file.sig:

      open(char *name, int mode) size: 16
         A0 00 0A 24 08 00 40 01
         00 00 09 24 00 00 00 00
         ;
      lseek() size: 16
         A0 00 0A 24 08 00 40 01 01 00 09 24 00 00 00 00
         ;
    

Then, we can add the pattern: command to our .cmd file:

      types: file.o
      pattern: file.sig
      symbol: 0x80001000, 0x80001100 T OpenFile()
    

Now, in addition to treating references to address 0x80001000 as calls to the OpenFile() function, REC will also find where in the executable file the open() and lseek() functions are, and treat references to those locations symbolically.

Note again that type information can be provided in the pattern file, although it is better to provide symbolic information via the type: command.

Many other options are available. Some are only meaningful when debugging REC. Others may improve the quality of the output. The User's Manual provides a list of all major options.


REC's Home Page | User's Manual

Copyright © 1997 - 2007 Backer Street Software -- All right reserved.