Decompiler Design - Type Analysis

 

Prev: Saved Registers


Recovering Types

We have so far recovered functions, statements, and variables. The missing part to recover a high-level program is the types of variables and functions.

Normally we think of types associated with variables; however functions have their types too, that is their declaration. In reality, though, a function's declaration is made up of the list of function parameters, which are variables, and the return type, which we can consider a special variable implicitly declared by the compiler and used as the argument of the return statement.

Thus, for all practical purposes, we can consider types associated only with variables.

But how do we find out what type a variable is?

This is a very difficult problem to solve, and in reality there is not much hope to recover the original type of each variable. However, as long as we preserve the behavior of the program, we can just assign to each variable a type that is compatible with the operations performed on that variable.

In fact, certain operations can only be performed on certain types of variables. However, other operations can be performed on many types of variables. Let's see some examples to make this point clearer:

Sources of Type Information

Some operations are guaranteed to be performed only on certain types. Let's start with the most precise operaration:

   CMP.B  [var1], #123
   JNC    Label               if (var1 >=unsigned 123) goto Label;

In this example, we have one memory location, var1. The type of that individual memory location can be determined by the CMP+JNC sequence to be "unsigned char", because the JNC operation tests the result of an unsigned comparison. Had the branch condition be JLT, then we could have determined that the memory location was a "signed char".

Unfortunately, the number of operations that yield such precise information about the types of its operands is very limited. If the condition was JEQ or JNE, then we would not have been able to determine the signedness of the memory location. We would only be able to know that the variable is 8-bits wide. In other words:

   JNC, JC     =>   unsigned                >=unsigned  <unsigned
   JGE, JLT    =>   signed                  >=signed    <signed
   JEQ, JNE    =>   signed or unsigned      ==          !=

Other operations have the same characteristic. AND, OR and XOR operations can only be performed on integral types, and they always yield an integral result, but we don't know whether the operands are signed or unsigned:

   MOV.B  R1, [var1]
   AND.B  R1, [var2]    // same for OR.B, XOR.B
   MOV.B  [var3], R1

In this case, we know that the three variables are 8-bits wide, but we don't know whether they are signed or unsigned. However, if later on, one of the variables is determined to be unsigned (for example because it is used as the operand of a CMP+JNC), then we can assume that all 3 variables are unsigned.

Determining variable types is thus an exercise in data propagation from the source of the information (the CMP+JNC expression) to all other uses of the same variable, and from those variables to all other uses of those variables and so on.

Type Propagation

Note that this works equally well if the var3 = var1 & var2 statement is before the >=unsigned statement, since types are not affected by the flow of execution.

One could assume that arithmetic operations would also be performed on integral types. Unfortunately this is not correct, because different arithmetic operations can also operate on pointers! Consider the following:

   ADD  [var1], #4    // also applies to  SUB  var1, #4

     int  var1;       var1 += 4;
     int  *var1;      var1 = &var1[1];
     short *var1;     var1 = &var1[2];
     char *var1;      var1 = &var1[4];

Thus, just looking at the ADD instruction we cannot decide which of the 4 types to use for "var1".

Multiply instructions are more precise but their result cannot be propagated too far. Consider the following:

   R1 = [var1]
   R2 = R1 * 4
   [var3] = [var2] + R2

     int var1;    // this is sure
     int var2, var3;      var3 = var2 + var1 * 4;
     int *var2, *var3;    var3 = &var2[var1];

In this example, we know var1 is an integer, but we cannot decide if var2 and var3 are also integer, or if they are pointers to objects of size 4.

Division has the same problem:

   R1 = [var1] - [var2]
   R2 = R1 / 4
   [var3] = R2

     int var1, var2;      var3 = (var1 - var2) / 4;
     int *var1, *var2;    var3 = (var1 - var2);
     int var3;   // this is sure

Again, here all we can determine is that var3 is an integer, but we cannot know if var1 and var2 are also integers, or if they are pointers to objects of size 4.

Finally, most accesses to memory allow us to determine that a variable is a pointer, but not the type of the object being accessed through the pointer:

   R1 = var2[var3]
   var1 = R1

       ? var1;
       ? *var2;  int var3;      var1 = var2[var3];
       int var2; ? *var3;       var1 = var3[var2];

While this may seem odd, the truth is that "var2[var3]" is really  "t = var2 + var3; t[0]", so without a multiplication of either var2 or var3, we cannot decide which is the pointer and which is the index.

Other sources of type information

Given the above description of sources of type information, it seems that assigning types to variables is a lost battle. Indeed, there's not much a decompiler can do beyond applying the rules described in the previous chapter.

If the program being decompiled uses some C run-time library or operating system function, then we can use a data base of function declarations to identify the parameters being passed to the function and the type of the return type. Consider the following:

    PUSH  var1                             ===>    char *var1;
    CALL  strlen     var2 = strlen(var1)
    SP += 4
    var2 = R0                              ===>    size_t var2;

By using the data base of function prototypes, we can determine that var1 is a char *, and that var2 is an int (actually a size_t, which is usually mapped to int).

We can then propagate the types we've discovered from the function call to the other uses of var1 and var2.

This approach is even more powerful when the function receives a structure as an argument. For example:

    R1 = &var2                           ===>  struct stat var2;
    PUSH R1
    PUSH var1                            ===>  int var1; // a filedesc.
    CALL stat       stat(var1, &var2)
    var3 = R1[4]    var3 = var2.st_ino;  ===>  ino_t var3;

Detecting calls to system functions is therefore very important not only to know how many parameters are passed to each function, but also to determine their types.


Next: Advanced Topics