|
Prev: Code vs. Data
Basic Blocks
Now that we are starting to make sense of the unstructured byte sequence in the binary file,
the next problem we need to solve before we can start decompiling the code is to be
sure that a sequence of instructions is actually executed in the sequence we see it
in the file.
Consider the following sequence:
L1: MOV EAX, $2
L2: MUL EAX, ECX
L3: MOV DWORD [0x402000], EAX
Since we know from the processor's user's manual
that the MUL instruction requires one of the operands as well as the result
to be in EAX, we might be tempted to translate the above sequence
into the following statement:
var_402000 = ecx * 2;
That would be wrong for two reasons:
-
there may be a jump from some other part of the program to L2.
If that happens, we cannot be sure that EAX will have the value 2 in it!
-
The EAX register may be used again after we have saved the value of
the MUL instruction into memory.
For example, a bigger sequence could be the following one:
L0: MOV EAX, $3
CMP EBX, $0
JNE L2
L1: MOV EAX, $2
L2: MUL EAX, ECX
L3: MOV DWORD [0x402000], EAX
MOV DWORD [0x402010], EAX
...
The correct way to translate the above sequence would be:
eax = (ebx == 0) ? 2 : 3;
var_402010 = var_402000 = eax * ecx;
But how do we know that execution could skip L1 and go directly to L2?
And more importantly, how do we know that there are no other ways that execution can go to L2,
maybe from some other part of the code?
These problems can be solved by using a well known data structure
called basic block. This data structure
is one of the most important one in the entire decompiler, since a lot of code
will rely on the information stored in it for many analyses.
A lot of information exists in the literature on what basic blocks are. We've already
given an informal definition. Here's a more precise one:
"a basic block is a sequence of instructions; the first instruction is the only way
for execution to enter a basic block; the last instruction is the only way for execution
to exit the basic block."
A consequence of this definition is that every jump destination starts a new basic block,
and every jump instruction (including return instructions) ends a basic block.
For a decompiler it's not clear whether a call instruction should end a basic
block. Depending on the sophistication of the algorithms, one can consider a call
instruction to end a basic block, or one can just ignore call instructions
for the time being.
It's however clear that the destination of a call instruction always starts
a basic block.
Given that we already have computed the list of labels and jumps when we scanned for
code and data, and that we have the entry point of most functions in our procedure table,
we can build the basic blocks for a procedure by using the following algorithm:
current_address = current_procedure->entry_address
workList.push = current_address
while workList not empty
current_address = workList.pop
next:
block = new Block at current_address
blockList += block
label = find label at address higher than current_address
branch = find branch at address higher than current_address
if label->address < branch->address_after_branch
block->length = label->address - block->address
current_address = label->address
goto next
end if
block->length = branch->address_after_branch - block->address
if branch is return // returns have no known destination
continue
workList.push = branch->destination
if branch is conditional
current_address = branch->address_after_branch
goto next
end if
// else will resume from branch destination
end while
This algorithm will record in the blockList set which sequences of instructions
are executed in sequence. But while we are at, why don't we also record where execution
goes from block to block? After all, it may be useful to know that in our example
after we've set EAX to 3 we're going to use it in the MUL instruction, instead than on
some other instruction. To do this, we record 2 additional pieces of information for each
block:
-
The list of blocks that jump to the entry instruction of the current block: that is
the list of predecessor blocks (pred for short);
-
and the list of blocks where we jump to when we're finished executing this block: that
is the list of successor blocks (succ for short).
Here's a modified version of the basic block construction algorithm
that records this additional information:
Block *get_block_at(Address start_address)
{
Block *block = find in BlockList a block for start_address
if block == not found
block = new Block at start_address
blockList += block
return block
}
CreateBasicBlocks(current_procedure)
{
current_address = current_procedure->entry_address
block = get_block_at(current_address)
current_procedure->entry_block = block
workList.push = current_address
while workList not empty
current_address = workList.pop
block = get_block_at(current_address)
next:
label = find label at address higher than current_address
branch = find branch at address higher than current_address
if label->address < branch->address_after_branch
block->length = label->address - block->address
current_address = label->address
next_block = get_block_at(current_address)
next_block->preds += block
block->succs += next_block
block = next_block
goto next
end if
block->length = branch->address_after_branch - block->address
if branch is return // returns have no known destination
continue
end if
next_block = get_block_at(branch->destination)
next_block->preds += block
block->succs += next_block
workList.push = branch->destination
if branch is not conditional
continue // will resume from branch destination
end if
next_block = get_block_at(branch->address_after_branch)
next_block->preds += block
block->succs += next_block
block = next_block
current_address = block->address
goto next
end while
sort blockList by block's start address
}
Notes:
-
this algorithm is still incomplete: it does not
take into account indirect jumps, as it only assumes that a jump
can have 2 destinations: the true case and the false case.
Indirect jumps, on the other hand, can have many destinations.
We'll extend the algorithm in the advanced section to add all
the known destinations to the list of successors for a block that
ends with an indirect jump.
-
the algorithm only uses jumps and returns. It should ignore
function calls (that is, if call instructions were recorded
in the list of branches, they have to be skipped when looking
for the next branch).
While this algorithm is more complicated, we'll see in the next
section that the successors and predecessors information will become
very useful to find out more about our procedure.
|