Decompiler Design - Basic Blocks |
|
|
Prev: Code vs. Data
Now that we are starting to make sense of the unstructured byte sequence in the binary file, the next problem we need to solve before we can start decompiling the code is to be sure that a sequence of instructions is actually executed in the sequence we see it in the file.
Consider the following sequence:
L1: MOV EAX, $2 L2: MUL EAX, ECX L3: MOV DWORD [0x402000], EAXSince we know from the processor's user's manual that the MUL instruction requires one of the operands as well as the result to be in EAX, we might be tempted to translate the above sequence into the following statement:
var_402000 = ecx * 2;
That would be wrong for two reasons:
L0: MOV EAX, $3
CMP EBX, $0
JNE L2
L1: MOV EAX, $2
L2: MUL EAX, ECX
L3: MOV DWORD [0x402000], EAX
MOV DWORD [0x402010], EAX
...
The correct way to translate the above sequence would be:
eax = (ebx == 0) ? 2 : 3;
var_402010 = var_402000 = eax * ecx;
But how do we know that execution could skip L1 and go directly to L2? And more importantly, how do we know that there are no other ways that execution can go to L2, maybe from some other part of the code?
These problems can be solved by using a well known data structure called basic block. This data structure is one of the most important one in the entire decompiler, since a lot of code will rely on the information stored in it for many analyses.
A lot of information exists in the literature on what basic blocks are. We've already
given an informal definition. Here's a more precise one:
A consequence of this definition is that every jump destination starts a new basic block, and every jump instruction (including return instructions) ends a basic block.
For a decompiler it's not clear whether a call instruction should end a basic
block. Depending on the sophistication of the algorithms, one can consider a call
instruction to end a basic block, or one can just ignore call instructions
for the time being.
It's however clear that the
Given that we already have computed the list of labels and jumps when we scanned for code and data, and that we have the entry point of most functions in our procedure table, we can build the basic blocks for a procedure by using the following algorithm:
current_address = current_procedure->entry_address
workList.push = current_address
while workList not empty
current_address = workList.pop
next:
block = new Block at current_address
blockList += block
label = find label at address higher than current_address
branch = find branch at address higher than current_address
if label->address < branch->address_after_branch
block->length = label->address - block->address
current_address = label->address
goto next
end if
block->length = branch->address_after_branch - block->address
if branch is return // returns have no known destination
continue
workList.push = branch->destination
if branch is conditional
current_address = branch->address_after_branch
goto next
end if
// else will resume from branch destination
end while
This algorithm will record in the blockList set which sequences of instructions are executed in sequence. But while we are at, why don't we also record where execution goes from block to block? After all, it may be useful to know that in our example after we've set EAX to 3 we're going to use it in the MUL instruction, instead than on some other instruction. To do this, we record 2 additional pieces of information for each block:
Block *get_block_at(Address start_address)
{
Block *block = find in BlockList a block for start_address
if block == not found
block = new Block at start_address
blockList += block
return block
}
CreateBasicBlocks(current_procedure)
{
current_address = current_procedure->entry_address
block = get_block_at(current_address)
current_procedure->entry_block = block
workList.push = current_address
while workList not empty
current_address = workList.pop
block = get_block_at(current_address)
next:
label = find label at address higher than current_address
branch = find branch at address higher than current_address
if label->address < branch->address_after_branch
block->length = label->address - block->address
current_address = label->address
next_block = get_block_at(current_address)
next_block->preds += block
block->succs += next_block
block = next_block
goto next
end if
block->length = branch->address_after_branch - block->address
if branch is return // returns have no known destination
continue
end if
next_block = get_block_at(branch->destination)
next_block->preds += block
block->succs += next_block
workList.push = branch->destination
if branch is not conditional
continue // will resume from branch destination
end if
next_block = get_block_at(branch->address_after_branch)
next_block->preds += block
block->succs += next_block
block = next_block
current_address = block->address
goto next
end while
sort blockList by block's start address
}
Notes:
Indirect jumps, on the other hand, can have many destinations. We'll extend the algorithm in the advanced section to add all the known destinations to the list of successors for a block that ends with an indirect jump.
While this algorithm is more complicated, we'll see in the next section that the successors and predecessors information will become very useful to find out more about our procedure.
Next: Control Flow Graph