The subject of this exercise is programming in a machine language of the x86 architecture, and connecting the machine code with the high-level programming language.
Become familiar with the basic properties of the x86 architecture machine language,
and particularly with the available address modes and registers
[1],
[2].
Become familiar with the ways (conventions) of transferring parameters to subroutines
[3],
and in particular the cdecl
convention
for 32-bit operating systems
which will be used in this exercise.
Subroutines in x86 assembly language will be called from the program written in C++.
Instruction execution can be traced from the debugger (for example dbg
).
Unfortunately, the x86 assembly syntax is not equal among popular
C/C++ compilers (there are two syntaxes AT&T syntax used by gcc
(GNU Compiler Collection) and Intel syntax used by MSVC
(Microsoft Compilers)). Since we are going to use Intel syntax in this exercise
and gcc compiler offers suitable flags during compile time (gcc: options -S -masm=intel),
this part of the difference is not going to be a problem. The only problem that still
remains is how function calls (for the functions written in x86 assembly) are specified
within C++ code. For this reason we provide short instructions for both popular
compilers MSVC and gcc (it's up to the student to choose
which compiler is going to be used in this exercise).
The easiest way to write an x86 routine in gcc is to
write it in a separate file with the extension . s .
The file that defines the x86 routine subroutine_asm
has the following basic structure:
// this is comment (like in C++) // // syntax label (we use Intel): .intel_syntax noprefix // we want the subroutine name (subroutine_asm) // to be visible from the C++ code, so we // specify its name like this: .global subroutine_asm // and here we repeat the same label again: subroutine_asm: // ... assembly code
Routines defined in such a manner are called
in the same way as ordinary routines in C/C++
(it will be shown little later).
For now, just assume that the main program is
in the file main.cpp
,
while the assembly routine is located in subroutine.s
.
Then the compiling and linking can be done by
(note that you might want to modify this on 64-bit UNIX systems
as shall be explained in section 7):
$ g++ -g -o main subroutine.s main.cpp
Program tracing can now be initiated by the command (using gdb
):
$ gdb main
For this exercise, we need only a small subset
of all the capabilities of gdb
described in
[4],
these are break
, run
,
next
,step
,
print
, te info registers
.
The way to use these commands is explained in the gdb
documentation
[5].
The easiest way to write an assembly routine in MSVC is to provide function body called naked function as follows:
// directive __declspec(naked) tells the compiler // that parameters for the function call will be transfered using // cdecl convention, and // that any code before or after function call should not be // generated (thus the name naked) int __declspec(naked) function_asm(int i){ __asm{ // ... assembly kod } }
Subroutines written in assembly are called in the same way as ordinary routines in C, as will be explained in more detail later. Translation files with machine subroutine takes place in a standard way. It is necessary to add the subroutine source file to the Visual Studio console project (this is a type of project we use to build console C++ applications) and start compiling (Build Solution).
Program tracing can be initiated through the integrated development environment (by clicking on Start debugging). Useful tracing actions are Toggle Breakpoint Step over and Step into. Useful windows are Watch and Registers.
Basic introduction to x86 assembly programming can be found at
x86 assembly guide
(it is advisable to skim through it).
The standard way to access parameters and local variables in subroutine is
through the register ebp
(base pointer).
To make this possible, within the cdecl
convention,
assembly subroutine have the following structure (more on the topic
is available here ([3]):
/* cdecl prologue: */ push ebp /* store ebp at stack */ mov ebp, esp /* move esp to ebp */ /* allocate 4 bytes for the local variables */ /* (or more if needed)*/ sub esp, 4 /* local variables are "under" ebp (stack growth */ /* direction is toward lower addresses)*/ /* main subroutine functionality */ ... /* return value is in eax*/ /* release local variables:*/ add esp, 4 /* cdecl epilogue: */ pop ebp /* instead of 'add esp,4, pop ebp' we can also write 'leave'*/ ret /* return from the subroutine */
We will write a C/C++ subroutine that computes equation (a+b)*c for the given integer values
a,b,c and returns the result. This subroutine in C (subroutine_c
) looks like this:
int subroutine_c(int a, int b, int c) { return (a + b) * c; }
The body of the corresponding subroutine written in x86 assembly
(subroutine_asm
) is the following:
/* [ebp] stores previous value of ebp */ /* [ebp+4] is return address (it is eip register) */ mov eax, [ebp+12] /* b */ add eax, [ebp+8] /* a */ imul eax, [ebp+16] /* c */
The subroutine returns the result in register eax
.
The previous code snippet presents the main functionality of
the function subroutine_asm
, but to write complete
function we have to embrace it by standard prologue and epilogue
as shown in the previous chapter (we can paste the snippet on the
place denoted by the 3 dots. Also, since in this example we don't
have local variables, we can omit instructions (sub esp, 4 and
add esp, 4).
It can be noted that in this simple subroutine prologue and epilogue are not really necessary. That is, function can be rewritten as follows:
sub_asm_noebp: /* [esp] return address */ mov eax, [esp+8] /* b */ add eax, [esp+4] /* a */ imul eax, [esp+12] /* c */ ret /* return from subroutine */
Here, we do all the referencing by register esp
(not ebp
as before).
However, in general, we will use prologue and epilogue in our
assembly subroutines since they will not be so simple.
Using prologue and epilogue may help us
to keep our code structured and maintainable.
Note finally that most compilers use prologue and epilogue
while translating our C/C++ code to assembly code.
Please note that calling conventions require that some registers (e.g. EBX) must be preserved across the subroutine calls. Such registers are denoted as callee-saved in the documentation.
-m32
flag
to the g++ invocation
rdi
, rsi
,
rdx
, rcx
, r8
, and r9
;
rax
.
Assembly subroutine calls are transparent,
quite the same as a C/C++ routines.
That means subroutines subroutine_asm
and
subroutine_c
are called in the exact same way.
If the subroutine declaration is not visible
when it is called from the C/C++ program (e.g. a subroutine
is defined in a separate file) then it is necessary to provide
an appropriate prototype (we just mention it, but this is an usual
action in C/C++ programs).
Assembly subroutines written in a pure assembly (specified
in a separate file with extension .s, gcc) during the translation
produce an object code in accordance
with platforming binary interface (ABI) for the language C/C++.
If we want to call such a subroutine from a C/C++,
then we need to prefix subroutine prototype with the
extern C
to prevent subroutine
name mangling
(compiler, according to the calling convention, adds certain prefixes and
suffixes to the function name. For cdecl convention, it is an underscore
as a prefix).
In our example this will look like this:
extern "C" int subroutine_asm(int,int,int);
If we want to use gcc on Windows, we need to tell the compiler that during the compilation of the main function do not prefix assembly function name with the underscore. This is achieved by a keyword asm() in the external subroutine prototype declaration:
extern "C" int subroutine_asm(int,int,int) asm("subroutine_asm");If we don't do that, we will get link error because linker will not be able to resolve reference to the symbol
subroutine_asm
.
There is also another way to do the same. In assembly code we can
add an external label which has an underscore in it's name. It will look like
this:
.global subroutine_asm .global subroutine_asm_ subroutine_asm: subroutine_asm_: ...
subroutine_asm
i subroutine_c
which was discussed in the previous sections:
subroutine_c
and subroutine_asm
alog with the main (test) function:
int main(){ std::cout <<"ASM: " <<subroutine_asm(3,5,6) <<std::endl; std::cout <<"C++: " <<subroutine_c(3,5,6) <<std::endl; }
subroutine_c
.
Do this for different optimization levels (in gcc optimization levels are specified by option
-O0, -O1, -O2, -O3).
(MSVC: Project properties -> C/C++ -> Output files -> Assembler output;
gcc: options -S -masm=intel)
void vector_add_c(float const* a, float const* b, int count, float *r);
where a and b are input vectors, n is their length, r is sum of a and b.
fld
which loads vector element to the x87 stack,
fadd
which sums up two float values that have been previously put on x87 stack
and fstp
which stores sum produced by the previous instruction to the corresponding
element of r. More on x87 instruction set can be found here [8,9].
fld DWORD PTR [eax+ecx*4]
where eax is a base address of the vector (if we are looking at vector a, it is
address of the element a[0]), and register ecx is index of the element (it is
like index i in notation a[i]. We multiply it by 4 because it really computes memory location offset
and each single precision floating point number has 4 Bytes).
eax
,
b -> ebx
,
index -> ecx
,
count -> edx
,
r -> edi
ebx
and edi
should be returned to the main function unchanged,
so we need to place then on the stack in the subroutine prologue
and pop them of the stack in subroutine epilogue.
movq
or movaps
, addps
)
More info is available at [7,8,9]. Observe the case when vector length is not divisible by 4.
In that case (for the remaining 3,2 or 1 elements use x87 instructions).
If you face difficulties with SSE instruction set, you can also retreat
to the use of compiler intrinsic set functions that translates to SSE
instructions directly. For the Intel architecture processors, list of such instructions
can be found at
SSE intrinsics
clock()
).
If you specify optimization flags during compile time, it is very likely that the compiler will translate
your standard C implementation into SSE implementation! Use that in your advance to peek how SSE function
can be written (generate assembly code of your C function by specifying the -S flag, gcc).
[1] Wikipedia: x86 architecture
[2] Wikipedia: x86 assembly language
[3] Wikipedia: x86 calling conventions
[6] x86 Instruction Set Reference
[7] Wikipedia: Streaming SIMD Extensions
[8] x86 Instruction Set Reference
Last change: 15th October 2012.