Computer Architecture 2

Computer Architecture 2 - lab 2:
Instruction set architecture x86

The subject of this exercise is programming in a machine language of the x86 architecture, and connecting the machine code with the high-level programming language.

1. Introduction

Become familiar with the basic properties of the x86 architecture machine language, and particularly with the available address modes and registers [1], [2]. Become familiar with the ways (conventions) of transferring parameters to subroutines [3], and in particular the cdecl convention for 32-bit operating systems which will be used in this exercise.

2. The programming environment

Subroutines in x86 assembly language will be called from the program written in C++. Instruction execution can be traced from the debugger (for example dbg). Unfortunately, the x86 assembly syntax is not equal among popular C/C++ compilers (there are two syntaxes AT&T syntax used by gcc (GNU Compiler Collection) and Intel syntax used by MSVC (Microsoft Compilers)). Since we are going to use Intel syntax in this exercise and gcc compiler offers suitable flags during compile time (gcc: options -S -masm=intel), this part of the difference is not going to be a problem. The only problem that still remains is how function calls (for the functions written in x86 assembly) are specified within C++ code. For this reason we provide short instructions for both popular compilers MSVC and gcc (it's up to the student to choose which compiler is going to be used in this exercise).

3. Work with gcc

The easiest way to write an x86 routine in gcc is to write it in a separate file with the extension . s . The file that defines the x86 routine subroutine_asm has the following basic structure:

  // this is comment (like in C++)
  //
  // syntax label (we use Intel):
  .intel_syntax noprefix

  // we want the subroutine name (subroutine_asm)
  // to be visible from the C++ code, so we
  // specify its name like this:
  .global subroutine_asm

  // and here we repeat the same label again:
  subroutine_asm:

  // ... assembly code

Routines defined in such a manner are called in the same way as ordinary routines in C/C++ (it will be shown little later). For now, just assume that the main program is in the file main.cpp , while the assembly routine is located in subroutine.s . Then the compiling and linking can be done by (note that you might want to modify this on 64-bit UNIX systems as shall be explained in section 7):

 $ g++ -g -o main subroutine.s main.cpp

Program tracing can now be initiated by the command (using gdb):

 $ gdb main

For this exercise, we need only a small subset of all the capabilities of gdb described in [4], these are break, run, next ,step, print, te info registers. The way to use these commands is explained in the gdb documentation [5].

4. Work with MSVC

The easiest way to write an assembly routine in MSVC is to provide function body called naked function as follows:

  // directive __declspec(naked) tells the compiler
  // that parameters for the function call will be transfered using
  // cdecl convention, and
  // that any code before or after function call should not be 
  // generated (thus the name naked)  
  int __declspec(naked) function_asm(int i){

    __asm{
      // ... assembly kod
    }
  }

Subroutines written in assembly are called in the same way as ordinary routines in C, as will be explained in more detail later. Translation files with machine subroutine takes place in a standard way. It is necessary to add the subroutine source file to the Visual Studio console project (this is a type of project we use to build console C++ applications) and start compiling (Build Solution).

Program tracing can be initiated through the integrated development environment (by clicking on Start debugging). Useful tracing actions are Toggle Breakpoint Step over and Step into. Useful windows are Watch and Registers.

5. Assembly subroutine structure

Basic introduction to x86 assembly programming can be found at x86 assembly guide (it is advisable to skim through it). The standard way to access parameters and local variables in subroutine is through the register ebp (base pointer). To make this possible, within the cdecl convention, assembly subroutine have the following structure (more on the topic is available here ([3]):

              	     /* cdecl prologue: */
  push  ebp          /* store ebp at stack */
  mov   ebp, esp     /* move esp to ebp */

		     /* allocate 4 bytes for the local variables */
		     /* (or more if needed)*/
  sub   esp, 4       /* local variables are "under" ebp (stack growth */
		     /* direction is toward lower addresses)*/


                     /* main subroutine functionality */
  ...
                     /* return value is in eax*/


                     /* release local variables:*/
  add   esp, 4

                     /* cdecl epilogue: */
  pop ebp            /* instead of 'add esp,4, pop ebp' we can also write 'leave'*/
  ret                /* return from the subroutine */

6. An example

We will write a C/C++ subroutine that computes equation (a+b)*c for the given integer values a,b,c and returns the result. This subroutine in C (subroutine_c) looks like this:

  int subroutine_c(int a, int b, int c) {
    return (a + b) * c;
  }

The body of the corresponding subroutine written in x86 assembly (subroutine_asm) is the following:

                      /* [ebp] stores previous value of ebp  */
                      /* [ebp+4] is return address (it is eip register) */
  mov   eax, [ebp+12] /* b */
  add   eax, [ebp+8]  /* a */
  imul  eax, [ebp+16] /* c */

The subroutine returns the result in register eax. The previous code snippet presents the main functionality of the function subroutine_asm, but to write complete function we have to embrace it by standard prologue and epilogue as shown in the previous chapter (we can paste the snippet on the place denoted by the 3 dots. Also, since in this example we don't have local variables, we can omit instructions (sub esp, 4 and add esp, 4).

It can be noted that in this simple subroutine prologue and epilogue are not really necessary. That is, function can be rewritten as follows:

  sub_asm_noebp:
                        /* [esp] return address */
    mov   eax, [esp+8]  /* b */
    add   eax, [esp+4]  /* a */
    imul  eax, [esp+12] /* c */
    ret                 /* return from subroutine */

Here, we do all the referencing by register esp (not ebp as before). However, in general, we will use prologue and epilogue in our assembly subroutines since they will not be so simple. Using prologue and epilogue may help us to keep our code structured and maintainable. Note finally that most compilers use prologue and epilogue while translating our C/C++ code to assembly code.

Please note that calling conventions require that some registers (e.g. EBX) must be preserved across the subroutine calls. Such registers are denoted as callee-saved in the documentation.

7. Assembly subroutines under 64-bit Linux, FreeBSD and OS X systems

Instructions from sections 3, 5 and 6 are not applicable to 64-bit UNIX systems, since there cdecl is not the default calling convention. That problem can be solved in two ways (either way is going to work):

compile the program for a 32-bit platform by supplying the -m32 flag to the g++ invocation
- note that you might have to install packages with 32-bit libraries;
pass parameters under the default calling convention (System V AMD64 ABI):
- pass the first six arguments which are integeres or pointers through registers rdi, rsi, rdx, rcx, r8, and r9;
- pass the return value through register rax.

8. x86 assembly subroutine call

Assembly subroutine calls are transparent, quite the same as a C/C++ routines. That means subroutines subroutine_asm and subroutine_c are called in the exact same way. If the subroutine declaration is not visible when it is called from the C/C++ program (e.g. a subroutine is defined in a separate file) then it is necessary to provide an appropriate prototype (we just mention it, but this is an usual action in C/C++ programs).

Assembly subroutines written in a pure assembly (specified in a separate file with extension .s, gcc) during the translation produce an object code in accordance with platforming binary interface (ABI) for the language C/C++. If we want to call such a subroutine from a C/C++, then we need to prefix subroutine prototype with the extern C to prevent subroutine name mangling (compiler, according to the calling convention, adds certain prefixes and suffixes to the function name. For cdecl convention, it is an underscore as a prefix). In our example this will look like this:

extern "C" int subroutine_asm(int,int,int);

If we want to use gcc on Windows, we need to tell the compiler that during the compilation of the main function do not prefix assembly function name with the underscore. This is achieved by a keyword asm() in the external subroutine prototype declaration:

extern "C" int subroutine_asm(int,int,int) asm("subroutine_asm");

If we don't do that, we will get link error because linker will not be able to resolve reference to the symbol subroutine_asm. There is also another way to do the same. In assembly code we can add an external label which has an underscore in it's name. It will look like this:

  .global subroutine_asm
  .global subroutine_asm_

  subroutine_asm:
  subroutine_asm_:
  ...

9. Exercises

Test and analyse the example with the functions subroutine_asm i subroutine_c which was discussed in the previous sections:
- compile and test the program which consists of subroutines subroutine_c and subroutine_asm alog with the main (test) function:
```
    int main(){
    	std::cout <<"ASM: " <<subroutine_asm(3,5,6) <<std::endl;
    	std::cout <<"C++: " <<subroutine_c(3,5,6) <<std::endl;
    }    
```
- See what assembly code is generated by the compiler for the function subroutine_c. Do this for different optimization levels (in gcc optimization levels are specified by option -O0, -O1, -O2, -O3). (MSVC: Project properties -> C/C++ -> Output files -> Assembler output; gcc: options -S -masm=intel)
Using x86 assembly instruction set reference [6] write assembly subroutine which does the following
- stores number 42 to register eax.
- stores number 0x42 to register ebx.
- stores 0x0fff to the upper 16 bits of register edx (use instructions push, mov, pop).
- sets lower 8 bits of register edx to the number 0xdd.
- Trace execution of the created subroutine to check if it runs correctly.
Write two subroutines (one in C, other in assembly) that sums all integer values in the interval [0,n> where n is specified as a subroutine parameter. Instructions:
- write a function that tests functionality of both subroutines (for any n, call of the both subroutines should return the same value, that is sum_asm(n) == sum_c(n))
- we need 3 local variables (loop counter, sum and the limit) We can use registers ecx, eax, and edx for that. Thus we don't need to allocate any space for the local variables at the stack.
- loop counter and sum should be initialized to 0. Limit (parameter n) can be moved form stack (it is at esp+4).
- n can also be equal to 0 so we will have 0 loop iterations
- loop counter and sum can be increased by instruction add (for the loop counter we can also use instruction inc)
- conditional jump is performed by instructions cmp (compare) and jl (jump less) if the previous compare has determined that loop counter is still less than the limit
- sum (return value) is returned in register eax. Since, we already use eax for summing, we can return to the main (or test) function right after loop counter reach the limit (standard epilogue is assumed).
- solution (sum_asm) can be made in less than 15 lines of code
Write three subroutines (one using standard C, second using instructions from x87 instruction set (for floating point arithmetic), and third using instructions from the Streaming SIMD Extensions (SSE) instruction set) that sum up two single precision float vectors of length n. Instructions:
- Function prototype (in C) is the following: void vector_add_c(float const* a, float const* b, int count, float *r); where a and b are input vectors, n is their length, r is sum of a and b.
- Write test function in C that calls all three implementations and checks it they return the same vector r for the specified vectors a and b.
- and x87 assembly subroutine uses instructions from the x87 instruction set. We will need instruction fld which loads vector element to the x87 stack, fadd which sums up two float values that have been previously put on x87 stack and fstp which stores sum produced by the previous instruction to the corresponding element of r. More on x87 instruction set can be found here [8,9].
  - To access element of the vector we can address it by its index, like this: fld DWORD PTR [eax+ecx*4] where eax is a base address of the vector (if we are looking at vector a, it is address of the element a[0]), and register ecx is index of the element (it is like index i in notation a[i]. We multiply it by 4 because it really computes memory location offset and each single precision floating point number has 4 Bytes).
  - We can do it with 5 variables. One logical setup is following: a -> eax, b -> ebx, index -> ecx, count -> edx, r -> edi
  - registers ebx and edi should be returned to the main function unchanged, so we need to place then on the stack in the subroutine prologue and pop them of the stack in subroutine epilogue.
  - one possible solution (for x87 subroutine) has 23 lines of assembly code
- equivalent SSE subroutine will use vector instructions from the SSE instruction set. Those instructions use vector registers (xmm0 - xmm7) that are 128 bits long. That means they can store 4 elements of floating point array at once. Useful instructions are (movq or movaps, addps) More info is available at [7,8,9]. Observe the case when vector length is not divisible by 4. In that case (for the remaining 3,2 or 1 elements use x87 instructions). If you face difficulties with SSE instruction set, you can also retreat to the use of compiler intrinsic set functions that translates to SSE instructions directly. For the Intel architecture processors, list of such instructions can be found at SSE intrinsics
- test running time for each three implementations (for the timing, you can use function clock()). If you specify optimization flags during compile time, it is very likely that the compiler will translate your standard C implementation into SSE implementation! Use that in your advance to peek how SSE function can be written (generate assembly code of your C function by specifying the -S flag, gcc).
- BONUS Create function that computes saxpy. Learn about Duff's device and how it can be used to reduce number of loop iterations. Compare it with the ordinary C function that solves saxpy problem. For more reference you can also use similar exercise from Zurich ETH institute. You can also try some optimized linear algebra package like BLAS. Analyse performance gains in each of the implementations.
References

[1] Wikipedia: x86 architecture
[2] Wikipedia: x86 assembly language
[3] Wikipedia: x86 calling conventions
[4] Wikipedia: GNU Debugger
[5] Using GNU's GDB Debugger
[6] x86 Instruction Set Reference
[7] Wikipedia: Streaming SIMD Extensions
[8] x86 Instruction Set Reference
[9] x86 Instruction Set Reference
[10] x86 instruction listings
Last change: 15th October 2012.