Computer Architecture 2

Computer Architecture 2 - lab 3
Influence of computer architecture on software performance

This laboratory exercise considers the influence of architectural properties of the computer to the execution of programs written in higher level languages. If possible, the exercise should be worked out on computers with different architectures and the results should be compared and discussed. As a minimal recommended solution, you can use PC/Pentium and SUN/Ultrasparc (pinus) architectures.

1. Byte ordering (endianness)

Write a computer program which is able to detect byte ordering convention employed on the computer executing the program. Determine the convention employed by the test architectures.

Hint: analyze individual bytes of a suitably crafted computer word by employing an adequate pointer.

Remark: 64-bit SPARC computers can be configured in a way to emply any of the two main byte ordering conventions. However, your task is to discover which of the two is preferred by the compiler.

2. Cache memories

The goal of this laboratory exercise is to analyze the influence of cache parameters to the performance. The exercise consists of the following tasks:

Determine the following cache (L1, L2, l3?) parameters of the considered computer: total capacity, line width in bytes, and associativity. On modern x86 processors this can be accomplished by directly executing the CPUID instruction [1, 2] or by invoking the tools which employ that instruction [3, 4]. For other processors, this information can be retrieved at the web pages of the manufacturer.

In the case of the pinus computer, the the computer type can be recovered by employing commands uname, fpversion, or prtconf. Cache parameters can be found at www pages [5].

On computers under Linux, such pieces of information can be retrieved in virtual directories /proc/ (file cpuinfo) and /sys/devices/system/cpu/ (e.g. file cpu0/cache/index2/size).
```
cat /proc/cpuinfo
```
```
cat /sys/devices/system/cpu/cpu0/cache/index2/size
```
The other option is to employ programs hwinfo, lshw lscpu or dmidecode which are able to display more detailed information about the system.
Write a computer program to show the memory access performance for data in:
- the main storage (or L3),
- the L2 cache,
- the L1 cache.
Let's introduce the following notation:
- s1 ... capacity of L1 cache
- b1 ... line width of L1 cache
- s2, b2 ... analogously for cache L2
Your program should measure average bandwidth of the byte access during many rounds of execution of the following three subroutines:
- subroutine A: increment all bytes of a memory buffer containing s1 bytes in the order of increasing memory addresses;
- subroutine B: increment each b1-th byte of a memory buffer containing 2*s1 bytes in the order of increasing memory addresses;
- subroutine C: increment each b2-th byte of a memory buffer containing 2*s2 bytes in the order of increasing memory addresses.
Subroutines B i C employ the memory buffer which is exactly twice as large as the capacity of the analyzed cache (L1 or L2). By using such memory buffers we ensure that the buffer is smaller than the capacity of the memory at the next level of the memory hierarchy (if we are testing L1, the buffer is smaller than L2).
Each of these three subroutines has to be invoked many times in the loop. We see that the subroutine A relatively seldom generates L1 cache misses (once in b1 accesses). Subroutine B generates a L1 cache miss in each memory access, but a majority of these accesses shall fall inside L2. Subroutine C generates a L2 cache miss in each access.
For each subroutine, your program should determine average time of a byte access, as well as the achieved bandwidth in MB/s. Based on the obtained data, estimate the ratios of the latencies corresponding to neighbouring levels of the memory hierarchy (t(L2)/t(L1), t(RAM)/t(L2)).
Instructions:
- before carrying out the measurements, initialize all bytes of the memory buffers to random values in order to be sure that the whole buffer is in cache L2 (for subroutine A), and to make the optimizing compiler think that we are actually doing something useful;
- perform the measurements by using the clock function (<time.h>);
- in order to make the measurements reliable, repeat the experiment enough times in order to make the measured time interval comparable to one second;
- after the measurement (when the loop is completed) sum all bytes in the buffer and write out the result in order to make the optimizing compiler think that we really care about the result;
- use full compiler optimization (gcc: -O3)
- BONUS: Check whether our assumptions about cache misses and hits are correct. On the pinus computer this can be accomplished by employing internal cache performance counters (man cputrack for more information).
- BONUS2: Write a program for determining cache parameters on an x86 architecture.
Many programs implement 2D matrices by linear buffers. If the buffer address is given in buf, i and j denote row and column indices, and rows and cols denote matrix dimensions, then the element at (i,j) can be accessed by buf[i*cols+j]. Often the same operation needs to be applied to all matrix elements such as in matrix multiplication. Your task is to experimentally find out whether it is better to loop first over rows and then over columns, or vice versa, in the case of large matrices which can not fit into L2 cache. The obtained results should be commented and discussed.
Instructions
- Create an array of N * M integers (int), where M is chosen such that M * sizeof(int) is equal to the capacity of the L2 cache.
- Qualify all pointers to the array with the keyword volatile in order to prevent the compiler to modify the order of array accesses.
- Initialize all array elements to random values and measure the duration of many calls of a subroutine which determines the sum of all array elements.
- Measure the duration of an alternative subroutine in which the order of loops over rows and columns is exchanged and write out the ratio of the achieved performances.
- As before, use optimization, and make compiler believe that we care about the results.
- BONUS Repeat all this in the context of write accesses (e.g. matrix initialization) and read-write accesses (e.g. matrxi addition). Discuss the achieved results.

3. Influence of the data type to program performance

The goal of this exercise is to analyze the influence of (i) builtin data types and (ii) elementary operations to the program performance.

Write a simple subroutine for incrementing all elements of a large integer matrix and measure the execution time.
Determine the change in the execution time if the incrementing is exchanged with multiplication.
Determine the change in the execution time if the data type is changed from int into char, short, float ili double.
Print out all required times in a tabular form.

Discuss the results.

References

[1] Wikipedia: CPUID

[2] Intel Processor Identification and the CPUID Instruction

[3] x86info

[4] System Information Viewer

[5] Sun Fire V880 CINT2000 Result

[6] L1 memory cache on Intel x86 processors

Last change: 14th January 2013.