This laboratory exercise considers the influence of architectural properties of the computer to the execution of programs written in higher level languages. If possible, the exercise should be worked out on computers with different architectures and the results should be compared and discussed. As a minimal recommended solution, you can use PC/Pentium and SUN/Ultrasparc (pinus) architectures.
Write a computer program which is able to detect byte ordering convention employed on the computer executing the program. Determine the convention employed by the test architectures.
Hint: analyze individual bytes of a suitably crafted computer word by employing an adequate pointer.
Remark: 64-bit SPARC computers can be configured in a way to emply any of the two main byte ordering conventions. However, your task is to discover which of the two is preferred by the compiler.
The goal of this laboratory exercise is to analyze the influence of cache parameters to the performance. The exercise consists of the following tasks:
Determine the following cache (L1, L2, l3?) parameters of the considered computer: total capacity, line width in bytes, and associativity. On modern x86 processors this can be accomplished by directly executing the CPUID instruction [1, 2] or by invoking the tools which employ that instruction [3, 4]. For other processors, this information can be retrieved at the web pages of the manufacturer.
In the case of the pinus computer,
the the computer type can be recovered
by employing commands uname
,
fpversion
, or prtconf
.
Cache parameters can be found at www pages
[5].
On computers under Linux, such pieces of information can be retrieved in virtual directories /proc/ (file cpuinfo) and /sys/devices/system/cpu/ (e.g. file cpu0/cache/index2/size).
cat /proc/cpuinfo
cat /sys/devices/system/cpu/cpu0/cache/index2/size
The other option is to employ programs hwinfo, lshw lscpu or dmidecode which are able to display more detailed information about the system.
Write a computer program to show the memory access performance for data in:
Let's introduce the following notation:
s1
... capacity of L1 cache
b1
... line width of L1 cache
s2
, b2
... analogously for cache L2
Your program should measure average bandwidth of the byte access during many rounds of execution of the following three subroutines:
s1
bytes
in the order of increasing memory addresses;
b1
-th byte
of a memory buffer containing 2*s1
bytes
in the order of increasing memory addresses;
b2
-th byte
of a memory buffer containing 2*s2
bytes
in the order of increasing memory addresses.
Subroutines B i C employ the memory buffer which is exactly twice as large as the capacity of the analyzed cache (L1 or L2). By using such memory buffers we ensure that the buffer is smaller than the capacity of the memory at the next level of the memory hierarchy (if we are testing L1, the buffer is smaller than L2).
Each of these three subroutines has to be invoked
many times in the loop.
We see that the subroutine A
relatively seldom generates L1 cache misses
(once in b1
accesses).
Subroutine B
generates a L1 cache miss in each memory access,
but a majority of these accesses shall fall inside L2.
Subroutine C generates
a L2 cache miss in each access.
For each subroutine, your program should determine average time of a byte access, as well as the achieved bandwidth in MB/s. Based on the obtained data, estimate the ratios of the latencies corresponding to neighbouring levels of the memory hierarchy (t(L2)/t(L1), t(RAM)/t(L2)).
Instructions:
clock
function
(<time.h>
);
-O3
)
cputrack
for more information).
Many programs implement 2D matrices by linear buffers.
If the buffer address is given in buf
,
i
and j
denote row and column indices,
and rows
and cols
denote matrix dimensions,
then the element at (i,j) can be accessed by buf[i*cols+j]
.
Often the same operation needs to be applied
to all matrix elements such as in matrix multiplication.
Your task is to experimentally find out whether
it is better to loop first over rows
and then over columns, or vice versa,
in the case of large matrices which can not fit into L2 cache.
The obtained results should be commented and discussed.
Instructions
N * M
integers (int
),
where M is chosen such that M * sizeof(int)
is equal to the capacity of the L2 cache.
volatile
in order to prevent the compiler
to modify the order of array accesses.
The goal of this exercise is to analyze the influence of (i) builtin data types and (ii) elementary operations to the program performance.
int
into char
, short
,
float
ili double
.
[1] Wikipedia: CPUID
[2] Intel Processor Identification and the CPUID Instruction
[3] x86info
Last change: 14th January 2013.