Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.94 MB, 298 trang )
56
8 Data Layout in Main Memory
we run a simple benchmark accessing a constant number (4,096) of addresses with
an increasing stride, i.e. distance, between the accessed addresses.
We implemented this benchmark by iterating through an array chasing a
pointer. The array is filled with structs so that following the pointer of the elements
creates a circle through the complete array. Structs are data structures, which allow
to create user-defined aggregate data types that group multiple individual variables
together. The structs consist of a pointer and an additional data attribute realizing
the padding in memory, resulting in a memory access with the desired stride when
following the pointer chained list.
In case of a sequential array, the pointer of element i points to element i ỵ 1 and
the pointer of the last element references the first element so that the circle is
closed. In case of a random array, the pointer of each element points to a random
element of the array while ensuring that every element is referenced exactly once.
Figure 8.1 outlines the created sequential and random array.
If the assumption holds and random memory access costs are constant, then the
size of the padding in the array and the array layout (sequential or random) should
make no difference when iterating over the array. Figure 8.2 shows the result for
iterating through a list with 4,096 elements, while following the pointers inside the
elements and increasing the padding between the elements. As we can clearly see,
the access costs are not constant and increase with an increasing stride. We also see
multiple points of discontinuity in the curves, e.g. the random access times increase
heavily up to a stride of 64 bytes and continue increasing with a smaller slope.
Figure 8.3 indicates that an increasing number of cache misses is causing the
increase in access times. The first point of discontinuity in Fig. 8.2 is quite exactly
the size of the cache lines of the test system. The strong increase is due to the fact,
that with a stride smaller than 64 bytes, multiple list elements are located on one
cache line and the overhead of loading one line is amortized over the multiple
elements.
pointer
padding
pointer
padding
pointer
padding
pointer
padding
pointer
padding
pointer
padding
pointer
padding
pointer
padding
Fig. 8.1 Sequential versus random array layout
CPU Cycles per Element
8.1 Cache Effects on Application Performance
550
500
450
400
350
300
250
200
150
100
50
0
0
2
57
[Cache Linesize]
2
2
2
4
2
6
2
8
[Pagesize]
2
10
2
12
2
14
2
16
2
18
Stride in Bytes
Random
Sequential
Fig. 8.2 Cycles for cache accesses with increasing stride
For strides greater than 64 bytes, we would expect a cache miss for every single
list element and no further increase in access times. However, as the stride gets
larger the array is placed over multiple pages in memory and more TLB misses
occur, as the virtual addresses on the new pages have to be translated into physical
addresses. The number of TLB cache misses increases up to the page size of 4 KB
and stays at its worst case of one miss per element. With strides greater as the page
size, the TLB misses can induce additional cache misses when translating the
virtual to a physical address. These cache misses are due to accesses to the paging
structures which reside in main memory [BCR10, BCR10, SS95].
To summarize, the performance of main memory accesses can largely differ
depending on the access patterns. In order to improve application performance,
main memory access should be optimized in order to exploit the usage of caches.
8.1.2 The Size Experiment
In a second experiment, we access a constant number of addresses in main
memory with a constant stride of 64 bytes and vary the size of the working set size
or accessed area in memory. A run with n memory accesses and a working set size
n
of s bytes would iterate sÁ64
times through the array, which is created as described
earlier in the stride experiment in Sect. 8.1.1.
Figure 8.4a shows that the access costs differ up to a factor of 100, depending
on the working set size. The points of discontinuity correlate with the sizes of the
caches in the system. As long as the working set size is smaller than the size of the
L1 Cache, only the first iteration results in cache misses and all other accesses can
be answered out of the cache. As the working set size increases, the accesses in
one iteration start to evict the earlier accessed addresses, resulting in cache misses
in the next iteration.
58
8 Data Layout in Main Memory
(a)
(b)
Fig. 8.3 Cache misses for cache accesses with increasing stride. (a) Sequential Access.
(b) Random Access
Figure 8.4b shows the individual cache misses with increasing working set
sizes. Up to working sets of 32 KB, the misses for the L1 cache go up to one per
element, the L2 cache misses reach their plateau at the L2 cache size of 256 KB
and the L3 cache misses at 12 MB.
As we can see, the larger the accessed area in main memory, the more capacity
cache misses occur, resulting in poorer application performance. Therefore, it is
advisable to process data in cache friendly chunks if possible.
8.2 Row and Columnar Layouts
Let us consider a simple example to illustrate the two mentioned approaches for
representing a relational table in memory. For simplicity, we assume that all values
8.2 Row and Columnar Layouts
CPU Cycles per Element
(a)1000.0
[L1]
59
[L2]
[L3]
100.0
10.0
1.0
16KB 64KB 256KB 1MB
4MB 16MB 64MB 256MB
Size of Accessed Area in Bytes
Sequential
Random
(b)
Fig. 8.4 Cycles and cache misses for cache accesses with increasing working sets. (a) Sequential
Access. (b) Random Access
are stored as strings directly in memory and that we do not need to store any
additional data. As an example, let us look at the simple world population example:
Id
Name
Country
City
1
2
3
Paul Smith
Lena Jones
Marc Winter
Australia
USA
Germany
Sydney
Washington
Berlin
As discussed above, the database must transform its two-dimensional table into a
one-dimensional series of bytes for the operating system to write them to memory.
The classical and obvious approach is a row- or record-based layout. In this case,
all attributes of a tuple are stored consecutively and sequentially in memory. In
other words, the data is stored tuple-wise. Considering our example table, the data