Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.94 MB, 298 trang )
24
4 Changes in Hardware
Fig. 4.3 Conceptual view of the memory hierarchy
fully utilize the speed of flash based storage the interfaces and drivers have to be
adapted accordingly.
On top of flash is the main memory, which is directly accessible. The next level
are the CPU caches—L3, L2, L1—with different characteristics. Finally, the top
level of the memory hierarchy are the registers of the CPU where things like
calculations are happening.
When accessing data from a disk, there are usually four layers between the
accessed disk and the registers of the CPU which only transport information. In the
end, every operation takes place inside the CPU and in turn the data has to be in
the registers.
Table 4.1 shows some of the latencies, which come into play regarding the
memory hierarchy. Latency is the time delay experienced by the system to load the
data from the storage medium until it is available in a CPU register. The L1 cache
latency is 0.5 ns. In contrast, accessing a main memory reference takes 100 ns and
a simple disk seek is taking 10 ms.
In the end, there is nothing special about ‘‘in-memory’’ computing and all
computing ever done was in memory, because it can only take place in the CPU.
Table 4.1 Latency numbers
Action
L1 cache reference (cached data word)
Branch mispredict
L2 cache reference
Mutex lock / unlock
Main memory reference
Send 2,000 byte over 1 Gb/s network
SSD random read
Read 1 MB sequentially from memory
Disk seek
Send packet CA to Netherlands to CA
Time in nanoseconds (ns)
0.5
5
7
25
100
20,000
150,000
250,000
10,000,000
150,000,000
Time
0.1 ls
20 ls
150 ls
250 ls
10 ms
150 ms
4.6 Memory Hierarchy and Latency Numbers
25
Assuming a bandwidth-bound application, the performance is determined by how
fast the data can be transferred through the hierarchy to the CPU. In order to
estimate the runtime of an algorithm, it is possible to roughly estimate the amount
of data which has to be transferred to the CPU. A very simple operation that a CPU
can do is a comparison like filtering for an attribute. Let us assume a calculation
speed of 2 MB per millisecond for this operation using one core. So one core of a
CPU can digest 2 MB per millisecond. This number scales with the amount of
cores and if there are ten cores, they can scan 20 GB per second. If there are ten
nodes with ten cores each, then that is already 200 GB in per second.
Considering a large multi-node system like that, having ten nodes and 40 CPUs
per node where the data is distributed across the nodes, it is hard to write an
algorithm which needs more than 1 s. This includes large amounts of data. The
previously mentioned 200 GB are highly compressed data. So it is a much higher
amount of plain character data. To sum this up, the number to remember is 2 MB
per millisecond per core. If an algorithm shows a completely different result it is
worth looking into it as there is probably something going wrong. This could be an
issue in SQL, like a too complicated join or a loop in a loop.
4.7 Non-Uniform Memory Architecture
As the development in modern computer systems goes from multi-core to manycore systems and the amount of main memory continues to increase, the in Uniform Memory Architecture (UMA) systems becomes a bottleneck and introduces
heavy challenges in hardware design to connect all cores and memory.
Non-Uniform Memory Architectures (NUMA) attempt to solve this problems by
introducing local memory locations which are cheap to access for local processors.
Figure 4.4 pictures an overview of an UMA and a NUMA system. In an UMA
(a)
Fig. 4.4 (a) Shared FSB, (b) Intel quick path interconnect [Int09]
(b)
26
4 Changes in Hardware
system every processor observes the same speeds when accessing an arbitrary
memory address as the complete memory is accessed through a central memory
interface as shown in Fig. 4.4a. In contrast, in NUMA systems, every processor
has its primary used local memory as well as remote memory supplied from the
other processors. This setup is shown in Fig. 4.4b. The different kinds of memory
from the processors point of view introduce different memory access times
between local memory (adjacent slots) and remote memory that is adjacent to the
other processing units.
Additionally, systems can be classified into cache-coherent NUMA (ccNUMA)
and non cache-coherent NUMA systems. ccNuma systems provide each CPU
cache the same view to the complete memory and enforce coherency by a hardware implemented protocol. Non cache-coherent NUMA systems require software
layers to handle memory conflicts accordingly. Although non ccNUMA hardware
is easier and cheaper to build, most of todays available standard hardware provides
ccNUMA, since non ccNUMA hardware is more difficult to program.
To fully exploit the potentials of NUMA, applications have to be made aware of
the different memory latencies and should primarily load data from the locally
attached memory slots of a processor. Memory-bound applications may suffer a
degradation of up to 25 % of their maximal performance if remote memory is
accessed instead of local memory.
By introducing NUMA, the central bottleneck of the FSB can be avoided and
memory bandwidth can be increased. Benchmark results have shown that a
throughput of more than 72 GB per second is possible on an Intel XEON 7560
(Nehalem EX) system with four processors [Fuj10].
4.8 Scaling Main Memory Systems
An example system that consists of multiple nodes can be seen in Fig. 4.5. One
node has eight CPUs with eight cores, so each system has 64 cores, and there are
four nodes. Each of them has a terabyte of RAM and SSDs for persistence.
Everything which is below DRAM is for logging, archiving, and for emergency
reconstruction of data, which means reloading the data after the power supply was
turned off.
The networks which connect the nodes are continuously increasing in speed. In
the example shown in Fig. 4.5, a 10 Gb Ethernet network connects the four nodes.
Computers with 40 Gb Infiniband are already on the market and switch manufacturers are talking about 100 Gb switches which even have logic allowing smart
switching. This is another location where an optimization can take place—on a
low level and very effective for applications. It can be leveraged to improve joins,
where calculations often go across multiple nodes.
4.9 Remote Direct Memory Access
27
Fig. 4.5 A system consisting of multiple blades
4.9 Remote Direct Memory Access
Shared memory is another interesting way to directly access memory between
multiple nodes. The nodes are connected with the network via Infiniband and
create a shared memory region. The main idea is to automatically access data
which is on a different node without explicitly shipping the data. In turn, there is
direct access without shipping and processing it on the other side. Research has
been done at Stanford University in cooperation with the HPI using a RAM cluster.
It is very promising as it could basically offer direct access to a seemingly
unlimited amount of memory from a program’s perspective.
4.10
Self Test Questions
1. Speed per Core
What is the speed of a single core when processing a simple scan operation
(under optimal conditions)?
(a) 2 GB/ms/core
(b) 2 MB/ms/core