Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.94 MB, 298 trang )
72
10
Delete
Next, we scan through the attribute vectors and find the appropriate positions,
which means we look up the recordIDs for these values. In our example, there is
only one tuple with that combination of first and last name.
When finally deleting the two values from the attribute vectors, all subsequent
tuples need to be adjusted to maintain a sequence without gaps and they are moved to
preserve a sequential memory area. This implementation alternative of the delete
operation is therefore very expensive in terms of performance. In Chap. 26, later
during the course, the insert-only approach is presented as a better alternative to
implement deletion in typical enterprise use cases. This approach is of logical nature.
10.2
Self Test Questions
10.2
73
Self Test Questions
1. Delete Implementations
Which two possible delete implementations are mentioned in the course?
(a)
(b)
(c)
(d)
White box and black box delete
Physical and logical delete
Shifted and liquid delete
Column and row deletes
2. Arrays to Scan for Specific Query with Dictionary Encoding
When applying a delete with two predicates, e.g. firstname ¼ ‘John’ AND
lastname ¼ ‘Smith’, how many logical blocks in the IMDB are being looked at
during determination which tuples to delete (all columns are dictionary
encoded)?
(a)
(b)
(c)
(d)
1
2
4
8
3. Fast Delete Execution
Assume a physical delete implementation and the following two SQL statements on our world population table:
(A) DELETE FROM world_population WHERE country ¼ ‘China’;
(B) DELETE FROM world_population WHERE country ¼ ‘Ireland’; Which
query will execute faster? Please only consider the concepts learned so far.
(a)
(b)
(c)
(d)
Equal execution time
A
Depends on the ordering of the dictionary
B
Reference
[Pla09] H. Plattner, in A common database approach for OLTP and OLAP using an in-memory
column database, ed. by U. Çetintemel, S. Zdonik, D. Kossmann. SIGMOD Conference
(ACM, Newyork, 2009), pp. 1–2
Chapter 11
Insert
This chapter outlines what happens when inserting a new tuple into a table
(execution of an insert statement). Compared to a row-based database, the insert in
a column store is a bit more complicated. For a row-oriented database, the new
tuple is simply appended to the end of the table, i.e., the tuple is stored as one
piece. SanssouciDB uses column-orientation to store the data physically. A
detailed description of the differences between row store and column store is given
in Chap. 8. So, adding a new tuple to the database means to add a new entry to
every column that the table comprises of. Internally, every column consists of a
dictionary and an attribute vector (see Chap. 6). Adding a new entry to a column
means to check the dictionary and adding a new value if necessary. Afterwards, the
respective value of the dictionary entry is added to the attribute vector of the
column. Since the dictionary is sorted, adding a new entry to a column results in
three different scenarios:
1. Without a new dictionary entry
2. With a new dictionary entry, without resorting the dictionary
3. With a new dictionary entry, with resorting the dictionary
In this chapter, we will give a step by step explanation of the three different
scenarios.
11.1 Example
In this example, we insert the data of a new person into the world_population table
(see Fig. 11.1) that we used before.The example outlines what happens for the
column lname, representing the last name of a person, and fname, representing the
first name of a person.
H. Plattner, A Course in In-Memory Data Management,
DOI: 10.1007/978-3-642-36524-9_11, Ó Springer-Verlag Berlin Heidelberg 2013
75
76
11
Insert
Fig. 11.1 Example database table named world_population
11.1.1 INSERT without New Dictionary Entry
To demonstrate a scenario were we have an insert without a new entry to the
dictionary, we will look at the insert of the last name attribute to the lname column
of our world_population table. Attribute vector and dictionary of the lname column are initially filled as displayed in Fig. 11.2.
To add the string Schulze to the column, we need to look up whether it already
exists in the dictionary. Since there is another person named Sophie Schulze
(recordID four of the world_population table) in the database, the dictionary for
the lname column already contains an entry with the string Schulze. As one can see
from Fig. 11.3, the dictionary position of Schulze is ‘‘3’’.
Since Schulze is on position 3 of the dictionary, we append 3 to the end of the
attribute vector (see Fig. 11.4).
11.1.2 INSERT with New Dictionary Entry
When inserting the first name, the first name dictionary is scanned for the string
Karen. As shown in Fig. 11.5, this name is not present in the dictionary, yet.
Therefore, the name is appended to the end of the first name dictionary (see
Fig. 11.6).
As outlined in Chap. 6, the dictionary needs to be kept sorted. After appending
Karen to the end of the dictionary, the dictionary needs to be resorted. Therefore,
as shown in Fig. 11.7, a new dictionary is created with sorted order. In the new
dictionary most of the dictionaryIDs changed. For instance, the valueID for
Michael is changed from 3 to 4.
11.1
Example
Fig. 11.2 Initial status of the lname column
Fig. 11.3 Position of the string Schulze in the dictionary of the lname column
Fig. 11.4 Appending dictionary position of Schulze to the end of the attribute vector
77
78
11
Insert
Fig. 11.5 Dictionary for first name column
Fig. 11.6 Addition of Karen to fname dictionary
Fig. 11.7 Resorting the fname dictionary
Based on the changed valueIDs of the new first name dictionary, all valueIDs of
the first name attribute vector need to be updated as well. Figure 11.8 shows the
changes to the attribute vector. For instance at position 1, the valueID for Michael
is changed from 3 to 4.
11.1
Example
79
Fig. 11.8 Rebuilding the fname attribute vector
Fig. 11.9 Appending the valueID representing Karen to the attribute vector
In case the newly added dictionary value is inserted at the end based on the
sorting order of the dictionary, those two steps are omitted. The dictionary does not
need to be resorted and therefore the attribute vector does not need to be rebuild.
Finally the valueID 2, representing the dictionary position of the string Karen,
is appended to the attribute vector (see Fig. 11.9).
11.2 Performance Considerations
When thinking of the world_population example, there are about 8 billion people
and 5 million unique first names. Every new entry to the dictionary may cause an
overhead regarding resorting of the dictionary and reorganization of the respective
attribute vector. Triggering resorting and reorganization at every single insert
would lead to a performance penalty, which compromises the overall performance
of the system.Therefore, an additional insert layer needs to be added, the differential buffer. Chapter 25 explains in detail how write performance is kept at a high
level using periodic merges of the differential buffer and the main store.
80
11
Insert
The vulnerability of a column to reorganization heavily depends on the column
cardinality (the number of distinct values in a dictionary). When the dictionary
only has a few entries, it is most likely that a column needs to be reorganized with
a new insert. However, especially with attributes of low column cardinality, e.g.,
gender or country, the likelihood of reorganization decreases over time, since most
of the possible values for the respective column have been inserted into the dictionary already. In real world applications, the dictionary only changes occasionally after it has reached a certain size. The additional steps necessary for new
unique dictionary entries will occur less frequent and therefore expensive reorganization becomes less frequent.
11.3
Self Test Questions
1. Access Order of Structures During Insert
When doing an insert, what entity is accessed first?
(a)
(b)
(c)
(d)
The attribute vector
The dictionary
No access of either entity is needed for an insert
Both are accessed in parallel in order to speed up the process.
2. New Value in Dictionary
Given the following entities:
Old dictionary: ape, dog, elephant, giraffe
Old attribute vector: 0, 3, 0, 1, 2, 3, 3
Value to be inserted: lamb
What value is the lamb mapped to in the new attribute vector?
(a)
(b)
(c)
(d)
1
2
3
4
3. Insert Performance Variation Over Time
Why might real world productive column stores experience faster insert performance over time?
(a) Because the dictionary reaches a state of saturation and, thus, rewrites of
the attribute vector become less likely.
(b) Because the hardware will run faster after some run-in time.
(c) Because the column is already loaded into main-memory and does not have
to be loaded from disk.
(d) An increase in insert performance should not be expected.
11.3
Self Test Questions
81
4. Resorting Dictionaries of Columns
Consider a dictionary encoded column store (without a differential buffer) and
the following SQL statements on an initially empty table:
INSERT INTO students VALUES(‘Daniel’, ‘Bones’, ‘USA’);
INSERT INTO students VALUES(‘Brad’, ‘Davis’, ‘USA’);
INSERT INTO students VALUES(‘Hans’, ‘Pohlmann’, ‘GER’);
INSERT INTO students VALUES(‘Martin’, ‘Moore’, ‘USA’);
How many complete attribute vector rewrites are necessary?
(a)
(b)
(c)
(d)
2
3
4
5
5. Insert Performance
Which of the following use cases will have the worst insert performance when
all values will be dictionary encoded?
(a) A city resident database, that store all the names of all the people from that
city
(b) A database for vehicle maintenance data which stores failures, error codes
and conducted repairs
(c) A password database that stores the password hashes
(d) An inventory database of a company storing the furnature for each room.
Chapter 12
Update
The ‘‘UPDATE’’ is part of SQL’s data manipulation language (DML) and is used
for changing one or more tuples in a table. The UPDATE statement has the
following general form:
Listing 12.1: Update syntax
The optional WHERE condition restricts the update to tuples that match the
given condition. If no WHERE condition is specified, then all tuples in the table
are updated. Logically, i.e., in relational algebra, an UPDATE statement is
equivalent to a DELETE statement followed by an INSERT statement.
12.1 Update Types
Three different types of updates can be found in a typical enterprise application
[Pla09]:
• Aggregate update: The attributes are accumulated values as part of materialized
views. From our experience in enterprise systems, typically between 1 and 5
materialized aggregates are maintained for each accounting line item.
• Status update: Binary change of a status variable, typically with timestamps
• Value update: The value of an attribute changes by replacement.
12.1.1 Aggregate Updates
Most of the updates taking place in financial applications apply to complete
records, containing e.g. account number, legal organization, year, etc. The system
H. Plattner, A Course in In-Memory Data Management,
DOI: 10.1007/978-3-642-36524-9_12, Ó Springer-Verlag Berlin Heidelberg 2013
83
84
12 Update
contains aggregates for these records, e.g., by account, by project, or by region.
Directly reading these aggregates is faster than computing them on the fly.
12.1.2 Status Updates
Status variables (e.g. unpaid, paid) typically use a predefined set of a small number
values and thus create no problem when performing an in-place update since the
column cardinality does not change. It is advisable that compression of sequences
(e.g. run-length encoding) in the columns is not allowed for status fields. If the
automatic recording of status changes is preferable for the application, we can also
use the insert-only approach, which will be discussed in Chap. 26, for these
changes. In case the status variable has only two states, a null value and a time
stamp can be used as values to note if the status has been set. Thus, an in-place
update is fully transparent even considering temporal queries.
12.1.3 Value Updates
Since the change of an attribute in an enterprise application in most cases has to be
recorded (log of changes), the insert-only approach seems to be the appropriate
answer. On average only 5 % of the tuples of a financial accounting system are
actually changed over a longer period of time [KKG+11]. The extra load for the
differential buffer (the write-optimized store in a column store database, which
handles updates and inserts) and the extra consumption of main memory are
acceptable. With insert-only, we also capture the change history including time
and origin of the change.
Despite the fact that typical enterprise systems are not update-intensive, by
using insert-only and by not maintaining totals, we can even further reduce the
number of updates, which also reduces locking issues.
12.2 Update Example
Given is the world population table. Michael Berg moves from Berlin to Potsdam.
So the following query should be executed:
Listing 12.2: Michael Berg moves from Berlin to Potsdam