1…Example of Physical Delete

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.94 MB, 298 trang )

72

10

Delete

Next, we scan through the attribute vectors and find the appropriate positions,

which means we look up the recordIDs for these values. In our example, there is

only one tuple with that combination of first and last name.

When finally deleting the two values from the attribute vectors, all subsequent

tuples need to be adjusted to maintain a sequence without gaps and they are moved to

preserve a sequential memory area. This implementation alternative of the delete

operation is therefore very expensive in terms of performance. In Chap. 26, later

during the course, the insert-only approach is presented as a better alternative to

implement deletion in typical enterprise use cases. This approach is of logical nature.

10.2

Self Test Questions

10.2

73

Self Test Questions

1. Delete Implementations

Which two possible delete implementations are mentioned in the course?

(a)

(b)

(c)

(d)

White box and black box delete

Physical and logical delete

Shifted and liquid delete

Column and row deletes

2. Arrays to Scan for Specific Query with Dictionary Encoding

When applying a delete with two predicates, e.g. firstname ¼ ‘John’ AND

lastname ¼ ‘Smith’, how many logical blocks in the IMDB are being looked at

during determination which tuples to delete (all columns are dictionary

encoded)?

(a)

(b)

(c)

(d)

1

2

4

8

3. Fast Delete Execution

Assume a physical delete implementation and the following two SQL statements on our world population table:

(A) DELETE FROM world_population WHERE country ¼ ‘China’;

(B) DELETE FROM world_population WHERE country ¼ ‘Ireland’; Which

query will execute faster? Please only consider the concepts learned so far.

(a)

(b)

(c)

(d)

Equal execution time

A

Depends on the ordering of the dictionary

B

Reference

[Pla09] H. Plattner, in A common database approach for OLTP and OLAP using an in-memory

column database, ed. by U. Çetintemel, S. Zdonik, D. Kossmann. SIGMOD Conference

(ACM, Newyork, 2009), pp. 1–2

Chapter 11

Insert

This chapter outlines what happens when inserting a new tuple into a table

(execution of an insert statement). Compared to a row-based database, the insert in

a column store is a bit more complicated. For a row-oriented database, the new

tuple is simply appended to the end of the table, i.e., the tuple is stored as one

piece. SanssouciDB uses column-orientation to store the data physically. A

detailed description of the differences between row store and column store is given

in Chap. 8. So, adding a new tuple to the database means to add a new entry to

every column that the table comprises of. Internally, every column consists of a

dictionary and an attribute vector (see Chap. 6). Adding a new entry to a column

means to check the dictionary and adding a new value if necessary. Afterwards, the

respective value of the dictionary entry is added to the attribute vector of the

column. Since the dictionary is sorted, adding a new entry to a column results in

three different scenarios:

1. Without a new dictionary entry

2. With a new dictionary entry, without resorting the dictionary

3. With a new dictionary entry, with resorting the dictionary

In this chapter, we will give a step by step explanation of the three different

scenarios.

11.1 Example

In this example, we insert the data of a new person into the world_population table

(see Fig. 11.1) that we used before.The example outlines what happens for the

column lname, representing the last name of a person, and fname, representing the

first name of a person.

H. Plattner, A Course in In-Memory Data Management,

DOI: 10.1007/978-3-642-36524-9_11, Ó Springer-Verlag Berlin Heidelberg 2013

75

76

11

Insert

Fig. 11.1 Example database table named world_population

11.1.1 INSERT without New Dictionary Entry

To demonstrate a scenario were we have an insert without a new entry to the

dictionary, we will look at the insert of the last name attribute to the lname column

of our world_population table. Attribute vector and dictionary of the lname column are initially filled as displayed in Fig. 11.2.

To add the string Schulze to the column, we need to look up whether it already

exists in the dictionary. Since there is another person named Sophie Schulze

(recordID four of the world_population table) in the database, the dictionary for

the lname column already contains an entry with the string Schulze. As one can see

from Fig. 11.3, the dictionary position of Schulze is ‘‘3’’.

Since Schulze is on position 3 of the dictionary, we append 3 to the end of the

attribute vector (see Fig. 11.4).

11.1.2 INSERT with New Dictionary Entry

When inserting the first name, the first name dictionary is scanned for the string

Karen. As shown in Fig. 11.5, this name is not present in the dictionary, yet.

Therefore, the name is appended to the end of the first name dictionary (see

Fig. 11.6).

As outlined in Chap. 6, the dictionary needs to be kept sorted. After appending

Karen to the end of the dictionary, the dictionary needs to be resorted. Therefore,

as shown in Fig. 11.7, a new dictionary is created with sorted order. In the new

dictionary most of the dictionaryIDs changed. For instance, the valueID for

Michael is changed from 3 to 4.

11.1

Example

Fig. 11.2 Initial status of the lname column

Fig. 11.3 Position of the string Schulze in the dictionary of the lname column

Fig. 11.4 Appending dictionary position of Schulze to the end of the attribute vector

77

78

11

Insert

Fig. 11.5 Dictionary for first name column

Fig. 11.6 Addition of Karen to fname dictionary

Fig. 11.7 Resorting the fname dictionary

Based on the changed valueIDs of the new first name dictionary, all valueIDs of

the first name attribute vector need to be updated as well. Figure 11.8 shows the

changes to the attribute vector. For instance at position 1, the valueID for Michael

is changed from 3 to 4.

11.1

Example

79

Fig. 11.8 Rebuilding the fname attribute vector

Fig. 11.9 Appending the valueID representing Karen to the attribute vector

In case the newly added dictionary value is inserted at the end based on the

sorting order of the dictionary, those two steps are omitted. The dictionary does not

need to be resorted and therefore the attribute vector does not need to be rebuild.

Finally the valueID 2, representing the dictionary position of the string Karen,

is appended to the attribute vector (see Fig. 11.9).

11.2 Performance Considerations

When thinking of the world_population example, there are about 8 billion people

and 5 million unique first names. Every new entry to the dictionary may cause an

overhead regarding resorting of the dictionary and reorganization of the respective

attribute vector. Triggering resorting and reorganization at every single insert

would lead to a performance penalty, which compromises the overall performance

of the system.Therefore, an additional insert layer needs to be added, the differential buffer. Chapter 25 explains in detail how write performance is kept at a high

level using periodic merges of the differential buffer and the main store.

80

11

Insert

The vulnerability of a column to reorganization heavily depends on the column

cardinality (the number of distinct values in a dictionary). When the dictionary

only has a few entries, it is most likely that a column needs to be reorganized with

a new insert. However, especially with attributes of low column cardinality, e.g.,

gender or country, the likelihood of reorganization decreases over time, since most

of the possible values for the respective column have been inserted into the dictionary already. In real world applications, the dictionary only changes occasionally after it has reached a certain size. The additional steps necessary for new

unique dictionary entries will occur less frequent and therefore expensive reorganization becomes less frequent.

11.3

Self Test Questions

1. Access Order of Structures During Insert

When doing an insert, what entity is accessed first?

(a)

(b)

(c)

(d)

The attribute vector

The dictionary

No access of either entity is needed for an insert

Both are accessed in parallel in order to speed up the process.

2. New Value in Dictionary

Given the following entities:

Old dictionary: ape, dog, elephant, giraffe

Old attribute vector: 0, 3, 0, 1, 2, 3, 3

Value to be inserted: lamb

What value is the lamb mapped to in the new attribute vector?

(a)

(b)

(c)

(d)

1

2

3

4

3. Insert Performance Variation Over Time

Why might real world productive column stores experience faster insert performance over time?

(a) Because the dictionary reaches a state of saturation and, thus, rewrites of

the attribute vector become less likely.

(b) Because the hardware will run faster after some run-in time.

(c) Because the column is already loaded into main-memory and does not have

to be loaded from disk.

(d) An increase in insert performance should not be expected.

11.3

Self Test Questions

81

4. Resorting Dictionaries of Columns

Consider a dictionary encoded column store (without a differential buffer) and

the following SQL statements on an initially empty table:

INSERT INTO students VALUES(‘Daniel’, ‘Bones’, ‘USA’);

INSERT INTO students VALUES(‘Brad’, ‘Davis’, ‘USA’);

INSERT INTO students VALUES(‘Hans’, ‘Pohlmann’, ‘GER’);

INSERT INTO students VALUES(‘Martin’, ‘Moore’, ‘USA’);

How many complete attribute vector rewrites are necessary?

(a)

(b)

(c)

(d)

2

3

4

5

5. Insert Performance

Which of the following use cases will have the worst insert performance when

all values will be dictionary encoded?

(a) A city resident database, that store all the names of all the people from that

city

(b) A database for vehicle maintenance data which stores failures, error codes

and conducted repairs

(c) A password database that stores the password hashes

(d) An inventory database of a company storing the furnature for each room.

Chapter 12

Update

The ‘‘UPDATE’’ is part of SQL’s data manipulation language (DML) and is used

for changing one or more tuples in a table. The UPDATE statement has the

following general form:

Listing 12.1: Update syntax

The optional WHERE condition restricts the update to tuples that match the

given condition. If no WHERE condition is specified, then all tuples in the table

are updated. Logically, i.e., in relational algebra, an UPDATE statement is

equivalent to a DELETE statement followed by an INSERT statement.

12.1 Update Types

Three different types of updates can be found in a typical enterprise application

[Pla09]:

• Aggregate update: The attributes are accumulated values as part of materialized

views. From our experience in enterprise systems, typically between 1 and 5

materialized aggregates are maintained for each accounting line item.

• Status update: Binary change of a status variable, typically with timestamps

• Value update: The value of an attribute changes by replacement.

12.1.1 Aggregate Updates

Most of the updates taking place in financial applications apply to complete

records, containing e.g. account number, legal organization, year, etc. The system

H. Plattner, A Course in In-Memory Data Management,

DOI: 10.1007/978-3-642-36524-9_12, Ó Springer-Verlag Berlin Heidelberg 2013

83

84

12 Update

contains aggregates for these records, e.g., by account, by project, or by region.

Directly reading these aggregates is faster than computing them on the fly.

12.1.2 Status Updates

Status variables (e.g. unpaid, paid) typically use a predefined set of a small number

values and thus create no problem when performing an in-place update since the

column cardinality does not change. It is advisable that compression of sequences

(e.g. run-length encoding) in the columns is not allowed for status fields. If the

automatic recording of status changes is preferable for the application, we can also

use the insert-only approach, which will be discussed in Chap. 26, for these

changes. In case the status variable has only two states, a null value and a time

stamp can be used as values to note if the status has been set. Thus, an in-place

update is fully transparent even considering temporal queries.

12.1.3 Value Updates

Since the change of an attribute in an enterprise application in most cases has to be

recorded (log of changes), the insert-only approach seems to be the appropriate

answer. On average only 5 % of the tuples of a financial accounting system are

actually changed over a longer period of time [KKG+11]. The extra load for the

differential buffer (the write-optimized store in a column store database, which

handles updates and inserts) and the extra consumption of main memory are

acceptable. With insert-only, we also capture the change history including time

and origin of the change.

Despite the fact that typical enterprise systems are not update-intensive, by

using insert-only and by not maintaining totals, we can even further reduce the

number of updates, which also reduces locking issues.

12.2 Update Example

Given is the world population table. Michael Berg moves from Berlin to Potsdam.

So the following query should be executed:

Listing 12.2: Michael Berg moves from Berlin to Potsdam

Xem Thêm

1…Example of Physical Delete

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về