B+ Tree and Database Operations: Problems and Solutions

Problem 1: B+ Tree Operations

Consider the following B+ tree. The structure is described first, followed by the key values in each node.

n = 8. (Therefore internal nodes other than the root must have between 4 and 8 children and leaves must have between 4 and 7 key values.)

Structure:

  • The root is node A.
  • Node A has two children: B and C.
  • B has four children: D, E, F, G.
  • C has eight children: H, I, J, K, L, M, N, O.

Key values:

  • A: 40
  • B: 10, 18, 32
  • C: 48, 62, 74, 82, 100, 200, 300
  • D: 2, 4, 6, 8
  • E: 10, 12, 14, 16
  • F: 18, 20, 22, 24, 26, 28, 30
  • G: 32, 34, 36, 38
  • H: 40, 42, 44, 46
  • I: 48, 50, 52, 54, 56, 58, 60
  • J: 62, 64, 66, 68, 70, 72
  • K: 74, 76, 78, 80
  • L: 82, 84, 86, 88
  • M, N, O: details not needed for this problem

What nodes are visited and, in each, what key comparisons are made when doing each of the following searches:

Find record with key value 66

ANSWER:

  • Visit A, compare with 40
  • Visit C, compare with 48, 62, 74
  • Visit J, compare with 62, 64, 66

Find record with key value 18

ANSWER:

  • Visit A, compare with 40
  • Visit B, compare with 10, 18, (maybe 32, depending on details of how = is handled)
  • Visit F, compare with 18

Find all records with key values between 15 and 35

ANSWER:

  • Visit A, compare with 40
  • Visit B, compare with 10, 18
  • Visit E, compare with 12, 14, 16.
  • Follow pointer before 16 to get record with 16;
  • Follow pointer to leaf node F
  • Compare with 18, 20, …, 30, and fetch each of the corresponding data records;
  • Follow pointer to leaf node G
  • Compare with 32, 34, 36 and get records for 32 and 34.

Describe the tree after each of the following operations.

DO EACH OF THESE SEPARATELY, STARTING WITH THE ORIGINAL TREE DESCRIBED ABOVE:

a. Insert 15

ANSWER:

Node E now has 10, 12, 14, 15, 16; no other changes

b. Insert 27

ANSWER:

Insert in node F, which becomes overfull. Split node F

  • Node F now has: 18, 20, 22, 24
  • New node F’: 26, 27, 28, 30
  • Add key and pointer to parent B:
  • B now has: 10, 18, 26, 32, where the pointer before 32 points to F’

c. Insert 49

ANSWER:

Insert in node I, which becomes overfull. Split node I

  • Node I now has: 48, 49, 50, 52
  • New node I’: 54, 56, 58, 60
  • Want to add key 54 and pointer to I’ to parent, C:
  • C is full, so split it:
  • C: pH, 48, pI, 54, pI’, 62, pJ, 74, pK
  • C’: pL, 100, pM, 200, pN, 300, pO
  • Insert 82 and pointer to C’ into A
  • A now has: pB, 40, pC, 82, pC’
  • (Note that searching for values >= 82 will now take path from root to C’ to leaf).

d. Delete 28

ANSWER:

Just delete it from F, yielding

F: 18, 20, 22, 24, 26, 30

e. Delete 36

ANSWER:

G becomes underfull. NOTE we cannot merge it with H, because H is not a sibling. Instead, redistribute with F and adjust key in parent:

  • F: 18, 20, 22, 24, 26
  • G: 28, 30, 32, 34, 38
  • B: 10, 18, 28

(NOTE: This solution divides the key values evenly between the nodes; it’s also fine to just borrow one key value from the neighbor, as in the textbook’s algorithm.)

g. Delete 16

ANSWER:

  • E becomes underfull. Since its neighbor/sibling D only has 4 key values, we can merge:
  • Delete node E and change D to
  • D: 2, 4, 6, 8, 10, 12, 14
  • Delete (10, pE) from parent B. Now it only has 3 children, so it’s underfull.
  • Redistribute with C and adjust A, yielding
  • B: pD, 18, pF, 32, pG, 40, pH, 48, pI
  • C: pJ, 74, pK, 82, pL, 100, pM, 200, pN, 300, pO
  • A: 62

Problem 2: B+ Tree Depth

Consider a B+ index tree with n=10.

Suppose there are 9,000 records in the file being indexed, each with a different value of the search key.

a) What is the smallest possible depth of the B+ tree? Describe the “shape” of the tree. You don’t need to show the values in the nodes.

Note: The depth (aka height) of a tree is the length of the longest path from the root to a leaf. A tree that only has a root has depth zero; A tree in which all the leaves are children of the root has depth one, etc.

Each internal node has 9 keys and 10 children, the maximum for n=10.

  • Root (level 0): 9 keys,
  • Level 1: 10 nodes each with 9 keys [total of 90 keys]
  • Level 2: 100 nodes, each with 9 keys [total of 900 keys]
  • Level 3: 1000 nodes, each with 9 keys [total of 9000 keys]

This is enough key values, so level 3 is the leaves. Tree has height/depth of 3.

b) What is the largest possible depth of the B+ tree? Describe the “shape” of the tree. You don’t need to show the values in the nodes.

  • Root has two children and other internal nodes have 5 children, the minimum for n=10.
  • Root: 1 key, 2 children
  • Level 1: 2 nodes, each with 4 keys, 5 children [total 8 keys]
  • Level 2: 10 nodes, each with 4 keys, 5 children [total 40 keys]
  • Level 3: 50 nodes, each with 4 keys, 5 children [total 200 keys]
  • Level 4: 250 nodes, each with 4 keys, 5 children [total 1000 keys]
  • Level 5: 1250 nodes, each with between 4 and 9 keys [total 5000 – 10,250 keys]

Make level 5 the leaf nodes; many of the leaves are fairly full.

Note going to level 6 doesn’t work, because the leaves would have to be underfull

  • Level 6: 5750 nodes, each with 4 keys [total > 20000 keys]

So the maximum height/depth is 5.

Problem 3: Block Nested Loop Join

Consider execution of the query select * from instructor natural join department.

Suppose (unrealistically) that each block holds at most four records of instructor or department and that the output block can hold up to three records of the result.

Using department as the outer relation,

Illustrate the execution of the block nested loop join algorithm on the sample data that was posted with HW 1 (assuming that it’s stored in the order given by the insert statements).

Show what’s in memory at the end of each iteration of the second loop and show the records of the result that have been flushed to the disk.

Assume there are only three blocks of memory available (enough for one block of instructor, one block of department, and one output block).

Note the points where the output block fills up and is flushed to the disk.

You do not have to show the entire execution. Just show the results up to and including the step where block 1 of department is compared to block 1 of instructor.

B0 of department:

  • (‘Biology’, ‘Watson’, ‘90000’);
  • (‘Comp. Sci.’, ‘Taylor’, ‘100000’);
  • (‘Elec. Eng.’, ‘Taylor’, ‘85000’);
  • (‘Finance’, ‘Painter’, ‘120000’);

B1 of department:

  • (‘History’, ‘Painter’, ‘50000’);
  • (‘Music’, ‘Packard’, ‘80000’);
  • (‘Physics’, ‘Watson’, ‘70000’);

B0 of instructor:

  • (‘10101’, ‘Srinivasan’, ‘Comp. Sci.’, ‘65000’);
  • (‘12121’, ‘Wu’, ‘Finance’, ‘90000’);
  • (‘15151’, ‘Mozart’, ‘Music’, ‘40000’);
  • (‘22222’, ‘Einstein’, ‘Physics’, ‘95000’);

B1 of instructor:

  • (‘32343’, ‘El Said’, ‘History’, ‘60000’);
  • (‘33456’, ‘Gold’, ‘Physics’, ‘87000’);
  • (‘45565’, ‘Katz’, ‘Comp. Sci.’, ‘75000’);
  • (‘58583’, ‘Califieri’, ‘History’, ‘62000’);

B2 of instructor:

  • (‘76543’, ‘Singh’, ‘Finance’, ‘80000’);
  • (‘76766’, ‘Crick’, ‘Biology’, ‘72000’);
  • (‘83821’, ‘Brandt’, ‘Comp. Sci.’, ‘92000’);
  • (‘98345’, ‘Kim’, ‘Elec. Eng.’, ‘80000’);

a. First time end of second nested loop is reached (after executing the two inner loops) the following are in memory:

  • Block 0 of department:
  • Block 0 of instructor:
  • Output block: matches found in inner loop minus anything flushed to the disk:
  • (10101, Srinivasan, Comp. Sci, 65000, Taylor, 10000)
  • (12121, Wu, Finance, 90000, Painter, 120000)
  • Disk:

b. Second time end of second loop is reached:

  • Block 0 of department:
  • Block 1 of instructor:
  • Output block: fills up then flushes to disk
  • (10101, Srinivasan, Comp. Sci, 65000, Taylor, 10000)
  • (12121, Wu, Finance, 90000, Painter, 120000)
  • (45565, Katz, Comp. Sci., 75000, Taylor, 10000);
  • Disk:
  • (10101, Srinivasan, Comp. Sci, 65000, Taylor, 10000)
  • (12121, Wu, Finance, 90000, Painter, 120000)
  • (45565, Katz, Comp. Sci., 75000, Taylor, 10000);

c. Third time

  • Block 0 of department
  • Block 2 of instructor
  • Output block (fills up, flushes, then one more match):
  • (98345, Kim, Elec. Eng., 80000Taylor, 85000)
  • Disk:
  • (10101, Srinivasan, Comp. Sci, 65000, Taylor, 10000)
  • (12121, Wu, Finance, 90000, Painter, 120000)
  • (45565, Katz, Comp. Sci., 75000, Taylor, 10000);
  • Added in this step:
  • (76543, Singh, Finance, 80000, Painter, 120000)
  • (76766, Crick, Biology, 72000,Biology, Watson, 90000)
  • (83821, Brandt, Comp. Sci., 92000,Taylor, 100000)

d. Fourth time

  • Block 1 of department
  • Block 0 of instructor
  • Output block fills up and flushes
  • (98345, Kim, Elec. Eng., 80000Taylor, 85000)
  • (15151, Mozart, Music, 40000, Packard, 80000);
  • (22222, Einstein, Physics, 95000, Watson, 70000)
  • Disk:
  • (10101, Srinivasan, Comp. Sci, 65000, Taylor, 10000)
  • (12121, Wu, Finance, 90000, Painter, 120000)
  • (45565, Katz, Comp. Sci., 75000, Taylor, 10000);
  • (76543, Singh, Finance, 80000, Painter, 120000)
  • (76766, Crick, Biology, 72000,Biology, Watson, 90000)
  • (83821, Brandt, Comp. Sci., 92000,Taylor, 100000)
  • Added in this step:
  • (98345, Kim, Elec. Eng., 80000Taylor, 85000)
  • (15151, Mozart, Music, 40000, Packard, 80000);
  • (22222, Einstein, Physics, 95000, Watson, 70000)

e. Fifth time

  • Block 1 of department
  • Block 1 of instructor
  • Output block fills up and flushes
  • (32343, El Said, History, 60000,Painter, 50000)
  • (33456, Gold, Physics, 87000, Watson, 70000)
  • (58583, Califieri, History, 62000,Painter, 50000)
  • Disk:
  • (10101, Srinivasan, Comp. Sci, 65000, Taylor, 10000)
  • (12121, Wu, Finance, 90000, Painter, 120000)
  • (45565, Katz, Comp. Sci., 75000, Taylor, 10000);
  • (76543, Singh, Finance, 80000, Painter, 120000)
  • (76766, Crick, Biology, 72000,Biology, Watson, 90000)
  • (83821, Brandt, Comp. Sci., 92000,Taylor, 100000)
  • (98345, Kim, Elec. Eng., 80000Taylor, 85000)
  • (15151, Mozart, Music, 40000, Packard, 80000);
  • (22222, Einstein, Physics, 95000, Watson, 70000)
  • Added in this step:
  • (32343, El Said, History, 60000,Painter, 50000)
  • (33456, Gold, Physics, 87000, Watson, 70000)
  • (58583, Califieri, History, 62000,Painter, 50000)

Problem 4: Database Performance Calculations

  1. We want the largest n that allows n pointers and (n-1) keys per block so that each node will have large fan-out and fit in a block. Key (attribute a) is 8 bytes, pointers are 8 bytes, so we have 16 bytes per (pointer, key) pair. 212 bytes per block/ 24 bytes per pair = 28 pairs per block, so let n = 28 = 256
  2. In worst case, root has 2 children and other nodes are half full (27 = 128 pointers). There are 220 records (each with a different value of a). Use the formula or just note that height 1 has at least 2 * 27 = 28 key values in leaves; height 2 has at least 2 * 27 *27 = 215 key values in leaves; height 3 has at least 2 * 27 *27*27 = 222 key values in leaves; So worst case height is 2.
  3. 212 bytes per block / 25 bytes per record = 27 records per block. 220 records / 27 records per block = 213 blocks.
  4. Records are same size, but there are 225 of them => 218 blocks.
  5. tS = 8ms + 2ms = 10ms. to compute transfer time, 4 KB per block / 100 MB per sec = 4 KB per block/ 100 KB per ms = 0.04 ms/block
  6. Using index. height= 2. assume root is already in memory. So 2 block transfers and seeks needed to get to leaf node and one more to get to the data block. 3*(tT + tS). Answer: 3*(10.04 ms) = 30.12 ms [ Or, if root is not in memory, 4*(10.04ms) = 40.16 ms]
  7. Linear scan: one seek + 213 block transfers 10 ms + 213 blocks *(0.04ms/block) = 10 + 8192*0.04 ms = 338 ms For the join algorithms, this solution is assuming minimal amount of memory (enough for two input blocks and one output block) is available. This is very pessimistic estimate.
  8. Block nested loop join Choose smaller relation as outer relation. In this case, that’s r. # block transfers = br * bs + br = 213*218 + 213 = 231 + 213 ~= 231 # seeks = 2*br = 214 Note that time is dominated by the huge transfer time, 231 transfers. total transfer time = 2 * 1,073,741,824 * 0.04 ms = 85,899 sec = about 24 hours Also need 214 * 10 ms for seeks = 164 sec = about 3 minutes
  9. Indexed nested loop join. Both relations have indices, but since s is larger, use it as the inner relation. Calculation similar to that for the index on r shows that height 2 is not enough in the worst case, but height 3 is. Again, we’ll assume root is in memory, so we find leaf, then find data with 4 block transfers and seeks. This takes 4*(10.04 ms) = 40.16 ms Need to do this for each record in r: 220 * 40.2 ms = 42,153 sec = 11 hours Also need a block transfer and a seek for each block of r: 213* 10.04 ms = about 82 sec So in this case, using the index is better than the block-nested loop join, but still very time consuming. (Note however that we’ve assumed extremely limited memory, which isn’t realistic for systems dealing with very larger relations like this. EXERCISE: Think about how larger memory would affect the answers to parts 8 and 9.
  10. Merge-join or hash join might perform better. Since we’ve assumed extremely limited memory for the other algorithms, we should make the same assumption here for comparison. This means there will only be enough memory for a very small hash table, so hashing will perform very poorly. It may be worth the overhead of sorting the relations in order to perform merge sort, since the sort will take time O( b log b) for relations with b blocks, and the joining will then be linear.