Friday, 23 November 2012

Advanced Data Structures using ‘C’ - Splay Trees (Self-adjusting Search Trees)

Advanced Data Structures using ‘C’ Unit 5

Unit 5 Splay Trees (Self-adjusting Search Trees) Structure 5.0 Introduction Objectives 5.1 Splay Trees Self Assessment Questions 5.2 Access lemma 5.3 Balance Theorem 5.4 Strong Access Lemma 5.5 Static Optimality Theorem 5.6 Static Finger Theorem 5.7 Other Theorems 5.7.1 Working Set Theorem 5.7.2 Sequential Access Theorem 5.7.3 Dynamic Finger Theorem 5.7.4 Dynamic Optimality Conjecture 5.7.4.1 Insertion 5.7.4.2 Join 5.7.4.3 Deletion Self Assessment Questions 5.8 Summary 5.9 Terminal Questions 5.0 Introduction This Unit just describes the bottom-up splaying algorithm, the proof of the access lemma, and a few applications.
Every time a node is accessed in a splay tree, it is moved to the root of the tree. The amortized cost of the operation is O(log n). Just moving the element to the root by rotating it up the tree does not have this property. In
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 121
Splay trees do movement is done in a very special way that guarantees this amortized bound. Objective: At the end of this unit the reader must be able to answer
 What are splaying Tree and splaying algorithm
 What is balance theorem
 What is static optimality theorem
 How the set theorem works
 To define and compare theorems like dynamic finger theorem, sequential access theorem and Dynamic optimality conjecture
 How to have insertions, join and deletion in a tree
5.1 Splay Trees We shall describe the algorithm by giving three rewrite rules in the form of pictures. In these pictures, x is the node that was accessed (that will eventually be at the root of the tree). By looking at the local structure of the tree defined by x, x's parent, and x's grandparent we decide which of the following three rules to follow. We continue to apply the rules until x is at the root of the tree: y x Zig (terminal case): / ====> \ x y z z / / x Zig-zag: y ====> x ====> / \ \ / y z x y z x / y \ Zig-zig: y ====> / \ ====> y / x z \ x z
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 122
Notes 1) Each rule has a mirror image variant, which covers all the cases. 2) The zig-zig rule is the one that distinguishes splaying from just rotating x to the root of the tree. 3) Top-down splaying is much more efficient in practice. Here are some examples: 6 0 3 / \ / \ 5 5 / \ / / \ / \ 4 splay(0) 3 6 splay(3) 0 5 / =======> / \ =======> \ / \ 3 1 4 1 4 6 / \ \ 2 2 2 / 1 / 0 To analyze the performance of splaying, we start by assuming that each node x has a weight w(x) > 0. These weights can be chosen arbitrarily. For each assignment of weights we will be able to derive a bound on the cost of a sequence of accesses. We can choose the assignment that gives the best bound. By giving the frequently accessed elements a high weight, we will be able to get tighter bounds on the running time. Note that the weights are only used in the analysis, and do not change the algorithm at all. (A commonly used case is to assign all the weights to be 1.) Before we can state our performance lemma, we need to define two more quantities. The size of a node x (denoted s(x)) is the total weight of all the nodes in the subtree rooted at x. The rank of a node x (denoted r(x)) is the floor(log_2) of the size of x. Restating these:
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 123
s(x) = Sum (over y in the subtree rooted at x) of w(y) r(x) = floor(log(s(x))) For each node x, we'll keep r(x) tokens on that node. (Alternatively, the potential function will just be the sums of the ranks of all the nodes in the tree.) Here's an example to illustrate this: Here's a tree, labeled with sizes on the left and ranks on the right. 9 o 3 o / / 8 o 3 o / \ / \ 6 o o 1 2 o o 0 / \ / \ 2 o o 3 1 o o 1 / / / / 1 o o 2 0 o o 1 \ \ o 1 o 0 Notes about this potential: 1) Doing a rotation between a pair of nodes x and y only effects the ranks of the nodes x and y, and no other nodes in the tree. Furthermore, if y was the root before the rotation, then the rank of y before the rotation equals the rank of x after the rotation. 2) Assuming all the weights are 1, the potential of a balanced tree is O(n), and the potential of a long chain (most unbalanced tree) is O(n log n). 3) In the banker's view of amortized analysis, we can think of having r(x) tokens on node x. Self Assessment Questions 1. What is a Splay Tree? Explain its properties.
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 124
5.2 Access lemma The number of splaying steps done when splaying node x in a tree with root t is at most 3(r(t)-r(x))+1. Proof: As we do the work, we must pay one token for each splay step we do. Furthermore we must make sure that there are always r(x) tokens on node x as we do our restructuring. We are going to allocate 3(r(t) - r(x)) +1 tokens to do the splay. Our job in the proof is to show that this is enough. First we need the following observation about ranks, called the Rank Rule. Suppose that two siblings have the same rank, r. Then the parent has rank at least r+1. This is because if the rank is r, then the size is at least 2^r. If both siblings have size at least 2^r, then the total size is at least 2^(r+1) and we conclude that the rank is at least r+1. We can represent this with the following diagram: x / \ Then x >= r+1 r r Conversely, suppose we find a situation where a node and its parent have the same rank r, then the other sibling of the node must have rank < r. So if we have three nodes configured as follows, with these ranks: r / \ Then x < r x r
Now we can go back to proving the lemma. The approach we take is to show that the 3(r'(x) - r(x)) tokens are sufficient to pay for the a zig-zag or a zig-zig steps. And that 3(r'(x) - r(x)) +1 is sufficient to pay for the zig step.
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 125
(Here r'() represents the rank function after the step, and r() represents the rank function before the step.) When we sum these costs to compute the amortized cost for the entire splay operation, they telescope to: 3(r(t) - r(x)) +1. Note that the +1 comes from the zig step, which can happen only once. Here's an example, the labels are ranks: 2 o 2 x / / \ 2 o 0 o o 2 / / 2 o splay(x) 2 o / ==========> / \ 2 x 1 o o 0 / \ / 0 o o 1 0 o / o 0 ----------------- ---------------- Total: 9 7 We allocated: 3(2-2)+1 = 1
extra tokens from restructuring: 9-7 = 2 3 There are 2 splay steps. So 3 > 2, and we have enough. It remains to show these bounds for the individual steps. There are three cases, one for each of the types of splay steps. Zig: r o o r / ==> \ a o o b <= r
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 126
The actual cost is 1, so we spend one token on this. We take the tokens on a and augment them with another r-a and put them on b. Thus the total number of tokens needed is 1+r-a. This is at most 1+3(r-a). Zig-zag: We'll split it into 2 cases: Case 1: The rank does not increase between the starting node and ending node of the step. r o o r / / \ r o ===> a o o b \ r o By the Rank Rule, one of a or b must be < r, so one token is released from the data structure. we use this to pay for the work. Thus, our allocation of 3(r-r) = 0 is sufficient to pay for this. Case 2: (The rank does increase) r o o r / / \ b o ===> c o o d \ a o The tokens on c can be supplied from those on b. (There are enough there cause b >= c.). Note that r-a > 0. So: use r-a (which is at least 1) to pay for the work use r-a to augment the tokens on a to make enough for d Summing these two gives: 2(r-a). But 2(r-a) <= 3(r-a), which completes the zig-zag case. Zig-zig: Again, we split into two cases.
Case 1: The rank does not increase between the starting node and the ending node of the step.
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 127
r o r o o r / / \ \ r o =====> r o o d ====> o c <= r / \ r o ( d<r by rank rule) o d < r As in the zig-zag case, we use the token gained because d < r to pay for the step. Thus we have 3(r-r)=0 tokens and we need 0. Case 2: The rank increases during the step. r o o r / \ b o ============> o c <= r / \ a o o d <= r use r-a (which is at least 1) to pay for the work use r-a to boost the tokens on a to cover those needed for d use r-a to boost the tokens on b to cover those needed for c Summing these gives 3(r-a), which completes the analysis of the zig-zig case. This completes the proof of the access lemma. 5.3 Balance Theorem A sequence of m splays in a tree of n nodes takes time O(m log(n) + n log(n)). Proof: We apply the access lemma with all the weights equal to 1. For a given splay, r(t) <= log(n), and r(x) >= 0. So the amortized cost of the splay is at most: 3 log(n) + 1
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 128
We now switching to the world of potentials (potential = total tokens in the tree). To bound the cost of the sequence we add this amount for each splay, then add the initial minus the final potential. The initial potential is at most n log(n), and the final potential is at least 0. This gives a bound of: m (3 log(n) + 1) + n log(n) = O(m log(n) + n log(n)) Now let us see how the access lemma can be applied to prove more advanced properties of splay trees. We take advantage of the fact that we can assign arbitrary positive values to the weights of the nodes. In the previous section, we proved the following: Access lemma: The number of splaying steps done when splaying node x in a tree with root t is at most 3(r(t)-r(x))+1. Actually, something a bit stronger is true. By defining rank to be the log of the weight of a node (instead of the floor of the log), we can prove a stronger lemma: 5.4 Strong Access Lemma The number of rotations done when splaying node x in a tree with root t is at most 3(r(t)-r(x))+1. Here r(t) and r(x) denote the continuous definition of rank. The bound in this lemma is approximately a factor of two tighter than the previous one. Throughout these notes, we'll use the strong access lemma and the continuous definition of rank. The main advantage here is simply that it means we can stop writing floors, because in these notes we're only trying to prove big-oh bounds.
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 129
Let's rewrite the access lemma a bit. Let w(x) be the weight of a node x. r(x) is log(w(x) + total weight of other descendants of x). Since the weight of the descendants of x is a positive number we have: r(x) >= log(w(x)) So the amortized cost of a splay is bounded by 3(r(t) - log(w(x))) + 1. The theorems below concern bounds on the running times of sequences of accesses in splay trees. (The set of nodes in these trees is static.) We'll use the rotation bound of the strong access lemma as our measure of the running time of a sequence of operations. Recall the following consequence of the definition of amortized cost: total real cost = total amortized cost + initial potential - final potential To bound the cost of a sequence, we'll bound the total amortized cost (using the access lemma) and bound the total decrease in potential that could result from the sequence. NOTATION: There will be n items in the tree, numbered 1...n. The letter i will indicate an index into this list of items. We'll be considering a sequence of m accesses, and the index j will refer to one of these accesses. Let i[j] refer to the index of the jth access. 5.5 Static Optimality Theorem Let q(i) be the number of times item i is accessed. Let m = Sum q(i) be the number of accesses. Let us assume that q(i) > 0 for each i. Then the total cost of the sequence is: m O(m + Sum log ------- ) j q(i[j])
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 130
PROOF: Choose w(i) (the weight of item i) to be q(i)/m. The total weight of all items is 1. So the amortized cost of an access to element i is at most: 3(r(root) - log(w(i))) + 1 = 3 (- log(q(i)/m)) +1 = 3 log(m/q(i))+1 Summing this, we see that the amortized cost satisfies the bound stated in the theorem. It remains to bound the decrease in potential. Each item contributes its rank to the potential. We have the following bounds on ranks: 0 >= r(i) = log(size(i)) >= log(w(i)) Rewriting this: 0 <= -r(i) <= log(1/w(i)) The the maximum amount that a node's potential can decrease is just log (1/w(i)), which is log(m/q(i)). Summing this we see: total potential decrease <= Sum log(m/q(i)) i Since each element is accessed at least once, the terms in this summation are a subset of the terms in the bound we're trying to prove in the theorem. So this additional cost can be absorbed into the big-oh. NOTE 1: The reason we need to assume that each element is accessed at least once is to guard against the possibility that we start out with a very bad tree which contains millions of elements near the top of the tree that are never accessed, and we are forced to incur a cost to get by them. NOTE 2: This is called the static optimality theorem because the static tree that is constructed to be optimal for this specific access sequence will incur a cost of O(log(m/q(i))) to access the item i.
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 131
5.6 Static Finger Theorem Let f be a specific element called the finger, then the following is a bound on the cost of splaying a sequence: O(m + n log(n) + Sum log (|f - i[j]| + 1)) j NOTE: |f-i| is the distance in the symmetric ordering of the items between the finger and item i. PROOF: Assign the weights w(i) as follows: 1
w(i) = 2 (|f-i| + 1) Since Sum 1/n^2 (n from 1 to infinity) is a constant (pi^2/6) we see that the sum of all the weights is bounded by a constant (at most twice this). Let's look at the amortized cost of an access. Recall: 3(r(t) - log(w(x))) + 1. The term r(t), meaning the rank of the root, is just bounded by a constant, and sums to O(m). The 1 term also sums to O(m). The -log(w(x)) term can be written: 1
-log ( ) 2 (|f-i| + 1) = 2 log (|f-i|+1) Combining these estimates shows that the amortized cost is bounded by the summation stated in the theorem. It remains to bound the decrease in potential. What bound can we put on the rank of an item i?
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 132
2 pi^2 1 log ( –––– ) >= r(i) >= 2 log (–––––) 6 |f-i|+1 Negating this we get: 2 pi^2 - log ( ––––– ) <= -r(i) <= 2 log (|f-i|+1) 6 So the potential change of a node is bounded by O(log n). This contributes O(n log n) to the total real cost. 5.7 Other Theorems 5.7.1 Working Set Theorem Let t(j) be the number of accesses of different items that occurred between access j and the previous access of the same item. The the cost of the sequence is bounded by: O(m + n log(n) + Sum log (t(j) + 1)) j NOTE: this is proven by using a weight assignment very similar to the one used in the proof of the static finger theorem. Imagine that we're maintaining a move-to-front list (the accessed element moves to the front). When accessing an item the number of elements it has to pass when it moves to the front is t(j). If an item is in position p, (positions numbered 1, 2, 3...) in the list, then it has a weight of 1/p^2. The following theorems do not follow from the access lemma. 5.7.2 Sequential Access Theorem The cost of the access sequence that accesses each of the n items in the tree in symmetric (left-to-right) order is O(n).
This is generalized by the following theorem, the proof of which is very difficult:
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 133
5.7.3 Dynamic Finger Theorem Using the notation of the static finger theorem, a bound on the splaying cost is: O(m + n + Sum log (|i[j-1] - i[j]| + 1)) j The two theorems above are special cases of the following conjecture: 5.7.4 Dynamic Optimality Conjecture Consider the following class of binary-tree based algorithms for processing a sequence of requests: on each access we pay the distance from the root to the accessed node. Arbitrary rotations are also allowed, at a cost of 1 per rotation. For a sequence Sigma, let C(T,Sigma) be the cost of the optimal algorithm in this class for processing a sequence Sigma starting from tree T. Then the cost of splaying the sequence (starting from any tree) is O(C(T,Sigma) + n). What about dynamic operations that change the set of elements in the tree? Here's how we can do Insertion, Join, and Deletion. 5.7.4.1 Insertion Say we want to insert a new item e. Proceed like ordinary binary tree insertion, but stop short of linking the new item e into the tree. Let the node you were about to link e to be called p. We splay p to the root. Now construct a tree according to one of the following pictures: e e / \ / \ p p / \ / \ depending on if the item in e is before or after p.
Let's analyze this when all the weights are 1. The amortized cost of the splay is O(log n) according to the access lemma. Then we have to pay the
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 134
additional amortized cost which equals the potential of the new root. This is log(n+1). So the amortized cost of the whole thing is O(log(n)). An alternative insertion algorithm is to do ordinary binary tree insertion, and then simply splay the item to the root. This also has efficient amortized cost, but the analysis is a tiny bit more complicated. 5.7.4.2 Join Here we have two trees, say, A and B. We want to put them together with all the items in A to the left of all the items in B. This is called the "Join" operation. This is done as follows. Splay the rightmost node of A. Now make B the right child of this node. The amortized cost of the operation is easily seen to be O(log n) where n is the number of nodes in the tree after joining. (Just assign uniform weights to all the nodes, as we did in the Insertion analysis.) 5.7.4.3 Deletion To delete a node, x, simply splay it to the root and then Join its left and right sub-trees together as described above. The amortized cost is easily seen to be O(log n), where n is the size before the deletion. Self Assessment Questions 1. State and prove Access Lemma. 2. What is the use of Balance Theorem? 3. tate and prove Static Optimality Theorem. 5.8 Summary
Self Adjusting search trees or Splay trees guarantees the amortized cost of the operation for O(log(n)). The balance theorem has a sequence of m splays in a tree of n nodes and takes time O(m log(n) + n log(n)). Dynamic operations like insertion, join and deletion of item in a tree were discussed.
Advanced Data Structures using ‘C’ Unit 5
Sikkim Manipal University Page No.: 135
The amortized cost of join and deletion operation is O(log(n)) while it is O(log(n+1)) for the insertion operation. 5.9 Terminal Questions 1. What do you mean by the Splay Trees? Discuss. 2. How a Splay Tree differs from a Binary Tree? Justify your answer with appropriate example. 3. Show the result of inserting 3, 1, 4, 5, 2, 9, 6, 8 into a
a) bottom-up splay tree
b) top-down splay tree.
4. Show the result of deleting 3 from the splay tree shown in Exercise 1 for top-down splay tree. 5. Prove that if all nodes in a splay tree are accessed in sequential order, the resulting tree consists of a chain of left-children.

No comments:

Post a Comment