__be the coordinates of the next move as defined by the rules of chess; IF (0 <= u) & (u < n) & (0 <= v) & (v < n) & (h[u,v] = 0) THEN h[u,v] := i; IF i < n*n THEN Try(i+1, u, v, q1); IF ~q1 THEN h[u,v] := 0 ELSE q1 := TRUE END END END UNTIL q1 OR no more candidates; q := q1 END Try Just one more refinement step will lead us to a program expressed fully in terms of our basic programming notation. We should note that so far the program was developed completely independently of the laws governing the jumps of the knight. This delaying of considerations of particularities of the problem was quite deliberate. But now is the time to take them into account. Given a starting coordinate pair x,y there are eight potential candidates u,v of the destination. They are numbered 1 to 8 in Fig. 3.8. A simple method of obtaining u,v from x,y is by addition of the coordinate differences stored in either an array of difference pairs or in two arrays of single differences. Let these arrays be denoted by dx and dy, appropriately initialized. Then an index k may be used to number the next candidate . The details are shown in Program 3.3. The recursive procedure is initiated by a call with the coordinates x0, y0 of that field as parameters from which the tour is to start. This field must be given the value 1; all others are to be marked free. dx = (2, 1, -1, -2, -2, -1, 1, 2) dy = (1, 2, 2, 1, -1, -2, -2, -1) h[x0, y0] := 1; try(2, x0, y0, q) 96 Fig. 3.8. The 8 possible moves of a knight One further detail must not be overlooked: A variable h uv does exist only if both u and v lie within the index range 0 ... n-1. Consequently, the Boolean expression, substituted for acceptable in the general schema, is valid only if its first four constituent terms are true. It is therefore relevant that the term h uv = 0 appears last. We assume a global n×n matrix h representing the result, the constant n (and nsqr = n 2), and arrays dx and dy representig the possible moves of a knight (see Fig. 3.8): PROCEDURE Try(i, x, y: INTEGER; VAR q: BOOLEAN); VAR k, u, v: INTEGER; q1: BOOLEAN; BEGIN k := 0; REPEAT k := k+1; q1 := FALSE; u := x + dx[k]; v := y + dy[k]; IF (0 <= u) & (u < n) & (0 <= v) & (v < n) & (h[u,v] = 0) THEN h[u,v] := i; IF i < Nsqr THEN Try(i+1, u, v, q1); IF ~q1 THEN h[u,v] := 0 END ELSE q1 := TRUE END END UNTIL q1 OR (k = 8); q := q1 END Try; PROCEDURE Clear; VAR i, j: INTEGER; BEGIN FOR i := 0 TO n-1 DO FOR j := 0 TO n-1 DO h[i,j] := 0 END END END Clear; PROCEDURE KnightsTour(i, j: INTEGER; VAR done: BOOLEAN); BEGIN Clear; h[i, j] := 1; Try(2, i, j, done); END KnightsTour. Table 3.1 indicates solutions obtained with initial positions <3,3>, <2,4> for n = 5 and <1,1> for n = 6. 23 10 15 4 25 23 4 9 14 25 16 5 24 9 14 10 15 24 1 8 11 22 1 18 3 5 22 3 18 13 6 17 20 13 8 16 11 20 7 2 21 12 7 2 19 21 6 17 12 19 1 16 7 26 11 14 34 25 12 15 6 27 17 2 33 8 13 10 x 6 7 8 1 2 3 5 4 y 97 32 35 24 21 28 5 23 18 3 30 9 20 36 31 22 19 4 29 Table 3.1 Three Knights' Tours. What abstractions can now be made from this example? Which pattern does it exhibit that is typical for this kind of problem-solving algorithms? What does it teach us? The characteristic feature is that steps toward the total solution are attempted and recorded that may later be taken back and erased in the recordings when it is discovered that the step does not possibly lead to the total solution, that the step leads into a dead-end street. This action is called backtracking . The general pattern below is derived from TryNextMove , assuming that the number of potential candidates in each step is finite. PROCEDURE Try; BEGIN intialize selection of candidates; REPEAT select next; IF acceptable THEN record it; IF solution incomplete THEN Try; IF not successful THEN cancel recording END END END UNTIL successful OR no more candidates END Try Actual programs may, of course, assume various derivative forms of this schema. A frequently encountered pattern uses an explicit level parameter indicating the depth of recursion and allowing for a simple termination condition. If, moreover, at each step the number of candidates to be investigated is fixed, say m, then the derived schema shown below applies; it is to be invoked by the statement Try(1). PROCEDURE Try(i: INTEGER); VAR k: INTEGER; BEGIN k := 0; REPEAT k := k+1; select k-th candidate; IF acceptable THEN record it; IF i < n THEN Try(i+1); IF not successful THEN cancel recording END END END UNTIL successful OR (k = m) END Try The remainder of this chapter is devoted to the treatment of three more examples. They display various incarnations of the abstract schema and are included as further illustrations of the appropriate use of recursion. 3.5. The Eight Queens Problem The problem of the eight queens is a well-known example of the use of trial-and-error methods and of backtracking algorithms. It was investigated by C .F. Gauss in 1850, but he did not completely solve it. This should not surprise anyone. After all, the characteristic property of these problems is that they defy analytic solution. Instead, they require large amounts of exacting labor, patience, and accuracy. Such algorithms have therefore gained relevance almost exclusively through the automatic computer, which possesses these properties to a much higher degree than people, and even geniuses, do. The eight queens poblem is stated as follows (see also [3-4]): Eight queens are to be placed on a chess board in such a way that no queen checks against any other queen. Using the last schema of Sect. 3.4 as a template, we readily obtain the following crude version of a solution: 98 PROCEDURE Try(i: INTEGER); BEGIN initialize selection of positions for i-th queen: REPEAT make next selection; IF safe THEN SetQueen; IF i < 8 THEN Try(i+1); IF not successful THEN RemoveQueen END END END UNTIL successful OR no more positions END Try In order to proceed, it is necessary to make some commitments concerning the data representation. Since we know from the rules of chess that a queen checks all other figures lying in either the same column, row, or diagonal on the board, we infer that each column may contain one and only one queen, and that the choice of a position for the i th queen may be restricted to the i th column. The parameter i therefore becomes the column index, and the selection process for positions ranges over the eight possible values for a row index j. There remains the question of representing the eight queens on the board. An obvious choice would again be a square matrix to represent the board, but a little inspection reveals that such a representation would lead to fairly cumbersome operations for checking the availability of positions. This is highly undesirable since it is the most frequently executed operation. We should therefore choose a data representation which makes checking as simple as possible. The best recipe is to represent as directly as possible that information which is truly relevant and most often used. In our case this is not the position of the queens, but whether or not a queen has already been placed along each row and diagonals. (We already know that exactly one is placed in each column k for 0 ≤ k < i). This leads to the following choice of variables: VAR x: ARRAY 8 OF INTEGER; a: ARRAY 8 OF BOOLEAN; b, c: ARRAY 15 OF BOOLEAN where xi denotes the position of the queen in the i th column; aj means "no queen lies in the j th row"; bk means "no queen occupies the k th /-diagonal; ck means "no queen sits on the k th \-diagonal. We note that in a /-diagonal all fields have the same sums of their coordinates i and j, and that in a \- diagonal the coordinate differences i-j are constant. The appropriate solution is shown in the following program Queens . Given these data, the statement SetQueen is elaborated to x[i] := j; a[j] := FALSE; b[i+j] := FALSE; c[i-j+7] := FALSE the statement RemoveQueen is refined into a[j] := TRUE; b[i+j] := TRUE; c[i-j+7] := TRUE and the condition safe is fulfilled if the field__

*lies in in a row and in diagonals which are still free. Hence, it can be expressed by the logical expression a[j] & b[i+j] & c[i-j+7] This completes the development of this algorithm, that is shown in full as Program 3.4. The computed solution is x = (1, 5, 8, 6, 3, 7, 2, 4) and is shown in Fig. 3.9. 99 Fig. 3.9. A solution to the Eight Queens problem PROCEDURE Try(i: INTEGER; VAR q: BOOLEAN); VAR j: INTEGER; BEGIN j := 0; REPEAT q := FALSE; IF a[j] & b[i+j] & c[i-j+7] THEN x[i] := j; a[j] := FALSE; b[i+j] := FALSE; c[i-j+7] := FALSE; IF i < 7 THEN Try(i+1, q); IF ~q THEN a[j] := TRUE; b[i+j] := TRUE; c[i-j+7] := TRUE END ELSE q := TRUE END END ; INC(j) UNTIL q OR (j = 8) END Try; PROCEDURE Queens; VAR i: INTEGER; (*uses global writer W*) BEGIN FOR i := 0 TO 7 DO a[i] := TRUE END ; FOR i := 0 TO 14 DO b[i] := TRUE; c[i] := TRUE END ; Try(0,q); FOR i := 0 TO 7 DO Texts.WriteInt(W, x[i], 4) END ; Texts.WriteLn(W) END Queens. Before we abandon the context of the chess board, the eight queens example is to serve as an illustration of an important extension of the trial-and-error algorithm. The extension is -- in general terms -- to find not only one, but all solutions to a posed problem. The extension is easily accommodated. We are to recall the fact that the generation of candidates must progress in a systematic manner that guarantees no candidate is generated more than once. This property of the algorithm corresponds to a search of the candidate tree in a systematic fashion in which every node is visited exactly once. It allows -- once a solution is found and duly recorded -- merely to proceed to the next candidate delivered by the systematic selection process. The general schema is as follows: PROCEDURE Try(i: INTEGER); VAR k: INTEGER; 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 100 BEGIN FOR k := 0 TO n-1 DO select k th candidate; IF acceptable THEN record it; IF i < n THEN Try(i+1) ELSE output solution END ; cancel recording END END END Try Note that because of the simplification of the termination condition of the selection process to the single term k = n, the repeat statement is appropriately replaced by a for statement. It comes as a surprise that the search for all possible solutions is realized by a simpler program than the search for a single solution. The extended algorithm to determine all 92 solutions of the eight queens problem is shown in Program 3.5. Actually, there are only 12 significantly differing solutions; our program does not recognize symmetries. The 12 solutions generated first are listed in Table 3.2. The numbers n to the right indicate the frequency of execution of the test for safe fields. Its average over all 92 solutions is 161. PROCEDURE write; VAR k: INTEGER; BEGIN (*global writer W*) FOR k := 0 TO 7 DO Texts.WriteInt(W, x[k], 4) END ; Texts.WriteLn(W) END write; PROCEDURE Try(i: INTEGER); VAR j: INTEGER; BEGIN FOR j := 1 TO 8 DO IF a[j] & b[i+j] & c[i-j+7] THEN x[i] := j; a[j] := FALSE; b[i+j] := FALSE; c[i-j+7] := FALSE; IF i < 7 THEN Try(i+1) ELSE write END ; a[j] := TRUE; b[i+j] := TRUE; c[i-j+7] := TRUE END END END Try; PROCEDURE AllQueens; VAR i: INTEGER; BEGIN FOR i := 0 TO 7 DO a[i] := TRUE END ; FOR i := 0 TO 14 DO b[i] := TRUE; c[i] := TRUE END ; Try(0) END AllQueens. x1 x2 x3 x4 x5 x6 x7 x8 n 1 5 8 6 3 7 2 4 876 1 6 8 3 7 4 2 5 264 1 7 4 6 8 2 5 3 200 1 7 5 8 2 4 6 3 136 2 4 6 8 3 1 7 5 504 2 5 7 1 3 8 6 4 400 2 5 7 4 1 8 6 3 072 2 6 1 7 4 8 3 5 280 2 6 8 3 1 4 7 5 240 2 7 3 6 8 5 1 4 264 101 2 7 5 8 1 4 6 3 160 2 8 6 1 3 5 7 4 336 Table 3.2 Twelve Solutions to the Eight Queens Problem. 3.6. The Stable Marriage Problem Assume that two disjoint sets A and B of equal size n are given. Find a set of n pairs such that a in A and b in B satisfy some constrains. Many different criteria for such pairs exist; one of them is the rule called the stable marriage rule . Assume that A is a set of men and B is a set of women. Each man and each women has stated distinct preferences for their possible partners. If the n couples are chosen such that there exists a man and a woman who are not married, but who would both prefer each other to their actual marriage partners, then the assignment is unstable. If no such pair exists, it is called stable. This situation characterizes many related problems in which assignments have to be made according to preferences such as, for example, the choice of a school by students, the choice of recruits by different branches of the armed services, etc. The example of marriages is particularly intuitive; note, however, that the stated list of preferences is invariant and does not change after a particular assignment has been made. This assumption simplifies the problem, but it also represents a grave distortion of reality (called abstraction). One way to search for a solution is to try pairing off members of the two sets one after the other until the two sets are exhausted. Setting out to find all stable assignments, we can readily sketch a solution by using the program schema of AllQueens as a template. Let Try(m) denote the algorithm to find a partner for man m, and let this search proceed in the order of the man's list of stated preferences. The first version based on these assumptions is: PROCEDURE Try(m: man); VAR r: rank; BEGIN FOR r := 0 TO n-1 DO pick the r th preference of man m; IF acceptable THEN record the marriage; IF m is not last man THEN Try(successor(m)) ELSE record the stable set END ; cancel the marriage END END END Try The initial data are represented by two matrices that indicate the men's and women's preferences. VAR wmr: ARRAY n, n OF woman mwr: ARRAY n, n OF man Accordingly, wmr m denotes the preference list of man m, i.e., wmr m,r is the woman who occupies the r th rank in the list of man m. Similarly, mwr w is the preference list of woman w, and mwr w,r is her r th choice. A sample data set is shown in Table 3.3. r = 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 m = 1 7 2 6 5 1 3 8 4 w = 1 4 6 2 5 8 1 3 7 2 4 3 2 6 8 1 7 5 2 8 5 3 1 6 7 4 2 3 3 2 4 1 8 5 7 6 3 6 8 1 2 3 4 7 5 4 3 8 4 2 5 6 7 1 4 3 2 4 7 6 8 5 1 5 8 3 4 5 6 1 7 2 5 6 3 1 4 5 7 2 8 6 8 7 5 2 4 3 1 6 6 2 1 3 8 7 4 6 5 102 7 2 4 6 3 1 7 5 8 7 3 5 7 2 4 1 8 6 8 6 1 4 2 7 5 3 8 8 7 2 8 4 5 6 3 1 Table 3.3 Sample Input Data for wmr and mwr The result is represented by an array of women x, such that x m denotes the partner of man m. In order to maintain symmetry between men and women, an additional array y is introduced, such that y w denotes the partner of woman w. VAR x, y: ARRAY n OF INTEGER Actually, y is redundant, since it represents information that is already present through the existence of x. In fact, the relations x[y[w]] = w, y[x[m]] = m hold for all m and w who are married. Thus, the value y w could be determined by a simple search of x; the array y, however, clearly improves the efficiency of the algorithm. The information represented by x and y is needed to determine stability of a proposed set of marriages. Since this set is constructed stepwise by marrying individuals and testing stability after each proposed marriage, x and y are needed even before all their components are defined. In order to keep track of defined components, we may introduce Boolean arrays singlem, singlew: ARRAY n OF BOOLEAN with the meaning that singlem m implies that xm is defined, and singlew w implies that yw is defined. An inspection of the proposed algorithm, however, quickly reveals that the marital status of a man is determined by the value m through the relation ~singlem[k] = k < m This suggests that the array singlem be omitted; accordingly, we will simplify the name singlew to single. These conventions lead to the refinement shown by the following procedure Try . The predicate acceptable can be refined into the conjunction of single and stable , where stable is a function to be still further elaborated. PROCEDURE Try(m: man); VAR r: rank; w: woman; BEGIN FOR r := 0 TO n-1 DO w := wmr[m,r]; IF single[w] & stable THEN x[m] := w; y[w] := m; single[w] := FALSE; IF m < n THEN Try(m+1) ELSE record set END ; single[w] := TRUE END END END Try At this point, the strong similarity of this solution with procedure AllQueens is still noticeable. The crucial task is now the refinement of the algorithm to determine stability. Unfortunately, it is not possible to represent stability by such a simple expression as the safety of a queen's position. The first detail that should be kept in mind is that stability follows by definition from comparisons of ranks. The ranks of men or women, however, are nowhere explicitly available in our collection of data established so far. Surely, the rank of woman w in the mind of man m can be computed, but only by a costly search of w in wmrm. Since the computation of stability is a very frequent operation, it is advisable to make this information more directly accessible. To this end, we introduce the two matrices rmw: ARRAY man, woman OF rank; rwm: ARRAY woman, man OF rank 103 such that rmw m,w denotes woman w's rank in the preference list of man m, and rwm w,m denotes the rank of man m in the list of w. It is plain that the values of these auxiliary arrays are constant and can initially be determined from the values of wmr and mwr. The process of determining the predicate stable now proceeds strictly according to its original definition. Recall that we are trying the feasibility of marrying m and w, where w = wmr m,r , i.e., w is man m's r th choice. Being optimistic, we first presume that stability still prevails, and then we set out to find possible sources of trouble. Where could they be hidden? There are two symmetrical possibilities: 1. There might be a women pw, preferred to w by m, who herself prefers m over her husband. 2. There might be a man pm, preferred to m by w, who himself prefers w over his wife. Pursuing trouble source 1, we compare ranks rwm pw,m and rwm pw,y[pw] for all women preferred to w by m, i.e. for all pw = wmr m,i such that i < r. We happen to know that all these candidate women are already married because, were anyone of them still single, m would have picked her beforehand. The described process can be formulated by a simple linear search; s denotes stability. s := TRUE; i := 0; WHILE (i < r) & s DO pw := wmr[m,i]; INC(i); IF ~single[pw] THEN s := rwm[pw,m] > rwm[pw, y[pw]] END END Hunting for trouble source 2, we must investigate all candidates pm who are preferred by w to their current assignation m, i.e., all preferred men pm = mwr w,i such that i < rwm w,m . In analogy to tracing trouble source 1, comparison between ranks rmw pm,w and rmw pm,x[pm] is necessary. We must be careful, however, to omit comparisons involving x pm where pm is still single. The necessary safeguard is a test pm < m, since we know that all men preceding m are already married. The complete algorithm is shown in module Marriage . Table 3.4 specifies the nine computed stable solutions from input data wmr and mwr given in Table 3.3. PROCEDURE write; (*global writer W*) VAR m: man; rm, rw: INTEGER; BEGIN rm := 0; rw := 0; FOR m := 0 TO n-1 DO Texts.WriteInt(W, x[m], 4); rm := rmw[m, x[m]] + rm; rw := rwm[x[m], m] + rw END ; Texts.WriteInt(W, rm, 8); Texts.WriteInt(W, rw, 4); Texts.WriteLn(W) END write; PROCEDURE stable(m, w, r: INTEGER): BOOLEAN; VAR pm, pw, rank, i, lim: INTEGER; S: BOOLEAN; BEGIN S := TRUE; i := 0; WHILE (i < r) & S DO pw := wmr[m,i]; INC(i); IF ~single[pw] THEN S := rwm[pw,m] > rwm[pw, y[pw]] END END ; i := 0; lim := rwm[w,m]; WHILE (i < lim) & S DO pm := mwr[w,i]; INC(i); IF pm < m THEN S := rmw[pm,w] > rmw[pm, x[pm]] END END ; RETURN S END stable; PROCEDURE Try(m: INTEGER); VAR w, r: INTEGER; 104 BEGIN FOR r := 0 TO n-1 DO w := wmr[m,r]; IF single[w] & stable(m,w,r) THEN x[m] := w; y[w] := m; single[w] := FALSE; IF m < n-1 THEN Try(m+1) ELSE write END ; single[w] := TRUE END END END Try; PROCEDURE FindStableMarriages(VAR S: Texts.Scanner); VAR m, w, r: INTEGER; BEGIN FOR m := 0 TO n-1 DO FOR r := 0 TO n-1 DO Texts.Scan(S); wmr[m,r] := S.i; rmw[m, wmr[m,r]] := r END END ; FOR w := 0 TO n-1 DO single[w] := TRUE; FOR r := 0 TO n-1 DO Texts.Scan(S); mwr[w,r] := S.i; rwm[w, mwr[w,r]] := r END END ; Try(0) END FindStableMarriages END Marriage This algorithm is based on a straightforward backtracking scheme. Its efficiency primarily depends on the sophistication of the solution tree pruning scheme. A somewhat faster, but more complex and less transparent algorithm has been presented by McVitie and Wilson [3-1 and 3-2], who also have extended it to the case of sets (of men and women) of unequal size. Algorithms of the kind of the last two examples, which generate all possible solutions to a problem (given certain constraints), are often used to select one or several of the solutions that are optimal in some sense. In the present example, one might, for instance, be interested in the solution that on the average best satisfies the men, or the women, or everyone. Notice that Table 3.4 indicates the sums of the ranks of all women in the preference lists of their husbands, and the sums of the ranks of all men in the preference lists of their wives. These are the values rm = S m: 1 ≤ m ≤ n: rmw m,x[m] rw = S m: 1 ≤ m ≤ n: rwm x[m],m x1 x2 x3 x4 x5 x6 x7 x8 rm rw c 1 7 4 3 8 1 5 2 6 16 32 21 2 2 4 3 8 1 5 7 6 22 27 449 3 2 4 3 1 7 5 8 6 31 20 59 4 6 4 3 8 1 5 7 2 26 22 62 5 6 4 3 1 7 5 8 2 35 15 47 6 6 3 4 8 1 5 7 2 29 20 143 7 6 3 4 1 7 5 8 2 38 13 47 8 3 6 4 8 1 5 7 2 34 18 758 9 3 6 4 1 7 5 8 2 43 11 34 c = number of evaluations of stability. Solution 1 = male optimal solution; solution 9 = female optimal solution. Table 3.4 Result of the Stable Marriage Problem. The solution with the least value rm is called the male-optimal stable solution; the one with the smallest rw is the female-optimal stable solution. It lies in the nature of the chosen search strategy that good 105 solutions from the men's point of view are generated first and the good solutions from the women's perspective appear toward the end. In this sense, the algorithm is based toward the male population. This can quickly be changed by systematically interchanging the role of men and women, i.e., by merely interchanging mwr with wmr and interchanging rmw with rwm. We refrain from extending this program further and leave the incorporation of a search for an optimal solution to the next and last example of a backtracking algorithm. 3.7. The Optimal Selection Problem The last example of a backtracking algorithm is a logical extension of the previous two examples represented by the general schema. First we were using the principle of backtracking to find a single solution to a given problem. This was exemplified by the knight's tour and the eight queens. Then we tackled the goal of finding all solutions to a given problem; the examples were those of the eight queens and the stable marriages. Now we wish to find an optimal solution. To this end, it is necessary to generate all possible solutions, and in the course of generating them to retain the one that is optimal in some specific sense. Assuming that optimality is defined in terms of some positive valued function f(s), the algorithm is derived from the general schema of Try by replacing the statement print solution by the statement IF f(solution) > f(optimum) THEN optimum := solution END The variable optimum records the best solution so far encountered. Naturally, it has to be properly initialized; morever, it is customary to record to value f(optimum) by another variable in order to avoid its frequent recomputation. An example of the general problem of finding an optimal solution to a given problem follows: We choose the important and frequently encountered problem of finding an optimal selection out of a given set of objects subject to constraints. Selections that constitute acceptable solutions are gradually built up by investigating individual objects from the base set. A procedure Try describes the process of investigating the suitability of one individual object, and it is called recursively (to investigate the next object) until all objects have been considered. We note that the consideration of each object (called candidates in previous examples) has two possible outcomes, namely, either the inclusion of the investigated object in the current selection or its exclusion. This makes the use of a repeat or for statement inappropriate; instead, the two cases may as well be explicitly written out. This is shown, assuming that the objects are numbered 1, 2, ... , n. PROCEDURE Try(i: INTEGER); BEGIN IF inclusion is acceptable THEN include i th object; IF i < n THEN Try(i+1) ELSE check optimality END ; eliminate i th object END ; IF exclusion is acceptable THEN IF i < n THEN Try(i+1) ELSE check optimality END END END Try From this pattern it is evident that there are 2 n possible sets; clearly, appropriate acceptability criteria must be employed to reduce the number of investigated candidates very drastically. In order to elucidate this process, let us choose a concrete example for a selection problem: Let each of the n objects a 0, ... , an-1 be characterized by its weight and its value. Let the optimal set be the one with the largest sum of the values of its components, and let the constraint be a limit on the sum of their weight. This is a problem well known to all travellers who pack suitcases by selecting from n items in such a way that their total value is optimal and that their total weight does not exceed a specific allowance. We are now in a position to decide upon the representation of the given facts in terms of global variables. The choices are easily derived from the foregoing developments: 106 TYPE object = RECORD weight, value: INTEGER END ; VAR obj: ARRAY n OF object; limw, totv, maxv: INTEGER; s, opts: SET The variables limw and totv denote the weight limit and the total value of all n objects. These two values are actually constant during the entire selection process. s represents the current selection of objects in which each object is represented by its name (index). opts is the optimal selection so far encountered, and maxv is its value. Which are now the criteria for acceptability of an object for the current selection? If we consider inclusion , then an object is selectable, if it fits into the weight allowance. If it does not fit, we may stop trying to add further objects to the current selection. If, however, we consider exclusion , then the criterion for acceptability, i.e., for the continuation of building up the current selection, is that the total value which is still achievable after this exclusion is not less than the value of the optimum so far encountered. For, if it is less, continuation of the search, although it may produce some solution, will not yield the optimal solution. Hence any further search on the current path is fruitless. From these two conditions we determine the relevant quantities to be computed for each step in the selection process: 1. The total weight tw of the selection s so far made. 2. The still achievable value av of the current selection s. These two entities are appropriately represented as parameters of the procedure Try. The condition inclusion is acceptable can now be formulated as tw + a[i].weight < limw and the subsequent check for optimality as IF av > maxv THEN (*new optimum, record it*) opts := s; maxv := av END The last assignment is based on the reasoning that the achievable value is the achieved value, once all n objects have been dealt with. The condition exclusion is acceptable is expressed by av - a[i].value > maxv Since it is used again thereafter, the value av - a[i].value is given the name av1 in order to circumvent its reevaluation. The entire procedure is now composed of the discussed parts with the addition of appropriate initialization statements for the global variables. The ease of expressing inclusion and exclusion from the set s by use of set operators is noteworthy. The results opts and maxv of Selection with weight allowances ranging from 10 to 120 are listed in Table 3.5. TYPE Object = RECORD value, weight: INTEGER END ; VAR obj: ARRAY n OF Object; limw, totv, maxv: INTEGER; s, opts: SET; PROCEDURE Try(i, tw, av: INTEGER); VAR av1: INTEGER; BEGIN (*try inclusion*) IF tw + obj[i].weight <= limw THEN s := s + {i}; IF i < n THEN Try(i+1, tw + obj[i].weight, av) ELSIF av > maxv THEN maxv := av; opts := s END ; s := s - {i} END ; (*try exclusion*) 107 IF av > maxv + obj[i].value THEN IF i < n THEN Try(i+1, tw, av - obj[i].value) ELSE maxv := av - obj[i].value; opts := s END END END Try; PROCEDURE Selection(n, Weightinc, WeightLimit: INTEGER); VAR i: INTEGER; BEGIN limw := 0; REPEAT limw := limw + WeightInc; maxv := 0; s := {}; opts := {}; Try(0, 0, totv); UNTIL limw >= WeightLimit END Selection. Weight 10 11 12 13 14 15 16 17 18 19 Tot Value 18 20 17 19 25 21 27 23 25 24 10 * 18 20 * 27 30 * * 52 40 * * * 70 50 * * * * 84 60 * * * * * 99 70 * * * * * 115 80 * * * * * * 130 90 * * * * * * 139 100 * * * * * * * 157 110 * * * * * * * * 172 120 * * * * * * * * 183 Table 3.5 Sample Output from Optimal Selection Program. This backtracking scheme with a limitation factor curtailing the growth of the potential search tree is also known as branch and bound algorithm. Exercises 3.1 (Towers of Hanoi). Given are three rods and n disks of different sizes. The disks can be stacked up on the rods, thereby forming towers. Let the n disks initially be placed on rod A in the order of decreasing size, as shown in Fig. 3.10 for n = 3. The task is to move the n disks from rod A to rod C such that they are ordered in the original way. This has to be achieved under the constraints that 1. In each step exactly one disk is moved from one rod to another rod. 2. A disk may never be placed on top of a smaller disk. 3. Rod B may be used as an auxiliary store. Find an algorithm that performs this task. Note that a tower may conveniently be considered as consisting of the single disk at the top, and the tower consisting of the remaining disks. Describe the algorithm as a recursive program. 3 2 1 A B C 108 Fig. 3.10. The towers of Hanoi 3.2. Write a procedure that generates all n! permutations of n elements a 1, ... , a n in situ, i.e., without the aid of another array. Upon generating the next permutation, a parametric procedure Q is to be called which may, for instance, output the generated permutation. Hint: Consider the task of generating all permutations of the elements a 1, ... , a m as consisting of the m subtasks of generating all permutations of a 1, ... , a m-1 followed by a m, where in the i th subtask the two elements a i and a m had initially been interchanged. 3.3. Deduce the recursion scheme of Fig. 3.11 which is a superposition of the four curves W 1, W 2, W 3, W4. The structure is similar to that of the Sierpinski curves (3.21) and (3.22). From the recursion pattern, derive a recursive program that draws these curves. Fig. 3.11. Curves W 1 – W 4 3.4. Only 12 of the 92 solutions computed by the Eight Queens algorithm are essentially different. The other ones can be derived by reflections about axes or the center point. Devise a program that determines the 12 principal solutions. Note that, for example, the search in column 1 may be restricted to positions 1-4. 3.5 Change the Stable Marriage Program so that it determines the optimal solution (male or female). It therefore becomes a branch and bound program of the type represented by Program 3.7. 3.6 A certain railway company serves n stations S 0, ... , S n-1. It intends to improve its customer information service by computerized information terminals. A customer types in his departure station SA and his destination SD, and he is supposed to be (immediately) given the schedule of the train connections with minimum total time of the journey. Devise a program to compute the desired information. Assume that the timetable (which is your data bank) is provided in a suitable data structure containing departure (= arrival) times of all available trains. Naturally, not all stations are connected by direct lines (see also Exercise 1.6). 3.7 The Ackermann Function A is defined for all non-negative integer arguments m and n as follows: A(0, n) = n + 1 A(m, 0) = A(m-1, 1) (m > 0) A(m, n) = A(m-1, A(m, n-1)) (m, n > 0) Design a program that computes A(m,n) without the use of recursion. As a guideline, use Program 2.11, the non-recusive version of Quicksort. Devise a set of rules for the transformation of recursive into iterative programs in general. References 3-1. D.G. McVitie and L.B. Wilson. The Stable Marriage Problem. Comm. ACM, 14 , No. 7 (1971), 486- 92. 3-2. -------. Stable Marriage Assignment for Unequal Sets. Bit, 10, (1970), 295-309. 3-3. Space Filling Curves, or How to Waste Time on a Plotter. Software - Practice and Experience, 1, No. 4 (1971), 403-40. 3-4. N. Wirth. Program Development by Stepwise Refinement. Comm. ACM, 14, No. 4 (1971), 221-27. 109 4 Dynamic Information Structures 4.1. Recursive Data Types In Chap. 2 the array, record, and set structures were introduced as fundamental data structures. They are called fundamental because they constitute the building blocks out of which more complex structures are formed, and because in practice they do occur most frequently. The purpose of defining a data type, and of thereafter specifying that certain variables be of that type, is that the range of values assumed by these variables, and therefore their storage pattern, is fixed once and for all. Hence, variables declared in this way are said to be static . However, there are many problems which involve far more complicated information structures. The characteristic of these problems is that not only the values but also the structures of variables change during the computation. They are therefore called dynamic structures. Naturally, the components of such structures are -- at some level of resolution -- static, i.e., of one of the fundamental data types. This chapter is devoted to the construction, analysis, and management of dynamic information structures. It is noteworthy that there exist some close analogies between the methods used for structuring algorithms and those for structuring data. As with all analogies, there remain some differences, but a comparison of structuring methods for programs and data is nevertheless illuminating. The elementary, unstructured statement is the assignment of an expression's value to a variable. Its corresponding member in the family of data structures is the scalar, unstructured type. These two are the atomic building blocks for composite statements and data types. The simplest structures, obtained through enumeration or sequencing, are the compound statement and the record structure. They both consist of a finite (usually small) number of explicitly enumerated components, which may themselves all be different from each other. If all components are identical, they need not be written out individually: we use the for statement and the array structure to indicate replication by a known, finite factor. A choice among two or more elements is expressed by the conditional or the case statement and by extensions of record types, respectively. And finally, a repetiton by an initially unknown (and potentially infinite) factor is expressed by the while and repeat statements. The corresponding data structure is the sequence (file), the simplest kind which allows the construction of types of infinite cardinality. The question arises whether or not there exists a data structure that corresponds in a similar way to the procedure statement. Naturally, the most interesting and novel property of procedures in this respect is recursion. Values of such a recursive data type would contain one or more components belonging to the same type as itself, in analogy to a procedure containing one or more calls to itself. Like procedures, data type definitions might be directly or indirectly recursive. A simple example of an object that would most appropriately be represented as a recursively defined type is the arithmetic expression found in programming languages. Recursion is used to reflect the possibility of nesting, i.e., of using parenthesized subexpressions as operands in expressions. Hence, let an expression here be defined informally as follows: An expression consists of a term, followed by an operator, followed by a term. (The two terms constitute the operands of the operator.) A term is either a variable -- represented by an identifier -- or an expression enclosed in parentheses. A data type whose values represent such expressions can easily be described by using the tools already available with the addition of recursion: TYPE expression = RECORD op: INTEGER; opd1, opd2: term END TYPE term = RECORD IF t: BOOLEAN THEN id: Name ELSE subex: expression END END 110 Hence, every variable of type term consists of two components, namely, the tagfield t and, if t is true, the field id , or of the field subex otherwise. Consider now, for example, the following four expressions: 1. x + y 2. x - (y * z) 3. (x + y) * (z - w) 4. (x/(y + z)) * w These expressions may be visualized by the patterns in Fig. 4.1, which exhibit their nested, recursive structure, and they determine the layout or mapping of these expressions onto a store. Fig. 4.1. Storage patterns for recursive record structures A second example of a recursive information structure is the family pedigree: Let a pedigree be defined by (the name of) a person and the two pedigrees of the parents. This definition leads inevitably to an infinite structure. Real pedigrees are bounded because at some level of ancestry information is missing. Assume that this can be taken into account by again using a conditional structure: TYPE ped = RECORD IF known: BOOLEAN THEN name: Name; father, mother: ped END END Note that every variable of type ped has at least one component, namely, the tagfield called known . If its value is TRUE, then there are three more fields; otherwise there is none. A particular value is shown here in the forms of a nested expression and of a diagram that may suggest a possible storage pattern (see Fig. 4.2). (T, Ted, (T, Fred, (T, Adam, (F), (F)), (F)), (T, Mary, (F), (T, Eva, (F), (F))) The important role of the variant facility becomes clear; it is the only means by which a recursive data structure can be bounded, and it is therefore an inevitable companion of every recursive definition. The analogy between program and data structuring concepts is particularly pronounced in this case. A conditional (or selective) statement must necessarily be part of every recursive procedure in order that execution of the procedure can terminate. In practice, dynamic structures involve references or pointers to its elements, and the concept of an alternative (to terminate the recursion) is implied in the pointer, as shown in the next paragraph. + x T y T 1. * + F x T 3. y T - F z T w T - x T 2. * F y T z T * / T 4. F x T F + T T y z w 111 Fig. 4.2. An example of a recursive data structure 4.2. Pointers The characteristic property of recursive structures which clearly distinguishes them from the fundamental structures (arrays, records, sets) is their ability to vary in size. Hence, it is impossible to assign a fixed amount of storage to a recursively defined structure, and as a consequence a compiler cannot associate specific addresses to the components of such variables. The technique most commonly used to master this problem involves dynamic allocation of storage, i.e., allocation of store to individual components at the time when they come into existence during program execution, instead of at translation time. The compiler then allocates a fixed amount of storage to hold the address of the dynamically allocated component instead of the component itself. For instance, the pedigree illustrated in Fig. 4.2 would be represented by individual -- quite possibly noncontiguous -- records, one for each person. These persons are then linked by their addresses assigned to the respective father and mother fields. Graphically, this situation is best expressed by the use of arrows or pointers (Fig. 4.3). T Ted T Fred T Adam F F F T Mary T Eva F F F 112 Fig. 4.3. Data structure linked by pointers It must be emphasized that the use of pointers to implement recursive structures is merely a technique. The programmer need not be aware of their existence. Storage may be allocated automatically the first time a new component is referenced. However, if the technique of using references or pointers is made explicit, more general data structures can be constructed than those definable by purely recursive data definiton. In particular, it is then possible to define potentially infinite or circular (graph) structures and to dictate that certain structures are shared. It has therefore become common in advanced programming languages to make possible the explicit manipulation of references to data in additon to the data themeselves. This implies that a clear notational distinction must exist between data and references to data and that consequently data types must be introduced whose values are pointers (references) to other data. The notation we use for this purpose is the following: TYPE T = POINTER TO T0 This type declaration expresses that values of type T are pointers to data of type T0. It is fundamentally important that the type of elements pointed to is evident from the declaration of T. We say that T is bound to T0. This binding distinguishes pointers in higher-level languages from addresses in assembly codes, and it is a most important facility to increase security in programming through redundancy of the underlying notation. Values of pointer types are generated whenever a data item is dynamically allocated. We will adhere to the convention that such an occasion be explicitly mentioned at all times. This is in contrast to the situation in which the first time that an item is mentioned it is automatically allocated. For this purpose, we introduce a procedure New. Given a pointer variable p of type T, the statement New(p) effectively allocates a variable of type T0 and assigns the pointer referencing this new variable to p (see Fig. 4.4). The pointer value itself can now be referred to as p (i.e., as the value of the pointer variable p). In contrast, the variable which is referenced by p is denoted by p^. The referenced structures are typically records. If the referenced record has, for example, a field x, then it is denoted by p^.x. Because it is clear that not the pointer p has any fields, but only the referenced record p^, we allow the abbreviated notation p.x in place of p^.x. Ted T Fred T Mary T Adam T F Eva T F F F F F 113 Fig. 4.4. Dynamic allocation of variable p^ It was mentioned above that a variant component is essential in every recursive type to ensure finite instances. The example of the family predigree is of a pattern that exhibits a most frequently occurring constellation, namely, the case in which one of the two cases features no further components. This is expressed by the following declaration schema: TYPE T = RECORD IF nonterminal: BOOLEAN THEN S(T) END END S(T) denotes a sequence of field definitions which includes one or more fields of type T, thereby ensuring recursivity. All structures of a type patterned after this schema exhibit a tree (or list) structure similar to that shown in Fig. 4.3. Its peculiar property is that it contains pointers to data components with a tag field only, i.e., without further relevant information. The implementation technique using pointers suggests an easy way of saving storage space by letting the tag information be included in the pointer value itself. The common solution is to extend the range of values of all pointer types by a single value that is pointing to no element at all. We denote this value by the special symbol NIL, and we postulate that the value NIL can be assumed by all pointer typed variables. This extension of the range of pointer values explains why finite structures may be generated without the explicit presence of variants (conditions) in their (recursive) declaration. The new formulations of the explicitly recursive data types declared above are reformulated using pointers as shown below. Note that the field known has vanished, since ~p.known is now expressed as p = NIL. The renaming of the type ped to person reflects the difference in the viewpoint brought about by the introduction of explicit pointer values. Instead of first considering the given structure in its entirety and then investigating its substructure and its components, attention is focused on the components in the first place, and their interrelationship (represented by pointers) is not evident from any fixed declaration. TYPE term = POINTER TO TermDescriptor; TYPE exp = POINTER TO ExpDescriptor; TYPE ExpDescriptor = RECORD op: INTEGER; opd1, opd2: term END ; TYPE TermDescriptor = RECORD id: ARRAY 32 OF CHAR END TYPE Person = POINTER TO RECORD name: ARRAY 32 OF CHAR; father, mother: Person END Note: The type Person points to records of an anonymous type (PersonDescriptor). The data structure representing the pedigree shown in Figs. 4.2 and 4.3 is again shown in Fig. 4.5 in which pointers to unknown persons are denoted by NIL. The resulting improvement in storage economy is obvious. p: POINTER TO T p↑: T 114 Fig. 4.5. Data structure with NIL pointers Again referring to Fig. 4.5, assume that Fred and Mary are siblings, i.e., have the same father and mother. This situation is easily expressed by replacing the two NIL values in the respective fields of the two records. An implementation that hides the concept of pointers or uses a different technique of storage handling would force the programmer to represent the ancestor records of Adam and Eve twice. Although in accessing their data for inspection it does not matter whether the two fathers (and the two mothers) are duplicated or represented by a single record, the difference is essential when selective updating is permitted. Treating pointers as explicit data items instead of as hidden implementation aids allows the programmer to express clearly where storage sharing is intended and where it is not. A further consequence of the explicitness of pointers is that it is possible to define and manipulate cyclic data structures. This additional flexibility yields, of course, not only increased power but also requires increased care by the programmer, because the manipulation of cyclic data structures may easily lead to nonterminating processes. This phenomenon of power and flexibility being intimately coupled with the danger of misuse is well known in programming, and it particularly recalls the GOTO statement. Indeed, if the analogy between program structures and data structures is to be extended, the purely recursive data structure could well be placed at the level corresponding with the procedure, whereas the introduction of pointers is comparable to the use of GOTO statements. For, as the GOTO statement allows the construction of any kind of program pattern (including loops), so do pointers allow for the composition of any kind of data structure (including rings). The parallel development of corresponding program and data structures is shown in condensed form in Table 4.1. Construction Pattern Program Statement Data Type Atomic element Assignment Scalar type Enumeration Compound statement Record type Repetition (known factor) For statement Array type Choice Conditional statement Type union (Variant record) Repetition While or repeat statement Sequence type Recursion Procedure statement Recursive data type General graph GO TO statement Structure linked by pointers Table 4.1 Correspondences of Program and Data Structures. In Chap. 3, we have seen that iteration is a special case of recursion, and that a call of a recursive procedure P defined according to the following schema: PROCEDURE P; BEGIN IF B THEN P0; P END END where P0 is a statement not involving P, is equivalent to and replaceable by the iterative statement WHILE B DO P0 END Ted T Fred T Mary T Adam T NIL Eva T NIL NIL NIL NIL NIL 115 The analogies outlined in Table 4.1 reveal that a similar relationship holds between recursive data types and the sequence. In fact, a recursive type defined according to the schema TYPE T = RECORD IF b: BOOLEAN THEN t0: T0; t: T END END where T0 is a type not involving T, is equivalent and replaceable by a sequence of T0s. The remainder of this chapter is devoted to the generation and manipulation of data structures whose components are linked by explicit pointers. Structures with specific simple patterns are emphasized in particular; recipes for handling more complex structures may be derived from those for manipulating basic formations. These are the linear list or chained sequence -- the simplest case -- and trees. Our preoccupation with these building blocks of data structuring does not imply that more involved structures do not occur in practice. In fact, the following story appeared in a Zürich newspaper in July 1922 and is a proof that irregularity may even occur in cases which usually serve as examples for regular structures, such as (family) trees. The story tells of a man who laments the misery of his life in the following words: I married a widow who had a grown-up daughter. My father, who visited us quite often, fell in love with my step-daughter and married her. Hence, my father became my son-in-law, and my step-daughter became my mother. Some months later, my wife gave birth to a son, who became the brother-in-law of my father as well as my uncle. The wife of my father, that is my stepdaughter, also had a son. Thereby, I got a brother and at the same time a grandson. My wife is my grandmother, since she is my mother's mother. Hence, I am my wife's husband and at the same time her step-grandson; in other words, I am my own grandfather. 4.3. Linear Lists 4.3.1. Basic Operations The simplest way to interrelate or link a set of elements is to line them up in a single list or queue. For, in this case, only a single link is needed for each element to refer to its successor. Assume that types Node and NodeDesc are defined as shown below. Every variable of type NodeDesc consists of three components, namely, an identifying key, the pointer to its successor, and possibly further associated information. For our further discussion, only key and next will be relevant. TYPE Node = POINTER TO NodeDesc; TYPE NodeDesc = RECORD key: INTEGER; next: Ptr; data: ... END ; VAR p, q: Node (*pointer variables*) A list of nodes, with a pointer to its first component being assigned to a variable p, is illustrated in Fig. 4.6. Probably the simplest operation to be performed with a list as shown in Fig. 4.6 is the insertion of an element at its head. First, an element of type NodeDesc is allocated, its reference (pointer) being assigned to an auxiliary pointer variable, say q. Thereafter, a simple reassignment of pointers completes the operation. Note that the order of these three statements is essential. NEW(q); q.next := p; p := q p 1 2 3 4 NIL 116 Fig. 4.6. Example of a linked list The operation of inserting an element at the head of a list immediately suggests how such a list can be generated: starting with the empty list, a heading element is added repeatedly. The process of list generation is expressed in by the following piece of program; here the number of elements to be linked is n. p := NIL; (*start with empty list*) WHILE n > 0 DO NEW(q); q.next := p; p := q; q.key := n; DEC(n) END This is the simplest way of forming a list. However, the resulting order of elements is the inverse of the order of their insertion. In some applications this is undesirable, and consequently, new elements must be appended at the end instead of the head of the list. Although the end can easily be determined by a scan of the list, this naive approach involves an effort that may as well be saved by using a second pointer, say q, always designating the last element. This method is, for example, applied in Program 4.4, which generates cross-references to a given text. Its disadvantage is that the first element inserted has to be treated differently from all later ones. The explicit availability of pointers makes certain operations very simple which are otherwise cumbersome; among the elementary list operations are those of inserting and deleting elements (selective updating of a list), and, of course, the traversal of a list. We first investigate list insertion . Assume that an element designated by a pointer (variable) q is to be inserted in a list after the element designated by the pointer p. The necessary pointer assignments are expressed as follows, and their effect is visualized by Fig. 4.7. q.next := p.next; p.next := q Fig. 4.7. Insertion after p^ If insertion before instead of after the designated element p^ is desired, the unidirectional link chain seems to cause a problem, because it does not provide any kind of path to an element's predecessors. However, a simple trick solves our dilemma. It is illustrated in Fig. 4.8. Assume that the key of the new element is 8. NEW(q); q^ := p^; p.key := k; p.next := q q p q 117 Fig. 4.8. Insertion before p^ The trick evidently consists of actually inserting a new component after p^ and thereafter interchanging the values of the new element and p^. Next, we consider the process of list deletio n. Deleting the successor of a p^ is straightforward. This is shown here in combination with the reinsertion of the deleted element at the head of another list (designated by q). Figure 4.9 illustrates the situation and shows that it constitutes a cyclic exchange of three pointers. r := p.next; p.next := r.next; r.next := q; q := r Fig. 4.9. Deletion and re-insertion The removal of a designated element itself (instead of its successor) is more difficult, because we encounter the same problem as with insertion: tracing backward to the denoted element's predecessor is impossible. But deleting the successor after moving its value forward is a relatively obvious and simple solution. It can be applied whenever p^ has a successor, i.e., is not the last element on the list. However, it must be assured that there exist no other variables pointing to the now deleted element. We now turn to the fundamental operation of list traversal . Let us assume that an operation P(x) has to be performed for every element of the list whose first element is p^. This task is expressible as follows: WHILE list designated by p is not empty DO perform operation P; proceed to the successor END In detail, this operation is descibed by the following statement: WHILE p # NIL DO p q q q 13 p 8 27 21 13 8 21 27 118 P(p); p := p.next END It follows from the definitions of the while statement and of the linking structure that P is applied to all elements of the list and to no other ones. A very frequent operation performed is list searching for an element with a given key x. Unlike for arrays, the search must here be purely sequential. The search terminates either if an element is found or if the end of the list is reached. This is reflected by a logical conjunction consisting of two terms. Again, we assume that the head of the list is designated by a pointer p. WHILE (p # NIL) & (p.key # x) DO p := p.next END p = NIL implies that p^ does not exist, and hence that the expression p.key # x is undefined. The order of the two terms is therefore essential. 4.3.2. Ordered Lists and Reorganizing Lists The given linear list search strongly resembles the search routines for scanning an array or a sequence. In fact, a sequence is precisely a linear list for which the technique of linkage to the successor is left unspecified or implicit. Since the primitive sequence operators do not allow insertion of new elements (except at the end) or deletion (except removal of all elements), the choice of representation is left wide open to the implementor, and he may well use sequential allocation, leaving successive components in contiguous storage areas. Linear lists with explicit pointers provide more flexibility, and therefore they should be used whenever this additional flexibility is needed. To exemplify, we will now consider a problem that will occur throughout this chapter in order to illustate alternative solutions and techniques. It is the problem of reading a text, collecting all its words, and counting the frequency of their occurrence. It is called the construction of a concordance or the generation of a cross-reference list . An obvious solution is to construct a list of words found in the text. The list is scanned for each word. If the word is found, its frequency count is incremented; otherwise the word is added to the list. We shall simply call this process search, although it may actually also include an insertion. In order to be able to concentrate our attention on the essential part of list handling, we assume that the words have already been extracted from the text under investigation, have been encoded as integers, and are available in the from of an input sequence. The formulation of the procedure called search follows in a straightforward manner. The variable root refers to the head of the list in which new words are inserted accordingly. The complete algorithm is listed below; it includes a routine for tabulating the constructed cross-reference list. The tabulation process is an example in which an action is executed once for each element of the list. TYPE Word = POINTER TO RECORD key, count: INTEGER; next: Word END ; PROCEDURE search(x: INTEGER; VAR root: Word); VAR w: Word; BEGIN w := root; WHILE (w # NIL) & (w.key # x) DO w := w.next END ; (* (w = NIL) OR (w.key = x) *) IF w = NIL THEN (*new entry*) w := root; NEW(root); root.key := x; root.count := 1; root.next := w ELSE INC(w.count) END END search; PROCEDURE PrintList(w: Word); BEGIN (*uses global writer W *) WHILE w # NIL DO 119 Texts.WriteInt(W, w.key, 8); Texts.WriteInt(W, w.count, 8); Texts.WriteLn(W); w := w.next END END PrintList; The linear scan algorithm resembles the search procedure for arrays, and reminds us of a simple technique used to simplify the loop termination condition: the use of a sentinel. A sentinel may as well be used in list search; it is represented by a dummy element at the end of the list. The new procedure is listed below. We must assume that a global variable sentinel is added and that the initialization of root := NIL is replaced by the statements NEW(sentinel); root := sentinel which generate the element to be used as sentinel. PROCEDURE search(x: INTEGER; VAR root: Word); VAR w: Word; BEGIN w := root; sentinel.key := x; WHILE w.key # x DO w := w.next END ; IF w = sentinel THEN (*new entry*) w := root; NEW(root); root.key := x; root.count := 1; root.next := w ELSE INC(w.count) END END search Obviously, the power and flexibility of the linked list are ill used in this example, and the linear scan of the entire list can only be accepted in cases in which the number of elements is limited. An easy improvement, however, is readily at hand: the ordered list search. If the list is ordered (say by increasing keys), then the search may be terminated at the latest upon encountering the first key that is larger than the new one. Ordering of the list is achieved by inserting new elements at the appropriate place instead of at the head. In effect, ordering is practically obtained free of charge. This is because of the ease by which insertion in a linked list is achieved, i.e., by making full use of its flexibility. It is a possibility not provided by the array and sequence structures. (Note, however, that even in ordered lists no equivalent to the binary search of arrays is available). Fig. 4.10. Insertion in ordered list Ordered list search is a typical example of the situation, where an element must be inserted ahead of a given item, here in front of the first one whose key is too large. The technique shown here, however, differs from the one used shown earlier. Instead of copying values, two pointers are carried along in the list traversal; w2 lags one step behind w1 and thus identifies the proper insertion place when w1 has found too large a key. The general insertion step is shown in Fig. 4.10. The pointer to the new element (w3) is to be assigned to w2^.next, except when the list is still empty. For reasons of simplicity and effectiveness, we prefer to avoid this distinction by using a conditional statement. The only way to avoid 1 5 w2 7 w3 5 w1 12 NIL 120 this is to introduce a dummy element at the list head. The initializing statement root := NIL is accordingly replaced by NEW(root); root.next := NIL Referring to Fig. 4.10, we determine the condition under which the scan continues to proceed to the next element; it consists of two factors, namely, (w1 # NIL) & (w1.key < x) The resulting search procedure is:. PROCEDURE search(x: INTEGER); VAR root: Word); VAR w1, w2, w3: Word; BEGIN (*w2 # NIL*) w2 := root; w1 := w2.next; WHILE (w1 # NIL) & (w1.key < x) DO w2 := w1; w1 := w2.next END ; (* (w1 = NIL) OR (w1.key >= x) *) IF (w1 = NIL) OR (w1.key > x) THEN (*new entry*) NEW(w3); w2.next := w3; w3.key := x; w3.count := 1; w3.next := w1 ELSE INC(w1.count) END END search In order to speed up the search, the continuation condition of the while statement can once again be simplified by using a sentinel. This requires the initial presence of a dummy header as well as a sentinel at the tail. It is now high time to ask what gain can be expected from ordered list search. Remembering that the additional complexity incurred is small, one should not expect an overwhelming improvement. Assume that all words in the text occur with equal frequency. In this case the gain through lexicographical ordering is indeed also nil, once all words are listed, because the position of a word does not matter if only the total of all access steps is significant and if all words have the same frequency of occurrence. However, a gain is obtained whenever a new word is to be inserted. Instead of first scanning the entire list, on the average only half the list is scanned. Hence, ordered list insertion pays off only if a concordance is to be generated with many distinct words compared to their frequency of occurrence. The preceding examples are therefore suitable primarily as programming exercises rather than for practical applications. The arrangement of data in a linked list is recommended when the number of elements is relatively small (< 50), varies, and, moreover, when no information is given about their frequencies of access. A typical example is the symbol table in compilers of programming languages. Each declaration causes the addition of a new symbol, and upon exit from its scope of validity, it is deleted from the list. The use of simple linked lists is appropriate for applications with relatively short programs. Even in this case a considerable improvement in access method can be achieved by a very simple technique which is mentioned here again primarily because it constitutes a pretty example for demonstrating the flexibilities of the linked list structure. A characteristic property of programs is that occurrences of the same identifier are very often clustered, that is, one occurrence is often followed by one or more reoccurrences of the same word. This information is an invitation to reorganize the list after each access by moving the word that was found to the top of the list, thereby minimizing the length of the search path the next time it is sought. This method of access is called list search with reordering, or -- somewhat pompously -- self-organizing list search. In presenting the corresponding algorithm in the form of a procedure, we take advantage of our experience made so far and introduce a sentinel right from the start. In fact, a sentinel not only speeds up the search, but in this case it also simplifies the program. The list must initially not be empty, but contains the sentinel element already. The initialization statements are 121 NEW(sentinel); root := sentinel Note that the main difference between the new algorithm and the straight list search is the action of reordering when an element has been found. It is then detached or deleted from its old position and inserted at the top. This deletion again requires the use of two chasing pointers, such that the predecessor w2 of an identified element w1 is still locatable. This, in turn, calls for the special treatment of the first element (i.e., the empty list). To conceive the linking process, we refer to Fig. 4.11. It shows the two pointers when w1 was identified as the desired element. The configuration after correct reordering is represented in Fig. 4.12, and the complete new search procedure is listed below. Fig. 4.11. List before re-ordering Fig. 4.12. List after re-ordering PROCEDURE search(x: INTEGER; VAR root: Word); VAR w1, w2: Word; BEGIN w1 := root; sentinel.key := x; IF w1 = sentinel THEN (*first element*) NEW(root); root.key := x; reoot.count := 1; root.next := sentinel ELSIF w1.key = x THEN INC(w1.count) ELSE (*search*) REPEAT w2 := w1; w1 := w2.next UNTIL w1.key = x; IF w1 = sentinel THEN (*new entry*) w2 := root; NEW(root); root.key := x; root.count := 1; root.next := w2 ELSE (*found, now reorder*) INC(w1^.count); w2.next := w1.next; w1.next := root; root := w1 X1 3 U2 2 A0 8 G5 6 NIL root sentinel w2 w1 X1 3 U2 2 A0 7 G5 6 NIL root sentinel w2 w1 122 END END END search The improvement in this search method strongly depends on the degree of clustering in the input data. For a given factor of clustering, the improvement will be more pronounced for large lists. To provide an idea of how much gain can be expected, an empirical measurement was made by applying the above cross- reference program to a short and a relatively long text and then comparing the methods of linear list ordering and of list reorganization. The measured data are condensed into Table 4.2. Unfortunately, the improvement is greatest when a different data organization is needed anyway. We will return to this example in Sect. 4.4. Test 1 Test 2 Number of distinct keys 53 582 Number of occurrences of keys 315 14341 Time for search with ordering 6207 3200622 Time for search with reordering 4529 681584 Improvement factor 1.37 4.70 Table 4.2 Comparsion of List Search Methods. 4.3.3. An Application: Partial Ordering (Topological Sorting) An appropriate example of the use of a flexible, dynamic data structure is the process of topological sorting. This is a sorting process of items over which a partial ordering is defined, i.e., where an ordering is given over some pairs of items but not between all of them. The following are examples of partial orderings: 1. In a dictionary or glossary, words are defined in terms of other words. If a word v is defined in terms of a word w, we denote this by v 〈 w. Topological sorting of the words in a dictionary means arranging them in an order such that there will be no forward references. 2. A task (e.g., an engineering project) is broken up into subtasks. Completion of certain subtasks must usually precede the execution of other subtasks. If a subtask v must precede a subtask w, we write v 〈 w. Topological sorting means their arrangement in an order such that upon initiation of each subtask all its prerequisite subtasks have been completed. 3. In a university curriculum, certain courses must be taken before others since they rely on the material presented in their prerequisites. If a course v is a prerequisite for course w, we write v 〈 w. Topological sorting means arranging the courses in such an order that no course lists a later course as prerequisite. 4. In a program, some procedures may contain calls of other procedures. If a procedure v is called by a procedure w, we write v 〈 w. Topological sorting implies the arrangement of procedure declarations in such a way that there are no forward references. In general, a partial ordering of a set S is a relation between the elements of S. It is denoted by the symbol “〈”, verbalized by precedes , and satisfies the following three properties (axioms) for any distinct elements x, y, z of S: 1. if x 〈 y and y 〈 z, then x 〈 z (transitivity) 2. if x 〈 y, then not y 〈 x (asymmetry) 3. not z 〈 z (irreflexivity) For evident reasons, we will assume that the sets S to be topologically sorted by an algorithm are finite. Hence, a partial ordering can be illustrated by drawing a diagram or graph in which the vertices denote the elements of S and the directed edges represent ordering relationships. An example is shown in Fig. 4.13. 123 Fig. 4.13. Partially ordered set The problem of topological sorting is to embed the partial order in a linear order. Graphically, this implies the arrangement of the vertices of the graph in a row, such that all arrows point to the right, as shown in Fig. 4.14. Properties (1) and (2) of partial orderings ensure that the graph contains no loops. This is exactly the prerequisite condition under which such an embedding in a linear order is possible. Fig. 4.14. Linear arrangement of the partially ordered set of Fig. 4.13. How do we proceed to find one of the possible linear orderings? The recipe is quite simple. We start by choosing any item that is not preceded by another item (there must be at least one; otherwise a loop would exist). This object is placed at the head of the resulting list and removed from the set S. The remaining set is still partially ordered, and so the same algorithm can be applied again until the set is empty. In order to describe this algorithm more rigorously, we must settle on a data structure and representation of S and its ordering. The choice of this representation is determined by the operations to be performed, particularly the operation of selecting elements with zero predecessors. Every item should therefore be represented by three characteristics: its identification key, its set of successors, and a count of its predecessors. Since the number n of elements in S is not given a priori, the set is conveniently organized as a linked list. Consequently, an additional entry in the description of each item contains the link to the next item in the list. We will assume that the keys are integers (but not necessarily the consecutive integers from 1 to n). Analogously, the set of each item's successors is conveniently represented as a linked list. Each element of the successor list is described by an identification and a link to the next item on this list. If we call the descriptors of the main list, in which each item of S occurs exactly once, leaders, and the descriptors of elements on the successor chains trailers, we obtain the following declarations of data types: TYPE Leader = POINTER TO LeaderDesc; Trailer = POINTER TO TrailerDesc; LeaderDesc = RECORD key, count: INTEGER; trail: Trailer; next: Leader END; TrailerDesc = RECORD id: Leader; next: Trailer END 7 9 1 2 4 6 3 5 8 10 1 2 3 6 4 5 7 8 9 10 124 Assume that the set S and its ordering relations are initially represented as a sequence of pairs of keys in the input file. The input data for the example in Fig. 4.13 are shown below, in which the symbols 〈 are added for the sake of clarity, symbolizing partial order: 1 〈 2 2 〈 4 4 〈 6 2 〈 10 4 〈 8 6 〈 3 1 〈 3 3 〈 5 5 〈 8 7 〈 5 7 〈 9 9 〈 4 9 〈 10 The first part of the topological sort program must read the input and transform the data into a list structure. This is performed by successively reading a pair of keys x and y (x 〈 y). Let us denote the pointers to their representations on the linked list of leaders by p and q. These records must be located by a list search and, if not yet present, be inserted in the list. This task is perfomed by a function procedure called find . Subsequently, a new entry is added in the list of trailers of x, along with an identification of y; the count of predecessors of y is incremented by 1. This algorithm is called input phase . Figure 4.15 illustrates the data structure generated during processing the given input data. The function find(w) yields the pointer to the list element with key w. In the following poece of program we make use of text scanning , a feature of the Oberon system’s text concept. Instead of considering a text (file) as a sequence of characters, a text is considered as a sequence of tokens , which are identifiers, numbers, strings, and special characters (such as +, *, <, etc. The procedure Texts.Scan(S) scans the text, reading the next token. The scanner S plays the role of a text rider. (*input phase*) NEW(head); tail := head; z := 0; Texts.Scan(S); WHILE S.class = Texts.Int DO x := S.i; Texts.Scan(S); y := S.i; p := find(x); q := find(y); NEW(t); t.id := q; t.next := p.trail; p.trail := t; INC(q.count); Texts.Scan(S) END Fig. 4.15. List structure generated by TopSort program After the data structure of Fig. 4.15 has been constructed in this input phase, the actual process of topological sorting can be taken up as described above. But since it consists of repeatedly selecting an element with a zero count of predecessors, it seems sensible to first gather all such elements in a linked chain. Since we note that the original chain of leaders will afterwards no longer be needed, the same field called next may be used again to link the zero predecessor leaders. This operation of replacing one chain by another chain occurs frequently in list processing. It is expressed in detail here, and for reasons of convenience it constructs the new chain in reverse order. head 1 0 key count next trail id next id next 2 1 4 2 6 1 10 2 8 2 3 2 5 2 7 0 9 1 tail 125 (*search for leaders without predecessors*) p := head; head := NIL; WHILE p # tail DO q := p; p := q.next; IF q.count = 0 THEN (*insert q^ in new chain*) q.next := head; head := q END END Referring to Fig. 4.15, we see that the next chain of leaders is replaced by the one of Fig. 4.16 in which the pointers not depicted are left unchanged. Fig. 4.16. List of Leaders with zero count After all this preparatory establishing of a convenient representation of the partially ordered set S, we can finally proceed to the actual task of topological sorting, i.e., of generating the output sequence. In a first rough version it can be described as follows: q := head; WHILE q # NIL DO (*output this element, then delete it*) Texts.WriteInt(W, q.key, 8); DEC(n); t := q.trail; q := q.next; decrement the predecessor count of all its successors on trailer list t; if any count becomes 0, insert this element in the leader list q END The statement that is to be still further refined constitutes one more scan of a list. In each step, the auxiliary variable p designates the leader element whose count has to be decremented and tested. WHILE t # NIL DO p := t.id; DEC(p.count); IF p.count = 0 THEN (*insert p^ in leader list*) p.next := q; q := p END ; t := t.next END This completes the program for topological sorting. Note that a counter n was introduced to count the leaders generated in the input phase. This count is decremented each time a leader element is output in the output phase. It should therefore return to zero at the end of the program. Its failure to return to zero is an indication that there are elements left in the structure when none is without predecessor. In this case the set S is evidently not partially ordered. The output phase programmed above is an example of a process that maintains a list that pulsates, i.e., in which elements are inserted and removed in an unpredictable order. It is therefore an example of a process which utilizes the full flexibility offered by the explicitly linked list. VAR head, tail: Leader; n: INTEGER; PROCEDURE find(w: INTEGER): Leader; VAR h: Leader; BEGIN h := head; tail.key := w; (*sentinel*) 1 0 NIL 7 0 head 126 WHILE h.key # w DO h := h.next END ; IF h = tail THEN NEW(tail); INC(n); h.count := 0; h.trail := NIL; h.next := tail END ; RETURN h END find; PROCEDURE TopSort(VAR R: Texts.Reader); VAR p, q: Leader; t: Trailer; (*uses global writer W*) x, y, n: INTEGER; BEGIN (*initialize list of leaders with a dummy acting as sentinel*) NEW(head); tail := head; n := 0; Texts.ReadInt(R, x); (*input phase*) WHILE ~R.eot DO Texts.WriteInt(W, x, 8); Texts.ReadInt(R, y); Texts.WriteInt(W, y, 8); Texts.WriteLn(W); p := find(x); q := find(y); NEW(t); t.id := q; t.next := p.trail; p.trail := t; INC(q.count); Texts.ReadInt(R, x); END ; (*search for leaders without predecessors*) p := head; head := NIL; WHILE p # tail DO q := p; p := q.next; IF q.count = 0 THEN (*insert q in new chain*) q.next := head; head := q END END ; (*output phase*) q := head; WHILE q # NIL DO Texts.WriteLn(W); Texts.WriteInt(W, q.key, 8); DEC(n); t := q.trail; q := q.next; WHILE t # NIL DO p := t.id; DEC(p.count); IF p.count = 0 THEN (*insert p in leader list*) p.next := q; q := p END ; t := t.next END END ; IF n # 0 THEN Texts.WriteString(W, "This set is not partially ordered") END ; Texts.WriteLn(W) END TopSort. 4.4 Tree Structures 4.4.1. Basic Concepts and Definitions We have seen that sequences and lists may conveniently be defined in the following way: A sequence (list) with base type T is either 1. The empty sequence (list). 2. The concatenation (chain) of a T and a sequence with base type T. Hereby recursion is used as an aid in defining a structuring principle, namely, sequencing or iteration. Sequences and iterations are so common that they are usually considered as fundamental patterns of structure and behaviour. But it should be kept in mind that they can be defined in terms of recursion, whereas the reverse is not true, for recursion may be effectively and elegantly used to define much more sophisticated structures. Trees are a well-known example. Let a tree structure be defined as follows: A tree structure with base type T is either 127 1. The empty structure. 2. A node of type T with a finite number of associated disjoint tree structures of base type T, called subtrees. From the similarity of the recursive definitions of sequences and tree structures it is evident that the sequence (list) is a tree structure in which each node has at most one subtree. The list is therefore also called a degenerate tree. There are several ways to represent a tree structure. For example, a tree structure with its base type T ranging over the letters is shown in various ways in Fig. 4.17. These representations all show the same structure and are therefore equivalent. It is the graph structure that explicitly illustrates the branching relationships which, for obvious reasons, led to the generally used name tree. Strangely enough, it is customary to depict trees upside down, or -- if one prefers to express this fact differently -- to show the roots of trees. The latter formulation, however, is misleading, since the top node (A) is commonly called the root. 128 Fig. 4.17. Representation of tree structure by (a) nested sets, (b) nested parentheses, (c) indented text, and (d) graph An ordered tree is a tree in which the branches of each node are ordered. Hence the two ordered trees in Fig. 4.18 are distinct, different objects. A node y that is directly below node x is called a (direct) descendant of x; if x is at level i, then y is said to be at level i+1. Conversely, node x is said to be the (direct) ancestor of y. The root of a tree is defined to be at level 0. The maximum level of any element of a tree is said to be its depth or height . Fig. 4.18. Two distinct trees A B C A C B K E L J I D B O F P M N H G C A a) (A(B(D(I),E(J,K,L)),C(F(O),G(M,N),H(P)))) b) A B D I E J K L C F O G M N H P c) A B C D E I J K L F G H O M N P d) 129 If an element has no descendants, it is called a terminal node or a leaf ; and an element that is not terminal is an interior node. The number of (direct) descendants of an interior node is called its degree . The maximum degree over all nodes is the degree of the tree. The number of branches or edges that have to be traversed in order to proceed from the root to a node x is called the path length of x. The root has path length 0, its direct descendants have path length 1, etc. In general, a node at level i has path length i. The path length of a tree is defined as the sum of the path lengths of all its components. It is also called its internal path length . The internal path length of the tree shown in Fig. 4.17, for instance, is 36. Evidently, the average path length is Pint = ( Si: 1 ≤ i ≤ n: n i×i) / n where n i is the number of nodes at level i. In order to define what is called the external path length , we extend the tree by a special node wherever a subtree was missing in the original tree. In doing so, we assume that all nodes are to have the same degree, namely the degree of the tree. Extending the tree in this way therefore amounts to filling up empty branches, whereby the special nodes, of course, have no further descendants. The tree of Fig. 4.17 extended with special nodes is shown in Fig. 4.19 in which the special nodes are represented by squares. The external path length is now defined as the sum of the path lengths over all special nodes. If the number of special nodes at level i is m i, then the average external path length is Pext = ( Si: 1 ≤ i ≤ m i×i) / m Fig. 4.19. Ternary tree extended with special nodes In the tree shown in Fig. 4.19 the external path length is 120. The number of special nodes m to be added in a tree of degree d directly depends on the number n of original nodes. Note that every node has exactly one edge pointing to it. Thus, there are m+n edges in the extended tree. On the other hand, d edges are emanating from each original node, none from the special nodes. Therefore, there exist d*n + 1 edges, the 1 resulting from the edge pointing to the root. The two results yield the following equation between the number m of special nodes and n of original nodes: d×n + 1 = m+n, or m = (d-1)×n + 1 The maximum number of nodes in a tree of a given height h is reached if all nodes have d subtrees, except those at level h, all of which have none. For a tree of degree d, level 0 then contains 1 node (namely, the root), level 1 contains its d descendants, level 2 contains the d 2 descendants of the d nodes at level 2, etc. This yields Nd(h) = Si: 0 ≤ i < h: d i as the maximum number of nodes for a tree with height h and degree d. For d = 2, we obtain N2(h) = 2 h - 1 A B C D E F G H I K J L O M N P 130 Of particular importance are the ordered trees of degree 2. They are called binary trees. We define an ordered binary tree as a finite set of elements (nodes) which either is empty or consists of a root (node) with two disjoint binary trees called the left and the right subtree of the root. In the following sections we shall exclusively deal with binary trees, and we therefore shall use the word tree to mean ordered binary tree . Trees with degree greater than 2 are called multiway trees and are discussed in Sect. 7 of this chapter. Familiar examples of binary trees are the family tree (pedigree) with a person's father and mother as descendants (!), the history of a tennis tournament with each game being a node denoted by its winner and the two previous games of the combatants as its descendants, or an arithmetic expression with dyadic operators, with each operator denoting a branch node with its operands as subtrees (see Fig. 4.20). Fig. 4.20. Tree representation of expression (a + b/c) * (d – e*f) We now turn to the problem of representation of trees. It is plain that the illustration of such recursive structures in terms of branching structures immediately suggests the use of our pointer facility. There is evidently no use in declaring variables with a fixed tree structure; instead, we define the nodes as variables with a fixed structure, i.e., of a fixed type, in which the degree of the tree determines the number of pointer components referring to the node's subtrees. Evidently, the reference to the empty tree is denoted by NIL. Hence, the tree of Fig. 4.20 consists of components of a type defined as follows and may then be constructed as shown in Fig. 4.21. TYPE Node = POINTER TO NodeDesc; TYPE NodeDesc = RECORD op: CHAR; left, right: Node END Fig. 4.21. Tree of Fig. 4.21 represented as linked data structure e NIL NIL f NIL NIL * d NIL NIL - b NIL NIL c NIL NIL / a NIL NIL + * root * e f - d / b c + a * 131 Before investigating how trees might be used advantageously and how to perform operations on trees, we give an example of how a tree may be constructed by a program. Assume that a tree is to be generated containing nodes with the values of the nodes being n numbers read from an input file. In order to make the problem more challenging, let the task be the construction of a tree with n nodes and minimal height. In order to obtain a minimal height for a given number of nodes, one has to allocate the maximum possible number of nodes of all levels except the lowest one. This can clearly be achieved by distributing incoming nodes equally to the left and right at each node. This implies that we structure the tree for given n as shown in Fig. 4.22, for n = 1, ... , 7. Fig. 4.22. Perfectly balanced trees The rule of equal distribution under a known number n of nodes is best formulated recursively: 1. Use one node for the root. 2. Generate the left subtree with nl = n DIV 2 nodes in this way. 3. Generate the right subtree with nr = n - nl - 1 nodes in this way. The rule is expressed as a recursive procedure which reads the input file and constructs the perfectly balanced tree. We note the following definition: A tree is perfectly balanced, if for each node the numbers of nodes in its left and right subtrees differ by at most 1. TYPE Node = POINTER TO RECORD key: INTEGER; left, right: Node END ; VAR R: Texts.Reader; W: Texts.Writer; root: Node; PROCEDURE tree(n: INTEGER): Node; VAR new: Node; x, nl, nr: INTEGER; BEGIN (*construct perfectly balanced tree with n nodes*) IF n = 0 THEN new := NIL ELSE nl := n DIV 2; nr := n-nl-1; NEW(new); Texts.ReadInt(R, new.key); new.key := x; new.left := tree(nl); new.right := tree(nr) END ; RETURN new END tree; PROCEDURE PrintTree(t: Node; h: INTEGER); VAR i: INTEGER; n=1 n=2 n=3 n=4 n=5 n=7 n=6 132 BEGIN (*print tree t with indentation h*) IF t # NIL THEN PrintTree(t.left, h+1); FOR i := 1 TO h DO Texts.Write(W, TAB) END ; Texts.WriteInt(W, key, 6); Texts.WriteLn(W); PrintTree(t.right, h+1) END END PrintTree; Assume, for example, the following input data for a tree with 21 nodes: 8 9 11 15 19 20 21 7 3 2 1 5 6 4 13 14 10 12 17 16 18 The call root := tree(21) reads the input dara while constructing the perfectly balanced tree shown in Fig. 4.23. Note the simplicity and transparency of this program that is obtained through the use of recursive procedures. It is obvious that recursive algorithms are particularly suitable when a program is to manipulate information whose structure is itself defined recursively. This is again manifested in the procedure which prints the resulting tree: The empty tree results in no printing, the subtree at level L in first printing its own left subtree, then the node, properly indented by preceding it with L tabs, and finally in printing its right subtree. Fig. 4.23. Tree generated by preceding program 4.4.2. Basic Operations on Binary Trees There are many tasks that may have to be perfomed on a tree structure; a common one is that of executing a given operation P on each element of the tree. P is then understood to be a parameter of the more general task of visting all nodes or, as it is usually called, of tree traversal. If we consider the task as a single sequential process, then the individual nodes are visited in some specific order and may be considered as being laid out in a linear arrangement. In fact, the description of many algorithms is considerably facilitated if we can talk about processing the next element in the tree based in an underlying order. There are three principal orderings that emerge naturally from the structure of trees. Like the tree structure itself, they are conveniently expressed in recursive terms. Referring to the binary tree in Fig. 4.24 in which R denotes the root and A and B denote the left and right subtrees, the three orderings are 1. Preorder: R, A, B (visit root before the subtrees) 2. Inorder: A, R, B 3. Postorder: A, B, R (visit root after the subtrees) 16 10 13 2 21 17 18 4 14 3 1 6 12 8 9 5 11 7 15 20 19 133 Fig. 4.24. Binary tree Traversing the tree of Fig. 4.20 and recording the characters seen at the nodes in the sequence of encounter, we obtain the following orderings: 1. Preorder: * + a / b c - d * e f 2. Inorder: a + b / c * d - e * f 3. Postorder: a b c / + d e f * - * We recognize the three forms of expressions: preorder traversal of the expression tree yields prefix notation; postorder traversal generates postfix notation; and inorder traversal yields conventional infix notation, although without the parentheses necessary to clarify operator precedences. Let us now formulate the three methods of traversal by three concrete programs with the explicit parameter t denoting the tree to be operated upon and with the implicit parameter P denoting the operation to be performed on each node. Assume the following definitions: TYPE Node = POINTER TO RECORD ... left, right: Node END The three methods are now readily formulated as recursive procedures; they demonstrate again the fact that operations on recursively defined data structures are most conveniently defined as recursive algorithms. PROCEDURE preorder(t: Node); BEGIN IF t # NIL THEN P(t); preorder(t.left); preorder(t.right) END END preorder PROCEDURE inorder(t: Node); BEGIN IF t # NIL THEN inorder(t.left); P(t); inorder(t.right) END END inorder PROCEDURE postorder(t: Node); BEGIN IF t # NIL THEN postorder(t.left); postorder(t.right); P(t) END END postorder Note that the pointer t is passed as a value parameter. This expresses the fact that the relevant entity is the reference to the considered subtree and not the variable whose value is the pointer, and which could be changed in case t were passed as a variable parameter. An example of a tree traversal routine is that of printing a tree, with appropriate indentation indicating each node's level. R A B 134 Binary trees are frequently used to represent a set of data whose elements are to be retrievable through a unique key. If a tree is organized in such a way that for each node ti, all keys in the left subtree of ti are less than the key of ti, and those in the right subtree are greater than the key of ti, then this tree is called a search tree. In a search tree it is possible to locate an arbitrary key by starting at the root and proceeding along a search path switching to a node's left or right subtree by a decision based on inspection of that node's key only. As we have seen, n elements may be organized in a binary tree of a height as little as log n. Therefore, a search among n items may be performed with as few as log n comparsions if the tree is perfectly balanced. Obviously, the tree is a much more suitable form for organizing such a set of data than the linear list used in the previous section. As this search follows a single path from the root to the desired node, it can readily be programmed by iteration: PROCEDURE locate(x: INTEGER; t: Node): Node; BEGIN WHILE (t # NIL) & (t.key # x) DO IF t.key < x THEN t := t.right ELSE t := t.left END END ; RETURN t END locate The function locate(x, t) yields the value NIL, if no key with value x is found in the tree with root t. As in the case of searching a list, the complexity of the termination condition suggests that a better solution may exist, namely the use of a sentinel. This technique is equally applicable in the case of a tree. The use of pointers makes it possible for all branches of the tree to terminate with the same sentinel. The resulting structure is no longer a tree, but rather a tree with all leaves tied down by strings to a single anchor point (Fig. 4.25). The sentinel may be considered as a common, shared representative of all external nodes by which the original tree was extended (see Fig. 4.19): PROCEDURE locate(x: INTEGER; t: Ptr): Ptr; BEGIN s.key := x; (*sentinel*) WHILE t.key # x DO IF t.key < x THEN t := t.right ELSE t := t.left END END ; RETURN t END locate Note that in this case locate(x, t) yields the value s instead of NIL, i.e., the pointer to the sentinel, if no key with value x is found in the tree with root t. Fig. 4.25. Search tree with sentinel 6 5 3 1 2 4 t s 135 4.4.3. Tree Search and Insertion The full power of the dynamic allocation technique with access through pointers is hardly displayed by those examples in which a given set of data is built, and thereafter kept unchanged. More suitable examples are those applications in which the structure of the tree itself varies, i.e., grows and/or shrinks during the execution of the program. This is also the case in which other data representations, such as the array, fail and in which the tree with elements linked by pointers emerges as the most appropriate solution. We shall first consider only the case of a steadily growing but never shrinking tree. A typical example is the concordance problem which was already investigated in connection with linked lists. It is now to be revisited. In this problem a sequence of words is given, and the number of occurrences of each word has to be determined. This means that, starting with an empty tree, each word is searched in the tree. If it is found, its occurrence count is incremented; otherwise it is inserted as a new word (with a count initialized to 1). We call the underlying task tree search with insertion . The following data type definitions are assumed: TYPE Node = POINTER TO RECORD key, count: INTEGER; left, right: Node END Finding the search path is again straightforward. However, if it leads to a dead end (i.e., to an empty subtree designated by a pointer value NIL), then the given word must be inserted in the tree at the place of the empty subtree. Consider, for example, the binary tree shown in Fig. 4.26 and the insertion of the name Paul . The result is shown in dotted lines in the same picture. Fig. 4.26. Insertion in ordered binary tree The search process is formulated as a recursive procedure. Note that its parameter p is a variable parameter and not a value parameter. This is essential because in the case of insertion a new pointer value must be assigned to the variable which previously held the value NIL. Using the input sequence of 21 numbers that had been used above to construct the tree of Fig. 4.23, the search and insertion procedure yields the binary tree shown in Fig. 4.27, with a call search(k, root) for each key k, where root is a variable of type Node . ER 2 GEORGE 1 NORMA 2 ANN NIL NIL 5 MARY NIL NIL 3 PAUL NIL NIL 1 WALTER NIL NIL 4 136 Fig. 4.27. Search tree generated by preceding program PROCEDURE PrintTree(t: Node; h: INTEGER); VAR i: INTEGER; BEGIN (*print tree t with indentation h*) IF t # NIL THEN PrintTree(t.left, h+1); FOR i := 1 TO h DO Texts.Write(W, TAB) END ; Texts.WriteInt(W, t.key, 6); Texts.WriteLn(W); PrintTree(t.right, h+1) END END PrintTree; PROCEDURE search(x: INTEGER; VAR p: Node); BEGIN IF p = NIL THEN (*x not in tree; insert*) NEW(p); p.key := x; p.count := 1; p.left := NIL; p.right := NIL ELSIF x < p.key THEN search(x, p.left) ELSIF x > p.key THEN search(x, p.right) ELSE INC(p.count) END END search; The use of a sentinel again simplifies the task somewhat. Clearly, at the start of the program the variable root must be initialized by the pointer to the sentinel instead of the value NIL, and before each search the specified value x must be assigned to the key field of the sentinel. PROCEDURE search(x: INTEGER; VAR p: Node); BEGIN IF x < p.key THEN search(x, p.left) ELSIF x > p^key THEN search(x, p.right) ELSIF p # s THEN INC(p.count) ELSE (*insert*) NEW(p); p.key := x; p.left := s; p.right := s; p.count := 1 END END Although the purpose of this algorithm is searching, it can be used for sorting as well. In fact, it resembles the sorting by insertion method quite strongly, and because of the use of a tree structure instead of an array, the need for relocation of the components above the insertion point vanishes. Tree sorting can be 8 7 9 3 2 5 1 4 6 11 10 15 12 13 14 17 16 18 20 21 19 137 programmed to be almost as efficient as the best array sorting methods known. But a few precautions must be taken. After encountering a match, the new element must also be inserted. If the case x = p.key is handled identically to the case x > p.key, then the algorithm represents a stable sorting method, i.e., items with identical keys turn up in the same sequence when scanning the tree in normal order as when they were inserted. In general, there are better ways to sort, but in applications in which searching and sorting are both needed, the tree search and insertion algorithm is strongly recommended. It is, in fact, very often applied in compilers and in data banks to organize the objects to be stored and retrieved. An appropriate example is the construction of a cross-reference index for a given text, an example that we had already used to illustrate list generation. Our task is to construct a program that (while reading a text and printing it after supplying consecutive line numbers) collects all words of this text, thereby retaining the numbers of the lines in which each word occurred. When this scan is terminated, a table is to be generated containing all collected words in alphabetical order with lists of their occurrences. Obviously, the search tree (also called a lexicographic tree ) is a most suitable candidate for representing the words encountered in the text. Each node now not only contains a word as key value, but it is also the head of a list of line numbers. We shall call each recording of an occurrence an item. Hence, we encounter both trees and linear lists in this example. The program consists of two main parts (see Program 4.5), namely, the scanning phase and the table printing phase. The latter is a straightforward application of a tree traversal routine in which visiting each node implies the printing of the key value (word) and the scanning of its associated list of line numbers (items). The following are further clarifications regarding the Cross Reference Generator listed below. Table 4.4 shows the results of processing the text of the preceding procedure search . 1. A word is considered as any sequence of letters and digits starting with a letter. 2. Since words may be of widely different lengths, the actual characters are stored in an array buffer, and the tree nodes contain the index of the key's first character. 3. It is desirable that the line numbers be printed in ascending order in the cross-reference index. Therefore, the item lists must be generated in the same order as they are scanned upon printing. This requirement suggests the use of two pointers in each word node, one referring to the first, and one referring to the last item on the list. We assume the existence of global writer W, and a variable representing the current line number in the text. CONST WordLen = 32; TYPE Word = ARRAY WordLen OF CHAR; Item = POINTER TO RECORD lno: INTEGER; next: Item END ; Node = POINTER TO RECORD key: Word; first, last: Item; (*list*) left, right: Node (*tree*) END ; VAR line: INTEGER; PROCEDURE search(VAR w: Node; VAR a: Word); VAR q: Item; BEGIN IF w = NIL THEN (*word not in tree; new entry, insert*) NEW(w); NEW(q); q.lno := line; COPY(a, w.key); w.first := q; w.last := q; w.left := NIL; w.right := NIL ELSIF w.key < a THEN search(w.right, a) ELSIF w.key > a THEN search(w.left, a) 138 ELSE (*old entry*) NEW(q); q.lno := line; w.last.next := q; w.last := q END END search; PROCEDURE Tabulate(w: Node); VAR m: INTEGER; item: Item; BEGIN IF w # NIL THEN Tabulate(w.left); Texts.WriteString(W, w.key); item := w.first; m := 0; REPEAT IF m = 10 THEN Texts.WriteLn(W); Texts.Write(W, TAB); m := 0; END ; INC(m); Texts.WriteInt(W, item.lno, 6); item := item.next UNTIL item = NIL; Texts.WriteLn(W); Tabulate(w.right) END END Tabulate; PROCEDURE CrossRef(VAR R: Texts.Reader); VAR root: Node; (*uses global writer W*) i: INTEGER; ch: CHAR; w: Word; BEGIN root := NIL; line := 0; Texts.WriteInt(W, 0, 6); Texts.Write(W, TAB); Texts.Read(R, ch); WHILE ~R.eot DO IF ch = 0DX THEN (*line end*) Texts.WriteLn(W); INC(line); Texts.WriteInt(W, line, 6); Texts.Write(W, 9X); Texts.Read(R, ch) ELSIF ("A" <= ch) & (ch <= "Z") OR ("a" <= ch) & (ch <= "z") THEN i := 0; REPEAT IF i < WordLen-1 THEN w[i] := ch; INC(i) END ; Texts.Write(W, ch); Texts.Read(R, ch) UNTIL (i = WordLen-1) OR ~(("A" <= ch) & (ch <= "Z")) & ~(("a" <= ch) & (ch <= "z")) & ~(("0" <= ch) & (ch <= "9")); w[i] := 0X; (*string terminator*) search(root, w) ELSE Texts.Write(W, ch); Texts.Read(R, ch) END ; END ; Texts.WriteLn(W); Texts.WriteLn(W); Tabulate(root) END CrossRef; 0 PROCEDURE search(x: INTEGER; VAR p: Node); 1 BEGIN 2 IF x < p.key THEN search(x, p.left) 3 ELSIF x > p^key THEN search(x, p.right) 4 ELSIF p # s THEN INC(p.count) 5 ELSE (*insert*) NEW(p); 6 p.key := x; p.left := s; p.right := s; p.count := 1 7 END 8 END BEGIN 1 ELSE 5 ELSIF 3 4 139 END 7 8 IF 2 INC 4 INTEGER 0 NEW 5 Node 0 PROCEDURE 0 THEN 2 3 4 VAR 0 count 4 6 insert 5 key 2 3 6 left 2 6 p 0 2 2 3 3 4 4 5 6 6 6 6 right 3 6 s 4 6 6 search 0 2 3 x 0 2 2 3 3 6 Table 4.4 Sample output of cross reference generator 4.4.4. Tree Deletion We now turn to the inverse problem of insertion: deletion. Our task is to define an algorithm for deleting, i.e., removing the node with key x in a tree with ordered keys. Unfortunately, removal of an element is not generally as simple as insertion. It is straightforward if the element to be deleted is a terminal node or one with a single descendant. The difficulty lies in removing an element with two descendants, for we cannot point in two directions with a single pointer. In this situation, the deleted element is to be replaced by either the rightmost element of its left subtree or by the leftmost node of its right subtree, both of which have at most one descendant. The details are shown in the recursive procedure delete . This procedure distinguishes among three cases: 1. There is no component with a key equal to x. 2. The component with key x has at most one descendant. 3. The component with key x has two descendants. PROCEDURE delete(x: INTEGER; VAR p: Node); VAR q: Node; PROCEDURE del (VAR r: Node); BEGIN IF r.right # NIL THEN del(r.right) ELSE q.key := r.key; q.count := r.count; q := r; r := r.left END END del; BEGIN (*delete*) IF p = NIL THEN (*word is not in tree*) ELSIF x < p.key THEN delete(x, p.left) ELSIF x > p.key THEN delete(x, p.right) ELSE (*delete p^*) q := p; IF q.right = NIL THEN p := q.left ELSIF q.left = NIL THEN p := q.right ELSE del(q.left) END 140 END END delete The auxiliary, recursive procedure del is activated in case 3 only. It descends along the rightmost branch of the left subtree of the element q^ to be deleted, and then it replaces the relevant information (key and count) in q^ by the corresponding values of the rightmost component r^ of that left subtree, whereafter t^ may be disposed. We note that we do not mention a procedure that would be the inverse of NEW, indicating that storage is no longer needed and therefore disposable and reusable. It is generally assumed that a computer system recognizes a disposable variable through the circumstance that no other variables are pointing to it, and that it therefore can no longer be referenced. Such a system is called a garbage collector . It is not a feature of a programming language, but rather of its implementations. In order to illustrate the functioning of procedure delete , we refer to Fig. 4.28. Consider the tree (a); then delete successively the nodes with keys 13, 15, 5, 10. The resulting trees are shown in Fig. 4.28 (b-e). Fig. 4.28. Tree deletion 4.4.5. Analysis of Tree Search and Insertion It is a natural reaction to be suspicious of the algorithm of tree search and insertion. At least one should retain some skepticism until having been given a few more details about its behaviour. What worries many programmers at first is the peculiar fact that generally we do not know how the tree will grow; we have no idea about the shape that it will assume. We can only guess that it will most probably not be the perfectly balanced tree. Since the average number of comparisons needed to locate a key in a perfectly balanced tree with n nodes is approximately log n, the number of comparisons in a tree generated by this algorithm will be greater. But how much greater? First of all, it is easy to find the worst case. Assume that all keys arrive in already strictly ascending (or descending) order. Then each key is appended immediately to the right (left) of its predecessor, and the resulting tree becomes completely degenerate, i.e., it turns out to be a linear list. The average search effort is then n/2 comparisons. This worst case evidently leads to a very poor performance of the search algorithm, and it seems to fully justify our skepticism. The remaining question is, of course, how likely this case will be. More precisely, we should like to know the length an of the search path averaged over all n keys and averaged over all n! trees that are generated from the n! permutations of the original n distinct keys. This problem of algorithmic analysis turns out to be fairly straightforward, and it is presented here as a typical example of analyzing an algorithm as well as for the practical importance of its result. 10 5 15 3 8 13 18 a) 10 5 15 3 8 18 b) 10 5 18 3 8 c) 10 3 18 8 d) 10 3 18 e) 141 Given are n distinct keys with values 1, 2, ... , n. Assume that they arrive in a random order. The probability of the first key -- which notably becomes the root node -- having the value i is 1/n. Its left subtree will eventually contain i-1 nodes, and its right subtree n-i nodes (see Fig. 4.29). Let the average path length in the left subtree be denoted by a i-1, and the one in the right subtree is a n-i, again assuming that all possible permutations of the remaining n-1 keys are equally likely. The average path length in a tree with n nodes is the sum of the products of each node's level and its probability of access. If all nodes are assumed to be searched with equal likelihood, then an = ( Si: 1 ≤ i ≤ n: p i) / n where p i is the path length of node i. Fig. 4.29. Weight distribution of branches In the tree in Fig. 4.29 we divide the nodes into three classes: 1. The i-1 nodes in the left subtree have an average path length a i-1 2. The root has a path length of 0. 3. The n-i nodes in the right subtree have an average path length a n-i Hence, the equation above can be expressed as a sum of two terms 1) and 3) an (i) = ((i-1) * a i-1 + (n-i) * a n-i) / n The desired quantity an is the average of a n (i) over all i = 1 ... n, i.e., over all trees with the key 1, 2, ... , n at the root. an = ( Si: 1 ≤ i ≤ n: (i-1) a i-1 + (n-i) a n-i) / n 2 = 2 * ( Si: 1 ≤ i ≤ n: (i-1) a i-1) / n 2 = 2 * ( Si: 1 ≤ i < n: i * a i) / n 2 This equation is a recurrence relation of the form a n = f 1(a 1, a 2, ... , a n-1). From this we derive a simpler recurrence relation of the form a n = f 2(a n-1). We derive directly (1) by splitting off the last term, and (2) by substituting n-1 for n: (1) a n = 2*(n-1 * a n-1 /n 2 + 2 * ( Si: 1 ≤ i < n: i * a i) / n 2 (2) a n-1 = 2 * ( Si: 1 ≤ i < n-1: i * a i) / (n-1) 2 Multiplying (2) by (n-1) 2/n 2 yields (3) 2 * ( Si: 1 ≤ i < n-1: i * a i) / n 2 = a n-1 ø2 * (n-1) 2/n 2 and substituting the right part of (3) in (1), we find an = 2 * (n-1) * a n-1 / n 2 + a n-1 * (n-1) 2 / n 2 = a n-1 * (n-1) 2 / n 2 It turns out that a n can be expressed in non-recursive, closed form in terms of the harmonic sum Hn = 1 + 1/2 + 1/3 + … + 1/n an = 2 * (Hn * (n+1)/n – 1) From Euler's formula (using Euler's constant g = 0.577...) i i-1 n-i 142 Hn = g + ln n + 1/12n 2 + ... we derive, for large n, the approximate value an = 2 * (ln n + g - 1) Since the average path length in the perfectly balanced tree is approximately an' = log n - 1 we obtain, neglecting the constant terms which become insignificant for large n, lim (a n/a n’) = 2 * ln(n) / log(n) = 2 ln(2) = 1.386... What does this result teach us? It tells us that by taking the pains of always constructing a perfectly balanced tree instead of the random tree, we could -- always provided that all keys are looked up with equal probability -- expect an average improvement in the search path length of at most 39%. Emphasis is to be put on the word average, for the improvement may of course be very much greater in the unhappy case in which the generated tree had completely degenerated into a list, which, however, is very unlikely to occur. In this connection it is noteworthy that the expected average path length of the random tree grows also strictly logarithmically with the number of its nodes, even though the worst case path length grows linearly. The figure of 39% imposes a limit on the amount of additional effort that may be spent profitably on any kind of reorganization of the tree's structure upon insertion of keys. Naturally, the ratio between the frequencies of access (retrieval) of nodes (information) and of insertion (update) significantly influences the payoff limits of any such undertaking. The higher this ratio, the higher is the payoff of a reorganization procedure. The 39% figure is low enough that in most applications improvements of the straight tree insertion algorithm do not pay off unless the number of nodes and the access vs. insertion ratio are large. 4.5. Balanced Trees From the preceding discussion it is clear that an insertion procedure that always restores the trees' structure to perfect balance has hardly any chance of being profitable, because the restoration of perfect balance after a random insertion is a fairly intricate operation. Possible improvements lie in the formulation of less strict definitions of balance. Such imperfect balance criteria should lead to simpler tree reorganization procedures at the cost of only a slight deterioration of average search performance. One such definition of balance has been postulated by Adelson-Velskii and Landis [4-1]. The balance criterion is the following: A tree is balanced if and only if for every node the heights of its two subtrees differ by at most 1. Trees satisfying this condition are often called AVL-trees (after their inventors). We shall simply call them balanced trees because this balance criterion appears a most suitable one. (Note that all perfectly balanced trees are also AVL-balanced.) The definition is not only simple, but it also leads to a manageable rebalancing procedure and an average search path length practically identical to that of the perfectly balanced tree. The following operations can be performed on balanced trees in O(log n) units of time, even in the worst case: 1. Locate a node with a given key. 2. Insert a node with a given key. 3. Delete the node with a given key. These statements are direct consequences of a theorem proved by Adelson-Velskii and Landis, which guarantees that a balanced tree will never be more than 45% higher than its perfectly balanced counterpart, no matter how many nodes there are. If we denote the height of a balanced tree with n nodes by h b(n), then log(n+1) < h b(n) < 1.4404*log(n+2) - 0.328 143 The optimum is of course reached if the tree is perfectly balanced for n = 2k-1. But which is the structure of the worst AVL-balanced tree? In order to find the maximum height h of all balanced trees with n nodes, let us consider a fixed height h and try to construct the balanced tree with the minimum number of nodes. This strategy is recommended because, as in the case of the minimal height, the value can be attained only for certain specific values of n. Let this tree of height h be denoted by Th. Clearly, T0 is the empty tree, and T1 is the tree with a single node. In order to construct the tree Th for h > 1, we will provide the root with two subtrees which again have a minimal number of nodes. Hence, the subtrees are also T's. Evidently, one subtree must have height h-1, and the other is then allowed to have a height of one less, i.e. h-2. Figure 4.30 shows the trees with height 2, 3, and 4. Since their composition principle very strongly resembles that of Fibonacci numbers, they are called Fibonacci-trees (see Fig. 4.30). They are defined as follows: 1. The empty tree is the Fibonacci-tree of height 0. 2. A single node is the Fibonacci-tree of height 1. 3. If T h-1 and T h-2 are Fibonacci-trees of heights h-1 and h-2, then Th =* is a Fibonacci-tree. 4. No other trees are Fibonacci-trees. Fig. 4.30. Fibonacci-trees of height 2, 3, and 4 The number of nodes of T h is defined by the following simple recurrence relation: N0 = 0, N 1 = 1 Nh = N h-1 + 1 + N h-2 The N i are those numbers of nodes for which the worst case (upper limit of h) can be attained, and they are called Leonardo numbers. 4.5.1. Balanced Tree Insertion Let us now consider what may happen when a new node is inserted in a balanced tree. Given a root r with the left and right subtrees L and R, three cases must be distinguished. Assume that the new node is inserted in L causing its height to increase by 1: 1. h L = h R: L and R become of unequal height, but the balance criterion is not violated. 2. h L < h R: L and R obtain equal height, i.e., the balance has even been improved. 3. h L > h R: the balance criterion is violated, and the tree must be restructured. Consider the tree in Fig. 4.31. Nodes with keys 9 and 11 may be inserted without rebalancing; the tree with root 10 will become one-sided (case 1); the one with root 8 will improve its balance (case 2). Insertion of nodes 1, 3, 5, or 7, however, requires subsequent rebalancing. 1 2 1 T2 3 2 T3 4 1 3 2 T4 4 5 7 6 144 Fig. 4.31. Balanced tree Some careful scrutiny of the situation reveals that there are only two essentially different constellations needing individual treatment. The remaining ones can be derived by symmetry considerations from those two. Case 1 is characterized by inserting keys 1 or 3 in the tree of Fig. 4.31, case 2 by inserting nodes 5 or 7. The two cases are generalized in Fig. 4.32 in which rectangular boxes denote subtrees, and the height added by the insertion is indicated by crosses. Simple transformations of the two structures restore the desired balance. Their result is shown in Fig. 4.33; note that the only movements allowed are those occurring in the vertical direction, whereas the relative horizontal positions of the shown nodes and subtrees must remain unchanged. Fig. 4.32. Imbalance resulting from insertion Fig. 4.33. Restoring the balance An algorithm for insertion and rebalancing critically depends on the way information about the tree's balance is stored. An extreme solution lies in keeping balance information entirely implicit in the tree structure itself. In this case, however, a node's balance factor must be rediscovered each time it is affected B A case 1 B A case 2 C B A case 1 B A case 2 C 2 8 4 10 6 145 by an insertion, resulting in an excessively high overhead. The other extreme is to attribute an explicitly stored balance factor to every node. The definition of the type Node is then extended into TYPE Node = POINTER TO RECORD key, count, bal: INTEGER; (*bal = -1, 0, +1*) left, right: Node END We shall subsequently interpret a node's balance factor as the height of its right subtree minus the height of its left subtree, and we shall base the resulting algorithm on this node type. The process of node insertion consists essentially of the following three consecutive parts: 1. Follow the search path until it is verified that the key is not already in the tree. 2. Insert the new node and determine the resulting balance factor. 3. Retreat along the search path and check the balance factor at each node. Rebalance if necessary. Although this method involves some redundant checking (once balance is established, it need not be checked on that node's ancestors), we shall first adhere to this evidently correct schema because it can be implemented through a pure extension of the already established search and insertion procedures. This procedure describes the search operation needed at each single node, and because of its recursive formulation it can easily accommodate an additional operation on the way back along the search path. At each step, information must be passed as to whether or not the height of the subtree (in which the insertion had been performed) had increased. We therefore extend the procedure's parameter list by the Boolean h with the meaning the subtree height has increased. Clearly, h must denote a variable parameter since it is used to transmit a result. Assume now that the process is returning to a node p^ from the left branch (see Fig. 4.32), with the indication that it has increased its height. We now must distinguish between the three conditions involving the subtree heights prior to insertion: 1. h L < h R, p.bal = +1, the previous imbalance at p has been equilibrated. 2. h L = h R, p.bal = 0, the weight is now slanted to the left. 3. h L > h R, p.bal = -1, rebalancing is necessary. In the third case, inspection of the balance factor of the root of the left subtree (say, p1.bal) determines whether case 1 or case 2 of Fig. 4.32 is present. If that node has also a higher left than right subtree, then we have to deal with case 1, otherwise with case 2. (Convince yourself that a left subtree with a balance factor equal to 0 at its root cannot occur in this case.) The rebalancing operations necessary are entirely expressed as sequences of pointer reassignments. In fact, pointers are cyclically exchanged, resulting in either a single or a double rotation of the two or three nodes involved. In addition to pointer rotation, the respective node balance factors have to be updated. The details are shown in the search, insertion, and rebalancing procedures. 146 Fig. 4.34. Insertions in balanced tree The working principle is shown by Fig. 4.34. Consider the binary tree (a) which consists of two nodes only. Insertion of key 7 first results in an unbalanced tree (i.e., a linear list). Its balancing involves a RR single rotation, resulting in the perfectly balanced tree (b). Further insertion of nodes 2 and 1 result in an imbalance of the subtree with root 4. This subtree is balanced by an LL single rotation (d). The subsequent insertion of key 3 immediately offsets the balance criterion at the root node 5. Balance is thereafter reestablished by the more complicated LR double rotation; the outcome is tree (e). The only candidate for losing balance after a next insertion is node 5. Indeed, insertion of node 6 must invoke the fourth case of rebalancing outlined below, the RL double rotation. The final tree is shown in Fig.4.34 (f). PROCEDURE search(x: INTEGER; VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; (*~h*) BEGIN IF p = NIL THEN (*insert*) NEW(p); h := TRUE; p.key := x; p.count := 1; p.left := NIL; p.right := NIL; p.bal := 0 ELSIF p.key > x THEN search(x, p.left, h); IF h THEN (*left branch has grown*) IF p.bal = 1 THEN p.bal := 0; h := FALSE ELSIF p.bal = 0 THEN p.bal := -1 ELSE (*bal = -1, rebalance*) p1 := p.left; IF p1.bal = -1 THEN (*single LL rotation*) p.left := p1.right; p1.right := p; p.bal := 0; p := p1 ELSE (*double LR rotation*) p2 := p1.right; p1.right := p2.left; p2.left := p1; p.left := p2.right; p2.right := p; IF p2.bal = -1 THEN p.bal := 1 ELSE p.bal := 0 END ; IF p2.bal = +1 THEN p1.bal := -1 ELSE p1.bal := 0 END ; p := p2 END ; p.bal := 0; h := FALSE END END ELSIF p.key < x THEN search(x, p.right, h); 4 2 6 1 3 5 7 f) 4 2 5 1 3 7 e) 5 2 7 1 4 d) 5 4 7 2 c) 5 4 7 b) 4 5 a) 147 IF h THEN (*right branch has grown*) IF p.bal = -1 THEN p.bal := 0; h := FALSE ELSIF p.bal = 0 THEN p.bal := 1 ELSE (*bal = +1, rebalance*) p1 := p.right; IF p1.bal = 1 THEN (*single RR rotation*) p.right := p1.left; p1.left := p; p.bal := 0; p := p1 ELSE (*double RL rotation*) p2 := p1.left; p1.left := p2.right; p2.right := p1; p.right := p2.left; p2.left := p; IF p2.bal = +1 THEN p.bal := -1 ELSE p.bal := 0 END ; IF p2.bal = -1 THEN p1.bal := 1 ELSE p1.bal := 0 END ; p := p2 END ; p.bal := 0; h := FALSE END END ELSE INC(p.count) END END search Two particularly interesting questions concerning the performance of the balanced tree insertion algorithm are the following: 1. If all n! permutations of n keys occur with equal probability, what is the expected height of the constructed balanced tree? 2. What is the probability that an insertion requires rebalancing? Mathematical analysis of this complicated algorithm is still an open problem. Empirical tests support the conjecture that the expected height of the balanced tree thus generated is h = log(n)+c, where c is a small constant (c ≈ 0.25). This means that in practice the AVL-balanced tree behaves as well as the perfectly balanced tree, although it is much simpler to maintain. Empirical evidence also suggests that, on the average, rebalancing is necessary once for approximately every two insertions. Here single and double rotations are equally probable. The example of Fig. 4.34 has evidently been carefully chosen to demonstrate as many rotations as possible in a minimum number of insertions. The complexity of the balancing operations suggests that balanced trees should be used only if information retrievals are considerably more frequent than insertions. This is particularly true because the nodes of such search trees are usually implemented as densely packed records in order to economize storage. The speed of access and of updating the balance factors -- each requiring two bits only -- is therefore often a decisive factor to the efficiency of the rebalancing operation. Empirical evaluations show that balanced trees lose much of their appeal if tight record packing is mandatory. It is indeed difficult to beat the straightforward, simple tree insertion algorithm. 4.5.2. Balanced Tree Deletion Our experience with tree deletion suggests that in the case of balanced trees deletion will also be more complicated than insertion. This is indeed true, although the rebalancing operation remains essentially the same as for insertion. In particular, rebalancing consists again of either single or a double rotations of nodes. The basis for balanced tree deletion is the ordinary tree deletion algorithm. The easy cases are terminal nodes and nodes with only a single descendant. If the node to be deleted has two subtrees, we will again replace it by the rightmost node of its left subtree. As in the case of insertion, a Boolean variable parameter h is added with the meaning “the height of the subtree has been reduced”. Rebalancing has to be considered only when h is true. h is made true upon finding and deleting a node, or if rebalancing itself reduces the height of a subtree. We now introduce the two (symmetric) balancing operations in the form of procedures, because they have to be invoked from more than one point in the deletion algorithm. Note that balanceL is applied when the left, balanceR after the right branch had been reduced in height. 148 Fig. 4.35. Deletions in balanced tree The operation of the procedure is illustrated in Fig. 4.35. Given the balanced tree (a), successive deletion of the nodes with keys 4, 8, 6, 5, 2, 1, and 7 results in the trees (b) ... (h). Deletion of key 4 is simple in itself, because it represents a terminal node. However, it results in an unbalanced node 3. Its rebalancing operation invoves an LL single rotation. Rebalancing becomes again necessary after the deletion of node 6. This time the right subtree of the root (7) is rebalanced by an RR single rotation. Deletion of node 2, although in itself straightforward since it has only a single descendant, calls for a complicated RL double rotation. The fourth case, an LR double rotation, is finally invoked after the removal of node 7, which at first was replaced by the rightmost element of its left subtree, i.e., by the node with key 3. PROCEDURE balanceL(VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; BEGIN (*h; left branch has shrunk*) IF p.bal = -1 THEN p.bal := 0 ELSIF p.bal = 0 THEN p.bal := 1; h := FALSE 1 9 8 7 10 6 9 11 a) 5 38 2 4 1 8 7 10 6 9 11 b) 5 2 1 3 7 6 10 9 11 c) 5 2 1 3 10 7 11 d) 5 2 1 3 9 10 7 11 e) 3 2 1 10 9 11 f) 7 3 10 9 11 g) 7 3 11 9 h) 10 3 149 ELSE (*bal = 1, rebalance*) p1 := p.right; IF p1.bal >= 0 THEN (*single RR rotation*) p.right := p1.left; p1.left := p; IF p1.bal = 0 THEN p.bal := 1; p1.bal := -1; h := FALSE ELSE p.bal := 0; p1.bal := 0 END ; p := p1 ELSE (*double RL rotation*) p2 := p1.left; p1.left := p2.right; p2.right := p1; p.right := p2.left; p2.left := p; IF p2.bal = +1 THEN p.bal := -1 ELSE p.bal := 0 END ; IF p2.bal = -1 THEN p1.bal := 1 ELSE p1.bal := 0 END ; p := p2; p2.bal := 0 END END END balanceL; PROCEDURE balanceR(VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; BEGIN (*h; right branch has shrunk*) IF p.bal = 1 THEN p.bal := 0 ELSIF p.bal = 0 THEN p.bal := -1; h := FALSE ELSE (*bal = -1, rebalance*) p1 := p.left; IF p1.bal <= 0 THEN (*single LL rotation*) p.left := p1.right; p1.right := p; IF p1.bal = 0 THEN p.bal := -1; p1.bal := 1; h := FALSE ELSE p.bal := 0; p1.bal := 0 END ; p := p1 ELSE (*double LR rotation*) p2 := p1.right; b2 := p2.bal; p1.right := p2.left; p2.left := p1; p.left := p2.right; p2.right := p; IF p2.bal = -1 THEN p.bal := 1 ELSE p.bal := 0 END ; IF p2.bal = +1 THEN p1.bal := -1 ELSE p1.bal := 0 END ; p := p2; p2.bal := 0 END END END balanceR; PROCEDURE delete(x: INTEGER; VAR p: Node; VAR h: BOOLEAN); VAR q: Node; PROCEDURE del(VAR r: Node; VAR h: BOOLEAN); BEGIN (*~h*) IF r.right # NIL THEN del(r.right, h); IF h THEN balanceR(r, h) END ELSE q.key := r.key; q.count := r.count; q := r; r := r.left; h := TRUE END END del; BEGIN (*~h*) IF p = NIL THEN (*key not in tree*) ELSIF p.key > x THEN delete(x, p.left, h); 150 IF h THEN balanceL(p, h) END ELSIF p.key < x THEN delete(x, p.right, h); IF h THEN balanceR(p, h) END ELSE (*delete p^*) q := p; IF q.right = NIL THEN p := q.left; h := TRUE ELSIF q.left = NIL THEN p := q.right; h := TRUE ELSE del(q.left, h); IF h THEN balanceL(p, h) END END END END delete Fortunately, deletion of an element in a balanced tree can also be performed with -- in the worst case -- O(log n) operations. An essential difference between the behaviour of the insertion and deletion procedures must not be overlooked, however. Whereas insertion of a single key may result in at most one rotation (of two or three nodes), deletion may require a rotation at every node along the search path. Consider, for instance, deletion of the rightmost node of a Fibonacci-tree. In this case the deletion of any single node leads to a reduction of the height of the tree; in addition, deletion of its rightmost node requires the maximum number of rotations. This therefore represents the worst choice of node in the worst case of a balanced tree, a rather unlucky combination of chances. How probable are rotations, then, in general? The surprising result of empirical tests is that whereas one rotation is invoked for approximately every two insertions, one is required for every five deletions only. Deletion in balanced trees is therefore about as easy -- or as complicated -- as insertion. 4.6. Optimal Search Trees So far our consideration of organizing search trees has been based on the assumption that the frequency of access is equal for all nodes, that is, that all keys are equally probable to occur as a search argument. This is probably the best assumption if one has no idea of access distribution. However, there are cases (they are the exception rather than the rule) in which information about the probabilities of access to individual keys is available. These cases usually have the characteristic that the keys always remain the same, i.e., the search tree is subjected neither to insertion nor deletion, but retains a constant structure. A typical example is the scanner of a compiler which determines for each word (identifier) whether or not it is a keyword (reserved word). Statistical measurements over hundreds of compiled programs may in this case yield accurate information on the relative frequencies of occurrence, and thereby of access, of individual keys. Assume that in a search tree the probability with which node i is accessed is Pr {x = k i} = p i, (Si: 1 ≤ i ≤ n : p i) = 1 We now wish to organize the search tree in a way that the total number of search steps -- counted over sufficiently many trials -- becomes minimal. For this purpose the definition of path length is modified by (1) attributing a certain weight to each node and by (2) assuming the root to be at level 1 (instead of 0), because it accounts for the first comparison along the search path. Nodes that are frequently accessed become heavy nodes; those that are rarely visited become light nodes. The (internal) weighted path length is then the sum of all paths from the root to each node weighted by that node's probability of access. P = Si: 1 ≤ i ≤ n : p i*h i hi is the level of node i. The goal is now to minimize the weighted path length for a given probability distribution. As an example, consider the set of keys 1, 2, 3, with probabilities of access p 1 = 1/7, p 2 = 2/7, and p 3 = 4/7. These three keys can be arranged in five different ways as search trees (see Fig. 4.36). 151 Fig. 4.36. The search trees with 3 nodes The weighted path lengths of trees (a) to (e) are computed according to their definition as P(a) = 11/7, P(b) = 12/7, P(c) = 12/7, P(d) = 15/7, P(e) = 17/7 Hence, in this example, not the perfectly balanced tree (c), but the degenerate tree (a) turns out to be optimal. The example of the compiler scanner immediately suggests that this problem should be viewed under a slightly more general condition: words occurring in the source text are not always keywords; as a matter of fact, their being keywords is rather the exception. Finding that a given word k is not a key in the search tree can be considered as an access to a hypothetical "special node" inserted between the next lower and next higher key (see Fig. 4.19) with an associated external path length. If the probability qi of a search argument x lying between the two keys ki and ki+1 is also known, this information may considerably change the structure of the optimal search tree. Hence, we generalize the problem by also considering unsuccessful searches. The overall average weighted path length is now P = ( Si: 1 ≤ i ≤ n : p i*h i) + ( Si: 1 ≤ i ≤ m : q i*h' i) where (Si: 1 ≤ i ≤ n : p i) + ( Si: 1 ≤ i ≤m : q i) = 1. and where, h i is the level of the (internal) node i and h' j is the level of the external node j. The average weighted path length may be called the cost of the search tree, since it represents a measure for the expected amount of effort to be spent for searching. The search tree that requires the minimal cost among all trees with a given set of keys k i and probabilities p i and q i is called the optimal tree . 3 2 a) 1 3 1 b) 2 2 1 c) 3 1 2 d) 3 1 3 e) 2 152 Fig. 4.37. Search tree with associated access frequencies For finding the optimal tree, there is no need to require that the p's and q's sum up to 1. In fact, these probabilities are commonly determined by experiments in which the accesses to nodes are counted. Instead of using the probabilities pi and qj, we will subsequently use such frequency counts and denote them by ai = number of times the search argument x equals k i bj = number of times the search argument x lies between k j and k j+1 By convention, b 0 is the number of times that x is less than k 1, and b n is the frequency of x being greater than k n (see Fig. 4.37). We will subsequently use P to denote the accumulated weighted path length instead of the average path length: P = ( Si: 1 ≤ i ≤ n : a i*h i) + ( Si: 1 ≤ i ≤ m : b i*h' i) Thus, apart from avoiding the computation of the probabilities from measured frequency counts, we gain the further advantage of being able to use integers instead of fractions in our search for the optimal tree. Considering the fact that the number of possible configurations of n nodes grows exponentially with n, the task of finding the optimum seems rather hopeless for large n. Optimal trees, however, have one significant property that helps to find them: all their subtrees are optimal too. For instance, if the tree in Fig. 4.37 is optimal, then the subtree with keys k 3 and k 4 is also optimal as shown. This property suggests an algorithm that systematically finds larger and larger trees, starting with individual nodes as smallest possible subtrees. The tree thus grows from the leaves to the root, which is, since we are used to drawing trees upside-down, the bottom-up direction [4-6]. The equation that is the key to this algorithm is derived as follows: Let P be the weighted path length of a tree, and let P L and P R be those of the left and right subtrees of its root. Clearly, P is the sum of P L and P R, and the number of times a search travels on the leg to the root, which is simply the total number W of search trials. We call W the weight of the tree. Its average path length is then P/W. P = P L + W + P R W = ( Si: 1 ≤ i ≤ n : a i) + ( Si: 1 ≤ i ≤ m : b i) These considerations show the need for a denotation of the weights and the path lengths of any subtree consisting of a number of adjacent keys. Let T ij be the optimal subtree consisting of nodes with keys k i+1 , ki+2 , ... , k j. Then let wij denote the weight and let p ij denote the path length of T ij . Clearly P = p 0,n and W = w0,n . These quantities are defined by the following recurrence relations: k2|a 2 k1|a 1 k4|a 4 k3|a 3 b1 b0 b3 b2 b4 153 wii = b i (0 ≤ i ≤ n) wij = w i, j-1 + a j + b j (0 ≤ i < j ≤ n) pii = w ii (0 ≤ i ≤ n) pij = w ij + MIN k: i < k ≤ j : (p i,k-1 + p kj ) (0 ≤ i < k < j ≤ n) The last equation follows immediately from the definitions of P and of optimality. Since there are approximately n 2/2 values p ij , and because its definition calls for a choice among all cases such that 0 < j-i ≤ n, the minimization operation will involve approximately n 3/6 operations. Knuth pointed out that a factor n can be saved by the following consideration, which alone makes this algorithm usable for practical purposes. Let r ij be a value of k which achieves the minimum for p ij. It is possible to limit the search for r ij to a much smaller interval, i.e., to reduce the number of the j-i evaluation steps. The key is the observation that if we have found the root rij of the optimal subtree T ij , then neither extending the tree by adding a node at the right, nor shrinking the tree by removing its leftmost node ever can cause the optimal root to move to the left. This is expressed by the relation ri,j-1 ≤ r ij ≤ r i+1,j which limits the search for possible solutions for r ij to the range r i,j-1 ... r i+1,j . This results in a total number of elementary steps in the order of n 2. We are now ready to construct the optimization algorithm in detail. We recall the following definitions, which are based on optimal trees T ij consisting of nodes with keys k i+1 ... k j. 1. a i: the frequency of a search for k i. 2. b j: the frequency of a search argument x between k j and k j+1 . 3. w ij : the weight of T ij . 4. p ij : the weighted path length of T ij . 5. r ij : the index of the root of T ij . We declare the following arrays: a: ARRAY n+1 OF INTEGER; (*a[0] not used*) b: ARRAY n+1 OF INTEGER; p,w,r: ARRAY n+1, n+1 OF INTEGER; Assume that the weights w ij have been computed from a and b in a straightforward way. Now consider w as the argument of the procedure OptTree to be developed and consider r as its result, because r describes the tree structure completely. p may be considered an intermediate result. Starting out by considering the smallest possible subtrees, namely those consisting of no nodes at all, we proceed to larger and larger trees. Let us denote the width j-i of the subtree Tij by h. Then we can trivially determine the values pii for all trees with h = 0 according to the definition of p ij . FOR i := 0 TO n DO p[i,i] := b[i] END In the case h = 1 we deal with trees consisting of a single node, which plainly is also the root (see Fig. 4.38). FOR i := 0 TO n-1 DO j := i+1; p[i,j] := w[i,j] + p[i,i] + p[j,j]; r[i,j] := j END 154 Fig. 4.38. Optimal search tree with single node Note that i denotes the left index limit and j the right index limit in the considered tree T ij . For the cases h > 1 we use a repetitive statement with h ranging from 2 to n, the case h = n spanning the entire tree T 0,n . In each case the minimal path length p ij and the associated root index r ij are determined by a simple repetitive statement with an index k ranging over the interval given for r ij . FOR h := 2 TO n DO FOR i := 0 TO n-h DO j := i+h; find k and min = MIN k: i < k < j : (p i,k-1 + p kj) such that r i,j-1 < k < r i+1,j ; p[i,j] := min + w[i,j]; r[i,j] := k END END The details of the refinement of the statement in italics can be found in Program 4.6. The average path length of T 0,n is now given by the quotient p 0,n /w 0,n , and its root is the node with index r 0,n . Let us now describe the structure of the program to be designed. Its two main components are the procedures to find the optimal search tree, given a weight distribution w, and to display the tree given the indices r. First, the counts a and b and the keys are read from an input source. The keys are actually not involved in the computation of the tree structure; they are merely used in the subsequent display of the tree. After printing the frequency statistics, the program proceeds to compute the path length of the perfectly balanced tree, in passing also determining the roots of its subtrees. Thereafter, the average weighted path length is printed and the tree is displayed. In the third part, procedure OptTree is activated in order to compute the optimal search tree; thereafter, the tree is displayed. And finally, the same procedures are used to compute and display the optimal tree considering the key frequencies only, ignoring the frequencies of non-keys. To summarize, the following are the global constants and variables: CONST N = 100; (*max no. of keywords*) WordLen = 16; (*max keyword length*) VAR key: ARRAY N+1, WordLen OF CHAR; a, b: ARRAY N+1 OF INTEGER; p, w, r: ARRAY N+1, N+1 OF INTEGER; PROCEDURE BalTree(i, j: INTEGER): INTEGER; VAR k: INTEGER; BEGIN k := (i+j+1) DIV 2; r[i, j] := k; IF i >= j THEN RETURN 0 ELSE RETURN BalTree(i, k-1) + BalTree(k, j) + w[i, j] END END BalTree; kj|a j bj-1 bj wj-1, j -1 wj-1, j 155 PROCEDURE ComputeOptTree(n: INTEGER); VAR x, min, tmp: INTEGER; i, j, k, h, m: INTEGER; BEGIN (*argument: W, results: p, r*) FOR i := 0 TO n DO p[i, i] := 0 END ; FOR i := 0 TO n-1 DO j := i+1; p[i, j] := w[i, j]; r[i, j] := j END ; FOR h := 2 TO n DO FOR i := 0 TO n-h DO j := i+h; m := r[i, j-1]; min := p[i, m-1] + p[m, j]; FOR k := m+1 TO r[i+1, j] DO tmp := p[i, k-1]; x := p[k, j] + tmp; IF x < min THEN m := k; min := x END END ; p[i, j] := min + w[i, j]; r[i, j] := m END END END ComputeOptTree; PROCEDURE WriteTree(i, j, level: INTEGER); VAR k: INTEGER; (*uses global writer W*) BEGIN IF i < j THEN WriteTree(i, r[i, j]-1, level+1); FOR k := 1 TO level DO Texts.Write(W, TAB) END ; Texts.WriteString(W, key[r[i, j]]); Texts.WriteLn(W); WriteTree(r[i, j], j, level+1) END END WriteTree; PROCEDURE Find(VAR S: Texts.Scanner); VAR i, j, n: INTEGER; (*uses global writer W*) BEGIN Texts.Scan(S); b[0] := SHORT(S.i); n := 0; Texts.Scan(S); (*input a, key, b*) WHILE S.class = Texts.Int DO INC(n); a[n] := SHORT(S.i); Texts.Scan(S); COPY(S.s, key[n]); Texts.Scan(S); b[n] := SHORT(S.i); Texts.Scan(S) END ; (*compute w from a and b*) FOR i := 0 TO n DO w[i, i] := b[i]; FOR j := i+1 TO n DO w[i, j] := w[i, j-1] + a[j] + b[j] END END ; Texts.WriteString(W, "Total weight = "); Texts.WriteInt(W, w[0, n], 6); Texts.WriteLn(W); Texts.WriteString(W, "Pathlength of balanced tree = "); Texts.WriteInt(W, BalTree(0, n), 6); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W); ComputeOptTree(n); Texts.WriteString(W, "Pathlength of optimal tree = "); Texts.WriteInt(W, p[0, n], 6); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W); FOR i := 0 TO n DO w[i, i] := 0; 156 FOR j := i+1 TO n DO w[i, j] := w[i, j-1] + a[j] END END ; ComputeOptTree(n); Texts.WriteString(W, "optimal tree not considering b"); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W) END Find; As an example, let us consider the following input data of a tree with 3 keys: 20 1 Albert 10 2 Ernst 1 5 Peter 1 b0 = 20 a1 = 1 key 1 = Albert b1 = 10 a2 = 2 key 2 = Ernst b2 = 1 a3 = 4 key 3 = Peter b3 = 1 The results of procedure Find are shown in Fig. 4.40 and demonstrate that the structures obtained for the three cases may differ significantly. The total weight is 40, the pathlength of the balanced tree is 78, and that of the optimal tree is 66. Fig. 4.40. The 3 trees generated by the Optimal Tree procedure (NEW FIGURE!) It is evident from this algorithm that the effort to determine the optimal structure is of the order of n 2; also, the amount of required storage is of the order n 2. This is unacceptable if n is very large. Algorithms with greater efficiency are therefore highly desirable. One of them is the algorithm developed by Hu and Tucker [4-5] which requires only O(n) storage and O(n*log(n)) computations. However, it considers only the case in which the key frequencies are zero, i.e., where only the unsuccessful search trials are registered. Another algorithm, also requiring O(n) storage elements and O(n*log(n)) computations was described by Walker and Gotlieb [4-7]. Instead of trying to find the optimum, this algorithm merely promises to yield a nearly optimal tree. It can therefore be based on heuristic principles. The basic idea is the following. Consider the nodes (genuine and special nodes) being distributed on a linear scale, weighted by their frequencies (or probabilities) of access. Then find the node which is closest to the center of gravity. This node is called the centroid , and its index is ( Si: 1 ≤ i ≤ n : i*a i) + ( Si: 1 ≤ i ≤ m : i*b i) / W rounded to the nearest integer. If all nodes have equal weight, then the root of the desired optimal tree evidently coincides with the centroid Otherwise -- so the reasoning goes -- it will in most cases be in the close neighborhood of the centroid. A limited search is then used to find the local optimum, whereafter this procedure is applied to the resulting two subtrees. The likelihood of the root lying very close to the centroid grows with the size n of the tree. As soon as the subtrees have reached a manageable size, their optimum can be determined by the above exact algorithm. 4.7. B-Trees So far, we have restricted our discussion to trees in which every node has at most two descendants, i.e., to binary trees. This is entirely satisfactory if, for instance, we wish to represent family relationships with a preference to the pedigree view, in which every person is associated with his parents. After all, no one has more than two parents. But what about someone who prefers the posterity view? He has to cope with the balanced tree Ernst Albert Peter optimal tree Albert Ernst Peter not considering key misses Peter Ernst Albert 157 fact that some people have more than two children, and his trees will contain nodes with many branches. For lack of a better term, we shall call them multiway trees. Of course, there is nothing special about such structures, and we have already encountered all the programming and data definition facilities to cope with such situations. If, for instance, an absolute upper limit on the number of children is given (which is admittedly a somewhat futuristic assumption), then one may represent the children as an array component of the record representing a person. If the number of children varies strongly among different persons, however, this may result in a poor utilization of available storage. In this case it will be much more appropriate to arrange the offspring as a linear list, with a pointer to the youngest (or eldest) offspring assigned to the parent. A possible type definition for this case is the following, and a possible data structure is shown in Fig. 4.43. TYPE Person = POINTER TO RECORD name: alfa; sibling, offspring: Person END Fig. 4.43. Multiway tree represented as binary tree We now realize that by tilting this picture by 45 degrees it will look like a perfect binary tree. But this view is misleading because functionally the two references have entirely different meanings. One usually dosen't treat a sibling as an offspring and get away unpunished, and hence one should not do so even in constructing data definitions. This example could also be easily extended into an even more complicated data structure by introducing more components in each person's record, thus being able to represent further family relationships. A likely candidate that cannot generally be derived from the sibling and offspring references is that of husband and wife, or even the inverse relationship of father and mother. Such a structure quickly grows into a complex relational data bank, and it may be possible to map serveral trees into it. The algorithms operating on such structures are intimately tied to their data definitions, and it does not make sense to specify any general rules or widely applicable techniques. However, there is a very practical area of application of multiway trees which is of general interest. This is the construction and maintenance of large-scale search trees in which insertions and deletions are necessary, but in which the primary store of a computer is not large enough or is too costly to be used for long-time storage. Assume, then, that the nodes of a tree are to be stored on a secondary storage medium such as a disk store. Dynamic data structures introduced in this chapter are particularly suitable for incorporation of secondary storage media. The principal innovation is merely that pointers are represented by disk store addresses instead of main store addresses. Using a binary tree for a data set of, say, a million items, requires on the average approximately log 10 6 (i.e. about 20) search steps. Since each step now involves a disk access (with inherent latency time), a storage organization using fewer accesses will be highly desirable. The multiway tree is a perfect solution to this problem. If an item located on a secondary store is accessed, an entire group of items may also be accessed without much additional cost. This suggests that a tree be subdivided into subtrees, and that the subtrees are represented as units that are accessed all together. We shall call these subtrees pages . Figure 4.44 shows a binary tree subdivided into pages, each page consisting of 7 nodes. JOHN ALBERT MARY ROBERT PETER CAROL CHRIS TINA PAUL GEORGE PAMELA 158 Fig. 4.44. Binary tree subdivided into pages The saving in the number of disk accesses -- each page access now involves a disk access -- can be considerable. Assume that we choose to place 100 nodes on a page (this is a reasonable figure); then the million item search tree will on the average require only log 100 (10 6) (i.e. about 3) page accesses instead of 20. But, of course, if the tree is left to grow at random, then the worst case may still be as large as 104. It is plain that a scheme for controlled growth is almost mandatory in the case of multiway trees. 4.7.1. Multiway B-Trees If one is looking for a controlled growth criterion, the one requiring a perfect balance is quickly eliminated because it involves too much balancing overhead. The rules must clearly be somewhat relaxed. A very sensible criterion was postulated by R. Bayer and E.M. McCreight [4.2] in 1970: every page (except one) contains between n and 2n nodes for a given constant n. Hence, in a tree with N items and a maximum page size of 2n nodes per page, the worst case requires logn N page accesses; and page accesses clearly dominate the entire search effort. Moreover, the important factor of store utilization is at least 50% since pages are always at least half full. With all these advantages, the scheme involves comparatively simple algorithms for search, insertion, and deletion. We will subsequently study them in detail. The underlying data structures are called B-trees, and have the following characteristics; n is said to be the order of the B-tree. 1. Every page contains at most 2n items (keys.) 2. Every page, except the root page, contains at least n items. 3. Every page is either a leaf page, i.e. has no descendants, or it has m+1 descendants, where m is its number of keys on this page. 4. All leaf pages appear at the same level. Fig. 4.45. B-tree of order 2 Figure 4.45 shows a B-tree of order 2 with 3 levels. All pages contain 2, 3, or 4 items; the exception is the root which is allowed to contain a single item only. All leaf pages appear at level 3. The keys appear in increasing order from left to right if the B-tree is squeezed into a single level by inserting the descendants in between the keys of their ancestor page. This arrangement represents a natural extension of binary 25 10 20 2 5 7 8 13 14 15 18 22 24 26 27 28 32 35 38 41 42 45 46 30 40 159 search trees, and it determines the method of searching an item with given key. Consider a page of the form shown in Fig. 4.46 and a given search argument x. Assuming that the page has been moved into the primary store, we may use conventional search methods among the keys k 1 ... k m. If m is sufficiently large, one may use binary search; if it is rather small, an ordinary sequential search will do. (Note that the time required for a search in main store is probably negligible compared to the time it takes to move the page from secondary into primary store.) If the search is unsuccessful, we are in one of the following situations: 1. k i < x < k i+1 , for 1 < i < m The search continues on page p i^ 2. k m < x The search continues on page p m^. 3. x < k 1 The search continues on page p 0^. Fig. 4.46. B-tree page with m keys If in some case the designated pointer is NIL, i.e., if there is no descendant page, then there is no item with key x in the whole tree, and the search is terminated. Surprisingly, insertion in a B-tree is comparatively simple too. If an item is to be inserted in a page with m < 2n items, the insertion process remains constrained to that page. It is only insertion into an already full page that has consequences upon the tree structure and may cause the allocation of new pages. To understand what happens in this case, refer to Fig. 4.47, which illustrates the insertion of key 22 in a B- tree of order 2. It proceeds in the following steps: 1. Key 22 is found to be missing; insertion in page C is impossible because C is already full. 2. Page C is split into two pages (i.e., a new page D is allocated). 3. The 2n+1 keys are equally distributed onto C and D, and the middle key is moved up one level into the ancestor page A. Fig. 4.47. Insertion of key 22 in B-tree This very elegant scheme preserves all the characteristic properties of B-trees. In particular, the split pages contain exactly n items. Of course, the insertion of an item in the ancestor page may again cause that page to overflow, thereby causing the splitting to propagate. In the extreme case it may propagate up to the root. This is, in fact, the only way that the B-tree may increase its height. The B-tree has thus a strange manner of growing: it grows from its leaves upward to the root. We shall now develop a detailed program from these sketchy descriptions. It is already apparent that a recursive formulation will be most convenient because of the property of the splitting process to propagate back along the search path. The general structure of the program will therefore be similar to balanced tree insertion, although the details are different. First of all, a definition of the page structure has to be formulated. We choose to represent the items in the form of an array. TYPE Page = POINTER TO PageDescriptor; Item = RECORD key: INTEGER; p: Page; 20 7 10 15 18 26 30 35 40 A B C 20 30 7 10 15 18 35 40 A B D 22 26 C p0 k 1 p 1 k 2 p 2 . . . p m-1 k m p m 160 count: INTEGER (*data*) END ; PageDescriptor = RECORD m: INTEGER; (* 0 .. 2n *) p0: Page; e: ARRAY 2*n OF Item END Again, the item component count stands for all kinds of other information that may be associated with each item, but it plays no role in the actual search process. Note that each page offers space for 2n items. The field m indicates the actual number of items on the page. As m ≥ n (except for the root page), a storage utilization of a least 50% is guaranteed. The algorithm of B-tree search and insertion is formulated below as a procedure called search . Its main structure is straightforward and similar to that for the balanced binary tree search, with the exception that the branching decision is not a binary choice. Instead, the “within-page search” is represented as a binary search on the array e of elements. The insertion algorithm is formulated as a separate procedure merely for clarity. It is activated after search has indicated that an item is to be passed up on the tree (in the direction toward the root). This fact is indicated by the Boolean result parameter h; it assumes a similar role as in the algorithm for balanced tree insertion, where h indicates that the subtree had grown. If h is true, the second result parameter, u, represents the item being passed up. Note that insertions start in hypothetical pages, namely, the "special nodes" of Fig. 4.19; the new item is immediately handed up via the parameter u to the leaf page for actual insertion. The scheme is sketched here: PROCEDURE search(x: INTEGER; a: Page; VAR h: BOOLEAN; VAR u: Item); BEGIN IF a = NIL THEN (*x not in tree, insert*) Assign x to item u, set h to TRUE, indicating that an item u is passed up in the tree ELSE binary search for x in array a.e; IF found THEN process data ELSE search(x, descendant, h, u); IF h THEN (*an item was passed up*) IF no. of items on page a^ < 2n THEN insert u on page a^ and set h to FALSE ELSE split page and pass middle item up END END END END END search If the paramerter h is true after the call of search in the main program, a split of the root page is requested. Since the root page plays an exceptional role, this process has to be programmed separately. It consists merely of the allocation of a new root page and the insertion of the single item given by the paramerter u. As a consequence, the new root page contains a single item only. The details can be gathered from Program 4.7, and Fig. 4.48 shows the result of using Program 4.7 to construct a B-tree with the following insertion sequence of keys: 20; 40 10 30 15; 35 7 26 18 22; 5; 42 13 46 27 8 32; 38 24 45 25; The semicolons designate the positions of the snapshots taken upon each page allocation. Insertion of the last key causes two splits and the allocation of three new pages. 161 Fig. 4.48. Growth of B-tree of order 2 Since each activation of search implies one page transfer to main store, k = log n(N) recursive calls are necessary at most, if the tree contains N items. Hence, we must be capable of accommodating k pages in main store. This is one limiting factor on the page size 2n. In fact, we need to accommodate even more than k pages, because insertion may cause page splitting to occur. A corollary is that the root page is best allocated permanently in the primary store, because each query proceeds necessarily through the root page. Another positive quality of the B-tree organization is its suitability and economy in the case of purely sequential updating of the entire data base. Every page is fetched into primary store exactly once. Deletion of items from a B-tree is fairly straight-forward in principle, but it is complicated in the details. We may distinguish two different circumstances: 1. The item to be deleted is on a leaf page; here its removal algorithm is plain and simple. 2. The item is not on a leaf page; it must be replaced by one of the two lexicographically adjacent items, which happen to be on leaf pages and can easily be deleted. In case 2 finding the adjacent key is analogous to finding the one used in binary tree deletion. We descend along the rightmost pointers down to the leaf page P, replace the item to be deleted by the rightmost item on P, and then reduce the size of P by 1. In any case, reduction of size must be followed by a check of the number of items m on the reduced page, because, if m < n, the primary characteristic of B-trees would be violated. Some additional action has to be taken; this underflow condition is indicated by the Boolean variable parameter h. The only recourse is to borrow or annect an item from one of the neighboring pages, say from Q. Since this involves fetching page Q into main store -- a relatively costly operation -- one is tempted to make the best of this undesirable situation and to annect more than a single item at once. The usual strategy is to distribute the items on pages P and Q evenly on both pages. This is called page balancing . 25 10 20 5 7 8 13 15 18 22 24 26 27 32 35 38 42 45 46 30 40 f) 10 20 30 40 5 7 8 13 15 18 22 26 27 32 35 42 46 e) 10 20 30 5 7 15 18 22 26 35 40 d) 20 30 7 10 15 18 22 26 35 40 c) 20 10 15 30 40 b) 20 a) 162 Of course, it may happen that there is no item left to be annected since Q has already reached its minimal size n. In this case the total number of items on pages P and Q is 2n-1; we may merge the two pages into one, adding the middle item from the ancestor page of P and Q, and then entirely dispose of page Q. This is exactly the inverse process of page splitting. The process may be visualized by considering the deletion of key 22 in Fig. 4.47. Once again, the removal of the middle key in the ancestor page may cause its size to drop below the permissible limit n, thereby requiring that further special action (either balancing or merging) be undertaken at the next level. In the extreme case page merging may propagate all the way up to the root. If the root is reduced to size 0, it is itself deleted, thereby causing a reduction in the height of the B-tree. This is, in fact, the only way that a B-tree may shrink in height. Figure 4.49 shows the gradual decay of the B-tree of Fig. 4.48 upon the sequential deletion of the keys 25 45 24; 38 32; 8 27 46 13 42; 5 22 18 26; 7 35 15; The semicolons again mark the places where the snapshots are taken, namely where pages are being eliminated. The similarity of its structure to that of balanced tree deletion is particularly noteworthy. Fig. 4.49. Decay of B-tree of order 2 TYPE Page = POINTER TO PageRec; Entry = RECORD key: INTEGER; p: Page END ; PageRec = RECORD m: INTEGER; (*no. of entries on page*) p0: Page; e: ARRAY 2*N OF Entry END ; VAR root: Page; W: Texts.Writer; 25 10 20 5 7 8 13 15 18 22 24 26 27 32 35 38 42 45 46 30 40 a) 10 22 30 40 5 7 8 1 3 15 18 20 26 27 32 35 38 42 46 b) 10 22 30 5 7 8 13 15 18 20 26 27 35 40 42 46 c) 10 22 5 7 15 18 20 26 30 35 40 d) 15 7 10 20 30 35 40 e) 10 20 30 40 f) 163 PROCEDURE search(x: INTEGER; VAR p: Page; VAR k: INTEGER); VAR i, L, R: INTEGER; found: BOOLEAN; a: Page; BEGIN a := root; found := FALSE; WHILE (a # NIL) & ~found DO L := 0; R := a.m; (*binary search*) WHILE L < R DO i := (L+R) DIV 2; IF x <= a.e[i].key THEN R := i ELSE L := i+1 END END ; IF (R < a.m) & (a.e[R].key = x) THEN found := TRUE ELSIF R = 0 THEN a := a.p0 ELSE a := a.e[R-1].p END END ; p := a; k := R END search; PROCEDURE insert(x: INTEGER; a: Page; VAR h: BOOLEAN; VAR v: Entry); (*a # NIL. Search key x in B-tree with root a; insert new item with key x. If an entry is to be passed up, assign it to v. h := "tree has become higher"*) VAR i, L, R: INTEGER; b: Page; u: Entry; BEGIN (* ~h *) IF a = NIL THEN v.key := x; v.p := NIL; h := TRUE ELSE L := 0; R := a.m; (*binary search*) WHILE L < R DO i := (L+R) DIV 2; IF x <= a.e[i].key THEN R := i ELSE L := i+1 END END ; IF (R < a.m) & (a.e[R].key = x) THEN (*found, do nothing*) ELSE (*item not on this page*) IF R = 0 THEN b := a.p0 ELSE b := a.e[R-1].p END ; insert(x, b, h, u); IF h THEN (*insert u to the left of a.e[R]*) IF a.m < 2*N THEN h := FALSE; FOR i := a.m TO R+1 BY -1 DO a.e[i] := a.e[i-1] END ; a.e[R] := u; INC(a.m) ELSE NEW(b); (*overflow; split a into a,b and assign the middle entry to v*) IF R < N THEN (*insert in left page a*) v := a.e[N-1]; FOR i := N-1 TO R+1 BY -1 DO a.e[i] := a.e[i-1] END ; a.e[R] := u; FOR i := 0 TO N-1 DO b.e[i] := a.e[i+N] END ELSE (*insert in right page b*) DEC(R, N); IF R = 0 THEN v := u ELSE v := a.e[N]; FOR i := 0 TO R-2 DO b.e[i] := a.e[i+N+1] END ; b.e[R-1] := u END ; FOR i := R TO N-1 DO b.e[i] := a.e[i+N] END END ; a.m := N; b.m := N; b.p0 := v.p; v.p := b END END 164 END END END insert; PROCEDURE underflow(c, a: Page; s: INTEGER; VAR h: BOOLEAN); (*a = underflowing page, c = ancestor page, s = index of deleted entry in c*) VAR b: Page; i, k: INTEGER; BEGIN (*h & (a.m = N-1) & (c.e[s-1].p = a) *) IF s < c.m THEN (*b := page to the right of a*) b := c.e[s].p; k := (b.m-N+1) DIV 2; (*k = nof items available on page b*) a.e[N-1] := c.e[s]; a.e[N-1].p := b.p0; IF k > 0 THEN (*balance by moving k-1 items from b to a*) FOR i := 0 TO k-2 DO a.e[i+N] := b.e[i] END ; c.e[s] := b.e[k-1]; b.p0 := c.e[s].p; c.e[s].p := b; DEC(b.m, k); FOR i := 0 TO b.m-1 DO b.e[i] := b.e[i+k] END ; a.m := N-1+k; h := FALSE ELSE (*merge pages a and b, discard b*) FOR i := 0 TO N-1 DO a.e[i+N] := b.e[i] END ; DEC(c.m); FOR i := s TO c.m-1 DO c.e[i] := c.e[i+1] END ; a.m := 2*N; h := c.m < N END ELSE (*b := page to the left of a*) DEC(s); IF s = 0 THEN b := c.p0 ELSE b := c.e[s-1].p END ; k := (b.m-N+1) DIV 2; (*k = nof items available on page b*) IF k > 0 THEN FOR i := N-2 TO 0 BY -1 DO a.e[i+k] := a.e[i] END ; a.e[k-1] := c.e[s]; a.e[k-1].p := a.p0; (*move k-1 items from b to a, one to c*) DEC(b.m, k); FOR i := k-2 TO 0 BY -1 DO a.e[i] := b.e[i+b.m+1] END ; c.e[s] := b.e[b.m]; a.p0 := c.e[s].p; c.e[s].p := a; a.m := N-1+k; h := FALSE ELSE (*merge pages a and b, discard a*) c.e[s].p := a.p0; b.e[N] := c.e[s]; FOR i := 0 TO N-2 DO b.e[i+N+1] := a.e[i] END ; b.m := 2*N; DEC(c.m); h := c.m < N END END END underflow; PROCEDURE delete(x: INTEGER; a: Page; VAR h: BOOLEAN); (*search and delete key x in B-tree a; if a page underflow arises, balance with adjacent page or merge; h := "page a is undersize"*) VAR i, L, R: INTEGER; q: Page; PROCEDURE del(p: Page; VAR h: BOOLEAN); VAR k: INTEGER; q: Page; (*global a, R*) BEGIN k := p.m-1; q := p.e[k].p; IF q # NIL THEN del(q, h); IF h THEN underflow(p, q, p.m, h) END ELSE p.e[k].p := a.e[R].p; a.e[R] := p.e[k]; DEC(p.m); h := p.m < N END END del; 165 BEGIN IF a # NIL THEN L := 0; R := a.m; (*binary search*) WHILE L < R DO i := (L+R) DIV 2; IF x <= a.e[i].key THEN R := i ELSE L := i+1 END END ; IF R = 0 THEN q := a.p0 ELSE q := a.e[R-1].p END ; IF (R < a.m) & (a.e[R].key = x) THEN (*found*) IF q = NIL THEN (*a is leaf page*) DEC(a.m); h := a.m < N; FOR i := R TO a.m-1 DO a.e[i] := a.e[i+1] END ELSE del(q, h); IF h THEN underflow(a, q, R, h) END END ELSE delete(x, q, h); IF h THEN underflow(a, q, R, h) END END END END delete; PROCEDURE ShowTree(VAR W: Texts.Writer; p: Page; level: INTEGER); VAR i: INTEGER; BEGIN IF p # NIL THEN FOR i := 1 TO level DO Texts.Write(W, 9X) END ; FOR i := 0 TO p.m-1 DO Texts.WriteInt(W, p.e[i].key, 4) END ; Texts.WriteLn(W); IF p.m > 0 THEN ShowTree(p.p0, level+1) END ; FOR i := 0 TO p.m-1 DO ShowTree(p.e[i].p, level+1) END END END ShowTree; Extensive analysis of B-tree performance has been undertaken and is reported in the referenced article (Bayer and McCreight). In particular, it includes a treatment of the question of optimal page size, which strongly depends on the characteristics of the storage and computing system available. Variations of the B-tree scheme are discussed in Knuth, Vol. 3, pp. 476-479. The one notable observation is that page splitting should be delayed in the same way that page merging is delayed, by first attempting to balance neighboring pages. Apart from this, the suggested improvements seem to yield marginal gains. A comprehensive survey of B-trees may be found in [4-8]. 4.7.2. Binary B-Trees The species of B-trees that seems to be least interesting is the first order B-tree (n = 1). But sometimes it is worthwhile to pay attention to the exceptional case. It is plain, however, that first-order B-trees are not useful in representing large, ordered, indexed data sets invoving secondary stores; approximately 50% of all pages will contain a single item only. Therefore, we shall forget secondary stores and again consider the problem of search trees involving a one-level store only. A binary B-tree (BB-tree) consists of nodes (pages) with either one or two items. Hence, a page contains either two or three pointers to descendants; this suggested the term 2-3 tree . According to the definition of B-trees, all leaf pages appear at the same level, and all non-leaf pages of BB-trees have either two or three descendants (including the root). Since we now are dealing with primary store only, an optimal economy of storage space is mandatory, and the representation of the items inside a node in the form of an array appears unsuitable. An alternative is the dynamic, linked allocation; that is, inside each node there exists a linked list of items of length 1 or 2. Since each node has at most three descendants and thus needs to harbor only up to three pointers, one is tempted to combine the pointers for descendants and pointers in 166 the item list as shown in Fig. 4.50. The B-tree node thereby loses its actual identity, and the items assume the role of nodes in a regular binary tree. It remains necessary, however, to distinguish between pointers to descendants (vertical) and pointers to siblings on the same page (horizontal). Since only the pointers to the right may be horizontal, a single bit is sufficient to record this distiction. We therefore introduce the Boolean field h with the meaning horizontal . The definition of a tree node based on this representation is given below. It was suggested and investigated by R. Bayer [4-3] in 1971 and represents a search tree organization guaranteeing p = 2*log(N) as maximum path length. TYPE Node = POINTER TO RECORD key: INTEGER; ........... left, right: Node; h: BOOLEAN (*right branch horizontal*) END Fig. 4.50. Representation of BB-tree nodes Considering the problem of key insertion, one must distinguish four possible situations that arise from growth of the left or right subtrees. The four cases are illustrated in Fig. 4.51. Remember that B-trees have the characteristic of growing from the bottom toward the root and that the property of all leafs being at the same level must be maintained. The simplest case (1) is when the right subtree of a node A grows and when A is the only key on its (hypothetical) page. Then, the descendant B merely becomes the sibling of A, i.e., the vertical pointer becomes a horizontal pointer. This simple raising of the right arm is not possible if A already has a sibling. Then we would obtain a page with 3 nodes, and we have to split it (case 2). Its middle node B is passed up to the next higher level. Now assume that the left subtree of a node B has grown in height. If B is again alone on a page (case 3), i.e., its right pointer refers to a descendant, then the left subtree (A) is allowed to become B's sibling. (A simple rotation of pointers is necessary since the left pointer cannot be horizontal). If, however, B already has a sibling, the raising of A yields a page with three members, requiring a split. This split is realized in a very straightforward manner: C becomes a descendant of B, which is raised to the next higher level (case 4). a b c a b 167 Fig. 4.51. Node insertion in BB-tree It should be noted that upon searching a key, it makes no effective difference whether we proceed along a horizontal or a vertical pointer. It therefore appears artificial to worry about a left pointer in case 3 becoming horizontal, although its page still contains not more than two members. Indeed, the insertion algorithm reveals a strange asymmetry in handling the growth of left and right subtrees, and it lets the BB-tree organization appear rather artificial. There is no proof of strangeness of this organization; yet a healthy intuition tells us that something is fishy, and that we should remove this asymmetry. It leads to the notion of the symmetric binary B-tree (SBB-tree) which was also investigated by Bayer [4-4] in 1972. On the average it leads to slightly more efficient search trees, but the algorithms for insertion and deletion are also slightly more complex. Furthermore, each node now requires two bits (Boolean variable lh and rh) to indicate the nature of its two pointers. Since we will restrict our detail considerations to the problem of insertion, we have once again to distinguish among four cases of grown subtrees. They are illustrated in Fig. 4.52, which makes the gained A A B 1. B a b c a b c A A B 2. C a c d a b c B b C d B b c A a C d A B 3. a c c a B b A b A B a b c A B 4. a c c a B b A b C d C d B b c A a C d 168 symmetry evident. Note that whenever a subtree of node A without siblings grows, the root of the subtree becomes the sibling of A. This case need not be considered any further. Fig. 4.52. Insertion in SBB-trees The four cases considered in Fig. 4.52 all reflect the occurrence of a page overflow and the subsequent page split. They are labelled according to the directions of the horizontal pointers linking the three siblings in the middle figures. The initial situation is shown in the left column; the middle column illustrates the fact that the lower node has been raised as its subtree has grown; the figures in the right column show the result of node rearrangement. It is advisable to stick no longer to the notion of pages out of which this organization had developed, for we are only interested in bounding the maximum path length to 2*log(N). For this we need only ensure that two horizontal pointers may never occur in succession on any search path. However, there is no reason to forbid nodes with horizontal pointers to the left and right, i.e. to treat the left and right sides differently. We therefore define the symmetric binary B-tree as a tree that has the following properties: 1. Every node contains one key and at most two (pointers to) subtrees. (LL) A B C A B C B A C (LR) B A C A B C B A C (RR) C A B A B C B A C (RL) B A C A B C B A C 169 2. Every pointer is either horizontal or vertical. There are no two consecutive horizontal pointers on any search path. 3. All terminal nodes (nodes without descendants) appear at the same (terminal) level. From this definition it follows that the longest search path is no longer than twice the height of the tree. Since no SBB-tree with N nodes can have a height larger than log(N), it follows immediately that 2*log(N) is an upper bound on the search path length. In order to visualize how these trees grow, we refer to Fig. 4.53. The lines represent snapshots taken during the insertion of the following sequences of keys, where every semicolon marks a snapshot. (1) 1 2; 3; 4 5 6; 7; (2) 5 4; 3; 1 2 7 6; (3) 6 2; 4; 1 7 3 5; (4) 4 2 6; 1 7; 3 5; Fig. 4.53. Insertion of keys 1 to 7 These pictures make the third property of B-trees particularly obvious: all terminal nodes appear on the same level. One is therefore inclined to compare these structures with garden hedges that have been recently trimmed with hedge scissors. The algorithm for the construction of SBB-trees is show below. It is based on a definition of the type Node with the two components lh and rh indicating whether or not the left and right pointers are horizontal. TYPE Node = RECORD key, count: INTEGER; left, right: Node; lh, rh: BOOLEAN END 7 1 3 3 1. 1 2 1 2 3 1 2 4 5 6 2 4 6 5 7 3 2. 4 5 3 4 5 1 2 4 5 6 7 3 3. 2 6 2 4 6 1 2 4 5 6 7 4. 2 4 1 2 4 6 6 7 1 2 6 5 3 4 170 The recursive procedure search again follows the pattern of the basic binary tree insertion algorithm. A third parameter h is added; it indicates whether or not the subtree with root p has changed, and it corresponds directly to the parameter h of the B-tree search program. We must note, however, the consequence of representing pages as linked lists: a page is traversed by either one or two calls of the search procedure. We must distinguish between the case of a subtree (indicated by a vertical pointer) that has grown and a sibling node (indicated by a horizontal pointer) that has obtained another sibling and hence requires a page split. The problem is easily solved by introducing a three-valued h with the following meanings: 1. h = 0: the subtree p requires no changes of the tree structure. 2. h = 1: node p has obtained a sibling. 3. h = 2: the subtree p has increased in height. PROCEDURE search(VAR p: Node; x: INTEGER; VAR h: INTEGER); VAR q, r: Node; BEGIN (*h=0*) IF p = NIL THEN (*insert new node*) NEW(p); p.key := x; p.L := NIL; p.R := NIL; p.lh := FALSE; p.rh := FALSE; h := 2 ELSIF x < p.key THEN search(p.L, x, h); IF h > 0 THEN (*left branch has grown or received sibling*) q := p.L; IF p.lh THEN h := 2; p.lh := FALSE; IF q.lh THEN (*LL*) p.L := q.R; q.lh := FALSE; q.R := p; p := q ELSE (*q.rh, LR*) r := q.R; q.R := r.L; q.rh := FALSE; r.L := p.L; p.L := r.R; r.R := p; p := r END ELSE DEC(h); IF h = 1 THEN p.lh := TRUE END END END ELSIF x > p.key THEN search(p.R, x, h); IF h > 0 THEN (*right branch has grown or received sibling*) q := p.R; IF p.rh THEN h := 2; p.rh := FALSE; IF q.rh THEN (*RR*) p.R := q.L; q.rh := FALSE; q.L := p; p := q ELSE (*q.lh, RL*) r := q.L; q.L := r.R; q.lh := FALSE; r.R := p.R; p.R := r.L; r.L := p; p := r END ELSE DEC(h); IF h = 1 THEN p.rh := TRUE END END END END END search; Note that the actions to be taken for node rearrangement very strongly resemble those developed in the AVL-balanced tree search algorithm. It is evident that all four cases can be implemented by simple pointer rotations: single rotations in the LL and RR cases, double rotations in the LR and RL cases. In fact, procedure search appears here slightly simpler than in the AVL case. Clearly, the SBB-tree scheme emerges as an alternative to the AVL-balancing criterion. A performance comparison is therefore both possible and desirable. 171 We refrain from involved mathematical analysis and concentrate on some basic differences. It can be proven that the AVL-balanced trees are a subset of the SBB-trees. Hence, the class of the latter is larger. It follows that their path length is on the average larger than in the AVL case. Note in this connection the worst-case tree (4) in Fig. 4.53. On the other hand, node rearrangement is called for less frequently. The balanced tree is therefore preferred in those applications in which key retrievals are much more frequent than insertions (or deletions); if this quotient is moderate, the SBB-tree scheme may be preferred. It is very difficult to say where the borderline lies. It strongly depends not only on the quotient between the frequencies of retrieval and structural change, but also on the characteristics of an implementation. This is particularly the case if the node records have a densely packed representation, and if therefore access to fields involves part-word selection. The SBB-tree has later found a rebirth under the name of red-black tree . The difference is that whereas in the case of the symmetric, binary B-tree every node contains two h-fields indicating whether the emanating pointers are horizontal, every node of the red-black tree contains a single h-field, indicating whether the incoming pointer is horizontal. The name stems from the idea to color nodes with incoming down-pointer black, and those with incoming horizontal pointer red. No two red nodes can immedaitely follow each other on any path. Therefore, like in the cases of the BB- and SBB-trees, every search path is at most twice as long as the height of the tree. There exists a canonical mapping from binary B-trees to red-black trees. 4.8. Priority Search Trees Trees, and in particular binary trees, constitute very effective organisations for data that can be ordered on a linear scale. The preceding chapters have exposed the most frequently used ingenious schemes for efficient searching and maintenance (insertion, deletion). Trees, however, do not seem to be helpful in problems where the data are located not in a one-dimensional, but in a multi-dimensional space. In fact, efficient searching in multi-dimensional spaces is still one of the more elusive problems in computer science, the case of two dimensions being of particular importance to many practical applications. Upon closer inspection of the subject, trees might still be applied usefully at least in the two-dimensional case. After all, we draw trees on paper in a two-dimensional space. Let us therefore briefly review the characteristics of the two major kinds of trees so far encountered. 1. A search tree is governed by the invariants p.left ≠ NIL implies p.left.x < p.x p.right ≠ NIL implies p.x < p.right.x holding for all nodes p with key x. It is apparent that only the horizontal position of nodes is at all constrained by the invariant, and that the vertical positions of nodes can be arbitrarily chosen such that access times in searching, (i.e. path lengths) are minimized. 2. A heap, also called priority tree , is governed by the invariants p.left ≠ NIL implies p.y ≤ p.left.y p.right ≠ NIL implies p.y ≤ p.right.y holding for all nodes p with key y. Here evidently only the vertical positions are constrained by the invariants. It seems straightforward to combine these two conditions in a definition of a tree organization in a two- dimensional space, with each node having two keys x and y, which can be regarded as coordinates of the node. Such a tree represents a point set in a plane, i.e. in a two-dimensional Cartesian space; it is therefore called Cartesian tree [4-9]. We prefer the term priority search tree , because it exhibits that this structure emerged from a combination of the priority tree and the search tree. It is characterized by the following invariants holding for each node p: p.left ≠ NIL implies (p.left.x < p.x) & (p.y ≤ p.left.y) p.right ≠ NIL implies (p.x < p.right.x) & (p.y ≤ p.right.y) It should come as no big surprise, however, that the search properties of such trees are not particularly wonderful. After all, a considerable degree of freedom in positioning nodes has been taken away and is no longer available for choosing arrangements yielding short path lengths. Indeed, no logarithmic bounds 172 on efforts involved in searching, inserting, or deleting elements can be assured. Although this had already been the case for the ordinary, unbalanced search tree, the chances for good average behaviour are slim. Even worse, maintenance operations can become rather unwieldy. Consider, for example, the tree of Fig. 4.54 (a). Insertion of a new node C whose coordinates force it to be inserted above and between A and B requires a considerable effort transforming (a) into (b). McCreight discovered a scheme, similar to balancing, that, at the expense of a more complicated insertion and deletion operation, guarantees logarithmic time bounds for these operations. He calls that structure a priority search tree [4-10]; in terms of our classification, however, it should be called a balanced priority search tree. We refrain from discussing that structure, because the scheme is very intricate and in practice hardly used. By considering a somewhat more restricted, but in practice no less relevant problem, McCreight arrived at yet another tree structure, which shall be presented here in detail. Instead of assuming that the search space be unbounded, he considered the data space to be delimited by a rectangle with two sides open. We denote the limiting values of the x-coordinate by x min and x max . In the scheme of the (unbalanced) priority search tree outlined above, each node p divides the plane into two parts along the line x = p.x. All nodes of the left subtree lie to its left, all those in the right subtree to its right. For the efficiency of searching this choice may be bad. Fortunately, we may choose the dividing line differently. Let us associate with each node p an interval [p.L .. p.R), ranging over all x values including p.L up to but excluding p.R. This shall be the interval within which the x-value of the node may lie. Then we postulate that the left descendant (if any) must lie within the left half, the right descendant within the right half of this interval. Hence, the dividing line is not p.x, but (p.L+p.R)/2. For each descendant the interval is halved, thus limiting the height of the tree to log(xmax-xmin). This result holds only if no two nodes have the same x-value, a condition which, however, is guaranteed by the invariant (4.90). If we deal with integer coordinates, this limit is at most equal to the wordlength of the computer used. Effectively, the search proceeds like a bisection or radix search, and therefore these trees are called radix priority search trees [4-10]. They feature logarithmic bounds on the number of operations required for searching, inserting, and deleting an element, and are governed by the following invariants for each node p: p.left ≠ NIL implies (p.L ≤ p.left.x < p.M) & (p.y ≤ p.left.y) p.right ≠ NIL implies (p.M ≤ p.right.x < p.R) & (p.y ≤ p.right.y) where p.M = (p.L + p.R) DIV 2 p.left.L = p.L p.left.R = p.M p.right.L = p.M p.right.R = p.R for all node p, and root.L = x min, root.R = x max . A decisive advantage of the radix scheme is that maintenance operations (preserving the invariants under insertion and deletion) are confined to a single spine of the tree, because the dividing lines have fixed values of x irrespective of the x-values of the inserted nodes. Typical operations on priority search trees are insertion, deletion, finding an element with the least (largest) value of x (or y) larger (smaller) than a given limit, and enumerating the points lying within a given rectangle. Given below are procedures for inserting and enumerating. They are based on the following type declarations: TYPE Node = POINTER TO RECORD x, y: INTEGER; left, right: Node END Notice that the attributes x L and x R need not be recorded in the nodes themselves. They are rather computed during each search. This, however, requires two additional parameters of the recursive procedure insert. Their values for the first call (with p = root) are xmin and xmax respectively. Apart from this, a search proceeds similarly to that of a regular search tree. If an empty node is encountered, the 173 element is inserted. If the node to be inserted has a y-value smaller than the one being inspected, the new node is exchanged with the inspected node. Finally, the node is inserted in the left subtree, if its x-value is less than the middle value of the interval, or the right subtree otherwise. PROCEDURE insert(VAR p: Node; X, Y, xL, xR: INTEGER); VAR xm, t: INTEGER; BEGIN IF p = NIL THEN (*not in tree, insert*) NEW(p); p.x := X; p.y := Y; p.left := NIL; p.right := NIL ELSIF p.x = X THEN (*found; don't insert*) ELSE IF p.y > Y THEN t := p.x; p.x := X; X := t; t := p.y; p.y := Y; Y := t END ; xm := (xL + xR) DIV 2; IF X < xm THEN insert(p.left, X, Y, xL, xm) ELSE insert(p.right, X, Y, xm, xR) END END END insert The task of enumerating all points x,y lying in a given rectangle, i.e. satisfying x0 ≤ x < x1 and y ≤ y1 is accomplished by the following procedure enumerate. It calls a procedure report(x,y) for each point found. Note that one side of the rectangle lies on the x-axis, i.e. the lower bound for y is 0. This guarantees that enumeration requires at most O(log(N) + s) operations, where N is the cardinality of the search space in x and s is the number of nodes enumerated. PROCEDURE enumerate(p: Ptr; x0, x1, y, xL, xR: INTEGER); VAR xm: INTEGER; BEGIN IF p # NIL THEN IF (p.y <= y) & (x0 <= p.x) & (p.x < x1) THEN report(p.x, p.y) END ; xm := (xL + xR) DIV 2; IF x0 < xm THEN enumerate(p.left, x0, x1, y, xL, xm) END ; IF xm < x1 THEN enumerate(p.right, x0, x1, y, xm, xR) END END END enumerate Exercises 4.1. Let us introduce the notion of a recursive type, to be declared as RECTYPE T = T0 and denoting the set of values defined by the type T0 enlarged by the single value NONE. The definition of the type person , for example, could then be simplified to RECTYPE person = RECORD name: Name; father, mother: person END Which is the storage pattern of the recursive structure corresponding to Fig. 4.2? Presumably, an implementation of such a feature would be based on a dynamic storage allocation scheme, and the fields named father and mother in the above example would contain pointers generated automatically but hidden from the programmer. What are the difficulties encountered in the realization of such a feature? 174 4.2. Define the data structure described in the last paragraph of Section 4.2 in terms of records and pointers. Is it also possible to represent this family constellation in terms of recursive types as proposed in the preceding exercise? 4.3. Assume that a first-in-first-out (fifo) queue Q with elements of type T0 is implemented as a linked list. Define a module with a suitable data structure, procedures to insert and extract an element from Q, and a function to test whether or not the queue is empty. The procedures should contain their own mechanism for an economical reuse of storage. 4.4. Assume that the records of a linked list contain a key field of type INTEGER. Write a program to sort the list in order of increasing value of the keys. Then construct a procedure to invert the list. 4.5. Circular lists (see Fig. 4.55) are usually set up with a so-called list header. What is the reason for introducing such a header? Write procedures for the insertion, deletion, and search of an element identified by a given key. Do this once assuming the existence of a header, once without header. Fig. 4.55. Circular list 4.6. A bidirectional list is a list of elements that are linked in both ways. (See Fig. 4.56) Both links are originating from a header. Analogous to the preceding exercise, construct a module with procedures for searching, inserting, and deleting elements. Fig. 4.56. Bidirectional list 4.7. Does the given program for topological sorting work correctly if a certain pair occurs more than once in the input? 4.8. The message "This set is not partially ordered" in the program for topological sorting is not very helpful in many cases. Extend the program so that it outputs a sequence of elements that form a loop, if there exists one. 4.9. Write a program that reads a program text, identifies all procedure definitions and calls, and tries to establish a topological ordering among the subroutines. Let P ← Q mean that P is called by Q. 4.10. Draw the tree constructed by the program shown for constructing the perfectly balanced tree, if the input consists of the natural numbers 1, 2, 3, ... , n. 4.11. Which are the sequences of nodes encountered when traversing the tree of Fig. 4.23 in preorder, inorder, and postorder? 4.12. Find a composition rule for the sequence of n numbers which, if applied to the program for straight search and insertion, yields a perfectly balanced tree. 4.13. Consider the following two orders for traversing binary trees: 175 a1. Traverse the right subtree. a2. Visit the root. a3. Traverse the left subtree. b1. Visit the root. b2. Traverse the right subtree. b3. Traverse the left subtree. Are there any simple relationships between the sequences of nodes encountered following these orders and those generated by the three orders defined in the text? 4.14. Define a data structure to represent n-ary trees. Then write a procedure that traverses the n-ary tree and generates a binary tree containing the same elements. Assume that the key stored in an element occupies k words and that each pointer occupies one word of storage. What is the gain in storage when using a binary tree versus an n-ary tree? 4.15. Assume that a tree is built upon the following definition of a recursive data structure (see Exercise 4.1). Formulate a procedure to find an element with a given key x and to perform an operation P on this element. RECTYPE Tree = RECORD x: INTEGER; left, right: Tree END 4.16. In a certain file system a directory of all files is organized as an ordered binary tree. Each node denotes a file and specifies the file name and, among other things the date of its last access, encoded as an integer. Write a program that traverses the tree and deletes all files whose last access was before a certain date. 4.17. In a tree structure the frequency of access of each element is measured empirically by attributing to each node an access count. At certain intervals of time, the tree organization is updated by traversing the tree and generating a new tree by using the program of straight search and insertion, and inserting the keys in the order of decreasing frequency count. Write a program that performs this reorganization. Is the average path length of this tree equal to, worse, or much worse than that of an optimal tree? 4.18. The method of analyzing the tree insertion algorithm described in Sect. 4.5 can also be used to compute the expected numbers C n of comparisons and M n of moves (exchanges) which are performed by Quicksort, sorting n elements of an array, assuming that all n! permutations of the n keys 1, 2, ... , n are equally likely. Find the analogy and determine C n and M n. 4.19. Draw the balanced tree with 12 nodes which has the maximum height of all 12-node balanced trees. In which sequence do the nodes have to be inserted so that the AVL-insertion procedure generates this tree? 4.20. Find a sequence of n insertion keys so that the procedure for insertion in an AVL-tree performs each of the four rebalancing acts (LL, LR, RR, RL) at least once. What is the minimal length n for such a sequence? 4.21. Find a balanced tree with keys 1 ... n and a permutation of these keys so that, when applied to the deletion procedure for AVL-trees, it performs each of the four rebalancing routines at least once. What is the sequence with minimal length n? 4.22. What is the average path length of the Fibonacci-tree T n? 4.23. Write a program that generates a nearly optimal tree according to the algorithm based on the selection of a centroid as root. 4.24. Assume that the keys 1, 2, 3, ... are inserted into an empty B-tree of order 2. Which keys cause page splits to occur? Which keys cause the height of the tree to increase? If the keys are deleted in the same order, which keys cause pages to be merged (and disposed) and which keys cause the height to decrease? Answer the question for (a) a deletion scheme using balancing, and (b) a scheme whithout balancing (upon underflow, a single item only is fetched from a neighboring page). 176 4.25. Write a program for the search, insertion, and deletion of keys in a binary B-tree. Use the node type and the insertion scheme shown above for the binary B-tree. 4.26. Find a sequence of insertion keys which, starting from the empty symmetric binary B-tree, causes the shown procedure to perform all four rebalancing acts (LL, LR, RR, RL) at least once. What is the shortest such sequence? 4.27. Write a procedure for the deletion of elements in a symmetric binary B-tree. Then find a tree and a short sequence of deletions causing all four rebalancing situations to occur at least once. 4.28. Formulate a data structure and procedures for the insertion and deletion of an element in a priority search tree. The procedures must maintain the specified invariants. Compare their performance with that of the radix priority search tree. 4.29. Design a module with the following procedures operating on radix priority search trees: -- insert a point with coordinates x, y. -- enumerate all points within a specified rectangle. -- find the point with the least x-coordinate in a specified rectangle. -- find the point with the largest y-coordinate within a specified rectangle. -- enumerate all points lying within two (intersecting) rectangles. References 4-1. G.M. Adelson-Velskii and E.M. Landis. Doklady Akademia Nauk SSSR, 146 , (1962), 263-66; English translation in Soviet Math, 3, 1259-63. 4-2. R. Bayer and E.M. McCreight. Organization and Maintenance of Large Ordered Indexes. Acta Informatica, 1 , No. 3 (1972), 173-89. 4-3. -----, Binary B-trees for Virtual memory. Proc. 1971 ACM SIGFIDET Workshop, San Diego, Nov. 1971, pp. 219-35. 4-4. -----, Symmetric Binary B-trees: Data Structure and Maintenance Algorithms. Acta Informatica, 1, No. 4 (1972), 290-306. 4-5. T.C. Hu and A.C. Tucker. SIAM J. Applied Math, 21, No. 4 (1971) 514-32. 4-6. D. E. Knuth. Optimum Binary Search Trees. Acta Informatica, 1, No. 1 (1971), 14-25. 4-7. W.A. Walker and C.C. Gotlieb. A Top-down Algorithm for Constructing Nearly Optimal Lexicographic Trees. in Graph Theory and Computing (New York: Academic Press, 1972), pp. 303-23. 4-8. D. Comer. The ubiquitous B-Tree. ACM Comp. Surveys, 11 , 2 (June 1979), 121-137. 4-9. J. Vuillemin. A unifying look at data structures. Comm. ACM, 23, 4 (April 1980), 229-239. 4-10. E.M. McCreight. Priority search trees. SIAM J. of Comp. (May 1985) 177 5 Key Transformations (Hashing) 5.1. Introduction The principal question discussed in Chap. 4 at length is the following: Given a set of items characterized by a key (upon which an ordering relation is defined), how is the set to be organized so that retrieval of an item with a given key involves as little effort as possible? Clearly, in a computer store each item is ultimately accessed by specifying a storage address. Hence, the stated problem is essentially one of finding an appropriate mapping H of keys (K) into addresses (A): H: K → A In Chap. 4 this mapping was implemented in the form of various list and tree search algorithms based on different underlying data organizations. Here we present yet another approach that is basically simple and very efficient in many cases. The fact that it also has some disadvantages is discussed subsequently. The data organization used in this technique is the array structure. H is therefore a mapping transforming keys into array indices, which is the reason for the term key transformation that is generally used for this technique. It should be noted that we shall not need to rely on any dynamic allocation procedures; the array is one of the fundamental, static structures. The method of key transformations is often used in problem areas where tree structures are comparable competitors. The fundamental difficulty in using a key transformation is that the set of possible key values is much larger than the set of available store addresses (array indices). Take for example names consisting of up to 16 letters as keys identifying individuals in a set of a thousand persons. Hence, there are 26 16 possible keys which are to be mapped onto 10 3 possible indices. The function H is therefore obviously a many-to-one function. Given a key k, the first step in a retrieval (search) operation is to compute its associated index h = H(k), and the second -- evidently necessary -- step is to verify whether or not the item with the key k is indeed identified by h in the array (table) T, i.e., to check whether T[H(k)].key = k. We are immediately confronted with two questions: 1. What kind of function H should be used? 2. How do we cope with the situation that H does not yield the location of the desired item? The answer to the second question is that some method must be used to yield an alternative location, say index h', and, if this is still not the location of the wanted item, yet a third index h", and so on. The case in which a key other than the desired one is at the identified location is called a collision ; the task of generating alternative indices is termed collision handling. In the following we shall discuss the choice of a transformation function and methods of collision handling. 5.2. Choice of a Hash Function A prerequisite of a good transformation function is that it distributes the keys as evenly as possible over the range of index values. Apart from satisfying this requirement, the distribution is not bound to any pattern, and it is actually desirable that it give the impression of being entirely random. This property has given this method the somewhat unscientific name hashing, i.e., chopping the argument up, or making a mess. H is called the hash function . Clearly, it should be efficiently computable, i.e., be composed of very few basic arithmetic operations. Assume that a transfer function ORD(k) is avilable and denotes the ordinal number of the key k in the set of all possible keys. Assume, furthermore, that the array indices i range over the intergers 0 .. N-1, where N is the size of the array. Then an obvious choice is H(k) = ORD(k) MOD N It has the property that the key values are spread evenly over the index range, and it is therefore the basis of most key transformations. It is also very efficiently computable, if N is a power of 2. But it is exactly this case that must be avoided, if the keys are sequences of letters. The assumption that all keys are equally likely is in this case mistaken. In fact, words that differ by only a few characters then most likely map onto 178 identical indices, thus effectively causing a most uneven distribution. It is therefore particularly recommended to let N be a prime number [5-2]. This has the conseqeunce that a full division operation is needed that cannot be replaced by a mere masking of binary digits, but this is no serious drawback on most modern computers that feature a built-in division instruction. Often, hash funtions are used which consist of applying logical operations such as the exclusive or to some parts of the key represented as a sequence of binary digits. These operations may be faster than division on some computers, but they sometimes fail spectacularly to distribute the keys evenly over the range of indices. We therefore refrain from discussing such methods in further detail. 5.3. Collision Handling If an entry in the table corresponding to a given key turns out not to be the desired item, then a collision is present, i.e., two items have keys mapping onto the same index. A second probe is necessary, one based on an index obtained in a deterministic manner from the given key. There exist several methods of generating secondary indices. An obvious one is to link all entries with identical primary index H(k) together in a linked list. This is called direct chaining. The elements of this list may be in the primary table or not; in the latter case, storage in which they are allocated is usually called an overflow area . This method has the disadvantage that secondary lists must be maintained, and that each entry must provide space for a pointer (or index) to its list of collided items. An alternative solution for resolving collisions is to dispense with links entirely and instead simply look at other entries in the same table until the item is found or an open position is encountered, in which case one may assume that the specified key is not present in the table. This method is called open addressing [5-3]. Naturally, the sequence of indices of secondary probes must always be the same for a given key. The algorithm for a table lookup can then be sketched as follows: h := H(k); i := 0; REPEAT IF T[h].key = k THEN item found ELSIF T[h].key = free THEN item is not in table ELSE (*collision*) i := i+1; h := H(k) + G(i) END UNTIL found or not in table (or table full) Various functions for resolving collisions have been proposed in the literature. A survey of the topic by Morris in 1968 [4-8] stimulated considerable activities in this field. The simplest method is to try for the next location -- considering the table to be circular -- until either the item with the specified key is found or an empty location is encountered. Hence, G(i) = i; the indices hi used for probing in this case are h0 = H(k) hi = (h i-1 + i) MOD N, i = 1 ... N-1 This method is called linear probing and has the disadvantage that entries have a tendency to cluster around the primary keys (keys that had not collided upon insertion). Ideally, of course, a function G should be chosen that again spreads the keys uniformly over the remaining set of locations. In practice, however, this tends to be too costly, and methods that offer a compromise by being simple to compute and still superior to the linear function are preferred. One of them consists of using a quadratic function such that the sequence of indices for probing is h0 = H(k) hi = (h 0 + i 2) MOD N i > 0 Note that computation of the next index need not involve the operation of squaring, if we use the following recurrence relations for h i = i 2 and d i = 2i + 1. hi+1 = h i + d i di+1 = d i + 2 i > 0 179 with h 0 = 0 and d 0 = 1. This is called quadratic probing , and it essentially avoids primary clustering, although practically no additional computations are required. A very slight disadvantage is that in probing not all table entries are searched, that is, upon insertion one may not encounter a free slot although there are some left. In fact, in quadratic probing at least half the table is visited if its size N is a prime number. This assertion can be derived from the following deliberation. If the i th and the j th probes coincide upon the same table entry, we can express this by the equation i2 MOD N = j 2 MOD N (i 2 - j 2) ≡ 0 (modulo N) Splitting the differences up into two factors, we obtain (i + j)(i - j) ≡ 0 (modulo N) and since i ≠ j, we realize that either i or j have to be at least N/2 in order to yield i+j = c*N, with c being an integer. In practice, the drawback is of no importance, since having to perform N/2 secondary probes and collision evasions is extremely rare and occurs only if the table is already almost full. As an application of the scatter storage technique, the cross-reference generator procedure shown in Sect. 4.4.3 is rewritten. The principal differences lie in the procedure search and in the replacement of the pointer type Node by the global hash table of words T. The hash function H is the modulus of the table size; quadratic probing was chosen for collision handling. Note that it is essential for good performance that the table size be a prime number. Although the method of key transformation is most effective in this case -- actually more efficient than tree organizations -- it also has a disadvantage. After having scanned the text and collected the words, we presumably wish to tabulate these words in alphabetical order. This is straightforward when using a tree organization, because its very basis is the ordered search tree. It is not, however, when key transformations are used. The full significance of the word hashing becomes apparent. Not only would the table printout have to be preceded by a sort process (which is omitted here), but it even turns out to be advantageous to keep track of inserted keys by linking them together explicitly in a list. Hence, the superior performance of the hashing method considering the process of retrieval only is partly offset by additional operations required to complete the full task of generating an ordered cross-reference index. CONST P = 997; (*prime, table size*) WordLen = 32; (*max length of keys*) Noc = 16; (*max no. of items per word*) TYPE Word = ARRAY WordLen OF CHAR; Table = POINTER TO ARRAY P OF RECORD key: Word; n: INTEGER; lno: ARRAY Noc OF INTEGER END; VAR line: INTEGER; PROCEDURE search(T: Table; VAR a: Word); VAR i, d: INTEGER; h: LONGINT; found: BOOLEAN; BEGIN (*compute hash index h for a; uses global variable line*) i := 0; h := 0; WHILE a[i] > 0X DO h := (256*h + ORD(a[i])) MOD P; INC(i) END ; d := 1; found := FALSE; REPEAT IF T[h].key = a THEN (*match*) found := TRUE; T[h].lno[T[h].n] := line; IF T[h].n < Noc THEN INC(T[h].n) END ELSIF T[h].key[0] = " " THEN (*new entry*) found := TRUE; COPY(a, T[h].key); T[h].lno[0] := line; T[h].n := 1 ELSE (*collision*) h := h+d; d := d+2; IF h >= P THEN h := h-P END ; IF d = P THEN Texts.WriteString(W, " Table overflow"); HALT(88) END 180 END UNTIL found END search; PROCEDURE Tabulate(T: Table); (*uses global writer W*) VAR i, k: INTEGER; BEGIN FOR k := 0 TO P-1 DO IF T[k].key[0] # " " THEN Texts.WriteString(W, T[k].key); Texts.Write(W, TAB); FOR i := 0 TO T[k].n -1 DO Texts.WriteInt(W, T[k].lno[i], 4) END ; Texts.WriteLn(W) END END END Tabulate; PROCEDURE CrossRef(VAR R: Texts.Reader); VAR i: INTEGER; ch: CHAR; w: Word; H: Table; BEGIN NEW(H); FOR i := 0 TO P-1 DO H[i].key[0] := " " END ; (*allocate and clear hash table*) line := 0; Texts.WriteInt(W, 0, 6); Texts.Write(W, TAB); Texts.Read(R, ch); WHILE ~R.eot DO IF ch = 0DX THEN (*line end*) Texts.WriteLn(W); INC(line); Texts.WriteInt(W, line, 6); Texts.Write(W, 9X); Texts.Read(R, ch) ELSIF ("A" <= ch) & (ch <= "Z") OR ("a" <= ch) & (ch <= "z") THEN i := 0; REPEAT IF i < WordLen-1 THEN w[i] := ch; INC(i) END ; Texts.Write(W, ch); Texts.Read(R, ch) UNTIL (i = WordLen-1) OR ~(("A" <= ch) & (ch <= "Z")) & ~(("a" <= ch) & (ch <= "z")) & ~(("0" <= ch) & (ch <= "9")); w[i] := 0X; (*string terminator*) search(H, w) ELSE Texts.Write(W, ch); Texts.Read(R, ch) END ; Texts.WriteLn(W); Texts.WriteLn(W); Tabulate(H) END END CrossRef 5.4. Analysis of Key Transformation Insertion and retrieval by key transformation has evidently a miserable worst-case performance. After all, it is entirely possible that a search argument may be such that the probes hit exactly all occupied locations, missing consistently the desired (or free) ones. Actually, considerable confidence in the correctness of the laws of probability theory is needed by anyone using the hash technique. What we wish to be assured of is that on the average the number of probes is small. The following probabilistic argument reveals that it is even very small. Let us once again assume that all possible keys are equally likely and that the hash function H distributes them uniformly over the range of table indices. Assume, then, that a key has to be inserted in a table of size n which already contains k items. The probability of hitting a free location the first time is then (n-k)/n. This is also the probability p1 that a single comparison only is needed. The probability that excatly one second probe is needed is equal to the probability of a collision in the first try times the probability of hitting a free location the next time. In general, we obtain the probability pi of an insertion requiring exactly i probes as 181 p1 = (n-k)/n p2 = (k/n) × (n-k)/(n-1) p3 = (k/n) × (k-1)/(n-1) × (n-k)/(n-2) ………. pi = (k/n) × (k-1)/(n-1) × (k-2)/(n-2) × … × (n-k)/(n-(i-1)) The expected number E of probes required upon insertion of the k+1st key is therefore Ek+1 = Si: 1 ≤ i ≤ k+1 : i×p i = 1 × (n-k)/n + 2 × (k/n) × (n-k)/(n-1) + ... + (k+1) * (k/n) × (k-1)/(n-1) × (k-2)/(n-2) × … × 1/(n-(k-1)) = (n+1) / (n-(k-1)) Since the number of probes required to insert an item is identical with the number of probes needed to retrieve it, the result can be used to compute the average number E of probes needed to access a random key in a table. Let the table size again be denoted by n, and let m be the number of keys actually in the table. Then E = (Sk: 1 ≤ k ≤ m: Ek) / m = (n+1) × (Sk: 1 ≤ k ≤ m: 1/(n-k+2))/m = (n+1) × (H n+1 - H n-m+1 ) where H is the harmonic function. H can be approximated as H n = ln(n) + g, where g is Euler's constant. If, moreover, we substitute a for m/(n+1), we obtain E = (ln(n+1) - ln(n-m+1))/a = ln((n+1)/(n-m+1))/a = -ln(1-a)/a a is approximately the quotient of occupied and available locations, called the load factor ; a = 0 implies an empty table, a = n/(n+1) ≈ 1 a full table. The expected number E of probes to retrieve or insert a randomly chosen key is listed in Table 5.1 as a function of the load factor. The numerical results are indeed surprising, and they explain the exceptionally good performance of the key transformation method. Even if a table is 90% full, on the average only 2.56 probes are necessary to either locate the key or to find an empty location. Note in particular that this figure does not depend on the absolute number of keys present, but only on the load factor. a E 0.1 1.05 0.25 1.15 0.5 1.39 0.75 1.85 0.9 2.56 0.95 3.15 0.99 4.66 Table 4.6 Expected number of probes as a function of the load factor. The above analysis was based on the use of a collision handling method that spreads the keys uniformly over the remaining locations. Methods used in practice yield slightly worse performance. Detailed analysis for linear probing yields an expected number of probes as E = (1 - a/2) / (1 - a) Some numerical values for E(a) are listed in Table 4.7 [5-4]. The results obtained even for the poorest method of collision handling are so good that there is a temptation to regard key transformation (hashing) as the panacea for everything. This is particularly so because its performance is superior even to the most sophisticated tree organization discussed, at least on the basis of comparison steps needed for retrieval and insertion. It is therefore important to point out explicitly some of the drawbacks of hashing, even if they are obvious upon unbiased consideration. 182 a E 0.1 1.06 0.25 1.17 0.5 1.50 0.75 2.50 0.9 5.50 0.95 10.50 Table 4.7 Expected number of probes for linear probing. Certainly the major disadvantage over techniques using dynamic allocation is that the size of the table is fixed and cannot be adjusted to actual demand. A fairly good a priori estimate of the number of data items to be classified is therefore mandatory if either poor storage utilization or poor performance (or even table overflow) is to be avoided. Even if the number of items is exactly known -- an extremely rare case -- the desire for good performance dictates to dimension the table slightly (say 10%) too large. The second major deficiency of scatter storage techniques becomes evident if keys are not only to be inserted and retrieved, but if they are also to be deleted. Deletion of entries in a hash table is extremely cumbersome unless direct chaining in a separate overflow area is used. It is thus fair to say that tree organizations are still attractive, and actually to be preferred, if the volume of data is largely unknown, is strongly variable, and at times even decreases. Exercises 5.1. If the amount of information associated with each key is relatively large (compared to the key itself), this information should not be stored in the hash table. Explain why and propose a scheme for representing such a set of data. 5.2. Consider the proposal to solve the clustering problem by constructing overflow trees instead of overflow lists, i.e., of organizing those keys that collided as tree structures. Hence, each entry of the scatter (hash) table can be considered as the root of a (possibly empty) tree. Compare the expected performance of this tree hashing method with that of open addressing. 5.3. Devise a scheme that performs insertions and deletions in a hash table using quadratic increments for collision resolution. Compare this scheme experimentally with the straight binary tree organization by applying random sequences of keys for insertion and deletion. 5.4. The primary drawback of the hash table technique is that the size of the table has to be fixed at a time when the actual number of entries is not known. Assume that your computer system incorporates a dynamic storage allocation mechanism that allows to obtain storage at any time. Hence, when the hash table H is full (or nearly full), a larger table H' is generated, and all keys in H are transferred to H', whereafter the store for H can be returned to the storage administration. This is called rehashing. Write a program that performs a rehash of a table H of size n. 5.5. Very often keys are not integers but sequences of letters. These words may greatly vary in length, and therefore they cannot conveniently and economically be stored in key fields of fixed size. Write a program that operates with a hash table and variable length keys. References 5-1. W.D. Maurer. An Improved Hash Code for Scatter Storage. Comm. ACM, 11 , No. 1 (1968), 35-38. 5-2. R. Morris. Scatter Storage Techniques. Comm. ACM, 11, No. 1 (1968), 38-43. 5-3. W.W. Peterson. Addressing for Random-access Storage. IBM J. Res. & Dev., 1 , (1957), 130-46. 5-4. G. Schay and W. Spruth. Analysis of a File Addressing Method. Comm. ACM, 5 , No. 8 459-62. 183