Advanced Data Structures using ‘C’ Unit 6
Unit 6 File Structures Structure 6.1 Introduction 6.2 Logical or Physical Organization and Data Independence 6.3 A language for describing file structures 6.4 Basic terminology 6.5 Sequential files 6.6 Inverted files
6.7 Index-sequential files
Self Assessment Questions 6.8 Multi-lists 6.9 Cellular multi-lists
6.10 Ring structures
6.11 Threaded lists
Self Assessment Questions 6.12 Trees 6.13 Scatter storage or hash addressing 6.14 Clustered files
6.15 B - Tree File Organization
6.16 B- Tree Index Files
6.17 Dynamic Hashing
6.17.1 How does it work? Self Assessment Questions
6.18 Summary
6.19 Terminal Questions 6.1 Introduction
This unit is mainly concerned with the way in which file structures are used in document retrieval. Most surveys of file structures address themselves to
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 137
applications in data management which is reflected in the terminology used to describe the basic concepts. Objectives: At the end of this unit the reader must be able to answer and understand
Structures that are used in the document retrieval
Logical, physical and organizational data independencies and the methods used.
The different terminologies used in file structure description
What are sequential files, inverted files and index sequential files? Give its advantages.
Multi lists, threaded list and cellular list used
The usage of clustered files and dynamic hashing methods
6.2 Logical or Physical Organization and Data Independence There is one important distinction that must be made at the outset when discussing file structures. And that is the difference between the logical and physical organisation of the data. On the whole a file structure will specify the logical structure of the data, which is the relationship that will exist between data items independently of the way in which these relationships may actually be realised within any computer. It is this logical aspect that we will concentrate on. The physical organisation is much more concerned with optimising the use of the storage medium when a particular logical structure is stored on, or in it. Typically for every unit of physical store there will be a number of units of the logical structure (probably records) to be stored in it. For example, if we were to store a tree structure on a magnetic disk, the physical organisation would be concerned with the best way of packing the nodes of the tree on the disk given the access characteristics of the disk.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 138
The work on data bases has been very much concerned with a concept called data independence. The aim of this work is to enable programs to be written independently of the logical structure of the data they would interact with. The independence takes the following form, should the file structure overnight be changed from an inverted to a serial file the program should remain unaffected. This independence is achieved by interposing a data model between the user and the data base. The user sees the data model rather than the data base, and all his programs communicate with the model. The user therefore has no interest in the structure of the file. There is a school of thought that says that says that applications in library automation and information retrieval should follow this path as well. And so it should. Unfortunately, there is still much debate about what a good data model should look like. Furthermore, operational implementations of some of the more advanced theoretical systems do not exist yet. So any suggestion that an IR system might be implemented through a data base package should still seem premature. Also, the scale of the problems in IR is such that efficient implementation of the application still demands close scrutiny of the file structure to be used.
Nevertheless, it is worth taking seriously the trend away from user knowledge of file structures, a trend that has been stimulated considerably by attempts to construct a theory of data. There are a number of proposals for dealing with data at an abstract level. The best known of these by now is the one put forward by Codd, which has become known as the relational model. In it data are described by n-tuples of attribute values. More formally if the data is described by relations, a relation on a set of domains D1, . . . , Dn can be represented by a set of ordered n-tuples each of the form (d1, . . . , dn) where di [[proper subset]] Di. As it is rather difficult to cope with general relations, various levels (three in fact) of normalisation have been introduced restricting the kind of relations allowed.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 139
A second approach is the hierarchical approach. It is used in many existing data base systems. This approach works as one might expect: data is represented in the form of hierarchies. Although it is more restrictive than the relational approach it often seems to be the natural way to proceed. It can be argued that in many applications a hierarchic structure is a good approximation to the natural structure in the data, and that the resulting loss in precision of representation is worth the gain in efficiency and simplicity of representation. The third approach is the network approach associated with the proposals by the Data Base Task Group of CODASYL. Here data items are linked into a network in which any given link between two items exists because it satisfies some condition on the attributes of those items, for example, they share an attribute. It is more general than the hierarchic approach in the sense that a node can have any number of immediate superiors. It is also equivalent to the relational approach in descriptive power. The whole field of data base structures is still very much in a state of flux. The advantages and disadvantages of each approach are discussed very thoroughly in Date, who also gives excellent annotated citations to the current literature. There is also a recent Computing Survey which reviews the current state of the art. There have been some very early proponents of the relational approach in IR, as early as 1967 Maron and Levien discussed the design and implementation of an IR system via relations, be it binary ones. 6.3 A language for describing file structures
Like all subjects in computer science the terminology of file structures has evolved higgledy-piggledy without much concern for consistency, ambiguity, or whether it was possible to make the kind of distinctions that were
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 140
important. It was only much later that the need for a well-defined, unambiguous language to describe file structures became apparent. In particular, there arose a need to communicate ideas about file structures without getting bogged down by hardware considerations. This section will present a formal description of file structures. The framework described is important for the understanding of any file structure. The terminology is based on that introduced by Hsiao and Harary. 6.4 Basic terminology Given a set of 'attributes' A and a set of 'values' V, then a record R is a subset of the cartesian product A x V in which each attribute has one and only one value. Thus R is a set of ordered pairs of the form (an attribute, its value). For example, the record for a document which has been processed by an automatic content analysis algorithm would be R = {(K1, x1), (K2, x2), . . . (Km, xm)} The Ki 's are keywords functioning as attributes and the value xi can be thought of as a numerical weight. Frequently documents are simply characterised by the absence or presence of keywords, in which case we write R = {Kt1, Kt2, . . . , Kti} where Kti is present if xti = 1 and is absent otherwise. Records are collected into logical units called files. They enable one to refer to a set of records by name, the file name. The records within a file are often organised according to relationships between the records. This logical organisation has become known as a file structure (or data structure).
It is difficult in describing file structures to keep the logical features separate from the physical ones. The latter are characteristics forced upon us by the
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 141
recording media (e.g. tape, disk). Some features can be defined abstractly
(with little gain) but are more easily understood when illustrated concretely.
One such feature is a field. In any implementation of a record, the attribute
values are usually positional, that is the identity of an attribute is given by
the position of its attribute value within the record. Therefore the data within
a record is registered sequentially and has a definite beginning and end.
The record is said to be divided into fields and the nth field carries the nth
attribute value. Pictorially we have an example of a record with associated
fields in Figure 6.1.
Figure 6.1: An example of a record with associated field.
The fields are not necessarily constant in length. To find the value of the
attribute K4, we first find the address of the record R (which is actually the
address of the start of the record) and read the data in the 4th field.
In the same picture I have also shown some fields labelled Pi. They are
addresses of other records, and are commonly called pointers. Now we
have extended the definition of a record to a set of attribute-value pairs and
pointers. Each pointer is usually associated with a particular attribute-value
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 142
pair. For example, (see Figure 6.2) pointers could be used to link all records for which the value x1 (of attribute K1) is a, similarly for x2 equal to b, etc. Figure 6.2: A demonstration of the use of pointers to link record To indicate that a record is the last record pointed to in a list of records we use the null pointer. The pointer associated with attribute K in record R will be called a K-pointer. An attribute (keyword) that is used in this way to organise a file is called a key. To unify the discussion of file structures we need some further concepts. Following Hsiao and Harary again, we define a list L of records with respect to a keyword K, or more briefly a K-list as a set of records containing K such that: (1) the K-pointers are distinct; (2) each non-null K-pointer in L gives the address of a record within L; (3) there is a unique record in L not pointed to by any record containing K; it is called the beginning of the list; and (4) there is a unique record in L containing the null K-pointer; it is the end of the list. From our previous example: K1-list : R1, R2, R5 K2-list : R2, R4
K4-list : R1, R2, R3
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 143
Finally, we need the definition of a directory of a file. Let F be a file whose records contain just m different keywords K1, K2, . . . , Km. Let ni be the number of records containing the keyword Ki, and hi be the number of Ki-lists in F. Furthermore, we denote by aij the beginning address of the jth Ki-list. Then the directory is the set of sequences (Ki, ni, hi, ai1, ai2, . . . aihi) i = 1, 2, . . . m We are now in a position to give a unified treatment of sequential files, inverted files, index-sequential files and multi-list files. 6.5 Sequential files A sequential file is the most primitive of all file structures. It has no directory and no linking pointers. The records are generally organised in lexicographic order on the value of some key. In other words, a particular attribute is chosen whose value will determine the order of the records. Sometimes when the attribute value is constant for a large number of records a second key is chosen to give an order when the first key fails to discriminate. The implementation of this file structure requires the use of a sorting routine. Its main advantages are: (1) it is easy to implement; (2) it provides fast access to the next record using lexicographic order. Its disadvantages: (1) it is difficult to update - inserting a new record may require moving a large proportion of the file; (2) random access is extremely slow. Sometimes a file is considered to be sequentially organised despite the fact that it is not ordered according to any key. Perhaps the date of acquisition is considered to be the key value, the newest entries are added to the end of the file and therefore pose no difficulty to updating.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 144
6.6 Inverted files An inverted file is a file structure in which every list contains only one record. Remember that a list is defined with respect to a keyword K, so every K-list contains only one record. This implies that the directory will be such that ni = hi for all i, that is, the number of records containing Ki will equal the number of Ki-lists. So the directory will have an address for each record containing Ki . For document retrieval this means that given a keyword we can immediately locate the addresses of all the documents containing that keyword. For the previous example let us assume that a non-black entry in the field corresponding to an attribute indicates the presence of a keyword and a black entry its absence. Then the directory will point to the file in the way shown in Figure 6.3. The definition of an inverted files does not require that the addresses in the directory are in any order. However, to facilitate operations such as conjunction ('and') and disjunction ('or') on any two inverted lists, the addresses are normally kept in record number order. This means that 'and' and 'or' operations can be performed with one pass through both lists. The penalty we pay is of course that the inverted file becomes slower to update.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 145
Fig. 6.3: An inverted file
6.7 Index-sequential files
An index-sequential file is an inverted file in which for every keyword Ki , we
have ni = hi = 1 and a11 <a21 . . . <am1. This situation can only arise if
each record has just one unique keyword, or one unique attribute-value. In
practice therefore, this set of records may be order sequentially by a key.
Each key value appears in the directory with the associated address of its
record. An obvious interpretation of a key of this kind would be the record
number. In our example none of the attributes would do the job except the
record number. Diagrammatically the index-sequential file would therefore
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 146
appear as shown in Figure 6.4. I have deliberately written Ri instead of Ki to emphasise the nature of the key. Fig. 6.4: An index sequential file
In the literature an index-sequential file is usually thought of as a sequential file with a hierarchy of indices. This does not contradict the previous definition, it merely describes the way in which the directory is implemented. It is not surprising therefore that the indexes ('index' = 'directory' here) are often oriented to the characteristics of the storage medium. For example (see Figure 6.5) there might be three levels of indexing: track, cylinder and master. Each entry in the track index will contain enough information to locate the start of the track, and the key of the last record in the track which is also normally the highest value on that track. There is a track index for
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 147
each cylinder. Each entry in the cylinder index gives the last record on each cylinder and the address of the track index for that cylinder. If the cylinder index itself is stored on tracks, then the master index will give the highest key referenced for each track of the cylinder index and the starting address of that track. Self Assessment Question 1. Discuss the different terminologies used in file structure. 2. What are sequential files and inverted files? 3. Write short note on index-sequential file. 4. Give the advantages of sequential files. 5. Compare sequential file, inverted file sand index sequential file.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 148
Fig. 6.5: As example of an implementation of an index sequential file
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 149
No mention has been made of the possibility of overflow during an updating process. Normally provision is made in the directory to administer an overflow area. This of course increases the number of book-keeping entries in each entry of the index. 6.8 Multi-lists A multi-list is really only a slightly modified inverted file. There is one list per keyword, i.e. hi = 1. The records containing a particular keyword Ki are chained together to form the Ki-list and the start of the Ki-list is given in the directory, as illustrated in Figure 6.4. Since there is no K3-list, the field reserved for its pointer could well have been omitted. So could any blank pointer field, so long as no ambiguity arises as to which pointer belongs to which keyword. One way of ensuring this, particularly if the data values (attribute-values) are fixed format, is to have the pointer not pointing to the beginning of the record but pointing to the location of the next pointer in the chain. The multi-list is designed to overcome the difficulties of updating an inverted file. The addresses in the directory of an inverted file are normally kept in record-number order. But, when the time comes to add a new record to the file, this sequence must be maintained, and inserting the new address can be expensive. No such problem arises with the multi-list, we update the appropriate K-lists by simply chaining in the new record. The penalty we pay for this is of course the increase in search time. This is in fact typical of many of the file structures. Inherent in their design is a trade-off between search time and update time. 6.9 Cellular multi-lists
A further modification of the multi-list is inspired by the fact that many storage media are divided into pages, which can be retrieved one at a time.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 150
A K-list may cross several page boundaries which means that several pages may have to be accessed to retrieve one record. A modified multi-list structure which avoids this is called a cellular multi-list. The K-lists are limited so that they will not cross the page (cell) boundaries. At this point the full power of the notation introduced before comes into play. The directory for a cellular multi-list will be the set of sequences (Ki, ni, hi, ai1, . . . aihi) i = 1, 2, . . . , m where the hi have been picked to ensure that a Ki-list does not cross a page boundary. In an implementation, just as in the implementation of an index-sequential file, further information will be stored with each address to enable the right page to be located for each key value.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 151
Fig. 6.6: A multi-list 6.10 Ring structures A ring is simply a linear list that closes upon itself. In terms of the definition of a K-list, the beginning and end of the list are the same record. This data-structure is particularly useful to show classification of data.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 152
Fig. 6.7: Adendrogran Let us suppose that a set of documents {Dl, D2, D3, D4, D5, D6, D7, D8} has been classified into four groups, that is {(Dl, D2), (D3, D4), (D5, D6), (D7, D8)} Furthermore these have themselves been classified into two groups, {((Dl, D2), (D3, D4)), ((D5, D6), (D7, D8))} The dendrogram for this structure would be that given in Figure 6.7. To represent this in storage by means of ring structures is now a simple matter (see Figure 6.8).
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 153
Fig. 6.8: An implementation of dendorgram vis ring structure The Di indicates a description (representation) of a document. Notice how the rings at a lower level are contained in those at a higher level. The field marked Ci normally contains some identifying information with respect to the ring it subsumes. For example, C1 in some way identifies the class of documents {D1, D2}. Were we to group documents according to the keywords they shared, then for each keyword we would have a group of documents, namely, those which had that keyword in common. Ci would then be the field containing the keyword uniting that particular group. The rings would of course overlap (Figure 6.9), as in this example: D1 = {K1, K2} D2 = {K2, K3} D3 = {K1, K4}
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 154
Fig. 6.9: Two overlapping rings The usefulness of this kind of structure will become more apparent when we discuss searching of classifications. If each ring has associated with it a record which contains identifying information for its members, then, a search strategy searching a structure such as this will first look at Ci (or Ki in the second example) to determine whether to proceed or abandon the search. 6.11 Threaded lists In this section an elementary knowledge of list processing will be assumed. A simple list representation of the classification ((D1, D2), (D3, D4)), ((D5, D6), (D7, D8)) is given in Figure 6.10. Each sublist in this structure has associated with it a record containing only two pointers. (We can assume that Di is really a pointer to document Di.) The function of the pointers should be clear from the diagram. The main thing to note, however, is that the record associated with a list does not contain any identifying information.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 155
Fig. 6.10: A list structure implementation of hierarchic classification A modification of the implementation of a list structure like this which makes it resemble a set of ring structures is to make the right hand pointer of the last element of a sublist point back to the head of the sublist. Each sublist has become effectively a ring structure. We now have what is commonly called a threaded list (see Figure 6.11). The representation I have given is a slight oversimplification in that we need to flag which elements are data elements (giving access to the documents Di) and which elements are just pointer elements. The major advantage associated with a threaded list is that it can be traversed without the aid of a stack. Normally when traversing a conventional list structure the return addresses are stacked, whereas in the threaded list they have been incorporated in the data structure. One disadvantage associated with the use of list and ring structures for representing classifications is that they can only be entered at the 'top'. An additional index giving entry to the structure at each of the data elements increases the update speed considerably.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 156
Fig. 6.11: A threaded list implementation of a hierarchic classification Another modification of the simple list representation has been studied extensively by Stanfel and Patt. The individual elements (or cells) of the list structure are modified to incorporate one extra field, so that instead of each element looking like this where the Pis are pointers and S is a symbol. Otherwise no essential change has been made to the simple representation. This structure has become known as the Doubly Chained Tree. Its properties have mainly been investigated for storing variable length keys, where each key is made up by selecting symbols from a finite (usually small) alphabet. For example, let {A,B,C} be the set of key symbols and let R1, R2, R3, R4, R5 be five records to be stored. Let us assign keys made of the 3 symbols, to the record as follows: AAA R1
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 157
AB R2 AC R3 BB R4 BC R5 An example of a doubly chained tree containing the keys and giving access to the records is given in Figure 6.12. The topmost element contains no symbol, it merely functions as the start of the structure. Given an arbitrary key its presence or absence is detected by matching it against keys in the structure. Matching proceeds level by level, once a matching symbol has been found at one level, the P1 pointer is followed to the set of alternative symbols at the next level down. The matching will terminate either: (1) when the key is exhausted, that is, no more key symbols are left to match; or (2) when no matching symbol is found at the current level. Fig. 6.12: An example of a doubly chained trees
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 158
For case (1) we have: (a) the key is present if the P1 pointer in the same cell as the last matching symbol now points to a record; (b) P1 points to a further symbol, that is, the key 'falls short' and is therefore not in the structure. For case (2), we also have that the key is not in the structure, but now there is a mismatch. Stanfel and Patt have concentrated on generating search trees with minimum expected search time, and preserving this property despite updating. For the detailed mathematics demonstrating that this is possible the reader is referred to their cited work.
Self Assessment Question
1. What are multi lists and cellular multi lists?
2. Give the advantage of multi lists.
3. What are threaded lists?
4. Write short note on:
a. Multi-list
b. Cellular multi-list
c. Ring structures
6.12 Trees
Although computer scientists have adopted trees as file structures, their properties were originally investigated by mathematicians. In fact, a substantial part of the Theory of Graphs is devoted to the study of trees. There are numerous definitions of trees. I have chosen a particularly simple one from Berge. If we think of a graph as a set of nodes (or points or vertices) and a set of lines (or edges) such that each line connects exactly
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 159
two nodes, then a tree is defined to be a finite connected graph with no cycles, and possessing at least two nodes. To define a cycle we first define a chain. We represent the line uk joining two nodes x and y by uk = [x,y]. A chain is a sequence of lines, in which each line uk has one node in common with the preceding line uk-1, and the other vertex in common with the succeeding line uk+1. An example of a chain is [a,x1], [x1,x2], [x2,x3], [x3,b]. A cycle is a finite chain which begins at a node and terminates at the same node (i.e. in the example a = b). Berge gives the following theorem showing many equivalent characterizations of trees. Theorem. Let H be a graph with at least n nodes, where n > 1; any one of the following equivalent properties characterises a tree. (1) H is connected and does not possess any cycles. (2) H contains no cycles and has n - 1 lines. (3) H is connected and has n - l lines. (4) H is connected but loses this property if any line is deleted. (5) Every pair of nodes is connected by one and only one chain.
One thing to be noticed in the discussion so far is that no mention has been made of a direction associated with a line. In most applications in computer science (and IR) one node is singled out as special. This node is normally called the root of the tree, and every other node in the tree can only be reached by starting at the root and proceeding along a chain of lines until the node sought is reached. Implicitly therefore, a direction is associated with each line. In fact, when one comes to represent a tree inside a computer by a list structure, often the addresses are stored in a way which allows movement in only one direction. It is convenient to think of a tree as a directed graph with a reserved node as the root of the tree. Of course, if one has a root then each path (directed chain) starting at the root will eventually
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 160
terminate at a particular node from which no further branches will emerge. These nodes are called the terminal nodes of the tree. By now it is perhaps apparent that when we were talking about ring structures and threaded lists in some of our examples we were really demonstrating how to implement a tree structure. The dendrogram in Figure 6.7 can easily be represented as a tree (Figure 6.13). The documents are stored at the terminal nodes and each node represents a class (cluster) of documents. A search for a particular set of documents would be initiated at the root and would proceed along the arrows until the required class was found. Fig. 6.13: A tree representation of a dendrogran Another example of a tree structure is the directory associated with an index-sequential file. It was described as a hierarchy of indexes, but could equally well have been described as a tree structure.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 161
The use of tree structures in computer science dates back to the early 1950s when it was realized that the so-called binary search could readily be represented by a binary tree. A binary tree is one in which each node (except the terminal nodes) has exactly two branches leaving it. A binary search is an efficient method for detecting the presence or absence of a key value among a set of keys. It presupposes that the keys have been sorted. It proceeds by successive division of the set, at each division discarding half the current set as not containing the sought key. When the set contains N sorted keys the search time is of order log2N. Furthermore, after some thought one can see how this process can be simply represented by a binary tree. Unfortunately, in many applications one wants the ability to insert a key which has been found to be absent. If the keys are stored sequentially then the time taken by the insertion operation may be of order N. If one, however, stores the keys in a binary tree this lengthy insert time may be overcome, both search and insert time will be of order log2N. The keys are stored at the nodes, at each node a left branch will lead to 'smaller' keys, a right branch will lead to 'greater' keys. A search terminating on a terminal node will indicate that the key is not present and will need to be inserted. The structure of the tree as it grows is largely dependent on the order in which new keys are presented. Search time may become unnecessarily long because of the lop-sidedness of the tree. Fortunately, it can be shown that random insertions do not change the expected log2N time dependence of the tree search. Nevertheless, methods are available to prevent the possibility of degenerate trees. These are trees in which the keys are stored in such a way that the expected search time is far from optimal. For example, if the keys were to arrive for insertion already ordered then the tree to be built would simply be as shown in Figure 6.14.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 162
It would take us too far a field for me to explain the techniques for avoiding degenerate trees. Essentially, the binary tree is maintained in such a way that at any node the subtree on the left branch has approximately as many levels as the subtree on the right branch. Hence the name balanced tree for such a tree. The search paths in a balanced tree will never be more than 45 per cent longer than the optimum. The expected search and insert times are still of order log N.. Fig. 6.14: An example of a degenerate tree
So far we have assumed that each key was equally likely as a search argument. If one has data giving the probability that the search argument is Ki (a key already in the tree), and the probability that the search argument lies between Ki and Ki+1, then again techniques are known for reordering the tree to optimize the expected search time. Essentially one makes sure that the more frequently accessed keys have the shortest search paths from the root. One well-known technique used when only the second set of probabilities is known, and the others assigned the value zero, is the Hu-Tucker algorithm. Again the interested reader may consult Knuth.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 163
At this point it is probably a good idea to point out that these efficiency considerations are largely irrelevant when it comes to representing a document classification by a tree structure. The situation in document retrieval is different in the following aspects: (1) we do not have a useful linear ordering on the documents; (2) a search request normally does not seek the absence or presence of a document. In fact, what we do have is that documents are more or less similar to each other, and a request seeks documents which in some way best match the request. A tree structure representing a document classification is therefore chosen so that similar documents may be close together. Therefore to rearrange a tree structure to satisfy some 'balancedness' criterion is out of the question. The search efficiency is achieved by bringing together documents which are likely to be required together. This is not to say that the above efficiency considerations are unimportant in the general context of IR. Many operations, such as the searching of a dictionary, and using a suffix stripping algorithm can be made very efficient by appropriately structuring the binary tree. The discussion so far has been limited to binary trees. In many applications this two-way split is inappropriate. The natural way to represent document classifications is by a general tree structure, where there is no restriction on the number of branches leaving a node. Another example is the directory of an index sequential file which is normally represented by an m-way tree, where m is the number of branches leaving a node.
Finally, more comments are in order about the manipulation of tree structures in mass storage devices. Up to now we have assumed that to follow a set of pointers poses no particular problems with regard to retrieval
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 164
speed. Unfortunately, present random access devices are sufficiently slow for it to be impossible to allow an access for, say, each node in a tree. There are ways of partitioning trees in such a way that the number of disk accesses during a tree search can be reduced. Essentially, it involves storing a number of nodes together in one 'page' of disk storage. During a disk access this page is brought into fast memory, is then searched, and the next page to be accessed is determined. 6.13 Scatter storage or hash addressing One file structure which does not relate very well to the ones mentioned before is known as Scatter Storage. The technique by which the file structure is implemented is often called Hash Addressing. Its underlying principle is appealingly simple. Given that we may access the data through a number of keys Ki, then the address of the data in store is located through a key transformation function f which when applied to Ki evaluates to give the address of the associated data. We are assuming here that with each key is associated only one data item. Also for convenience we will assume that each record (data and key) fits into one location, whose address is in the image space of f. The addresses given by the application of f to the keys Ki are called the hash addresses and f is called a hashing function. Ideally f should be such that it spreads the hash addresses uniformly over the available storage. Of course this would be achieved if the function were one-to-one. Unfortunately this cannot be so because the range of possible key values is usually considerably larger than the range of the available storage addresses. Therefore, given any hashing function we have to contend with the fact that two distinct keys Ki and Kj are likely to map to the same address f(Ki) (=f(Kj)). Before I explain some of the ways of dealing with this I shall give a few examples of hashing functions.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 165
Let us assume that the available storage is of size 2[m] then three simple transformations are as follows: (1) if Ki is the key, then take the square of its binary representation and select m bits from the middle of the result; (2) cut the binary representation of Ki into pieces each of m bits and add these together. Now select the m least significant bits of the sum as the hash address; (3) divide the integer corresponding to Ki by the length of the available store 2 and use the remainder as the hash address. Each of these methods has disadvantages. For example, the last one may given the same address rather frequently if there are patterns in the keys. As mentioned before there is the problem of collisions, that is, when two distinct keys hash to the same address. The first point to be made about this problem is that it destroys some of the simplicity of hashing. Initially it may have been thought that the key need not be stored with the data at the hash address. Unfortunately this is not so. No matter what method we use to resolve collisions we still need to store the key with the data so that at search time when a key is hashed we can distinguish its data from the data associated with keys which have hashed to the same address.
There are a number of strategies for dealing with collisions. Essentially they fall into two classes, those which use pointers to link together collided keys and those which do not. Let us first look at the ones which do not use pointers. These have a mechanism for searching the store, starting at the address where the collision occurred, for an empty storage location if a record needs to be inserted, or, for a matching key value at retrieval time. The simplest of these advances from the hash address each time moving along a fixed number of locations, say s, until an empty location or the matching key value is found. The collision strategy thus traces out a well
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 166
defined sequence of locations. This method of dealing with collisions is called the linear method. The tendency with this method is to store collided records as closely to the initial hash address as possible. This leads to an undesirable effect called primary clustering. In this context all this means is that the records tend to concentrate in groups or bunch-up. It destroys the uniform nature of the hashing function. To be more precise, it is desirable that hash addresses are equally likely, however, the first empty location at the end of a collision sequence increases in likelihood in proportion to the number of records in the collision sequence. To see this one needs only to realize that a key hashed to any location in the sequence will have its record stored at the end of the sequence. Therefore big groups of records tend to grow even bigger. This phenomenon is aggravated by a small step size s when seeking an empty location. Sometimes s = 1 is used in which case the collision strategy is known as the open addressing technique. Primary clustering is also worse when the hash table (available storage) is relatively full. Variations in the linear method which avoid primary clustering involve making the step size a variable. One way is to set s equal to ai + bi on the ith step. Another is to invoke a random number generator which calculates the step size afresh each time. These last two collision handling methods are called the quadratic and random method respectively. Although they avoid primary clustering they are nevertheless subject to secondary clustering, which is caused by keys hashing to the same address and following the same sequence in search of an empty location.
The second class of collision handling methods involves extra storage space which is used to chain together collided records. When a collision occurs at a hash address it may be because it is the head of a chain of records which have all hashed to that address, or it may be that a record is stored there which belongs to a chain starting at some other address. In
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 167
both cases a free location is needed which in the first case is simply linked in and stores the new record, in the second case the intermediate chain element is moved to the free location and the new record is stored at its own hash address thus starting a new chain (a one-element chain so far). A variation on this method is to use a two-level store. At the first level we have a hash table, at the second level we have a bump table which contains all the collided records. At a hash address in the hash table we will find either, a record if no collisions have taken place at that address, or, a pointer to a chain of records which collided at that address. This latter chaining method has the advantage that records need never be moved once they have been entered in the bump table. The storage overhead is larger since records are put in the bump table before the hash table is full. For both classes of collision strategies one needs to be careful about deletions. For the linear, quadratic etc. collision handling strategies we must ensure that when we delete a record at an address we do not make records which collided at that address unreachable. Similarly with the chaining method we must ensure that a deleted record does not leave a gap in the chain, that is, after deletion the chain must be reconnected. The advantages of hashing are several. Firstly it is simple. Secondly its insertion and search strategies are identical. Insertion is merely a failed search. If Ki is the hashed key, then if a search of the collision sequence fails to turn up a match in Ki, its record is simply inserted at the end of the sequence at the next free location. Thirdly, the search time is independent of the number of keys to be inserted.
The application of hashing in IR has tended to be in the area of table construction and look-up procedures. An obvious application is when constructing the set of conflation classes during text processing. During a retrieval operation, a query will first be converted into a list of class names.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 168
To do this each significant word needs to be looked up in a dictionary which gives the name of the class to which it belongs. Clearly there is a case for hashing. We simply apply the hashing function to the word and find the name of the conflation class to which it belongs at the hash address. 6.14 Clustered files
It is now common practice to refer to a file processed by a clustering algorithm as a clustered file, and to refer to the resulting structure as a file structure. For example Salton lists a clustered file as an alternative organisation to inverted, serial, chained files, etc. Although it may be convenient terminologically, it does disguise the real status of cluster methods. Cluster methods (or automatic classification methods) are more profitably discussed at the level of abstraction at which relations are discussed in connection with data bases, that is, in a thoroughly data independent way. In other words, selecting an appropriate cluster method and implementing it are two separate problems. Unfortunately not all users of clustering techniques see it this way, and so the current scene is rather confused. One factor contributing to the confusion is that clustering techniques have been used at a very low level of implementation of system software, for example, to reduce the number of page exceptions in a virtual memory. Therefore, those who use clustering merely to increase retrieval efficiency (in terms of storage and speed) will tend to see a classification structure as a file structure, whereas those who see clustering as a means of discovering (or summarizing) some inherent structure in the data will look upon the same structure as a description of the data. Of course, this description may be used to achieve more efficient retrieval (and in IR more effective retrieval in terms of say precision and recall). Furthermore, if one looks carefully at some of the implementations of cluster methods one
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 169
discovers that the classificatory system is represented inside the computer by one of the more conventional file structures.
6.15 B -Tree File Organization
The B -tree structure is used not only as an index but also as an organizer for records into a file. In a B -tree file organization, the leaf nodes of the tree store records instead of storing pointers to records. Since records are usually larger than pointers, the maximum number of records that can be stored in a leaf node is less than the maximum number of pointers in a non-leaf node. However, the leaf node are still required to be at least half full. Insertion and deletion from a B -tree file organization are handled in the same way as that in a B -tree index. When a B -tree is used for file organization, space utilization is particularly important. We can improve the space utilization by involving more sibling nodes in redistribution during splits and merges. In general, if m nodes are involved in redistribution, each node can be guaranteed to contain at least [(m – 1) n / m] entries. However, the cost of update becomes higher as more siblings are involved in redistribution.
6.16 B-Tree Index Files
B-tree indices are similar to B -tree indices. Difference is that B-tree eliminates the redundant storage of search key values. A corresponding B-tree of Figure 6.15 A allows search key values to appear only once. Thus we can store the index in less space.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 170
Figure 6.15: Leaf and nonleaf node of a B-tree
1. Advantages:
o Lack of redundant storage (but only marginally different).
o Some searches are faster (key may be in non-leaf node).
2. Disadvantages:
o Leaf and non-leaf nodes are of different size (complicates storage)
o Deletion may occur in a non-leaf node (more complicated)
Generally, the structural simplicity of B -tree is preferred.
6.17 Dynamic Hashing
As the database grows over time, we have three options:
o Choose hash function based on current file size. Get performance degradation as file grows.
o Choose hash function based on anticipated file size. Space is wasted initially.
o Periodically re-organize hash structure as file grows. Requires selecting new hash function, recomputing all addresses and generating new bucket assignments. Costly, and shuts down database.
Some hashing techniques allow the hash function to be modified dynamically to accommodate the growth or shrinking of the database. These are called dynamic hash functions. Extendable hashing is one form of dynamic hashing. Extendable hashing splits and coalesces buckets as
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 171
database size changes. This imposes some performance overhead, but space efficiency is maintained. As reorganization is on one bucket at a time, overhead is acceptably low. 6.17.1 How does it work? Figure 6.16: General extendable hash structure. 1. We choose a hash function that is uniform and random that generates values over a relatively large range.
o Range is b-bit binary integers (typically b=32).
o is over 4 billion, so we don't generate that many buckets!
o Instead we create buckets on demand, and do not use all b bits of the hash initially.
o At any point we use i bits where .
o The i bits are used as an offset into a table of bucket addresses.
o Value of i grows and shrinks with the database.
o Figure 6.16 shows an extendable hash structure.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 172
o Note that the i appearing over the bucket address table tells how many bits are required to determine the correct bucket.
o It may be the case that several entries point to the same bucket.
o All such entries will have a common hash prefix, but the length of this prefix may be less than i.
o So we give each bucket an integer giving the length of the common hash prefix.
o This is shown in Figure - B as .
o Number of bucket entries pointing to bucket j is then .
2. To find the bucket containing search key value :
o Compute .
o Take the first i high order bits of .
o Look at the corresponding table entry for this i-bit string.
o Follow the bucket pointer in the table entry.
3. We now look at insertions in an extendable hashing scheme.
o Follow the same procedure for lookup, ending up in some bucket j.
o If there is room in the bucket, insert information and insert record in the file.
o If the bucket is full, we must split the bucket, and redistribute the records.
o If bucket is split we may need to increase the number of bits we use in the hash.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 173
4. Two cases exist:
i. If , then only one entry in the bucket address table points to bucket j.
o Then we need to increase the size of the bucket address table so that we can include pointers to the two buckets that result from splitting bucket j.
o We increment i by one, thus considering more of the hash, and doubling the size of the bucket address table.
o Each entry is replaced by two entries, each containing original value.
o Now two entries in bucket address table point to bucket j.
o We allocate a new bucket z, and set the second pointer to point to z.
o Set and to i.
o Rehash all records in bucket j which are put in either j or z.
o Now insert new record.
o It is remotely possible, but unlikely, that the new hash will still put all of the records in one bucket.
o If so, split again and increment i again.
ii. If , then more than one entry in the bucket address table points to bucket j.
o Then we can split bucket j without increasing the size of the bucket address table (why?).
o Note that all entries that point to bucket j correspond to hash prefixes that have the same value on the leftmost bits.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 174
o We allocate a new bucket z, and set and to the original value plus 1.
o Now adjust entries in the bucket address table that previously pointed to bucket j.
o Leave the first half pointing to bucket j, and make the rest point to bucket z.
o Rehash each record in bucket j as before.
o Reattempt new insert.
5. Note that in both cases we only need to rehash records in bucket j.
6. Deletion of records is similar. Buckets may have to be coalesced, and bucket address table may have to be halved.
Advantages:
o Extendable hashing provides performance that does not degrade as the file grows.
o Minimal space overhead - no buckets need be reserved for future use. Bucket address table only contains one pointer for each hash value of current prefix length.
Disadvantages:
o Extra level of indirection in the bucket address table
o Added complexity
Self Assessment Question
1. What is hash addressing and scatter storage?
2. Give the advantage of B tree index files.
3. Define tree and give examples.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 175
4. Write short note on:
a. Clustered files
b. B+ tree file organization
c. B tree index file
5. What is dynamic hashing and give its advantage and disadvantages?
6.18 Summary Logical structure of the data is the relationship that will exist between data items independently. Data independency aims at the programs which are written independently of the logical structure of the data. Network approach hierarchical approaches are different methods used towards this. Sequential file is the most primitive file structure with no directory and no linking pointers. An inverted file is a file structure in which every list contains only one record. Index sequential file and multi lists are special inverted files. Hash addressing is the technique by which the file structures are implemented. 6.19 Terminal Questions 1. Discuss the techniques for allowing a hash file to expand and shrink dynamically. What are the advantages and disadvantages of each? 2. What are mixed files used for? What are other types of primary file organizations?
3. A parts file with Part# as hash key includes records with the following Part# values: 2369, 3760, 4692, 4871, 5659, 1821, 1074, 7115, 1620, 2426, 3943, 4750, 6975, 4981, 9208. The file uses eight buckets, numbered 0 to 7. Each bucket is one disk block and holds two records. Load these records into the file in the given order, using the hash function h(K) = K mod 8. Calculate the average number of block accesses for a random retrieval on Part#.
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 176
4. Suppose that we have a hash file of fixed-length records, and suppose that over-flow is handled by chaining. Outline algorithms for insertion, deletion, and modification of a file record. State any assumptions you make. 5. What is the order p of a B-tree? Describe the structure of B-tree nodes? How does a B-tree differ from a B+-tree? Why is a B+-tree usually preferred as an access structure to a data file?
Advanced Data Structures using ‘C’ Unit 6
Sikkim Manipal University Page No.: 177
References
1. HSIAO D. and HARARY F., 'A formal system for information retrieval from files', Communications of the ACM, 13, 67-7 3 (1970)
2. ROBERTS D. C., 'File organization techniques', Advances in Computers, 12, 115-174 (1972)
3. BERTZISS A. T., Data Structures: Theory and Practice, Academic Press, London and New York (1971)
4. DODD G. G., 'Elements of data management systems', Computing Surveys, 1, 117-133 (1969)
5. CLIMENSON W. D., 'File organization and search techniques', Annual Review of Information Science and Technology, 1, 107-135 (1966)
6. SEVERANCE D. G., 'A parametric model of alternative file structures', Information Systems, 1, 51-55 (1975)
7. Van RIJSBERGEN C. J., 'File organization in library automation and information retrieval', Journal of Documentation, 32, 294-317 ( 1976)
No comments:
Post a Comment