6.1 Files Concept Need Primitive Operations | unit 6 file organization

DSA

File Organization

6.1 Files: Concept, Need, Primitive Operations

File is a collection of records related to each other. The file size is limited by the size of memory and storage medium.

There are two important features of file:

1. File Activity

2. File Volatility

File activity specifies percent of actual records which proceed in a single run.

File volatility addresses the properties of record changes. It helps to increase the efficiency of disk design than tape.

File Organization

File organization ensures that records are available for processing. It is used to determine an efficient file organization for each base relation.

For example, if we want to retrieve employee records in alphabetical order of name. Sorting the file by employee name is a good file organization. However, if we want to retrieve all employees whose marks are in a certain range, a file is ordered by employee name would not be a good file organization.

Types of File Organization

There are three types of organizing the file:

1. Sequential access file organization

2. Direct access file organization

3. Indexed sequential access file organization

1. Sequential access file organization

Storing and sorting in contiguous block within files on tape or disk is called as sequential access file organization.

In sequential access file organization, all records are stored in a sequential order. The records are arranged in the ascending or descending order of a key field.

Sequential file search starts from the beginning of the file and the records can be added at the end of the file.

In sequential file, it is not possible to add a record in the middle of the file without rewriting the file.

Advantages of sequential file

It is simple to program and easy to design.

Sequential file is best use if storage space.

Disadvantages of sequential file

Sequential file is time consuming process.

It has high data redundancy.

Random searching is not possible.

2. Direct access file organization

Direct access file is also known as random access or relative file organization.

In direct access file, all records are stored in direct access storage device (DASD), such as hard disk. The records are randomly placed throughout the file.

The records do not need to be in sequence because they are updated directly and rewritten back in the same location.

This file organization is useful for immediate access to large amount of information. It is used in accessing large databases.

It is also called as hashing.

Advantages of direct access file organization

Direct access file helps in online transaction processing system (OLTP) like online railway reservation system.

In direct access file, sorting of the records are not required.

It accesses the desired records immediately.

It updates several files quickly.

It has better control over record allocation.

Disadvantages of direct access file organization

Direct access file does not provide backup facility.

It is expensive.

It has less storage space as compared to sequential file.

3. Indexed sequential access file organization

Indexed sequential access file combines both sequential file and direct access file organization.

In indexed sequential access file, records are stored randomly on a direct access device such as magnetic disk by a primary key.

This file has multiple keys. These keys can be alphanumeric in which the records are ordered is called primary key.

The data can be access either sequentially or randomly using the index. The index is stored in a file and read into memory when the file is opened.

Advantages of Indexed sequential access file organization

In indexed sequential access file, sequential file and random file access is possible.

It accesses the records very fast if the index table is properly organized.

The records can be inserted in the middle of the file.

It provides quick access for sequential and direct processing.

It reduces the degree of the sequential search.

Disadvantages of Indexed sequential access file organization

Indexed sequential access file requires unique keys and periodic reorganization.

Indexed sequential access file takes longer time to search the index for the data access or retrieval.

It requires more storage space.

It is expensive because it requires special software.

It is less efficient in the use of storage space as compared to other file organizations.

Key takeaway

File Organization

File organization ensures that records are available for processing. It is used to determine an efficient file organization for each base relation.

6.2 Sequential File Organization – Concept and Primitive Operations

It is one of the simple methods of file organization. Here each file/records are stored one after the other in a sequential manner. This can be achieved in two ways:

Records are stored one after the other as they are inserted into the tables. This method is called pile file method. When a new record is inserted, it is placed at the end of the file. In the case of any modification or deletion of record, the record will be searched in the memory blocks. Once it is found, it will be marked for deleting and new block of record is entered.

Inserting a new record:

In the diagram above, R1, R2, R3 etc are the records. They contain all the attribute of a row. i.e.; when we say student record, it will have his id, name, address, course, DOB etc. Similarly R1, R2, R3 etc can be considered as one full set of attributes.

In the second method, records are sorted (either ascending or descending) each time they are inserted into the system. This method is called sorted file method. Sorting of records may be based on the primary key or on any other columns. Whenever a new record is inserted, it will be inserted at the end of the file and then it will sort – ascending or descending based on key value and placed at the correct position. In the case of update, it will update the record and then sort the file to place the updated record in the right place. Same is the case with delete.

Inserting a new record:

Advantages of Sequential File Organization

The design is very simple compared other file organization. There is no much effort involved to store the data.

When there are large volumes of data, this method is very fast and efficient. This method is helpful when most of the records have to be accessed like calculating the grade of a student, generating the salary slips etc where we use all the records for our calculations

This method is good in case of report generation or statistical calculations.

These files can be stored in magnetic tapes which are comparatively cheap.

Disadvantages of Sequential File Organization

Sorted file method always involves the effort for sorting the record. Each time any insert/update/ delete transaction is performed, file is sorted. Hence identifying the record, inserting/ updating/ deleting the record, and then sorting them always takes some time and may make system slow.

Key takeaway

It is one of the simple methods of file organization. Here each file/records are stored one after the other in a sequential manner. This can be achieved in two ways:

6.3 Direct Access File – Concepts and Primitive Operations

When a file is used, information is read and accessed into computer memory and there are several ways to access this information of the file. Some systems provide only one access method for files. Other systems, such as those of IBM, support many access methods, and choosing the right one for a particular application is a major design problem.

There are three ways to access a file into a computer system: Sequential-Access, Direct Access, Index sequential Method.

Sequential Access –

It is the simplest access method. Information in the file is processed in order, one record after the other. This mode of access is by far the most common; for example, editor and compiler usually access the file in this fashion.

Read and write make up the bulk of the operation on a file. A read operation -read next- read the next position of the file and automatically advance a file pointer, which keeps track I/O location. Similarly, for the writewrite next append to the end of the file and advance to the newly written material.

Key points:

Data is accessed one record right after another record in an order.

When we use read command, it move ahead pointer by one

When we use write command, it will allocate memory and move the pointer to the end of the file

Such a method is reasonable for tape.

2. Direct Access –

Another method is direct access method also known as relative access method. A filed-length logical record that allows the program to read and write record rapidly. in no particular order. The direct access is based on the disk model of a file since disk allows random access to any file block. For direct access, the file is viewed as a numbered sequence of block or record. Thus, we may read block 14 then block 59 and then we can write block 17. There is no restriction on the order of reading and writing for a direct access file.

A block number provided by the user to the operating system is normally a relative block number; the first relative block of the file is 0 and then 1 and so on.

3. Index sequential method –

It is the other method of accessing a file which is built on the top of the sequential access method. These methods construct an index for the file. The index, like an index in the back of a book, contains the pointer to the various blocks. To find a record in the file, we first search the index and then by the help of pointer we access the file directly.

Key points:

It is built on top of Sequential access.

It controls the pointer by using index.

Key takeaway

6.4 Indexed Sequential File Organization – Concept, Types of Indices, Structure of Index Sequential File

ISAM method is an advanced sequential file organization. In this method, records are stored in the file using the primary key. An index value is generated for each primary key and mapped with the record. This index contains the address of the record in the file.

DBMS Indexed sequential access method

If any record has to be retrieved based on its index value, then the address of the data block is fetched and the record is retrieved from the memory.

Pros of ISAM:

In this method, each record has the address of its data block, searching a record in a huge database is quick and easy.

This method supports range retrieval and partial retrieval of records. Since the index is based on the primary key values, we can retrieve the data for the given range of value. In the same way, the partial value can also be easily searched, i.e., the student name starting with 'JA' can be easily searched.

Cons of ISAM

This method requires extra space in the disk to store the index value.

When the new records are inserted, then these files have to be reconstructed to maintain the sequence.

When the record is deleted, then the space used by it needs to be released. Otherwise, the performance of the database will slow down.

Key takeaway

DBMS Indexed sequential access method

If any record has to be retrieved based on its index value, then the address of the data block is fetched and the record is retrieved from the memory.

6.5 Linked Organization - Multi List Files, Coral Rings, Inverted Files and Cellular Partitions

File organization

Sequential

Random

Linked organization

Inverted files

Cellular partitions

1. Sequential Organization

In sequential organization the records are placed sequentially onto the storage media i.e. occupy consecutive locations in the case of tape that means placing records adjacent to each other.

In addition the physical sequence of records is ordered on some key called the primary key.

Sequential organization is also possible in the case of DASD such as a disk. Even though disk storage is really two dimensional (cylinder x surface) it may be mapped down into one dimensional memory.

If the disk has c cylinders and s surfaces one possibility will be to view disk memory as in figure.

Using notation tij to represent the jth track of the ith surface, the sequence is t11, t21, t31….ts1, t12, t22,…..ts2 etc.

Interpreting disk memory as sequential memory

The sequential interpretation in figure is particularly efficient for batched update and retrieval as the tracks are to be accessed in order: all tracks on cylinder 1 followed by all tracks on cylinder 2 etc. as a result of this the read/write heads are moved one cylinder at a time and this movement is necessitated only once for every s tracks.

Its main advantages are:

It is easy to implement;
It provides fast access to the next record using lexicographic order.

Its disadvantages:

It is difficult to update - inserting a new record may require moving a large proportion of the file;
Random access is extremely slow.

2. Random File organization

Records are stored at random locations on the disk. This randomization could be achieved by any of several techniques: direct addressing, directory lookup, hashing.

Direct addressing: in direct addressing with equi-size records, available disk space is divided out into nodes large enough to hold a record. Numeric value of primary key is used to determine the node into which a particular record is to be stored.

Directory lookup: the index is not direct access type but is a dense index maintained using a structure suitable for index operations. Retrieving a record involves searching the index for the record address and then accessing the record itself. The storage management scheme will depend on whether fixed size or variable size nodes are being used. It requires more accesses for retrieval and update, since index searching will generally require more than one access. In both direct addressing and directory lookup, some provision must be made to handle collisions.

Hashing: the available file space is divided into buckets and slots. Some space may have to be set aside for an overflow area in case chaining is being used to handle overflows. When variable size records are present, the no. of slots per bucket will be only rough indicator of no. of records a bucket can hold. The actual no. will vary dynamically with the size of records in a particular bucket. Random organization on the primary key using any of the above three techniques overcomes the difficulties of sequential organizations. Insertion, deletions become easy. But batch processing of queries becomes inefficient as records are not maintained in order of primary key. Handling range queries becomes very inefficient except in case of directory lookup.

3. Linked organization

Linked organizations differ from sequential organizations essentially in that the logical sequence of records is generally different from the physical sequence.

In sequential ith record is placed at location li, then the i+1st record is placed at li + c where c is the length of ith record or some fixed constant.

In linked organization the next logical record is obtained by following link value from present record. Linking in order of increasing primary key eases insertion deletion.

Searching for a particular record is difficult since no index is available, so only sequential search possible.

We can facilitate indexes by maintaining indexes corresponding to ranges of employee numbers eg. 501-700, 701-900. all records with same range will be linked together i a list.

We can generalize this idea for secondary key level also. We just set up indexes for each key and allow records to be in more than one list. This leads to the multilist structure for file representation.

4. Inverted files

Inverted files are similar to multilists. Multilists records with the same key value are linked together with link information being kept in individual record. In case of inverted files the link information is kept in index itself.

EG. We assume that every key is dense. Since the index entries are variable length, index maintenance becomes complex fro multilists. Benefits being Boolean queries require only one access per record satisfying the query. Queries of type k1=xx and k2=yy can be handled similarly by intersecting two lists.

The retrieval works in two steps. In the first step, the indexes are processed to obtain a list of records satisfying the query and in the second, these records are retrieved using the list. The no. of disk accesses needed is equal to the no. of records being retrieved + the no. to process the indexes.

Inverted files represent one extreme of file organization in which only the index structures are important. The records themselves can be stored in any way.

Inverted files may also result in space saving compared with other file structures when record retrieval doesn’t require retrieval of key fields. In this case key fields may be deleted from the records unlike multilist structures.

5. Cellular partitions

To reduce the file search times, the storage media may be divided into cells. A cell may be an entire disk pack or it may simply be a cylinder. Lists are localized to lie within a cell.

Thus if we had a multilist organization in which the list for key1=prog list included records on several different cylinders then we could break the list into several smaller lists where each prog list included only those records in the same cylinder. The index entry for prog will now contain several entries of the type (addr, length) where addr is a pointer to start of a list of records with key1=prog and length is the no. of records on the list. By doing this all records of the same cell may be accessed without moving the read/write heads.

Key takeaway

Linked organizations differ from sequential organizations essentially in that the logical sequence of records is generally different from the physical sequence.

In sequential ith record is placed at location li, then the i+1st record is placed at li + c where c is the length of ith record or some fixed constant.

In linked organization the next logical record is obtained by following link value from present record. Linking in order of increasing primary key eases insertion deletion.

Searching for a particular record is difficult since no index is available, so only sequential search possible.

We can facilitate indexes by maintaining indexes corresponding to ranges of employee numbers e.g. 501-700, 701-900. all records with same range will be linked together i a list.

References:

1. Horowitz, Sahani, Dinesh Mehata, ―Fundamentals of Data Structures in C++‖, Galgotia Publisher, ISBN: 8175152788, 9788175152786.

2. M Folk, B Zoellick, G. Riccardi, ―File Structures‖, Pearson Education, ISBN:81-7758-37-5

3. Peter Brass, ―Advanced Data Structures‖, Cambridge University Press, ISBN: 978-1-107- 43982-5

4. A. Aho, J. Hopcroft, J. Ulman, ―Data Structures and Algorithms‖, Pearson Education, 1998, ISBN-0-201-43578-0.

5. Michael J Folk, ―File Structures an Object-Oriented Approach with C++‖, Pearson Education, ISBN: 81-7758-373-5.

6. Sartaj Sahani, ―Data Structures, Algorithms and Applications in C++‖, Second Edition, University Press, ISBN:81-7371522 X.

7. G A V Pai, ―Data Structures and Algorithms‖, The McGraw-Hill Companies, ISBN - 9780070667266.

8. Goodrich, Tamassia, Goldwasser, ―Data Structures and Algorithms in Java‖, Wiley Publication, ISBN: 9788126551903

Sign Up

Index

Notes

Highlighted

Underlined

Browse by Topics

Notes

Highlighted

Underlined