Data Structures and File Organization Techniques

Evolution of File Structure Design

The evolution of file structure design is intrinsically linked to the development of storage devices and data processing techniques. As technology progressed, so did the need for efficient and flexible ways to organize and access data.

  1. In 1963, researchers came up with the idea of AVL trees for data in memory.
  2. AVL trees, however, did not apply to files because they work well when tree nodes are composed of single records rather than dozens or hundreds of them.
  3. In the 1970s came the idea of B-Trees, which require an O(logk N) access time, where N is the number of entries in the file and k is the number of entries indexed in a single block of the B-Tree structure. B-Trees can guarantee that one can find one file entry among millions of others with only 3 or 4 trips to the disk.

Entry-Sequenced Files

The records in an entry-sequenced file are stored in the order in which they are inserted in the file; new records are always appended to the end of the file. Entry-sequenced files are useful for keeping time-sequenced event records, such as log files or audit trails. The records in entry-sequenced files can have fixed-length or variable-length fields. When you update an existing record of variable length, remember that the size of the new record cannot exceed the size required by the original record.

Hashing Files on CD-ROM

Hashing a file on a CD-ROM means generating a unique, fixed-length string of characters (a “hash”) that represents the entire content of a file stored on a CD. This allows you to verify if the file has been altered or corrupted during burning or storage by comparing its hash to a previously recorded value, essentially acting as a digital fingerprint for the file on the CD.

Dynamic Memory

Dynamic memory allocation is a fundamental concept in data structures and programming. It allows programs to allocate memory at runtime, providing flexibility and efficiency when working with data structures of varying sizes. In most programming languages, including C++, memory can be classified into two categories: stack memory and heap memory.

Structure in C

The structure in C is a user-defined data type that can be used to group items of possibly different types into a single type. The struct keyword is used to define the structure in the C programming language. The items in the structure are called its members, and they can be of any valid data type. Additionally, the values of a structure are stored in contiguous memory locations.

Example:

  
    struct structure_name {
      data_type member_name1;
      data_type member_name2;
    };
  

Magnetic Tape

Magnetic tape is one of the oldest technologies for electronic data storage. While tape has largely been displaced as a primary and backup storage medium, it remains well-suited for archiving because of its high capacity, low cost, and long durability. If the tape is part of a library, robotic selection and loading the right cartridge into a tape drive can add latency. In an archive, latency is not an issue. With tape archiving, there is no online copy for quick retrieval, as everything is vaulted for the long term.

CD-ROM Strengths and Weaknesses

  1. Seek Performance: Very bad.
  2. Data Transfer Rate: Not terrible, not great.
  3. Storage Capacity: Storage capacity is good. It enables us to build indexes and other support structures that can help overcome some of the limitations associated with CD-ROM’s poor performance.
  4. Read-Only Access: There can’t be any changes. File organization can be optimized.
  5. No need for interaction with the user (which requires a quick response).

B-Trees, B+ Trees, and Simple B+ Trees

  1. B and B+ Trees are not the only tools useful for file structure design. Simple indexes are useful when they can be held fully in memory, and hashing can provide much faster access than B and B+ Trees.
  2. B-Trees with Associated Information: These are B-Trees that contain record contents at every level of the B-Tree.
    • Strengths: Can save space.
    • Weaknesses: Works only when the record information is located within the B-Tree. Otherwise, too much seeking is involved in retrieving the record information.
  3. Simple Prefix B+ Trees: The separators in the index set are smaller than the keys in the sequence set, meaning the tree is even smaller.

B+ Tree Worst Case

In a B+ tree, the worst-case search scenario occurs when the target key is located at the very last position within the leaf node, requiring traversal through the maximum possible number of levels in the tree. However, it still maintains a logarithmic time complexity due to the high fanout of each node, meaning it can access a large number of child nodes in a single operation, minimizing I/O operations compared to a binary search tree.

Metadata

Metadata is data that describes and contextualizes other data. It provides information about the content, format, structure, and other characteristics of data and can be used to improve the organization, discoverability, and accessibility of data. Metadata can be stored in various forms, such as text, XML, or RDF, and can be organized using metadata standards and schemas. There are many metadata standards that have been developed to facilitate the creation and management of metadata, such as Dublin Core, schema.org, and the Metadata Encoding and Transmission Standard (METS).

Inverted List

An inverted list refers to a data structure used in information retrieval systems, particularly search engines, where it stores a list of documents that contain a specific search term (keyword). It essentially maps each word to the documents where it appears, allowing for fast full-text searches by looking up the word and accessing the associated document list. It’s also known as an inverted index or posting list.

Reduced Redundancy

Reduced redundancy refers to the process of minimizing or eliminating duplicate or unnecessary data in a database or other data storage system. This is typically done to improve the efficiency of data storage and reduce the risk of data inconsistencies or errors. For example, redundant data may occur when the same information is stored in multiple tables or fields. This can lead to data inconsistencies if the same information is updated in one place but not in others. By reducing redundancy, databases can be designed to store only one copy of each piece of information, which reduces the risk of inconsistencies and saves storage space.

Run-length encoding: Run-length encoding (RLE) is a data compression technique that reduces redundancy by replacing repeated data values with a single count value. This makes RLE an example of redundancy reduction because it compresses data by identifying and removing redundant information.

Data Compression Techniques

  1. Lossless Data Compression: Lossless data compression guarantees that the decompressed data is identical to the original data. It works best for text and data files where precision matters.
  2. Lossy Data Compression: Lossy data compression gives away the accuracy of some of its input data for a better compression ratio. It is usually applied to multimedia files, where some loss of detail can be tolerated.

Add Simple Index to Sequence Set

  1. Each of the blocks we created for our Sequence Set contains a range of records that might contain the record we are seeking.
  2. We can construct a simple single-level index for these blocks.
  3. The combination of this kind of index with the sequence set of blocks provides complete indexed sequential access. This method works well as long as the entire index can be held in memory.
  4. If the entire index cannot be held in memory, then we can use a B+ Tree, which is a B-Tree index plus a sequence set that holds the records.

Disk Access

Disk access refers to the process of reading or writing data to a storage device, typically a hard disk drive (HDD) or a solid-state drive (SSD). This involves the physical movement of the disk’s read/write head to locate the specific sector on the disk platter where the data is stored or needs to be written.

  1. Seek Time: The time taken by the read/write head to move to the correct track on the disk platter.
  2. Rotational Latency: The time taken for the disk to rotate until the desired sector is under the read/write head.
  3. Transfer Time: The time taken to transfer the actual data between the disk and the system’s memory.

Organization of Data on Nine-Track Tapes

  1. On a tape, the logical position of a byte within a file corresponds directly to its physical position relative to the start of the file.
  2. The surface of a typical tape can be seen as a set of parallel tracks, each of which is a sequence of bits. These bits correspond to 1 byte + a parity bit.
  3. One byte = a one-bit-wide slice of tape called a frame.

Difference Between Internal and External Fragmentation

  1. Internal Fragmentation:
    1. In internal fragmentation, fixed-sized memory blocks are assigned to processes.
    2. Internal fragmentation happens when the process or method is smaller than the memory.
  2. External Fragmentation:
    1. In external fragmentation, variable-sized memory blocks are assigned to the process.
    2. External fragmentation happens when the process or method is removed.

Reading Variable-Length Records

  1. Determine Record Length: Use the LENGTH= option to calculate the length of each record.
  2. Iterative Reading: Employ a DO loop to repeatedly read records from the file.
  3. Variable-Length Record Handling: Ensure that the reading mechanism can accommodate variable-length records.
  4. Error Handling: Implement robust error handling to detect and handle potential issues like end-of-file, read errors, or invalid record formats.
  5. Data Extraction: Once a record is read, extract the desired fields based on the record’s structure and the predefined data formats.

Cost of Disk Access

  1. Seek Time: The time taken by the disk’s read/write head to move from its current position to the desired track. This is the most significant cost in disk access, especially for random access patterns.
  2. Rotational Latency: The time taken for the disk to rotate until the desired sector is positioned under the read/write head. This cost is constant for a given disk and depends on its rotational speed.
  3. Transfer Time: The time taken to transfer the actual data from the disk to the memory or vice versa. This cost is proportional to the amount of data being transferred.