This book provides the conceptual tools to build file structures that can be quickly and efficiently accessed. It teaches good design judgement through an approach that puts the "hands-on" work of constructing and running programs at the center of the learning process. This best-selling book has been thoroughly updated. It includes timely coverage of file structures in a UNIX environment, in addition to a new and substantial appendix on CD-ROM. All former programs in C and Pascal have been updated to ANSI C and Turbo Pascal 6.0.
Author(s): Michael J. Folk, Bill Zoellick
Edition: 2
Publisher: Addison-Wesley
Year: 1991
Language: English
Pages: 608
1 Introduction to File Structures
1.1 The Heart of File Structure Design
1.2 A Short History of File Structure Design
1.3 A Conceptual Toolkit: File Structure Literacy
Summary • Key Terms
2 Fundamental File Processing Operations
2.1 Physical Files and Logical Files
2.2 Opening Files
2.3 Closing Files
2.4 Reading and Writing
2.4.1 Read and Write Functions
2.4.2 A Program to Display the Contents of a File
2.4.3 Detecting End-of-File
2.5 Seeking
2.5.1 Seeking in C
2.5.2 Seeking in Pascal
2.6 Special Characters in Files
2.7 The UNIX Directory Structure
2.8 Physical and Logical Files in UNIX
2.8.1 Physical Devices as UNIX Files
2.8.2 The Console, the Keyboard, and Standard Error
2.8.3 1/0 Redirection and Pipes
2.9 File-related Header Files
2.10 UNIX Filesystem Commands
Summary Key Terms • Exercises
Further Readings
3 Secondary Storage and System Software
3.1 Disks
3.1.1 The Organization of Disks
3.1.2 Estimating Capacities and Space Needs
3.1.3 Organizing Tracks by Sector
3.1.4 Organizing Tracks by Block
3.1.5 Nondata Overhead
3.1.6 The Cost of a Disk Access
3.1.7 Effect of Block Size on Performance: A UNIX Example
3.1.8 Disk as Bottleneck
3.2 Magnetic Tape
3.2.1 Organization of Data on Tapes
3.2.2 Estimating Tape Length Requirements
3.2.3 Estimating Data Transmission Times
3.2.4 Tape Applications
3.3 Disk Versus Tape
3.4 Storage as a Hierarchy
3.5 A Journey of a Byte
3.5.1 The File Manager
3.5.2 The 1/0 Buffer
3.5.3 The Byte Leaves RAM: The 1/0 Processor and Disk Controller
3.6 Buffer Management
3.6.1 Buffer Bottlenecks
3.6.2 Buffering Strategies
3.7 1/0 in UNIX
3.7.1 The Kernel
3.7.2 Linking File Names to Files
3.7.3 Normal Files, Special Files, and Sockets
3.7.4 Block 1/0
3.7.5 Device Drivers
3.7.6 The Kernel and Filesystems
3.7.7 Magnetic Tape and UNIX
Summary • Key Terms • Exercises
Further Readings
4 Fundamental File Structure Concepts
4.1 Field and Record Organization
4.1.1 A Stream File
4.1.2 Field Structures
4.1.3 Reading a Stream of Fields
4.1.4 Record Structures
4.1.5 A Record Structure That Uses a Length Indicator
4.1.6 Mixing Numbers and Characters: Use of a File Dump
4.2 Record Access
4.2.1 Record Keys
4.2.2 A Sequential Search
4.2.3 UNIX Tools for Sequential Processing 1
4.2.4 Direct Access
4.3 More about Record Structures
4.3.1 Choosing a Record Structure and Record Length
4.3.2 Header Records
4.4 File Access and File Organization
4.5 Beyond Record Structures
4.5.1 Abstract Data Models
4.5.2 Headers and Self-Describing Files
4.5.3 Metadata
4.5.4 Color Raster Images
4.5.5 Mixing Object Types in One File
4.5.6 Object-oriented File Access
4.5.7 Extensibility
4.6 Portability and Standardization
4.6.1 Factors Affecting Portability
4.6.2 Achieving Portability
Summary • Key Terms • Exercises
Further Readings
C Programs
Pascal Programs
5 Organizing Files for Performance
5.1 Data Compression
5.1.1 Using a Different Notation
5.1.2 Suppressing Repeating Sequences
5.1.3 Assigning Variable-length Codes
5.1.4 Irreversible Compression Techniques
5.1.5 Compression in UNIX
5.2 Reclaiming Space in Files
5.2.1 Record Deletion and Storage Compaction
5.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically
5.2.3 Deleting Variable-length Records
5.2.4 Storage Fragmentation
5.2.5 Placement Strategies
5.3 Finding Things Quickly: An Introduction to Internal Sorting and Binary Searching
5.3.l Finding Things in Simple Field and Record Files
5.3.2 Search by Guessing: Binary Search
5.3.3 Binary Search versus Sequential Search
5.3.4 Sorting a Disk File in RAM
5.3.5 The Limitations of Binary Searching and Internal Sorting
5.4 Keysorting
5.4.1 Description of the Method
5.4.2 Limitations of the Keysort Method
5.4.3 Another Solution: Why Bother to Write the File Back?
5.4.4 Pinned Records
Summary • Key Terms • Exercises
Further Readings
6 Indexing
6.1 What Is an Index?
6.2 A Simple Index with an Entry-Sequenced File
6.3 Basic Operations on an Indexed, Entry-Sequenced File
6.4 Indexes That Are Too Large to Hold in Memory
6.5 Indexing to Provide Access by Multiple Keys
6.6 Retrieval Using Combinations of Secondary Keys
6.7 Improving the Secondary Index Structure: Inverted Lists
6.7.1 A First Attempt at a Solution
6.7.2 A Better Solution: Linking the List of References
6.8 Selective Indexes
6.9 Binding
Summary • Key Terms • Exercises
Further Readings
7 Cosequential Processing and the Sorting of Large Files
7.1 A Model for Implementing Cosequential Processes
7.1.1 Matching Names in Two Lists
7.1.2 Merging Two Lists
7.1.3 Summary of the Cosequential Processing Model
7.2 Application of the Model to a General Ledger Program
7.2.1 The Problem
7.2.2 Application of the Model to the Ledger Program
7.3 Extension of the Model to Include Multiway Merging
7.3.1 A K-way Merge Algorithm
7.3.2 A Selection Tree for Merging Large Numbers of Lists
7.4 A Second Look at Sorting in RAM
7.4.1 Overlapping Processing and 1/0: Heapsort
7.4.2 Building the Heap while Reading in the File
7.4.3 Sorting while Writing out to the File
7.5 Merging as a Way of Sorting Large Files on Disk
7.5.1 How Much Time Does a Merge Sort Take?
7.5.2 Sorting a File That Is Ten Times Larger
7.5.3 The Cost of Increasing the File Size
7.5.4 Hardware-based Improvements
7.5.5 Decreasing the Number of Seeks Using Multiple-step Merges
7.5.6 Increasing Run Lengths Using Replacement Selection
7.5.7 Replacement Selection Plus Multistep Merging
7.5.8 Using Two Disk Drives with Replacement Selection
7.5.9 More Drives? More Processors?
7.5.10 Effects of Multiprogramming
7.5.11 A Conceptual Toolkit for External Sorting
7.6 Sorting Files on Tape
7.6.1 The Balanced Merge
7.6.2 The K-way Balanced Merge
7.6.3 Multiphase Merges
7.6.4 Tapes versus Disks for External Sorting
7.7 Sort-Merge Packages
7.8 Sorting and Cosequential Processing in UNIX
7.8.1 Sorting and Merging in UNIX
7.8.2 Cosequential Processing Utilities in UNIX
Summary • Key Terms • Exercises
Further Readings
8 B-Trees and Other Tree-structured File Organizations
8.1 Introduction: The Invention of the B-Tree
8.2 Statement of the Problem
8.3 Binary Search Trees as a Solution
8.4 AVL Trees
8.5 Paged Binary Trees
8.6 The Problem with the Top-down Construction of Paged Trees
8.7 B-Trees: Working up from the Bottom
8.8 Splitting and Promoting
8.9 Algorithms for B-Tree Searching and Insertion
8.10 B-Tree Nomenclature
8.11 Formal Definition of B-Tree Properties
8.12 Worst-case Search Depth
8.13 Deletion, Redistribution, and Concatenation
8.13.1 Redistribution
8.14 Redistribution during Insertion: A Way to Improve Storage Utilization
8.15 B* Trees
8.16 Buffering of Pages: Virtual B-Trees
8.16.1 LRU Replacement
8.16.2 Replacement Based on Page Height
8.16.3 Importance of Virtual B-Trees
8.17 Placement of Information Associated with the Key
8.18 Variable-length Records and Keys
Summary • Key Terms • Exercises
Further Readings
C Programs to Insert Keys into a B-Tree
Pascal Programs to Insert Keys into a B-Tree
9 The B+ Tree Family and Indexed Sequential File Access
9.1 Indexed Sequential Access
9.2 Maintaining a Sequence Set
9.2.1 The Use of Blocks
9.2.2 Choice of Block Size
9.3 Adding a Simple Index to the Sequence Set
9.4 The Content of the Index: Separators Instead of Keys
9.5 The Simple Prefix B+ Tree
9.6 Simple Prefix B + Tree Maintenance
9.6.1 Changes Localized to Single Blocks in the Sequence Set
9.6.2 Changes Involving Multiple Blocks in the Sequence Set
9.7 Index Set Block Size
9.8 Internal Structure of Index Set Blocks: A Variable-order B-Tree
9.9 Loading a Simple Prefix B+ Tree
9.10 B+ Trees
9.11 B-Trees, B+ Trees, and Simple Prefix B+ Trees in Perspective
Summary • Key Terms • Exercises
Further Readings
10 Hashing
10.1 Introduction
10.1.1 What is Hashing?
10.1.2 Collisions
10.2 A Simple Hashing Algorithm
10.3 Hashing Functions and Record Distributions
10.3.1 Distributing Records among Addresses
10.3.2 Some Other Hashing Methods
10.3.3 Predicting the Distribution of Records
10.3.4 Predicting Collisions for a Full File
10.4 How Much Extra Memory Should Be Used?
10.4.1 Packing Density
10.4.2 Predicting Collisions for Different Packing Densities
10.5 Collision Resolution by Progressive Overflow
10.5.1 How Progressive Overflow Works
10.5.2 Search Length
10.6 Storing More Than One Record per Address: Buckets
10.6.1 Effects of Buckets on Performance
10.6.2 Implementation Issues
10.7 Making Deletions
10.7.1 Tombstones for Handling Deletions
10.7.2 Implications of Tombstones for Insertions
10.7.3 Effects of Deletions and Additions on Performance
10.8 Other Collision Resolution Techniques
10.8.1 Double Hashing
10.8.2 Chained Progressive Overflow
10.8.3 Chaining with a Separate Overflow Area
10.8.4 Scatter Tables: Indexing Revisited
10.9 Patterns of Record Access
Summary • Key Terms • Exercises
Further Readings
11 Extendible Hashing
11.1 Introduction
11.2 How Extendible Hashing Works
11.2.1 Tries
11.2.2 Turning the Trie into a Directory
11.2.3 Splitting to Handle Overflow
11.3 Implementation
11.3.1 Creating the Addresses
11.3.2 Implementing the Top-level Operations
11.3.3 Bucket and Directory Operations
11.3.4 Implementation Summary
11.4 Deletion
11.4.1 Overview of the Deletion Process
11.4.2 A Procedure for Finding Buddy Buckets
11.4.3 Collapsing the Directory
11.4.4 Implementing the Deletion Operations
11.4.5 Summary of the Deletion Operation
11.5 Extendible Hashing Performance
11.5.1 Space Utilization for Buckets
11.5.2 Space Utilization for the Directory
11.6 Alternative Approaches
11.6.1 Dynamic Hashing
11.6.2 Linear Hashing
11.6.3 Approaches to Controlling Splitting
Summary • Key Terms • Exercises
Further Readings
Appendix A: File Structures on CD-ROM
A.1 Using this Appendix
A.2 Introduction to CD-ROM
A.2.1 A Short History of CD-ROM
A.2.2 CD-ROM as a File Structure Problem
A.3 Physical Organization of CD-ROM
A.3.1 Reading Pits and Lands
A.3.2 CLV Instead of CAV
A.3.3 Addressing
A.3.4 Structure of a Sector
A.4 CD-ROM Strengths and Weaknesses
A.4.1 Seek Performance
A.4.2 Data Transfer Rate
A.4.3 Storage Capacity
A.4.4 Read-Only Access
A.4.5 Asymmetric Writing and Reading
A.5 Tree Structures on CD-ROM
A.5.1 Design Exercises
A.5.2 Block Size
A.5.3 Special Loading Procedures and Other Considerations
A.5.4 Virtual Trees and Buffering Blocks
A.5.5 Trees as Secondary Indexes on CD-ROM
A.6 Hashed Files on CD-ROM
A.6.1 Design Exercises
A.6.2 Bucket Size
A.6.3 How the Size of CD-ROM Helps
A.6.4 Advantages of CD-ROM's Read-Only Status
A.7 The CD-ROM File System
A.7.1 The Problem
A.7.2 Design Exercise
A.7.3 A Hybrid Design
Summary
Appendix B: ASCII Table
Appendix C: String Functions in Pascal: tools.prc
Functions and Procedures Used to Operate on strng
Appendix D: Comparing Disk Drives
Bibliography
Index