Storage Systems: Organization, Performance, Coding, Reliability, and Their Data Processing

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Storage Systems: Organization, Performance, Coding, Reliability and Their Data Processing was motivated by the 1988 Redundant Array of Inexpensive/Independent Disks proposal to replace large form factor mainframe disks with an array of commodity disks. Disk loads are balanced by striping data into strips―with one strip per disk― and storage reliability is enhanced via replication or erasure coding, which at best dedicates k strips per stripe to tolerate k disk failures. Flash memories have resulted in a paradigm shift with Solid State Drives (SSDs) replacing Hard Disk Drives (HDDs) for high performance applications. RAID and Flash have resulted in the emergence of new storage companies, namely EMC, NetApp, SanDisk, and Purestorage, and a multibillion-dollar storage market. Key new conferences and publications are reviewed in this book. The goal of the book is to expose students, researchers, and IT professionals to the more important developments in storage systems, while covering the evolution of storage technologies, traditional and novel databases, and novel sources of data. We describe several prototypes: FAWN at CMU, RAMCloud at Stanford, and Lightstore at MIT; Oracle's Exadata, AWS' Aurora, Alibaba's PolarDB, Fungible Data Center; and author's paper designs for cloud storage, namely heterogeneous disk arrays and hierarchical RAID.

Author(s): Alexander Thomasian
Edition: 1
Publisher: Morgan Kaufmann
Year: 2021

Language: English
Pages: 746
Tags: storage; replication; RAID;

Front Cover
Storage Systems
Copyright
Contents
About the author
Preface
Acknowledgments
Abbreviations and acronyms
1 Introduction
1.1 Computer systems after WW II
1.2 High level programming languages - Fortran
1.2.1 A Programming Language - APL
1.2.2 COmmon Business Oriented Language - COBOL
1.2.3 IBM's PL/I programming language
1.2.4 Some early computer companies
1.3 Effect of data representation on storage space requirements
1.4 Basic computer arithmetic
1.5 Author's experience with IBM computers in 1970s
1.5.1 IBM computers at Univ. of Tehran and IBM World Trade Corp. in Tehran, Iran
1.5.2 My experiences with IBM computers at Tehran Regional Electric Company
1.5.3 Customer billing at TREC utility
1.5.4 My experience with IBM computers at UCLA
1.6 IBM's System 360 and its successors
1.6.1 US lawsuits against IBM and AT&T
1.6.2 Amdahl Corp. and plug compatible computers
1.6.3 Radio Corporation of America - RCA
1.6.4 Electronic Data Systems - EDS and Perot Systems
1.7 The IBM S/360 computer family
1.8 Operating systems associated with IBM mainframes
1.9 Early computer companies possibly competing with IBM
1.9.1 Burroughs + UNIVAC = UNISYS
1.10 My experience at Burroughs Corp.
1.10.1 NCR - National Cash Register Corp.
1.10.2 Control Data Corporation
1.10.3 Honeywell Corp.
1.10.4 Hewlett-Packard - HP Corp.
1.10.5 Digital Equipment Corp - DEC
1.11 Computer company revenue rankings
1.12 Computer structures book
1.13 Computer family architectures - CFA
1.14 Virtual memory and page replacement algorithms
1.15 Memory space fragmentation and dynamic storage allocation
1.15.1 Page replacement algorithms
1.15.2 Simplified analysis of a paging system
1.16 Analysis of thrashing in 2-phase locking - 2PL systems
1.17 CPU caches
1.18 Multiprogrammed computer systems
1.19 Timesharing systems
1.20 Mean response with FCFS and processor-sharing scheduling
1.21 Analysis of open and closed queueing network models
1.22 Bottleneck analysis and balanced job bounds
1.23 Performance analyses of I/O subsystems
1.24 Vector supercomputers
1.25 Parallel computers
1.25.1 The ILLIAC IV computer
1.25.2 Thinking Machines Connection Machine
1.25.3 Kendall Square Research's KSR-1
1.25.4 Goodyear Massively Parallel Processor - MPP
1.25.5 MasPar
1.25.6 NCUBE
1.25.7 Meiko
1.25.8 SUPRENUM
1.25.9 Parsytec
1.25.10 Intel Personal SuperComputer - iPSC
1.25.11 IBM's BlueGene supercomputer
1.25.12 Tesla Dojo supercomputer for AI training
1.26 The future of supercomputing
1.27 Microprocessor CPUs, GPUs, FPGAs, and ASICs
1.28 RISCV and other microprocessors
1.29 The IBM PC and its compatibles
1.29.1 Experience with IBM workstations
1.30 Storage studies by Alan Jay Smith at Berkeley
1.31 Prefetching
1.32 Database buffers
1.33 Checkpointing in processing large jobs
1.34 Computer related rule of thumb
1.34.1 Amdahl rules in developing S/360 computers
1.34.2 Amdahl's law in the era of multicore
1.34.3 Amazon optimal configurations for x86-based EC2 instances
1.34.4 Kung's law
1.34.5 Brooks' law
1.34.6 Patterson et al.'s roofline bound
1.34.7 Gray's rules of thumb
1.34.8 Jim Gray's five minute rule
1.34.9 Moore's law
1.34.10 Wright's law
1.34.11 Dennard's law
1.34.12 Huang's law for Graphics Processing Units - GPUs
1.34.13 Grosch's law
1.34.14 Kryder's law
1.34.15 Subsecond response times
1.35 Conclusions and summary
2 Storage technologies and their data
2.1 Evolution of recording material
2.2 Advertising and e-commerce
2.3 Computer storage technologies
2.3.1 Punched cards - Hollerith and IBM
2.3.2 Punched paper tapes
2.3.3 Handwriting recognition
2.3.4 Delay line memories
2.3.5 Core memories
2.3.6 Semiconductor memories
2.3.7 Redundant array of Independent Memories - RAIM
2.3.8 Magnetic Random Access Memory - MRAM
2.3.9 Magnetic tapes and tape libraries
2.3.10 An analytical model for a tape library
2.3.11 Summary of a recent article on magnetic tapes
2.3.12 Origins of Hard Disk Drives - HDDs
2.3.13 HDD manufacturers
2.3.14 Storage technologies expected to replace disk drives
2.3.15 Magnetic bubble memories
2.3.16 Charged Couple Devices - CCDs
2.3.17 Micro-Electro-Mechanical Systems - MEMS
2.3.18 IBM Zurich millipede
2.3.19 Phase Change Memory - PCM
2.3.20 Flash memories
2.3.21 Companies producing flash memories
2.3.22 Elevating commodity storage with the SALSA host translation layer
2.3.23 Flash SSD versus magnetic HDD pricing
2.3.24 Pure Storage design of Purity
2.3.25 Intel/Micron 3D_XPoint Optane Memory
2.3.26 Processing In Memory - PIM
2.3.27 Universal memory technology - UltraRAM
2.3.28 Racetrack memory
2.3.29 Optical storage
2.3.30 Holographic memory
2.3.31 M-DISC and storage longevity
2.3.32 Persistent and NonVolatile Memory - NVM
2.3.33 Glass as a new storage medium
2.3.34 DNA based archival storage system
2.4 Reliability studies of DRAM, HDDs, & flash SSDs
2.4.1 Flash SSD reliability at Facebook, Google & NetApp
2.5 Storage Networking Industry Association - SNIA
2.5.1 Solid state storage performance
2.5.2 Persistent Memory Forum
2.5.3 Computational storage
2.6 Big data and its sources
2.7 Sources of storage content
2.8 Ranking and description of media companies
2.9 Sources of news: newspapers, radio and TV stations
2.9.1 Newspapers in US and worldwide
2.9.2 TV networks in US
2.10 Text editing and formatting languages
2.11 Online books sources
2.12 Free book download web sites
2.13 Data, image, audio and video compression
2.13.1 Data compression
2.13.2 Huffman coding/encoding
2.13.3 Lempel-Ziv - LZ encoding
2.13.4 Arithmetic coding
2.13.5 Miscellaneous topics on data compression
2.13.6 Universal Resource Locator - URL shortener
2.13.7 Image compression
2.13.8 Video/audio compression
2.14 Main memory data compression
2.15 Data deduplication in storage systems
2.15.1 Data deduplication at Microsoft
2.15.2 The Venti prototype at Bell Labs/Lucent
2.15.3 Data Domain deduplication
2.15.4 Datrium
2.15.5 Summary of a major survey on data deduplication
2.16 Up and coming data deduplication companies
2.17 Storage research at IBM's Almaden Research Center in 1990s
2.18 Cleversafe and its information dispersal technology
2.19 Recent developments at IBM Research at ARC
2.20 Storage research at Hewlett-Packard - HP
2.21 Primary storage vendors and enterprise companies in 2020
2.22 All-flash upstart storage companies
2.23 Hyperconverged infrastructure for storage systems
2.24 Top enterprise storage backup players
2.25 Data storage companies: up and coming storage vendors
2.26 Parallel file systems
2.27 Cloud storage
2.27.1 Cloud computing price models
2.27.2 Storage as a service in cloud computing
2.27.3 Cloud storage elasticity and its benchmarking
2.28 Jai Menon's predictions on the future of clouds
2.29 Cloud storage companies
2.30 Distributed systems research related to clouds
2.30.1 OceanStore
2.30.2 Inktomi and CAP theorem
2.30.3 Replicated data
2.30.4 Sky Computing
2.31 Data encryption
2.31.1 Data encryption for cloud storage
2.32 Conclusions - predictions about storage systems
2.32.1 Resurgence in shared storage, but Fibre-Channel fades
3 Disk drive data placement and scheduling
3.1 The organization of Hard Disk Drives - HDDs
3.2 Internal organization of files in UNIX
3.3 Review of disk arm scheduling
3.3.1 Implementation of SATF
3.3.2 Disk performance studies by Windsor Hsu and Alan Jay Smith at IBM ARC
3.3.3 Linux support for disk scheduling
3.4 Disk scheduling for mixed workloads
3.5 Real time disk scheduling for multimedia
3.6 Storage virtualization
3.7 File placement on disk
3.7.1 Anticipatory disk arm placement
3.8 Disks with Shingled Magnetic Recording - SMR
3.9 Review of analyses of disk scheduling methods
3.10 Analytic studies of disk storage
3.11 Analysis of a zoned disk with the FCFS scheduling
3.11.1 Disk service time in zoned disks with FCFS scheduling
3.12 Performance analysis of the SCAN policy
3.13 Analysis of the SATF policy
3.13.1 Preliminary investigation of SATF
3.13.2 First method for the analysis of SATF
3.13.3 Second method for the analysis of SATF
3.14 Conclusions
4 Mirrored & hybrid arrays
4.1 Introduction to mirrored and hybrid disk arrays
4.2 Mirrored and hybrid disk array organizations
4.2.1 Basic Mirroring - BM
4.2.2 Group Rotate Declustering - GRD
4.2.3 Interleaved Declustering - ID
4.2.4 Chained Declustering - CD
4.2.5 Dual striping
4.2.6 Logical volume and automatic storage management and GPFS
4.2.7 LSI Logic RAID
4.2.8 Adaptive disk arrays
4.2.9 SSPiRAL (Survivable Storage using Parity in Redundant Array Layout)
4.2.10 B-code
4.2.11 Weaver codes
4.2.12 Robust, Efficient, Scalable, Autonomous, Reliable - RESAR
4.2.13 Multiway placement
4.2.14 Classification of mirrored and hybrid disk arrays
4.3 Routing read requests in mirrored disks
4.4 Shortening the tail for response times
4.5 Improving write performance in mirrored disks
4.6 Disks with multiple R/W heads on a single and multiple arms
4.7 Seek distances in single and mirrored disks
4.8 Mirrored disk performance in normal, degraded, rebuild modes
4.8.1 Queueing theory for disk performance evaluation
4.8.2 RAID1 performance in normal mode
4.8.3 Uniform vs round-robin routing of read requests in mirrored disks
4.8.4 Completion times of multiple requests in RAID1 and RAID5
4.8.5 Operation in degraded mode analysis
4.8.6 Rebuild processing in RAID1
4.9 Protecting against rare event failures in archival systems
4.10 RAIDP: ReplicAtion with IntraDisk Parity for cost effective storage of warm data
4.11 Remote mirroring for disaster recovery
4.12 RAID reliability analysis
4.12.1 Reliability expressions for mirrored and hybrid disk arrays
4.12.2 Reliability analysis of variants of RAID1 and RAID(4+k) w/o repair
4.12.3 Reliability analysis of variants of RAID1 and RAID(4+k) w/o repair
4.12.4 Performability analysis
4.12.5 Shortcut Reliability Analysis - SRA
4.12.6 Transient solutions of Continuous Time Markov Chains - CTMCs
4.12.7 RAID reliability analysis with repair
4.12.8 Solving linear Ordinary Differential Equations - ODEs
4.12.9 Simplified for RAID5 rebuild reliability analysis
4.12.10 Enhanced reliability modeling of storage systems
4.12.11 Metrics for reliability estimation in storage systems
4.12.12 Expected Annual Fraction of Data Loss - EAFDL
4.12.13 Taking into account controller failures
4.13 Storage reliability research at IBM's Zurich Research Lab
4.13.1 Analysis and simulation for reliability estimation
4.13.2 Automated Reliability Interactive Estimation System - ARIES
4.13.3 System AVailability Estimator - SAVE project at IBM Research
4.14 Conclusions
5 Redundant Arrays of Independent Disks - RAID
5.1 Redundant Arrays of Inexpensive Disks
5.1.1 RAID prototypes at Berkeley and CMU
5.2 Early RAID products
5.2.1 Tandem Corporation
5.2.2 Teradata Corp.
5.2.3 National Cash Register Corp. - NCR
5.2.4 EMC Corp.
5.2.5 SUN Microsystems
5.2.6 SUN ZFS file system and volume manager
5.2.7 International Business Machines - IBM Corp.
5.2.8 Storage Technology Corp. - STK
5.2.9 Western Digital Corp. - WDC
5.2.10 Network Appliance - NetApp Corp.
5.2.11 HP storage products
5.3 RAID classification and motivation
5.4 RAID0 and striping
5.5 RAID2
5.6 RAID3
5.7 RAID4
5.8 RAID5
5.8.1 RAID0 and RAID5 stripe unit size
5.8.2 Parity striping and load balancing
5.8.3 Load balancing in disk arrays without striping
5.9 RAID5 performance analysis in normal mode
5.10 RAID(4+k) disk arrays in normal and degraded mode
5.11 Rebuild processing in disk arrays
5.11.1 Variations of rebuild processing in RAID5
5.12 Vacationing server model for rebuild processing
5.12.1 Effect of rebuild processing on external request response time
5.12.2 Rebuild unit sizes in RAID5
5.12.3 Permanent Customer Model - PCM and comparison with VSM
5.13 RAID5 sparing configurations for rebuild
5.13.1 Rebuild by distributed sparing
5.13.2 Rebuild by restriping
5.13.3 Rebuild with parity sparing
5.14 IntraDisk Redundancy - IDR for higher reliability rebuild
5.15 Disk scrubbing for higher reliability rebuild processing
5.15.1 Disk scrubbing versus intradisk redundancy
5.16 Predictive Failure Analysis - PFA
5.17 Undetected disk errors and Silent Data Corruption - SDC
5.18 Clustered RAID5 layouts
5.18.1 Balanced Incomplete Block Designs - BIBDs
5.18.2 PRIME data layout
5.18.3 Relatively Prime - RELPR data layout
5.18.4 Thorp permutation for load balancing in clustered RAID5
5.18.5 Clustered RAID with Nearly Random Permutations - NRP
5.19 Clustered RAID designs by Walter Burkhard et al. at UCSD
5.19.1 Almost Complete Address Translation - ACATS
5.19.2 DATUM for tolerating multiple disk failures in disk arrays
5.19.3 Permutation Development Data Layout - PDDL
5.19.4 Balancing loads by shifted parity group placements
5.20 Log-structured file systems and arrays
5.21 RAID6
5.21.1 Supplementary Parity Augmentation - SPA
5.22 Reed-Solomon coding for higher reliability
5.22.1 RS coding summary and finite field arithmetic
5.23 Parity based MDS codes
5.24 RDP arrays and their optimal recovery
5.25 EVENODD defined and efficient rebuild of a single disk
5.26 Blaum-Roth - BR code
5.27 X-code disk arrays and rebuild mode with one and two disk failures
5.27.1 Minimizing reconstruction overhead in MDS RAID
5.27.2 Binary MDS array codes with optimal repair
5.28 The RM2 disk array
5.29 RAID7
5.30 Erasure coding for distributed storage
5.30.1 Pyramid codes
5.30.2 HDFS-Xorbas
5.30.3 Ceph erasure coding with LR code
5.30.4 Hitchhiker erasure code
5.30.5 Hadoop Adaptively-Coded Distributed File System
5.30.6 HashTag Erasure Code - HTEC
5.31 ReGenerating codes
5.31.1 Pentagon and heptagon codes
5.31.2 Minimum Storage Regeneration - MSR codes
5.32 Protection schemes for flash memories
5.32.1 Flash protection schemes
5.32.2 Differential RAID for flash memories
5.32.3 Partial MDS - PMDS coding for SSDs
5.33 Conclusions
5.33.1 Seagate launches self-healing memory
6 Coding for multiple disk failures
6.1 Introduction
6.2 2-Dimensional string layouts
6.3 Simple data entanglement layouts with high reliability
6.4 Reed-Solomon codes
6.5 A family of MDS block array codes with two parities
6.6 Codes for correcting two erasures with independent parities
6.7 Row-Diagonal Parity - RDP codes
6.8 Short write operations
6.9 Additional reading
7 Saving power in disks, flash memories, and servers
7.1 Introduction to power consumption in computer systems
7.2 Saving battery power in laptop computers
7.3 Varying spindown threshold based on user behavior
7.4 Exploiting idleness in storage systems
7.5 Making enterprise computers greener by protecting them better
7.6 Policy optimization for dynamic power management
7.7 Managing energy and server resources in hosting centers
7.8 Interplay of energy and performance for RAID running OLTP
7.9 Dynamic speed control for server disk power management
7.10 Approaches to conserve disk energy in network servers
7.11 Energy efficiency through burstiness
7.12 Dempsey: a tool for modeling hard disk power consumption
7.13 MAID - Massive Arrays of Idle Disks alternative to tape storage
7.13.1 The Copan MAID storage system
7.13.2 Nexsan AutoMAID
7.14 Self-tuning power aware storage cache replacement algorithm
7.15 Popular Data Concentration - PDC
7.16 Disk layout optimization for reducing energy consumption
7.17 Managing server energy and operational costs in hosting centers
7.17.1 Improving energy savings while meeting performance goals in RAID
7.18 Performance directed energy management for main memory and disks
7.19 Exploiting redundancy to conserve energy in storage systems
7.20 Thermal disk drive design: challenges and possible solutions
7.21 PARAID: the gear-shifting Power-Aware RAID
7.22 DiskGroup: energy efficient disk layout for RAID1 systems
7.22.1 Power provisioning for a warehouse-sized computer
7.23 Pergamum: replacing tape with disk-based archival storage
7.24 Energy efficient RAID - ERAID
7.25 Power reduction via write-offloading
7.26 Redundant Arrays of Hybrid Disks - RAHD
7.27 Achieving power-efficient, erasure-coded storage
7.28 Effect of energy-saving schemes on disk reliability
7.29 Mathematical model of disk reliability versus load and temperature
7.30 Sample-Replicate-Consolidate mapping - SRCMap
7.31 Power Proportional Distributed File Systems - PPDFS
7.32 Dynamic locality improvement to increase effective storage performance
7.33 Disk data reorganization for reducing energy consumption
7.34 File assignment with minimal variance of service time
7.35 Striping-based Energy Aware - SEA placement
7.36 PEARL: Performance, Energy, and ReLiability balanced dynamic data distribution
7.37 Power proportionality for data center storage
7.38 Economic evaluation of energy saving with reliability constraint
7.39 Dynamic server provisioning for data center power management
7.40 Modeling the energy costs of I/O workloads
7.41 Energy proportionality is required in addition to energy efficiency
7.42 SDD design tradeoffs from energy perspective
7.43 Green AI
7.44 Conclusions
8 Database parallelism, big data and analytics, deep learning
8.1 Stonebraker's classification of computer systems
8.2 Comparison of systems from the viewpoint of CPU performance
8.3 High performance network and channel-based interconnects for storage
8.3.1 Intel Compute eXpress Link - CXL
8.4 Concurrency and coherency control in data sharing systems
8.5 Combined shared disk and nothing systems
8.6 Parallel systems at IBM Research
8.7 Interconnection networks in IBM's BlueGene/L
8.8 Data allocation and transaction routing in multicomputers
8.9 Data allocation with a distributed relational databases
8.10 Review of multicomputer Data Base Machines - DBMs
8.10.1 Tandem Corporation
8.10.2 Teradata Corporation
8.10.3 Stratus Technologies
8.10.4 Dell Corporation
8.10.5 Inspur Systems
8.11 Benchmarking in various forms
8.11.1 Transaction Processing Council - TPC
8.11.2 Storage Performance Council - SPC
8.11.3 Standard Performance Evaluation Corporation - SPEC
8.11.4 Benchmarking distributed databases
8.11.5 Mixed workload benchmarks: OLTP and OLAP
8.11.6 Sort benchmarks
8.12 Data Base Machines - DBMs and backend processors
8.12.1 List of advantages and disadvantages
8.13 Head-per-track disks
8.13.1 Rotating Associative Processor for Information Dissemination - RAPID
8.13.2 Context Addressable Cellular System for Non-Numeric Processing - CASSM
8.13.3 Rotating Associative Relational Store - RARES
8.13.4 The Context-Addressed Segment-Sequential Storage - CASSS
8.13.5 Relational Associative Processor - RAP
8.13.6 Number of rotations required to read qualifying tuples from RAP
8.14 Active disks projects
8.14.1 Active disk project at CMU/PDL
8.14.2 Active disk project at UCSB/Maryland
8.14.3 Intelligent disks - IDisks
8.14.4 SmartSTOR project at IBM Almaden
8.14.5 Smart disk cluster project at Northwestern U.
8.14.6 Integrating server, storage and database stack at IBM Almaden
8.14.7 Active disks summary
8.14.8 Active storage revisited
8.15 Multidimensional indices on disk, DRAM, and flash
8.15.1 Survey of multi-dimensional indices
8.15.2 Combining dimensionality reduction with clustering
8.15.3 Dimensionality reduction via SVD, PCA, KLT
8.15.4 Clustering methods for dimensionality reduction
8.15.5 CSVD to build an index for k-NN queries
8.15.6 Nearest-neighbors query processing for multiple clusters
8.15.7 Exact versus approximate k-NN queries
8.15.8 Indices suited for dimensionality reduced data
8.15.9 Ordered Partition - OP tree index
8.15.10 Stepwise Dimensionality Increasing SDI-tree index
8.16 Implementing indices in flash memories
8.17 Redesign of relational databases by Stonebraker et al.
8.18 Parallel Data Base Machines - DBMs
8.18.1 Almaden Research Backend Relational Engine - ARBRE
8.18.2 IBM's DB2 Parallel Edition
8.19 Google File System, Bigtable, and Spanner
8.20 Microsoft Azure
8.20.1 Amazon DynamoDB - ADDB
8.20.2 Aurora database for processing continual inputs from sensors
8.21 IBM and other cloud service providers
8.22 Distributed databases in cloud computing
8.23 SpringFS bridging agility and performance in elastic distributed storage
8.24 Snowflake cloud based data warehousing with SQL support
8.24.1 Storage architecture and provisioning in Snowflake
8.24.2 Resource sharing in Snowflake
8.25 Review of peer-to-peer computing
8.26 Fast Array of Wimpy Nodes - FAWN
8.27 RAMCloud project at Stanford
8.28 How flash changes the design of database storage engines
8.29 Hybrid Transaction Analytic Processing - HTAP
8.30 Intelligent page store for concurrent txn and query processing
8.31 Oracle Exadata database machine
8.32 Oracle in memory option or Database in Main Memory - DBIM
8.33 MemSQL/SingleStore
8.34 Amazon Aurora
8.35 Transaction processing in the cloud
8.36 RAPID and Oracle AutoML: a fast and predictive AutoML pipeline
8.37 Benchmarking automatic ML frameworks
8.38 Alibaba's X-engine
8.39 RocksDB with ultrafast data access
8.40 LightStore project at MIT
8.41 PinK: high-speed in-storage key-value store with bounded tails
8.42 BlueDBM: an appliance for big data analytics
8.43 WiSer highly available HTAP DBMS for IoT applications
8.44 Raven RDBMS at Microsoft provides ML
8.45 Machine Learning data platform - MLdp
8.46 Databricks
8.47 Fungible - a new storage architecture for big data
8.48 Network requirements for resource disaggregation
8.49 Deep learning and associated hardware
8.50 GPU accelerated database systems
8.51 Graphics Processing Unit - GPU solutions
8.52 Field Programmable Gate Array - FPGA solutions
8.53 Multichip modules
8.54 Unified solutions
8.55 Power consumption in FPGAs and ASICs
8.56 Hybrid approaches to acceleration
8.57 Application Specific Integrated Circuit - ASIC
8.58 Tensorflow and Tensor Processing Units - TPUs
8.59 Increasing computational challenges
8.60 Quantum Neural Nets - QNNs
8.61 Data acceleration examples
8.61.1 The Q100 data processing unit
8.61.2 Many-core architecture for in-memory data processing at Oracle
8.62 Cerebras wafer size chips vs GPUS
8.62.1 The deep learning workload
8.62.2 Understanding performance
8.62.3 More silicon area means more space for compute cores
8.62.4 More silicon area means more on chip memory
8.62.5 Fast communication Swarm fabric
8.62.6 Co-designed with software for maximum utilization and usability
8.63 Conclusions
9 Structured, unstructured, and diverse databases
9.1 Categories of file systems
9.2 Mainframe count-key-data disk organizations
9.3 Hierarchical and network Data Base Management Systems - DBMSs
9.3.1 Information Management System - IMS
9.3.2 Transaction Processing Facility - TPF
9.3.3 Network databases
9.4 Relational data model
9.4.1 Methods to reduce the level of lock contention in OLTP
9.4.2 Informix
9.4.3 Berkeley's INGRES relational database
9.4.4 Oracle's relational databases
9.5 Ranking methodology for database engines
9.6 Overall ranking of all database types
9.7 Relational database management systems
9.8 Object relational databases
9.9 Data mining
9.10 Data warehousing and OLAP
9.10.1 Multi-dimensional OLAP - MOLAP
9.10.2 Relational OLAP - ROLAP
9.10.3 Hybrid OLAP - HOLAP
9.11 Distinct schools of thought in data warehouse design
9.11.1 The Inmon method
9.11.2 The Kimball method
9.11.3 Deciding factors for Kimball vs Inmon approaches
9.11.4 Conclusions on data warehousing
9.12 Data lakes
9.12.1 Major data lake types
9.13 Open source big data projects
9.14 Semi-structured data and its model
9.14.1 Pros and cons of semi-structured data format
9.15 Big data technology and the five Vs
9.16 Hadoop technology ecosphere
9.16.1 Four steps on the workflow of an analytical application
9.17 Distributed batch vs inline processing
9.18 NoSQL/non-relational databases
9.19 Key-value stores
9.20 Document stores
9.21 Time-series databases
9.22 Kubernetes and other containers
9.23 Graph databases
9.23.1 Categories of graph models
9.24 Object-oriented databases
9.25 Search engines for text
9.26 Web search engines
9.27 Resource Description Framework - RDF
9.28 Wide column stores
9.29 Multivalue databases
9.30 Native XML databases
9.31 Realtime stream processing
9.32 Event stores
9.33 Streaming analytics
9.33.1 Aurora database to manage data streams
9.34 Trill: a high-performance incremental query processor for diverse analytics
9.35 Summary of Forrester WaveTM streaming analytics, Q3, 2109
9.36 Content stores
9.37 Multimodel databases
9.38 Main memory databases
9.39 Distributed file systems and object storage
9.40 Enterprise Backup and recovery software solutions
9.41 Analytics and Business Intelligence - ABI platforms
9.41.1 Crunchbase finds business information about companies
9.42 Blockchain, Bitcoin, Ethereum
10 Heterogeneous Disk Arrays - HDAs
10.1 Introduction to RAID
10.2 Data allocation in a Heterogeneous Disk Array - HDA
10.2.1 Balancing disk allocations
10.2.2 Allocation methods considered in this study
10.3 Analytic justification for HDA
10.3.1 Improved response time due to HDA
10.3.2 The effect of disk failures on response time
10.3.3 Shortcut reliability analysis of HDA
10.4 HDA data allocation experiment setup
10.4.1 Virtual array allocation requests
10.4.2 Estimating virtual array widths in normal mode
10.4.3 Dealing with load increase in degraded mode
10.5 Data allocation experiments
10.5.1 Description of the experiment
10.5.2 Assumptions and parameter settings
10.5.3 Comparison of allocation methods
10.5.4 Sensitivity analysis
10.5.4.1 Sensitivity to β in Min-F1 and Min-F2
10.5.4.2 The effect of ρmax and vmax on allocations
10.5.5 Clustered RAID5
10.6 Rebuild processing in HDA
10.7 RAID+ data layout based on Latin squares
10.8 Related work
10.9 Using utility functions to provision storage systems
10.10 Conclusions
11 Hierarchical RAID - HRAID
11.1 Introduction to HRAID
11.2 Intranode & internode coding in HRAID
11.3 Concurrency control in HRAID
11.4 RAID IOPS with no disk failures
11.5 RAID IOPS with disk failures
11.6 HRAID response times
11.7 HRAID2/2 performance
11.8 RAID and HRAID reliability
11.9 Shortcut reliability analysis of HRAID
11.10 Simulation to estimate the MTTDL
11.11 Multistep recovery in HRAID
11.12 Related work
11.13 Collective Intelligent Bricks - CIB or Icecube project at IBM
11.13.1 Reliability analysis of Icecube
11.14 Conclusions
12 Conclusions
Appendix
A.1 Books on topics related to storage
A.1.1 Computer organization, architecture and fault-tolerance
A.1.2 Operating systems
A.1.3 Storage systems
A.1.4 Coding for storage systems
A.1.5 Database systems and machine learning
A.1.6 High frequency trading
A.1.7 Performance and reliability analysis
A.1.8 Computer arithmetic and logic design
A.1.9 Algorithms and data structures
A.1.10 Peer to peer computing
A.1.11 Computer history via interviews
A.2 ACM, IEEE, USENIX, and their publications
A.3 Journals, conferences, and workshops dealing with storage systems
A.3.1 Workshops and conferences related to data base machines
A.4 Web sites for trade publications
A.4.1 Few interesting blogs
A.5 Storage research in industry
A.6 Storage research at universities
A.7 Funding agencies, national labs, and research institutes
Bibliography
Index
Back Cover