Data Warehousing and Analytics: Fueling the Data Engine

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This textbook covers all central activities of data warehousing and analytics, including transformation, preparation, aggregation, integration, and analysis. It discusses the full spectrum of the journey of data from operational/transactional databases, to data warehouses and data analytics; as well as the role that data warehousing plays in the data processing lifecycle. It also explains in detail how data warehouses may be used by data engines, such as BI tools and analytics algorithms to produce reports, dashboards, patterns, and other useful information and knowledge.

The book is divided into six parts, ranging from the basics of data warehouse design (Part I - Star Schema, Part II - Snowflake and Bridge Tables, Part III - Advanced Dimensions, and Part IV - Multi-Fact and Multi-Input), to more advanced data warehousing concepts (Part V - Data Warehousing and Evolution) and data analytics (Part VI - OLAP, BI, and Analytics).

This textbook approaches data warehousing from the case study angle. Each chapter presents one or more case studies to thoroughly explain the concepts and has different levels of difficulty, hence learning is incremental. In addition, every chapter has also a section on further readings which give pointers and references to research papers related to the chapter. All these features make the book ideally suited for either introductory courses on data warehousing and data analytics, or even for self-studies by professionals. The book is accompanied by a web page that includes all the used datasets and codes as well as slides and solutions to exercises.



Author(s): David Taniar, Wenny Rahayu
Series: Data-Centric Systems and Applications
Publisher: Springer
Year: 2022

Language: English
Pages: 642
City: Cham

Foreword
Preface
Acknowledgements
Contents
1 Introduction
1.1 Operational Databases
1.2 Data Warehouses
1.3 Building Data Warehouses
1.4 Using Data Warehouses
1.5 The Big Picture
1.6 Fueling the Data Engine
1.7 Organisation of the Book
1.8 Summary
1.9 Exercises
1.10 Further Readings
References
Part I Star Schema
2 Simple Star Schemas
2.1 Notations and Processes
2.1.1 Star Schema Notation
2.1.2 E/R Diagram Notation
2.1.3 Transformation Process
2.2 First Case Study: A College Star Schema
2.3 Another Simple Case Study: A Sales Star Schema
2.4 Two-Column Table Methodology
2.4.1 One-Fact Measure
2.4.2 Multiple Fact Measures
2.5 Summary
2.6 Exercises
2.7 Further Readings
References
3 Creating Facts and Dimensions: More Complex Processes
3.1 Use of count Function
3.2 Average in the Fact
3.3 Outer Join
3.4 Creating Temporary Dimension Tables
3.5 Creating Temporary Tables in the Operational Database
3.6 Summary
3.7 Exercises
3.8 Further Readings
References
Part II Snowflake and Bridge Tables
4 Hierarchies
4.1 Hierarchy vs. Non-hierarchy
4.2 Hierarchy Versus Multiple Independent Dimensions
4.2.1 Separate vs. Combined Dimension
4.2.2 Combined Dimension vs. Hierarchy
4.3 Linked Dimensions
4.4 Hierarchy Design Considerations
4.5 Summary
4.6 Exercises
4.7 Further Readings
References
5 Bridge Tables
5.1 A Product Sales Case Study
5.2 A Truck Delivery Case Study
5.2.1 Solution Model 1: Using a Bridge Table
5.2.2 Solution Model 2: Add a Weight Factor Attribute
5.2.3 Solution Model 3: A List Aggregate Version
5.3 Summary
5.4 Exercises
5.5 Further Readings
References
6 Temporal Data Warehousing
6.1 A Bookshop Case Study
6.2 Implementation of Temporal Data Warehousing
6.3 Temporal Attributes and Temporal Dimensions
6.3.1 Temporal Attributes
6.3.2 Temporal Dimensions
6.3.3 Another Temporal Dimension
6.4 Slowly Changing Dimensions
6.4.1 SCD Type 0 and Type 1
6.4.2 SCD Type 2
6.4.3 SCD Type 3
6.4.4 SCD Type 4
6.4.5 SCD Type 6
6.4.6 Implementation of SCD in SQL
6.4.7 Creating the Fact Tables
6.5 Summary
6.6 Exercises
6.7 Further Readings
References
Part III Advanced Dimension
7 Determinant Dimensions
7.1 Introducing a Determinant Dimension: Petrol Station Case Study
7.1.1 Petrol Station Star Schema Version 1
7.1.2 Petrol Station Star Schema Version 2
7.2 Determinant vs. Non-determinant Dimensions: The Olympic Games Case Study
7.2.1 Star Schema Version 1 (Without Medal Type Dimension)
7.2.2 Star Schema Version 2 (With Medal Type Dimension)
7.2.3 Determinant or Non-Determinant Dimensions
7.2.4 Version 1 (Without Medal Type Dimension) vs. Version 2 (With Medal Type Dimension)
7.2.5 Technical Challenges
7.3 Determinant Dimensions vs. Pivoted Fact Table: The PTE Academic Test Case Study
7.3.1 A Determinant Dimension Version
7.3.2 A Non-determinant Dimension Version or the Pivoted Fact Table Version
7.4 Non-type as a Determinant Dimension: University Enrolment Case Study
7.5 Multiple Relationship Between a Dimension and the Fact: Private Taxi Case Study
7.6 Summary
7.7 Exercises
7.8 Further Readings
References
8 Junk Dimensions
8.1 A Real-Estate Case Study
8.2 Option 1: The Non-junk Dimension Version
8.3 Option 2: The Junk Dimension Version
8.4 Non-junk Dimension Versus Junk Dimension
8.4.1 Simple Join Queries
8.4.2 Nested Queries
8.5 Is Combined Dimension a Junk Dimension?
8.6 Summary
8.7 Exercises
8.8 Further Readings
References
9 Dimension Keys
9.1 Surrogate Keys
9.1.1 An Example
9.2 Dimension-Less Keys
9.3 Summary
9.4 Exercises
9.5 Further Readings
References
10 One-Attribute Dimensions
10.1 Move It to the Fact
10.1.1 Column-Based Solution in the Fact
10.1.2 Row-Based Solution in the Fact
10.2 Keep It in the Dimension
10.2.1 Combine All One-Attribute Dimensions
10.2.2 Combine with Other Normal Dimensions
10.2.3 Determinant Dimension with One-Attribute Only
10.2.4 One-Attribute Dimension with Bridge
10.3 Summary
10.4 Exercises
10.5 Further Readings
References
Part IV Multi-Fact and Multi-Input
11 Multi-Fact Star Schemas
11.1 Different Subject Multi-fact: The Book Sales Case Study
11.1.1 Implementation in SQL
11.1.2 Multi-Fact with Pivot Table
11.2 Multi-Fact or Single Fact with Multiple Fact Measures: A Private Taxi Company Case Study
11.3 To Combine or Not to Combine
11.3.1 A Determinant Dimension Solution: Flight Charter Case Study
11.3.2 A Non-determinant Dimension Solution: Bachelor/Master Final Projects Case Study
11.3.3 Mutually Exclusive Star Schemas: Lecturer/Tutor Taking Tutorials Case Study
11.4 Different Granularity Multi-Fact: The Car Service Case Study
11.5 Summary
11.6 Exercises
11.7 Further Readings
References
12 Slicing a Fact
12.1 Vertical Slice
12.2 Horizontal Slice
12.3 Vertical or Horizontal Slice?
12.4 Determinant Dimension
12.5 Summary
12.6 Exercises
12.7 Further Readings
References
13 Multi-Input Operational Databases
13.1 Vertical Stacking: University Student Clubs Case Study
13.1.1 Student Orchestra Club
13.1.2 Business and Commerce Students' Society
13.1.3 Japanese Club
13.1.4 Building an Integrated Data Warehouse
13.2 Horizontal Stacking: Real-Estate Property Case Study
13.3 Summary
13.4 Exercises
13.5 Further Readings
References
Part V Data Warehousing Granularity and Evolution
14 Data Warehousing Granularity and Levels of Aggregation
14.1 Levels of Aggregation
14.2 Facts Without Fact Measures
14.3 Star Schemas with No Aggregation
14.4 Understanding the Relationship Between Transactions and Fact Measures
14.5 Levels of Aggregations, Hierarchy and Multi-Fact
14.6 Summary
14.7 Exercises
14.8 Further Readings
References
15 Designing Lowest-Level Star Schemas
15.1 Median House Price
15.2 Other Statistical Functions
15.3 Querying Level-0 or a Higher-Level Star Schema
15.4 Summary
15.5 Exercises
15.6 Further Readings
References
16 Levels of Aggregation: Adding and Removing Dimensions
16.1 Adding New Dimensions
16.1.1 Adding New Dimensions Does Not Lower Down the Level of Aggregation
16.1.2 Adding New Dimensions May Result in a Double Counting in the Fact Measure
16.1.3 The Final Star Schemas
16.1.4 Summary
16.2 Removing Dimensions
16.2.1 An Employee Case Study
16.2.2 Removing a Determinant Dimension
16.2.3 Summary
16.3 Exercises
16.4 Further Readings
References
17 Levels of Aggregation and Bridge Tables
17.1 Bridge Table: Truck Delivery Case Study
17.1.1 Combining Trips: TripGroupList
17.1.2 Combining Trips: StoreGroupList
17.1.3 Summary
17.2 Bridge Table: Product Sales Case Study
17.3 Summary
17.4 Exercises
17.5 Further Readings
References
18 Active Data Warehousing
18.1 Passive vs. Active Data Warehousing
18.2 Incremental Updates
18.2.1 Automatic Updates of Data Warehouse
18.2.1.1 Level-0
18.2.1.2 Level-1
18.2.1.3 Level-2
18.2.2 Expiry Date
18.2.2.1 Level-0
18.2.2.2 Level-1
18.2.2.3 Level-2
18.2.3 Data Warehouse Rules Changed
18.2.3.1 Level-0
18.2.3.2 Level-1
18.2.3.3 Level-2
18.3 Data Warehousing Schema Evolution
18.3.1 Changes Propagating to the Next Levels
18.3.1.1 Level-0
18.3.1.2 Level-1
18.3.1.3 Level-2
18.3.2 Changes Not Affecting the Next Levels
18.3.3 Inserting New Star Schema
18.3.4 Deleting Star Schema
18.4 Operational Database Evolution
18.4.1 Changes in the Table Structure
18.4.2 Changes in the E/R Schema
18.4.3 Changes in the Operational Database
18.5 Summary
18.6 Exercises
18.7 Further Readings
References
Part VI OLAP, Business Intelligence,and Data Analytics
19 Online Analytical Processing (OLAP)
19.1 Sales Data Warehousing
19.2 Basic Aggregate Functions
19.2.1 count Function
19.2.2 sum Function
19.2.3 avg, max and min Functions
19.2.4 group by Clause
19.3 Cube and Rollup
19.3.1 Cube
19.3.2 Rollup
19.3.3 Rollup vs. Cube
19.3.4 Partial Cube and Partial Rollup
19.3.5 grouping and decode Functions
19.4 Ranking
19.4.1 Rank
19.4.2 Top-N and Top-Percent Ranking
19.4.3 Partition
19.5 Cumulative and Moving Aggregate
19.5.1 Cumulative Aggregate
19.5.2 Moving Aggregate
19.6 Business Intelligence Reporting
19.6.1 Cumulative and Moving Aggregate
19.6.2 Ratio
19.6.3 Ranking
19.6.4 A More Complete Report
19.7 Summary
19.8 Exercises
19.9 Further Readings
References
20 Pre- and Post-Data Warehousing
20.1 Pre-Data Warehousing: Exploring Dirty Data
20.1.1 Duplication Problems
20.1.1.1 Data Duplication Between Records
20.1.1.2 Data Duplication Between Attributes
20.1.1.3 Duplication Between Tables
20.1.2 Relationship Problems
20.1.3 Inconsistent Values
20.1.3.1 Inconsistent Values at a Record Level
20.1.3.2 Inconsistent Values Between Attributes
20.1.4 Incorrect Values
20.1.4.1 Incorrect Value Problem at an Attribute Level
20.1.4.2 Incorrect Value Problem Between Records
20.1.4.3 Incorrect Value Problem Between Tables
20.1.5 Null Value Problems
20.1.5.1 Null Value Problems at an Attribute Level
20.1.5.2 Null Value Problems Between Records
20.1.5.3 Null Value Problems Between Attributes
20.1.6 Summary
20.2 Post-Data Warehousing: Exploring the Extended Fact Table
20.2.1 Extended Fact Table
20.2.2 A Typical Data Science Project
20.2.3 Explore Individual Attributes
20.2.3.1 Basic Statistics
20.2.3.2 Count Distribution: Histogram
20.2.3.3 Value Distribution: Boxplots
20.2.4 Search Records
20.2.5 Explore Multiple Attributes
20.3 Summary
20.4 Exercises
20.5 Further Readings
References
21 Data Analytics for Data Warehousing
21.1 Traditional Data Mining Techniques vs. Data Analytics for Data Warehousing
21.1.1 Traditional Data Mining Techniques
21.1.2 Data Analytics Requirements in Data Warehousing
21.2 Statistical Method: Regression
21.2.1 Simple Linear Regression
21.2.2 Polynomial Regression
21.2.3 Rolling Windows vs. Regression
21.2.4 Non-Time-Series Regression
21.3 Clustering Analysis
21.3.1 Centroid-Based Clustering
21.3.2 Density-Based Clustering
21.4 Classification Using Regression Trees
21.4.1 Selecting the Root Node
21.4.2 Level 1: Processing the Left Sub-Tree
21.4.3 Level 1: Processing the Right Sub-Tree
21.4.4 Level 2: Finalising the Regression Tree
21.5 Data Warehousing: The Middle Man
21.6 Summary
21.7 Exercises
21.8 Further Readings
References