Pentaho 3.2 Data Integration: Beginner's Guide

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

As part of Packt's Beginner's Guide, this book focuses on teaching by example. The book walks you through every aspect of PDI, giving step-by-step instructions in a friendly style, allowing you to learn in front of your computer, playing with the tool. The extensive use of drawings and screenshots make the process of learning PDI easy. Throughout the book numerous tips and helpful hints are provided that you will not find anywhere else.The book provides short, practical examples and also builds from scratch a small datamart intended to reinforce the learned concepts and to teach you the basics of data warehousing. This book is for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or, more generally, doing any kind of data manipulation. If you have never used PDI before, this will be a perfect book to start with.You will find this book is a good starting point if you are a database administrator, data warehouse designer, architect, or any person who is responsible for data warehouse projects and need to load data into them. You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in easy-to-understand language.

Author(s): Maria Carina Roldan
Year: 2010

Language: English
Pages: 492

Cover......Page 1
Copyright......Page 3
Credits......Page 4
Foreword......Page 5
The Kettle Project......Page 7
About the Author......Page 10
About the Reviewers......Page 11
Table of Contents......Page 14
Preface......Page 24
Pentaho Data Integration and Pentaho BI Suite......Page 30
Pentaho Data Integration......Page 32
Loading data warehouses or data marts......Page 34
Data cleansing......Page 35
Integrating PDI using Pentaho BI......Page 36
Time for action – installing PDI......Page 37
Time for action – starting and customizing Spoon......Page 38
Setting preferences in the Options window......Page 41
Storing transformations and jobs in a repository......Page 42
Time for action – creating a hello world transformation......Page 43
Directing the Kettle engine with transformations......Page 48
Exploring the Spoon interface......Page 49
transformation......Page 50
Time for action – installing MySQL on Windows......Page 52
Time for action – installing MySQL on Ubuntu......Page 55
Summary......Page 57
Reading data from files......Page 58
Time for action – reading results of football matches from files......Page 59
Input steps......Page 64
Text file input step......Page 65
Text file input step and regular expressions......Page 66
Regular expressions......Page 67
Grids......Page 69
Time for action – sending the results of matches to a plain file......Page 70
Output files......Page 72
Rowset......Page 73
Streams......Page 74
Getting system information......Page 75
Time for action – updating a file with news about examinations......Page 76
Getting information by using Get System Info step......Page 80
Date fields......Page 81
Numeric fields......Page 82
a terminal window......Page 83
about countries......Page 85
What is XML......Page 90
XPath......Page 91
Configuring the Get data from XML step......Page 92
How and when you can use variables......Page 93
Summary......Page 95
Basic calculations......Page 96
Calculator step......Page 97
Adding or modifying fields by using different PDI steps......Page 105
The Calculator step......Page 106
Formula step......Page 107
Calculations on groups of rows......Page 111
grouping data......Page 112
Group by step......Page 117
Time for action – counting frequent words by filtering......Page 120
Filtering rows using the Filter rows step......Page 126
Time for action – finding out which language people speak......Page 128
The Stream lookup step......Page 132
Summary......Page 135
Splitting streams......Page 136
a dataset......Page 137
Copying rows......Page 142
Distributing rows......Page 143
Time for action – assigning tasks by distributing......Page 144
Splitting the stream based on conditions......Page 148
Filter rows step......Page 149
PDI steps for splitting the stream based on conditions......Page 151
Switch/ Case step......Page 152
Merging streams......Page 154
Time for action – gathering progress and merging all together......Page 155
PDI options for merging streams......Page 157
Append Stream......Page 160
Summary......Page 162
Doing simple tasks with the JavaScript step......Page 164
Time for action – calculating scores with JavaScript......Page 165
Using the JavaScript language in PDI......Page 170
Inserting JavaScript code using the Modified Java Script Value step......Page 171
Modifying fields......Page 173
Testing your code......Page 174
Time for action – testing the calculation of averages......Page 175
Testing the script using the Test script button......Page 176
Time for action – calculating flexible scores by using variables......Page 177
Using named parameters......Page 181
Using transformation predefined constants......Page 182
JavaScript......Page 185
Looking at previous rows......Page 187
Avoiding coding by using purpose-built steps......Page 188
Summary......Page 190
Converting rows to columns......Page 192
rows to columns......Page 193
Converting row data to column data by using the Row denormalizer step......Page 196
Aggregating data with a Row denormalizer step......Page 199
by country......Page 200
Using Row denormalizer for aggregating data......Page 201
the dataset......Page 203
Modifying the dataset with a Row Normalizer step......Page 205
Summarizing the PDI steps that operate on sets of rows......Page 207
Generating a custom time dimension dataset by using Kettle variables......Page 209
Time for action – creating the time dimension dataset......Page 210
Getting variables......Page 214
starting date......Page 215
Using the Get Variables step......Page 216
Summary......Page 217
Capturing errors......Page 218
of a film......Page 219
Using PDI error handling functionality......Page 223
Aborting a transformation......Page 224
Time for action – aborting when there are too many errors......Page 225
Time for action – treating errors that may appear......Page 226
Treating rows coming to the error stream......Page 228
Time for action – validating genres with a Regex Evaluation step......Page 229
Validating data......Page 231
Time for action – checking films file with the Data Validator......Page 232
Defining simple validation rules using the Data Validator......Page 234
Cleansing data......Page 236
Summary......Page 238
Introducing the Steel Wheels sample database......Page 240
database......Page 242
Connecting with Relational Database Management Systems......Page 245
Exploring the Steel Wheels database......Page 246
Time for action – exploring the sample database......Page 247
A brief word about SQL......Page 248
Exploring any configured database with the PDI Database explorer......Page 251
Time for action – getting data about shipped orders......Page 252
Getting data from the database with the Table input step......Page 254
Using the SELECT statement for generating a new dataset......Page 255
parameters......Page 257
Making flexible queries by using Kettle variables......Page 259
variables......Page 260
Time for action – loading a table with a list of manufacturers......Page 262
Inserting new data into a database table with the Table output step......Page 268
existent ones......Page 269
Time for action – testing the update of existing products......Page 272
Inserting or updating data with the Insert/Update step......Page 274
Time for action – deleting data about discontinued items......Page 279
Deleting records of a database table with the Delete step......Page 282
Summary......Page 283
Time for action – populating the Jigsaw database......Page 284
Exploring the Jigsaw database model......Page 287
of products to buy......Page 289
Looking up values in a database with the Database lookup step......Page 291
suggested products to buy......Page 293
Joining data from the database to the stream data by using a Database join step......Page 295
Introducing dimensional modeling......Page 298
Combination lookup/update step......Page 299
region dimension......Page 302
Describing data with dimensions......Page 304
Loading Type I SCD with a Combination lookup/update step......Page 305
Dimension lookup/update step......Page 309
of product changes......Page 311
Keeping an entire history of data with a Type II slowly changing dimension......Page 312
Loading Type II SCDs with the Dimension lookup/update step......Page 314
Summary......Page 319
Introducing PDI jobs......Page 320
Time for action – creating a simple hello world job......Page 321
Executing processes with PDI jobs......Page 328
Using Spoon to design and run jobs......Page 329
Using the transformation job entry......Page 330
arguments and parameters......Page 332
Running jobs from a terminal window......Page 335
window......Page 336
Using named parameters and command-line arguments in transformations......Page 337
fixed arguments and parameters......Page 338
Deciding between the use of a command-line argument and a named parameter......Page 340
administrator if something is wrong......Page 341
Changing the flow of execution on the basis of conditions......Page 347
Creating and using a file results list......Page 349
Summary......Page 350
Enhancing your processes with the use of variables......Page 352
by setting a variable with the name of the file......Page 353
Setting variables inside a transformation......Page 358
Time for action – generating files with top scores......Page 360
subtransformation......Page 364
Creating and using subtransformations......Page 368
copying and getting rows......Page 371
Transferring data between transformations by using the copy /get rows mechanism......Page 375
nesting jobs......Page 377
Running a job inside another job with a job entry......Page 378
Understanding the scope of variables......Page 379
Iterating jobs and transformations......Page 380
transformation for every input row......Page 381
Executing for each row......Page 384
Summary......Page 389
Exploring the sales datamart......Page 390
Loading the dimensions......Page 393
Time for action – loading dimensions for the sales datamart......Page 394
Extending the sales datamart model......Page 399
dimensions......Page 401
Getting the information from the source with SQL queries......Page 407
Obtaining the surrogate key for a Type I SCD......Page 411
Obtaining the surrogate key for a Type II SCD......Page 412
Obtaining the surrogate key for the Time dimension......Page 414
obtained from the command line......Page 417
Time for action – loading the sales star......Page 419
Time for action – automating the loading of the sales datamart......Page 422
Summary......Page 426
PDI best practices......Page 428
Extending Kettle with plugins......Page 431
Overcoming real world risks with some remote execution......Page 433
Scaling out to overcome bigger risks......Page 434
PDI as a process action......Page 435
PDI as a datasource......Page 436
More about the Pentaho suite......Page 437
PDI Enterprise Edition and Kettle Developer Support......Page 438
Summary......Page 439
Appendix A: Working with Repositories......Page 440
Time for action – creating a PDI repository......Page 441
Creating repositories to store your transformations and jobs......Page 443
Time for action – logging into a repository......Page 444
Defining repository user accounts......Page 445
Creating transformations and jobs in repository folders......Page 446
Examining and modifying the contents of a repository with the Repository explorer......Page 447
Migrating from a file-based system to a repository-based system and vice-versa......Page 449
Summary......Page 450
Running transformations and jobs stored in files......Page 452
Running transformations and jobs from a repository......Page 453
Specifying command line options......Page 454
Providing options when running Pan and Kitchen......Page 455
Variables......Page 456
Appendix C: Quick Reference: Steps and Job Entries......Page 458
Transformation steps......Page 459
Job entries......Page 463
General shortcuts......Page 466
Designing transformations and jobs......Page 467
Repositories......Page 468
Experiencing the mouse-over assistance......Page 470
Using the mouse-over assistance toolbar......Page 471
Experiencing the job drill-down feature......Page 472
Summary......Page 473
PDI basics......Page 474
splitting a stream......Page 475
PDI error handling......Page 476
loading slowly changing dimensions......Page 477
deciding the scope of variables......Page 478
remote execution and clustering......Page 479
Index......Page 480