Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration

Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration

Paperback

By (author) Matt Casters, By (author) Roland Bouman, By (author) Jos van Dongen

$35.32
List price $53.41
You save $18.09 33% off

Free delivery worldwide
Available
Dispatched in 4 business days
When will my order arrive?

  • Publisher: John Wiley & Sons Ltd
  • Format: Paperback | 720 pages
  • Dimensions: 188mm x 229mm x 46mm | 1,066g
  • Publication date: 1 October 2010
  • Publication City/Country: Chichester
  • ISBN 10: 0470635177
  • ISBN 13: 9780470635179
  • Edition statement: New ed.
  • Illustrations note: Illustrations
  • Sales rank: 458,207

Product description

A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions-before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution. Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data) Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud" Get the most out of Pentaho Kettle and your data warehousing with this detailed guide-from simple single table data migration to complex multisystem clustered data integration tasks.

Other books in this category

Showing items 1 to 11 of 11
Categories:

Author information

Matt Casters is Founder of Kettle and works as Chief Data Integration at Pentaho, where he leads Kettle software development. Roland Bouman is an application developer focusing on open source web technology, databases, and business intelligence. Jos van Dongen is an independent business intelligence consultant and well-known author, analyst, and presenter.

Back cover copy

The ultimate resource on building and deploying data integration solutions with Kettle Kettle is a scaleable and extensible open source ETL and data integration tool that lets you extract data from databases, flat and XML files, web services, ERP systems, and OLAP cubes. It provides over 120 built-in transformation steps to validate, cleanse, and conform data, as well as numerous options to load data into data warehouses and many other targets. Kettle is a comprehensive, low-cost alternative to traditional data integration tools like Informatica PowerCenter, IBM InfoSphere DataStage, and BusinessObjects Data Integrator.This book explains in detail how to use Kettle to create, test, and deploy your own ETL and data integration solutions. You'll learn to use Kettle's programs to create transformations and jobs, use version control, audit data, and schedule your ETL solution. Then you'll progress to more advanced concepts such as clustering and cloud computing, real-time data integration, loading a Data Vault model, and extending Kettle by building your own plugins. In addition, you'll find hands-on examples and case studies that show exactly how to put Kettle's features into practice. Explore the components of the Kettle ETL toolset Discover how to install and configure Kettle and connect it to various data sources and targetsDesign and build every aspect of an ETL solution using KettleLearn how to load a data warehouse with KettleUnderstand the steps for deploying and scheduling ETL solutionsGain the skills to integrate Kettle with third-party productsLearn to extend Kettle and build your own pluginsUse clustering and cloud computing to scale and improve the performance of your Kettle ETL solutionsFind out how to use Kettle for real-time data integration

Table of contents

Introduction xxxi Part I Getting Started 1 Chapter 1 ETL Primer 3 OLTP versus Data Warehousing 3 What Is ETL? 5 The Evolution of ETL Solutions 5 ETL Building Blocks 7 ETL, ELT, and EII 8 ELT 9 EII: Virtual Data Integration 10 Data Integration Challenges 11 Methodology: Agile BI 12 ETL Design 14 Data Acquisition 14 Beware of Spreadsheets 15 Design for Failure 15 Change Data Capture 16 Data Quality 16 Data Profiling 16 Data Validation 17 ETL Tool Requirements 17 Connectivity 17 Platform Independence 18 Scalability 18 Design Flexibility 19 Reuse 19 Extensibility 19 Data Transformations 20 Testing and Debugging 21 Lineage and Impact Analysis 21 Logging and Auditing 22 Summary 22 Chapter 2 Kettle Concepts 23 Design Principles 23 The Building Blocks of Kettle Design 25 Transformations 25 Steps 26 Transformation Hops 26 Parallelism 27 Rows of Data 27 Data Conversion 29 Jobs 30 Job Entries 31 Job Hops 31 Multiple Paths and Backtracking 32 Parallel Execution 33 Job Entry Results 34 Transformation or Job Metadata 36 Database Connections 37 Special Options 38 The Power of the Relational Database 39 Connections and Transactions 39 Database Clustering 40 Tools and Utilities 41 Repositories 41 Virtual File Systems 42 Parameters and Variables 43 Defining Variables 43 Named Parameters 44 Using Variables 44 Visual Programming 45 Getting Started 46 Creating New Steps 47 Putting It All Together 49 Summary 51 Chapter 3 Installation and Configuration 53 Kettle Software Overview 53 Integrated Development Environment: Spoon 55 Command-Line Launchers: Kitchen and Pan 57 Job Server: Carte 57 Encr.bat and encr.sh 58 Installation 58 Java Environment 58 Installing Java Manually 58 Using Your Linux Package Management System 59 Installing Kettle 59 Versions and Releases 59 Archive Names and Formats 60 Downloading and Uncompressing 60 Running Kettle Programs 61 Creating a Shortcut Icon or Launcher for Spoon 62 Configuration 63 Configuration Files and the .kettle Directory 63 The Kettle Shell Scripts 69 General Structure of the Startup Scripts 70 Adding an Entry to the Classpath 70 Changing the Maximum Heap Size 71 Managing JDBC Drivers 72 Summary 72 Chapter 4 An Example ETL Solution--Sakila 73 Sakila 73 The Sakila Sample Database 74 DVD Rental Business Process 74 Sakila Database Schema Diagram 75 Sakila Database Subject Areas 75 General Design Considerations 77 Installing the Sakila Sample Database 77 The Rental Star Schema 78 Rental Star Schema Diagram 78 Rental Fact Table 79 Dimension Tables 79 Keys and Change Data Capture 80 Installing the Rental Star Schema 81 Prerequisites and Some Basic Spoon Skills 81 Setting Up the ETL Solution 82 Creating Database Accounts 82 Working with Spoon 82 Opening Transformation and Job Files 82 Opening the Step's Configuration Dialog 83 Examining Streams 83 Running Jobs and Transformations 83 The Sample ETL Solution 84 Static, Generated Dimensions 84 Loading the dim-date Dimension Table 84 Loading the dim-time Dimension Table 86 Recurring Load 87 The load-rentals Job 88 The load-dim-staff Transformation 91 Database Connections 91 The load-dim-customer Transformation 95 The load-dim-store Transformation 98 The fetch-address Subtransformation 99 The load-dim-actor Transformation 101 The load-dim-film Transformation 102 The load-fact-rental Transformation 107 Summary 109 Part II ETL 111 Chapter 5 ETL Subsystems 113 Introduction to the 34 Subsystems 114 Extraction 114 Subsystems 1--3: Data Profiling, Change Data Capture, and Extraction 115 Cleaning and Conforming Data 116 Subsystem 4: Data Cleaning and Quality Screen Handler System 116 Subsystem 5: Error Event Handler 117 Subsystem 6: Audit Dimension Assembler 117 Subsystem 7: Deduplication System 117 Subsystem 8: Data Conformer 118 Data Delivery 118 Subsystem 9: Slowly Changing Dimension Processor 118 Subsystem 10: Surrogate Key Creation System 119 Subsystem 11: Hierarchy Dimension Builder 119 Subsystem 12: Special Dimension Builder 120 Subsystem 13: Fact Table Loader 121 Subsystem 14: Surrogate Key Pipeline 121 Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121 Subsystem 16: Late-Arriving Data Handler 122 Subsystem 17: Dimension Manager System 122 Subsystem 18: Fact Table Provider System 122 Subsystem 19: Aggregate Builder 123 Subsystem 20: Multidimensional (OLAP) Cube Builder 123 Subsystem 21: Data Integration Manager 123 Managing the ETL Environment 123 Summary 126 Chapter 6 Data Extraction 127 Kettle Data Extraction Overview 128 File-Based Extraction 128 Working with Text Files 128 Working with XML files 133 Special File Types 134 Database-Based Extraction 134 Web-Based Extraction 137 Text-Based Web Extraction 137 HTTP Client 137 Using SOAP 138 Stream-Based and Real-Time Extraction 138 Working with ERP and CRM Systems 138 ERP Challenges 139 Kettle ERP Plugins 140 Working with SAP Data 140 ERP and CDC Issues 146 Data Profiling 146 Using eobjects.org DataCleaner 147 Adding Profile Tasks 149 Adding Database Connections 149 Doing an Initial Profile 151 Working with Regular Expressions 151 Profiling and Exploring Results 152 Validating and Comparing Data 153 Using a Dictionary for Column Dependency Checks 153 Alternative Solutions 154 Text Profiling with Kettle 154 CDC: Change Data Capture 154 Source Data--Based CDC 155 Trigger-Based CDC 157 Snapshot-Based CDC 158 Log-Based CDC 162 Which CDC Alternative Should You Choose? 163 Delivering Data 164 Summary 164 Chapter 7 Cleansing and Conforming 167 Data Cleansing 168 Data-Cleansing Steps 169 Using Reference Tables 172 Conforming Data Using Lookup Tables 172 Conforming Data Using Reference Tables 175 Data Validation 179 Applying Validation Rules 180 Validating Dependency Constraints 183 Error Handling 183 Handling Process Errors 184 Transformation Errors 186 Handling Data (Validation) Errors 187 Auditing Data and Process Quality 191 Deduplicating Data 192 Handling Exact Duplicates 193 The Problem of Non-Exact Duplicates 194 Building Deduplication Transforms 195 Step 1: Fuzzy Match 197 Step 2: Select Suspects 198 Step 3: Lookup Validation Value 198 Step 4: Filter Duplicates 199 Scripting 200 Formula 201 JavaScript 202 User-Defined Java Expressions 202 Regular Expressions 203 Summary 205 Chapter 8 Handling Dimension Tables 207 Managing Keys 208 Managing Business Keys 209 Keys in the Source System 209 Keys in the Data Warehouse 209 Business Keys 209 Storing Business Keys 210 Looking Up Keys with Kettle 210 Generating Surrogate Keys 210 The "Add sequence" Step 211 Working with auto-increment or IDENTITY Columns 217 Keys for Slowly Changing Dimensions 217 Loading Dimension Tables 218 Snowflaked Dimension Tables 218 Top-Down Level-Wise Loading 219 Sakila Snowflake Example 219 Sample Transformation 221 Database Lookup Configuration 222 Sample Job 225 Star Schema Dimension Tables 226 Denormalization 226 Denormalizing to 1NF with the "Database lookup" Step 226 Change Data Capture 227 Slowly Changing Dimensions 228 Types of Slowly Changing Dimensions 228 Type 1 Slowly Changing Dimensions 229 The Insert / Update Step 229 Type 2 Slowly Changing Dimensions 232 The "Dimension lookup / update" Step 232 Other Types of Slowly Changing Dimensions 237 Type 3 Slowly Changing Dimensions 237 Hybrid Slowly Changing Dimensions 238 More Dimensions 239 Generated Dimensions 239 Date and Time Dimensions 239 Generated Mini-Dimensions 239 Junk Dimensions 241 Recursive Hierarchies 242 Summary 243 Chapter 9 Loading Fact Tables 245 Loading in Bulk 246 STDIN and FIFO 247 Kettle Bulk Loaders 248 MySQL Bulk Loading 249 LucidDB Bulk Loader 249 Oracle Bulk Loader 249 PostgreSQL Bulk Loader 250 Table Output Step 250 General Bulk Load Considerations 250 Dimension Lookups 251 Maintaining Referential Integrity 251 The Surrogate Key Pipeline 252 Using In-Memory Lookups 253 Stream Lookups 253 Late-Arriving Data 255 Late-Arriving Facts 256 Late-Arriving Dimensions 256 Fact Table Handling 260 Periodic and Accumulating Snapshots 260 Introducing State-Oriented Fact Tables 261 Loading Periodic Snapshots 263 Loading Accumulating Snapshots 264 Loading State-Oriented Fact Tables 265 Loading Aggregate Tables 266 Summary 267 Chapter 10 Working with OLAP Data 269 OLAP Benefits and Challenges 270 OLAP Storage Types 272 Positioning OLAP 272 Kettle OLAP Options 273 Working with Mondrian 274 Working with XML/A Servers 277 Working with Palo 282 Setting Up the Palo Connection 283 Palo Architecture 284 Reading Palo Data 285 Writing Palo Data 289 Summary 291 Part III Management and Deployment 293 Chapter 11 ETL Development Lifecycle 295 Solution Design 295 Best and Bad Practices 296 Data Mapping 297 Naming and Commentary Conventions 298 Common Pitfalls 299 ETL Flow Design 300 Reusability and Maintainability 300 Agile Development 301 Testing and Debugging 306 Test Activities 307 ETL Testing 308 Test Data Requirements 308 Testing for Completeness 309 Testing Data Transformations 311 Test Automation and Continuous Integration 311 Upgrade Tests 312 Debugging 312 Documenting the Solution 315 Why Isn't There Any Documentation? 316 Myth 1: My Software Is Self-Explanatory 316 Myth 2: Documentation Is Always Outdated 316 Myth 3: Who Reads Documentation Anyway? 317 Kettle Documentation Features 317 Generating Documentation 319 Summary 320 Chapter 12 Scheduling and Monitoring 321 Scheduling 321 Operating System--Level Scheduling 322 Executing Kettle Jobs and Transformations from the Command Line 322 UNIX-Based Systems: cron 326 Windows: The at utility and the Task Scheduler 327 Using Pentaho's Built-in Scheduler 327 Creating an Action Sequence to Run Kettle Jobs and Transformations 328 Kettle Transformations in Action Sequences 329 Creating and Maintaining Schedules with the Administration Console 330 Attaching an Action Sequence to a Schedule 333 Monitoring 333 Logging 333 Inspecting the Log 333 Logging Levels 335 Writing Custom Messages to the Log 336 E-mail Notifications 336 Configuring the Mail Job Entry 337 Summary 340 Chapter 13 Versioning and Migration 341 Version Control Systems 341 File-Based Version Control Systems 342 Organization 342 Leading File-Based VCSs 343 Content Management Systems 344 Kettle Metadata 344 Kettle XML Metadata 345 Transformation XML 345 Job XML 346 Global Replace 347 Kettle Repository Metadata 348 The Kettle Database Repository Type 348 The Kettle File Repository Type 349 The Kettle Enterprise Repository Type 350 Managing Repositories 350 Exporting and Importing Repositories 350 Upgrading Your Repository 351 Version Migration System 352 Managing XML Files 352 Managing Repositories 352 Parameterizing Your Solution 353 Summary 356 Chapter 14 Lineage and Auditing 357 Batch-Level Lineage Extraction 358 Lineage 359 Lineage Information 359 Impact Analysis Information 361 Logging and Operational Metadata 363 Logging Basics 363 Logging Architecture 364 Setting a Maximum Buffer Size 365 Setting a Maximum Log Line Age 365 Log Channels 366 Log Text Capturing in a Job 366 Logging Tables 367 Transformation Logging Tables 367 Job Logging Tables 373 Summary 374 Part IV Performance and Scalability 375 Chapter 15 Performance Tuning 377 Transformation Performance: Finding the Weakest Link 377 Finding Bottlenecks by Simplifying 379 Finding Bottlenecks by Measuring 380 Copying Rows of Data 382 Improving Transformation Performance 384 Improving Performance in Reading Text Files 384 Using Lazy Conversion for Reading Text Files 385 Single-File Parallel Reading 385 Multi-File Parallel Reading 386 Configuring the NIO Block Size 386 Changing Disks and Reading Text Files 386 Improving Performance in Writing Text Files 387 Using Lazy Conversion for Writing Text Files 387 Parallel Files Writing 387 Changing Disks and Writing Text Files 387 Improving Database Performance 388 Avoiding Dynamic SQL 388 Handling Roundtrips 388 Handling Relational Databases 390 Sorting Data 392 Sorting on the Database 393 Sorting in Parallel 393 Reducing CPU Usage 394 Optimizing the Use of JavaScript 394 Launching Multiple Copies of a Step 396 Selecting and Removing Values 397 Managing Thread Priorities 397 Adding Static Data to Rows of Data 397 Limiting the Number of Step Copies 398 Avoiding Excessive Logging 398 Improving Job Performance 399 Loops in Jobs 399 Database Connection Pools 400 Summary 401 Chapter 16 Parallelization, Clustering, and Partitioning 403 Multi-Threading 403 Row Distribution 404 Row Merging 405 Row Redistribution 406 Data Pipelining 407 Consequences of Multi-Threading 408 Database Connections 408 Order of Execution 409 Parallel Execution in a Job 411 Using Carte as a Slave Server 411 The Configuration File 411 Defining Slave Servers 412 Remote Execution 413 Monitoring Slave Servers 413 Carte Security 414 Services 414 Clustering Transformations 417 Defining a Cluster Schema 417 Designing Clustered Transformations 418 Execution and Monitoring 420 Metadata Transformations 421 Rules 422 Data Pipelining 425 Partitioning 425 Defining a Partitioning Schema 425 Objectives of Partitioning 427 Implementing Partitioning 428 Internal Variables 428 Database Partitions 429 Partitioning in a Clustered Transformation 430 Summary 430 Chapter 17 Dynamic Clustering in the Cloud 433 Dynamic Clustering 433 Setting Up a Dynamic Cluster 434 Using the Dynamic Cluster 436 Cloud Computing 437 EC2 438 Getting Started with EC2 438 Costs 438 Customizing an AMI 439 Packaging a New AMI 442 Terminating an AMI 442 Running a Master 442 Running the Slaves 443 Using the EC2 Cluster 444 Monitoring 445 The Lightweight Principle and Persistence Options 446 Summary 447 Chapter 18 Real-Time Data Integration 449 Introduction to Real-Time ETL 449 Real-Time Challenges 450 Requirements 451 Transformation Streaming 452 A Practical Example of Transformation Streaming 454 Debugging 457 Third-Party Software and Real-Time Integration 458 Java Message Service 459 Creating a JMS Connection and Session 459 Consuming Messages 460 Producing Messages 460 Closing Shop 460 Summary 461 Part V Advanced Topics 463 Chapter 19 Data Vault Management 465 Introduction to Data Vault Modeling 466 Do You Need a Data Vault? 466 Data Vault Building Blocks 467 Hubs 467 Links 468 Satellites 469 Data Vault Characteristics 471 Building a Data Vault 471 Transforming Sakila to the Data Vault Model 472 Sakila Hubs 472 Sakila Links 473 Sakila Satellites 474 Loading the Data Vault: A Sample ETL Solution 477 Installing the Sakila Data Vault 477 Setting Up the ETL Solution 477 Creating a Database Account 477 The Sample ETL Data Vault Solution 478 Sample Hub: hub-actor 478 Sample Link: link-customer-store 480 Sample Satellite: sat-actor 483 Loading the Data Vault Tables 485 Updating a Data Mart from a Data Vault 486 The Sample ETL Solution 486 The dim-actor Transformation 486 The dim-customer Transformation 488 The dim-film Transformation 492 The dim-film-actor-bridge Transformation 492 The fact-rental Transformation 493 Loading the Star Schema Tables 495 Summary 495 Chapter 20 Handling Complex Data Formats 497 Non-Relational and Non-Tabular Data Formats 498 Non-Relational Tabular Formats 498 Handling Multi-Valued Attributes 498 Using the Split Field to Rows Step 499 Handling Repeating Groups 500 Using the Row Normaliser Step 500 Semi- and Unstructured Data 501 Kettle Regular Expression Example 503 Configuring the Regex Evaluation Step 504 Verifying the Match 507 Key/Value Pairs 508 Kettle Key/Value Pairs Example 509 Text File Input 509 Regex Evaluation 510 Grouping Lines into Records 511 Denormaliser: Turning Rows into Columns 512 Summary 513 Chapter 21 Web Services 515 Web Pages and Web Services 515 Kettle Web Features 516 General HTTP Steps 516 Simple Object Access Protocol 517 Really Simple Syndication 517 Apache Virtual File System Integration 517 Data Formats 517 XML 518 Kettle Steps for Working with XML 518 Kettle Job Entries for XML 519 HTML 520 JavaScript Object Notation 520 Syntax 521 JSON, Kettle, and ETL/DI 522 XML Examples 523 Example XML Document 523 XML Document Structure 523 Mapping to the Sakila Sample Database 524 Extracting Data from XML 525 Overall Design: The import-xml-into-db Transformation 526 Using the XSD Validator Step 528 Using the "Get Data from XML" Step 530 Generating XML Documents 537 Overall Design: The export-xml-from-db Transformation 537 Generating XML with the Add XML Step 538 Using the XML Join Step 541 SOAP Examples 544 Using the "Web services lookup" Step 544 Configuring the "Web services lookup" Step 544 Accessing SOAP Services Directly 546 JSON Example 549 The Freebase Project 549 Freebase Versus Wikipedia 549 Freebase Web Services 550 The Freebase Read Service 550 The Metaweb Query Language 551 Extracting Freebase Data with Kettle 553 Generate Rows 554 Issuing a Freebase Read Request 555 Processing the Freebase Result Envelope 556 Filtering Out the Original Row 557 Storing to File 558 RSS 558 RSS Structure 558 Channel 558 Item 559 RSS Support in Kettle 560 RSS Input 561 RSS Output 562 Summary 567 Chapter 22 Kettle Integration 569 The Kettle API 569 The LGPL License 569 The Kettle Java API 570 Source Code 570 Building Kettle 571 Building javadoc 571 Libraries and the Class Path 571 Executing Existing Transformations and Jobs 571 Executing a Transformation 572 Executing a Job 573 Embedding Kettle 574 Pentaho Reporting 574 Putting Data into a Transformation 576 Dynamic Transformations 580 Dynamic Template 583 Dynamic Jobs 584 Executing Dynamic ETL in Kettle 586 Result 587 Replacing Metadata 588 Direct Changes with the API 589 Using a Shared Objects File 589 OEM Versions and Forks 590 Creating an OEM Version of PDI 590 Forking Kettle 591 Summary 592 Chapter 23 Extending Kettle 593 Plugin Architecture Overview 593 Plugin Types 594 Architecture 595 Prerequisites 596 Kettle API Documentation 596 Libraries 596 Integrated Development Environment 596 Eclipse Project Setup 597 Examples 598 Transformation Step Plugins 599 StepMetaInterface 599 Value Metadata 605 Row Metadata 606 StepDataInterface 607 StepDialogInterface 607 Eclipse SWT 607 Form Layout 607 Kettle UI Elements 609 Hello World Example Dialog 609 StepInterface 614 Reading Rows from Specific Steps 616 Writing Rows to Specific Steps 616 Writing Rows to Error Handling 617 Identifying a Step Copy 617 Result Feedback 618 Variable Substitution 618 Apache VFS 619 Step Plugin Deployment 619 The User-Defined Java Class Step 620 Passing Metadata 620 Accessing Input and Fields 620 Snippets 620 Example 620 Job Entry Plugins 621 JobEntryInterface 622 JobEntryDialogInterface 624 Partitioning Method Plugins 624 Partitioner 625 Repository Type Plugins 626 Database Type Plugins 627 Summary 628 Appendix A The Kettle Ecosystem 629 Kettle Development and Versions 629 The Pentaho Community Wiki 631 Using the Forums 631 Jira 632 ##pentaho 633 Appendix B Kettle Enterprise Edition Features 635 Appendix C Built-in Variables and Properties Reference 637 Internal Variables 637 Kettle Variables 640 Variables for Configuring VFS 641 Noteworthy JRE Variables 642 Index 643