Thursday, February 23, 2023

Let's understand ETL Life Cycle and Best Practices

Data is extracted, transformed, and loaded (ETL) from various sources into a data warehouse or another centralized data repository.

ETL serves as the building block for workstreams in data analytics and machine learning. In accordance with a set of business rules, ETL cleans and arranges data in a way that satisfies particular business intelligence requirements, such as monthly reporting, but it can also handle more advanced analytics that can enhance back-end operations or end-user experiences. An organization frequently employs ETL to:

  • Data extraction from old systems.
  • To enhance data quality and establish consistency, clean the data.
  • Fill the target database with data.

The following execution steps make up a typical real-world ETL cycle:

1. Cycle initiation

2. Build reference data

3. Extract (from sources)

4. Validate

5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)

6. Stage (load into staging tables, if used)

7. Audit reports, such as those on adherence to corporate policies. In the event of failure, it also aids in diagnosis and repair.

8. Publish (to target tables)

9. Archive

10. Clean up

Best practices

4-tiered approach for ETL architecture design

  • Functional layer: Core functional of ETL processing is extract, transform, and load.
  • Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.
  • Audit, Balance and Control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.
  • Utility layer: Common components supporting all other layers.

Use file based ETL processing where possible.

  • Storage costs less expensive
  • Intermediate files serve several purposes:
  • Used for testing and debugging.
  • Used for restart and recover processing.
  • Control statistics are calculated using this method.
  • Reduces dependencies and enables modular programming.
  • Allows for job-execution and scheduling flexibility.
  • Better performance if coded properly and can take benefits of parallel processing capabilities when the need arises.

Use data-driven methods and minimize custom ETL coding.

  • Parameter-driven jobs, functions, and job-control.
  • Code definitions and mapping in database.
  • Consideration for data-driven tables to support more complex code-mappings and business-rule application.

Qualities of a good ETL architecture design:

  • Performance
  • Scalable
  • Migratable
  • Reliable (run-id, ETL-id, job-id...)
  • Operable (completion-codes for phases, re-running from checkpoints, etc.)
  • Auditable (in two dimensions: business requirements and technical troubleshooting)

1 comment:

Introduction to Database Management System

 A Database Management System (DBMS) is a collection of programs that enables user to create and maintain a database. The DBMS is a general-...