IBM Data Engineering: ETL and Data Pipelines with Shell, Airflow and Kafka
This course archive summarizes core concepts in data engineering, covering ETL and ELT patterns, batch and streaming pipelines, workflow orchestration with Apache Airflow, and real-time event streaming using Apache Kafka. The notes focus on architectural principles, operational trade-offs, and tooling ecosystems commonly used in modern data platforms
Module 1: Data Processing Techniques
The course starts by establishing the foundation: how raw data is moved, reshaped, and made usable.
The ETL (Extract–Transform–Load) model is introduced as a structured pipeline where data is cleaned and prepared before reaching the target system. This approach emphasizes consistency and control, making it well-suited for structured, on-premise, or tightly governed environments.
In contrast, ELT (Extract–Load–Transform) reverses the order. Raw data is loaded first and transformed later, typically inside cloud data platforms. This enables greater flexibility, faster experimentation, and full access to unmodified data, at the cost of higher storage and compute usage.
The module also surveys:
Data extraction techniques (APIs, web scraping, OCR, IoT, sensors)
Common transformation operations (cleaning, normalization, aggregation, anonymization)
Loading strategies (full vs incremental, batch vs streaming, push vs pull, parallel loading)
Key takeaway: ETL and ELT solve the same problem but optimize for different constraints. Modern platforms increasingly favor ELT, but ETL remains highly relevant.
Module 2: ETL & Data Pipelines: Tools and Techniques
This module zooms out from ETL mechanics to the broader concept of data pipelines.
ETL is revisited from an operational perspective: staging areas, batching strategies, parallelization, and workflow orchestration. Shell scripting is introduced as a lightweight but powerful way to automate simple pipelines, especially when paired with scheduling tools like cron.
From there, the focus shifts to pipeline behavior and performance:
Latency vs throughput
Bottlenecks and load balancing
Monitoring, logging, and operational reliability
A key distinction is made between:
Batch pipelines: accuracy-focused, scheduled, higher latency
Streaming pipelines: low latency, event-driven, fault-tolerant
The module concludes with a survey of tooling:
Code-first orchestration (Apache Airflow)
Dataframe-based processing (Pandas, Dask, Spark)
Managed and no-code ETL/ELT platforms (Talend, AWS Glue, Panoply)
Key takeaway: pipelines are systems, not scripts, and orchestration, monitoring, and scalability matter as much as transformation logic.
Module 3: Building Data Pipelines Using Apache Airflow
This module dives deep into Apache Airflow as a workflow orchestration platform.
Airflow models pipelines as Directed Acyclic Graphs (DAGs), where each task is a node and dependencies define execution order. Pipelines are written entirely in Python, making them versionable, testable, and collaborative.
Key concepts covered include:
Airflow’s architecture (scheduler, executor, workers, metadata DB, web UI)
Task lifecycle states and retries
Operators (Python, Bash, branching, triggers, sensors)
DAG structure and code organization
Logging, metrics, and production monitoring
Airflow is positioned clearly as a workflow manager, not a data processing engine, its strength lies in coordinating tasks, not transforming large datasets itself.
Key takeaway: Airflow turns pipelines into maintainable software systems, enabling complex dependencies, observability, and team collaboration.
Module 4: Building Streaming Pipelines Using Apache Kafka
The final module introduces event-driven architectures and real-time pipelines using Apache Kafka.
Data is framed as events - immutable records describing something that happened, flowing continuously through event streams. Kafka acts as a distributed backbone that decouples producers and consumers while ensuring durability, scalability, and replayability.
Core Kafka concepts include:
Brokers, topics, partitions, and replication
Producers and consumers with offset tracking
Fault tolerance and parallel consumption
Cluster coordination (ZooKeeper → KRaft)
The module also covers stream processing, where events are not just transported but transformed in motion. Kafka Streams is introduced as a way to build processing topologies that read from one topic, process records, and publish results to another.
Hands-on labs demonstrate:
Running Kafka locally
Creating topics
Publishing and consuming events via CLI
Inspecting logs
Using Kafka from Python (admin, producer, consumer APIs)
Key takeaway: Kafka enables scalable, real-time pipelines where data becomes a continuous flow rather than periodic batches.
Final Thoughts
Taken together, these modules form a coherent mental model of modern data engineering:
ETL and ELT define how data is shaped
Pipelines define how data moves
Airflow defines when and in what order things happen
Kafka defines how data flows in real time
The recurring theme is decoupling: separating storage from processing, producers from consumers, pipelines from transformations, to build systems that scale, evolve, and remain observable.
Course Certificate: View on Coursera
All notes and opinions are personal interpretations of the IBM Python for Data Science, AI & Development course on Coursera.

