IBM Python Project for Data Engineering:
Web Scraping and ETL Automation Project – Top 10 Global Banks
As part of IBM’s Python Project for Data Engineering, I was tasked with building a full ETL pipeline to extract and process financial data for the world’s largest banks. The project involved real-world web scraping, data transformation using exchange rates, and saving outputs to both CSV and SQL database formats - all automated with progress logging.
Overview:
Project Goal
To automate the creation of a quarterly market capitalization report for the top 10 largest banks in the world.
Scrape data from an archived Wikipedia page using
requests
+BeautifulSoup
Extract a specific table from the HTML (by inspecting its position in
<tbody>
)Filter and clean relevant rows (skip rows with missing data or no hyperlinks)
Convert market capitalization from USD to GBP, EUR, and INR using an external exchange rate CSV
Store the result in a CSV file and an SQLite database
Log each phase of the ETL process into a
code_log.txt
file
Tools & Stack
Python: Core programming logic
BeautifulSoup4: HTML parsing & web scraping
Pandas: Data cleaning and transformation
NumPy: Vectorized currency conversion
SQLite3: Local database storage
CSV: Flat file output
Datetime: Timestamped logging for process traceability
Highlights
Learned to inspect web structure to target specific
<tbody>
sections and handle multiple<a>
tags within a single cell.Dealt with real-world scraping errors like SSL certificate warnings and how to bypass them safely for non-sensitive archival content.
Practiced modular programming: each ETL phase was written as a self-contained function with docstrings and error tolerance.
Understood how to log process checkpoints for traceability — an essential practice in real ETL pipelines.
ETL Workflow:
Task 2: Extract – Web Scraping with BeautifulSoup
The table under “By Market Capitalization” was found at index 0 in the <tbody>
list. I filtered only valid rows (that had links and valid GDP) and extracted both the name and USD market cap.
Task 3: Transform – Exchange Rate Conversion
Using a CSV of exchange rates, I multiplied USD values by the appropriate rates for GBP, EUR, and INR, rounded to 2 decimal places. Vectorized NumPy operations ensured performance and precision.
Task 4 & 5: Load to CSV and Database
The final DataFrame was saved in two formats:
CSV using
pandas.to_csv
SQLite using
to_sql(..., if_exists='replace')
Task 6: SQL Query Execution
Sample queries run:
SELECT * FROM Largest_banks;
SELECT AVG(MC_GBP_Billion) FROM Largest_banks;
SELECT Name FROM Largest_banks LIMIT 5;
Task 7: Log Review
Final inspection of code_log.txt
confirms all steps logged correctly.
Final Output Sample
(Full dataset exported to
Largest_banks_data.csv
and stored inBanks.db
SQLite table)
Reflections
This was a solid capstone project for cementing real-world ETL skills. It required breaking down unstructured HTML, designing a pipeline architecture, and understanding how to preserve reproducibility through logs and clean storage. In future versions, I’d like to refactor it into a CLI tool or deploy it as a microservice for quarterly automation.
Task 1: Logging Function
A timestamped log_progress()
function tracks every stage of the ETL pipeline in a .txt
file — useful for future debugging or audits.