IBM Python Project for Data Engineering:

Web Scraping and ETL Automation Project – Top 10 Global Banks

As part of IBM’s Python Project for Data Engineering, I was tasked with building a full ETL pipeline to extract and process financial data for the world’s largest banks. The project involved real-world web scraping, data transformation using exchange rates, and saving outputs to both CSV and SQL database formats - all automated with progress logging.


Overview:

Project Goal

To automate the creation of a quarterly market capitalization report for the top 10 largest banks in the world.

  • Scrape data from an archived Wikipedia page using requests + BeautifulSoup

  • Extract a specific table from the HTML (by inspecting its position in <tbody>)

  • Filter and clean relevant rows (skip rows with missing data or no hyperlinks)

  • Convert market capitalization from USD to GBP, EUR, and INR using an external exchange rate CSV

  • Store the result in a CSV file and an SQLite database

  • Log each phase of the ETL process into a code_log.txt file


Tools & Stack

  • Python: Core programming logic

  • BeautifulSoup4: HTML parsing & web scraping

  • Pandas: Data cleaning and transformation

  • NumPy: Vectorized currency conversion

  • SQLite3: Local database storage

  • CSV: Flat file output

  • Datetime: Timestamped logging for process traceability


Highlights

  • Learned to inspect web structure to target specific <tbody> sections and handle multiple <a> tags within a single cell.

  • Dealt with real-world scraping errors like SSL certificate warnings and how to bypass them safely for non-sensitive archival content.

  • Practiced modular programming: each ETL phase was written as a self-contained function with docstrings and error tolerance.

  • Understood how to log process checkpoints for traceability — an essential practice in real ETL pipelines.


ETL Workflow:

Task 2: Extract – Web Scraping with BeautifulSoup

The table under “By Market Capitalization” was found at index 0 in the <tbody> list. I filtered only valid rows (that had links and valid GDP) and extracted both the name and USD market cap.

Task 3: Transform – Exchange Rate Conversion

Using a CSV of exchange rates, I multiplied USD values by the appropriate rates for GBP, EUR, and INR, rounded to 2 decimal places. Vectorized NumPy operations ensured performance and precision.

Task 4 & 5: Load to CSV and Database

The final DataFrame was saved in two formats:

  • CSV using pandas.to_csv

  • SQLite using to_sql(..., if_exists='replace')

Task 6: SQL Query Execution

Sample queries run:

SELECT * FROM Largest_banks;

SELECT AVG(MC_GBP_Billion) FROM Largest_banks;

SELECT Name FROM Largest_banks LIMIT 5;

Task 7: Log Review

Final inspection of code_log.txt confirms all steps logged correctly.

Final Output Sample

(Full dataset exported to Largest_banks_data.csv and stored in Banks.db SQLite table)

Reflections

This was a solid capstone project for cementing real-world ETL skills. It required breaking down unstructured HTML, designing a pipeline architecture, and understanding how to preserve reproducibility through logs and clean storage. In future versions, I’d like to refactor it into a CLI tool or deploy it as a microservice for quarterly automation.

Task 1: Logging Function

A timestamped log_progress() function tracks every stage of the ETL pipeline in a .txt file — useful for future debugging or audits.


Next
Next

SQL for Data Analysis Course