📈 Introduction:
Remember when you built your first end-to-end data pipeline in the mini-project? Now, it's time to take it to the next level by applying industry best practices to automate and optimize your pipeline using Terraform, CI/CD, and automated tests. This advanced project will not only enhance your technical skills but also make your data pipeline more robust, scalable, and maintainable, mirroring the practices used by top tech companies.
🎯 Project Definition:
Build and enhance an ETL Data Pipeline to not only extract hourly weather information for a selected location but also automate and optimize the deployment process. Utilize Terraform to manage your GCP infrastructure, implement CI/CD pipelines to ensure smooth, automated deployments, and add automated tests to verify the correctness and reliability of your pipeline. Store the retrieved data in BigQuery and schedule the pipeline to run automatically on an hourly schedule.
🦸 Who is this project for?
- Advanced Data Enthusiasts aiming to elevate their skills in Data Engineering, Cloud Engineering, and DevOps.
- Experienced Data / Cloud Professionals looking to integrate infrastructure as code, CI/CD pipelines, and automated tests into their workflows.
🎯 Learning Objectives:
By the end of this project, you'll be able to:
- Extract data from any public API using Python libraries like
requests
.
- Manipulate data efficiently with Pandas to transform raw data into actionable insights.
- Leverage GCP’s robust data warehousing tools like BigQuery to store and manage large datasets.
- Deploy and automate production-ready data pipelines using scalable Google Cloud services.
- Use Terraform to manage and deploy your GCP infrastructure as code.
- Implement CI/CD pipelines to automate testing and deployment processes.
- Write and run automated tests to ensure the reliability and correctness of your data pipeline.
- Implement logging and monitoring to keep track of your pipeline’s performance and health.