A data platform for oil & gas.
A large oil & gas client, a team of developers, and one lakehouse to bring the data together.
What it is
Together with a team of developers, I built and managed a data platform for a large oil & gas client on Microsoft Fabric. The architecture followed the medallion pattern (bronze, silver, gold) on a lakehouse foundation, fed by metadata-driven data pipelines with change data capture and Dataflow Gen2, and transformed with PySpark and Spark SQL.
My personal contribution: re-architecting how the pipelines run. The platform originally processed sequentially; I introduced a parallel execution pattern that cut processing time by more than 5x. I also slimmed down overbloated tables, reducing row counts for faster, cheaper ingestion.
The work
- Co-built and managed the client's Fabric data platform end to end with a team of developers
- Implemented medallion architecture on a lakehouse with metadata-driven pipelines
- Change data capture and Dataflow Gen2 for ingestion; PySpark and Spark SQL for transformation
- Re-architected pipeline orchestration from sequential to parallel: more than 5x faster processing
- Reduced overbloated tables and row counts for better ingestion performance
- Worked in an Agile team with git-based workflows and CI/CD
Concepts
What it taught me
Sequential pipelines are a default, not a law. The biggest performance wins came from questioning how the work was scheduled, not how it was written.