Data Platform #
The data platform landscape evolves rapidly, and much of the information in this space has a short shelf life. As a result, I primarily document my thoughts and findings in the Blog section of this site. Below are links to articles and experiments that that I believe remain relevant.
Data Ingestion #
Data ingestion remains one of the most persistent challenges in building and scaling data platforms. One approach I’ve explored is leveraging DuckDB, which offers a powerful way to seamlessly process both local and in the cloud. Its flexibility with the plugin framework, and speed make it an excellent tool for streamlining ingestion workflows. Read more about my experiment here Streamlined Data Ingestion with dbt and DuckDB.
Storage #
Storage is often considered inexpensive compared to compute. However, as data accumulates over time, efficient management can unlock significant cost savings.
In this post, I outline a straightforward method to identify unused and potentially obsolete data at the table level in BigQuery.
Also GCP specific; While billing reports provide total bucket costs, identifying large blobs and folders can be time-consuming, given the mostly flat hierarchy of object storage. To unlock further savings, I developed a Python utility that summarizes and analyze stored data, making it easier to pinpoint large files and folders across GCS buckets. More details can be found in this post.
Orchestration #
As of 2024, I see two dominant approaches to orchestration:
- Data or Asset-Centric – The primary focus is the generated output (data or assets).
- Process-Centric – The primary focus is the tasks and workflows driving the process.
The data-centric approach offers significant advantages by emphasizing data lineage over procedural steps. This shift allows for greater transparency and traceability, ensuring that dependencies and transformations are well understood.
A key benefit of data-centric DAGs is that they inherently promote atomic and idempotent processing. In contrast, ensuring these characteristics in process-centric DAGs often requires additional engineering effort and forethought.
While the data-centric approach has clear benefits, the maturity of existing process-centric solutions and the significant effort required to migrate a well-designed collection of DAGs might not justify a full transition. I see the biggest chances of success in a gradual transition.
I have written about Running a Multi-Tenant Airflow Cluster on Medium.
Transformation #
I currently don’t have much to say about data transformation. BigQuery (followed by Snowflake) have established SQL for massive parallel processing. dbt allows to easily structure code, apply software engineering best practices and DuckDB now allows super efficient processing of small data locally and in the cloud alike.
Solutions to make incremental processing easy are actively worked on. Let’s see when DuckDB supports the write to Iceberg.
Back to Data Engineering