Streamlined Data Ingestion with dbt and DuckDB #
Efficient file processing is crucial in data engineering. I just did a small experiment that explores integrating dbt (data build tool) with DuckDB (dbt-duckdb), enhanced by an Excel plugin. This combination of tools appears to be a simple framework for local and remote file processing and exposes itself as a powerful framework for ingestion tasks.
Why Combine dbt and DuckDB? #
By leveraging dbt and DuckDB together, we can ensure:
- Simplicity: dbt is well established and easy to use. The learning curve remains relatively flat.
- Local Fast Processing: DuckDB is a powerful local analytical database that offers very simple interfaces for working with local and remote files. ****
- Adherence to Software Engineering Best Practices: Maintainability and scalability are prioritized.
- Duty Segregation: Software engineers manage plugin functionalities, allowing users to focus solely on processing files.
Let’s assume we have a local .xslx
or .csv
file that we need to push into s3
. The file maybe needs to undergo minor transformations and we also want to maintain an audit table, with file size, load timestamp, the original file hash and a few other metadata attributes.
Intuition told me, a task like this requires a data ingestion framework. So far however, I have not found an ingestion system that does not lack at least one of the below attributes:
- it’s flexible with regards to the preprocessing of the data
- it’s simple to use for those not formally educated in software engineering
- it’s allowing deep customizations where needed
- it allows to apply software engineering best practices
- it does bridge the gab from a local system to a remote system
Now, with DuckDB and dbt including the plugin system of the dbt-duckdb
adapter, we have a system, with all the attributes that I which I would have found combined in previous tools I assessed.
Examples Using the Excel Plugin #
For the full example, feel free to explore this little demo repo: https://github.com/nicohein/demo-dbt-duckdb-excel-plugin.
Plugin Registration in profiles.yml #
To register the plugins in your profiles.yml
:
default:
outputs:
dev:
type: duckdb
plugins:
- module: plugins.excel
alias: custom_excel
- module: excel
target: dev
Import Data with source.yml #
To import data using the Excel plugin, configure your source.yml
:
version: 2
sources:
- name: excel_source_via_plugin
meta:
external_location: "data/source/{name}.xlsx"
plugin: excel
tables:
- name: raw_customers
Export Data with schema.yml #
To export data using the custom Excel plugin, configure your schema.yml
:
ytaamlCopy code
version: 2
models:
- name: orders
config:
materialized: external
# Location where the `external` materialization writes and from where the plugin reads
location: 'data/tmp/orders.parquet'
plugin: custom_excel
overrides:
# The location to which the plugin writes
file: 'data/destination/orders.xlsx'
Custom Plugins #
DuckDB itself can read data from Excel, csv, and various more established formats in the Data Engineering world, is easily set up locally and has convenient interfaces.
The plugin system for the external materalization of dbt-duckdb
allows to implement custom load
and store
functions that can read from anywhere Python (or more conveniently Pandas) can read and can write to anywhere Python can write to.
For the scope of this experiment I only customized the Excel plugin. This plugin addresses the limitation of the default extension by allowing the output file to be set at a module level instead of the target level.
Plugins remain completely transparent to the everyday dbt user who can configure all necessary and variables in the sources and model yaml blocks in the well known dbt .yml
files. This also means, an engineer can be concerned about adding all the desired features to the load
and store
functions of a custom plugin. This can include maintaining an audit table on the fly or sending logs to a remote system.
Conclusion #
By combining dbt with DuckDB and enhancing it with a custom Excel plugin, this project demonstrates a scalable and maintainable approach to file processing. Given the flexibility and extensibility of the stack, it exposes itself as a powerful framework for ingestion tasks as well because, it’s flexible, simple to use, allowing deep customizations, bridges the gab from a local system to a remote systems and allows us to apply software engineering best practices.
For details visit the demo repo: https://github.com/nicohein/demo-dbt-duckdb-excel-plugin.