At Ripple, we are moving towards building complex business models out of raw data. To do this successfully, we need to automate our historically manual processes. Even with a digital-first approach, many of our internal processes were done by hand, making them great candidates to be automated.
A prime example of this was the process of managing our data transformation workflows. Our data analysts used to schedule queries on BigQuery for transformation workflows and test the transformed data manually. We did not have a single tool that would automate the building, compiling, testing and documenting of SQL models, so we had no way to scale the process.
Ripple’s Data Engineering team found this data transformation process to be a great test case for implementing the Data Build Tool (DBT).
Why DBT?
Using the open source version of DBT, the Data Engineering team aimed to equip Ripple’s analysts with a tool that cleans data and reduces the amount of time spent manually performing checks.
DBT checks all the boxes to help us manage our data workflows:
- Centralizes the secure management and configuration of target databases
- Organizes compute resources to build, compile and test models across environments
- Schedules and executes the data models efficiently
- Defines and manages lineage and dependencies by creating references
- Flexibly integrates with our existing CI/CD pipeline
- Easily deploys SQL models
- Fits into existing infrastructure without a lot of additional tooling
- Serves documentation and creates data visibility for customers and internal users
With DBT, an analyst can easily create a finance dashboard that refreshes every hour. They would write a SQL query and apply it on DBT, which runs the test, compiles the code and runs the model at a set frequency.
DBT Implementation Basics
Setting up the DBT Project
We used GitLab to set up our DBT project. A DBT project contains:
.sql
files that define the models and tests..yml
files that define the configuration. For example,dbt_project.yml
configures and defines the DBT project.
Connection and Environments
DBT connects to the data warehouse, BigQuery, to run data transformation queries. This allows us to maintain separate production and development environments by using targets. For the sake of integrity while developing and testing models, you should use data within one environment.
For example, we created dev, staging and prod DBT workflows inside dbt-prod-gcp-project
to separate the underlying infrastructure from the data. This enables our analysts to focus on data curation and modelling rather than infrastructure.
SQL Models
A model is a single .sql
file. Each model:
- Contains a single select statement that transforms raw data into a dataset that is ready for analytics
or
2. Acts as an intermediate step in such a transformation
We will now introduce an example model that we built using DBT.
Model Example: Payment Deduplication
Ripple supports customer payments across hundreds of financial institutions around the world. Internally, we capture the raw application events in real time in our data lake on BigQuery. This pipeline is written in Apache Beam (Java) and scheduled on Cloud Composer.
Our first use case with DBT is to cleanse and filter duplicates in these raw events. We created a model that can insert_overwrite
payments that have been updated in the past seven days based on our business rules. In SQL, this can be done using a MERGE
statement running hourly to refresh and update the records.
The steps are as follows:
- Choose the right materialization strategy (table/view/incremental)
- Include partitioning and clustering based on the downstream filters and usage
- Adopt the right incremental strategy (
MERGE
orinsert_overwrite
)
DBT Style INCREMENTAL RUNs
- DBT’s macro
is_incremental()
lets us run the model incrementally or a full refresh:
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where date(p.timestamp) in ({{ partitions_to_replace | join(',') }})
{% endif %}
2. We self-reference the target (where updated_at ≥ (select max from {{this}}))
:
deduped_payment_trail_events as (
SELECT
p_id,
ripplenet_payment_id,
odl_payment_id,
ripplenet_quote_id,
uuid,
timestamp,
correlations,
data,
schema_type,
categories,
event_name,
app_name,
insert_ts
FROM {{ this }}
)
3. Partitions can be refreshed as static, relative, or var
-based, with the help of statement calls (dynamic filters using DBT datetime macros). By specifying the partition filter in the incremental runs, the amount of data scanned for transformation is significantly cut down. Therefore, the model builds are much faster than the --full-refresh
version in magnitude. We use static partition filters as described below to filter the payments for the past 7 days.
{% set partitions_to_replace = [
'current_date',
'date_sub(current_date, interval 7 day)'
] %}
{{
config(
materialized = "incremental",
partition_by = {
"field": "timestamp",
"data_type": "timestamp",
},
cluster_by=["app_name", "timestamp"],
incremental_strategy = "insert_overwrite",
partitions = partitions_to_replace
)
}}
DBT handles the reconciliation of those transformed new records with the existing database table, leveraging BigQuery’s atomic merge operation to ensure that the model’s unique_key
is unique. If it isn’t, then BigQuery returns an error. For this example, we use p_id
as our unique key.
FROM payment_trail_events as p
LEFT JOIN deduped_payment_trail_events as d
USING(p_id)
Scheduling
Once the models are built, they are deployed and scheduled using KubernetesPodOperator on Cloud Composer (Google’s managed version of Airflow).
Conclusion
As we started adopting DBT, our Data Scientists have found it easier to deploy their existing SQL models and run tests. The goal is for them to spend less time manually ensuring data quality by simply writing and scheduling tests using DBT. Scheduling them on Composer also makes it easier to maintain the workflows within our existing infrastructure. As a result, we look forward to migrating most of our transformational workflows to DBT.
Check out our Engineering page to learn more about our team and current open roles!
Photo by Luke Chesser on Unsplash