Data Transformations Using the Data Build Tool

At Ripple, we are moving towards building complex business models out of raw data. To do this successfully, we need to automate our historically manual processes. Even with a digital-first approach, many of our internal processes were done by hand, making them great candidates to be automated.

A prime example of this was the process of managing our data transformation workflows. Our data analysts used to schedule queries on BigQuery for transformation workflows and test the transformed data manually. We did not have a single tool that would automate the building, compiling, testing and documenting of SQL models, so we had no way to scale the process.

Ripple’s Data Engineering team found this data transformation process to be a great test case for implementing the Data Build Tool (DBT).

Why DBT?

Using the open source version of DBT, the Data Engineering team aimed to equip Ripple’s analysts with a tool that cleans data and reduces the amount of time spent manually performing checks.

DBT checks all the boxes to help us manage our data workflows:

Centralizes the secure management and configuration of target databases
Organizes compute resources to build, compile and test models across environments
Schedules and executes the data models efficiently
Defines and manages lineage and dependencies by creating references
Flexibly integrates with our existing CI/CD pipeline
Easily deploys SQL models
Fits into existing infrastructure without a lot of additional tooling
Serves documentation and creates data visibility for customers and internal users

With DBT, an analyst can easily create a finance dashboard that refreshes every hour. They would write a SQL query and apply it on DBT, which runs the test, compiles the code and runs the model at a set frequency.

DBT Implementation Basics

Setting up the DBT Project

We used GitLab to set up our DBT project. A DBT project contains:

.sql files that define the models and tests.
.yml files that define the configuration. For example, dbt_project.yml configures and defines the DBT project.

Connection and Environments

DBT connects to the data warehouse, BigQuery, to run data transformation queries. This allows us to maintain separate production and development environments by using targets. For the sake of integrity while developing and testing models, you should use data within one environment.

For example, we created dev, staging and prod DBT workflows inside dbt-prod-gcp-project to separate the underlying infrastructure from the data. This enables our analysts to focus on data curation and modelling rather than infrastructure.

SQL Models

A model is a single .sql file. Each model:

Contains a single select statement that transforms raw data into a dataset that is ready for analytics

2. Acts as an intermediate step in such a transformation

We will now introduce an example model that we built using DBT.

Model Example: Payment Deduplication

Ripple supports customer payments across hundreds of financial institutions around the world. Internally, we capture the raw application events in real time in our data lake on BigQuery. This pipeline is written in Apache Beam (Java) and scheduled on Cloud Composer.

Our first use case with DBT is to cleanse and filter duplicates in these raw events. We created a model that can insert_overwrite payments that have been updated in the past seven days based on our business rules. In SQL, this can be done using a MERGE statement running hourly to refresh and update the records.

The steps are as follows:

Choose the right materialization strategy (table/view/incremental)
Include partitioning and clustering based on the downstream filters and usage
Adopt the right incremental strategy (MERGE or insert_overwrite)

DBT Style INCREMENTAL RUNs

DBT’s macro is_incremental() lets us run the model incrementally or a full refresh:

{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where date(p.timestamp) in ({{ partitions_to_replace | join(',') }})
{% endif %}

2. We self-reference the target (where updated_at ≥ (select max from {{this}})):

deduped_payment_trail_events as (
SELECT
p_id,
ripplenet_payment_id,
odl_payment_id,
ripplenet_quote_id,
uuid,
timestamp,
correlations,
data,
schema_type,
categories,
event_name,
app_name,
insert_ts
FROM {{ this }}
)

3. Partitions can be refreshed as static, relative, or var-based, with the help of statement calls (dynamic filters using DBT datetime macros). By specifying the partition filter in the incremental runs, the amount of data scanned for transformation is significantly cut down. Therefore, the model builds are much faster than the --full-refresh version in magnitude. We use static partition filters as described below to filter the payments for the past 7 days.

{% set partitions_to_replace = [
'current_date',
'date_sub(current_date, interval 7 day)'
] %}
{{
config(
materialized = "incremental",
partition_by = {
"field": "timestamp",
"data_type": "timestamp",
},
cluster_by=["app_name", "timestamp"],
incremental_strategy = "insert_overwrite",
partitions = partitions_to_replace
)
}}

DBT handles the reconciliation of those transformed new records with the existing database table, leveraging BigQuery’s atomic merge operation to ensure that the model’s unique_key is unique. If it isn’t, then BigQuery returns an error. For this example, we use p_id as our unique key.

FROM payment_trail_events as p
LEFT JOIN deduped_payment_trail_events as d
USING(p_id)

Scheduling

Once the models are built, they are deployed and scheduled using KubernetesPodOperator on Cloud Composer (Google’s managed version of Airflow).

Conclusion

As we started adopting DBT, our Data Scientists have found it easier to deploy their existing SQL models and run tests. The goal is for them to spend less time manually ensuring data quality by simply writing and scheduling tests using DBT. Scheduling them on Composer also makes it easier to maintain the workflows within our existing infrastructure. As a result, we look forward to migrating most of our transformational workflows to DBT.

Check out our Engineering page to learn more about our team and current open roles!

Photo by Luke Chesser on Unsplash

Data Transformations Using the Data Build Tool

Why DBT?

DBT Implementation Basics

Model Example: Payment Deduplication

Conclusion

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple's Centralized Data Platform

Building CI/CD with Airflow, GitLab and Terraform in GCP

Data Transformations Using the Data Build Tool

Why DBT?

DBT Implementation Basics

Model Example: Payment Deduplication

Conclusion

Related posts

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple's Centralized Data Platform

Building CI/CD with Airflow, GitLab and Terraform in GCP