Managing one cloud infrastructure isn’t easy. Dealing with multiple clouds is even more challenging. Coming up with a way to retrieve encrypted secrets from a cloud-based security solution using a workflow that involves applications and services in both clouds (AWS, GCP) and doing it all securely—that’s an immense challenge.
For Ripple's Data Engineering team to solve this problem, we not only had to sort through all of those intricate architectural details, we also had to consider the existing infrastructure and long-term goals, and adopt best practices at the same time.. But we figured it out.
We discovered this challenge while setting up new ways to monitor our own systems, specifically our hosted applications. For part of the solution, we wanted to stream data from Cloud Dataflow to BigQuery. Both are managed services on the Google Cloud Platform (GCP), so this should not have been too complicated. The wrinkle we encountered was that the source data comes from Amazon Kinesis, which lives, of course, on Amazon Web Services (AWS). And the Kinesis credentials were stored in Hashicorp Vault, another hosted service.This meant we’d have to authenticate a Dataflow job or user against Vault to read Kinesis credentials, where the integrations between each of them (Dataflow, Kinesis, Vault) are primitive.
An Instance of Complication
We use Vault because it’s a secure solution that both of our cloud infrastructures can access.
Vault authenticates applications by either identifying the user or instance running the application. For example, if an application runs a dedicated machine instance, Vault assigns an approle for that machine. If the application is on a VM, Vault assigns an identity to the application and uses metadata from that instance to fetch the credentials.
We wanted Dataflow to fetch credentials for Kinesis from Vault. The hitch was that because Dataflow is a serverless service operating on a distributed scale, its instance profile is dynamic. So an instance-based authentication wasn’t a possibility in our case.The expectation is that, in a streaming Dataflow job, the job workers need to dynamically authenticate with Vault to fetch Kinesis credentials—especially when the credentials are temporary like AWS Security Token Service (STS) keys.
Approaching a Solution
We tried an approach that leveraged Cloud Identity and Access Management (IAM), another GCP service. Vault has multiple authentication (auth) mechanisms to support various services across multiple clouds.You can use IAM to manage identities of serverless applications like Dataflow and Lambda, where the machine or instance profile is unknown.
We created a named role on Vault for the IAM Service Account on GCP. This allowed us to authenticate our application and read credentials from Vault. As we did this, we observed things that were slightly different from our expectations.
The authentication was not happening as a part of the Dataflow job. Rather, the authentication was happening at the graph creation time before the Dataflow job was submitted. Too soon for us to use as part of our streaming job. If we had been using a batch-based approach with static keys, this approach would have worked since the delay would not have mattered. But to work for streaming, we would need to have dynamic authentication at the Dataflow worker level. This constraint also meant that we would need to source the keys for the IAM account externally since they would not be dynamically fetched from the instance metadata as we expected.
Here’s what we ended up doing:
We deployed the Dataflow streaming job on Cloud Composer and used the Composer worker’s IAM account for the IAM-based authentication. Since the auth happens before the Dataflow job, Composer handles the auth piece by sourcing IAM credentials from its metadata without the need to handle them manually.
Next Steps
We realized that any auth mechanism that happens at graph construction time is probably not the right approach for streaming. Because of the delay caused by externally fetching credentials, Dataflow most likely wouldn’t be using truly real-time data.
The newest version of Apache’s BEAM SDK will give us the ability to authenticate at run time through a feature called Splittable DoFn on Apache BEAM. This feature is an experimental feature on Apache BEAM being released first in the Python SDK, then in the Java SDK.
This feature will let us call the root PCollection (input for the dataflow job which here is the Kinesis records) from source which makes it a part of the Dataflow job itself. This will enable us to solve the streaming problem, along with the ability to dynamically authenticate and fetch static and AWS Security Token Service (STS) keys from Vault.
Our engineering team is solving these challenges and more everyday. If you’re interested in being part of the team, we’re hiring.
Image credit: Photo by Joseph Barrientos on Unsplash