Enhancing an application to send email is a relatively trivial matter—it can take an engineer as little as five minutes to modify an application to connect to an email server and specify a message to send. With a little bit more work, templates can be added to support sending emails with different content to distinct groups of people, as well as inserting images and attachments. Where it gets challenging, however, is in dealing with the small problems that inevitably occur, such as emails not arriving, or protecting against and recovering from major disasters that can strike at any moment, like data loss.
At Ripple, our goal is to not only build efficient and reliable systems, but also to ensure they are robust and secure, as well as fully compliant with legal requirements in all the jurisdictions within which we operate. However trivial sending emails may seem—or, more generally, sending notifications in any form—doing it correctly requires a much bigger time investment than the five minutes mentioned above. One of our core tenets is setting high standards to ensure customer trust.
In this post we present Hermes, our notification service, which can be used by all Ripple services that have a need to programmatically send emails or other kinds of notifications, such as Slack alerts.
We often use Hermes to send emails to customers and exchanges containing the receipt for each transaction we make. For context, these receipts must be sent to remain in compliance with a 1996 FinCEN Bank Secrecy Act rule [31 CFR 103.33(g)], also known as the Travel Rule, which stipulates that for certain transactions between financial institutions identifying information must be exchanged for the purpose of identifying money laundering. It is vital for both Ripple and our customers that these notifications are accurate and on time.
The main functional requirement of Hermes centered on storing all notifications persistently for audit purposes, whereby any sensitive data, such as customer and financial data, had to be encrypted. Beyond just encrypting data at rest, this requirement essentially demanded that even the engineers with access to Hermes for troubleshooting purposes should not be able to see any sensitive data—a privacy we prioritize for our customers.
The non-functional requirements focused on the integration of Hermes into our existing Kubernetes platform that runs on AWS, the need to collect logs, metrics, and traces for the purpose of monitoring and alerting, setting appropriate access controls and authentication, and so on.
Hermes uses an Aurora relational data store (RDS) with a failover replica that is backed by Postgres to persistently store each notification request that arrives. The service has only been granted permission to access a specific database within the data store and uses a username and password stored in a Hashicorp Vault that rotate every few hours.
Each notification request can indicate which elements within the content are sensitive, which are encrypted using Vault's Transit Engine before they are stored in the database (using encryption keys that are also rotated regularly); the sensitive values are only decrypted just before the message is handed off to the SendGrid API for final delivery to the intended recipients.
To support different usage patterns, Hermes offers both synchronous ("blocking") and asynchronous ("non-blocking") delivery of notifications. This means that an app can choose to wait until it gets confirmation from Hermes that the message was successfully sent, or to return immediately after handing off the message and to check back later on whether or not it was actually sent.
The service is written in Golang, which is an easy to learn yet powerful programming language for building fast and scalable systems. Hermes supports both gRPC and JSON requests sent over multiplexed HTTP/2, which makes it easier for other teams to use the service by choosing the communication method that works best for them. We use Bazel to build and test the code, as well as to package it up into a Docker image. A Helm chart defines how the notification service should be deployed to our Kubernetes cluster.
Life of an Asynchronous Request
As pictures can say more than a thousand words, in the figure below we visualize the flow of an asynchronous notification request. The numbers along the arrows depict the order of execution.
JSON and gRPC requests are treated equally by Hermes, with the only difference that the former must be first unmarshaled into a protobuf and the response marshaled back into JSON. We next encrypt the entire request—just in case we ever need it again—as well as the individual elements that have been marked as sensitive. Only after a request has been encrypted does it get stored in RDS where it is marked as "pending". A unique identifier we assign to each request is then returned to the caller.
A separate worker process will then pick up the pending notification for further handling. The preprocessing step involves validating the data and downloading remote content such as attachments; this only needs to happen once. After decrypting the sensitive fields, the postprocessing step involves performing any template substitution before creating the final email message to hand off to SendGrid.
In case a temporary failure occurs, the reason is stored in the database and the request is marked as "pending" again with a minimum timestamp before it can be retried. If a permanent failure occurs, the request will not be retried again unless explicitly requested by the caller at a later point in time.
Asynchronous vs. Synchronous Requests
The main difference between an asynchronous request and a synchronous request is that the former is only inserted into the database as "pending", which then must be separately polled by a worker process before the request is updated to "processing" and gets delivered, while the latter is directly marked by a worker as "processing" and delivered.
The caller who sends an asynchronous request must periodically poll Hermes for delivery status updates, while for synchronous requests the caller will immediately know whether the notification was successfully sent; for synchronous requests the response is not returned until after the sequence of steps on the right-hand side in the figure above has completed.
Asynchronous requests can reduce the load on the service by smoothing out bursty traffic, but conversely can overwhelm it too by sending too many status update requests. It is always a challenge to find the right balance; to proactively identify issues we use Grafana to monitor our metrics and send us alerts when the service shows signs of strain.
Hermes is our company-wide solution for sending application-driven notifications, especially if they must be retained for legal/compliance reasons. The service makes it easy for any Ripple application to send notifications, as all it needs to do is specify the recipients and the message to send, while Hermes takes care of the rest.
Several business processes are already using the service to automate sending emails, which used to be manually sent out—a mundane and error-prone task. Hermes has enabled these processes to not only be leaner and more auditable, but also more accountable.
Upcoming improvements to the service will consist of scheduling notifications to be delivered in the future, supporting the deduplication of repeated notifications, and combining multiple notifications into one; for instance, rather than sending an email for each individual transaction, Hermes could be instructed to add all transaction receipts that occurred within a certain time period to a single email.