Recently, we needed to upgrade our Postgres database. This required a period of downtime where we needed to prevent writes to the database. Easy enough, maintenance time exists for a reason. The problem? Our app receives hundreds of webhooks from dozens of sources every minute, each containing important data that we don’t want to lose. We end up with several challenges we need to solve:
We ended up using a “soft” maintenance mode; instead of blocking incoming requests we continued to receive them and processed them differently, including serving a generic "We are undergoing maintenance" page & stopping database write operations. This let us receive the requests and re-send them in their entirety to a separate server that was running code designed to receive these webhooks and save them. Here is the before & after of our architecture:
So how did we get here…?
As mentioned, our application receives a lot of webhooks from a lot of sources, often outside of our control. If we were to update them, it would be impossible to do so atomically; it would be a manual & labour-intensive process, prone to human error & difficult to roll back quickly. It was better to preserve the URL that receives webhooks instead of risking firing webhooks into the void and losing customer data.
So the problems to solve…
Let’s tackle them one at a time.
At first we investigated off-the-shelf solutions to solve this for us - and we found one, Hookdeck. This gives you a new URL to which you can redirect webhooks, where it will save them & let you replay them or forward them to a different server. It would sit in our system like so:
However, using this would require us to list it as a data processor under GDPR, a 45 day legal process, which made it untenable for our deadline. This also excluded other off-the-shelf solutions, so we’re left needing to build something ourselves.
Bootstrapping a whole new application with a datastore felt like overkill; we just needed a temporary, production-like environment to which we could deploy some code and store data during the maintenance window.
We also had a staging environment sat right there, running a recent version of our full Ruby on Rails application, ready to be taken over…
We modified the code on our staging server & introduced a new model called a QueuedWebhook
, and... Well, it didn’t do much; it had:
{% c-block language="sql" %}
CREATE TABLE queued_webhooks (
id SERIAL PRIMARY KEY,
-- Used to rebuild the request
body JSON,
headers JSON,
params JSONB,
path VARCHAR,
-- Debugging utilities
processed_at TIMESTAMP,
retry_count INTEGER DEFAULT 0,
error_message VARCHAR
);
{% c-block-end %}
We updated the controllers that received webhooks accordingly - instead of processing webhooks automatically, we extracted the information we needed & saved them to the DB.
This was puzzle piece #1: “How to store the webhooks” in place. We launched this to our staging environment and then had to actually get our webhooks there.
💡 Did you know? If you redirect traffic with a 301 (permanent redirect) or 302 (temporary redirect) it will convert the request method to GET. You can use their equivalents, 307 & 308 to preserve the HTTP verb
Our application is managed by Cloudflare, which lets us manage traffic in interesting ways; for example they have a concept called “Redirect Rules”, which will seamlessly intercept traffic heading to server A and redirect it to server B. We implemented this, and our architecture was updated accordingly:
However, these only support 301 or 302 status codes and, as we found out a little too late, this butchered our HTTP verbs.
💡 Did you also know? Some webhooks, those fired by GitHub included, will not follow redirects
Then we added a full Cloudflare worker to redirect traffic with a 307, only to find that some of the webhooks treated this as an error. Since we could not rely on webhook senders to consistently follow redirect rules, we needed to adjust our approach accordingly: Instead of redirecting requests, we needed to receive them, and then resend them exactly as they were.
… Sounds familiar, no?
We reused pretty much everything we’d written for staging, with one difference - instead of creating QueuedWebhook
's, we just instantiated them and fired them straight away; in Ruby this was simply the difference between .create
and .new.forward_webhook
. This version of the code lived on production, so the webhook lifecycle became:
A neat solution that helped us move fast and get the upgrade done!
After the maintenance was complete, we discussed whether it would be useful to have this functionality always available in production. If for whatever reason the production database became unable to perform writes (e.g. due to hardware failure or overload, or more maintenance), we could instantly forward webhooks to an alternate server that could store and eventually send back the webhooks once production could perform writes again. This would be a more predictable solution than failing the webhooks and requiring the webhook senders to retry them using their own backoff logic. We envisioned a solution that would take advantage of Rails middleware to control what happens to webhooks, based on feature flags and environment variables:
{% c-block language="ruby" %}
# frozen_string_literal: true
module Middleware
class QueueWebhooksMiddleware < Middleware::Base
# Fill this in as needed to allow all request headers
# needed for your webhooks
VALID_HEADERS = %w[]
def call(env)
request = Rack::Request.new(env)
if request.path =~ %r{^/webhooks/}
http_headers = permit_headers(request, VALID_HEADERS)
queued_webhook = QueuedWebhook.new(
path: request.path,
payload: request.body.read,
headers: http_headers)
if Flipper.enabled?(:DANGER__receive_webhooks)
queued_webhook.save!
elsif Flipper.enabled?(:DANGER__forward_webhooks)
queued_webhook.forward_webhook
end
end
@app.call(env)
end
private
def permit_headers(request, *allowed_headers)
# Filter request.headers to include only allowed headers
# ...
end
end
end
{% c-block-end %}
This would combine the code running on both the production & staging servers, and allow us to manage the state of different servers using feature flags and environment variables, instead of needing to deploy changes.
DANGER__forward_webhooks
DANGER__receive_webhooks
The URL to forward webhooks to would be set in env vars, and the queued webhooks could be relayed back to the main server using an admin UI or just the rails console.
I’m happy to report our database upgrade went well 🥰 But it also sparked a lot of conversation and ideas for a future system we could use should we need to put our app into maintenance again. If you have solved similar problems in different ways - or if you end up using an approach like ours - I’d love to hear about your experiences.