Jadd Kabbani - Engineering Manager
Since I joined hyperexponential just under 2 years ago, we have grown from about 25 people to over 100. Much of that growth happening within the Technology team. Alongside personnel growth, we’ve seen significant increases in our product’s usage, both in terms of the number of customers and how much they are using their environments.
A common problem in rapid growth companies like ours, is that eventually some of your existing codebase reaches its scalability limits. One of our services in particular was reaching the upper bound of its scalability, and it was causing a lot of instability in our system.
🚢 The Export Service
On the face of it, this service is deceptively simple and conforms to the standard offerings of SaaS products to their customers. The purpose of the Export Service is to provide analytics to our customers, so they can see the current state of the system, and “export” it into their own platforms. As our larger customers started to increase their usage of the platform over time, we started seeing a lot of stability issues coming from the Export Service. This not only made the analytics unavailable, but also threatened the performance of the platform itself.
The root cause of these stability issues can be tied back to 3 fundamental design decisions made when the service was first created.
🌐 Design Choice 1: Synchronous REST APIs
The way that the Export Service’s database is populated is via a 2-way REST API between Export and the main product, hx Renew. At a high level, whenever a change is made within Renew, the backend would fire a message to Export over HTTP. This message would let Export know that “something in the system has changed”. Periodically (every 10 seconds or so), Export will check to see if Renew has notified a change, and if so will kick off the “import process”.
The import process consists of the Export Service making several HTTP calls to Renew. For each entity type modeled in Export’s database, a REST call is made to Renew asking for the current state of all items that have been updated since the last time the import process has run.
If not much has changed, this works fine as Renew is easily able to query its own database and send the response back over HTTP. We start running into problems when a lot of data has changed in a short space of time.
If Renew has received a lot of updates in a very short space of time, either because our customers have performed some large automated operation or something like an internal schema change on our side, then we start running into problems. If the changes are too numerous, Renew will start putting pressure on its own database in order to fulfil Export’s requests. Under extreme load the database reads and serialisation can take so long that Export’s HTTP requests start to time out. The timed out requests need to be retried by Export, so it will repeatedly (with a rudimentary rate limit) call Renew to try and get the data it needs.
This repeated load on Renew’s database can lead to degraded performance and even unresponsiveness of core systems, producing top priority production incidents.
🃏 Design Choice 2: All-or-Nothing Updates
The Export Database is a relational Postgres SQL database, with strongly consistent relationships between entities. Because of this strong consistency, all updates from Renew need to be applied atomically to the Export database. If a lot of changes have happened since the last update, all the changes will be committed to the Export database in the same transaction. If for any reason any of the updates fail, all other updates must also fail and the database will not be updated. Combined with our synchronous REST calls to make the updates, this design choice makes the import process very fragile. If a call to Renew fails, for example because of a timeout, then any progress that the import process has already made is discarded. The import then needs to start from the very start, meaning that once the system gets into this state, it’s nearly impossible for it to recover by itself as the data load will never decrease.
⚡ Design Choice 3: Direct Database Access
The usual way of accessing the analytic data provided by the Export Service is via our REST APIs. However, another (discouraged but supported) access mechanism is via direct read-only access to the database using your favourite ODBC driver. This means that as customers are accessing the database directly, we need to be careful with how our schema evolves over time. Usually the database schema is purely a technical consideration, but in this case it is also a product consideration, as any breaking changes we make to the schema could break our customers’ integrations. Furthermore, we only have limited observability around this access mechanism, so we weren’t even sure how many of our customers are accessing the database in this way.
⏰ Time for Change
In hindsight it’s easy to point out poor design decisions, but the reality of working in the high pressure environment of an early start-up, where speed of delivery means everything, is that it’s easy for tech debt like the Export Service to sneak in. Both engineering (synchronous, all-or-nothing updates) and product (direct DB access) decisions were made at the early stage of the product, to allow us to deliver quickly to our budding customer base. It’s only years later when product usage has increased significantly where these scalability issues really become apparent.
The Export Service has been on our radar for a long time as a significant piece of technical debt. We’ve known for a long time that it has a flawed design, and over the past year incidents relating to it have been on the increase. The problem was always that the 3 decisions highlighted above are all inter-related, it’s hard to tackle one of them without tackling all of them. The biggest problem is the fact that our customers had direct access to the database, so any fundamental changes we make are likely to break any integrations that use that, and no amount of clever abstractions could help. If we could relax the constraints on the direct database access, or even better get rid of it entirely, we could build out a new solution with proper scalability in mind.
📬 Message Based Prototype
The usual way of handling an analytics service like the Export Service is to populate the database using asynchronous messages. As changes are made in Renew, fire-and-forget messages can be sent by the service to let other services know about the changes. Modern cloud providers like AWS all support various types of message queues and streams, so the barrier to entry for this type of technology is very low. Seems simple enough, we can refactor Renew to instead fire a message off to AWS’s SQS service whenever something updates, and then refactor the Export Service to read from SQS instead of making HTTP calls to Renew.
Problems start to appear though once you consider some of the limitations of SQS, and messages in general. SQS only guarantees “at least once” delivery, and makes no guarantees about the ordering. This means the same message can be delivered to Export multiple times, and the order that messages are received is not necessarily the order they are sent. If messages are received out of order, it’s possible for older messages to overwrite data that has changed in newer messages, resulting in data loss.
To get around this, any change in the database will increment a version counter, which is included in any message. When reading the messages, the current version counter is checked against the one in the message, and only if the version is higher than the existing one is the database updated. Similarly, duplicate messages can be safely discarded if we already have that version in the database.
That solves SQS’s limitations, but there’s still a problem with our strongly consistent database. The Postgres SQL database we use has a lot of foreign-key relationships; Policies are related to Policy Options, Models are related to Model Versions and Users are related to Teams, etc. What would happen in the message based system when a new Policy and its initial Policy Option is created at the same time?
- Policy is created in Renew
- Policy Option is created in Renew
- “Policy Created” message is pushed to SQS
- “Policy Option Created” message is pushed to SQS
- Export pulls down “Policy Created” message
- Export tries to write the new Policy to its database
- A Foreign Key Constraint Violation occurs, because the Policy points to a Policy Option which does not yet exist in the database
This is because our database is a relational, strongly-consistent database. That means that entities have relations between each-other, letting us associate e.g. Policy Options to their Policies. The strongly-consistent part means that the database is guaranteed to be in a self-consistent state at all times, meaning we are always 100% sure what the current state of the data looks like, all relations are valid and all entities are present. Strongly-consistent databases are very useful for performing business functions, as you don’t have to worry about missing or inconsistent data. But for analytic use cases, most of the time the added consistency is not really needed, and trying to enforce strong consistency can make the architecture more complicated than is necessary.
The alternative is to have an eventually consistent database instead. In these databases, the data may not necessarily be consistent at a given point in time, but eventually the data will all be there. This is great if you only really care about aggregations of data, or if you don’t mind waiting a bit longer for the full data to arrive. It should be noted that eventually here usually only means a couple of seconds at most. By changing our database from strong consistency to eventual consistency, we are able to accept updates incrementally via messages, instead of the All-or-Nothing approach outlined above. Now it doesn’t matter if the Policy Option points to a Policy that doesn’t yet exist, because we’re confident that the data will arrive eventually.
Over the course of a week, we created a proof-of-concept for a message-based Export Service, which was able to synchronise with Renew via messages for 2 entity types in our database.
The proof-of-concept was a massive success in that it demonstrated that this change would not be as difficult as we had originally thought. However, moving the database from strong consistency to eventual consistency did represent a shift in paradigm that could negatively impact any customers using the direct database access. If their integrations relied on strongly consistent relations, then they may start getting errors that were not present before.
🤺 Challenging Assumptions
A long time ago, one of our customers had asked for direct access to the Export database which we obliged by making it a full feature within the product. Customers are able to generate read-only credentials to the database to use with any tools they like. As we lacked observability in this area, we had assumed that some customers (but not many) continued to use this feature.
To make the proof-of-concept a reality, we would first need to make sure our customers wouldn’t be negatively impacted too much. We did this by first analysing our limited logging around this area, which we discovered showed almost no activity going through this route. We then reached out to our customers to ask them about their utilisation of the direct access feature.
To our surprise, we discovered that none of our customers used this feature, which meant it would be a lot easier to fix this technical debt than we had first envisioned. If no one used direct access, then we won’t even need the message-based solution at all.
💋 Keep It Simple
Armed with the knowledge that no one is actually accessing the database directly, we went back to the drawing board for a simpler, more effective solution. Renew is the source of truth for all entities in the system, with the Export Service providing a read-only mirror of the data in Renew’s database (with some minor schema changes). This means all the data we need is actually already available in Renew… so why not just get it from there?
With direct access out of the picture, Export data can only be accessed by the REST API. This means we could simply reimplement the Export APIs within Renew, and those endpoints could read directly from the source-of-truth database within Renew, bypassing any need for a secondary service or database. Additionally, as both Renew and Export are written with the same language and web framework, it was mostly just a direct cut and paste to move the API code from Export into Renew. All of the complicated synchronisation code could be removed, as well as the vast majority of the Export Service (it still serves some small function that is out of scope for this blog post). The result is one of the finest merge requests I’ve seen in my career:
95 files changed, and almost 4000 lines of code removed 🥲
🧗♂️ Scalability..?
So what about scalability? If analytics have moved to the transaction database, won’t the additional load cause problems?
Well we thought so too initially, which was partly the motivation for moving to a message-based solution instead of trying to just kill off the Export service entirely. But Renew is a specialist product, with a small number of highly skilled users. While we build our systems for scalability, we’re not Meta or Amazon so the scale of our analytic data is orders of magnitude smaller than even a medium sized B2C product.
We performed some load testing and are confident that the transaction database can handle any additional load from the Export endpoints. It was mostly the resource hungry HTTP calls that was causing performance degradation, coupled with large reads on the database (for data which may never even get queried!).
The prototype wasn’t all wasted effort though, it provided a great learning opportunity for one of our talented mid-level engineers (Neel Chotai), as well as getting us thinking deeply about assumptions we make about how our product is used, and how that can be used to form architectural decisions.
We may re-implement something similar to the Export Service in the future if a more comprehensive set of analytic features is needed, and if we do we’ve certainly got the experience to do so now in a scalable and extensible way.
We also learned a lot about our deployment and release process, which we’re actively working on to streamline and automate further. It also acted as a reminder that scalability issues don’t always come from high loads of data, but also from bottlenecks within your system.