The Great Database Migration

Product & Engineering

Angelina Quach

Software Engineer

June 27, 2024

At the end of Q2 2024, we migrated Shepherd’s pricing engine database with zero downtime. We moved from SQLite to Postgres, and streamlined the ETL process into a one-click solution.

Shepherd has underwritten and priced thousands of complex risks since our inception, spanning from Primary Casualty to Excess Casualty products. Much of the pricing magic comes from under the hood, where our actuaries and engineers collaborate to codify complex actuarial models, factors¹, and Shepherd’s secret sauce into our in-house pricing engine. This engine, which we’ve lovingly dubbed “Alchemist”, received a much needed makeover this quarter.

History

Since Shepherd launched in 2021, Alchemist has lived through two transformations. The first version converted an Excel file into generated JSON with correct typing and collocated data, leveraging auto-generated functions to search through the JSON for the relevant factors. This method proved successful for our first few models and quickly got our pricing needs from zero to one, but ran into performance issues once our factors data started growing; new product lines required new factors, and we needed to strictly version each model with its data. These performance issues included lookup performance, the JSON memory footprint, and the size of the generated types slowing down the Typescript compiler. To address these performance issues and maintain the collocated data, we migrated the JSON and codegen to use a SQLite file instead.

The second version of the Excel-based code generation wiped and recreated a SQLite database on every run. This file was stored in our Github repo with Git LFS so it could be atomically versioned, and was deployed in our CI/CD pipeline alongside our application code.

Why Google Sheets? Google Sheets unlocks the ability for anyone in our company to easily visualize, manipulate, and collaborate on Shepherd’s ratings data without any engineering lift. It also reduces onboarding friction as our team continues to scale, since most users are trained on Excel.

SQLite is a tried-and-true database, boasting to be the most used database engine in the world. It is incredibly lightweight (<1 MB), open-source (free!), and reliable. Since it requires no additional infrastructure or maintenance, it unlocks bandwidth for engineering to focus on other topline goals. It also serves queries incredibly fast, being a self-contained, embedded database stored on a single binary file. 79.15% of our pricing operations averaged 1 second or less response time. To benchmark, Sentry states that high-performing APIs are considered to have between 0.1 and 1 second average response time. While we have some gnarly complicated deals, underwriters expect real-time pricing feedback to help them build intuition around our models, so our users were quite delighted with this performance!

Pricing API latency distribution over a 74 day window, before the great migration on June 5th. 79.15% of pricing operations averaged 1 second or less, 90.2% averaged 2 seconds or less.

The Impetus for Change

While SQLite served our team dutifully through a year of astonishing growth and empowered us to build some amazing products in that time, it became increasingly clear that the scale of the business would surpass the original benefits of our SQLite solution; our gross written premium (GWP) has doubled year-over-year, and we have an ambitious plan to introduce two new insurance products in 2024.

Furthermore, bundling an 80MB+ SQLite file to our codebase slowed down the entire Github repository and hindered us from considering more robust hosting platforms. Inserting or updating data became a wearisome manual process, and without a GUI to surface the data to stakeholders, pricing data became a black box. It was time to consider a new solution.

Key Questions Driving the Migration

To ensure the migration would meet our needs, we framed our goals around these critical questions:

Separation of Application and Database
- How can we eliminate the risk of a single point of failure?
- How can we enable appropriate horizontal and vertical scaling based on resources?
Serverless Database Architecture
- How can we leverage auto-scaling and smart provisioning?
- How can we eliminate server management while ensuring managed security, compliance, and automated disaster recovery?
SQL Compatibility
- How can we maintain compatibility with our existing Alchemist engine, which uses Knex as its SQL query builder?
Performance
- How can we ensure there are no performance regressions, and even performance improvements? When we launched Shepherd we promised to respond to every submission with an indication in 24 hours or less, which directly relates to the performance of our platform.
Developer Experience
- How can we improve the developer experience, particularly with respect to data handling and operations?
Data Transparency
- How can we demystify data changes and make them more transparent to stakeholders? With these goals in mind, we got to work!

Steps

Step 1: Choosing the Right Database

Our team has always appreciated Neon for their innovative adoption of the database branching paradigm. With Neon, you can quickly and cost-effectively branch your data into writeable copies that you can test against in isolation without any impact to main (production) data. These branches can also be used to instantly back up a database, carry out migrations, and execute immediate data recovery.

This made Neon the obvious choice for our next database solution.

Step 2: Code Changes

Migrating from an in-memory database to a serverless architecture required us to convert synchronous functions to be asynchronous. Luckily, one of the best decisions we made early on was to abstract our data mapping and retrieval logic by auto-generating types and functions. We mapped each factor set to a table (we have nearly 400 tables!), and layered on a flexible query pattern. This meant that any code changes only needed to happen at the codegen level, and would be persisted to wherever the function was being called - Voila! ✨

In the snippet below, a developer can choose to pass in any number of query clauses, and even specify whether they want the query to execute independently or as part of a transaction.

1// auto-generated code
2const getTerritoriesData = async ({
3  filter = {},
4  knexTransaction,
5}: {
6  filter?: Record<string, any> | Knex.QueryCallback;
7  knexTransaction?: Knex.Transaction<any, any[]> | null;
8}) => {
9  const result = await (knexTransaction ?? knexInstance)
10    .select('*')
11    .from('territories')
12    .where(filter);
13
14  return result as unknown as Array<territoriesType>;
15};

Step 3: Deployment Changes

Separating a database from an application also means separating deployment infrastructure. Without reinventing the wheel, we reused the functionality that previously inserted data into our SQLite; we would now write those SQL statements to a SQL file, logically separated by factor set.

1DROP TABLE IF EXISTS "territories";
2
3CREATE TABLE "territories" ("zipCode" varchar(255), "state" varchar(255), "territory" varchar(255));
4CREATE INDEX "stateZipCodeHelpers" ON "territories" ("state", "zipCode");
5CREATE INDEX "territoriesByStateAndZipCode" ON "territories" ("state", "zipCode");
6
7INSERT INTO "territories" ("territory", "state", "zipCode") VALUES ('101', 'AK', '99501');
8INSERT INTO "territories" ("territory", "state", "zipCode") VALUES ('101', 'AK', '99502');
9INSERT INTO "territories" ("territory", "state", "zipCode") VALUES ('101', 'AK', '99503');
10...

These SQL files go through a pull request and are committed to GitHub. Surfacing the data, schema, and operations through a PR boosts our confidence in releasing database changes — no more binary file black box! The changes are also atomically versioned by GitHub, so we can easily track the step changes in our database over time.

To maintain the spirit of CI/CD, we leverage Render’s (our backend service hosting platform) pre-deploy feature and execute all SQL scripts against the corresponding Neon database before the service starts. If the pre-deploy step fails, all downstream operations are skipped. The service continues running its most recent successful release, with zero downtime — an inherent guard rail to ensure that our application is protected from any database failures!

Step 4: Developer Experience

The developer’s pain point was manually transforming Actuarial Google Sheets into codified types and data retrieval functions, then ensuring that the data was inserted correctly into SQLite. This process was tedious, error-prone, and took away from the engineer’s ability to focus on more meaningful work.

… so we reduced the process down to a 1- click solution. We built a dashboard in our underwriting platform that actuaries can visit and request factor changes. With a single click, an asynchronous job kicks off, automating the steps of Google Sheet extraction > parsing > codegen > SQL script generation > PR creation. The only engineering involvement now required is a PR review. Once the PR is merged, the data deployment is taken care of by Render. Previously, the average time to complete this workflow was 22 hours and 34 minutes. Now, it takes 41 seconds — a 1980x improvement!

We leverage Inngest to integrate reusable, event-driven functions into our system. Using the Inngest SDK, we define these functions within our existing codebase, enabling us to deploy our application anywhere without the need for additional infrastructure. By specifying which events should trigger which functions, Inngest automates the entire process while promising features like automatic retry, concurrency, and extended runtimes.

Roadblocks

After standing up the Neon cluster, our pricing engine experienced a 723.66% increase in latency. Against any standard, this is unacceptable performance. Our team was quite literally, shooketh.

Many war rooms later, we adopted the following techniques to reduce our latency:

Ensure database is in same region as application server

When servers are closer together, there is reduced latency because the physical distance the data must travel is shorter. With shorter routes, data packets experience fewer delays and disruptions, enabling higher throughput. Fewer network hops typically result in more stable and reliable connections as well.

Caching

Whether it be a caching service or application-level caching, recycling frequently accessed data and functions improves performance. We opted to use a memoization library called micro-memoize since it consistently boasted the best performance for most use cases. A couple notes to consider:

cache hydration: Every service request checks to see if the cache has data. If it does not, then the server must perform a database request and save the response in the cache. This flow can take some time, and will be the case every time the cache is invalidated or the server is restarted (for application-level caches).
- We evaluated the first-request latency (~3.15s) and the subsequent request latencies (~1.02s), and decided that a fast-follow solution to hydrate the cache was acceptable to unblock ourselves and harness the overall performance gains.
cache invalidation: Cached data can become stale and irrelevant, so balancing the cache data’s TTL with delivering accurate responses is critical. Whenever the cache is invalidated, the first request afterwards will suffer a performance hit.
- This does not apply to our use case, as Alchemist is effectively a read-only database to store insurance factors. The cache gets invalidated with every deployment; since our pricing database gets deployed as part of our CI/CD pipeline, up-to-date data will always be available in the cache.
server memory usage: By leveraging an application-level cache, application server costs and performance can be compromised with increased memory consumption. Be careful to tune your services accordingly.
- After some experimentation, we set our cache size to 500 per memoized function. Our services also have auto-scaling configured.

Parallelize Functions

If multiple asynchronous functions are executed independently of each other, parallelizing them can significantly improve latency.

1// total runtime: 1.354 seconds
2const getDataFromTables = async () => {
3  const table1Data = await db.table1.getData({}); // 201ms
4    const table2Data = await db.table2.getData({}); // 323ms
5    const table3Data = await db.table3.getData({}); // 181ms
6    const table4Data = await db.table4.getData({}); // 273ms
7    const table5Data = await db.table5.getData({}); // 376ms
8}

The code above contains 5 asynchronous functions executing consecutively. As a result, the time it takes for getDataFromTables to complete is over a second! Compared with the code snippet below where each function call executes concurrently, the runtime is only as long as the longest-running function. By parallelizing requests, we see a 72.23% latency reduction.

1// total runtime: 376ms
2const getDataFromTables = async() => {
3    const {
4        table1Data,
5        table2Data,
6        table3Data,
7        table4Data,
8        table5Data
9    } = await Promise.all([
10        db.table1.getData({}), 
11        db.table2.getData({}), 
12        db.table3.getData({}),
13        db.table4.getData({}),
14        db.table5.getData({}),
15    ]);
16}

Connection Pooling

A connection pool maintains a fixed number of open database connections, which it then uses to distribute and execute client requests. This reduces the resource cost of acquiring a new connection for every request, increases concurrency, and reduces contention for connections. You can also configure the number of connections to maintain in the pool and how frequently it is refreshed.

According to Neon, connection pooling can provide more than a 10x boost in throughput and a similar reduction in application response times for applications.

We used 3 scenarios with a considerable number of pricing needs as part of our performance benchmarks

Ultimately, we were able to get 62.63% of our pricing operations to average 1 second or less, and 79.01% to average 2 seconds or less.

Closing Thoughts

Overall, this migration proved to be a massive success. Shepherd is now better equipped to handle a higher volume of business and priced risks. We no longer need to worry about application/database coupling, and our engineering team’s velocity continues to grow.

If these technical challenges excite you as much as they excite us, check out our careers page!

Footnotes

1 “Factors” refer to data we obtain from ISO that are used as scalars in our pricing models