Recently I’ve been doing some performance optimisation in our app at work and one source of issues has been n + 1 queries. They’re easy to find, if you’re looking for them, and often easy to fix: just change the query from SELECT ... WHERE id = ? to SELECT ... WHERE id IN ?. However, can we automate for this and include checks for it in CI? Shift it left, so to speak, and not rely on humans being “more careful”.

First, what is an n + 1 query? It is when you fetch a collection of records (1), then, for each of those records, fetch their child records (n). It’s commonly cited as a problem with ORMs (though modern ones have techniques for avoiding it), but you can also do it accidentally if you’re using raw SQL, or a SQL builder DSL. I think it’s easier to think of them as unbatched queries. The n queries to get the child records could be batched into a single query, hence unbatched. At least to me that language is clearer, and it hints at the solution: batch the query. Technically, unbatched queries are a superset of n + 1 queries, for example, you can could just do n unbatched queries, without fetching the parent. However, the solution is the same for all of them.

Performing many unbatched queries, compared to a single batched query, has worse performance and leads to more load on the database. This is because all queries have some overhead, no matter how small they are. The database still has to use a socket connection, plan the query and it has to read from disk. By fetching data in batches, we pay this overhead once.

Make it measurable

My current process for finding unbatched querie is mostly heuristic based. For a slow request, look at its trace, if there are multiple similarly shaped¹ SELECT queries relatively close to one another then, it is likely an unbatched query. To be able to catch this programmatically, we need to define it in a measurable way. My two ideas for defining it are:

For a given piece of work², the number of queries with the same shape¹ should not exceed some bar.
For a piece of batched work, the query count should not increase with the units of batched work (e.g. two requests that fetch 10 and 100 objects should make the same, or similar, number of queries).

Ideas

Query count by shape

This is really pinpointing the heuristic I use when manually finding unbatched queries. Basically, for a given unit of work (e.g. a request), count the number of queries by shape. If it is over a certain threshold then we may have an unbatched query. Note, it’s important to scope the count to just a unit of work and not by time at all because then general performance can change the results of the test. Performance in different environments and under different loads can vary wildly.

The big issue I see with this idea is what rules we define for resetting the count in the middle of a unit of work. For example, imagine the following scenario:

SELECT * FROM foo WHERE id = 1;
UPDATE foo SET x = 1 WHERE id = 1;
SELECT * FROM foo WHERE id = 1;

We should not classify this as an unbatched query because there was an update to the record in the middle of the unit of work so refetching is important if we wanted to return the updated value to the caller. However, if the UPDATE updated an unrelated record then maybe this is an unbatched query.

What I do like about this method is that can be applied during queries or as an after the fact analysis of logs or tracing. Doing the analysis during queries is great for tests because you get instant feedback.

Ratio of the number of fetched objects to the number of queries

This measures exactly what we’re interested in and is likely technically simpler to implement due to really not having weird edge cases like the query count by shape idea does. I think this strategy works particularly well as a regression test. If you fix an unbatched query then you could add a test that checks number of queries is consistent over changes in the number of objects fetched.

The disadvantage is that you had to opt into it, it’s not something that you can’t really just switch it on. It also requires people to remember to write this check into tests. Also, as this strategy hinges on tests, you can’t apply it in prod.

Existing approaches

I did a bit of an investigation into what already exists for catching unbatched queries.

Bullet

GitHub: https://github.com/flyerhzm/bullet

Bullet is a Ruby gem that plugs into ActiveRecord and notifies you when your app performs n + 1 queries. It does this by checking if a the child relation has been loaded yet (i.e. eager loading). This is a very ORM specific thing, and bullet specifically is intended to be used with just ActiveRecord (or mongoid). If you were using Sequel models I suspect it might be able to be extended to work there but the approach doesn’t map to raw SQL or SQL DSLs.

Prosopite

GitHub: https://github.com/charkost/prosopite

This measures via the first definition from make it measurable. In their documentation they refer to query shape as its footprint. Query footprints are grouped by callsite.

Like bullet, prosopite is specifically for ActiveRecord.

go-sql-n-plus-one-finder

GitHub: https://github.com/rrreeeyyy/go-sql-n-plus-one-finder

This package is an implementation of prosopite in Go.

N + 1 Control

GitHub: https://github.com/palkan/n_plus_one_control

This tool is interesting in that it takes a different approach from the previously mentioned libraries/gems. It hooks into minitest or rspec (so Ruby specific) and detects if the number of queries increases as you increases the number of inputs (i.e. measures option 2).

Telerik

Link: https://docs.telerik.com/data-access/developers-guide/profiling-and-tuning/profiler-and-tuning-advisor/data-access-profiler-n-plus-one-problem

A discontinued tool that provided, amongst other things, profiling for applications. It included detection of n + 1 (i.e. unbatched) queries. This product looks to have been .NET specific.

Takeaway

My takeaway is that there is some tooling in this area but it’s not particularly extensive. I’m aware of many places checking for unbatched queries specifically, though maybe larger tech companies have built their own thing internally. My suspicion is that this is handled like other performance regressions: if a performance regression is noticed (e.g. SLO, alerting, etc.) it is investigated and the unbatched query fixed.

I think the least flakey strategy is to write tests that check the number queries made doesn’t change over fetching different number of objects. However, this requires you to write tests specifically for this which is not a solution I like.

The most pragmatic solution, I think, is to count the number of queries of a particular shape made on a given request path. If this is more than a given threshold then notify (e.g. fail the test, create an issue). This solution will require some heuristics. For example, reset the counter if there is a write on the request path because we may read the result back.

The shape of a query is a parameterised query without the parameters bound. For example, SELECT * FROM foo WHERE id = 1 has the same shape as SELECT * FROM foo WHERE id = 1 because they both have the shape SELECT * FROM foo WHERE id = ?. However, SELECT * FROM bar WHERE id = 1 has a different shape because it queries from a different table.

work is a discrete piece of work initiated by a user. In a web context, this is a request but unbatched queries are a problem for many types of apps, not just web.

Ideas for catching unbatched queries