Why do we talk about P99 and P95 request latency?

What does Pn, also known as the nth percentile, mean? It is the value that n% of values in a distribution will be less than or equal to. In this blog, the distribution we’ll be talking about is request latency.

When we talk about a value being the P99 (or P95, or P<whatever>), that means that 99% of the time, requests finish in this time or less. More concretely, if the P99 request latency for your site is 1s, then 99% of requests take 1s or less.

Lets look at two diagrams, one for P50 and the other for P99, to see what this looks like. Latency is on the x-axis and request count on the y-axis. The P50 value is the value on the x-axis where if we colour everything under the curve to the left of it, we will colour in 50% of the area under the curve. Same for P99 or any other percentile.

Why do we talk about P99 or P95? The median, P50, is good enough for housing prices, why isn’t it good enough for request latency or other timing related metrics? The key is scale: Your site is likely receiving more requests per day (or maybe even per hour or second) than there are houses on sale. So, if we talk about P50 request latency, there is only a 50% chance that requests take less than that amount of time - there will be a lot of customers for whom it takes longer to load the site. Google have a paper where they dive into this called “The Tail at Scale”.¹ One of the key takeaways from this is that when there are a lot of events, even low probability outcomes happen a lot in absolute terms.

Another factor is that the probability of multiple requests all being less than or equal to some percentile becomes deminishingly small. This is due to the way that probabilities combine. If your P50 request latency is 200ms and it takes 5 requests, all running in parallel, to load the page. What is the probability that it will load in 200ms?

0.5⁵ = 0.031 = 3.1%

Only 3.1% of page loads will happen in under 200ms! That means only a very small proportion of your customers will have this experience. What about if we looked at the P95 request latency?

0.95⁵ = 0.774 = 77.4%

When we talk about the P95, we’re now talking about the experience of most of our customers - 77.4% is much more representative than 3.1%. As more requests (or more generally, events) are included, the probability of them all being under our chosen percentile becomes less and less due to that combination of probabilities.

So which percentile should we pick? As is the answer to a lot of things - it depends. You could pick a high percentile (e.g. P99) but that can be very expensive. By definition, you’re including rarer events than say P95 and the cost of improving your P99 latency vs your P95 latency may be more expensive than the value that it provides. I think a good rule of thumb is the more requests it takes to your API to fulfill a useful function (e.g. load the page), the higher the percentile you should choose to optimise for because it’ll be more representative of your customer’s experience. So not P50.

https://research.google/pubs/pub40801/