Calculating Percentiles in SQL

SQL Tutorial
Friendly tips to help you learn SQL
`select 💝 from wagon_team;`

Calculating Percentiles in SQL

This tip builds on the analysis of a 🍌banana store🍌 introduced in Summarizing Data in SQL tip.

How long do people wait for their tasty banana orders? Using basic SQL we can compute average wait time, but if the distribution is skewed away from normal (as many internet-driven (and banana?) distributions often are), this may not give us a complete picture of how long most people are waiting. In addition to computing the average, we might (and should) ask, what are the 25th, 50th, 75th percentiles of wait-time, and how does that number vary day to day?

Many databases (including Postgres 9.4, Redshift, SQL Server) have built in percentile functions. Here’s an example using the function percentile_cont which is a window function that computes the percentile of wait-time, split (pun intended!) by day:

SELECT date, percentile_cont (0.25) WITHIN GROUP (ORDER BY wait_time ASC) OVER(PARTITION BY date) as percentile_25, percentile_cont (0.50) WITHIN GROUP (ORDER BY wait_time ASC) OVER(PARTITION BY date) as percentile_50, percentile_cont (0.75) WITHIN GROUP (ORDER BY wait_time ASC) OVER(PARTITION BY date) as percentile_75, avg(wait_time) as avg -- for comparison FROM banana_sales GROUP BY date ORDER BY date;

The structure of the percentile_cont is similar to other window functions: we specify how to order the data, how to group it - and the database does the rest. If we wanted to add more dimensions to our query (e.g. time of day), we’d add them to the partition and group by clause. If our database doesn’t support percentile_cont (sorry MySQL, Postgres < 9.4), the query is more complicated, but fear not, still possible! The challenge is to order the rows by increasing wait-time (per date of course) and then pick out the middle value (for median). In MySQL, we can use local variables to keep track of the order, and in Postgres, we can use the row-number function. Here’s the Postgres version:

SELECT t1.date, t1.wait_time as median FROM ( SELECT date, wait_time, ROW_NUMBER() OVER(ORDER BY wait_time PARTITION BY date) as row_num FROM banana_sales ) t JOIN ( SELECT date, count(*) as total FROM banana_sales GROUP BY date ) t2 ON t1.date = t2.date -- for simplicity, we take a simple solution when the list has an even length, to just choose one value WHERE t1.row_num = CASE when t2.total % 2 = 0 THEN t2.total / 2 ELSE (t2.total + 1) / 2 END;

SQL Tutorial Friendly tips to help you learn SQL select 💝 from wagon_team;

Calculating Percentiles in SQL

Read these SQL tips to learn more:

SQL Tutorial
Friendly tips to help you learn SQL
`select 💝 from wagon_team;`