Saturday, November 3, 2007

Percentile with SQL

In summarizing some of my data, I need to use the median instead of mean. Then I start to look for implementation of median with just SQL statement.
In the process of exploring features for median calculation, a generic solution to solve for percentile seems to be more flexible.
Using the CASE statement from SQL Server, it is straight forward to find out the percentile value for a given value. To demonstrate,

declare @test table(id int, anum int)

declare @idx int
set @idx = 1
while (@idx <=10) begin
insert into
@test values (@idx, @idx)
set @idx = @idx+1
end

select sum(case when anum <=3 then 1 else 0 end)*1.0/count(*) as percentile from @test

However, the implementation has one big drawback: when there are only a few data value to consider, the above implementation for percentile will unlikely to give the answer useful. For example, if there are only two values to consider: 1 and 10, then a value of 8 will get a percentile value of 0.5. Therefore, the actual percentile interpretation has to consider the underlying data model. If the data is considered to be linear, an interpretation between the minimum and maximum value has to be considered.

No comments: