'Calculating the mode/median/most frequent observation in categorical variables in SQL impala
I would like to calculate the mode/median or better, most frequent observation of a categorical variable within my query. E.g, if the variable has the following string values: dog, dog, dog, cat, cat and I want to get dog since its 3 vs 2. Is there any function that does that? I tried APPX_MEDIAN() but it only returns the first 10 characters as median and I do not want that. Also, I would like to get the most frequent observation with respect to date if there is a tie-break.
Thank you!
Solution 1:[1]
the most frequent observation is mode
and you can calculate it like this.
Single value mode can be calculated like this on a value column. Get the count and pick up row with max count.
select count(*),value from mytable group by value order by 1 desc limit 1
now, in case you have multiple modes, you need to join back to the main table to find all matches.
select orig.value from
(select count(*) c, value v from mytable) orig
join (select count(*) cmode from mytable group by value order by 1 desc limit 1) cmode
ON orig.c= cmode.cmode
This will get all count of values and then match them based on count. Now, if one value of count matches to max count, you will get 1 row, if you have two value counts matches to max count, you will get 2 rows and so on.
Calculation of median
is little tricky - and it will give you middle value. And its not most frequent one.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Koushik Roy |