'Calculating the mode/median/most frequent observation in categorical variables in SQL impala

I would like to calculate the mode/median or better, most frequent observation of a categorical variable within my query. E.g, if the variable has the following string values: dog, dog, dog, cat, cat and I want to get dog since its 3 vs 2. Is there any function that does that? I tried APPX_MEDIAN() but it only returns the first 10 characters as median and I do not want that. Also, I would like to get the most frequent observation with respect to date if there is a tie-break.

Thank you!



Solution 1:[1]

the most frequent observation is mode and you can calculate it like this.

Single value mode can be calculated like this on a value column. Get the count and pick up row with max count.

select count(*),value from mytable group by value order by 1 desc limit 1

now, in case you have multiple modes, you need to join back to the main table to find all matches.

select orig.value from
(select count(*) c, value v from mytable) orig
join (select count(*) cmode from mytable group by value order by 1 desc limit 1) cmode
ON orig.c= cmode.cmode

This will get all count of values and then match them based on count. Now, if one value of count matches to max count, you will get 1 row, if you have two value counts matches to max count, you will get 2 rows and so on.

Calculation of median is little tricky - and it will give you middle value. And its not most frequent one.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Koushik Roy