'How to calculate median of an unsorted frequency table in O(n)?
I have a dataset comprised of n unsorted tuples representing numbers (let's say specific color codes) and their frequency (number of times of appearance).
I want to find an exact median of the numbers with worst case complexity of O(n).
For example:
dataset: {(5000, 8000), (25000, 4000), (9, 9000)} median: 5000
dataset: {(7000, 4), (23000, 400), (3000, 9000), (2500, 12000), (19000, 350), (500, 9000)....} median: ?
Failed attempts so far:
- "Decompress" the list (so that it looks like this:
{7000, 7000, 7000, 7000, 23000, 23000...}
) and then sort it. Problem is - it takes Ω(nlogn), and probably more since the frequencies can be very large and doesn't have any upper bound. - Try to use QuickSelect over the data. To ensure O(n) time complexity we must guaranty good pivot selection. To do so I thought about Median of medians (supposedly O(n)) with the data - but I couldn't figure out how to do that without decompressing, thus making it potentially more than O(n).
Is there a way to manipulate the tuple list so that it wouldn't be decompressed and still use median of medians or another way to find the median?
End note: I don't want to assume anything about the dataset - amount of tuples, confined range of numbers/frequencies, etc).
Solution 1:[1]
Use quickselect on the values, and only pay attention to the frequencies in determining which half to keep.
Your ideal pivot is one that splits the list values in half. Because that will cut the work of the next pass in half. Where that split happens to in the whole dataset doesn't particularly. Because your goal is to get it down to the one value you want, and then you're done.
This means that for median of medians you can ignore frequencies entirely in selecting a pivot. Then pay attention to frequencies when deciding which side of the pivot to keep. And ignore frequencies again while selecting the next pivot.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | btilly |