'Iterate twice through values in Reducer Hadoop

I read in couple of places that the only way to iterate twice through values in a Reducer is to cache that values.

But also, there is a limitation that all the values must fit in main memory in that case.

What if you need to iterate twice, but you don't have the luxury of caching the values in memory?

Is there some kind of workaround?

Maybe there are some answers about this problem, but I'm new to Hadoop so I'm hopping that some solution was found since the time that questions were asked.


To be more concrete with my question, here is what I need to do:

  • Reducer gets a certain number of points (per example - points in 3D space with x,y,z coordinates)
  • One random point between them should be selected - let's call it firstPoint
  • Reducer should then find point that is farthest from firstPoint, to do that it needs to iterate through all the values - this way we get secondPoint
  • After that, reducer should find point farthest from secondPoint, so there's a need to iterate through the dataset again - this way we get thirdPoint
  • Distance from thirdPoint to all other points needs to be calculated

Distances from secondPoint to all other points and distances from thirdPoint to all other points need to be saved, so additional steps could be performed.

It's not a problem to buffer this distances, since each distance is a double, while a point could actually be a point in n-dimensional space so each point could have n coordinates, so it could take up too much space.

My original question was how can I iterate twice, but my question is more general, how can you iterate multiple times through values, to perform the steps above?



Solution 1:[1]

It might not work for every case, but you could try running more reducers so that each one processes a small enough amount of data that you could then cache the values into memory.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jeremy Beard