'What runs first: the partitioner or the combiner?

I was wondering between partitioner and combiner, which runs first?

I was of the opinion it is the partitiner first and then combiner and then the keys are redirected to different reducers, which appears like the partitioner, and so I'm confused. Please help me understand.



Solution 1:[1]

The direct answer to your question is => COMBINER

Details: Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.

where as

The partitioner comes into the picture when we are working one more than on reducer. So, the partitioner decides which reducer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key.

For a better understanding you can refer the following image, which I have taken from Yahoo Developer Tutorial on Hadoop. Figure 4.6: Combiner step inserted into the MapReduce data flow
(source: flickr.com)

Here is the tutorial .

Solution 2:[2]

Partition comes first.

According to "Hadoop, the definitive guide", output of Mapper first writen to memory buffer, then spilled to local dir when buffer is about to overflow. The spilling data is parted according to Partitioner, and in each partition the result is sorted and combined if Combiner given.

You can simply modify the wordcount MR program to verify it. My result is: ("the quick brown fox jumped over a lazy dog")


Word, Step, Time

fox, Mapper, **********754

fox, Partitioner, **********754

fox, Combiner, **********850

fox, Reducer, **********904


Obviously, Combiner runs after Partitioner.

Solution 3:[3]

Partitioner runs before Combiner: MapReduce Comprehensive Diagram.

You can have custom partition logic, and after mapper results are partitioned, the partitions are sorted and Combiner is applied to the sorted partitions.

See Hadoop MapReduce Comprehensive Description.

I checked it by running a word-count program with custom Combiner and Partitioner with timestamps logging:

Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682580 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682582 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682583 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682583 : world : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682584 : world : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682585 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682585 : world : 1
18/04/23 14:41:22 INFO mapred.LocalJobRunner: 
18/04/23 14:41:22 INFO mapred.MapTask: Starting flush of map output
18/04/23 14:41:22 INFO mapred.MapTask: Spilling map output
18/04/23 14:41:22 INFO mapred.MapTask: bufstart = 0; bufend = 107; bufvoid = 104857600
18/04/23 14:41:22 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214368(104857472); length = 29/6553600
Apr 23, 2018 2:41:22 PM mapreduce.WordCountCombiner reduce
INFO: Combiner: 1524483682614 : hello 
Apr 23, 2018 2:41:22 PM mapreduce.WordCountCombiner reduce
INFO: Combiner: 1524483682615 : world

Solution 4:[4]

combiner runs before partitiooner

combiner runs after map, to reduce the item count of map output. so it decrease the network overload. reduce runs after partitioner

Solution 5:[5]

Combiner is a map side reducer. It means what the reducer performing everything done by combiner. The main use of the combiner is a tuneup/ optimize the performance. After combiner optimize the code, the petitioner separate and assists to get multiple outputs. Combiner is optional, but highly recommendable for large files.

The partitioner divides the data according to the number of reducers and depends on the requirements devides the output. For instance: The output male, female, separate 2 outputs by using partitioner.

First Combiner will come then Partitioner will come, both are come in Mapside only, but not in reducer side.

Solution 6:[6]

In Hadoop- The definitive guide 3rd edition, page 209, we have below words:

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record, there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file. The configuration property io.sort.factor controls the maximum number of streams to merge at once; the default is 10.

If there are at least three spill files (set by the min.num.spills.for.combine property), the combiner is run again before the output file is written. Recall that combiners may be run repeatedly over th einput without affecting the final result. If there are only one or two spills, the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.So combiner is run during merge spilled file.

So it seems the answer is:

Map -> Partitioner -> Sort -> Combiner -> Spill -> Combiner(if spills>=3) -> Merge.

However, in Apache Tutorial there are below words:

The Mapper outputs are sorted and then partitioned per Reducer.

The content is different from The definitive guide. The answer here seems to be:

Map -> Sort -> Combiner -> Partitioner -> Spill -> Combiner(if spills>=3) -> Merge.

Which one is correct? I lean to accept the later one in Apache Tutorial, but not quite sure.

Solution 7:[7]

Combiner does not change the key value pair of output map task . It combines based on same key and emits the same Key /List value pair .

Partitioner takes the input from map/combiner(if exists) then segments the data and in process can emit new K List Value pair .

so Map-->Combine->Partition-->Reduce.

Solution 8:[8]

Consider the following case,

Data: X = [gender(range:M/F), age (range:18-60 yrs), salary(range:3-50 LPA)];
Task: Y = (find combined total salary for each gender(M/F) and for each generation(age based genX, genY, genZ))

Then,

  1. MAPPER
Mapper's output will be 

```
mapperOutput = [key:<gender>, value:<age, salary>]
```

Extends: Mapper class
  1. PARTITIONER

    logic for partitioning

    public String gen(String DOB){
        // Logic for dob to gen
        /*
        if dob in (1965-1980) then genX (0)
        if dob in (1981-1994) then genY (1)
        if dob in (1995-tillDate) then genZ (2)
        */
    }
    

    Partitioner will create 3 partitions on the input data and each partition's output will be

    partitionerOutput1 = [key:<gender>, value:<age, salary>]
    partitionerOutput2 = [key:<gender>, value:<age, salary>]
    partitionerOutput3 = [key:<gender>, value:<age, salary>]
    

    Code wise: it will make numReduceTasks = 3

    Extends: Partitioner class

  2. COMBINER

    It is also known as mini-reducer

    It can only be used for

    - Commutative tasks : A + B = B + A
    - Associative tasks : (A + B) + C = A + (B + C)
    

    Since, our task is sum (similarly, word-count task), we can use it. But note that, same can't be said for average.

    On various input spills for each partition, combiner ops are performed to reduce the spillage on disk.

    Combiner's output will be

    combinerOutput1 = [key:<gender>, value:<total_salary_for_InputSpills>] 
    combinerOutput2 = [key:<gender>, value:<total_salary_for_InputSpills>] 
    combinerOutput3 = [key:<gender>, value:<total_salary_for_InputSpills>] 
    

    Note:

  • may not keep age (its a choice now)

  • the number of input spills are combined to decrease the count/load on each partition (say if earlier, there were 20 spills for each partition. Later , will only have 10 spills for each partition)

    Extends: Reducer class

  1. REDUCER

    After some crunching done above by the combiner, the reducer finally completes the task by reducing the input spills from 10 to 1 (for each partition).

    Reducer's output will be

    reducerOutput1 = [key:<gender>, value:<total_salary_for_partition>] 
    reducerOutput2 = [key:<gender>, value:<total_salary_for_partition>] 
    reducerOutput3 = [key:<gender>, value:<total_salary_for_partition>] 
    

    To confirm this, open hdfs and check the outputFolder

    . outputFolder
    |-------------part___000000
    |-------------part___000001
    |-------------part___000002
    

    Each of which will have

    Male, totalSalaryForGen<?>
    Female, totalSalaryForGen<?>
    

    Extends: Reducer class

Summary: if combiner was to preceded partitioner then, it would have combined salaries of emp irrespective of their generation and that would have distorted the task's goal.

Solution 9:[9]

Mapper -> Combiner -> Partitionar -> Reducer

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2 Mike Song
Solution 3 Vladimir
Solution 4 michaeltang
Solution 5 Venu A Positive
Solution 6
Solution 7 user3423890
Solution 8
Solution 9