'Why does Hadoop choose MapReduce as its computing engine?
I know MapReduce
(MR
) is one of the three core frameworks of Hadoop
and I am familiar with its mapper-shuffle-reducer
progress.
My question can be separated into two parts:
1) What makes MR
unique for Hadoop
and why not other computing algorithms?
2) How does other language(e.g.: shell
, python
)'s computing part work? Is their computing procedure similar to MR
?
Solution 1:[1]
"Divide and conquer" is a very powerful approach for handling datasets. MapReduce provides a way to read large amounts of data, but distributes the workload in an scalable fashion. Often, even unstructured data has a way to separate out individual "records" from a raw file, and Hadoop (or its compatible filesystems like S3) offer ways to store large quantities of data in raw formats rather than in a database. Hadoop is just a process managing a bunch of disks and files. YARN (or Mesos, or other scheduler) is what allows you to deploy and run individual "units of work" that go gather and process data.
MapReduce is not the only execution framework, though. Tez or Spark interact with the InputFormat and OutputFormat API of MapReduce to read or write data. Both offer optimizations over MapReduce where tasks can be defined using a computation graph called a DAG rather than saying "map this input, reduce that result, map the output of that reducer to another reducer..." . Even higher languages like Hive or Pig are easier to work with than pure MapReduce functions. All of these are extensions to Hadoop, though, and deserve to be standalone projects rather than part of Hadoop core.
I would argue most Hadoop people are not concerned about why Hadoop uses MapReduce, they just accept its API is how you process files on HDFS interfaces, even if that's abstracted behind other libraries like Spark.
Shell, Python, Ruby, Javascript, etc can all be ran via the Hadoop Streaming API as long as YARN nodes all have the required runtime dependencies. It does operate similarly to MapReduce. Data is processed by a mapper, keys are shuffled into a reducer, and processed again. Except I do not believe custom data formats are possible to be processed. For example, you might not be able to read Parquet data in Shell using Hadoop Streaming. Data must exist as new-line delimiter events, read via standard input, then written to standardized output.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |