'How does the CAP Theorem apply on HDFS?
I just started reading about Hadoop and came across the CAP Theorem. Can you please throw some light on which two components of CAP would be applicable to a HDFS system?
Solution 1:[1]
Argument for Consistency
The document very clearly says: "The consistency model of a Hadoop FileSystem is one-copy-update-semantics; that of a traditional local POSIX filesystem."
(One-copy update semantics means the file contents seen by all of the processes accessing or updating a given file would see as if only a single copy of the file existed.)
Moving forward, the document says:
- "Create. Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the file and its data."
- "Update. Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the new data.
- "Delete. once a delete() operation on a path other than “/” has completed successfully, it MUST NOT be visible or accessible. Specifically, listStatus(), open() ,rename() and append() operations MUST fail."
The above mentioned characteristics point towards the presence of "Consistency" in the HDFS.
Source: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html
Argument for Partition Tolerance
HDFS provides High Availability for both Name Nodes and Data Nodes.
Argument for Lack of Availability
It is very clearly mentioned in the documentation(under the section "Operations and failures"):
"The time to complete an operation is undefined and may depend on the implementation and on the state of the system."
This indicates that the "Availability" in the context of CAP is missing in HDFS.
Source: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html
Given the above mentioned arguments, I believe HDFS supports "Consistency and Partition Tolerance" and not "Availability" in the context of CAP theorem.
Solution 2:[2]
- C – Consistency (All nodes see the data in homogeneous form i.e. every node has the same knowledge of data at any instant of time)
- A – Availability (A guarantee that every request receives a response which may be processed or failed)
- P – Partition Tolerance (The system continues to operate even if a message is lost or part of the system fails)
Talking about Hadoop , it supports the Availability and Partition Tolerance property. The Consistency property is not supported because only namenode has the information of where the replicas are placed. This information is not available with each and every node of the cluster.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Shariq Ehsan |
Solution 2 | Priyanka Arivalagan |