'How parquet columns can be skipped when reading from hdfs?
We all know parquet is column-oriented so we can get only columns we desired and reduce IO.
But what if the parquet file is stored in HDFS, should we download the entire file first, and then apply column filter locally?
For example, if we use spark to read a parquet column from HDFS/Hive:
spark.sql("select name from wide_table")
Still we must download the entire parquet file, is that right?
Or maybe there is a way we can filter the columns just before the network transfer?
Solution 1:[1]
Actually "predicate pushdown" which is a feature of Spark SQL will try to use column filters to reduce the amount of information that is processed by spark. Technically the entire hdfs block is still read into memory, but it uses smart logic to only return relevant results. This is normally called out in the physical plan. You can read this by using .explain()
on your query to see if the feature is being used. (Not all versions of hdfs support this.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Matt Andruff |