Category "hdfs"

How parquet columns can be skipped when reading from hdfs?

We all know parquet is column-oriented so we can get only columns we desired and reduce IO. But what if the parquet file is stored in HDFS, should we download t

impala/hive show file format

How can I have impala or hive return the file format of the underlying files on HDFS for a table? I tried: SHOW FILES database.table_name This ilst the files,

S3DistCP - Split source in multiples jobs

I have to do copy of an S3 to HDFS of an cluster EMR. I'm trying to smaller the execution time of my job. Looking in the logs the map input of the job is 1_000_

DistCP - Even simple copies result in CRC Exceptions

I'm running into an issue using distcp to copy files - every copy fails with an IO Exception (Checksum mismatch), even if performing a simple copy within the cl

HDFS Date partition directory loop

I have a HDFS Directory as below. /user/staging/app_name/2022_05_06 Under such a directory I have around 1000 part files. I want to loop each of the part file

Structured Streaming to Save JSON to HDFS

My Structured Spark Streaming program is to read JSON data from Kafka and write to HDFS in JSON format. I am able to save JSON to HDFS but it saves the JSON st

How does the CAP Theorem apply on HDFS?

I just started reading about Hadoop and came across the CAP Theorem. Can you please throw some light on which two components of CAP would be applicable to a HDF

hdfs: command not found

I am using Centos7 and Hadoop 3.2.1. I have created a new user in Linux. I copied the .bash_profile file from the master user to my new user. But when I try run

Hadoop configuration object not pointing to hdfs file system

I am trying to create small Spark program in Java. I am creating a Hadoop configuration object as show below: Configuration conf = new Configuration(false); con

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS? If I use Spark standalone cluster manager and have my data distributed in HDFS c

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied m