My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Un
When I try to write the dataframe to s3 as parquet, I always get an error like below. In the s3 bucket, an empty folder is generated automatically every time, b
I have a unit test to databricks code, and I want to run it locally on windows. Unluckily when I run pytest with PyCharm, it throws the following exception: Exc
I've got a Dataproc cluster going on configured this way: { "worker_config": { "num_instances": 20 }, "secondary_worker_config": { "
I have installed Hadoop in my Macbook M1 2020 with MacOS Monterey 12.3.1. I am able to successfully use hadoop and hdfs commands in my Laptop. I started using h
I am trying to set up distributed HBase on 3 nodes. I have already set up hadoop, YARN ZooKeeper and now HBase but when I launch hbase shell and run the simples
I am trying to start with Spark. I have Hadoop (3.3.1) and Spark (3.2.2) in my library. I have set the SPARK_HOME, PATH, HADOOP_HOME and LD_LIBRARY_PATH to thei
I am trying to extract the password from jceks file in hdfs. import org.apache.hadoop.security.alias.CredentialProviderFactory val conf = new org.apache.hadoo
I'm trying to import mongodb data into hive. The jar versions that i have used are ADD JAR /root/HDL/mongo-java-driver-3.4.2.jar; ADD JAR /root/HDL/mongo-hado
I have Yarn (package manager) already installed on my machine, but I now have to install Apache Hadoop. When I tried doing that with brew install hadoop, I got
I am trying to export HBase table(size-23TB) data to S3. So while using HBase export and passing S3 credentials via jceks path Command : hbase org.apache.hadoop
I am running one SQL query in Hive and it gives different results with CBO enabled and disabled. The results are wrong when CBO is enabled (set hive.cbo.enable=
Why spark is faster than Hadoop MapReduce?. As per my understanding if spark is faster due to in-memory processing then Hadoop is also load data into RAM then i
I am trying to create an external table using hive with hadoop but somehow it failed. These are the error I get when I try to run my queries. 02:23:29.516 [Hive
I'm new to Hadoop. I have to use 'MapReduce' with WordCount. I am getting some errors. I am running a 50Gb 'MapReduce' job on a single server (8GB, 8 core). It
I am getting error while installing spark on Google Colab. It says tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error
I have been working on hive and found something peculiar. Basically, while using double as a datatype for your column we need not have any precision specified (
I have real-time time series sensor data. My primary goal is to keep the raw data. I should do this so that the cost of storage is minimal. My scenario like th
I am using Centos7 and Hadoop 3.2.1. I have created a new user in Linux. I copied the .bash_profile file from the master user to my new user. But when I try run
Ideally, when we run incremental without merge-key it will create new file with the appended data set but if we use merge-key then it will create new whole data