Category "bigdata"

I'm trying to use either a sumif or case clause to sum up the values in a data set

I have the total amount expected to be saved, the total amount saved, the principal amount expected to be saved and the principal amount saved, now I'm trying t

How to match the unique ids that I created in df1 to df2 based on two column values?

I have two dataframes, and I am struggling to match the unique ids that I created in df1 to df2 based on 'name' and 'version' values. I need to add a column to

HBase Shell - org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet

I am trying to set up distributed HBase on 3 nodes. I have already set up hadoop, YARN ZooKeeper and now HBase but when I launch hbase shell and run the simples

Data caching with ClickHouse

Intro I have ClickHouse as data warehouse (tables with billions of rows). Users interact with the DWH using my application backend that generates SQL queries to

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi

Fastest way in numpy to get distance of product of n pairs in array

I have N number of points, for example: A = [2, 3] B = [3, 4] C = [3, 3] . . . And they're in an array like so: arr = np.array([[2, 3], [3, 4], [3, 3]]) I nee

sqoop merge-key creating multiple part files instead of one which doesn't serve the purpose of using merge-key

Ideally, when we run incremental without merge-key it will create new file with the appended data set but if we use merge-key then it will create new whole data

Spark DataFrame is Untyped vs DataFrame has schema?

I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often- 1) DataFrame is untyped 2) DataFrame has sch

How to improve my tables and queries for Big Data applications?

I created an API on Symfony which produces more than 1 million entries by day into one of the MySql tables. This table structure is defined this way: After s

Apache Spark - Is it possible to use a Dependency Injection Mechanism

Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application? Is it possible to use Guice, for instance? If so,

Azure data explorer update record

I am new to Azure data explorer and I am wondering how you can do update on a record in Azure data explorer using microsoft .NET SDK in C# ? The Microsoft docum

How to find items in a collections which are not in another collection with MongoDB

I want to query my mongodb to perform a non-match between 2 collections. Here is my structure : CollectionA : _id, name, firstname, website_account_key, emai

Sklearn-GMM on large datasets

I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set. Can I use GMM.fit() (sklearn.mixture.GMM) repeatedly on min