Category "bigdata"

how to make faster to foreach, search then update more than 100k data?

i have console app for background job. the app will do like this, get data from database for the location we can call table A(have 100k data) and place to varia

Recalculate historical data using Apache Beam

I have an Apache Beam streaming project that calculates data and writes it to the database, what is the best way to reprocess all historical records after a bug

I'm trying to use either a sumif or case clause to sum up the values in a data set

I have the total amount expected to be saved, the total amount saved, the principal amount expected to be saved and the principal amount saved, now I'm trying t

How to match the unique ids that I created in df1 to df2 based on two column values?

I have two dataframes, and I am struggling to match the unique ids that I created in df1 to df2 based on 'name' and 'version' values. I need to add a column to

HBase Shell - org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet

I am trying to set up distributed HBase on 3 nodes. I have already set up hadoop, YARN ZooKeeper and now HBase but when I launch hbase shell and run the simples

Data caching with ClickHouse

Intro I have ClickHouse as data warehouse (tables with billions of rows). Users interact with the DWH using my application backend that generates SQL queries to

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi

Fastest way in numpy to get distance of product of n pairs in array

I have N number of points, for example: A = [2, 3] B = [3, 4] C = [3, 3] . . . And they're in an array like so: arr = np.array([[2, 3], [3, 4], [3, 3]]) I nee

sqoop merge-key creating multiple part files instead of one which doesn't serve the purpose of using merge-key

Ideally, when we run incremental without merge-key it will create new file with the appended data set but if we use merge-key then it will create new whole data

Spark DataFrame is Untyped vs DataFrame has schema?

I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often- 1) DataFrame is untyped 2) DataFrame has sch

How to improve my tables and queries for Big Data applications?

I created an API on Symfony which produces more than 1 million entries by day into one of the MySql tables. This table structure is defined this way: After s

Apache Spark - Is it possible to use a Dependency Injection Mechanism

Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application? Is it possible to use Guice, for instance? If so,

Azure data explorer update record

I am new to Azure data explorer and I am wondering how you can do update on a record in Azure data explorer using microsoft .NET SDK in C# ? The Microsoft docum

How to find items in a collections which are not in another collection with MongoDB

I want to query my mongodb to perform a non-match between 2 collections. Here is my structure : CollectionA : _id, name, firstname, website_account_key, emai

Sklearn-GMM on large datasets

I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set. Can I use GMM.fit() (sklearn.mixture.GMM) repeatedly on min