Category "bigdata"

Is it possible to know the size of a variable that is being created while the function is running?

I am very new to R and I was exploring a function in a library that download data from a server and leaves the data as dataframe. The data are stored in a varia

spark sql Find the number of extensions for a record

I have a dataset as below col1 extension_col1 2345 2246 2246 2134 2134 2091 2091 Null 1234 1111 1111 Null I need to find the number of extensions available fo

How do I Split Huge File into Chunks of equal size in bash

I have a huge file of 4GB of data i.e 523 rows & 2,655,566 columns. I would like to read the whole file in equally divided chunks. How to do so, suggest the

What's the point of SeaweedFS File Store?

According to GitHub, SeaweedFS is intended to be a simple and highly scalable distributed file system which enables you to store and fetch billions of files, f

Why Records: [] is empty when i consume data from kinesis stream by python script?

I am trying to consume data from kinesis data stream which is created and produce data to it successfully , but when running consumer script in python : import

Druid generate missing records

I have a data table in druid and which has missing rows and I want to fill them by generating the missing timestamps and adding the precedent row value. This is

how do i extract both the week and year off a date, and have it as week1, 2022 or week40, 2021

I initially extracted the weeks from a date as 1, 2, 3 using the extract clause but now, I want to have the year along with it as week1, 2022 or week 40, 2021

how to make faster to foreach, search then update more than 100k data?

i have console app for background job. the app will do like this, get data from database for the location we can call table A(have 100k data) and place to varia

Recalculate historical data using Apache Beam

I have an Apache Beam streaming project that calculates data and writes it to the database, what is the best way to reprocess all historical records after a bug

I'm trying to use either a sumif or case clause to sum up the values in a data set

I have the total amount expected to be saved, the total amount saved, the principal amount expected to be saved and the principal amount saved, now I'm trying t

How to match the unique ids that I created in df1 to df2 based on two column values?

I have two dataframes, and I am struggling to match the unique ids that I created in df1 to df2 based on 'name' and 'version' values. I need to add a column to

HBase Shell - org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet

I am trying to set up distributed HBase on 3 nodes. I have already set up hadoop, YARN ZooKeeper and now HBase but when I launch hbase shell and run the simples

Data caching with ClickHouse

Intro I have ClickHouse as data warehouse (tables with billions of rows). Users interact with the DWH using my application backend that generates SQL queries to

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi

Category "bigdata"

Is it possible to know the size of a variable that is being created while the function is running?

spark sql Find the number of extensions for a record

How do I Split Huge File into Chunks of equal size in bash

What's the point of SeaweedFS File Store?

Why Records: [] is empty when i consume data from kinesis stream by python script?

Druid generate missing records

how do i extract both the week and year off a date, and have it as week1, 2022 or week40, 2021

how to make faster to foreach, search then update more than 100k data?

Recalculate historical data using Apache Beam

I'm trying to use either a sumif or case clause to sum up the values in a data set

How to match the unique ids that I created in df1 to df2 based on two column values?

HBase Shell - org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet

Data caching with ClickHouse

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

Fastest way in numpy to get distance of product of n pairs in array

sqoop merge-key creating multiple part files instead of one which doesn't serve the purpose of using merge-key

Spark DataFrame is Untyped vs DataFrame has schema?

How to improve my tables and queries for Big Data applications?

Apache Spark - Is it possible to use a Dependency Injection Mechanism

Azure data explorer update record

Category "bigdata"

Other Categories