'How/Where can I write time series data? As Parquet format to Hadoop, or HBase, Cassandra?

I have real-time time series sensor data. My primary goal is to keep the raw data. I should do this so that the cost of storage is minimal.

My scenario like this;

All sensors produces time series data, and i must save these raw time series data for batch analysis. Parquet format is great for less storage cost. But, does it make sense if each incoming time series data are written as a parquet format?

On the other hand, I want to process each incoming time series data in real time. For real-time scenario; I can use Kafka. But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?

If I use Cassandra, how can I do batch analysis?



Solution 1:[1]

But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?

Think of Kafka as a pipe into these stores. It is not a replacement to use "instead of" either. HBase and Cassnadras are stores, and you need to "batch" the data out of them... You would use Kafka Streams (or Spark, Flink, or my personal favorite NiFi) for actual (near) real-time processing before these systems.

I would suggest using Kafka rather than have point-to-point metrics into Hadoop (or related tools). I would also encourage using something meant for such data like TimescaleDB, CrateDB or InfluxDB, maybe Prometheus with some modification to the infrastructure... You could use Kafka for ingestion into both Hadoop and these other tools that are better tuned to store such datasets (which is the benefit of "buffering" the data in Kafka first)

does it make sense if each incoming time series data are written as a parquet format?

Sure. If you want to store lots of data for large batch analysis. But if you window your stream hourly data-points, and perform sums and averages, for example, then do you really need to store each and every datapoint?

If I use Cassandra, how can I do batch analysis?

Well, I would hope the same way you currently do it. Schedule a query against a database? Hope all the data is there? (no late-arriving records)

Solution 2:[2]

I have real-time time series sensor data. My primary goal is to keep the raw data. I should do this so that the cost of storage is minimal.

If your requirement is storing the raw data, you can write them to hdfs is compressed form. Using parquet format might not be feasible here. Formats can change. If you have the incoming data in kafka you can use kafka connect to write to hdfs in batches from a topic.

All sensors produces time series data, and i must save these raw time series data for batch analysis. Parquet format is great for less storage cost. But, does it make sense if each incoming time series data are written as a parquet format?

Not sure if I understand correctly, but it does not make any sense to store each data point in a seperate parquet file.

  1. parquet format has overhead compared to raw data
  2. parquet format is specifically designed for having table like data with many rows, so that filtering on that data is fast (with local access).
  3. batch processing and filesystems are most of the time really unhappy about lots of small files.

On the other hand, I want to process each incoming time series data in real time. For real-time scenario; I can use Kafka. But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?

Depending on your use case it might be easy enough for batch processing to use hive or spark sql on the raw data. Maybe kafka-streams processor is enough for your real time requirements.

So many options. It all depends on the use cases...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Christoph Bauer