'Kafka vs StreamSets
I was reading articles related to Kafka and StreamSets and my understanding was
Kafka acts as a broker between Producer system and subscriber. Producer push the data into Kafka cluster, subscriber pull the data from Kafka
StreamsSets is a technology to move data from one source to another through a pipeline
Now, below are my questions, Please help to clarify
What is the fundamental difference between Kafka and StreamSets? Is that Kafka doesn't move data but StreamSets moves the data?
If Kafka doesn't move the data, what is Kafka used for? If it moves data like ETL solutions, how it is different from SSIS, Informatica etc?
How is StreamSets different from SSIS, Informatica etc?
Solution 1:[1]
In StreamSets most of the time we create "data pipelines", think of a pipeline like an application which can consist of multiple steps/tasks, first task can be read data from a database or kafka or any number of data sources, second step can be modify the data, third step can be run a script ... etc and finally it can save the transformed data into a destination that could be a database or any other cloud storage. So Kafka and StreamSets can work together where StreamSets can read data from and write to Kafka
I think of Kafka as a place where data from multiple sources is collected and is available for consumers for a certain time. For example Kafka can read from a database table periodically and store the changes in a "topic", read from a web service periodically and then store this data into another topic. These topics are now available to consumers, a developer now can create an application that reads data from the first topic and do something with the data, Kafka can keep track of what the consumer has read by using offsets and offers replication and other options. It removes the need to write custom code that integrates multiple sources and destinations, instead you can configure this part.
StreamSets can read from and write to Kafka. StreamSets does not store the data in its own system while Kafka stores the data for a configurable period of time.
- SSIS is similar to StreamSets in that it is used to create pipelines/packages that consist of multiple tasks, each task can take the data/result from the previous tasks and then does something with it. Both StreamSets and SSIS can connect to many kinds of data sources and destinations.
My personal view on how StreamSets and SSIS are different is:
- StreamSets is web based while SSIS needs Visual Studio, StreamSets GUI is easier to use and does not require a special software to be installed for each developer.
- Deploying StreamSets pipelines to production with source control was easier than SSIS packages.
- SSIS is a Microsoft product so it integrates very well with other Microsoft products. StreamSets can be installed on any platform which makes it ideal for the AWS cloud.
- If you want to write SSIS scripting tasks you have to use C#/DotNet. StreamSets script tasks can be written in Jython and JavaScript
- SSIS is older and has tons of documentation online.
Solution 2:[2]
StreamSets is a graphical tool that contains components that allow for data movement, which happen to include Kafka producers and consumers, but you're not required to use them.
They're complementary, and by using Kafka, you can allow for back-pressure in streaming systems or have non-StreamSets producers/consumers interacting with other Kafka topics. No, Kafka doesn't move the data (except for internal replication), the clients that interact with the brokers do.
I've not used Informatica or SSIS, but I'm sure if you contacted someone at StreamSets, they could answer how they compare
Solution 3:[3]
Thanks to all, I think have to share some idea about how we can look towards the specification between kafka and streamset, if we are using both in same cluster then how we can differentiate.
"As we are using reliability of Kafka & Simplicity of Streamset"
- Streamset removes coding overhead for producer and Consumer
- Streamset Use to 1 source 1 Destination
- Kafka take data from multiple sources to multiple destination (pub-sub methodology)
- Streamset removes data drift problem
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Anurag Sharma |
Solution 2 | |
Solution 3 | Pavan Gomladu |