'How to merge Kinesis data streams into one for Kinesis data analytics?
I have multiple AWS kinesis data streams/firehose with structured data in CSV format. I need to perform analytics on that data with kinesis data analytics. But how can I merge multiple streams into one? Because Kinesis data analytics gets data only from one stream. Data streams can exist in different regions.
Problem: How to merge Kinesis data streams into one for Kinesis data analytics?
Solution 1:[1]
I don't know if there are any "off the shelf" products from AWS you can use to do this but it's pretty simple if you don't mind writing a little bit of code.
- Create a kinesis stream which will be the "merged stream" (the events of both your source streams will go here.)
- Create a lambda using the programming language of your choice and set the triggers to the kinesis streams you want to merge.
- Code the lambda to write all the events it receives to the stream created in step 1.
The resulting kinesis stream should have the merged data you are looking for and can use it to pump into analytics.
Solution 2:[2]
It is a late answer, but to update it for completeness
You can also do it with Kinesis Data Analytics for Apache flink. https://docs.aws.amazon.com/kinesisanalytics/latest/java/how-it-works.html. It is a managed Apache Flink service from AWS, if you dont mind writing a bit of code in Java/Python language.
You can use Studio notebook, if you are exploring streaming data i.e in development phase. https://docs.aws.amazon.com/kinesisanalytics/latest/java/how-notebook.html
Disclaimer: I work for the Amazon Kinesis team
Solution 3:[3]
I recently implemented a solution capable of joining multiple sets of streaming data, and I faced the same issue you said in your question.
Indeed, a KDA In-application takes only one stream as input data source; so this limitation makes the schema standardization of the data flowing into KDA necessary when you are dealing with multiple sets of streams. To work around these issues, a python snippet code can be used inside of lambda to flatten and standardize any event by converting its entire payload to a JSON-encoded string. Then, this lambda send the flattened events to a Kinesis Data Stream. The image below illustrates this process:
Note that after this stage both JSON events have the same schema and no nested fields. Yet, all information is preserved. In addition, the ssn field is placed on the header to be used as join key later on.
I wrote a detailed explanation of this solution here: https://medium.com/@guilhermeepassos/joining-and-enriching-multiple-sets-of-streaming-data-with-kinesis-data-analytics-24b4088b5846
I hope this may help!!!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | alstonp |
Solution 2 | adimg |
Solution 3 | pass0s |