I'm using the Apache Beam Go SDK and having a hard time getting a PCollection in the correct format for grouping/combining by key. I have multiple records per k
I built a pipeline which reads from confluent kafka it processes the records and then use side outputs to split them into rejected and approved pcollections, th
I tried to run a working Apache Beam code with Spark-Runner using command spark-submit --class org.apache.beam.examples.WordCount --master spark://localhost:404
First I want to say that am totally new to Beam world. I'm working on an Apache Beam focused task and my main data source is a Kinesis stream. In there, when I
I am building a streaming pipeline using Apache Beam (Python SDK version 2.37.0) and Google Dataflow to write some data I receive via Pubsub to BigQuery. I proc
I have a two very huge PCollections <KV<Long,XYZ>> and <KV<Long,ABC>>. I need to create a PCollection <KV<XYZ,ABC>> which I
I'm starting now and I need some help, I have a custom model that I created using apache beam creating a pipeline that takes the data from a csv file from a fol
We are using the Go SDK for building pipelines. I think Apache Beam already supports AWS S3 for Python and Java. Is there a plan to add it for the Go SDK?
I am creating a dataflow pipeline in python in which i need to use FileIO because i want to access and keep track of the filenames processed. Everything is work
Currently, I'm doing a streaming process with DataFlow for moving uploaded blobs from GCS into BigQuery. However, I found that there were several pub/sub messag
I am building a data pipeline from PubSub to Beam (Direct/Dataflow Runner) to Big Query. Today we started to run into issues where beam IO BigQuery connector st
So my usecase is that the elements in my PCollection should be put into windows of different lengths (which are specified in the Row itself), but the following
I am trying to convert Json Data {"col1":"sample-val-1", "col2":1.0} {"col1":"sample-val-2", "col2":2.0} {"col1":"sample-val-3", "col2":3.0} {"col1":"sample-val
we have a long running pipeline and we would like to add the timestamp to the filenames as close to the pipeline ends' time as possible. The solution we have co
Is there any way to write unstructured data to a big query table using apache beam dataflow big query io API (i.e without providing schema upfront)
Is it possible to route late data into another output instead of dropping it? I see that Flink supports this - getting late data as a side output.
The apache beam pipeline (python) I'm currently working on contains a transformation which runs a docker container. While that works well during local testing w
I´ve made a simple pipeline in Python to read from kafka, the thing is that the kafka cluster is on confluent cloud and I am having some trouble conecting
I have an Apache Beam streaming project that calculates data and writes it to the database, what is the best way to reprocess all historical records after a bug
Issue Summary: Hi, I am using avro version 1.11.0 for parsing an avro file and decoding it. We have a custom requirement, so i am not able to use ReadFromAvro.