'Correct way to define an apache beam pipepline

I am new to Beam and struggling to find many good guides and resources to learn best practices.

One thing I have noticed is there are two ways pipelines are defined:

with beam.Pipeline() as p:
# pipeline code in here

Or

p = beam.Pipeline()
# pipeline code in here
result = p.run()
result.wait_until_finish()

Are there specific situations in which each method is preferred?



Solution 1:[1]

From code snippets, I see the main difference is if you care about pipeline result or not. If you want to use PipelineResult to monitor pipeline status or or cancel your pipeline by your code, you can go to the second style.

Solution 2:[2]

I think functional wise they are equivalent since the __exit__ function for pipeline context manager is executing the same code. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L426

Solution 3:[3]

As pointed out by Yichi Zhang, Pipeline.__exit__ set .result, so you can do:

with beam.Pipeline() as p:
  ...

result = p.result

The contextmanager version is cleaner as it can correctly cleanup when error are raised inside the contextmanager.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rui Wang
Solution 2 Yichi Zhang
Solution 3 Conchylicultor