'Correct way to define an apache beam pipepline
I am new to Beam and struggling to find many good guides and resources to learn best practices.
One thing I have noticed is there are two ways pipelines are defined:
with beam.Pipeline() as p:
# pipeline code in here
Or
p = beam.Pipeline()
# pipeline code in here
result = p.run()
result.wait_until_finish()
Are there specific situations in which each method is preferred?
Solution 1:[1]
From code snippets, I see the main difference is if you care about pipeline result or not. If you want to use PipelineResult to monitor pipeline status or or cancel your pipeline by your code, you can go to the second style.
Solution 2:[2]
I think functional wise they are equivalent since the __exit__
function for pipeline context manager is executing the same code.
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L426
Solution 3:[3]
As pointed out by Yichi Zhang, Pipeline.__exit__
set .result
, so you can do:
with beam.Pipeline() as p:
...
result = p.result
The contextmanager version is cleaner as it can correctly cleanup when error are raised inside the contextmanager.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Rui Wang |
Solution 2 | Yichi Zhang |
Solution 3 | Conchylicultor |