I recently used Dataflow for batch processing of data and encountered a pipeline stoppage due to an IO error ("IOError: No space left on device").
Disk expansion on the worker node solved the problem, but the amount of data to be processed is not very large and it is unlikely that the disk will be exhausted.
Therefore, I would like to know how Dataflow works so that I can better understand the incident.
My questions are as follows.
- What is the architecture of Cloud Dataflow? I would like to know the architecture and the documentation to know about it.
- What is the flow of a Dataflow job before it is launched?
My guess is that the pipelines and jobs are managed on the Managed Kubernetes cluster, and the jobs are executed on the user's VM Instance, since the dataflow logs include kubelet and docker logs.
Any information would be appreciated.
https://stackoverflow.com/questions/67409934/how-cloud-dataflow-works-and-how-dataflow-job-is-managed May 06, 2021 at 07:19AM
没有评论:
发表评论