Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

apache beam - Google Dataflow Pipeline with Instance Local Cache + External REST API calls

We want to build a Cloud Dataflow Streaming pipeline which ingests events from Pubsub and performs multiple ETL-like operations on each individual event. One of these operations is that each event has a device-id which need to be transformed to a different value (lets call it mapped-id), the mapping from the device-id->mapped-id being provided by an external service over a REST API. The same device-id might be repeated across multiple events - so these device-id->mapped-id mappings can be cached and re-used. Since we might be dealing with as many as 3M events per second at peak through the pipeline, the call to the REST API needs to be reduced as much as possible and also optimized when the calls are actually needed.

With this setup in mind, I have the following questions.

  • To optimize the REST API call, does Dataflow provide any in-built optimisations like connection pooling or if we decide to use our own such mechanisms, are there any restrictions/limitations we need to keep in mind?

  • We are looking at some of the in-memory cache options, to locally cache the mappings, some of which are backed by local files as well. So, are there any restrictions on how much memory (as a fraction of the total instance memory) can these cache use without affecting the regular Dataflow operations in the workers? if we use a file-backed cache, is there a path on each worker which is safe to use by the application itself to build these files on?

  • The number of unique device-id could be in the order of many millions - so not all those mappings can be stored in each instance. So to be able to utilize the local cache better, we need to get some affinity between the device-id and the workers where they are processed. I can do a group-by based on the device-id before the stage where this transform happens. If I do that, is there any guarantee that the same device-id will more or less be processed by the same worker? if there is some reasonable guarantee, then I would not have to hit the external REST API most of the time other than the first call which should be fine. Or is there a better way to ensure such affinity between the ids and the workers.

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...