apache beam - GroupIntoBatches for non-KV elements

Question

Welcome To Ask or Share your Answers For Others

apache beam - GroupIntoBatches for non-KV elements

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache beam - GroupIntoBatches for non-KV elements

According to the Apache Beam 2.0.0 SDK Documentation GroupIntoBatches works only with KV collections.

My dataset contains only values and there's no need for introducing keys. However, to make use of GroupIntoBatches I had to implement “fake” keys with an empty string as a key:

static class FakeKVFn extends DoFn<String, KV<String, String>> {
  @ProcessElement
  public void processElement(ProcessContext c) {
    c.output(KV.of("", c.element()));
  }
}

So the overall pipeline looks like the following:

public static void main(String[] args) {
  PipelineOptions options = PipelineOptionsFactory.create();
  Pipeline p = Pipeline.create(options);

  long batchSize = 100L;

  p.apply("ReadLines", TextIO.read().from("./input.txt"))
      .apply("FakeKV", ParDo.of(new FakeKVFn()))
      .apply(GroupIntoBatches.<String, String>ofSize(batchSize))
      .setCoder(KvCoder.of(StringUtf8Coder.of(), IterableCoder.of(StringUtf8Coder.of())))
      .apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, String>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
          c.output(callWebService(c.element().getValue()));
        }
      }))
      .apply("WriteResults", TextIO.write().to("./output/"));

  p.run().waitUntilFinish();
}

Is there any way to group into batches without introducing “fake” keys?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:35:31+0000

It is required to provide KV inputs to GroupIntoBatches because the transform is implemented using state and timers, which are per key-and-window.

For each key+window pair, state and timers necessarily execute serially (or observably so). You have to manually express the available parallelism by providing keys (and windows, though no runner that I know of parallelizes over windows today). The two most common approaches are:

Use some natural key like a user ID
Choose some fixed number of shards and key randomly. This can be harder to tune. You have to have enough shards to get enough parallelism, but each shard needs to include enough data that GroupIntoBatches is actually useful.

Adding one dummy key to all elements as in your snippet will cause the transform to not execute in parallel at all. This is similar to the discussion at Stateful indexing causes ParDo to be run single-threaded on Dataflow Runner.

Categories

apache beam - GroupIntoBatches for non-KV elements

apache beam - GroupIntoBatches for non-KV elements

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags