Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
637 views
in Technique[技术] by (71.8m points)

spring-batch - Spring批处理分区大师可以读取数据库并将数据传递给工作人员吗?(Spring batch partitioning master can read database and pass data to workers?)

I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.

(我是Spring Batch的新手,正在尝试设计一个新应用程序,该应用程序必须从数据库中读取2000万条记录并进行处理。)

I don't think we can do this with one single JOB and Step(in sequential with one thread).

(我不认为我们可以通过一个单独的JOB和一个Step(与一个线程顺序执行)来做到这一点。)

I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).

(我在想我们可以在分区中做到这一点,在该分区中,步骤分为主工作者和多个工作者(每个工作者都是一个线程,它自己的进程可以并行运行)。)

We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.

(我们必须读取一个具有2000万条记录的表(现有表)并进行处理,但在此表中,我们没有任何自动生成的序列号,并且具有主键(如带有10位数字的雇主编号)。)

I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don't have sequence number to pass as range to each worker.

(我检查了几个用于分区的示例代码,在这里我们可以将范围传递给每个worker和worker进程,给定的范围是worker1从1到100,worker2 101到200…但是在我的情况下,这是行不通的,因为我们没有序列号作为范围传递给每个工人。)

In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ?

(在分区中,master可以从数据库中读取数据(例如1000条记录),并将其传递给每个工人以发送范围吗?)

.

(。)

Or for the above scenario do you suggest any other better approach.

(或者对于上述情况,您是否建议其他更好的方法。)

  ask by Ethan Lee translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options.

(原则上,任何按确定顺序返回结果行的查询都可以进行分区,如您通过OFFSETLIMIT选项提到的示例中所示。)

The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered.

(ORDER BY可能会大大增加查询的执行时间,尽管如果您通过表的主键进行排序,则这种效果应该不太明显,因为表的索引已经被排序了。)

So I would give this approach a try first, as it is the most elegant IMHO.

(因此,我将首先尝试这种方法,因为它是最优雅的恕我直言。)

Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader , because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing.

(但是请注意,由于直接从JdbcCursorItemReader处理巨大的结果集,您可能会遇到其他问题,因为某些RDBMS(如MySQL)对获取与处理互锁的行的速度不满意。)

So depending on the complexity of your processing I would recommend validating the design in that regard early on.

(因此,根据您处理的复杂性,我建议您尽早对此方面进行设计验证。)

Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).

(不幸的是,无法检索分区的整个表行集并将其作为参数传递给您建议的工作程序步骤,因为该参数不得序列化到超过千字节(或那个数量级)。)

An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.

(一种替代方法是检索每个分区的数据,并将其存储在某个位置(如果大小允许,在内存中的映射条目中存储,或存储在文件中),并将对该资源的引用通过参数传递给工作程序步骤,然后读取并处理该数据。)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...