hadoop - Sqoop - Data splitting

Question

Welcome To Ask or Share your Answers For Others

hadoop - Sqoop - Data splitting

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

hadoop - Sqoop - Data splitting

Sqoop able to import data from multiple tables using --query clause but not clear whether it is able to import below query.

Select deptid, avg(salary) from emp group by deptid

Another question is

sqoop import --connect jdbc:mysql://myserver:1202/ --username=u1 --password=p1 --query 'Select * from emp where empid< 1123 and $CONDITIONS' --split-by empid --target-dir /uname/emp/salary

$CONDITIONS and split-by are used to perform parallel processing or we can say efficiently importing data. Former split the rows based on condition and later use min and max logic on primary key. What is the difference between these two ($CONDITIONS, split-by). If we use both in same sqoop statement, which clause would got the priority?

Thanks....

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:50:17+0000

There is some gap in your understanding.

First of all, the degree of parallelism is controlled by -m <n> or --num-mappers <n>. By default value of --num-mappers is 4.

Second, --split-by <column-name>, will split your task on the basis of column-name.

Third, $CONDITIONS, it is used internally by sqoop to achieve this splitting task.

Example, You fired a query:

sqoop import --connect jdbc:mysql://myserver:1202/ --username u1 --password p1 --query 'select * from emp where $CONDITIONS' --split-by empId --target-dir /temp/emp -m 4

Say, my empId is uniformly distributed from 1- 100.

Now, sqoop will take --split-by column and find its max and min value using query:

SELECT MIN(empId), MAX(empId) FROM (Select * From emp WHERE (1 = 1) ) t1

See it replaced $CONDITIONS with (1 = 1).

In our case, min, max values are 1 and 100.

As number of mappers are 4, sqoop will divide my query in 4 parts.

Creating input split with lower bound 'empId >= 1' and upper bound 'empId < 25'

Creating input split with lower bound 'empId >= 25' and upper bound 'empId < 50'

Creating input split with lower bound 'empId >= 50' and upper bound 'empId < 75'

Creating input split with lower bound 'empId >= 75' and upper bound 'empId <= 100'

Now $CONDITIONS will again come into the picture. It is replaced by above range queries.

First mapper will fire query like this:

Select * From emp WHERE empId >= 25' AND 'empId < 50

and so on for other 3 mappers.

Results from all the mappers is aggregated and written to a final HDFS directory.

Regarding your query :

select deptid, avg(salary) from emp group by deptid

you will specify

--query 'select deptid, avg(salary) from emp group by deptid where $CONDITIONS'

It will be first converted to

select deptid, avg(salary) from emp group by deptid where (1 = 0)

to fetch column metadata.

I believe this query won't run in RDBMS. Try above query(having Where (1 = 0)) directly in Mysql.

So you will not be able to use this query to fetch data using Sqoop.

Sqoop is used for simpler SQL queries.

Categories

hadoop - Sqoop - Data splitting

hadoop - Sqoop - Data splitting

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags