Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
519 views
in Technique[技术] by (71.8m points)

sql server - TABLESAMPLE returns wrong number of rows?

I've just discovered the TABLESAMPLE clause but surprisingly it doesn't return the number of rows i've specified.

The table that i've used has ~14M rows and i wanted an arbitrary sample of 10000 rows.

select * from tabData TABLESAMPLE(10000 ROWS)

I get not 10000 but a different number everytime i execute it(between 8000 and 14000).

What's going on here, have i misunderstood the intended purpose of TABLESAMPLE?

Edit:

David's link explains it pretty well.

This returns always 10000 roughly random rows in an efficient way:

select TOP 10000 * from tabData TABLESAMPLE(20000 ROWS);

and the REPEATABLE option helps to get always the same (unless data has changed)

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS) REPEATABLE(100);

Since i wanted to know if it's more expensive to use TABLESAMPLE with a large number of rows to ensure(?) that i get the correct row-number, i've measured it;

1.loop (20 times):

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS);

(9938 row(s) affected)
(10000 row(s) affected)
(9383 row(s) affected)
(9526 row(s) affected)
(10000 row(s) affected)
(9545 row(s) affected)
(9560 row(s) affected)
(9673 row(s) affected)
(9608 row(s) affected)
(9476 row(s) affected)
(9766 row(s) affected)
(10000 row(s) affected)
(9500 row(s) affected)
(9941 row(s) affected)
(9769 row(s) affected)
(9547 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(9478 row(s) affected)
First batch(only 10000 rows) completed in: 14 seconds!

2.loop (20 times):

select TOP 10000 * from tabData TABLESAMPLE(10000000 ROWS);

(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
Second batch(max rows) completed in: 13 seconds!

3.loop: counterscheck with 100% random rows using ORDER BY NEWID():

select TOP 10000 * from tabData ORDER BY NEWID();

(10000 row(s) affected)

Cancelled after one execution that lasted 23 minutes

Conclusion:

So suprisingly the approach with an exact TOP clause and a large number in TABLESAMPLE is not slower. Hence it's a very efficient alternative to ORDER BY NEWID() if it doesn't matter that the rows are not random per row but per page level(Each 8K page for the table is given a random value).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

See the article here. You need to add a top clause and/or use the repeatable option to get the number of rows you want.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...