I've just discovered the TABLESAMPLE
clause but surprisingly it doesn't return the number of rows i've specified.
The table that i've used has ~14M rows and i wanted an arbitrary sample of 10000 rows.
select * from tabData TABLESAMPLE(10000 ROWS)
I get not 10000 but a different number everytime i execute it(between 8000 and 14000).
What's going on here, have i misunderstood the intended purpose of TABLESAMPLE
?
Edit:
David's link explains it pretty well.
This returns always 10000 roughly random rows in an efficient way:
select TOP 10000 * from tabData TABLESAMPLE(20000 ROWS);
and the REPEATABLE
option helps to get always the same (unless data has changed)
select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS) REPEATABLE(100);
Since i wanted to know if it's more expensive to use TABLESAMPLE
with a large number of rows to ensure(?) that i get the correct row-number, i've measured it;
1.loop (20 times):
select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS);
(9938 row(s) affected)
(10000 row(s) affected)
(9383 row(s) affected)
(9526 row(s) affected)
(10000 row(s) affected)
(9545 row(s) affected)
(9560 row(s) affected)
(9673 row(s) affected)
(9608 row(s) affected)
(9476 row(s) affected)
(9766 row(s) affected)
(10000 row(s) affected)
(9500 row(s) affected)
(9941 row(s) affected)
(9769 row(s) affected)
(9547 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(9478 row(s) affected)
First batch(only 10000 rows) completed in: 14 seconds!
2.loop (20 times):
select TOP 10000 * from tabData TABLESAMPLE(10000000 ROWS);
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
Second batch(max rows) completed in: 13 seconds!
3.loop: counterscheck with 100% random rows using ORDER BY NEWID():
select TOP 10000 * from tabData ORDER BY NEWID();
(10000 row(s) affected)
Cancelled after one execution that lasted 23 minutes
Conclusion:
So suprisingly the approach with an exact TOP
clause and a large number in TABLESAMPLE
is not slower. Hence it's a very efficient alternative to ORDER BY NEWID()
if it doesn't matter that the rows are not random per row but per page level(Each 8K page for the table is given a random value).
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…