How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.
The "obvious" answer is to:
SELECT * FROM table ORDER BY RAND() LIMIT 10000
For large tables, that's too slow: it calls RAND()
for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?
Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID()
, because RAND() may return the same value for all rows.
EDIT: 5 YEARS LATER
I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:
- Sample the rows to 2-5x my desired sample size, to cheaply
ORDER BY RAND()
- Save the result of
RAND()
to an indexed column on every insert/update. (If your data set isn't very update-heavy, you may need to find another way to keep this column fresh.)
To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:
SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high
SELECT *
FROM table
WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s
ORDER BY RAND() LIMIT 1000
(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")
While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND()
again.
Question&Answers:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…