Two solutions presented here. Both of these proposed solutions are mysql-only and can be used by any programming language as the consumer. PHP would be wildly too slow for this, but it could be the consumer of it.
Faster Solution: I can bring 1000 random rows from a table of 19 million rows in about 2 tenths of a second with more advanced programming techniques.
Slower Solution: It takes about 15 seconds with non-power programming techniques.
By the way both use the data generation seen HERE that I wrote. So that is my little schema. I use that, continue with TWO more self-inserts seen over there, until I have 19M rows. So I am not going to show that again. But to get those 19M rows, go see that, and do 2 more of those inserts, and you have 19M rows.
Slower version first
First, the slower method.
select id,thing from ratings order by rand() limit 1000;
That returns 1000 rows in 15 seconds.
For anyone new to mysql, don't even read the following.
Faster solution
This is a little more complicated to describe. The gist of it is that you pre-compute your random numbers and generate an in clause
ending of random numbers, separated by commas, and wrapped with a pair of parentheses.
It will look like (1,2,3,4)
but it will have 1000 numbers in it.
And you store them, and use them once. Like a one time pad for cryptography. Ok, not a great analogy, but you get the point I hope.
Think of it as an ending for an in
clause, and stored in a TEXT column (like a blob).
Why in the world would one want to do this? Because RNG (random number generators) are prohibitively slow. But to generate them with a few machines may be able to crank out thousands relatively quickly. By the way (and you will see this in the structure of my so called appendices, I capture how long it takes to generate one row. About 1 second with mysql. But C#, PHP, Java, anything can put that together. The point is not how you put it together, rather, that you have it when you want it.
This strategy, the long and short of it is, when this is combined with fetching a row that has not been used as a random list, marking it as used, and issuing a call such as
select id,thing from ratings where id in (a,b,c,d,e, ... )
and the in clause has 1000 numbers in it, the results are available in less than half a second. Effective employing the mysql CBO (cost based optimizer) than treats it like a join on a PK index.
I leave this in summary form, because it is a bit complicated in practice, but includes the following particles potentially
- a table holding the precomputed random numbers (Appendix A)
- a mysql create event strategy (Appendix B)
- a stored procedure that employees a Prepared Statement (Appendix C)
- a mysql-only stored proc to demonstrate RNG
in
clause for kicks (Appendix D)
Appendix A
A table holding the precomputed random numbers
create table randomsToUse
( -- create a table of 1000 random numbers to use
-- format will be like a long "(a,b,c,d,e, ...)" string
-- pre-computed random numbers, fetched upon needed for use
id int auto_increment primary key,
used int not null, -- 0 = not used yet, 1= used
dtStartCreate datetime not null, -- next two lines to eyeball time spent generating this row
dtEndCreate datetime not null,
dtUsed datetime null, -- when was it used
txtInString text not null -- here is your in clause ending like (a,b,c,d,e, ... )
-- this may only have about 5000 rows and garbage cleaned
-- so maybe choose one or two more indexes, such as composites
);
Appendix B
In the interest of not turning this into a book, see my answer HERE for a mechanism for running a recurring mysql Event. It will drive the maintenance of the table seen in Appendix A using techniques seen in Appendix D and other thoughts you want to dream up. Such as re-use of rows, archiving, deleting, whatever.
Appendix C
stored procedure to simply get me 1000 random rows.
DROP PROCEDURE if exists showARandomChunk;
DELIMITER $$
CREATE PROCEDURE showARandomChunk
(
)
BEGIN
DECLARE i int;
DECLARE txtInClause text;
-- select now() into dtBegin;
select id,txtInString into i,txtInClause from randomsToUse where used=0 order by id limit 1;
-- select txtInClause as sOut; -- used for debugging
-- if I run this following statement, it is 19.9 seconds on my Dell laptop
-- with 19M rows
-- select * from ratings order by rand() limit 1000; -- 19 seconds
-- however, if I run the following "Prepared Statement", if takes 2 tenths of a second
-- for 1000 rows
set @s1=concat("select * from ratings where id in ",txtInClause);
PREPARE stmt1 FROM @s1;
EXECUTE stmt1; -- execute the puppy and give me 1000 rows
DEALLOCATE PREPARE stmt1;
END
$$
DELIMITER ;
Appendix D
Can be intertwined with Appendix B concept. However you want to do it. But it leaves you with something to see how mysql could do it all by itself on the RNG side of things. By the way, for parameters 1 and 2 being 1000 and 19M respectively, it takes 800 ms on my machine.
This routine could be written in any language as mentioned in the beginning.
drop procedure if exists createARandomInString;
DELIMITER $$
create procedure createARandomInString
( nHowMany int, -- how many numbers to you want
nMaxNum int -- max of any one number
)
BEGIN
DECLARE dtBegin datetime;
DECLARE dtEnd datetime;
DECLARE i int;
DECLARE txtInClause text;
select now() into dtBegin;
set i=1;
set txtInClause="(";
WHILE i<nHowMany DO
set txtInClause=concat(txtInClause,floor(rand()*nMaxNum)+1,", "); -- extra space good due to viewing in text editor
set i=i+1;
END WHILE;
set txtInClause=concat(txtInClause,floor(rand()*nMaxNum)+1,")");
-- select txtInClause as myOutput; -- used for debugging
select now() into dtEnd;
-- insert a row, that has not been used yet
insert randomsToUse(used,dtStartCreate,dtEndCreate,dtUsed,txtInString) values
(0,dtBegin,dtEnd,null,txtInClause);
END
$$
DELIMITER ;
How to call the above stored proc:
call createARandomInString(1000,18000000);
That generates and saves 1 row, of 1000 numbers wrapped as described above. Big numbers, 1 to 18M
As a quick illustration, if one were to modify the stored proc, un-rem the line near the bottom that says "used for debugging", and have that as the last line, in the stored proc that runs, and run this:
call createARandomInString(4,18000000);
... to generate 4 random numbers up to 18M, the results might look like
+-------------------------------------+
| myOutput |
+-------------------------------------+
| (2857561,5076608,16810360,14821977) |
+-------------------------------------+
Appendix E
Reality check. These are somewhat advanced techniques and I can't tutor anyone on them. But I wanted to share them anyway. But I can't teach it. Over and out.