简单随机样本从一个Sql数据库Simple Random Samples from a Sql database

- 此内容更新于:2014-12-30


How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.

The "obvious" answer is to:


For large tables, that's too slow: it calls RAND() for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?

Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID(), because RAND() may return the same value for all rows.


I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:

  • Sample the rows to 2-5x my desired sample size, to cheaply ORDER BY RAND()
  • Save the result of RAND() to an indexed column on every insert/update. (If your data set isn't very update-heavy, you may need to find another way to keep this column fresh.)

To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:

SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high

    FROM table
   WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s

(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")

While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND() again.

Andrew Mao的回复:,并# 39;t甚至在SQL服务器的工作,因为每个后续调用RAND()返回相同的值。

(原文:That doesn't even work in SQL server because RAND() returns the same value every subsequent call.)

ojrac的回复:很好的观点,我# 39;会添加一个注意,SQL Server用户应该使用ORDER BY NEWID()。

(原文:Good point -- I'll add a note that SQL Server users should use ORDER BY NEWID() instead.)

Andrew Mao的回复:它仍然是非常低效的,因为它对所有数据进行排序。比例随机抽样技术对于一些更好,但我即使阅读一大堆帖子,我还# 39;t足够随机找到一个可接受的解决方案。

(原文:It still is terribly inefficient because it has to sort all the data. A random sampling technique for some percentage is better, but I even after reading a bunch of posts on here, I haven't found an acceptable solution that is sufficiently random.)

ojrac的回复:如果你读了这个问题,我专门问因为ORDER BY RAND()是O(n lg n)。

(原文:If you read the question, I am asking specifically because ORDER BY RAND() is O(n lg n).)

Josh Greifer的回复:muposat& # 39;年代回答下面是伟大的如果你# 39;你不要太痴迷于统计随机性RAND()。

(原文:muposat's answer below is great if you're not too obsessed with the statistical randomness of RAND().)


There's a very interesting discussion of this type of issue here: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

I think with absolutely no assumptions about the table that your O(n lg n) solution is the best. Though actually with a good optimizer or a slightly different technique the query you list may be a bit better, O(m*n) where m is the number of random rows desired, as it wouldn't necesssarily have to sort the whole large array, it could just search for the smallest m times. But for the sort of numbers you posted, m is bigger than lg n anyway.

Three asumptions we might try out:

  1. there is a unique, indexed, primary key in the table

  2. the number of random rows you want to select (m) is much smaller than the number of rows in the table (n)

  3. the unique primary key is an integer that ranges from 1 to n with no gaps

With only assumptions 1 and 2 I think this can be done in O(n), though you'll need to write a whole index to the table to match assumption 3, so it's not necesarily a fast O(n). If we can ADDITIONALLY assume something else nice about the table, we can do the task in O(m log m). Assumption 3 would be an easy nice additional property to work with. With a nice random number generator that guaranteed no duplicates when generating m numbers in a row, an O(m) solution would be possible.

Given the three assumptions, the basic idea is to generate m unique random numbers between 1 and n, and then select the rows with those keys from the table. I don't have mysql or anything in front of me right now, so in slightly pseudocode this would look something like:

create table RandomKeys (RandomKey int)
create table RandomKeysAttempt (RandomKey int)

-- generate m random keys between 1 and n
for i = 1 to m
  insert RandomKeysAttempt select rand()*n + 1

-- eliminate duplicates
insert RandomKeys select distinct RandomKey from RandomKeysAttempt

-- as long as we don't have enough, keep generating new keys,
-- with luck (and m much less than n), this won't be necessary
while count(RandomKeys) < m
  NextAttempt = rand()*n + 1
  if not exists (select * from RandomKeys where RandomKey = NextAttempt)
    insert RandomKeys select NextAttempt

-- get our random rows
select *
from RandomKeys r
join table t ON r.RandomKey = t.UniqueKey

If you were really concerned about efficiency, you might consider doing the random key generation in some sort of procedural language and inserting the results in the database, as almost anything other than SQL would probably be better at the sort of looping and random number generation required.

Sam Saffron的回复:我建议添加一个唯一索引的随机密钥选择和可能忽略重复插入,然后你就可以摆脱截然不同的东西,加入将更快。

(原文:I would recommend adding a unique index on the random key selection and perhaps ignoring duplicates on the insert, then you can get rid of the distinct stuff and the join will be faster.)

ojrac的回复:我认为随机数算法可以使用一些调整——一个独特的约束如前所述,或者只是生成2 * m值,并选择不同的,订单的id(先到先得,所以这降低了独特的约束)限制米。我喜欢它。

(原文:I think the random number algorithm could use some tweaks -- either a UNIQUE constraint as mentioned, or just generate 2*m numbers, and SELECT DISTINCT, ORDER BY id (first-come-first-serve, so this reduces to the UNIQUE constraint) LIMIT m. I like it.)

user12861的回复:为随机添加一个惟一的索引键选择,然后忽略重复插入,我想这可能会让你回到O(m ^ 2)的行为,而不是O(lg)。不知道如何有效的服务器维护索引时插入随机行一次。

(原文:As to adding a unique index to the random key selection and then ignoring duplicates on insert, I thought this may get you back to O(m^2) behavior instead of O(m lg m) for a sort. Not sure how efficient the server is maintaining the index when inserting random rows one at a time.)

user12861的回复:建议生成2 * m数字之类的,我想要一个算法保证无论如何工作。还有# 39;年代总是(slim)的机会,你的2 *米随机数将超过米副本,所以你获得# 39;t有足够为你查询。

(原文:As to suggestions to generate 2*m numbers or something, I wanted an algorithm guaranteed to work no matter what. There's always the (slim) chance that your 2*m random numbers will have more than m duplicates, so you won't have enough for your query.)

ojrac的回复:只要你注意生日悖论,您可以很容易地生成一个随机数的数量极低的机会& lt;惟一的值。但是,在最坏的情况下,你总是可以生成另一个米键,直到你# 39;已经有足够的独特的。,)

(原文:As long as you pay attention to the birthday paradox, you can easily generate a quantity of random numbers with an astronomically low chance of <m unique values. But, at worst, you could always generate another m keys until you've got enough unique ones. ;))


I think the fastest solution is

select * from table where rand() <= .3

Here is why I think this should do the job.

  • It will create a random number for each row. The number is between 0 and 1
  • It evaluates whether to display that row if the number generated is between 0 and .3 (30%).

This assumes that rand() is generating numbers in a uniform distribution. It is the quickest way to do this.

I saw that someone had recommended that solution and they got shot down without proof.. here is what I would say to that -

  • This is O(n) but no sorting is required so it is faster than the O(n lg n)
  • mysql is very capable of generating random numbers for each row. Try this -

    select rand() from INFORMATION_SCHEMA.TABLES limit 10;

Since the database in question is mySQL, this is the right solution.

user12861的回复:首先,你有问题,这也# 39;t真正回答这个问题,因为它得到了半随机返回的结果数,接近所需的数量,但不一定完全,而不是精确的预期数量的结果。

(原文:First, you have the problem that this doesn't really answer the question, since it gets a semi-random number of results returned, close to a desired number but not necessarily exactly that number, instead of a precise desired number of results.)


(原文:Next, as to efficiency, yours is O(n), where n is the number of rows in the table. That's not nearly as good as O(m log m), where m is the number of results you want, and m << n. You could still be right that it would be faster in practice, because as you say generating rand()s and comparing them to a constant COULD be very fast. You'd have to test it to find out. With smaller tables you may win. With huge tables and a much smaller number of desired results I doubt it.)

ojrac的回复:虽然@user12861对这没有得到确切的数字,这# 39;年代的一个好方法把数据集到正确的大小。

(原文:While @user12861 is right about this not getting the exact right number, it's a good way to cut the data set down to the right rough size.)


(原文:How does the database service the following query - SELECT * FROM table ORDER BY RAND() LIMIT 10000 ? It has to first create a random number for each row (same as the solution I described), then order it.. sorts are expensive! This is why this solution WILL be slower than the one I described, as no sorts are required. You can add a limit to the solution I described and it will not give you more than that number of rows. As someone correctly pointed out, it won't give you EXACT sample size, but with random samples, EXACT is most often not a strict requirement.)


(原文:Is there a way to specify minimum number of rows?)

显然在某些版本的SQL TABLESAMPLE命令,但它不是在所有SQL实现(值得注意的是,红移)。 http://technet.microsoft.com/en-us/library/ms189108(v = sql.105). aspx

Apparently in some versions of SQL there's a TABLESAMPLE command, but it's not in all SQL implementations (notably, Redshift).


ojrac的回复:非常酷!看起来变# 39;年代不是由PostgreSQL或MySQL / MariaDB实现,但这# 39;伟大的回答如果你# 39;再保险支持SQL实现。

(原文:Very cool! It looks like it's not implemented by PostgreSQL or MySQL/MariaDB either, but it's a great answer if you're on a SQL implementation that supports it.)


Faster Than ORDER BY RAND()

I tested this method to be much faster than ORDER BY RAND(), hence it runs in O(n) time, and does so impressively fast.

From http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx:

Non-MSSQL version -- I did not test this

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= RAND()

MSSQL version:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

This will select ~1% of records. So if you need exact # of percents or records to be selected, estimate your percentage with some safety margin, then randomly pluck excess records from resulting set, using the more expensive ORDER BY RAND() method.

Even Faster

I was able to improve upon this method even further because I had a well-known indexed column value range.

For example, if you have an indexed column with uniformly distributed integers [0..max], you can use that to randomly select N small intervals. Do this dynamically in your program to get a different set for each query run. This subset selection will be O(N), which can many orders of magnitude smaller than your full data set.

In my test I reduced the time needed to get 20 (out 20 mil) sample records from 3 mins using ORDER BY RAND() down to 0.0 seconds!


Starting with the observation that we can retrieve the ids of a table (eg. count 5) based on a set:

select *
from table_name
where _id in (4, 1, 2, 5, 3)

we can come to the result that if we could generate the string "(4, 1, 2, 5, 3)", then we would have a more efficient way than RAND().

For example, in Java:

ArrayList<Integer> indices = new ArrayList<Integer>(rowsCount);
for (int i = 0; i < rowsCount; i++) {
String inClause = indices.toString().replace('[', '(').replace(']', ')');

If ids have gaps, then the initial arraylist indices is the result of an sql query on ids.

只使用 或得到10%的记录 在RAND()& lt;0.01 得到1%的记录等。

Just use

WHERE RAND() < 0.1 

to get 10% of the records or

WHERE RAND() < 0.01 

to get 1% of the records, etc.


(原文:That will call RAND for every row, making it O(n). The poster was looking for something better than that.)

Andrew Mao的回复:不仅如此,为后续调用RAND()返回相同的值(至少在该软件),这意味着你将整个表或它与概率。

(原文:Not only that, but RAND() returns the same value for subsequent calls (at least on MSSQL), meaning you will get either the whole table or none of it with that probability.)


I want to point out that all of these solutions appear to sample without replacement. Selecting the top K rows from a random sort or joining to a table that contains unique keys in random order will yield a random sample generated without replacement.

If you want your sample to be independent, you'll need to sample with replacement. See Question 25451034 for one example of how to do this using a JOIN in a manner similar to user12861's solution. The solution is written for T-SQL, but the concept works in any SQL db.


Maybe you could do

SELECT * FROM table LIMIT 10000 OFFSET FLOOR(RAND() * 190000)
ojrac的回复:看来,我的选择一个随机的切片数据;我# 39;寻找一些更复杂——10000随机分布的行。

(原文:It looks like that would select a random slice of my data; I'm looking for something a little more complicated -- 10,000 randomly-distributed rows.)


(原文:Then your only option, if you want to do it in the database, is ORDER BY rand().)