随机选择n行dataframe内所有水平的一个因素 - selecting n random rows across all levels of a factor within a dataframe

- 此内容更新于:2016-02-24
主题:

从这些问题——随机样本的行子集的Rdataframe&dataframe随机行我可以很容易地看到如何从df随机样本'n'行,或'n'行源自df内一个特定水平的一个因素。下面是一些示例数据:如刚从“粉红色”示例3随机行颜色——使用图书馆(kimisc):或编写自定义函数:但是,我要做的是创建一个新的df包含3(或n)随机行所有级别的因素。即新df12行(3从蓝色、3从红色,3从黄色,3从粉色)。显然可以运行几次,创建newdfs每个颜色,然后将其绑定在一起。然而,我试图找出一个更简单的解决方案,当有许多,许多的水平,我需要这样做。

原文:

From these questions - Random sample of rows from subset of an R dataframe & Random rows in dataframe in R I can easily see how to randomly sample 'n' rows from a df, or 'n' rows that originate from a specific level of a factor within a df.

Here are some sample data:

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

df[sample(nrow(df), 3), ] #samples 3 random rows from df, without replacement.

To e.g. just sample 3 random rows from 'pink' color - using library(kimisc):

library(kimisc)
sample.rows(subset(df, color == "pink"), 3)

or writing custom function:

sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, color == "pink"), 3)

However, what I am trying to do is create a new df that contains 3 (or n) random row from all levels of the factor. i.e. the new df would have 12 rows (3 from blue, 3 from red, 3 from yellow, 3 from pink). It's obviously possible to run this several times, create newdfs for each color, and then bind them together. However, I am trying to work out a simpler solution, for when there are many, many levels that I need to do this across.

解决方案:
你可以随机分配一个ID使用每个元素都有一个特定的因素水平。然后你可以在一定范围内选择所有随机id。这个的优点是保留原来的秩序和行名字如果是你感兴趣的东西。加上您可以重用向量来创建不同长度的子集相当容易。
原文:

You can assign a random ID to each element that has a particular factor level using ave. Then you can select all random IDs in a certain range.

rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]

This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily.

网友:这个建议和其他答案都工作得很好。可能我只是检查上面的代码的两个特点。1)变量X1。它从df的变量选择吗?(它似乎没有)。2)的情况下观察的数量在不同因素水平变化——我想返回的一个子集行每个因素水平超过总量出现在一些因素的水平,这个解决方案仍然工作。即如果我问11行/颜色,它将返回10。这可能是有用的在我的真实数据,观察/行/因素水平有所不同。

(原文:Both this suggestion and the other answer both work very well. May I just check two things about the above code. 1) the variable X1. Does it matter which variable from the df is chosen here? (it doesn't seem to). 2) In the situation where the number of observations in different factor levels vary - and I want to return a subset of rows per factor level that exceeds the total amount present in some factor levels, that this solution will still work. i.e. if I ask for 11 rows per color, it will return 10. This may be useful in my real data where obs/rows per factor level do vary.)

网友:@jalapic1)你是正确的,它并不重要,作为第一个参数传递变量。通过数值向量保持了数值结果。2),如果你问10行()和一组只有3,所有三行集团将被返回,没有缺失值将抽样也不会完成更换。所以你可能与不平衡组织。

(原文:@jalapic 1) You are correct in that it doesn't really matter which variable you pass as the first parameter. Passing a numeric vector helped to keep the result numeric. 2) If you ask for 10 rows (rndid<=10) and a group only has 3, all three rows for that group will be returned and no missing values will be introduced nor will sampling be done with replacement. So you may wind up with unbalanced groups.)

网友:谢谢你!我不介意不平衡组织在这种背景下,这样完美的工作。

(原文:thank you. I don't mind about the unbalanced groups in this context, so that works perfectly.)

解决方案:
在0.3及以后版本,这工作得很好:旧版本的dplyr(版本<=0.2)使用dplyr我开始回答这个问题,假设这将工作:但事实证明,在0.2S3方法存在但不是在名称空间注册文件,所以没有派出。相反,我不得不这样做:可能会修复这个问题在以后的更新。
原文:

In versions of dplyr 0.3 and later, this works just fine:

df %>% group_by(color) %>% sample_n(size = 3)

Old versions of dplyr (version <= 0.2)

I set out to answer this using dplyr, assuming that this would work:

df %.% group_by(color) %.% sample_n(size = 3)

But it turns out that in 0.2 the sample_n.grouped_df S3 method exists but isn't registered in the NAMESPACE file, so it's never dispatched. Instead, I had to do this:

df %.% group_by(color) %.% dplyr:::sample_n.grouped_df(size = 3)
Source: local data frame [12 x 3]
Groups: color

            X1         X2  color
8   0.66152710 -0.7767473   blue
1  -0.70293752 -0.2372700   blue
2  -0.46691793 -0.4382669   blue
32 -0.47547565 -1.0179842   pink
31 -0.15254540 -0.6149726   pink
39  0.08135292 -0.2141423   pink
15  0.47721644 -1.5033192    red
16  1.26160230  1.1202527    red
12 -2.18431919  0.2370912    red
24  0.10493757  1.4065835 yellow
21 -0.03950873 -1.1582658 yellow
28 -2.15872261 -1.5499822 yellow

Presumably this will be fixed in a future update.

网友:你用的什么版本的?这是箱子吗?

(原文:What version of dplyr are you using? Is it trunk?)

网友:我试着两个0.2凹口,然后从github安装;同样的事情。

(原文:I tried both 0.2 on cran and then installed from github; same thing.)

网友:在这就像一个魅力@joran。这是我最喜欢的方式做上面的问题了。

(原文:@joran in dplyr 0.3 this works like a charm. It's my favorite way of doing the above problem now.)

解决方案:
我将考虑功能,目前主持GitHub要点。得到它和使用它:有几个不同的功能方便分层抽样。例如,你也可以取一个样本的“动态”。给你一个什么功能,下面是分层的参数:输入:特征向量的列或列的“阶层”。:所需的样本量。如果尺寸值小于1,比例的样本来自每一层。如果大小是一个整数1或更多,这一数字样本来自每一层。如果整数的大小是一个向量,每层指定数量的样本。建议您使用一个名为向量。举个例子,如果你有两个层次,“A”“B”,和你想要5“A”和10个样本从“B”,你可以输入。:这允许您子集组抽样过程。这是一个。例如,如果你的组织变量是“群”,它包含三个层次,“A”、“B”和“C”,但你只是想样本“A”和“C”,您可以使用。:放回抽样。
原文:

I would consider my stratified function, which is presently hosted as a GitHub Gist.

Get it with:

library(devtools)  ## To download "stratified"
source_gist("https://gist.github.com/mrdwab/6424112")

And use it with:

stratified(df, "color", 3)

There are several different features that are convenient for stratified sampling. For instance, you can also take a sample sort of "on the fly".

stratified(df, "color", 3, select = list(color = c("blue", "red")))

To give you a sense of what the function does, here are the arguments to stratified:

  • df: The input data.frame
  • group: A character vector of the column or columns that make up the "strata".
  • size: The desired sample size.
    • If size is a value less than 1, a proportionate sample is taken from each stratum.
    • If size is a single integer of 1 or more, that number of samples is taken from each stratum.
    • If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
  • select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
  • replace: For sampling with replacement.
网友:这是一个非常简洁的功能非常有用

(原文:this is a really neat function - very useful)

解决方案:
这里有一个解决方案。我们把数据。框架分为颜色组。从每一个这样的一个群体,我们示例3行。在结果中,我们获得data.frames的列表。然后的列表数据。框架应该合并成1data.frame:
原文:

Here's a solution. We split a data.frame into color groups. From each such a group, we sample 3 rows. In result, we obtain a list of data.frames.

df2 <- lapply(split(df, df$color),
   function(subdf) subdf[sample(1:nrow(subdf), 3),]
)

Then the list of data.frames should be merged into 1 data.frame:

do.call('rbind', df2)
##                    X1          X2  color
## blue.3    -1.22677188  1.25648082   blue
## blue.4    -0.54516686 -1.94342967   blue
## blue.1     0.44647071  0.16283326   blue
## pink.40    0.23520296 -0.40411906   pink
## pink.34    0.02033939 -0.32321309   pink
## pink.33   -1.01790533 -1.22618575   pink
## red.16     1.86545895  1.11691250    red
## red.11     1.35748078 -0.36044728    red
## red.13    -0.02425645  0.85335279    red
## yellow.21  1.96728782 -1.81388110 yellow
## yellow.25 -0.48084967  0.07865186 yellow
## yellow.24 -0.07056236 -0.28514125 yellow