R中随机森林的分层抽样

我在randomForest的文档中阅读了以下内容:

strata: A (factor) variable that is used for stratified sampling.

sampsize: Size(s) of sample to draw. For classification, if sampsize
is a vector of the length the number of strata, then sampling
is stratified by strata, and the elements of sampsize
indicate the numbers to be drawn from the strata.

作为参考,该函数的接口由下式给出:

 randomForest(x, y=NULL,  xtest=NULL, ytest=NULL, ntree=500,
              mtry=if (!is.null(y) && !is.factor(y))
              max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
              replace=TRUE, classwt=NULL, cutoff, strata,
              sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
              nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
              maxnodes = NULL,
              importance=FALSE, localImp=FALSE, nPerm=1,
              proximity, oob.prox=proximity,
              norm.votes=TRUE, do.trace=FALSE,
              keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
              keep.inbag=FALSE, ...)

我的问题是:一个人如何使用分层和sampsize?这是一个最小的工作示例,我想测试这些参数:

library(randomForest)
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width", "iris.type")

model = randomForest(iris.type ~ sepal.length + sepal.width, data = iris)

> model
500 samples
  6 predictors
  2 classes: 'Y0', 'Y1' 

No pre-processing
Resampling: Bootstrap (7 reps) 

Summary of sample sizes: 477, 477, 477, 477, 477, 477, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens  Spec  ROC SD  Sens SD  Spec SD
  2     0.763  1     0     0.156   0        0      
  4     0.782  1     0     0.231   0        0      
  6     0.847  1     0     0.173   0        0      

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 6.

我来参考这些参数,因为我希望RF使用自举样本,这些样本尊重我的数据中的负数与负数的比例.

This other thread,开始讨论这个主题,但是在没有说明如何使用这些参数的情况下解决了这个问题.

最佳答案
这不会是这样的:

model = randomForest(iris.type ~ sepal.length + sepal.width, 
                     data = iris, 
                     sampsize=c(10,10,10), strata=iris$iris.type)

我确实尝试过…,strata = iristype和…,strata =’iristype’但显然代码并未编写为在’data’参数的环境中解释该值.我使用了结果变量,因为它是该数据集中唯一的因子变量,但我不认为它必须是结果变量.事实上,我认为它绝对不应该是结果变量.预计此特定模型将产生无用的输出,并且仅用于测试语法.

转载注明原文:R中随机森林的分层抽样 - 代码日志