I'm comparing different resampling methods in caret, using only LM. Across multiple datasets and seeds, I'm seeing much better model performance for k-folds, which concerns me that I'm pulling the correct information from the fit object. I wan to know with certainty how to recover holdout model performance when using repeatedcv. How do you recover holdout fold model performance using lm with caret?
In this example below, both boot and LOOCV produce worse model performance using the iris dataset. Given LOOCV uses more data each train, this doesn't make sense to me:
fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "repeatedcv", number=10, repeats=10)) fit fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "LOOCV")) fit fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "boot", number=1000)) fit
Later, I ran a manual k-folds (non-repeated). This consistently results in worse performance than caret k-folds but similar to LOOCV and boot. I didn't set seeds but you can re-run a few times and R^2 will consistently be lower with the manual method. It is unclear why caret is different.
#create folds# iris <-iris[sample(nrow(iris)),] folds <- cut(seq(1,nrow(iris)),breaks=10,labels=FALSE) results <- data.frame(matrix(NA, nrow = 0, ncol = 1)) #store results #Perform 10 fold cross validation for(i in 1:10){ testIndexes <- which(folds==i,arr.ind=TRUE) testData <- iris[testIndexes, ] trainData <- iris[-testIndexes, ] print (nrow(trainData)) print (nrow(testData)) OLS <- lm (Sepal.Width ~ ., data=trainData) Predicted <- as.data.frame (predict (OLS, newdata = testData)) results <- rbind (results, corr.test(cbind (dplyr::select(testData, Sepal.Width), Predicted))$r[2,1]) } mean (results[,1])^2
https://stackoverflow.com/questions/65815873/holdout-results-using-lm-in-caret January 21, 2021 at 02:57AM
没有评论:
发表评论