2021年1月15日星期五

How to keep a variable in fit$model for lm() in R that I'm *not* using within the lm call itself?

I want to be able to index my model after having fit the model. Say I have

df <- data.frame(a = c(1,2,3),                    b = c(2,3,1000),                    country = c("Malawi", "USA","UK"))  

Then, I run:

fit<-lm(a~b,data=df)  

My resulting fit$model no longer has the "country" variable, so it becomes hard to do things like

  • run a regression and then remove certain countries as robustness tests.
  • run a regression and then find out which countries were outliers.

I know there are 'hacks' around this like using row indices, but I frequently find myself further subsetting the original dataset, and I am afraid of keeping track of row indices.

e.g. From the example above, I see that UK is an outlier.

So, I have two options:

lm(a~b,data=fit$model[-3,])  lm(a~b,data=df[df$country!="UK",])  

The second option is much clearer to me, but because summary statistics and tests in R (such as cook's distance) only give me the row index, I end up having to do the first option much more than I would like. This becomes especially tedious in large panel datasets where I'm trying to test robustness to outliers or leveraged data and would also like to know what countries (or other variables) those data are.

Ideally, I'd like an option to do something like

lm(a~b,data=fit$model[fit$model$country!="UK",])  

Please help, and thank you so much!

https://stackoverflow.com/questions/65743637/how-to-keep-a-variable-in-fitmodel-for-lm-in-r-that-im-not-using-within-th January 16, 2021 at 05:16AM

没有评论:

发表评论