Suppose I have a data.frame composed of responses to a short online math test, where every test-taker answers 5 randomly selected items out of a possible 9 items, and suppose the data come to me like the following:
id data tt1 item01 0 item04 1 item03 1 item09 0 item05 0 tt2 item01 1 item06 1 item08 1 item02 0 item04 1 tt3 item05 1 item03 0 item07 1 item09 0 item02 1
I managed to parse the variable data
, which is just a character string of five item-ID-and-right/wrong pairs, into a matrix of item responses that looks like this:
[1] [2] [3] [4] [5] 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1
However, since test-takers were given a random selection of items, the columns don't correspond to unique items across test-takers. What I really need is a sparse matrix like this:
item01 item02 item03 item04 item05 item06 item07 item08 item09 0 NA 1 1 0 NA NA NA 0 1 0 NA 1 NA 1 NA 1 NA NA 1 0 NA 1 NA 1 NA 0
My incredibly unwieldy and inefficient method is to create a data.frame with three columns (test-taker ID, item ID, and item score) which are all pre-populated and item scores are all recorded as NA. Then, I go through every row of that data.frame, look up whether that test-taker saw that item, and, if so, record whether they got a 0 or a 1 on that item. Then I transform the long data set into a wide data set, and I essentially have my sparse matrix of item responses. I know there has to be a better way to do this, but I either can't find the tidyr
function that I know must exist out there somewhere, or I can't wrap my head around it, or both.
Does anyone know of a more efficient method to create this sparse matrix than the monstrosity that I've come up with?
My actual code is below and a sample data set are here: https://1drv.ms/u/s!AucYBk7HiTv6mcI7emgpAuI8dEYO-Q?e=fpI0lb
library(tidyr) # for function 'spread()' M <- nrow(dat) # number of test-takers # Get start and end position of each item ID in the long character string varstart <- as.numeric(gregexpr(pattern = '[a-zA-Z]{2}\\d{6}', text = dat$TEST_DATA[1], perl = TRUE)[[1]]) varend <- varstart + 7 # Get start and end position of each score (0/1) in the long character string scorestart <- varstart + 13 scoreend <- varstart + 15 # Create non-sparse item response matrix tmp_data <- lapply(dat$TEST_DATA, function(y) { as.numeric(substring(y, scorestart, scoreend)) }) data <- matrix(unlist(tmp_data), ncol = length(scorestart), byrow = TRUE) # Get all unique item IDs items <- lapply(dat$TEST_DATA, substring, varstart, varend) all_items <- sort(unique(unlist(items))) # all unique items # Create long data set ldat <- as.data.frame(matrix(NA, nrow = length(all_items) * length(dat$ID), ncol = 3)) names(ldat) <- c('ID', 'Item', 'Score') for (i in 1:length(dat$ID)) { # for each test-taker start <- (i-1) * length(all_items) + 1 # start position in the long data set for this test-taker end <- start + length(all_items) - 1 # end position in the long data set for this test-taker ldat$ID[start:end] <- rep(dat$ID[i], times = length(all_items)) # fill in the test-taker ID ldat$Item[start:end] <- all_items # fill in the unique item IDs for (j in 1:length(all_items)) { # for each unique item jpos <- start + j - 1 # track position in the long data set if (all_items[j] %in% items[[i]]) { # if test-taker i saw item j num <- which(items[[i]] == all_items[j]) # find position of item j in test-taker i's item responses ldat$Score[jpos] <- data[i,num] # fill in 0/1 as appropriate } } } # Transform long data to wide final_data <- spread(ldat, key='Item', value='Score')
https://stackoverflow.com/questions/66810082/need-help-creating-a-sparse-data-matrix-from-a-long-character-string-of-item-ids March 26, 2021 at 10:04AM
没有评论:
发表评论