2021年3月25日星期四

Need help creating a sparse data matrix from a long character string of item IDs and item responses

Suppose I have a data.frame composed of responses to a short online math test, where every test-taker answers 5 randomly selected items out of a possible 9 items, and suppose the data come to me like the following:

id    data  tt1   item01 0   item04 1    item03 1    item09 0    item05 0  tt2   item01 1   item06 1    item08 1    item02 0    item04 1  tt3   item05 1   item03 0    item07 1    item09 0    item02 1  

I managed to parse the variable data, which is just a character string of five item-ID-and-right/wrong pairs, into a matrix of item responses that looks like this:

[1]      [2]      [3]      [4]      [5]  0        1        1        0        0  1        1        1        0        1  1        0        1        0        1  

However, since test-takers were given a random selection of items, the columns don't correspond to unique items across test-takers. What I really need is a sparse matrix like this:

item01    item02    item03    item04    item05    item06    item07    item08    item09  0         NA        1         1         0         NA        NA        NA        0  1         0         NA        1         NA        1         NA        1         NA  NA        1         0         NA        1         NA        1         NA        0  

My incredibly unwieldy and inefficient method is to create a data.frame with three columns (test-taker ID, item ID, and item score) which are all pre-populated and item scores are all recorded as NA. Then, I go through every row of that data.frame, look up whether that test-taker saw that item, and, if so, record whether they got a 0 or a 1 on that item. Then I transform the long data set into a wide data set, and I essentially have my sparse matrix of item responses. I know there has to be a better way to do this, but I either can't find the tidyr function that I know must exist out there somewhere, or I can't wrap my head around it, or both.

Does anyone know of a more efficient method to create this sparse matrix than the monstrosity that I've come up with?

My actual code is below and a sample data set are here: https://1drv.ms/u/s!AucYBk7HiTv6mcI7emgpAuI8dEYO-Q?e=fpI0lb

library(tidyr) # for function 'spread()'    M <- nrow(dat) # number of test-takers    # Get start and end position of each item ID in the long character string  varstart <- as.numeric(gregexpr(pattern = '[a-zA-Z]{2}\\d{6}', text = dat$TEST_DATA[1], perl = TRUE)[[1]])  varend <- varstart + 7    # Get start and end position of each score (0/1) in the long character string  scorestart <- varstart + 13  scoreend <- varstart + 15    # Create non-sparse item response matrix  tmp_data <- lapply(dat$TEST_DATA, function(y) {    as.numeric(substring(y, scorestart, scoreend))  })  data <- matrix(unlist(tmp_data), ncol = length(scorestart), byrow = TRUE)    # Get all unique item IDs  items <- lapply(dat$TEST_DATA, substring, varstart, varend)  all_items <- sort(unique(unlist(items))) # all unique items    # Create long data set  ldat <- as.data.frame(matrix(NA, nrow = length(all_items) * length(dat$ID), ncol = 3))  names(ldat) <- c('ID', 'Item', 'Score')  for (i in 1:length(dat$ID)) {                                     # for each test-taker    start <- (i-1) * length(all_items) + 1                          # start position in the long data set for this test-taker    end <- start + length(all_items) - 1                            # end position in the long data set for this test-taker    ldat$ID[start:end] <- rep(dat$ID[i], times = length(all_items)) # fill in the test-taker ID    ldat$Item[start:end] <- all_items                               # fill in the unique item IDs    for (j in 1:length(all_items)) {                                # for each unique item      jpos <- start + j - 1                                         # track position in the long data set      if (all_items[j] %in% items[[i]]) {                           # if test-taker i saw item j        num <- which(items[[i]] == all_items[j])                    # find position of item j in test-taker i's item responses        ldat$Score[jpos] <- data[i,num]                             # fill in 0/1 as appropriate      }    }  }    # Transform long data to wide  final_data <- spread(ldat, key='Item', value='Score')  
https://stackoverflow.com/questions/66810082/need-help-creating-a-sparse-data-matrix-from-a-long-character-string-of-item-ids March 26, 2021 at 10:04AM

没有评论:

发表评论