2020年12月21日星期一

Removing stopwords from R data frame column

Here's the situation, one whose solution seemed to be simple at first, but that has turned out to be more complicated than I expected.

I have an R data frame with three columns: an ID, a column with texts (reviews), and one with numeric values which I want to predict based on the text.

I have already done some preprocessing on the text column, so it is free of punctuation, in lower case, and ready to be tokenized and turned into a matrix so I can train a model on it. The problem is I can't figure out how to remove the stop words from that text.

Here's what I am trying to do with the text2vec package. I was planning on doing the stop-word removal before this chunk at first. But anywhere will do.

library(text2vec)    test_data <- data.frame(review_id=c(1,2,3),                          review=c('is a masterpiece a work of art',                          'sporting some of the best writing and voice work',                          'better in every possible way when compared'),                           score=c(90, 100, 100))    tokens <- word_tokenizer(test_data$review)  document_term_matrix <- create_dtm(itoken(tokens), hash_vectorizer())  model_tfidf <- TfIdf$new()  document_term_matrix <- model_tfidf$fit_transform(document_term_matrix)    document_term_matrix <- as.matrix(document_term_matrix)  

I am hoping to get the review column to be something like:

review=c('masterpiec work art',           'sporting best writing voice work',           'better possible way compared')  
https://stackoverflow.com/questions/65401533/removing-stopwords-from-r-data-frame-column December 22, 2020 at 07:55AM

没有评论:

发表评论