2021年2月5日星期五

CountVectorizer with Pytorch neural network, Bag-of-Words running out of memory

I have a dataset with 100 000 rows. I load it to a dataframe, shuffle and split to train and test set:

# Read tsv content to dataframe.  df = pd.read_csv(data_location, delimiter='\t', nrows=300) # TODO delete nrows    # Shuffle data.  df = df.sample(frac=1, random_state=seed).reset_index(drop=True)    # Split into train and development sets. 0.8 train and 0.2 test split is a common setup.  # Balance target classes by setting 'stratify' parameter to 'Target'.  X_train, X_test, y_train, y_test = train_test_split(df['text'],                                                      df['source'],                                                      test_size=0.2,                                                      stratify=df['source'],                                                      random_state=seed)        

Then I get the vocabulary using CountVectorizer. I try to use around only 100 features:

# Prepare CountVectorizer and get vocabulary.  vectorizer = CountVectorizer(max_features=max_features,                               lowercase=lowercase,                              binary=binary,                              min_df=min_df,                              ngram_range=ngram)  vectorizer.fit(df['text'])        

Then I transform the data to BoW matrices and the labels to numbers:

X_train = vectorizer.transform(X_train)  X_test = vectorizer.transform(X_test)    # Convert categorical labels to numbers.  LE = LabelEncoder()  y_train = LE.fit_transform(y_train)  y_test = LE.fit_transform(y_test)        

Now I convert the matrices to Pytorch tensors.

# Convert sparse csr matrices to tensors.  X_train = convert_csr_to_tensor(X_train)  X_test = convert_csr_to_tensor(X_test)    # Convert targets to tensors. Requires longs for some reason.  y_train = torch.from_numpy(y_train).long()  y_test = torch.from_numpy(y_test).long()  

The convert_csr_to_tensor function looks like this:

coo = X.tocoo()  values = coo.data  indices = np.vstack((coo.row, coo.col))  i = torch.LongTensor(indices)  v = torch.FloatTensor(values)  shape = coo.shape  return torch.sparse.FloatTensor(i, v, torch.Size(shape)).to_dense()  

However, when I operate on these tensors, I soon run out of memory. Where am I going wrong?

https://stackoverflow.com/questions/66073183/countvectorizer-with-pytorch-neural-network-bag-of-words-running-out-of-memory February 06, 2021 at 11:01AM

没有评论:

发表评论