I have a dataset with 100 000 rows. I load it to a dataframe, shuffle and split to train and test set:
# Read tsv content to dataframe. df = pd.read_csv(data_location, delimiter='\t', nrows=300) # TODO delete nrows # Shuffle data. df = df.sample(frac=1, random_state=seed).reset_index(drop=True) # Split into train and development sets. 0.8 train and 0.2 test split is a common setup. # Balance target classes by setting 'stratify' parameter to 'Target'. X_train, X_test, y_train, y_test = train_test_split(df['text'], df['source'], test_size=0.2, stratify=df['source'], random_state=seed)
Then I get the vocabulary using CountVectorizer. I try to use around only 100 features:
# Prepare CountVectorizer and get vocabulary. vectorizer = CountVectorizer(max_features=max_features, lowercase=lowercase, binary=binary, min_df=min_df, ngram_range=ngram) vectorizer.fit(df['text'])
Then I transform the data to BoW matrices and the labels to numbers:
X_train = vectorizer.transform(X_train) X_test = vectorizer.transform(X_test) # Convert categorical labels to numbers. LE = LabelEncoder() y_train = LE.fit_transform(y_train) y_test = LE.fit_transform(y_test)
Now I convert the matrices to Pytorch tensors.
# Convert sparse csr matrices to tensors. X_train = convert_csr_to_tensor(X_train) X_test = convert_csr_to_tensor(X_test) # Convert targets to tensors. Requires longs for some reason. y_train = torch.from_numpy(y_train).long() y_test = torch.from_numpy(y_test).long()
The convert_csr_to_tensor function looks like this:
coo = X.tocoo() values = coo.data indices = np.vstack((coo.row, coo.col)) i = torch.LongTensor(indices) v = torch.FloatTensor(values) shape = coo.shape return torch.sparse.FloatTensor(i, v, torch.Size(shape)).to_dense()
However, when I operate on these tensors, I soon run out of memory. Where am I going wrong?
https://stackoverflow.com/questions/66073183/countvectorizer-with-pytorch-neural-network-bag-of-words-running-out-of-memory February 06, 2021 at 11:01AM
没有评论:
发表评论