I have a dataframe A containing docid(document ID), title(title of the article), lineid(line ID, aka the location of the paragraph), text, and tokencount(counts of words including white spaces):
docid title lineid text tokencount 0 0 A 0 shopping and orders have become more com... 66 1 0 A 1 people wrote to the postal service online... 67 2 0 A 2 text updates really from the U.S. Postal... 43 ...
I want to create a new dataframe based on A including title
, lineid
, count
, and query
.
query
is the text string containing one or more words like "data analysis", "text message", or "shopping and orders".
count
is the counts of each word of the query
.
The new dataframe should look like this:
title lemma count lineid A "data" 0 0 A "data" 1 1 A "data" 4 2 A "shop" 2 0 A "shop" 1 1 A "shop" 2 2 B "data" 4 0 B "data" 0 1 B "data" 2 2 B "shop" 9 0 B "shop" 3 1 B "shop" 1 2 ...
How to make a function to generate this new dataframe?
I have created a new dataframe df
from A with a column count
.
df = A[['title','lineid']] df['count'] = 0 df.set_index(['title','lineid'], inplace=True)
Also, I have created a function to count word of query.
from collections import Counter def occurrence_counter(target_string, query): data = dict(Counter(target_string.split())) count = 0 for key in query: if key in data: count += data[key] return count
But, how can I use both of them to generate a function of a new dataframe?
https://stackoverflow.com/questions/67248738/pre-process-text-string-with-ntlk April 25, 2021 at 08:40AM
没有评论:
发表评论