I have a dataframe A containing docid(document ID), title(title of the article), lineid(line ID, aka the location of the paragraph), text, and tokencount(counts of words including white spaces):
docid title lineid text tokencount 0 0 A 0 shopping and orders have become more com... 66 1 0 A 1 people wrote to the postal service online... 67 2 0 A 2 text updates really from the U.S. Postal... 43 ... I want to create a new dataframe based on A including title, lineid, count, and query.
query is the text string containing one or more words like "data analysis", "text message", or "shopping and orders".
count is the counts of each word of the query.
The new dataframe should look like this:
title lemma count lineid A "data" 0 0 A "data" 1 1 A "data" 4 2 A "shop" 2 0 A "shop" 1 1 A "shop" 2 2 B "data" 4 0 B "data" 0 1 B "data" 2 2 B "shop" 9 0 B "shop" 3 1 B "shop" 1 2 ... How to make a function to generate this new dataframe?
I have created a new dataframe df from A with a column count.
df = A[['title','lineid']] df['count'] = 0 df.set_index(['title','lineid'], inplace=True) Also, I have created a function to count word of query.
from collections import Counter def occurrence_counter(target_string, query): data = dict(Counter(target_string.split())) count = 0 for key in query: if key in data: count += data[key] return count But, how can I use both of them to generate a function of a new dataframe?
https://stackoverflow.com/questions/67248738/pre-process-text-string-with-ntlk April 25, 2021 at 08:40AM
没有评论:
发表评论