2021年4月24日星期六

Pre-process text string with NTLK

I have a dataframe A containing docid(document ID), title(title of the article), lineid(line ID, aka the location of the paragraph), text, and tokencount(counts of words including white spaces):

  docid   title  lineid                                         text        tokencount  0     0     A        0   shopping and orders have become more com...                66  1     0     A        1  people wrote to the postal service online...                67  2     0     A        2   text updates really from the U.S. Postal...                43  ...  

I want to create a new dataframe based on A including title, lineid, count, and query.

query is the text string containing one or more words like "data analysis", "text message", or "shopping and orders".

count is the counts of each word of the query.

The new dataframe should look like this:

title  lemma   count   lineid    A    "data"    0        0    A    "data"    1        1    A    "data"    4        2    A    "shop"    2        0    A    "shop"    1        1    A    "shop"    2        2    B    "data"    4        0    B    "data"    0        1    B    "data"    2        2    B    "shop"    9        0    B    "shop"    3        1    B    "shop"    1        2  ...  

How to make a function to generate this new dataframe?


I have created a new dataframe df from A with a column count.

df = A[['title','lineid']]  df['count'] = 0  df.set_index(['title','lineid'], inplace=True)  

Also, I have created a function to count word of query.

from collections import Counter    def occurrence_counter(target_string, query):      data = dict(Counter(target_string.split()))      count = 0      for key in query:          if key in data:              count += data[key]      return count  

But, how can I use both of them to generate a function of a new dataframe?

https://stackoverflow.com/questions/67248738/pre-process-text-string-with-ntlk April 25, 2021 at 08:40AM

没有评论:

发表评论