2021年3月13日星期六

From a set of documents how to calculate the words, documents number and count in python for Mapreduce

I have two documents, I need to count the number of words from two documents, along with the names of the documents for each word. doc1.txt = "I have an apple", doc2.txt = "I live in an apartment". Now I want to do MapReduce and the output will look like this: ((word, document name),count). Example: ((apple, doc1.txt),1)

#!/usr/bin/env python  

import sys import glob #from string import punctuation #--- get all lines from stdin --- for line in sys. stdin: #--- remove leading and trailing whitespace--- #line=line.translate(None, punctuation).strip('\t') line = line.strip()

#--- split the line into words ---  words = line.split()  doc_name = glob.glob("*.txt")    for doc in doc_name:      print(doc)      if doc[] == '':                for word in words:                    #word = word.rstrip()          key = word+ ',' +doc           #print '%s\t%s' % (key, "1")  

this code every time prints all the words from each documents but for both documents they assign each documents name in every words like this: (apple, doc1.txt),1 (apple, doc2.txt),1

https://stackoverflow.com/questions/66620982/from-a-set-of-documents-how-to-calculate-the-words-documents-number-and-count-i March 14, 2021 at 12:04PM

没有评论:

发表评论