I have two documents, I need to count the number of words from two documents, along with the names of the documents for each word. doc1.txt = "I have an apple", doc2.txt = "I live in an apartment". Now I want to do MapReduce and the output will look like this: ((word, document name),count). Example: ((apple, doc1.txt),1)
#!/usr/bin/env python import sys import glob #from string import punctuation #--- get all lines from stdin --- for line in sys. stdin: #--- remove leading and trailing whitespace--- #line=line.translate(None, punctuation).strip('\t') line = line.strip()
#--- split the line into words --- words = line.split() doc_name = glob.glob("*.txt") for doc in doc_name: print(doc) if doc[] == '': for word in words: #word = word.rstrip() key = word+ ',' +doc #print '%s\t%s' % (key, "1") this code every time prints all the words from each documents but for both documents they assign each documents name in every words like this: (apple, doc1.txt),1 (apple, doc2.txt),1
https://stackoverflow.com/questions/66620982/from-a-set-of-documents-how-to-calculate-the-words-documents-number-and-count-i March 14, 2021 at 12:04PM
没有评论:
发表评论