A column with short and full form of people names, I want to unify them, if the name is a part of the other name. e.g. "James.J" and "James.Jones", I want to tag them both as "James.J".
data = {'Name': ["Amelia.Smith", "Lucas.M", "James.J", "Elijah.Brown", "Amelia.S", "James.Jones", "Benjamin.Johnson"]} df = pd.DataFrame(data) I can't figure out how to do it in Pandas. So only a xlrd way, with similarity ratio by SequenceMatcher (and sort it manually in Excel):
import xlrd from xlrd import open_workbook,cellname import xlwt from xlutils.copy import copy workbook = xlrd.open_workbook("C:\\TEM\\input.xlsx") old_sheet = workbook.sheet_by_name("Sheet1") from difflib import SequenceMatcher wb = copy(workbook) sheet = wb.get_sheet(0) for row_index in range(0, old_sheet.nrows): current = old_sheet.cell(row_index, 0).value previous = old_sheet.cell(row_index-1, 0).value sro = SequenceMatcher(None, current.lower(), previous.lower(), autojunk=True).ratio() if sro > 0.7: sheet.write(row_index, 1, previous) sheet.write(row_index-1, 1, previous) wb.save("C:\\TEM\\output.xls") What's the nice Pandas way to do it/ Thank you.
https://stackoverflow.com/questions/67428436/pandas-to-mark-both-if-cell-value-is-a-substring-of-another May 07, 2021 at 10:56AM
没有评论:
发表评论