2021年5月6日星期四

Pandas to mark both if cell value is a substring of another

A column with short and full form of people names, I want to unify them, if the name is a part of the other name. e.g. "James.J" and "James.Jones", I want to tag them both as "James.J".

data = {'Name': ["Amelia.Smith",  "Lucas.M",  "James.J",  "Elijah.Brown",  "Amelia.S",  "James.Jones",  "Benjamin.Johnson"]}    df = pd.DataFrame(data)  

I can't figure out how to do it in Pandas. So only a xlrd way, with similarity ratio by SequenceMatcher (and sort it manually in Excel):

import xlrd  from xlrd import open_workbook,cellname  import xlwt  from xlutils.copy import copy     workbook = xlrd.open_workbook("C:\\TEM\\input.xlsx")    old_sheet = workbook.sheet_by_name("Sheet1")    from difflib import SequenceMatcher    wb = copy(workbook)   sheet = wb.get_sheet(0)     for row_index in range(0, old_sheet.nrows):        current = old_sheet.cell(row_index, 0).value      previous = old_sheet.cell(row_index-1, 0).value      sro = SequenceMatcher(None, current.lower(), previous.lower(), autojunk=True).ratio()        if sro > 0.7:          sheet.write(row_index, 1, previous)          sheet.write(row_index-1, 1, previous)    wb.save("C:\\TEM\\output.xls")  

What's the nice Pandas way to do it/ Thank you.

enter image description here

https://stackoverflow.com/questions/67428436/pandas-to-mark-both-if-cell-value-is-a-substring-of-another May 07, 2021 at 10:56AM

没有评论:

发表评论