I have a massive dataframe df with around 10 million rows:
df.sort_values(['pair','x1','x2']) x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair ------------------------------------------------------------------------------- A male H female a male d male 0 A male W male a male d male 0 (*) A male KK female a male d male 0 (**) B female C male a male d male 0 (-) B female W male a male d male 0 (*) B female BB female a male d male 0 B female KK female a male d male 0 (**) F male W male a male d male 0 (*) A male T female b female d male 1 A male BB female b female d male 1 B female C male b female d male 1 (-) D male E male b female d male 1 A male C male b female e female 2 ... Each column can be explained by the following:
-
x1genis a gender data ofx1,x2genis ofx2, and so on. -
x1citesy1andx2citesy2. - Each pair of
y1andy2is assigned a uniquepairvalue.
My objective is to find four values per unique pair:
-
malecitingmale -
malecitingfemale -
femalecitingmale -
femalecitingfemale
where, each citation network should not be counted more than once.
For example, in the given sample, x2 = W is appeared three times in pair = 0 (see (*)), so it should be counted once, not three times. Same applies to x2 = KK in pair = 0 (see (**)). However, we can count the same reference if it is a new pair. (C -> d in (-) is counted separately once per pair = 0 and pair = 1)
Hence, for the first pair pair = 0, the objective values are:
-
malecitingmale= 4 (A -> a, F -> a, W -> d, C -> d) -
malecitingfemale= 0 -
femalecitingmale= 4 (B -> a, H -> d, KK -> d, BB -> d) -
femalecitingfemale= 0
What I initially did was using a for loop and a set of if loops and creating four lists separately for x1 and x2:
mm = [1] mf = [0] fm = [0] ff = [0] mm1 = 1 mf1 = 0 fm1 = 0 ff1 = 0 for i in range(1, len(df)): if df['pair'][i] == df['pair'][i-1]: if df['x1'][i] != df['x1'][i-1]: if df['x1gen'][i] == 'male': if df['y1gen'][i] == 'male': mm1 += 1 else: mf1 += 1 else: if df['y1gen'][i] == 'male': fm1 += 1 else: ff1 += 1 ... and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).
What is the optimal way to find such values without having to do a highly inefficient string-matching?
https://stackoverflow.com/questions/66863761/comparing-multiple-columns-of-a-massive-dataframe-with-complex-duplicate-rows March 30, 2021 at 09:49AM
没有评论:
发表评论