有些事如何做: Comparing multiple columns of a massive DataFrame with complex duplicate rows

I have a massive dataframe df with around 10 million rows:

df.sort_values(['pair','x1','x2'])      x1    x1gen    x2     x2gen     y1     y1gen     y2     y2gen      pair  -------------------------------------------------------------------------------  A     male     H      female    a      male      d      male       0  A     male     W      male      a      male      d      male       0         (*)  A     male     KK     female    a      male      d      male       0         (**)  B     female   C      male      a      male      d      male       0         (-)  B     female   W      male      a      male      d      male       0         (*)  B     female   BB     female    a      male      d      male       0  B     female   KK     female    a      male      d      male       0         (**)  F     male     W      male      a      male      d      male       0         (*)  A     male     T      female    b      female    d      male       1  A     male     BB     female    b      female    d      male       1  B     female   C      male      b      female    d      male       1         (-)  D     male     E      male      b      female    d      male       1  A     male     C      male      b      female    e      female     2  ...

Each column can be explained by the following:

x1gen is a gender data of x1, x2gen is of x2, and so on.
x1 cites y1 and x2 cites y2.
Each pair of y1 and y2 is assigned a unique pair value.

My objective is to find four values per unique pair:

male citing male
male citing female
female citing male
female citing female

where, each citation network should not be counted more than once.

For example, in the given sample, x2 = W is appeared three times in pair = 0 (see (*)), so it should be counted once, not three times. Same applies to x2 = KK in pair = 0 (see (**)). However, we can count the same reference if it is a new pair. (C -> d in (-) is counted separately once per pair = 0 and pair = 1)

Hence, for the first pair pair = 0, the objective values are:

male citing male = 4 (A -> a, F -> a, W -> d, C -> d)
male citing female = 0
female citing male = 4 (B -> a, H -> d, KK -> d, BB -> d)
female citing female = 0

What I initially did was using a for loop and a set of if loops and creating four lists separately for x1 and x2:

mm = [1]  mf = [0]  fm = [0]  ff = [0]    mm1 = 1  mf1 = 0  fm1 = 0  ff1 = 0    for i in range(1, len(df)):    if df['pair'][i] == df['pair'][i-1]:      if df['x1'][i] != df['x1'][i-1]:        if df['x1gen'][i] == 'male':          if df['y1gen'][i] == 'male':            mm1 += 1          else:            mf1 += 1        else:          if df['y1gen'][i] == 'male':            fm1 += 1          else:            ff1 += 1  ...

and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).

What is the optimal way to find such values without having to do a highly inefficient string-matching?

https://stackoverflow.com/questions/66863761/comparing-multiple-columns-of-a-massive-dataframe-with-complex-duplicate-rows March 30, 2021 at 09:49AM

有些事如何做

2021年3月29日星期一

Comparing multiple columns of a massive DataFrame with complex duplicate rows

没有评论:

发表评论