2021年3月29日星期一

Comparing multiple columns of a massive DataFrame with complex duplicate rows

I have a massive dataframe df with around 10 million rows:

df.sort_values(['pair','x1','x2'])      x1    x1gen    x2     x2gen     y1     y1gen     y2     y2gen      pair  -------------------------------------------------------------------------------  A     male     H      female    a      male      d      male       0  A     male     W      male      a      male      d      male       0         (*)  A     male     KK     female    a      male      d      male       0         (**)  B     female   C      male      a      male      d      male       0         (-)  B     female   W      male      a      male      d      male       0         (*)  B     female   BB     female    a      male      d      male       0  B     female   KK     female    a      male      d      male       0         (**)  F     male     W      male      a      male      d      male       0         (*)  A     male     T      female    b      female    d      male       1  A     male     BB     female    b      female    d      male       1  B     female   C      male      b      female    d      male       1         (-)  D     male     E      male      b      female    d      male       1  A     male     C      male      b      female    e      female     2  ...  

Each column can be explained by the following:

  • x1gen is a gender data of x1, x2gen is of x2, and so on.
  • x1 cites y1 and x2 cites y2.
  • Each pair of y1 and y2 is assigned a unique pair value.

My objective is to find four values per unique pair:

  1. male citing male
  2. male citing female
  3. female citing male
  4. female citing female

where, each citation network should not be counted more than once.

For example, in the given sample, x2 = W is appeared three times in pair = 0 (see (*)), so it should be counted once, not three times. Same applies to x2 = KK in pair = 0 (see (**)). However, we can count the same reference if it is a new pair. (C -> d in (-) is counted separately once per pair = 0 and pair = 1)

Hence, for the first pair pair = 0, the objective values are:

  1. male citing male = 4 (A -> a, F -> a, W -> d, C -> d)
  2. male citing female = 0
  3. female citing male = 4 (B -> a, H -> d, KK -> d, BB -> d)
  4. female citing female = 0

What I initially did was using a for loop and a set of if loops and creating four lists separately for x1 and x2:

mm = [1]  mf = [0]  fm = [0]  ff = [0]    mm1 = 1  mf1 = 0  fm1 = 0  ff1 = 0    for i in range(1, len(df)):    if df['pair'][i] == df['pair'][i-1]:      if df['x1'][i] != df['x1'][i-1]:        if df['x1gen'][i] == 'male':          if df['y1gen'][i] == 'male':            mm1 += 1          else:            mf1 += 1        else:          if df['y1gen'][i] == 'male':            fm1 += 1          else:            ff1 += 1  ...  

and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).

What is the optimal way to find such values without having to do a highly inefficient string-matching?

https://stackoverflow.com/questions/66863761/comparing-multiple-columns-of-a-massive-dataframe-with-complex-duplicate-rows March 30, 2021 at 09:49AM

没有评论:

发表评论