I have a massive dataframe df
with around 10 million rows:
df.sort_values(['pair','x1','x2']) x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair ------------------------------------------------------------------------------- A male H female a male d male 0 A male W male a male d male 0 (*) A male KK female a male d male 0 (**) B female C male a male d male 0 (-) B female W male a male d male 0 (*) B female BB female a male d male 0 B female KK female a male d male 0 (**) F male W male a male d male 0 (*) A male T female b female d male 1 A male BB female b female d male 1 B female C male b female d male 1 (-) D male E male b female d male 1 A male C male b female e female 2 ...
Each column can be explained by the following:
-
x1gen
is a gender data ofx1
,x2gen
is ofx2
, and so on. -
x1
citesy1
andx2
citesy2
. - Each pair of
y1
andy2
is assigned a uniquepair
value.
My objective is to find four values per unique pair
:
-
male
citingmale
-
male
citingfemale
-
female
citingmale
-
female
citingfemale
where, each citation network should not be counted more than once.
For example, in the given sample, x2 = W
is appeared three times in pair = 0
(see (*)
), so it should be counted once, not three times. Same applies to x2 = KK
in pair = 0
(see (**)
). However, we can count the same reference if it is a new pair. (C -> d
in (-)
is counted separately once per pair = 0
and pair = 1
)
Hence, for the first pair pair = 0
, the objective values are:
-
male
citingmale
= 4 (A -> a, F -> a, W -> d, C -> d
) -
male
citingfemale
= 0 -
female
citingmale
= 4 (B -> a, H -> d, KK -> d, BB -> d
) -
female
citingfemale
= 0
What I initially did was using a for
loop and a set of if
loops and creating four lists separately for x1
and x2
:
mm = [1] mf = [0] fm = [0] ff = [0] mm1 = 1 mf1 = 0 fm1 = 0 ff1 = 0 for i in range(1, len(df)): if df['pair'][i] == df['pair'][i-1]: if df['x1'][i] != df['x1'][i-1]: if df['x1gen'][i] == 'male': if df['y1gen'][i] == 'male': mm1 += 1 else: mf1 += 1 else: if df['y1gen'][i] == 'male': fm1 += 1 else: ff1 += 1 ...
and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).
What is the optimal way to find such values without having to do a highly inefficient string-matching?
https://stackoverflow.com/questions/66863761/comparing-multiple-columns-of-a-massive-dataframe-with-complex-duplicate-rows March 30, 2021 at 09:49AM
没有评论:
发表评论