2021年1月17日星期日

Aggregating a dataframe: weighted averages for numerical columns and concatenating as string for other types

I got a dataframe which looks like this:

np.random.seed(11)    df = pd.DataFrame({      'item_id': np.random.randint(1, 4, 10),      'item_type': [chr(x) for x in np.random.randint(65, 80, 10)],      'value1': np.round(np.random.rand(10)*30, 1),      'value2': np.round(np.random.randn(10)*30, 1),      'n': np.random.randint(100, size=10)  })       item_id item_type  value1  value2   n  0        2         A    26.8   -39.2  59  1        1         N    25.7   -33.6   1  2        2         A     5.0    22.1   3  3        2         N    19.0    47.2   8  4        1         M     0.6    -0.9  87  5        2         N     3.5   -20.5  81  6        3         E     9.5    32.9  68  7        1         C     4.7    -9.3  72  8        2         M    22.8    21.8  32  9        1         B    24.5    46.5  78  

I would like to transform this dataframe to have a single row for each item_id. The columns should be aggregated by finding the weighted average of value1 and value2 (weighted by n), and combining categorical variable item_type if it is not unique. The end result looks like this:

        item_type     value1     value2  item_id                                  1         B/C/M/N   9.778571  11.955882  2           A/M/N  15.089071 -15.474317  3               E   9.500000  32.900000  

What I have tried

This can be done with a custom function and using apply, like this one:

def func(x):        record = ['/'.join(sorted(x.item_type.unique()))]      total_rows = x.n.sum()      for c in ['value1', 'value2']:          record.append((x[c] * x.n / total_rows).sum())      return pd.Series(record, index=['item_type', 'value1', 'value2'])  
%%timeit  df.groupby('item_id').apply(func)    6.95 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  

This is for 10 records. I have a dataframe above 40 million records. I am searching for the most efficient way to do this, before I start thinking of going parallel. All other operations I've done on this dataframe take less than a minute, but this one is sloow.

any ideas appreciated!

https://stackoverflow.com/questions/65768564/aggregating-a-dataframe-weighted-averages-for-numerical-columns-and-concatenati January 18, 2021 at 11:46AM

没有评论:

发表评论