I got a dataframe which looks like this:
np.random.seed(11) df = pd.DataFrame({ 'item_id': np.random.randint(1, 4, 10), 'item_type': [chr(x) for x in np.random.randint(65, 80, 10)], 'value1': np.round(np.random.rand(10)*30, 1), 'value2': np.round(np.random.randn(10)*30, 1), 'n': np.random.randint(100, size=10) }) item_id item_type value1 value2 n 0 2 A 26.8 -39.2 59 1 1 N 25.7 -33.6 1 2 2 A 5.0 22.1 3 3 2 N 19.0 47.2 8 4 1 M 0.6 -0.9 87 5 2 N 3.5 -20.5 81 6 3 E 9.5 32.9 68 7 1 C 4.7 -9.3 72 8 2 M 22.8 21.8 32 9 1 B 24.5 46.5 78
I would like to transform this dataframe to have a single row for each item_id
. The columns should be aggregated by finding the weighted average of value1
and value2
(weighted by n
), and combining categorical variable item_type
if it is not unique. The end result looks like this:
item_type value1 value2 item_id 1 B/C/M/N 9.778571 11.955882 2 A/M/N 15.089071 -15.474317 3 E 9.500000 32.900000
What I have tried
This can be done with a custom function and using apply
, like this one:
def func(x): record = ['/'.join(sorted(x.item_type.unique()))] total_rows = x.n.sum() for c in ['value1', 'value2']: record.append((x[c] * x.n / total_rows).sum()) return pd.Series(record, index=['item_type', 'value1', 'value2'])
%%timeit df.groupby('item_id').apply(func) 6.95 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is for 10 records. I have a dataframe above 40 million records. I am searching for the most efficient way to do this, before I start thinking of going parallel. All other operations I've done on this dataframe take less than a minute, but this one is sloow.
any ideas appreciated!
https://stackoverflow.com/questions/65768564/aggregating-a-dataframe-weighted-averages-for-numerical-columns-and-concatenati January 18, 2021 at 11:46AM
没有评论:
发表评论