2021年3月22日星期一

Elegant vectorized way to reference nearby rows? Modeling Tiered Pricing with Multi-indexed DataFrames

I'd like to be able to model tiered pricing in a multi-indexed Pandas DataFrame. I tried searching for this and I was surprised that nothing came up, honestly.

I'd like to begin with DataFrames like these two:

import pandas as pd    pricing_df = pd.DataFrame({   'Product': ['Widget','Widget','Widget','Doodad','Doodad','Doodad','Thingy','Thingy','Stuff'],                              'TierIndex': [       1,       2,       3,       1,       2,       3,       1,       2,     1 ],                              'TierStart': [       0,      10,     100,       0,     100,    1000,       0,      25,     0 ],                              'TierPrice': [     5.0,     4.0,     3.0,    30.0,    25.0,    12.0,   150.0,   125.0, 750.0 ],                             'ExtraStuff': [   'sdf', 'werty',  'zxvb',  'dfgh',   'wer',  'vbnm', 'sxcde', 'rtgfb', 'yhn' ] } )    pricing_df.set_index(['Product', 'TierIndex'], drop = True, inplace = True)    pricing_df  
invoices_df = pd.DataFrame({   'Product': ['Widget', 'Doodad', 'Thingy',  'Stuff', 'Widget', 'Doodad', 'Thingy',  'Stuff' ],                               'MonthEnd': ['20210131', '20210131', '20210131', '20210131', '20210228', '20210228', '20210228','20210228'],                                'Quantity': [     200,       90,       20,       10,       30,     1300,       40,        0 ],                              'ExtraStuff': [   'xdf',  'xerty',   'xxvb',   'xfgh',    'xer',   'xbnm',  'xxcde',  'xtgfb' ] } )    invoices_df['MonthEnd'] = pd.to_datetime(invoices_df['MonthEnd'], format='%Y%m%d', errors='ignore')    invoices_df.set_index(['Product', 'MonthEnd'], drop = True, inplace = True)    invoices_df['Cost'] = 'total calculated cost for this row here'    invoices_df  

I'd like to call in the data from pricing_df (ideally with a single function via .apply(), perhaps?) into the invoices_df.

I imagine starting with sorting by Product_Code, then within each Product_Code sorting by Tier_Index (1,2,3...n). Once that's done (example here was built that way from the get-go), I'd begin calculating the total price by something like:

[Max of top tier price * (max of 0 and (quantity - highest Tier_Start_Qty))] + [next highest tier price * (max of 0 and (min of (current Tier_Start_Qty - higher Tier_Start_Qty) and (quantity - current Tier_Start_Qty)] + ... + [bottom tier price * (max of 0 and (min of (next higher Tier_Start_Qty - current Tier_Start_Qty) and (quantity - current Tier_Start_Qty)]

so a Product with tiers at 0, 10, and 100 units for $5, $4 and $3 per unit would charge for 200 units as:

$3 * max(200 - 100, 0) + $4 * max(0, min(200 - 10, 100 - 10)) + $5 * (max(0, min(200 - 0, 10 - 0)) = $3 * 100 + $4 * 90 + $5 * 10 = $710 = invoices_df.loc[('Widget', '2021-01-31'),'Cost'] (or similar)

The same item with a quantity of 30 would cost: $3 * max(30 - 100, 0) + $4 * max(0, min(30 - 10, 100 - 10)) + $5 * (max(0, min(30 - 0, 10 - 0)) = $3 * 0 + $4 * 20 + $5 * 10 = $130 = invoices_df.loc[('Widget', '2021-02-28'),'Cost'] (or similar)

Not all Product codes will have the same number of tiers, so i'm hoping the above code will be able to function over however many tiers are needed (ideally without having to loop). At the end of the day, every Product should have a total cost calculated for a given month.

The bit that's hanging me up is the idea of having to read multiple rows neatly to get each term put together. I have done work in Anaplan where doing something like this is quite simple, but doing this in Pandas seems a lot trickier, at least without some kind of yucky nested for loop mess.

Is there an elegant, ideally vectorized way I can implment this? Any guidance would be greatly appreciated.

https://stackoverflow.com/questions/66754250/elegant-vectorized-way-to-reference-nearby-rows-modeling-tiered-pricing-with-mu March 23, 2021 at 05:43AM

没有评论:

发表评论