Can someone help me with pyspark using both aggregate and groupby functions? I have made my data frames, and applied filters and selects to get the data I want. However, I am now stuck trying to aggregate things correctly.
Currently, my code outputs the content below:
+----------+-----------+--------------+---------------+----------+---------+ |l_orderkey|o_orderdate|o_shippriority|l_extendedprice|l_discount| rev| +----------+-----------+--------------+---------------+----------+---------+ | 53634| 1995-02-22| 0| 20517.44| 0.08|18876.045| | 265539| 1995-01-25| 0| 70423.08| 0.01| 69718.85| | 331590| 1994-12-10| 0| 46692.75| 0.03| 45291.97| | 331590| 1994-12-10| 0| 37235.1| 0.1| 33511.59| | 420545| 1995-03-05| 0| 75542.1| 0.04|72520.414| | 420545| 1995-03-05| 0| 1062.0| 0.07|987.66003| | 420545| 1995-03-05| 0| 9729.45| 0.1| 8756.505| | 420545| 1995-03-05| 0| 15655.6| 0.04|15029.375| | 420545| 1995-03-05| 0| 3121.3| 0.03|3027.6611| | 420545| 1995-03-05| 0| 71723.0| 0.03| 69571.31| | 488928| 1995-02-15| 0| 1692.77| 0.01|1675.8423| | 488928| 1995-02-15| 0| 22017.84| 0.01|21797.662| | 488928| 1995-02-15| 0| 57100.42| 0.04|54816.402| | 488928| 1995-02-15| 0| 3807.76| 0.05| 3617.372| | 488928| 1995-02-15| 0| 73332.52| 0.01|72599.195| | 510754| 1994-12-21| 0| 41171.78| 0.09| 37466.32| | 512422| 1994-12-26| 0| 87251.56| 0.07| 81143.95| | 677761| 1994-12-26| 0| 60123.34| 0.0| 60123.34| | 956646| 1995-03-07| 0| 61853.68| 0.05|58760.996| | 1218886| 1995-02-13| 0| 24844.0| 0.01| 24595.56| +----------+-----------+--------------+---------------+----------+---------+
I wish to apply a group by: l_orderkey and aggregate the Rev as a sum.
Here is my most recent attempt with 't' being the dataframe and F being functions from pyspark.sql "from pyspark.sql import functions as F"
(t .groupby(t.l_orderkey,t.o_orderdate, t.o_shippriority) .agg(F.collect_set(sum(t.rev)), F.collect_set(t.l_orderkey)) .show())
Can someone help me know if I'm on the right track? I keep getting "Column is not iterable"
https://stackoverflow.com/questions/66947084/pyspark-using-both-aggregate-and-group-by April 05, 2021 at 08:44AM
没有评论:
发表评论