2021年4月4日星期日

PySpark using both aggregate and group by

Can someone help me with pyspark using both aggregate and groupby functions? I have made my data frames, and applied filters and selects to get the data I want. However, I am now stuck trying to aggregate things correctly.

Currently, my code outputs the content below:

+----------+-----------+--------------+---------------+----------+---------+  |l_orderkey|o_orderdate|o_shippriority|l_extendedprice|l_discount|      rev|  +----------+-----------+--------------+---------------+----------+---------+  |     53634| 1995-02-22|             0|       20517.44|      0.08|18876.045|  |    265539| 1995-01-25|             0|       70423.08|      0.01| 69718.85|  |    331590| 1994-12-10|             0|       46692.75|      0.03| 45291.97|  |    331590| 1994-12-10|             0|        37235.1|       0.1| 33511.59|  |    420545| 1995-03-05|             0|        75542.1|      0.04|72520.414|  |    420545| 1995-03-05|             0|         1062.0|      0.07|987.66003|  |    420545| 1995-03-05|             0|        9729.45|       0.1| 8756.505|  |    420545| 1995-03-05|             0|        15655.6|      0.04|15029.375|  |    420545| 1995-03-05|             0|         3121.3|      0.03|3027.6611|  |    420545| 1995-03-05|             0|        71723.0|      0.03| 69571.31|  |    488928| 1995-02-15|             0|        1692.77|      0.01|1675.8423|  |    488928| 1995-02-15|             0|       22017.84|      0.01|21797.662|  |    488928| 1995-02-15|             0|       57100.42|      0.04|54816.402|  |    488928| 1995-02-15|             0|        3807.76|      0.05| 3617.372|  |    488928| 1995-02-15|             0|       73332.52|      0.01|72599.195|  |    510754| 1994-12-21|             0|       41171.78|      0.09| 37466.32|  |    512422| 1994-12-26|             0|       87251.56|      0.07| 81143.95|  |    677761| 1994-12-26|             0|       60123.34|       0.0| 60123.34|  |    956646| 1995-03-07|             0|       61853.68|      0.05|58760.996|  |   1218886| 1995-02-13|             0|        24844.0|      0.01| 24595.56|  +----------+-----------+--------------+---------------+----------+---------+  

I wish to apply a group by: l_orderkey and aggregate the Rev as a sum.

Here is my most recent attempt with 't' being the dataframe and F being functions from pyspark.sql "from pyspark.sql import functions as F"

(t .groupby(t.l_orderkey,t.o_orderdate, t.o_shippriority)    .agg(F.collect_set(sum(t.rev)), F.collect_set(t.l_orderkey)) .show())  

Can someone help me know if I'm on the right track? I keep getting "Column is not iterable"

https://stackoverflow.com/questions/66947084/pyspark-using-both-aggregate-and-group-by April 05, 2021 at 08:44AM

没有评论:

发表评论