2021年2月5日星期五

Accessing aws s3 using wildcards vs spark filters

I have data stored in parquet files partioned by date.

When I try to access them using spark, I find that using the spark filter as in here:

basepath = 's3a://base/path/'  df_sp = sparkSession.read.parquet(basepath).filter((func.col('date_hour')>='2021-01-20-00')&(func.col('date_hour')<'2021-01-30-00'))  df_sp.count()  

is about 10 times slower than using wildcard expressions in the path name:

basepath = 's3a://base/path/date_hour=2020-12-2*-00/'  df_sp = sparkSession.read.parquet(basepath)  df_sp.count()  

When I look at the physical plan in spark, the filter version is using partition filters, so I would have thought it could be fast.

Why is the spark filter slower? Can I use the native spark functions to access this data as fast as with wildcards?

https://stackoverflow.com/questions/66072595/accessing-aws-s3-using-wildcards-vs-spark-filters February 06, 2021 at 09:06AM

没有评论:

发表评论