I have data stored in parquet files partioned by date.
When I try to access them using spark, I find that using the spark filter as in here:
basepath = 's3a://base/path/' df_sp = sparkSession.read.parquet(basepath).filter((func.col('date_hour')>='2021-01-20-00')&(func.col('date_hour')<'2021-01-30-00')) df_sp.count() is about 10 times slower than using wildcard expressions in the path name:
basepath = 's3a://base/path/date_hour=2020-12-2*-00/' df_sp = sparkSession.read.parquet(basepath) df_sp.count() When I look at the physical plan in spark, the filter version is using partition filters, so I would have thought it could be fast.
Why is the spark filter slower? Can I use the native spark functions to access this data as fast as with wildcards?
https://stackoverflow.com/questions/66072595/accessing-aws-s3-using-wildcards-vs-spark-filters February 06, 2021 at 09:06AM
没有评论:
发表评论