2021年5月2日星期日

Write to cache using pyspark that is shared with other pyspark processes

I have a pyspark code that reads from a persistent store(HDFS) and creates a soark dataframe in memory. I believe it is called caching.

What i need is this: every night the pyspark should run and refresh the cache ,such that other pyspark scripts can directly read from the cache without going to the persistent store.

I understand one can use Redis to do this , but what are some other options? Kafka?

https://stackoverflow.com/questions/67362727/write-to-cache-using-pyspark-that-is-shared-with-other-pyspark-processes May 03, 2021 at 10:05AM

没有评论:

发表评论