2020年12月31日星期四

RDD vs Pandas Dataframe vs Direct Read to create Spark DataFrame

For creating Spark DataFrame, we can read directly from raw data, pass RDD OR pass pandas Dataframe.

I was doing experimentation with three of these methods,

Spark: Standalone Mode  using pyspark.sql module  

Method1 : Reading text/csv file in Pandas and passing pandas DataFrame to create Spark DataFrame.

 df3=spark.createDataFrame(pandas_df)  

Method2 :I have created RDD by passing text file to 'sc.textFile'. Then I used this RDD to create Spark DataFrame

df3=spark.createDataFrame(RDD_list, StringType())  

Method3 :Reading directly from raw data to create Spark DataFrame

df3=spark.read.text("Data/bookpage.txt")  

What I have observed:

  1. Num of default partitions in three cases are different.
   Method1:(pandas) - 8 ( I have 8 cores)     Method2:(RDD)    - 2     Method3:(Direct raw read)- 1     
  1. Conversion
 Method1 : Raw Data => Pandas DF => Spark DataFrame   Method2 : Raw Data => RDD => Spark DataFrame   Method3 : Raw Data => Spark DataFrame  

Questions:

  1. Which method is more efficient?
  2. As everything in spark implemented at RDD level, so creating RDD in Method2, can make it more efficient?
  3. For same data, there are different default partitions. Why?
https://stackoverflow.com/questions/65526603/rdd-vs-pandas-dataframe-vs-direct-read-to-create-spark-dataframe January 01, 2021 at 08:52AM

没有评论:

发表评论