有些事如何做: RDD vs Pandas Dataframe vs Direct Read to create Spark DataFrame

For creating Spark DataFrame, we can read directly from raw data, pass RDD OR pass pandas Dataframe.

I was doing experimentation with three of these methods,

Spark: Standalone Mode  using pyspark.sql module

Method1 : Reading text/csv file in Pandas and passing pandas DataFrame to create Spark DataFrame.

 df3=spark.createDataFrame(pandas_df)

Method2 :I have created RDD by passing text file to 'sc.textFile'. Then I used this RDD to create Spark DataFrame

df3=spark.createDataFrame(RDD_list, StringType())

Method3 :Reading directly from raw data to create Spark DataFrame

df3=spark.read.text("Data/bookpage.txt")

What I have observed:

   Method1:(pandas) - 8 ( I have 8 cores)     Method2:(RDD)    - 2     Method3:(Direct raw read)- 1

 Method1 : Raw Data => Pandas DF => Spark DataFrame   Method2 : Raw Data => RDD => Spark DataFrame   Method3 : Raw Data => Spark DataFrame

Questions:

Which method is more efficient?
As everything in spark implemented at RDD level, so creating RDD in Method2, can make it more efficient?
For same data, there are different default partitions. Why?

有些事如何做