For creating Spark DataFrame, we can read directly from raw data, pass RDD OR pass pandas Dataframe.
I was doing experimentation with three of these methods,
Spark: Standalone Mode using pyspark.sql module Method1 : Reading text/csv file in Pandas and passing pandas DataFrame to create Spark DataFrame.
df3=spark.createDataFrame(pandas_df) Method2 :I have created RDD by passing text file to 'sc.textFile'. Then I used this RDD to create Spark DataFrame
df3=spark.createDataFrame(RDD_list, StringType()) Method3 :Reading directly from raw data to create Spark DataFrame
df3=spark.read.text("Data/bookpage.txt") What I have observed:
- Num of default partitions in three cases are different.
Method1:(pandas) - 8 ( I have 8 cores) Method2:(RDD) - 2 Method3:(Direct raw read)- 1 - Conversion
Method1 : Raw Data => Pandas DF => Spark DataFrame Method2 : Raw Data => RDD => Spark DataFrame Method3 : Raw Data => Spark DataFrame Questions:
- Which method is more efficient?
- As everything in spark implemented at RDD level, so creating RDD in Method2, can make it more efficient?
- For same data, there are different default partitions. Why?
没有评论:
发表评论