I have two vectors of size k and 2k,one of them is center point of a cluster generated by "model.clusterCenter()" and the other is converted from a single row of a dataframe. And my goal is to calculate the cosine similarity between each pair of vectors(which is of size 2k^2). I've noticed huge difference in execution time when the loop order is switched.
Here is my sample code:
val cosine_list=ListBuffer(("sample_string",0.0)) // first item in list to show //the data structure of list for (i<- 0 until k){ //k: number of rows in dataframe val cen0=df.select("features").collect()(i).getAs[Vector](0) val cen0_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen0) for (j<-0 until 2*k){ //number of center points is 2* number of rows in df val cen1=model.clusterCenters(j) //get the j-th center point vector val cen1_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen1) val sqr_cen0=Vectors.norm(cen0_new,2) val sqr_cen1=Vectors.norm(cen1_new,2) val dot1=DenseVector(cen0_new.toArray).dot(DenseVector(cen1_new.toArray)) val cos=dot1/(sqr_cen0*sqr_cen1) val map_name=s"${i}_${j}" cosine_list.append((map_name,cos)) } In this approach, the outer loop is over k and the inner loop is over 2*k, it takes around 10 minutes on a single machine to execute based on my dataset. But when I switched the outer and inner loop as follows:
val cosine_list = ListBuffer(("sample_string", 0.0)) for(j <- 0 until 2*k){ val cen1=model.clusterCenters(j) //get the j-th center point vector val cen1_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen1) for(i <- 0 until k){ val cen0=df.select("features").collect()(i).getAs[Vector](0) val cen0_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen0) val sqr_cen0=Vectors.norm(cen0_new,2) val sqr_cen1=Vectors.norm(cen1_new,2) val dot1=DenseVector(cen0_new.toArray).dot(DenseVector(cen1_new.toArray)) val cos=dot1/(sqr_cen0*sqr_cen1) val map_name=s"${j}_${i}" cosine_list.append((map_name,cos)) } } this takes hours(>10) on the same configuration setup as the first approach.
My question is why is the time difference is so huge, my guess is that collect()(i) is executed k times in the first approach and 2*k^2 times in the second approach and this leads to a huge difference in terms of execution time. Are there any other reasons? Besides, how can I achieve the same result without using for loop as the best practice is to avoid for loops in scala code?
https://stackoverflow.com/questions/65930450/switching-for-loop-order-leads-to-huge-difference-in-execution-time-and-avoid-us January 28, 2021 at 11:03AM
没有评论:
发表评论