2021年1月27日星期三

switching for loop order leads to huge difference in execution time and avoid using for loop in scala code

I have two vectors of size k and 2k,one of them is center point of a cluster generated by "model.clusterCenter()" and the other is converted from a single row of a dataframe. And my goal is to calculate the cosine similarity between each pair of vectors(which is of size 2k^2). I've noticed huge difference in execution time when the loop order is switched.

Here is my sample code:

 val cosine_list=ListBuffer(("sample_string",0.0)) // first item in list to show                                                    //the data structure of list      for (i<- 0 until k){ //k: number of rows in dataframe        val cen0=df.select("features").collect()(i).getAs[Vector](0)        val cen0_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen0)        for (j<-0 until 2*k){ //number of center points is 2* number of rows in df          val cen1=model.clusterCenters(j) //get the j-th center point vector          val cen1_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen1)          val sqr_cen0=Vectors.norm(cen0_new,2)          val sqr_cen1=Vectors.norm(cen1_new,2)          val dot1=DenseVector(cen0_new.toArray).dot(DenseVector(cen1_new.toArray))          val cos=dot1/(sqr_cen0*sqr_cen1)          val map_name=s"${i}_${j}"          cosine_list.append((map_name,cos))        }     

In this approach, the outer loop is over k and the inner loop is over 2*k, it takes around 10 minutes on a single machine to execute based on my dataset. But when I switched the outer and inner loop as follows:

val cosine_list = ListBuffer(("sample_string", 0.0))    for(j <- 0 until 2*k){    val cen1=model.clusterCenters(j) //get the j-th center point vector    val cen1_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen1)    for(i <- 0 until k){      val cen0=df.select("features").collect()(i).getAs[Vector](0)      val cen0_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen0)      val sqr_cen0=Vectors.norm(cen0_new,2)      val sqr_cen1=Vectors.norm(cen1_new,2)      val dot1=DenseVector(cen0_new.toArray).dot(DenseVector(cen1_new.toArray))      val cos=dot1/(sqr_cen0*sqr_cen1)      val map_name=s"${j}_${i}"      cosine_list.append((map_name,cos))    }  }  

this takes hours(>10) on the same configuration setup as the first approach.

My question is why is the time difference is so huge, my guess is that collect()(i) is executed k times in the first approach and 2*k^2 times in the second approach and this leads to a huge difference in terms of execution time. Are there any other reasons? Besides, how can I achieve the same result without using for loop as the best practice is to avoid for loops in scala code?

https://stackoverflow.com/questions/65930450/switching-for-loop-order-leads-to-huge-difference-in-execution-time-and-avoid-us January 28, 2021 at 11:03AM

没有评论:

发表评论