2021年3月4日星期四

How to print out the short-text by their distance to center point in each cluster ? NLP clustering by Python

NLP/K-MEANS/PYTHON

Hi all,

Im currently doing a short-text clustering task of NLP. Im trying to cluster the short text by K-means.

I have completed embedding the sentences(by using GLOVE) and feed to CNN, and then I used K-means to do clustering.

I find most(or maybe all) online tutorials only show the way to plot the clustering results...none of them tell how to print out the sentences/documents in the clusters. I have figured out the way to print out sentences in each clusters(Im using Python)

My question is :

  1. how to print out the sentence/document of the center point?
  2. how can I print out sentences and order them by their distance to the center point of the the cluster?

Can anyone help me on this issue?

Many thanks in advance!

My code:

#print centers of the clusters    centers = kmeans.cluster_centers_      centroidpoint = pca.transform(centers)    print("Centers- Kmeans")  print(centers)  

out put is like this:

Centers- Kmeans  [[0.0752584  0.08675878 0.03207847 ... 0.10317419 0.07130289 0.0322413 ]   [0.06198343 0.07327988 0.05582789 ... 0.10588244 0.0630549  0.03647455]  ...      

how can I find out the sentences of the center point of the cluster instead of just output the vector value of the center of the cluster?

#print out the sentences in each cluster    centroid_list = kmeans.cluster_centers_    labels = kmeans.labels_    n_clusters_ = len(centroid_list)      # print "cluster centroids:",centroid_list      print (labels)        cluster_menmbers_list = []    for i in range(0, n_clusters_):      menmbers_list = []      for j in range(0, len(labels)):          if labels[j] == i:              menmbers_list.append(j)      cluster_menmbers_list.append(menmbers_list)          # print cluster_menmbers_list    for i in range(0,len(cluster_menmbers_list)):      print("CLUSTER" + " " +  str(i) + ':')      for j in range(0,len(cluster_menmbers_list[i])):          a = cluster_menmbers_list[i][j]          print(data1[a])    

the out put is like:

cluster 0:    sentence1  sentence2  sentence3  ...    cluster 1:    sentence1  sentence2  sentence3  

but these sentences are not orderred by their distance to the center of the cluster, so they look very dispersed...

how can I print out like the top 20 or top 30 of the sentences that are nearreat to the center of each cluster?

Many thnaks in advance!

https://stackoverflow.com/questions/66477417/how-to-print-out-the-short-text-by-their-distance-to-center-point-in-each-cluste March 04, 2021 at 10:48PM

没有评论:

发表评论