2021年3月25日星期四

Removing Duplicates from Java RDD

Here is my DataSet:

code,city,airportname,lat,longi  BLR,Bangalore,HAL,1111,222  BLR,Bangalore,Int Airport,12344,5677  BLR,Bangalore,Int Airport,12344,5677  MUM,Mumbai,Shivaji Airport,55,66  MUM,Mumbai,Mumbai Int,33,555  CHN,Chennai,Channai Int Airport4,55,66666  PUN,Punjab,Punjab Airport,33,77  HAR,Hariyana,Hariyana Aairport,55,88  KAS,Kashmir,Kashmir Airport,77,99  

As you can see BLR,Bangalore,Int Airport,12344,5677 has been repeated twice in the data .I want remove the duplicates from the filtered data.But distinct is not working in JavaRDD.Can some one please help:

  SparkConf sparkConf=new SparkConf().setAppName("AggreGateByKeyJavaJob").setMaster("local[*]");          SparkSession spark=SparkSession.builder().appName("EligibilityJob")                  .config(sparkConf)                  .getOrCreate();            JavaSparkContext javacontext=JavaSparkContext.fromSparkContext(spark.sparkContext());          Dataset<Row> rowDataset= SparkSession                  .builder()                  .sparkContext(javacontext.sc())                  .getOrCreate()                  .read()                  .option("header",true)                  .format("csv")                  .option("delimiter",",")                  .load("src/main/resources/airport_data.csv")                  .withColumnRenamed("code","airportCode")                  .withColumnRenamed("city","city")                  .withColumnRenamed("airportname","airportname")                  .withColumnRenamed("lat","lat")                  .withColumnRenamed("longi","longi");            Dataset<AirportDataRow> airportDataset=rowDataset.withColumn("lat",rowDataset.col("lat").cast(DataTypes.DoubleType))                  .withColumn("longi",rowDataset.col("longi").cast(DataTypes.DoubleType))                                                           .repartition(2)                  .as(Encoders.bean(AirportDataRow.class));    JavaRDD<AirportDataRow> airprtRdd=airportDataset.javaRDD().distinct();          JavaRDD<AirportDataRow>filteredRdd=airprtRdd.filter(new Function<AirportDataRow, Boolean>() {              @Override              public Boolean call(AirportDataRow airportDataRow) throws Exception {                  return airportDataRow!=null && airportDataRow.getAirportCode()!=null &&                          !airportDataRow.getAirportCode().equalsIgnoreCase("KAS");              }          }).distinct();  
https://stackoverflow.com/questions/66809641/removing-duplicates-from-java-rdd March 26, 2021 at 09:04AM

没有评论:

发表评论