Here is my DataSet:
code,city,airportname,lat,longi BLR,Bangalore,HAL,1111,222 BLR,Bangalore,Int Airport,12344,5677 BLR,Bangalore,Int Airport,12344,5677 MUM,Mumbai,Shivaji Airport,55,66 MUM,Mumbai,Mumbai Int,33,555 CHN,Chennai,Channai Int Airport4,55,66666 PUN,Punjab,Punjab Airport,33,77 HAR,Hariyana,Hariyana Aairport,55,88 KAS,Kashmir,Kashmir Airport,77,99
As you can see BLR,Bangalore,Int Airport,12344,5677 has been repeated twice in the data .I want remove the duplicates from the filtered data.But distinct is not working in JavaRDD.Can some one please help:
SparkConf sparkConf=new SparkConf().setAppName("AggreGateByKeyJavaJob").setMaster("local[*]"); SparkSession spark=SparkSession.builder().appName("EligibilityJob") .config(sparkConf) .getOrCreate(); JavaSparkContext javacontext=JavaSparkContext.fromSparkContext(spark.sparkContext()); Dataset<Row> rowDataset= SparkSession .builder() .sparkContext(javacontext.sc()) .getOrCreate() .read() .option("header",true) .format("csv") .option("delimiter",",") .load("src/main/resources/airport_data.csv") .withColumnRenamed("code","airportCode") .withColumnRenamed("city","city") .withColumnRenamed("airportname","airportname") .withColumnRenamed("lat","lat") .withColumnRenamed("longi","longi"); Dataset<AirportDataRow> airportDataset=rowDataset.withColumn("lat",rowDataset.col("lat").cast(DataTypes.DoubleType)) .withColumn("longi",rowDataset.col("longi").cast(DataTypes.DoubleType)) .repartition(2) .as(Encoders.bean(AirportDataRow.class)); JavaRDD<AirportDataRow> airprtRdd=airportDataset.javaRDD().distinct(); JavaRDD<AirportDataRow>filteredRdd=airprtRdd.filter(new Function<AirportDataRow, Boolean>() { @Override public Boolean call(AirportDataRow airportDataRow) throws Exception { return airportDataRow!=null && airportDataRow.getAirportCode()!=null && !airportDataRow.getAirportCode().equalsIgnoreCase("KAS"); } }).distinct();
https://stackoverflow.com/questions/66809641/removing-duplicates-from-java-rdd March 26, 2021 at 09:04AM
没有评论:
发表评论