2021年4月29日星期四

How to drop rows from partitioned hive table?

I need to drop specific rows from a Hive table, which is partitioned. These rows for deletion matches certain conditions, so entire partitions can not be dropped in order to do so. Lets say the table Table has three columns: partner, date and source_key, and it is partitioned by date and source_key.

It is known that there is no operation supported for deletion or update of a particular set of records in hive (See How to delete and update a record in Hive) .

Following this solution I successfully executed the following queries in order to keep only the records which matches some given conditions, say: belong to some given range of date, having source_key='heaven' and the column partner<>'angel'...

Create temporary empty copy of the table Table.

CREATE TABLE IF NOT EXISTS tmpTable LIKE Table;  

Fill it with the current rows.

INSERT OVERWRITE TABLE tmpTable  PARTITION (date,source_key)  SELECT * FROM Table  WHERE  date >= '2020-05-01' AND date < '2020-11-30' AND  source_key = 'heaven';  

Delete target partitions.

ALTER TABLE Table DROP IF EXISTS  PARTITION (source_key = 'heaven' , date >= '2020-05-01' , date < '2020-11-30' );  

Insert the edited partitions to the target table. (couldn't manage to insert OVERWRITE because of syntax errors)

INSERT INTO Table  PARTITION (source_key,date)  SELECT * FROM tmpTable  WHERE  partner <> 'angel';  

Drop temporary table.

DROP TABLE IF EXISTS tmpTable;  

The query runs fine. Because the table Table is managed, when the partitions are dropped the hdfs files should be dropped within, but something is wrong (perhaps in the last INSERT INTO statement) because after the execution of all these queries the target table Table keeps all records with partner = 'angel' in the given range of dates and with , basically stays the same.

Where is the fault? What is missing? How to accurately delete specific rows matching certain conditions for such a Hive table?

https://stackoverflow.com/questions/67326939/how-to-drop-rows-from-partitioned-hive-table April 30, 2021 at 09:08AM

没有评论:

发表评论