Detailed Explanation of MySQL Removing Duplicate Data Example

2021-08-21 21:42:58
OfStack

There are two meanings of duplicate records, 1 is a completely duplicate record, that is, all fields are duplicate, and 2 is a partially duplicate record. For the first kind of duplication, it is easy to solve, only use distinct keyword to remove duplication in query statement, and almost all database systems support distinct operation. The main reason for this duplication is that the table is poorly designed, which can be avoided by adding primary keys or only 1 index columns to the table.


select distinct * from t;

For the second type of duplicate problem, it is usually required to find any one of the duplicate records. Assuming that the table t has three fields: id, name and address, id is the primary key, and the duplicate fields are name and address, it is required to get a result set with only one of these two fields.

--Oracle, MySQL, using related subqueries


select * from t t1

 where t1.id =

 (select min(t2.id)

  from t t2

  where t1.name = t2.name and t1.address = t2.address);

--Hive only supports subqueries in the FROM clause, which must have a name and columns must be only 1


select t1.*

 from t t1,

   (select name, address, min(id) id from t group by name, address) t2

 where t1.id = t2.id;

--You can also use the row_number () parsing function of hive


select t.id, t.name, t.address

 from (select id, name, address,

row_number() over (distribute by name, address sort by id) as rn 

     from t) t 

 where t.rn=1;

Thank you for reading, hope to help everyone, thank you for your support to this site!