oracle quickly deletes duplicate records

  • 2020-05-12 06:23:14
  • OfStack

The ORACLE tutorial you are looking at is :oracle quickly removes duplicate records. When I was working on the project, one of my colleagues accidentally rewrote all the data in the first table. In other words, there was one duplicate in all the records in the table. The data in this table is in the tens of millions, and it's a production system. That is, you cannot delete all records, and you must delete duplicate records quickly.

In this regard, the following methods for deleting duplicate records are summarized, as well as the advantages and disadvantages of each method.

For the sake of presentation, let's say the table is called Tbl, and the table has three columns col1, col2, col3, where col1, col2 are the primary keys, and col1, col2 are indexed.

1. Create temporary tables

You can insert the data leader into a temporary table, then delete the data of the original table, and then export the data back to the original table. The SQL statement is as follows:

creat table tbl_tmp (select distinct* from tbl); truncate table tbl; // empty the table record insert into tbl select * from tbl_tmp; // insert the data in the temporary table back.
This approach can fulfill the requirements, but it is obviously slow for a table of 10 million records, which in a production system would be too expensive for the system to work.

2. Use rowid

In oracle, each record has an rowid, rowid is the only one in the entire database, and rowid determines which data file, block, or line each record is in oracle. In a duplicate record, the contents of all columns may be the same, but rowid will not be the same. The SQL statement reads as follows:

delete from tbl where rowid in (select a.rowid from tbl a, tbl b where a.rowid > b.rowid and a.col1=b.col1 and a.col2 = b.col2)
If you already know that there is only one duplicate per record, this sql statement applies. However, if each record has an N duplicate, and the N is unknown, consider applying the following method.

3. Use max or min functions

rowid is also used here, unlike above, in combination with max or min functions. The SQL statement is as follows

delete from tbl rowid not in (select max(b.rowid) from tbl max a. col1= b. col1 a. col2 = b. col2) // min is also acceptable for max
Or use the following statement

delete from tbl awhere rowid < (select max(b. rowid) from tbl b where col1= b. col1 and a. col2 = b. col2) // if we replace max with min, we need to replace" < "Instead of" > "
The idea of the above method is basically the same, but group by is used, which reduces the explicit comparison conditions and improves the efficiency. SQL statement is as follows:

deletefrom tbl where rowid not in (select max(rowid) from tbl tgroup by t.col1, t.col2);delete from tbl where (col1, col2) in (select col1,col2 from tblgroup bycol1,col2havingcount(*) > 1)and rowidnotin(selectnin(rowid)fromtblgroup bycol1,col2havingcount(*) > 1)
There is another method, which is more suitable for the case where there are fewer records with duplicate records in the table and there are indexes. Assuming that there are indexes on col1 and col2, and there are few records with duplicate records in tbl table, SQL statement is as follows: 4. Make use of group by to improve efficiency


Related articles: