Simple Realization Method of MySQL Data Table Merging and Duplication Removal

  • 2021-12-04 20:07:33
  • OfStack

Scenario:

The crawled data generates a data table with the same structure as the other main table. Merge + duplicate removal is required

Solution: (direct example)

First, create two tables, pep and pep2, where pep is the main table


CREATE TABLE IF NOT EXISTS `pep/pep2`(
`id` INT UNSIGNED AUTO_INCREMENT,
`no` VARCHAR(100) NOT NULL,
PRIMARY KEY ( `id` )
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then insert two pieces of data into pep, and one piece of data into pep2 is the same as one piece of data in pep


insert into pep(no) values('abc');
insert into pep(no) values('caa');

insert into pep2(no) values('abc');

Insert data from pep2 into pep


insert into pep (no) select no from pep2;

Grouping de-duplicate to create a new temporary table tmp


create table tmp select id,no from pep group by no;

Note: After creating the id field type of this table, it is no longer primary key self-increment


  It may also report errors 
 ```Syntax error or access violation: 1055 Expression #1 of SELECT 
 list is not in GROUP BY clause and contains nonaggregated 
 column 'XXX.Y.ZZZZ' which is not functionally dependent on 
 columns in GROUP BY clause; this is incompatible with
 sql_mode=only_full_group_by
 ```
  Resolve: Execute the following two commands: 
 ```
 mysql> set global sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
 
 mysql> set session sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
 ```

Delete the pep table and rename the tmp table to pep


drop table pep;
alter table tmp rename to pep;

Looking at the desc structure and select * from pep, it is found that the field type of id has changed, and it needs to be changed back to the original type here;


alter table pep add primary key (id);
alter table pep modify id int auto_increment;

You can also use join to do duplicate removal. Faster, you can also add a field (which can be several fields + md5 values), create a only 1 index unique for this field, and automatically filter out duplicate data when inserting data in the future.

Summarize


Related articles: