Simple Realization Method of MySQL Data Table Merging and Duplication Removal
- 2021-12-04 20:07:33
- OfStack
Scenario:
The crawled data generates a data table with the same structure as the other main table. Merge + duplicate removal is required
Solution: (direct example)
First, create two tables, pep and pep2, where pep is the main table
CREATE TABLE IF NOT EXISTS `pep/pep2`(
`id` INT UNSIGNED AUTO_INCREMENT,
`no` VARCHAR(100) NOT NULL,
PRIMARY KEY ( `id` )
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
Then insert two pieces of data into pep, and one piece of data into pep2 is the same as one piece of data in pep
insert into pep(no) values('abc');
insert into pep(no) values('caa');
insert into pep2(no) values('abc');
Insert data from pep2 into pep
insert into pep (no) select no from pep2;
Grouping de-duplicate to create a new temporary table tmp
create table tmp select id,no from pep group by no;
Note: After creating the id field type of this table, it is no longer primary key self-increment
It may also report errors
```Syntax error or access violation: 1055 Expression #1 of SELECT
list is not in GROUP BY clause and contains nonaggregated
column 'XXX.Y.ZZZZ' which is not functionally dependent on
columns in GROUP BY clause; this is incompatible with
sql_mode=only_full_group_by
```
Resolve: Execute the following two commands:
```
mysql> set global sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
mysql> set session sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
```
Delete the pep table and rename the tmp table to pep
drop table pep;
alter table tmp rename to pep;
Looking at the desc structure and select * from pep, it is found that the field type of id has changed, and it needs to be changed back to the original type here;
alter table pep add primary key (id);
alter table pep modify id int auto_increment;
You can also use join to do duplicate removal. Faster, you can also add a field (which can be several fields + md5 values), create a only 1 index unique for this field, and automatically filter out duplicate data when inserting data in the future.
Summarize