MySQL distinct statement to query duplicate records and related performance discussions

  • 2020-12-16 06:09:49
  • OfStack

In MySQL queries, duplicate values may be included. This is not a problem, but sometimes you might want to just list the different (distinct) values.

The keyword DISTINCT is used to return the only value that is different. The usage is also simple:


SELECT DISTINCT * FROM tableName

The keyword DISTINCT is used to filter out redundant duplicates and only one is kept.

In addition, if you want to de-weight a field, try:


SELECT *, COUNT(DISTINCT nowamagic) FROM table GROUP BY nowamagic

The MySQL version should not be too low for this usage.

We should even sort the filter criteria before writing the query, and the really efficient criteria (there may be multiple tables involved) are the main drivers of the query, with the inefficient criteria only playing a secondary role. So what are the criteria for defining an efficient filtering condition? First, see if the filtering conditions can reduce the amount of data that must be processed as quickly as possible. So, we have to pay more attention to the way conditions are written.
Let's say we have four tables: customers, orders, orderdetail, articles. Now let's say SQL has to deal with the problem of finding out all the customers who have ordered batmobile in the last six months living in The city of Gotham. Of course, there are several ways to write this query, and advocates of ANSI SQL might write the following statement:


select distinct c.custname
from customers c
join orders o
on o.custid = c.custid
join orderdetail od
on od.ordid = o.ordid
join articles a
on a.artid = od.artid
where c.city = 'GOTHAM'
and a.artname = 'BATMOBILE'
and o.ordered >= somefunc

Where somefunc is a function that returns the specific date 6 months ago. Note that the distinct is used on this one, as there have been several recent orders for the batmobile considering that a particular customer could be a big buyer.
Regardless of how the optimizer might rewrite this query, let's take a look at what this code means. First, the data from the customers table should only keep records with the city name Gotham. Next, the orders table is searched, which means that the custid field had better have indexes, otherwise the query speed can only be guaranteed if a hash table is created by sorting, merging, or scanning the orders table. For the orders table, also filter for the order date: if the optimizer is smart, it filters out 1 bit of data before joining (join), reducing the amount of data to be processed later. A less intelligent optimizer may do the connection first and then the filter, where specifying filter conditions in the connection can improve performance, such as:


join orders o
on o.custid = c.custid
and a.ordered >= somefunc

Note if it is:


left outer join orders o on
o.custid = c.custid
and a.ordered >= somefunc

The filter on the left table here will fail because it is a left outer join and all columns of the left table will appear in the join result set).
The optimizer is affected by the filter condition even if the filter condition is independent of the connection (join). For example, if the primary key of orderdetail is (ordid, artid), that is, ordid is the first attribute of the index, then we can use the index to find the records related to the order. Unfortunately, if the primary key is artid (ordid, ordid) (note that in terms of relational theory, either version is exactly the same), the access efficiency ratio (ordid, artid) is worse as an index, and even some database products cannot use the index (note 3). The only hope for 1 is to add a separate index to ordid.
Once you have joined tables orderdetail and orders, look at the articles table, which should not be a problem because table order includes the artid field. Finally, check that the value in articles is Batmobile. That's the end of the query, because with distinct, the customer names that have been filtered through layers must also be sorted to weed out duplicate items.
Avoiding distinct at the highest level should be a basic rule. The reason is that even if we miss a condition of the join, distinct makes the query perform "seemingly correctly" -- admittedly, it's easy to find duplicates and hard to find inaccuracies, so avoiding distinct at the highest level should be a basic rule.
Finding the results incorrect is harder. For example, if multiple customers happen to be called "Wayne", distinct will eliminate duplicates not only from multiple orders from the same customer, but also from different customers with the same name. In fact, both the unique customer ID and the customer name should be returned to ensure a complete list of batmobile buyers.
To get rid of distinct, consider the following: The customer is in The city of Gohtam and meets the existence test, i.e., has ordered a batmobile in the last 6 months. Note that most (but not all) SQL dialects support the following syntax:


select c.custname
from customers c
where c.city = 'GOTHAM'
and exists (select null
from orders o,
orderdetail od,
articles a
where a.artname = 'BATMOBILE'
and a.artid = od.artid
and od.ordid = o.ordid
and o.custid = c.custid
and o.ordered >= somefunc )

The presence test for the above example may occur multiple times for the same name, but only once for each customer, no matter how many orders he has. Some people think that I'm a little harsh on the ANSI SQL syntax (referring to the "batmobile buyer" example), because the customers table in the above code is no less important. In fact, the key difference is that the customers table in the new query is the only source of the query results (nested subqueries are responsible for finding the customer subset), whereas the previous query used join.
This nested subquery is closely related to the outer select by 10 points. As shown in line 11 (in bold), the subquery refers to the current record of the outer query, so the inner layer subquery is called an associative subquery (correlated subquery).
One weakness of this subquery is that it cannot be executed before the current customer has been identified. If the optimizer does not rewrite this query, it must first identify each customer and then check to see if the existence test is met, one by one. It is very efficient to execute when there are very few customers from the city of Gotham, otherwise the situation will be very bad (at this point, a good optimizer should try another way to execute the query).


select custname
from customers
where city = 'GOTHAM'
and custid in
(select o.custid
from orders o,
orderdetail od,
articles a
where a.artname = 'BATMOBILE'
and a.artid = od.artid
and od.ordid = o.ordid
and o.ordered >= somefunc)

In this example, the inner query is no longer dependent on the outer query; it has become an unrelated subquery (uncorrelated subquery) that only needs to be executed once. Obviously, this code follows the original execution flow. In the first example of this section, you must first search for customers who meet the location criteria (for example, all from GOTHAM), and then examine each order in turn. Now, customers who have ordered the batmobile can get it through an internal query.
However, on closer examination, there are more subtle differences between the two versions of the code. In code with associated subqueries, it is critical that the custid field in the orders table have an index, which is not important for the other 1 piece of code because the index (if any) to be used is the primary key index of the table customers.
You may have noticed that the new version of the query executes the implicit distinct. Indeed, because of a join operation, a subquery may return multiple records about a customer. But duplicates don't matter, because the in condition only checks if the item appears in the list returned by the subquery, and in doesn't care if a value appears in the list once or 100 times. But for the sake of 1, as a whole, the same rules should be applied to the subquery and the main query, that is, the existence test should also be added to the subquery:


select custname
from customers
where city = 'GOTHAM'
and custid in
(select o.custid
from orders o
where o.ordered >= somefunc
and exists (select null
from orderdetail od,
articles a
where a.artname = 'BATMOBILE'
and a.artid = od.artid
and od.ordid = o.ordid))

or


select custname
from customers
where city = 'GOTHAM'
and custid in
(select custid
from orders
where ordered >= somefunc
and ordid in (select od.ordid
from orderdetail od,
articles a
where a.artname = 'BATMOBILE'
and a.artid = od.artid)

Although the nesting becomes deeper and more difficult, whether exists should be chosen within the subquery or in's selection rule is the same: this selection depends on the validity of the date and the commodity conditions. Unless the last six months have been very slow, the name of the item should be the most effective filter, so in is better than exists in the subquery because it is quicker to find all the batmobile orders and then check if the sales occurred in the last six months rather than the other way around. This method would be faster if the artid field of table orderdetail had an index, otherwise, this clever move would be eclipsed.
The choice between in and exists is considered whenever a large number of records are checked for existence.
For the benefit of most SQL dialects, uncorrelated subqueries can be rewritten as embedded views in from clauses. However, it is important to keep in mind that in implicitly excludes duplicate items, which must be explicitly eliminated when the subquery is rewritten as an embedded view in the from clause. Such as:


select custname
from customers
where city = 'GOTHAM'
and custid in
(select o.custid
from orders o,
(select distinct od.ordid
from orderdetail od,
articles a
where a.artname = 'BATMOBILE'
and a.artid = od.artid) x
where o.ordered >= somefunc
and x.ordid = o.ordid)

Summary: Ensuring that the SQL statement returns the correct result is only the first step in establishing the best SQL statement.


Related articles: