The use of sphinx in mysql to search for subdomains

  • 2020-05-15 02:17:13
  • OfStack

Search subdomain

For example, a search for ofstack.com will bring up www.ofstack.com, ofstack.com, host.ofstack.com, and so on.

If mysql is used and like is used, the efficiency is very low. There is no way to use millions or even tens of millions of data, so sphinx is adopted to do it.

In the process of use found a lot of problems, it will be summarized here, but also let the unknown friends pay attention to these characters.

Analysis:
sphinx is a full-text index that searches for contained records.

First of all, if we don't set anything up, a search for ofstack.com will bring up aaofstack.com, jb51.a.cn, ofstack.com.com.

Why does this happen?

. We use/search - c configuration file - i index name 'ofstack. com' when search will find the following words part is divided into two parts' jb51 'and' cn ', on by default. As a delimiter, if we don't want to make it as a delimiter, you need to put. Added to the charset_table, there needs to be said is that we search domain name only letters, Numbers, "-" are required, such as characters, other do not. The Settings are as follows:
charset_table = 0..9,A..Z- > a.. z, a.. z,U+002e,U+002d,U+0040,U+0060 # where U+002e stands for ".",U+002d stands for "-",U+0040 stands for "@",U+0060 stands for "' ", here is the ascii code value.
This will result in the discovery of domain names such as jb51.a.cn.

What about ofstack. com. com? We can add the suffix "XXXXX", "concat" (search,'XXXXX') to the field of the index so that it doesn't come out.

Now leaves aaofstack com this kind of domain name, we use the keyword "'". ofstack. com "'" (note that is single quotation marks inside double quotes the search in this way, the main domain name plus alone, searched but found in the process of using this has nothing to do with the domain name of the domain name, such as aa. bb. cn so, the study is the issue of ". "later, then the". "to replace" @ ", to search the appeared, such as 12306, a lot of the problem domain can't read to, Later studies estimated that these special characters all had special meanings in sphinx, so the character "'" was found, and the first cut was normal after replacing it with this one.

Note: after replacing "." with "'", add this character in charset_table, otherwise it will be ignored.

So we need to pay attention to these special characters during the search.


Related articles: