C URL short address compression algorithm and short URL principle analysis

  • 2021-01-14 06:27:06
  • OfStack

Short URL applications have become popular on China's major microblogs. For example, url. cn for ES1en microblog, sinaurl. cn for groom, etc.

We're QQ weibo posted on the website, automatic discriminant microblogging site, and transform it, for example: http: / / url hytQx cn / 2

Why to do so, I think the reasons are as follows:

Microblogs are limited to 140 characters, so if we need to post some links, but this link is very long, so that it will take up nearly half of our content, this is definitely not allowed, so the short website came into being.

Short URLs are a good way to manage open level URL in our project. One part of the URL can cover violence, advertising and other information, so that we can through the user's report, completely manage this link will not appear in our application, because the same URL through the encryption algorithm, the address is the same.

We can carry out statistics on traffic and clicks of 1 series of websites, and dig out the concerns of most users, which is conducive to us making better decisions for the follow-up work of the project.

In fact, the above 3 points are purely personal opinions, because I will apply them in the next part of the project, so I understand 1, let's take a look at the theory of short URL mapping algorithm (information found on the Internet) :

The long web address md5 is generated a 32-bit signature string, which is divided into 4 segments, each segment has 8 bytes.
For the 4 pieces of loop processing, take 8 bytes, treat it as hexadecimal string and 0x3fffffff(30 bits 1) and operation, that is, more than 30 bits ignore processing;
These 30 bits are divided into 6 segments, each 5 digits as the index of the alphabet to obtain a specific character, in turn to obtain a 6-digit string;
The total md5 string can obtain four 6-bit strings; Any one of these can be used as the short url address of the long url;
Very simple theory, we can not say that the obtained URL is only 1, but we can take out 4 sets of URL, so that there will be almost no big duplication.

Let's look at the program section:


public   static   string [] ShortUrl( string  url)
{
    // You can customize the generation MD5 Mixing of encrypted characters before transmission KEY
    string  key =  " Leejor " ;
    // To use build URL The character of
    string [] chars =  new   string []{
        " a " , " b " , " c " , " d " , " e " , " f " , " g " , " h " ,
        " i " , " j " , " k " , " l " , " m " , " n " , " o " , " p " ,
        " q " , " r " , " s " , " t " , " u " , " v " , " w " , " x " ,
        " y " , " z " , " 0 " , " 1 " , " 2 " , " 3 " , " 4 " , " 5 " ,
        " 6 " , " 7 " , " 8 " , " 9 " , " A " , " B " , " C " , " D " ,
        " E " , " F " , " G " , " H " , " I " , " J " , " K " , " L " ,
        " M " , " N " , " O " , " P " , " Q " , " R " , " S " , " T " ,
        " U " , " V " , " W " , " X " , " Y " , " Z "
    };
    // To the incoming URL MD5 encryption
    string  hex = System.Web.Security.FormsAuthentication.HashPasswordForStoringInConfigFile(key + url,  " md5 " );
    string [] resUrl =  new   string [4];
    for  ( int  i = 0; i < 4; i++)
    {
        // Encrypt characters as follows 8 position 1 group 16 Into the system and 0x3FFFFFFF Perform bit and operations
        int  hexint = 0x3FFFFFFF & Convert.ToInt32( " 0x "   + hex.Substring(i * 8, 8), 16);
        string  outChars =  string .Empty;
        for  ( int  j = 0; j < 6; j++)
        {
            // I'm going to take this value and I'm going to take this value 0x0000003D Perform bit-and-operation to get an array of characters chars The index
            int  index = 0x0000003D & hexint;
            // Add the fetched characters
            outChars += chars[index];
            // Move to the right bit per cycle 5 position
            hexint = hexint >> 5;
        }
        // Stores a string into the output array of the corresponding index
        resUrl[i] = outChars;
    }
    return  resUrl;
}

You can now use this method directly by waiting for the following four sets of values:


ShortUrl(https://www.ofstack.com)[0];  // Get the value fAVfui
ShortUrl(https://www.ofstack.com)[1];  // Get the value 3ayQry
ShortUrl(https://www.ofstack.com)[2];  // Get the value UZzyUr
ShortUrl(https://www.ofstack.com)[3];  // Get the value 36rQZn


I recommend TTServer for storing URL data. For those who have not heard of TTServer, here is the introduction of this database:

Tokyo Cabinet Japanese Mikio Hirabayashi (ping Lin � male) の ペ � ジ development 1 DBM database (note: the well-known DBM database qdbm is his development), the database to read and write very fast. insert:0.4sec/1000000 recordes(2500000qps), write 1 million data in only 0.4 seconds. search: 0.33 sec / 1000000 recordes qps (3000000), 1 million data read only about 0.33 seconds.

For dictionary data Key/Value queries, this database is one of the most efficient I have seen so far, and it is small enough to match short url/long url perfectly.

The system uses six short code characters to represent addresses of any length. Valid character codes are ASCII 'A' through 'Z' and '5' from '0 ', where each character contains 2 ^ 5 (32) status. 6 short code characters can be used to draw 32 ^ 6 (1073741824) URL

First, you need a database table to store and retrieve your mapped URLs.


CREATE TABLE mappedURL ( the CREATE TABLE mappedURL (
shortCode char (6) not  null ,
lognURL  text not null ,
PRIMARY KEY  shortCodeInd (shortCode),
);

Second, you need to define an algorithm to map long ES86en to short ES87en. The algorithm was described above.

Third, you need to create a web page, find the original ES90en from the database's short URL mapping, and redirect it.

-- -- -- -- -- -- -- -- -- -

MD5 has been compromised, so the possibility that an attacker could forge ES97en of the same MD5 for malicious purposes cannot be ruled out. If you don't take this situation into account, the probability of md5 collision is extremely low, and it is estimated that you and I will not see it in your lifetime.

Also, I don't know what the practical use of "the same URL must be computed with a key equal to 1 every time" would be. Even if the same URL has different keys, 1 is not too much of a waste, is it? Only six alphanumeric combinations can accommodate billions of variations.

I ask this question because of the concerns of md5 collision. URL = URL; URL = URL; URL = URL; URL = URL; URL = URL; URL = URL; URL = URL; URL = URL; URL = URL; URL = URL; URL = URL;

The amount of URL and associated record data to be stored is very large.
Also, some URL can be quite long, so use the text field.
If the key value of the hash is stored in ES118en, it is very convenient and fast to query by this key.

As in git, object and hash, there are no conflicts at all.

bit. ly etc url shorter service is how to achieve?
Do I need to look up url backwards from the hash key? If this is required, url will definitely need a place to hash in case of a conflict
MD5 is a 128-bit hash code (4 integers of 4 bytes each). Therefore, 1 url MD5 code, there are 2 to the power of 128 (i.e. 2e128) possible. The probability that MD5 codes of two url are equal is 1 over 2e128, i.e. r=2e-128

If url is inserted into the database after MD5, the first url insert will not duplicate, and the second MD5 insert will duplicate the first r. Article 3 In the case of url insertion, the probability of repetition is 2×r, and so on, in the case of n insertion, the probability of repetition is (n-1)×r. n MD5 codes, where the probability of two duplicates is the sum of these probabilities. (1 + 2 + 3 +... + (n - 1)) * r = (1/2) * (n - 1) * n r

For n sets of MD5 codes, the probability of duplication is (1/2)*(n/2e64)e2

Therefore, only if n is large enough to be comparable to 2e64 does it need to be considered for conflict. And 2 to the 64th is still a lot.

Therefore, as long as it is not malicious attacks, 1 applications are not likely to have collision


Related articles: