Two kinds of JAVA short url service algorithm

  • 2020-04-01 04:05:22
  • OfStack

Short urls, as the name implies, are Short looking urls. Since twitter launched its shorting service, Internet companies have launched their own shorting service. The biggest advantage of short url is short, few characters, easy to publish, spread, copy and store.

Through a search on the Internet, it seems that two shorting algorithms have been passed around, One is based on the MD5 code and the other is based on the self-incrementing sequence.

1. Based on MD5 code: the short url length calculated by this algorithm is generally 5 or 6 bits, and collisions may occur in the calculation process (with a small probability). The number of expressible urls is 62

To the fifth or sixth power. It feels like Google (link: http://goo.gl), twitter is using this kind of algorithm (guess), may look better.

2. Based on self-increasing sequence: this algorithm is relatively simple to implement, with the possibility of collision being 0, the expressible URL reaching infinity, and the length starting from 1. It seems that baidu's short url service (link: http://dwz.cn/) is this algorithm.

The specific algorithm

1, MD5 code : suppose the length of the url is N

A. Calculate the MD5 code of the long address and divide the 32-bit MD code into 4 segments with 8 characters in each segment

B. Treat the 8 strings obtained by a as a hexadecimal number, and conduct & operation with the binary Numbers represented by N * 6 ones

I get a binary number N by 6

C. Divide the number obtained by b into N segments with 6 bits in each segment, and then conduct & operation on the N 6 digits with 61 respectively to obtain

The number is used as the INDEX to take the corresponding letters or Numbers of the alphabet, which is a short url of length N.


static final char[] DIGITS = 
{ '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g',
'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z' };

public String shorten(String longUrl, int urlLength) {
  if (urlLength < 0 || urlLength > 6) {
   throw new IllegalArgumentException("the length of url must be between 0 and 6");
  }
  String md5Hex = DigestUtils.md5Hex(longUrl);
  // 6 digit binary can indicate 62 letter & number from 0-9a-zA-Z
  int binaryLength = urlLength * 6;
  long binaryLengthFixer = Long.valueOf(StringUtils.repeat("1", binaryLength), BINARY);
  for (int i = 0; i < 4; i++) {
   String subString = StringUtils.substring(md5Hex, i * 8, (i + 1) * 8);
   subString = Long.toBinaryString(Long.valueOf(subString, 16) & binaryLengthFixer);
   subString = StringUtils.leftPad(subString, binaryLength, "0");
   StringBuilder sbBuilder = new StringBuilder();
   for (int j = 0; j < urlLength; j++) {
    String subString2 = StringUtils.substring(subString, j * 6, (j + 1) * 6);
    int charIndex = Integer.valueOf(subString2, BINARY) & NUMBER_61;
    sbBuilder.append(DIGITS[charIndex]);
   }
   String shortUrl = sbBuilder.toString();
   if (lookupLong(shortUrl) != null) {
    continue;
   } else {
    return shortUrl;
   }
  }
  // if all 4 possibilities are already exists
  return null;
 }

2. Self-increasing sequence:

A. Or the self-increment of the sequence, representing the value in base 62.


private AtomicLong sequence = new AtomicLong(0);

 @Override
 protected String shorten(String longUrl) {
  long myseq = sequence.incrementAndGet();
  String shortUrl = to62RadixString(myseq);
  return shortUrl;
 }

 private String to62RadixString(long seq) {
  StringBuilder sBuilder = new StringBuilder();
  while (true) {
   int remainder = (int) (seq % 62);
   sBuilder.append(DIGITS[remainder]);
   seq = seq / 62;
   if (seq == 0) {
    break;
   }
  }
  return sBuilder.toString();
 }

The code in MAVEN project USES two maps to simulate the mapping of long and short urls. In practice, it may be implemented based on database table and index or some distributed KV system.

I hope this article is helpful for you to learn short url service.


Related articles: