Using Python to convert sina weibo's mid and url instances to each other (in decimal and 62)

  • 2020-04-02 13:36:28
  • OfStack

However, status contains an mid field, and with mid we can actually calculate the url.

It is necessary to explain what base62 encoding is before starting the calculation. It's really just an interchange of decimal and 62-bit bases. For base 62, after counting from 0 to 9, 10 is a lowercase a, then 26 letters are counted, then z is 35, then 36 is uppercase a, and then 61 is uppercase z. Therefore, we can implement the encode and decode of the decimal digit base62 encoding. The following code is actually from stackoverflow:


ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def base62_encode(num, alphabet=ALPHABET):
    """Encode a number in Base X

    `num`: The number to encode
    `alphabet`: The alphabet to use for encoding
    """
    if (num == 0):
        return alphabet[0]
    arr = []
    base = len(alphabet)
    while num:
        rem = num % base
        num = num // base
        arr.append(alphabet[rem])
    arr.reverse()
    return ''.join(arr)

def base62_decode(string, alphabet=ALPHABET):
    """Decode a Base X encoded string into the number

    Arguments:
    - `string`: The encoded string
    - `alphabet`: The alphabet to use for encoding
    """
    base = len(alphabet)
    strlen = len(string)
    num = 0

    idx = 0
    for char in string:
        power = (strlen - (idx + 1))
        num += alphabet.index(char) * (base ** power)
        idx += 1

    return num

Let's start with the url to mid conversion. For a sina weibo url, it is the form such as: http://weibo.com/2991905905/z579Hz9Wr, is the number in the middle of the user's uid, it is important to the back of the string "z579Hz9Wr". Its calculation is actually very simple, from the last four characters in a group, we get:


z
579H
z9Wr

Decode each string in base62 encoding to get their decimal digits as:


35
1219149
8379699

Put them together to get mid: "3512191498379699". The important thing to note here is that for strings other than the beginning, if the resulting decimal number is less than 7 digits, the zeros need to be preceded. For example, the resulting decimal Numbers are: 35,33040,8906190, then need to add two zeros before 33040.
The code is as follows:


def url_to_mid(url):
    '''
    >>> url_to_mid('z0JH2lOMb')
    3501756485200075L
    >>> url_to_mid('z0Ijpwgk7')
    3501703397689247L
    >>> url_to_mid('z0IgABdSn')
    3501701648871479L
    >>> url_to_mid('z08AUBmUe')
    3500330408906190L
    >>> url_to_mid('z06qL6b28')
    3500247231472384L
    >>> url_to_mid('yCtxn8IXR')
    3491700092079471L
    >>> url_to_mid('yAt1n2xRa')
    3486913690606804L
    '''
    url = str(url)[::-1]
    size = len(url) / 4 if len(url) % 4 == 0 else len(url) / 4 + 1
    result = []
    for i in range(size):
        s = url[i * 4: (i + 1) * 4][::-1]
        s = str(base62_decode(str(s)))
        s_len = len(s)
        if i < size - 1 and s_len < 7:
            s = (7 - s_len) * '0' + s
        result.append(s)
    result.reverse()
    return int(''.join(result))

Mid to url is also very simple, for a mid, we from the back and forward every 7 bits of a group, with base62 encoding to encode, together can be. It is also important to note that for every group of seven Numbers, except for the first group, if the number in base 62 is less than 4 digits long, you need to complement 0.


def mid_to_url(midint):
    '''
    >>> mid_to_url(3501756485200075)
    'z0JH2lOMb'
    >>> mid_to_url(3501703397689247)
    'z0Ijpwgk7'
    >>> mid_to_url(3501701648871479)
    'z0IgABdSn'
    >>> mid_to_url(3500330408906190)
    'z08AUBmUe'
    >>> mid_to_url(3500247231472384)
    'z06qL6b28'
    >>> mid_to_url(3491700092079471)
    'yCtxn8IXR'
    >>> mid_to_url(3486913690606804)
    'yAt1n2xRa'
    '''
    midint = str(midint)[::-1]
    size = len(midint) / 7 if len(midint) % 7 == 0 else len(midint) / 7 + 1
    result = []
    for i in range(size):
        s = midint[i * 7: (i + 1) * 7][::-1]
        s = base62_encode(int(s))
        s_len = len(s)
        if i < size - 1 and len(s) < 4:
            s = '0' * (4 - s_len) + s
        result.append(s)
    result.reverse()
    return ''.join(result)

Running doctest shows that all the test cases pass.

In the end, I don't quite understand why sina weibo doesn't directly include the url in the field, and the open platform of sina weibo also has a lot of things that do not meet the standards. There are also issues like the refresh token, which I'm not going to enumerate here.


Related articles: