Python simulates sina weibo login function of sina weibo crawler

2020-04-02 13:17:02
OfStack

1. Main function (WeiboMain. Py) :


import urllib2
import cookielib
import WeiboEncode
import WeiboSearch
if __name__ == '__main__':
    weiboLogin = WeiboLogin(' XXX @gmail.com', ' X x x x ')# Email (account number), password 
    if weiboLogin.Login() == True:
        print " Login successful! "

The first two imports are to load Python's network programming module, and the next import is to load the other two files weiboencode.py and weiboserb. py (described later). The main function creates a new login object and then logs in.

2. WeiboLogin class (WeiboMain. Py) :


class WeiboLogin:
    def __init__(self, user, pwd, enableProxy = False):
        " Initialize the WeiboLogin . enableProxy Indicates whether the proxy server is used, and is turned off by default "
        print "Initializing WeiboLogin..."
        self.userName = user
        self.passWord = pwd
        self.enableProxy = enableProxy

        self.serverUrl = "http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=&rsakt=mod&client=ssologin.js(v1.4.11)&_=1379834957683"
        self.loginUrl = "http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.11)"
        self.postHeader = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Firefox/24.0'}

The initialization function defines two key url members: self.serverurl is used for the first step of login (get servertime, nonce, etc.), where the first step essentially contains 1 and 2 of the login process of parsing sina weibo; Self. LoginUrl is used for the second step (POST to the URL after encrypting the user and password, self. PostHeader is the header of POST), which corresponds to 3 of the login process parsing sina weibo. There are three more functions in the class:


def Login(self):
        " Log in program "  
        self.EnableCookie(self.enableProxy)#cookie Or proxy server configuration 

        serverTime, nonce, pubkey, rsakv = self.GetServerTime()# The first step to landing 
        postData = WeiboEncode.PostEncode(self.userName, self.passWord, serverTime, nonce, pubkey, rsakv)# Encrypt the user and password 
        print "Post data length:n", len(postData)
        req = urllib2.Request(self.loginUrl, postData, self.postHeader)
        print "Posting request..."
        result = urllib2.urlopen(req)# Step 2 of login - resolve the sina weibo login process 3
        text = result.read()
        try:
            loginUrl = WeiboSearch.sRedirectData(text)# Parse the result of the relocation 
              urllib2.urlopen(loginUrl)
        except:
            print 'Login error!'
            return False

        print 'Login sucess!'
        return True

Self.EnableCookie is used to set cookie and proxy server. There are many free proxy servers on the network. Then make the first step of login, visit the sina server to get serverTime and other information, and then use this information to encrypt the user name and password, build a POST request; The second step is to send the user and password to self.loginurl. After the relocation information is obtained, the URL to which the final jump is obtained is parsed. After opening the URL, the server will automatically write the login information of the user to cookie, and the login is successful.


def EnableCookie(self, enableProxy):
    "Enable cookie & proxy (if needed)."

    cookiejar = cookielib.LWPCookieJar()# To establish cookie
    cookie_support = urllib2.HTTPCookieProcessor(cookiejar)
    if enableProxy:
        proxy_support = urllib2.ProxyHandler({'http':'http://xxxxx.pac'})# Using the agent 
         opener = urllib2.build_opener(proxy_support, cookie_support, urllib2.HTTPHandler)
        print "Proxy enabled"
    else:
        opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)
    urllib2.install_opener(opener)# build cookie The corresponding opener

The EnableCookie function is simpler


def GetServerTime(self):
    "Get server time and nonce, which are used to encode the password"

    print "Getting server time and nonce..."
    serverData = urllib2.urlopen(self.serverUrl).read()# Get the content 
     print serverData
    try:
        serverTime, nonce, pubkey, rsakv = WeiboSearch.sServerData(serverData)# parsed serverTime . nonce Etc. 
         return serverTime, nonce, pubkey, rsakv
    except:
        print 'Get server time & nonce error!'
        return None

The functions in the WeiboSearch file are used to parse the data from the server and are relatively simple.

3. SServerData function (weibosearch.py) :


import re
import json
def sServerData(serverData):
    "Search the server time & nonce from server data"

    p = re.compile('((.*))')
    jsonData = p.search(serverData).group(1)
    data = json.loads(jsonData)
    serverTime = str(data['servertime'])
    nonce = data['nonce']
    pubkey = data['pubkey']#
    rsakv = data['rsakv']#
    print "Server time is:", serverTime
    print "Nonce is:", nonce
    return serverTime, nonce, pubkey, rsakv

The parsing process mainly USES regular expressions and JSON, which are easy to understand. In addition, the partial function of parse relocation result in Login is also shown in this file as follows:


def sRedirectData(text):
    p = re.compile('location.replace(['"](.*?)['"])')
    loginUrl = p.search(text).group(1)
    print 'loginUrl:',loginUrl
    return loginUrl

4. From the first step to the second step, the user and password should be encrypted.


import urllib
import base64
import rsa
import binascii
def PostEncode(userName, passWord, serverTime, nonce, pubkey, rsakv):
    "Used to generate POST data"

    encodedUserName = GetUserName(userName)# User name usage base64 encryption 
     encodedPassWord = get_pwd(passWord, serverTime, nonce, pubkey)# Current password usage rsa encryption 
     postPara = {
        'entry': 'weibo',
        'gateway': '1',
        'from': '',
        'savestate': '7',
        'userticket': '1',
        'ssosimplelogin': '1',
        'vsnf': '1',
        'vsnval': '',
        'su': encodedUserName,
        'service': 'miniblog',
        'servertime': serverTime,
        'nonce': nonce,
        'pwencode': 'rsa2',
        'sp': encodedPassWord,
        'encoding': 'UTF-8',
        'prelt': '115',
        'rsakv': rsakv,     
        'url': 'http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack',
        'returntype': 'META'
    }
    postData = urllib.urlencode(postPara)# Network coding 
    return postData

The PostEncode function builds the message body of the POST, requiring the build to get the same information as the actual login. Difficulties in the user name and password encryption:


def GetUserName(userName):
    "Used to encode user name"

    userNameTemp = urllib.quote(userName)
    userNameEncoded = base64.encodestring(userNameTemp)[:-1]
    return userNameEncoded

def get_pwd(password, servertime, nonce, pubkey):
    rsaPublickey = int(pubkey, 16)
    key = rsa.PublicKey(rsaPublickey, 65537) # Create a public key 
    message = str(servertime) + 't' + str(nonce) + 'n' + str(password) # Stitching plaintext js Encrypt the file to get 
    passwd = rsa.encrypt(message, key) # encryption 
    passwd = binascii.b2a_hex(passwd) # Converts the encrypted message to 16 Into the system. 
    return passwd

Sina login process, the password encryption method is SHA1, now into RSA, may change later, but various encryption algorithms in Python have corresponding implementation, as long as it is found that the encryption method (), the program is relatively easy to achieve.

At this point, the Python simulation login to sina weibo is successful, run the output:


loginUrl: http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack&ssosavestate=1390390056&ticket=ST-MzQ4NzQ5NTYyMA==-1387798056-xd-284624BFC19FE242BBAE2C39FB3A8CA8&retcode=0
Login sucess!

If you need to crawl and fetch the information in the microblog, then you just need to add the crawl and fetch and parse modules after the Main function, such as reading the contents of a microblog page:


htmlContent = urllib2.urlopen(myurl).read()# get myurl All the content of the web page (html)

We can design different crawler modules according to different requirements, the code to simulate login is put here.