Example analysis of Python crawler DNS parsing cache method

  • 2020-06-03 06:59:00
  • OfStack

This article describes the Python crawler DNS parsing cache method. To share for your reference, specific as follows:

Preface:

This is the core code of DNS parse cache module in Python crawler, which is the code of last year. Now you can have a look if you are interested.

The DNS parsing time for 1 domain name is between 10 and 60 milliseconds, which may seem trivial, but for large 1 point crawlers it is not negligible. For example, if we want to climb sina weibo with 10 million requests under the same domain name (which is not a lot), it will take between 100,000 and 600,000 seconds, or 86,400 seconds in 1 day. This means that the DNS parse item alone took several days, at which point the DNS parse cache was added and the effect became apparent.

Let's put the code directly below, with instructions at the end.

Code:


# encoding=utf-8
# ---------------------------------------
#   Version: 0.1
#   Date: 2016-04-26
#   The author: 9 tea <bone_ace@163.com>
#   Development environment: Win64 + Python 2.7
# ---------------------------------------
import socket
# from gevent import socket
_dnscache = {}
def _setDNSCache():
  """ DNS The cache  """
  def _getaddrinfo(*args, **kwargs):
    if args in _dnscache:
      # print str(args) + " in cache"
      return _dnscache[args]
    else:
      # print str(args) + " not in cache"
      _dnscache[args] = socket._getaddrinfo(*args, **kwargs)
      return _dnscache[args]
  if not hasattr(socket, '_getaddrinfo'):
    socket._getaddrinfo = socket.getaddrinfo
    socket.getaddrinfo = _getaddrinfo

Description:

In fact, it is not difficult to save the cache in socket to avoid repeated retrieval.
You can put the above code in an es25EN_cache.py file and call this in the crawler frame _setDNSCache() The method will do.

One thing to note is that if you use the gevent coroutine, and use it monkey.patch_all() , it should be noted that the crawler has changed to socket in gevent, and DNS parsing cache module should also use socket of gevent.

For more information about Python, please refer to Python Socket Programming Skills Summary, Python Data Structure and Algorithm Tutorial, Python Function Using Skills Summary, Python String Operation Skills Summary, Python Introduction and Advanced Classic Tutorial and Python File and Directory Operation Skills Summary.

I hope this article has been helpful in Python programming.


Related articles: