Python USES sockets to send HTTP requests to receive incomplete data
- 2020-04-02 14:33:31
- OfStack
Due to the requirements of work, we need to use python to make a crawler like network collector. Although Python's urllib module provides more convenient and concise operations, it involves some low-level requirements, such as manually setting user-agent, Referer, etc., so I choose to design directly with socket. Of course, in this case, you need to be familiar with the HTTP protocol, which I won't cover here. The entire python code is as follows:
#!/usr/bin env python
import socket
host="www.baidu.com"
se=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
se.connect((host,80))
se.send("GET / HTTP/1.1n")
se.send("Accept:text/html,application/xhtml+xml,*/*;q=0.8n")
#se.send("Accept-Encoding:gzip,deflate,sdchn")
se.send("Accept-Language:zh-CN,zh;q=0.8,en;q=0.6n")
se.send("Cache-Control:max-age=0n")
se.send("Connection:keep-aliven")
se.send("Host:"+host+"rn")
se.send("Referer:http://www.baidu.com/n")
se.send("user-agent: Googlebotnn")
print se.recv(1024)
The code worked fine, but an important problem was found. The result returned only HTTP headers, not the contents of the web page. Online to find a lot of materials, nothing, and after a night thinking, suddenly thought of a problem, may I request resources is very big, the size of a network of IP packet, it is affected by many factors, the most typical is the MTU maximum transmission unit (), so will I requested data are divided, the HTTP header information is only part of the other data is in transit or buffer? So I did this traversal:
while True:
buf = se.recv(1024)
if not len(buf):
break
print buf
This shows that all the requested data is returned, so it seems necessary to have a deep understanding of the TCP/IP protocol for good network programming.