Example of PHP using Curl to realize simulated login and data grabbing function

  • 2021-09-24 21:32:46
  • OfStack

In this paper, an example of PHP using Curl to achieve simulated login and data capture functions. Share it for your reference, as follows:

Use PHP Curl extension library can simulate the realization of login, and crawl a user account needs to log in to view after the data. The specific implementation process is as follows (personal summary):

1. First of all, we need to analyze the html source code of the corresponding login page to obtain some necessary information:

(1) The address of the login page;

(2) The address of the verification code;

(3) The name and submission method of each field that needs to be submitted in the login form;

(4) The address where the login form is submitted;

(5) In addition, you need to know the address of the data to be crawled.

2. Get cookie and store it (for Web sites that use cookie files):


$login_url = 'http://www.xxxxx';  // Login page address 
$cookie_file = dirname(__FILE__)."/pic.cookie";  //cookie File storage location (custom) 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $login_url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_exec($ch);
curl_close($ch);

3. Get the verification code and store it (for websites that use the verification code):


$verify_url = "http://www.xxxx";   // Verification code address 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $verify_url);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$verify_img = curl_exec($ch);
curl_close($ch);
$fp = fopen("./verify/verifyCode.png",'w');  // Write the captured picture file to the local picture file and save it 
fwrite($fp, $verify_img);
fclose($fp);

Description:

Because the verification code can't be recognized, my practice here is to grab the verification code picture and store it in the local file, and then display it on the html page in my own project, so that the user can fill it in, wait for the user to fill in the account number, password and verification code, and click the Submit button before proceeding to the next step.

4. Simulate submitting the login form:


$ post_url = 'http://www.xxxx';   // Login form submission address 
$post = "username=$account&password=$password&seccodeverify=$verifyCode";// The data submitted by the form (determined by the form field name and user input) 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $ post_url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);     // The submission method is post
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_exec($ch);
curl_close($ch);

5. Grab data:


$data_url = "http://www.xxxx";   // Address where the data is located 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $data_url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,0);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
$data = curl_exec($ch);
curl_close($ch);

So far, the page at the address of the data has been crawled and stored in the string variable $data.

It should be noted that what you crawl down is the html source code of a web page, which means that this string not only contains the data you want, but also contains many html tags and other things you don't want. Therefore, if you want to extract the data you need, you should also analyze the html code of the page where the data is stored, and then combine string operation functions, regular matching and other methods to extract the data you want.

The above method is effective for 1 general website using http protocol. However, if you want to simulate logging in to a website using https protocol, you need to add the following 1 processing:

1. Skip https authentication:


curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);

2. Using a user agent:


$UserAgent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.0.04506; .NET CLR 3.5.21022; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
curl_setopt($curl, CURLOPT_USERAGENT, $UserAgent);

Note: If these processes are not added, the simulated login will not succeed.

Use the above program to simulate the login website 1 is generally successful, but in fact, it is still necessary to consider the specific situation of the simulated login website. For example, some websites have different codes, so the pages you crawl are garbled, so it is necessary to carry out 1 coding conversion, such as: $data = iconv("gb2312", "utf-8",$data); The gbk code is converted into utf8 code. There are also some websites with high security requirements, such as online banking, which will put the verification code in an inline framework. At this time, you need to grab the page of the inline framework first, then extract the address of the verification code from it, and then grab the verification code. There are also 1 websites (such as online banking) that submit forms in js code. I will do some processing before submitting the form. Such as encryption, Therefore, if you submit it directly, you can't log in successfully. You have to do similar processing before submitting, but in this case, if you can know the specific operation in js code, such as encryption, what is the encryption algorithm, you can do the same processing as it, and then submit the data, which is also successful. However, the key point is that if you don't know what it is doing, for example, it is encrypted, but you don't know the specific encryption algorithm, then you can't do the same operation and you can't successfully simulate the login. A typical case in this respect is online banking, which uses online banking controls to process the password and verification code submitted by users before submitting the form in js code, but we don't know what operation it is doing at all, so we can't simulate it. So if you think you read this article can be simulated after the login online banking then you are too naive, people's bank website can be so easy to be simulated by you login? Of course, if you can crack the online banking control, it is another matter. Then again, why do I feel so deep, because I have encountered this difficult problem, don't say, say too much is tears. . .

For more readers interested in PHP related contents, please check the topics of this site: "php curl Usage Summary", "PHP Network Programming Skills Summary", "PHP Array (Array) Operation Skills Encyclopedia", "php String (string) Usage Summary", "PHP Data Structure and Algorithm Tutorial", "php Programming Algorithm Summary", "PHP Operation and Operator Usage Summary" and "php Common Database Operation Skills Summary"

I hope this article is helpful to everyone's PHP programming.


Related articles: