Basic tutorial for capturing data using selenium Crawlers

  • 2021-06-28 09:29:18
  • OfStack

Write before

This article was written a few months ago, but I forgot it when I was so busy.

ps: Sometimes things get delayed.

A few months ago, I remember a friend in the group said that he wanted to use selenium to crawl data. For crawling data, 1 generally simulates visiting certain fixed websites, crawls information of his own interest, and then processes the crawled data.

His requirement is to import articles directly into a rich text editor for publication, which is actually one of the crawlers.

In fact, it is not difficult either. It is the process of UI automation. Let's start now.

Preparing tools/materials

1. java Language

2. IDEA Development Tools

3. jdk1.8

4. selenium-server-standalone (Version 3.0 and above)

step

1. Decomposition requirements:

The main requirement is to ensure that the original formatting style is preserved:

Articles to be crawled, selected and copied

Paste the copied text into a rich text editor

2. Code implementation ideas:

Keyboard Event Simulation CTRL+A Select All

Keyboard events simulate CTRL+C replication

Keyboard event simulation CTRL+V paste

3. Instance Code


import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;

import java.awt.*;
import java.awt.event.KeyEvent;
import java.util.concurrent.TimeUnit;

/**
 * @author rongrong
 * Selenium Example crawler operation code for simulated visits to websites 
 */
public class Demo {
 private static WebDriver driver;
 static final int MAX_TIMEOUT_IN_SECONDS = 5;

 @BeforeClass
 public static void setUpBeforeClass() throws Exception {
  driver = new ChromeDriver();
  String url = "https://temai.snssdk.com/article/feed/index?id=6675245569071383053&subscribe=5501679303&source_type=28&content_type=1&create_user_id=34013&adid=__AID__&tt_group_id=6675245569071383053";
  driver.manage().window().maximize();
  driver.manage().timeouts().implicitlyWait(MAX_TIMEOUT_IN_SECONDS, TimeUnit.SECONDS);
  driver.get(url);
 }

 @AfterClass
 public static void tearDownAfterClass() throws Exception {
  if (driver != null) {
   System.out.println(" Run over! ");
   driver.quit();
  }
 }

 @Test
 public void test() throws InterruptedException {
  Robot robot = null;
  try {
   robot = new Robot();
  } catch (AWTException e1) {
   e1.printStackTrace();
  }
  robot.keyPress(KeyEvent.VK_CONTROL);
  robot.keyPress(KeyEvent.VK_A);
  robot.keyRelease(KeyEvent.VK_A);
  Thread.sleep(2000);
  robot.keyPress(KeyEvent.VK_C);
  robot.keyRelease(KeyEvent.VK_C);
  robot.keyRelease(KeyEvent.VK_CONTROL);
  driver.get("https://ueditor.baidu.com/website/onlinedemo.html");
  Thread.sleep(2000);
  driver.switchTo().frame(0);
  driver.findElement(By.tagName("body")).click();
  robot.keyPress(KeyEvent.VK_CONTROL);
  robot.keyPress(KeyEvent.VK_V);
  robot.keyRelease(KeyEvent.VK_V);
  robot.keyRelease(KeyEvent.VK_CONTROL);
  Thread.sleep(2000);
 }
}

Write after

The authors do not specifically recommend using selenium as a crawler for the following reasons:

Slow:

Each time the crawler runs, it opens a browser, and initialization needs to load a bunch of pictures, JS rendering, and so on.

Too many resources:

Some people say that changing to a headless browser works the same way, it opens the browser, and many websites verify the parameters. If the other party sees you maliciously requesting access, it will do your request. Then you have to consider changing the header again. It's too complicated to know how much it is and you have to change the code. It's very troublesome.

Requirements for the network will be higher:

Loaded a number of supplemental files (such as css, js, and image files) that may not be of value to you.This may result in more traffic than the resources that are really needed (using a separate HTTP request).

summary


Related articles: