Java regular expression simple use and web crawler production code

  • 2020-04-01 01:51:35
  • OfStack

A regular expression is a rule that is specific to operations on strings.

1. In the String class, there are some methods to match and cut strings.

Boolean matches(String regex) determines whether the String matches the given regular expression.

Cut a String according to a given regular expression: String[]      The split (String regex);

Replace the string that matches the regular expression with the other string we want: String ; ReplaceAll (String  The regex, String replacement)


2. Here are some common USES of regular expressions

(1)


String regex="[1-9][0-9]{4,15}";
//[1-9] means that this number can only be selected within 1-9
//[0-9] means the number could be 0-9
//{4, 15} indicates that the preceding number in this format can be repeated 4-15 times

This regular expression means that the first number should be any number from 1 to 9, and then one of the Numbers from 0 to 9 must appear at least 4 times and at most 15 times

Such as:

10175     Meet  

10 does not match, because [0-9]{4,15}, must appear at least 4 times, in this case only once

(2)

[a-za-z0-9_]{6} means exactly six times a-z or a-z or _  The characters in

Plus means at least once

* indicates zero or more occurrences

? That means one or zero times


(3) cut the string according to the regular expression


String str="sjd.ksdj.skdjf";
String regex="\.";

Note:   In a regular expression is an arbitrary character, is a special symbol. If we want to cut with., we have to convert it to normal characters with \\.

Because \ is also a special symbol, two \\ are needed. When we want to use ordinary \, then we need to use \\\\.

String [] ss = STR. The split (regex); Returns an array of strings: "SJD "  "KSDJ"   "SKDJF"   Implement the original string cutting

(4) replace what we want according to the regular expression

Replace all strings of 5 or more digits in a string with #


String str="abcd1334546lasjdfldsf2343424sdj";
String regex="[0-9]{5,}";
String   newstr=str.replaceAll(regex,"#");

(5) get the string that conforms to the regular expression rule


Pattern p=Pattern.compile(String regex);
Matcher  m=p.matcher(String str);
while(m.find())
{
System.out.println(m.group());
}

3. Production of web crawler

We made a web page that can be read out of all the mailbox, and stored in a text file.


/*
 Web crawler 
 That is, get a string or content from a web page that matches a regular expression  
 Get the email address from the network 
*/
import java.io.*;
import java.util.regex.*;
import java.net.*;
class  MailTest
{
 public static void main(String[] args) throws Exception
 {
  getMailAddr();
 }
 public static void getMailAddr()throws Exception
 {
  URL url=new URL("http://bbs.jb51.net/topics/390148495");
  URLConnection con=url.openConnection();
  BufferedReader bufIn=new BufferedReader(new InputStreamReader(con.getInputStream()));
  BufferedWriter bufw=new BufferedWriter(new FileWriter(new File("e://mailaddress.txt")));
  String str=null;
  String regex="[a-zA-Z0-9_]{6,12}@[a-zA-Z0-9]+(\.[a-zA-Z]+)+";

  Pattern p=Pattern.compile(regex);
  while((str=bufIn.readLine())!=null)
  {
   Matcher m=p.matcher(str);
   while(m.find())
   {
    String ss=m.group();
    bufw.write(ss,0,ss.length());
    bufw.newLine();
    bufw.flush();
   }
  }

 }
}


Related articles: