C Using Regular Expressions to Crawl Web Site Information Sample
- 2021-11-29 08:12:34
- OfStack
In this paper, an example is given to describe the method of using regular expressions to crawl website information in C #. Share it for your reference, as follows:
Here, take capturing the product details of JD.COM Mall as an example.
1. Create the JdRobber. cs program class
public class JdRobber
{
/// <summary>
/// Determine whether it is a JD.COM link
/// </summary>
/// <param name="param"></param>
/// <returns></returns>
public bool ValidationUrl(string url)
{
bool result = false;
if (!String.IsNullOrEmpty(url))
{
Regex regex = new Regex(@"^http://item.jd.com/\d+.html$");
Match match = regex.Match(url);
if (match.Success)
{
result = true;
}
}
return result;
}
/// <summary>
/// Grasp JD.COM information
/// </summary>
/// <param name="param"></param>
/// <returns></returns>
public void GetInfo(string url)
{
if (ValidationUrl(url))
{
string htmlStr = WebHandler.GetHtmlStr(url, "Default");
if (!String.IsNullOrEmpty(htmlStr))
{
string pattern = ""; // Regular expression
string sourceWebID = ""; // Commodity key ID
string title = ""; // Title
decimal price = 0; // Price
string picName = ""; // Picture
// Key to extracting goods ID
pattern = @"http://item.jd.com/(?<Object>\d+).html";
sourceWebID = WebHandler.GetRegexText(url, pattern);
// Extract title
pattern = @"<div.*id=\""name\"".*>[\s\S]*<h1>(?<Object>.*?)</h1>";
title = WebHandler.GetRegexText(htmlStr, pattern);
// Picture extraction
int begin = htmlStr.IndexOf("<div id=\"spec-n1\"");
int end = htmlStr.IndexOf("</div>", begin + 1);
if (begin > 0 && end > 0)
{
string subPicHtml = htmlStr.Substring(begin, end - begin);
pattern = @"<img.*src=\""(?<Object>.*?)\"".*/>";
picName = WebHandler.GetRegexText(subPicHtml, pattern);
}
// Extraction price
if (sourceWebID != "")
{
string priceUrl = @"http://p.3.cn/prices/get?skuid=J_" + sourceWebID + "&type=1";
string priceJson = WebHandler.GetHtmlStr(priceUrl, "Default");
pattern = @"\""p\"":\""(?<Object>\d+(\.\d{1,2})?)\""";
price = WebHandler.GetValidPrice(WebHandler.GetRegexText(priceJson, pattern));
}
Console.WriteLine(" Commodity name: {0}", title);
Console.WriteLine(" Picture: {0}", picName);
Console.WriteLine(" Price: {0}", price);
}
}
}
}
2. Create the WebHandler. cs public method class
/// <summary>
/// Common method class
/// </summary>
public class WebHandler
{
/// <summary>
/// Object of the Web page HTML Code
/// </summary>
/// <param name="url"> Link address </param>
/// <param name="encoding"> Coding type </param>
/// <returns></returns>
public static string GetHtmlStr(string url, string encoding)
{
string htmlStr = "";
try
{
if (!String.IsNullOrEmpty(url))
{
WebRequest request = WebRequest.Create(url); // Instantiation WebRequest Object
WebResponse response = request.GetResponse(); // Create WebResponse Object
Stream datastream = response.GetResponseStream(); // Create a stream object
Encoding ec = Encoding.Default;
if (encoding == "UTF8")
{
ec = Encoding.UTF8;
}
else if (encoding == "Default")
{
ec = Encoding.Default;
}
StreamReader reader = new StreamReader(datastream, ec);
htmlStr = reader.ReadToEnd(); // Read data
reader.Close();
datastream.Close();
response.Close();
}
}
catch { }
return htmlStr;
}
/// <summary>
/// Gets a keyword in a regular expression
/// </summary>
/// <param name="input"> Text </param>
/// <param name="pattern"> Expression </param>
/// <returns></returns>
public static string GetRegexText(string input, string pattern)
{
string result = "";
if (!String.IsNullOrEmpty(input) && !String.IsNullOrEmpty(pattern))
{
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
Match match = regex.Match(input);
if (match.Success)
{
result = match.Groups["Object"].Value;
}
}
return result;
}
/// <summary>
/// Return the effective price
/// </summary>
/// <param name="strPrice"></param>
/// <returns></returns>
public static decimal GetValidPrice(string strPrice)
{
decimal price = 0;
try
{
if (!String.IsNullOrEmpty(strPrice))
{
Regex regex = new Regex(@"^\d+(\.\d{1,2})?$", RegexOptions.IgnoreCase);
Match match = regex.Match(strPrice);
if (match.Success)
{
price = decimal.Parse(strPrice);
}
}
}
catch { }
return price;
}
}
PS: Here are two very convenient regular expression tools for your reference:
JavaScript Regular Expression Online Test Tool:
http://tools.ofstack.com/regex/javascript
Regular expression online generation tool:
http://tools.ofstack.com/regex/create_reg
For more readers interested in C # related content, please check the topics on this site: "C # Regular Expression Usage Summary", "C # Coding Operation Skills Summary", "XML File Operation Skills Summary in C #", "C # Common Control Usage Tutorial", "WinForm Control Usage Summary", "C # Data Structure and Algorithm Tutorial", "C # Object-Oriented Programming Introduction Tutorial" and "C # Programming Thread Use Skills Summary"
I hope this article is helpful to everyone's C # programming.