C grabs web page data analyzes title description pictures and other information and removes HTML tags

  • 2021-09-12 01:56:46
  • OfStack

1. First, grab the whole web page content, put the data in byte [] (the form is byte when transmitting on the network), and then convert it into String in one step, so as to facilitate its operation. Examples are as follows:


private static string GetPageData(string url)
{
    if (url == null || url.Trim() == "")
        return null;
    WebClient wc = new WebClient();
    wc.Credentials = CredentialCache.DefaultCredentials;
    Byte[] pageData = wc.DownloadData(url);
    return Encoding.Default.GetString(pageData);//.ASCII.GetString
}

2. Get the string form of the data, and then you can parse the web page (in fact, all kinds of operations on the string and the application of regular expressions):

Commonly used parsing is as follows:

1. Get the title


Match TitleMatch = Regex.Match(strResponse, "<title>([^<]*)</title>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
title = TitleMatch.Groups[1].Value;

2. Get descriptive information


Match Desc = Regex.Match(strResponse, "<meta name=\"DESCRIPTION\" content=\"([^<]*)\">", RegexOptions.IgnoreCase | RegexOptions.Multiline);
strdesc = Desc.Groups[1].Value;

STEP 3 Get pictures


public class HtmlHelper
{
    /// <summary>
    /// HTML Extract the picture address from
    /// </summary>
    public static List<string> PickupImgUrl(string html)
    {
        Regex regImg = new Regex(@"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", RegexOptions.IgnoreCase);
        MatchCollection matches = regImg.Matches(html);
        List<string> lstImg = new List<string>();
        foreach (Match match in matches)
        {
            lstImg.Add(match.Groups["imgUrl"].Value);
        }
        return lstImg;
    }
    /// <summary>
    /// HTML Extract the picture address from
    /// </summary>
    public static string PickupImgUrlFirst(string html)
    {
        List<string> lstImg = PickupImgUrl(html);
        return lstImg.Count == 0 ? string.Empty : lstImg[0];
    }
}

4. Remove the Html tag


private string StripHtml(string strHtml)
{
    Regex objRegExp = new Regex("<(.|\n)+?>");
    string strOutput = objRegExp.Replace(strHtml, "");
    strOutput = strOutput.Replace("<", "&lt;");
    strOutput = strOutput.Replace(">", "&gt;");
    return strOutput;
}

Some exceptions will make the removal dirty, so it is recommended to convert twice in succession. This converts Html tags into spaces. Too many contiguous spaces will affect the subsequent operation of strings. So add the following statement:


// Change all spaces into 1 Spaces
Regex r = new Regex(@"\s+");
wordsOnly = r.Replace(strResponse, " ");
wordsOnly.Trim();


Related articles: