C Extracts Pictures from Web Pages through Regular Expressions

  • 2021-08-31 08:53:21
  • OfStack

At present, there is a part of processing pictures in the project, referring to 1 online case, and writing a method to get the picture address in the content.

1 Generally speaking, an HTML document has many tags, such as " < html > "," < body > "," < table > ", etc., it is not easy to extract img tags from documents. Because the styles of img tags are varied, it is not easy to find them by program. Therefore, if you want to find them, you must write a very sound regular expression, otherwise you may not find them completely, or you may not find the correct img tags.

We can think about how to build this regular expression from the format of HTML tag. First of all, there are several ways to write img tags. If you ignore the case, the following lists several possible situations of img tags.
< img > < img/ > < img src=/ >

These 1 tags need not be considered because there is no picture resource address.
< img src = /images/pic.jpg/ > < img src =" /images/pic.jpg" > < img src= '/images/pic.jpg ' / >

These 1 tags all have picture resource addresses, and another feature is that there are quotation marks, which may be single quotation marks or double quotation marks. Because you don't need to match quotation mark pairs at the same time, a regular expression can write: @ " < img\s*src\s*=\s*[""']?\s*(?[^\s""' < > ]*)\s*/?\s* > "
< img width="320" height="240" src=/images/pic.jpg onclick="window.open('/images/pic.jpg')" >

Because there may be other parameters between img and src, " < img "End with a word, for example, can't be" < imgabc ", which is also preceded by src. One advantage of using the word terminator"\ b "is that it omits the"\ s* "indicating spaces. In addition, because" "cannot appear in the img tag" < "," > "So overwrite the previous regular expression: @" < img\b[^ < > ]*?\bsrc\s*=\s*[""']?\s*(? < imgUrl > [^\s""' < > ]*)[^ < > ]*?/?\s* > "
< img width="320" height="240" src = "
/images/pic.jpg" / >

Problems like this, which may be broken with carriage returns, sometimes occur, so where there are spaces, carriage returns and TAB characters should be included. In addition, spaces, TAB, carriage returns and line feeds should not appear in picture addresses.

So the above regular expression can be changed to: @ " < img\b[^ < > ]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(? < imgUrl > [^\s\t\r\n""' < > ]*)[^ < > ]*?/?[\s\t\r\n]* > "

Here's the class HvtHtmlImage that gets the addresses of all the pictures in HTML:


using System.Text.RegularExpressions;
namespace HoverTree.HoverTreeFrame.HvtImage
{
public class HvtHtmlImage
{
/// <summary> 
///  Acquire HTML Of all the pictures in the  URL .  
/// </summary> 
/// <param name="sHtmlText">HTML Code </param> 
/// <returns> Pictorial URL List </returns> 
public static string[] GetHvtImgUrls(string sHtmlText)
{
//  Define regular expressions to match  img  Label  
Regex m_hvtRegImg = new Regex(@"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", RegexOptions.IgnoreCase);

//  Search for matching strings  
MatchCollection matches = m_hvtRegImg.Matches(sHtmlText);
int m_i = 0;
string[] sUrlList = new string[matches.Count];
//  Get a list of matches  
foreach (Match match in matches)
sUrlList[m_i++] = match.Groups["imgUrl"].Value;
return sUrlList;
}
}
}

Let's look at another example


public Array MatchHtml(string html,string com)
    {
      List<string> urls = new List<string>();
      html = html.ToLower();
      // Get SRC In the tag URL
      Regex regexSrc = new Regex("src=\"[^\"]*[(.jpg)(.png)(.gif)(.bmp)(.ico)]\"");
      foreach(Match m in regexSrc.Matches(html))
      {
        string src = m.Value;
        src = src.Replace("src=","").Replace("\"","");
        if (!src.Contains("http"))
          src = com + src;
        if(!urls.Contains(src))
        urls.Add(src);
      }
      // Get HREF In the label URL
      Regex regexHref = new Regex("href=\"[^\"]*[(.jpg)(.png)(.gif)(.bmp)(.ico)]\"");
      foreach (Match m in regexHref.Matches(html))
      {
        string href = m.Value;
        href = href.Replace("href=", "").Replace("\"", "");
        if (!href.Contains("http"))
          href = com + href;
        if(!urls.Contains(href))
        urls.Add(href);
      }
      return urls.ToArray();
    }

[DllImport("kernel32.dll")]
    static extern bool SetConsoleMode(IntPtr hConsoleHandle, int mode);
    [DllImport("kernel32.dll")]
    static extern bool GetConsoleMode(IntPtr hConsoleHandle, out int mode);
    [DllImport("kernel32.dll")]
    static extern IntPtr GetStdHandle(int handle);
    const int STD_INPUT_HANDLE = -10;
    const int ENABLE_QUICK_EDIT_MODE = 0x40 | 0x80;
    public static void EnableQuickEditMode()
    {
      int mode; IntPtr handle = GetStdHandle(STD_INPUT_HANDLE);
      GetConsoleMode(handle, out mode);
      mode |= ENABLE_QUICK_EDIT_MODE;
      SetConsoleMode(handle, mode);
    }
    static void Main(string[] args)
    {
      EnableQuickEditMode();
      int oldCount = 0;
      Console.Title = "TakeImageFromInternet";
      string path = "E:\\Download\\loading\\";
      while (true)
      {
        Console.Clear();
        string countFile = "E:\\CountFile.txt";// Text used to count so that the file name is not duplicate 
        int cursor = 0;
        if (File.Exists(countFile))
        {
          string text = File.ReadAllText(countFile);
          try
          {
            cursor =oldCount = Convert.ToInt32(text);// It is recommended to use it more often long
          }
          catch { }
        }
        Console.Write("please input a url:");
        string url = "http://www.baidu.com/";
        string temp = Console.ReadLine();
        if (!string.IsNullOrEmpty(temp))
          url = temp;
        Match mcom = new Regex(@"^(?i)http://(\w+\.){2,3}(com(\.cn)?|cn|net)\b").Match(url);// Obtain a domain name 
        string com = mcom.Value;
        //Console.WriteLine(mcom.Value);
        Console.Write("please input a save path:");
        temp = Console.ReadLine();
        if (Directory.Exists(temp))
          path = temp;
        Console.WriteLine();
        WebClient client = new WebClient();
        byte[] htmlData = null;
        htmlData = client.DownloadData(url);
        MemoryStream mstream = new MemoryStream(htmlData);
        string html = "";
        using (StreamReader sr = new StreamReader(mstream))
        {
          html = sr.ReadToEnd();
        }
        Array urls = new MatchHtmlImageUrl().MatchHtml(html,com);
 
        foreach (string imageurl in urls)
        {
         Console.WriteLine(imageurl);
          byte[] imageData = null;
          try
          {
            imageData = client.DownloadData(imageurl);
          }
          catch { }
          if (imageData != null && imageData.Length>0)
            using (MemoryStream ms = new MemoryStream(imageData))
            {
              try
              {
                
                string ext = Aping.Utility.File.FileOpration.ExtendName(imageurl);
                ImageFormat format = ImageFormat.Jpeg;
                switch (ext)
                {
                  case ".jpg":
                    format = ImageFormat.Jpeg;
                    break;
                  case ".bmp":
                    format = ImageFormat.Bmp;
                    break;
                  case ".png":
                    format = ImageFormat.Png;
                    break;
                  case ".gif":
                    format = ImageFormat.Gif;
                    break;
                  case ".ico":
                    format = ImageFormat.Icon;
                    break;
                  default:
                    continue;
                }
                Image image = new Bitmap(ms);
                if (Directory.Exists(path))
                  image.Save(path + "\\" + cursor + ext, format);
              }
              catch(Exception ex) { Console.WriteLine(ex.Message); }
            }
          cursor++;
        }
        mstream.Close();
        File.WriteAllText(countFile, cursor.ToString(), Encoding.UTF8);
        Console.WriteLine("take done...image count:"+(cursor-oldCount).ToString());
      }      
    }


Related articles: