The C regular expression matches the image path and the image address code in HTML

  • 2020-05-30 20:59:25
  • OfStack

1 generally speaking, an HTML document has many tags, such as" < html > "," < body > "," < table > "Etc., it is not easy to extract the img tags from the document. Because the img tag styles vary a lot, it is not easy to find them programmatically when extracting them. If you want to find them, you must write a very robust regular expression, or you may not find all of them, or you may not find the correct img tag.
We can think about how to build this regular expression from the HTML tag format. The first thing to think about is that there are several ways to write the img tag. If you ignore the case, here is a list of several possible cases of the img tag.
< img > < img/ > < img src=/ >
These tags do not need to be considered, because there is no image resource address.
< img src = /images/pic.jpg/ > < img src =" /images/pic.jpg" > < img src= '/images/pic.jpg ' / >
These tags all have the image resource address, another feature is the pair of quotation marks, may be single quotation marks, may be double quotation marks. Since you don't need to match the pair of quotation marks at the same time, the regular expression can be as follows: "at" < img\s*src\s*=\s*[""']?\s*(?[^\s""' < > ]*)\s*/?\s* > "
< img width="320" height="240" src=/images/pic.jpg onclick="window.open('/images/pic.jpg')" >
Since there may be other parameters between img and src," < img "end with a word, for example," < "imgabc" is the same as "src". The use of the word ending "\b" has the advantage of eliminating "\s*" which means space. In addition, the img tag cannot appear" < "," > "So rewrite the previous regular expression: @" < img\b[^ < > ]*?\bsrc\s*=\s*[""']?\s*(? < imgUrl > [^\s""' < > ]*)[^ < > ]*?/?\s* > "
< img width="320" height="240" src = "
/images/pic.jpg" / >
This problem of possibly folding lines with carriage returns sometimes occurs, so include the carriage return newline and TAB characters where Spaces are separated, and no whitespace, TAB, carriage return, and newline characters in the image address. So the above regular expression can be changed to "@" < img\b[^ < > ]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r \n]*(? < imgUrl > [^\s\t\r\n""' < > ]*)[^ < > ]*?/?[\s\t\r\n]* > "
Write down the static method to get the address of all the images in HTML.


/// <summary>         ///  achieve HTML Of all pictures  URL . 
        /// </summary>         /// <param name="sHtmlText">HTML code </param>         /// <returns> The image URL The list of </returns>         public static string[] GetHtmlImageUrlList(string sHtmlText)
        {
            //  Define a regular expression to match  img  The label              Regex regImg = new Regex(@"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", RegexOptions.IgnoreCase);
            //  Search for a matching string              MatchCollection matches = regImg.Matches(sHtmlText);
            int i = 0;
            string[] sUrlList = new string[matches.Count];
            //  Gets a list of matches              foreach (Match match in matches)
                sUrlList[i++] = match.Groups["imgUrl"].Value;
            return sUrlList;
        }


Related articles: