C Removes HTML tags using an for loop

  • 2021-11-01 04:22:47
  • OfStack

Remove the HTML tag from a paragraph of text to eliminate styles, paragraphs, etc. The most common method is probably regular expressions. Note, however, that regular expressions do not handle all HTML documents, so it is sometimes better to use one iteration, such as an for loop.

Look at the following code:


using System;
using System.Text.RegularExpressions;
/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}
/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}
/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
}

The code provides two different ways to remove an HTML tag from a given string, one using a regular expression and one using an array of characters for processing in an for loop. Look at the results of 1 test:


using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string html = "<p>There was a <b>.NET</b> programmer " +
"and he stripped the <i>HTML</i> tags.</p>";
Console.WriteLine(HtmlRemoval.StripTagsRegex(html));
Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html));
Console.WriteLine(HtmlRemoval.StripTagsCharArray(html));
}
}

The output is as follows:

There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.

In the above code, three different methods in the HtmlRemoval class are called, and they all return the same result, that is, the HTML tag in the given string is removed. The second method, which refers directly to a predefined regular expression object of RegexOptions. Compiled, is faster than the first method. However, RegexOptions. Compiled has a few drawbacks. In some cases, its startup time will increase by tens of times. The specific content can be seen in the following two articles:

RegexOption.Compiled
Regex Performance

In general, regular expressions are not the most efficient, so another method is given in the HtmlRemoval class, which uses character arrays to process strings. The test program provides 1000 HTML files, each HTML file has about 8000 characters, and all files are read through File. ReadAllText. The test results show that the execution speed of character array is the fastest.

Performance test for HTML removal

HtmlRemoval.StripTagsRegex: 2404 ms
HtmlRemoval.StripTagsRegexCompiled: 1366 ms
HtmlRemoval. StripTagsCharArray: 287 ms [Fastest]

File length test for HTML removal

File length before: 8085 chars
HtmlRemoval.StripTagsRegex: 4382 chars
HtmlRemoval.StripTagsRegexCompiled: 4382 chars
HtmlRemoval.StripTagsCharArray: 4382 chars

Therefore, using character arrays to process large quantities of files can save time. In the character array method, only non-HTML marked characters are added to the array buffer. For efficiency, it uses character arrays and a new string constructor to receive character arrays and ranges, which is faster than using StringBuilder.

For self-closing HTML tags

In XHTML, some tags do not have independent closing tags, such as < br/ > , < img/ > Wait. The above code should be able to handle the self-closing HTML tag correctly. Here are some of the supported HTML tags. Note that the regular expression method may not properly handle invalid HTML tags.

Supported tags


<img src="" />
<img src=""/>
<br />
<br/>
< div >
<!-- -->

Comments in HTML Documentation

The code presented in this article may fail to remove the HTML tag from the HTML document comment. Sometimes comments may contain 1 invalid HTML tags that will not be completely removed during processing. However, scanning for these incorrect HTML tags may sometimes be necessary.

How do I validate

There are many ways to validate XHTML, and we can iterate in the same way as the code above. A simple way is to do it for ' < 'And' > 'To determine whether they match, or to match using regular expressions. Here are one resource that describes these methods:

HTML Brackets: Validation

Validate XHTML

There are many ways to remove the HTML tag from a given string, and they all return correct results. There is no doubt that using character arrays for iteration is the most efficient.


Related articles: