Apr 2, 2020

In depth analysis of the use of python to grab the body of the source code

This method is based on text density. The original idea comes from the general web text extraction algorithm based on line block distribution function of Harbin Institute of Technology. Based on this, this paper makes some minor modifications.

Convention: This paper is based on the different lines of the page to carry out statistics, therefore, it is assumed that the page content is not compressed, is the page has a normal line feed.

In some news pages, the text content of the news may be relatively short, but it is embedded in a video file, so I will give a higher weight to the video; The same applies to images, where there is a disadvantage that the weight should be determined according to the size of the image displayed, but the method in this article fails to achieve this.

Because ads, navigation, and non-body content often appear as hyperlinks, the text will give the hyperlink zero text weight.

It is assumed that the content of the body is continuous, with no non-body content in the middle, so in fact, to extract the body content is to find out where the body content begins and ends.

Steps:

First clear the CSS,Javascript, comments, Meta,Ins tags in the web page, clear the blank line.

Calculate the processed value of each row (1)

Calculate the starting and ending position of the largest positron string per line number obtained above

The second step needs to be explained:

For each row, we need to calculate a value, which is calculated as follows:

An image label img, equivalent to the appearance of text of length of 50 characters (given weight), x1,

An embed video tag, equivalent to the appearance of 1000 characters of text, x2

The length of the text of label a of all the links in a line is x3,

The text length of the other tags is x4

The value of each row is equal to 50 times x1 and the number of occurrences + 1000 times x2 and the number of occurrences + x4 and 8

// says minus 8 because we’re going to calculate the largest positron string, so we’re going to subtract a positive number, and I think we’re going to do it empirically.

The complete code

#coding:utf-8
import re
def remove_js_css (content):
    """ remove the the javascript and the stylesheet and the comment content (<script>....</script> and <style>....</style> <!-- xxx -->) """
    r = re.compile(r'''<script.*?</script>''',re.I|re.M|re.S)
    s = r.sub ('',content)
    r = re.compile(r'''<style.*?</style>''',re.I|re.M|re.S)
    s = r.sub ('', s)
    r = re.compile(r'''<!--.*?-->''', re.I|re.M|re.S)
    s = r.sub('',s)
    r = re.compile(r'''<meta.*?>''', re.I|re.M|re.S)
    s = r.sub('',s)
    r = re.compile(r'''<ins.*?</ins>''', re.I|re.M|re.S)
    s = r.sub('',s)
    return s
def remove_empty_line (content):
    """remove multi space """
    r = re.compile(r'''^s+$''', re.M|re.S)
    s = r.sub ('', content)
    r = re.compile(r'''n+''',re.M|re.S)
    s = r.sub('n',s)
    return s
def remove_any_tag (s):
    s = re.sub(r'''<[^>]+>''','',s)
    return s.strip()
def remove_any_tag_but_a (s):
    text = re.findall (r'''<a[^r][^>]*>(.*?)</a>''',s,re.I|re.S|re.S)
    text_b = remove_any_tag (s)
    return len(''.join(text)),len(text_b)
def remove_image (s,n=50):
    image = 'a' * n
    r = re.compile (r'''<img.*?>''',re.I|re.M|re.S)
    s = r.sub(image,s)
    return s
def remove_video (s,n=1000):
    video = 'a' * n
    r = re.compile (r'''<embed.*?>''',re.I|re.M|re.S)
    s = r.sub(video,s)
    return s
def sum_max (values):
    cur_max = values[0]
    glo_max = -999999
    left,right = 0,0
    for index,value in enumerate (values):
        cur_max += value
        if (cur_max > glo_max) :
            glo_max = cur_max
            right = index
        elif (cur_max < 0):
            cur_max = 0
    for i in range(right, -1, -1):
        glo_max -= values[i]
        if abs(glo_max < 0.00001):
            left = i
            break
    return left,right+1
def method_1 (content, k=1):
    if not content:
        return None,None,None,None
    tmp = content.split('n')
    group_value = []
    for i in range(0,len(tmp),k):
        group = 'n'.join(tmp[i:i+k])
        group = remove_image (group)
        group = remove_video (group)
        text_a,text_b= remove_any_tag_but_a (group)
        temp = (text_b - text_a) - 8
        group_value.append (temp)
    left,right = sum_max (group_value)
    return left,right, len('n'.join(tmp[:left])), len ('n'.join(tmp[:right]))
def extract (content):
    content = remove_empty_line(remove_js_css(content))
    left,right,x,y = method_1 (content)
    return 'n'.join(content.split('n')[left:right])

The code starts with the last function.