Combined with Python web crawler to do a small program for today's news

  • 2021-12-04 19:18:33
  • OfStack

Core code

requests. get Download html Web page
bs4. BeautifulSoup Analysis of html Content


from requests import get
from bs4 import BeautifulSoup as bs
from datetime import datetime as dt
 
def Today(style=1):
    date = dt.today()
    if style!=1: return f'{date.month} Month {date.day} Day '
    return f'{date.year}-{date.month:02}-{date.day:02}'
 
def SinaNews(style=1):
    url1 = 'http://news.***.com.cn/'
    if style==1: url1 += 'world'
    elif style==2: url1 += 'china'
    else: url1='https://mil.news.sina.com.cn/'
    text = get(url1)
    text.encoding='uft-8'
    soup = bs(text.text,'html.parser')
    aTags = soup.find_all("a")
    return [(t.text,t['href']) for t in aTags if Today() in str(t)]

Crawl the title

> > > for i,news in enumerate(SinaNews(1)):
print(f'No{i+1}:',news[0])


No1: Foreign media: *****
No2: Japanese media: ******
......

.......

Content has been mosaic! ! !
> > >

For the first time to do crawler, in order to find a news website without cracking the web page, download the web page can get the content directly. Among them, three web pages of international, domestic and military news are used as content sources. requests. get downloads the web pages and analyzes the obtained html texts. All < a href=... > Mark the date with just what is needed.

Crawl the text

Then download the text webpage according to url, and analyze that id = 'article' < div > The layer is where the text is located, and. get_text () is the key function to get the text, and then do some formatting as appropriate:


>>> def NewsDownload(url):
    html = get(url)
    html.encoding='uft-8'
    soup = bs(html.text,'html.parser')
    text = soup.find('div',id='article').get_text().strip()
    text = text.replace(' Click to enter the topic: ',' Related topics: ')
    text = text.replace('    ','\n    ')
    while '\n\n\n' in text:
        text = text.replace('\n\n\n','\n\n')
    return text
 
>>> url = 'https://******/w/2021-09-29/doc-iktzqtyt8811588.shtml'
>>> NewsDownload(url)
' Original title: ******************************************************'
>>> 

Interface code

Use the built-in graphical interface library tkinter controls Text, Listbox, Scrollbar, Button. Set basic properties, place location, bind commands, and then debug until the program is finished!

Source code News. pyw: The website name involved has been mosaic!


from requests import get
from bs4 import BeautifulSoup as bs
from datetime import datetime as dt
from os import path
import tkinter as tk
 
def Today(style=1):
    date = dt.today()
    if style!=1: return f'{date.month} Month {date.day} Day '
    return f'{date.year}-{date.month:02}-{date.day:02}'
 
def SinaNews(style=1):
    url1 = 'http://news.****.com.cn/'
    if style==1: url1 += 'world'
    elif style==2: url1 += 'china'
    else: url1='https://mil.****.com.cn/'
    text = get(url1)
    text.encoding='uft-8'
    soup = bs(text.text,'html.parser')
    aTags = soup.find_all("a")
    return [(t.text,t['href']) for t in aTags if Today() in str(t)]
 
def NewsList(i):
    global news
    news = SinaNews(i)
    tList.delete(0,tk.END)
    for idx,item in enumerate(news):
        tList.insert(tk.END,f'{idx+1:03} {item[0]}')
    tText.config(state=tk.NORMAL)
    tText.delete(0.0,tk.END)
    tText.config(state=tk.DISABLED)
    NewsShow(0)
    
def NewsList1(): NewsList(1)
def NewsList2(): NewsList(2)
def NewsList3(): NewsList(3)
 
def NewsShow(idx):
    if idx!=0:
        idx = tList.curselection()[0]
    title,url = news[idx][0],news[idx][1]
    html = get(url)
    html.encoding='uft-8'
    soup = bs(html.text,'html.parser')
    text = soup.find('div',id='article').get_text().strip()
    text = text.replace(' Click to enter the topic: ',' Related topics: ')
    text = text.replace('    ','\n    ')
    while '\n\n\n' in text:
        text = text.replace('\n\n\n','\n\n')
    tText.config(state=tk.NORMAL)
    tText.delete(0.0,tk.END)
    tText.insert(tk.END, title+'\n\n'+text)
    tText.config(state=tk.DISABLED)
    
def InitWindow(self,W,H):
    Y = self.winfo_screenheight()
    winPosition = str(W)+'x'+str(H)+'+8+'+str(Y-H-100)
    self.geometry(winPosition)
    icoFile = 'favicon.ico'
    f = path.exists(icoFile)
    if f: win.iconbitmap(icoFile)
    self.resizable(False,False)
    self.wm_attributes('-topmost',True)
    self.title(bTitle[0])
    SetControl()
    self.update()
    self.mainloop()
 
def SetControl():
    global tList,tText
    tScroll = tk.Scrollbar(win, orient=tk.VERTICAL)
    tScroll.place(x=450,y=320,height=300)
    tList = tk.Listbox(win,selectmode=tk.BROWSE,yscrollcommand=tScroll.set)
    tScroll.config(command=tList.yview)
    for idx,item in enumerate(news):
        tList.insert(tk.END,f'{idx+1:03} {item[0]}')
    tList.place(x=15,y=320,width=435,height=300)
    tList.select_set(0)
    tList.focus()
    bW,bH = 70,35    # Width and height of button 
    bX,bY = 95,270    # Coordinates of the button 
    tBtn1 = tk.Button(win,text=bTitle[1],command=NewsList1)
    tBtn1.place(x=bX,y=bY,width=bW,height=bH)
    tBtn2=tk.Button(win,text=bTitle[2],command=NewsList2)
    tBtn2.place(x=bX+100,y=bY,width=bW,height=bH)
    tBtn3 = tk.Button(win,text=bTitle[3],command=NewsList3)
    tBtn3.place(x=bX+200,y=bY,width=bW,height=bH)
    tScroll2 = tk.Scrollbar(win, orient=tk.VERTICAL)
    tScroll2.place(x=450,y=10,height=240)
    tText = tk.Text(win,yscrollcommand=tScroll2.set)
    tScroll2.config(command=tText.yview)
    tText.place(x=15,y=10,width=435,height=240)
    tText.config(state=tk.DISABLED,bg='azure',font=(' Song Style ', '14'))
    NewsShow(0)
    tList.bind("<Double-Button-1>",NewsShow)
 
if __name__=='__main__':
 
    win = tk.Tk()
    bTitle = (' Today's news ',' International news ',' Domestic news ',' Military news ')
    news = SinaNews()
    InitWindow(win,480,640)
 

All the codes are presented, so we will not make a detailed analysis here. If necessary, please leave a message for discussion. My use environment Win7+Python3.8. 8 can run without errors! The name of the website involved in this article has been mosaic. If you can't guess the name, you can ask me in private.

Software compilation

Use pyinstaller. exe to compile into a single running file. Note that the suffix of the source file should be. pyw otherwise there will be cmd black window. There is also a small knowledge point, any website Logo icon icon file, 1 can be downloaded in the root directory, namely:
http(s)://websiteurl.com(.cn)/favicon.ico

The compile command is as follows:

D:\ > pyinstaller --onefile --nowindowed --icon="D:\favicon.ico" News.pyw

After compiling, an News. exe executable file is generated under the dist folder, and the size is about 15M is acceptable.

Anyway, you can use it directly if you take it away. Remember to collect it before you leave. Thank you!


Related articles: