Python applet can be used by crawling today's news and taking it away

  • 2021-12-04 19:16:48
  • OfStack

Directory core code crawling title interface code software compilation

Core code

requests. get Download html Web page
bs4. BeautifulSoup Analysis of html Content


from requests import get
from bs4 import BeautifulSoup as bs
from datetime import datetime as dt
def Today(style=1):
    date = dt.today()
    if style!=1: return f'{date.month} Month {date.day} Day '
    return f'{date.year}-{date.month:02}-{date.day:02}' 
def SinaNews(style=1):
    url1 = 'http://news.***.com.cn/'
    if style==1: url1 += 'world'
    elif style==2: url1 += 'china'
    else: url1='https://mil.news.sina.com.cn/'
    text = get(url1)
    text.encoding='uft-8'
    soup = bs(text.text,'html.parser')
    aTags = soup.find_all("a")
    return [(t.text,t['href']) for t in aTags if Today() in str(t)]

Crawl the title


for i,news in enumerate(SinaNews(1)):
    print(f'No{i+1}:',news[0])

    
No1:  Foreign media: *****
No2:  Japanese media: ******
......

......

Content has been mosaic! ! !

For the first time to do crawler, in order to find a news website without cracking the web page, download the web page can get the content directly. Among them, three web pages of international, domestic and military news are used as content sources. After downloading the web pages, requests. get, the obtained html text is analyzed, and all < a href=... > Mark the date with just what is needed.

Crawl the text

Then download the text webpage according to url, and analyze that id = 'article' < div > The layer is where the text is located, and. get_text () is the key function to get the text, and then do some formatting properly:


>>> def NewsDownload(url):
    html = get(url)
    html.encoding='uft-8'
    soup = bs(html.text,'html.parser')
    text = soup.find('div',id='article').get_text().strip()
    text = text.replace(' Click to enter the topic: ',' Related topics: ')
    text = text.replace('    ','\n    ')
    while '\n\n\n' in text:
        text = text.replace('\n\n\n','\n\n')
    return text 
>>> url = 'https://******/w/2021-09-29/doc-iktzqtyt8811588.shtml'
>>> NewsDownload(url)
' Original title: ******************************************************'
>>> 

Interface code

Use the built-in graphical interface library tkinter controls Text, Listbox, Scrollbar, Button. Set basic properties, place location, bind commands, and then debug until the program is finished!

Source code News. pyw: The website name involved has been mosaic!


from requests import get
from bs4 import BeautifulSoup as bs
from datetime import datetime as dt
from os import path
import tkinter as tk 
def Today(style=1):
    date = dt.today()
    if style!=1: return f'{date.month} Month {date.day} Day '
    return f'{date.year}-{date.month:02}-{date.day:02}'
def SinaNews(style=1):
    url1 = 'http://news.****.com.cn/'
    if style==1: url1 += 'world'
    elif style==2: url1 += 'china'
    else: url1='https://mil.****.com.cn/'
    text = get(url1)
    text.encoding='uft-8'
    soup = bs(text.text,'html.parser')
    aTags = soup.find_all("a")
    return [(t.text,t['href']) for t in aTags if Today() in str(t)] 
def NewsList(i):
    global news
    news = SinaNews(i)
    tList.delete(0,tk.END)
    for idx,item in enumerate(news):
        tList.insert(tk.END,f'{idx+1:03} {item[0]}')
    tText.config(state=tk.NORMAL)
    tText.delete(0.0,tk.END)
    tText.config(state=tk.DISABLED)
    NewsShow(0)   
def NewsList1(): NewsList(1)
def NewsList2(): NewsList(2)
def NewsList3(): NewsList(3) 
def NewsShow(idx):
    if idx!=0:
        idx = tList.curselection()[0]
    title,url = news[idx][0],news[idx][1]
    html = get(url)
    html.encoding='uft-8'
    soup = bs(html.text,'html.parser')
    text = soup.find('div',id='article').get_text().strip()
    text = text.replace(' Click to enter the topic: ',' Related topics: ')
    text = text.replace('    ','\n    ')
    while '\n\n\n' in text:
        text = text.replace('\n\n\n','\n\n')
    tText.config(state=tk.NORMAL)
    tText.delete(0.0,tk.END)
    tText.insert(tk.END, title+'\n\n'+text)
    tText.config(state=tk.DISABLED)   
def InitWindow(self,W,H):
    Y = self.winfo_screenheight()
    winPosition = str(W)+'x'+str(H)+'+8+'+str(Y-H-100)
    self.geometry(winPosition)
    icoFile = 'favicon.ico'
    f = path.exists(icoFile)
    if f: win.iconbitmap(icoFile)
    self.resizable(False,False)
    self.wm_attributes('-topmost',True)
    self.title(bTitle[0])
    SetControl()
    self.update()
    self.mainloop()
def SetControl():
    global tList,tText
    tScroll = tk.Scrollbar(win, orient=tk.VERTICAL)
    tScroll.place(x=450,y=320,height=300)
    tList = tk.Listbox(win,selectmode=tk.BROWSE,yscrollcommand=tScroll.set)
    tScroll.config(command=tList.yview)
    for idx,item in enumerate(news):
        tList.insert(tk.END,f'{idx+1:03} {item[0]}')
    tList.place(x=15,y=320,width=435,height=300)
    tList.select_set(0)
    tList.focus()
    bW,bH = 70,35    # Width and height of button 
    bX,bY = 95,270    # Coordinates of the button 
    tBtn1 = tk.Button(win,text=bTitle[1],command=NewsList1)
    tBtn1.place(x=bX,y=bY,width=bW,height=bH)
    tBtn2=tk.Button(win,text=bTitle[2],command=NewsList2)
    tBtn2.place(x=bX+100,y=bY,width=bW,height=bH)
    tBtn3 = tk.Button(win,text=bTitle[3],command=NewsList3)
    tBtn3.place(x=bX+200,y=bY,width=bW,height=bH)
    tScroll2 = tk.Scrollbar(win, orient=tk.VERTICAL)
    tScroll2.place(x=450,y=10,height=240)
    tText = tk.Text(win,yscrollcommand=tScroll2.set)
    tScroll2.config(command=tText.yview)
    tText.place(x=15,y=10,width=435,height=240)
    tText.config(state=tk.DISABLED,bg='azure',font=(' Song Style ', '14'))
    NewsShow(0)
    tList.bind("<Double-Button-1>",NewsShow)
if __name__=='__main__':
    win = tk.Tk()
    bTitle = (' Today's news ',' International news ',' Domestic news ',' Military news ')
    news = SinaNews()
    InitWindow(win,480,640)
 

All the codes are presented, so we will not make a detailed analysis here. If necessary, please leave a message for discussion. My use environment Win7+Python3.8. 8 can run without errors! The name of the website involved in this article has been mosaic. If you can't guess the name, you can ask me in private.

Software compilation

Use pyinstaller. exe to compile into a single running file. Note that the suffix of the source file should be. pyw otherwise there will be cmd black window. There is also a small knowledge point, any website Logo icon icon file, 1 can be downloaded in the root directory, namely:
http(s)://websiteurl.com(.cn)/favicon.ico

The compile command is as follows:

D:\ > pyinstaller --onefile --nowindowed --icon="D:\favicon.ico" News.pyw

After compiling, an News. exe executable file is generated under the dist folder, and the size is about 15M, which is acceptable.

You can use it directly if you take it anyway

The above is the Python applet crawling today's news can be used to take away the details, more information about Python applet please pay attention to other related articles on this site!


Related articles: