python method for deleting duplicate files in local folders

  • 2020-11-03 22:30:18
  • OfStack

My last post was about downloading images from the web, so I pulled out the entire joke site, but there was a lot of duplication in the selection, such as other images on the page, repeated posts, and so on. So I went back to python's 1 method and wrote a script to delete duplicate images in the specified folder.

1. Methods and ideas

1. Method to compare the same files: hashlib library provides a method to get the value of md5 of the file, so we can determine whether the images are the same by the value of md5

2. Operation on files: os library has operation methods on files, such as: os.remove () can delete the specified file, os.listdir () can get the file name of all files in the folder by specifying the folder path

Idea: By getting all file names for the specified folder, then matching the list to one absolute path, iterate over the md5 value for each file, and delete the file if the md5 value repeats

2. Code implementation


import os 
import hashlib 
import logging 
import sys 
 
def logger(): 
 """  To obtain logger""" 
 logger = logging.getLogger() 
 if not logger.handlers: 
  #  The specified logger The output format  
  formatter = logging.Formatter('%(asctime)s %(levelname)-8s: %(message)s') 
  #  Log file  
  file_handler = logging.FileHandler("test.log") 
  file_handler.setFormatter(formatter) #  Can be achieved by setFormatter Specify output format  
  #  Console log  
  console_handler = logging.StreamHandler(sys.stdout) 
  console_handler.formatter = formatter #  Or you can just give it formatter The assignment  
  #  for logger Added log handler  
  logger.addHandler(file_handler) 
  logger.addHandler(console_handler) 
  #  Specifies the minimum output level of the log, which defaults to WARN level  
  logger.setLevel(logging.INFO) 
 return logger 
 
def get_md5(filename): 
 m = hashlib.md5() 
 mfile = open(filename, "rb") 
 m.update(mfile.read()) 
 mfile.close() 
 md5_value = m.hexdigest() 
 return md5_value 
 
def get_urllist(): 
 # Replace the specified folder path  
 base = ("F:\\pythonFile\\ Fried egg net \\ Boring figure \\jpg\\") 
 list = os.listdir(base) 
 urlList=[] 
 for i in list: 
  url = base + i 
  urlList.append(url) 
 return urlList 
 
if __name__ == '__main__': 
 log = logger() 
 md5List =[] 
 urlList =get_urllist() 
 for a in urlList: 
  md5 =get_md5(a) 
  if (md5 in md5List): 
   os.remove(a) 
   print(" Repeat: %s"%a) 
   log.info(" Repeat: %s"%a) 
  else: 
   md5List.append(md5) 
   # print(md5List) 
   print("1 A total of %s photo "%len(md5List)) 

We can then use the log to see which files are duplicates, but the md5 value will change a little bit for a very large file, but it will work for a small file like 1, just replace my path, and it will run on your computer.


Related articles: