python method for deleting duplicate files in local folders
- 2020-11-03 22:30:18
- OfStack
My last post was about downloading images from the web, so I pulled out the entire joke site, but there was a lot of duplication in the selection, such as other images on the page, repeated posts, and so on. So I went back to python's 1 method and wrote a script to delete duplicate images in the specified folder.
1. Methods and ideas
1. Method to compare the same files: hashlib library provides a method to get the value of md5 of the file, so we can determine whether the images are the same by the value of md5
2. Operation on files: os library has operation methods on files, such as: os.remove () can delete the specified file, os.listdir () can get the file name of all files in the folder by specifying the folder path
Idea: By getting all file names for the specified folder, then matching the list to one absolute path, iterate over the md5 value for each file, and delete the file if the md5 value repeats
2. Code implementation
import os
import hashlib
import logging
import sys
def logger():
""" To obtain logger"""
logger = logging.getLogger()
if not logger.handlers:
# The specified logger The output format
formatter = logging.Formatter('%(asctime)s %(levelname)-8s: %(message)s')
# Log file
file_handler = logging.FileHandler("test.log")
file_handler.setFormatter(formatter) # Can be achieved by setFormatter Specify output format
# Console log
console_handler = logging.StreamHandler(sys.stdout)
console_handler.formatter = formatter # Or you can just give it formatter The assignment
# for logger Added log handler
logger.addHandler(file_handler)
logger.addHandler(console_handler)
# Specifies the minimum output level of the log, which defaults to WARN level
logger.setLevel(logging.INFO)
return logger
def get_md5(filename):
m = hashlib.md5()
mfile = open(filename, "rb")
m.update(mfile.read())
mfile.close()
md5_value = m.hexdigest()
return md5_value
def get_urllist():
# Replace the specified folder path
base = ("F:\\pythonFile\\ Fried egg net \\ Boring figure \\jpg\\")
list = os.listdir(base)
urlList=[]
for i in list:
url = base + i
urlList.append(url)
return urlList
if __name__ == '__main__':
log = logger()
md5List =[]
urlList =get_urllist()
for a in urlList:
md5 =get_md5(a)
if (md5 in md5List):
os.remove(a)
print(" Repeat: %s"%a)
log.info(" Repeat: %s"%a)
else:
md5List.append(md5)
# print(md5List)
print("1 A total of %s photo "%len(md5List))
We can then use the log to see which files are duplicates, but the md5 value will change a little bit for a very large file, but it will work for a small file like 1, just replace my path, and it will run on your computer.