Remove the python code of level by level optimization for the same file in the directory

  • 2020-04-02 09:41:05
  • OfStack

These two days be free and at leisure in baidu Amoy pictures, not much, also tens of thousands of it, among which there are many pictures of beautiful women! Ha ha! Let's talk about what happened after we got the picture instead of how we got it.
The first problem is that some images don't have suffixes. Under Windows, no suffix name of the file is not correctly identified, no preview, open when you choose to open the way, laborious! This is an easy problem to solve by adding a suffix to each image. There are not many pictures without suffix, less than 1000, one by one to change very troublesome, fortunately I am learning computer, the morning wrote a program batch modification (link: #). The problem is solved. Then I encountered a new problem: there are too many pictures, it is inevitable to repeat, some pictures are exactly the same, there is no need to keep all, I want to delete all the duplicate pictures.

Let's analyze this problem: firstly, the number of files is so large that it is not practical to find them by hand. Besides, it is very difficult to find the exact same ones in thousands of pictures with our naked eyes. If it is not an image but other documents, it is difficult to distinguish between them correctly without being able to preview. So it's going to be implemented programmatically. So how do you do that programmatically? What makes the two files identical? First, judging by the file name is unreliable because the file name can be changed at will, but the contents of the file remain the same. Again in the same folder below, it is impossible to appear two exactly the same file name, the operating system does not allow. Another method is to judge by the size of the file, which is a good way, but the same size of the file may not be the same picture. Moreover, the pictures are generally small, there is no more than 3M, most of the less than 1M, if the folder below the file is particularly many, the same size of the file is quite likely to appear. So file size alone is not reliable. Another way is to read the contents of each image and then compare the contents of that image with the others. If the contents are the same, then the two images must be identical. This method seems to be perfect, let's analyze its space-time efficiency: first, the content of each picture is compared with other pictures, this is a double cycle, the efficiency of reading is low, the efficiency of comparison is even lower, all the comparison is very time-consuming! In terms of memory, if you read all the pictures to memory in advance, you can speed up the file efficiency, but the memory resources of ordinary computers are limited, if the pictures are very many, several G, it is not practical to read the memory. If you do not read all the files to memory, then every time before the comparison to read the file content, several times to read several times, from the hard disk to read data is relatively slow, this is obviously not appropriate. So is there a better way? I racked my brain and finally came up with md5. What is md5? Don't you know? Well, you're on Mars. Hurry up, duckduckgo! Isn't md5 encrypted, you might ask? Does it have anything to do with our problems? That's a good question! Md5 can be arbitrary length of the string will be encrypted in a 32 sequences of characters, including Numbers and letters (uppercase or lowercase), because any small changes will lead to md5 string sequence change, therefore the md5 can be viewed as a string of 'fingerprint' in 'or' information, because the md5 string has a total of 3632, so the two different string to get the same md5 probability is very small, almost zero, in the same way, we can get each file md5, several files at the same md5 is almost certainly two files are the same, The probability that the md5 is the same and the files are different is too small to be ignored, so we can do this: we get the md5 of each file, and we can determine whether the two images are the same by comparing the md5. Here's the code implementation, in python
 
# -*- coding: cp936 -*- 
import md5 
import os 
from time import clock as now 
def getmd5(filename): 
file_txt = open(filename,'rb').read() 
m = md5.new(file_txt) 
return m.hexdigest() 
def main(): 
path = raw_input("path: ") 
all_md5=[] 
total_file=0 
total_delete=0 
start=now() 
for file in os.listdir(path): 
total_file += 1; 
real_path=os.path.join(path,file) 
if os.path.isfile(real_path) == True: 
filemd5=getmd5(real_path) 
if filemd5 in all_md5: 
total_delete += 1 
print ' delete ',file 
else: 
all_md5.append(filemd5) 
end = now() 
time_last = end - start 
print ' Total number of documents: ',total_file 
print ' Number of deletions: ',total_delete 
print ' Time: ',time_last,' seconds ' 
if __name__=='__main__': 
main() 

The above procedure principle is very simple, is read every file in turn, to calculate the md5, if the md5 md5 list does not exist, just add the md5 to md5 list, if any, we think the md5 corresponding documents have been seen, this picture is redundant, then we can put the photos deleted. Here is a screenshot of the program in action:
< img Alt = "" border = 0 height = 168 SRC =" http://files.jb51.net/upload/201205/20120525231356712.png "width = 193 >
We can see that there are 8,674 files under this folder, and 31 of them are duplicates. It takes 155.5 seconds to find all the duplicates. The efficiency is not high, can you optimize? I have analyzed it, there are two functions in my program which take a lot of time, one is to calculate the md5 of each file, which takes up most of the time, and the other is to find the existence of md5 in the list, which also takes a lot of time. From these two aspects, we can further optimize.

First of all, I want to solve the lookup problem. Maybe we can order the elements in the list first, and then we can find them, but the list changes, so it's less efficient to sort them every time. I was thinking about using dictionaries to optimize. The most significant feature of dictionaries is that a key corresponds to a value. We can take md5 as the key, and the corresponding value of key is not needed. Under the circumstance of change, the dictionary is more efficient than the sequence, because the sequence is unordered, and the dictionary is ordered, so the lookup is of course faster. This way we can simply determine whether the md5 value is in all the keys. Here's the improved code:
 
# -*- coding: cp936 -*- 
import md5 
import os 
from time import clock as now 
def getmd5(filename): 
file_txt = open(filename,'rb').read() 
m = md5.new(file_txt) 
return m.hexdigest() 
def main(): 
path = raw_input("path: ") 
all_md5={} 
total_file=0 
total_delete=0 
start=now() 
for file in os.listdir(path): 
total_file += 1; 
real_path=os.path.join(path,file) 
if os.path.isfile(real_path) == True: 
filemd5=getmd5(real_path) 
if filemd5 in all_md5.keys(): 
total_delete += 1 
print ' delete ',file 
else: 
all_md5[filemd5]='' 
end = now() 
time_last = end - start 
print ' Total number of documents: ',total_file 
print ' Number of deletions: ',total_delete 
print ' Time: ',time_last,' seconds ' 

if __name__=='__main__': 
main() 

< img Alt = "" border = 0 height = 167 SRC =" http://files.jb51.net/upload/201205/20120525231356552.png "width = 184 >
In terms of time, it is faster than before, but not ideal. I'm going to optimize it. What else can be optimized? Md5! The program above, each file to calculate md5, very time-consuming, is every file need to calculate md5? Can you find a way to reduce the number of md5 calculations? I came up with a method: in the above analysis, we mentioned that we could compare the file size to determine whether the picture is identical and fast, but this method is not accurate, md5 is accurate, can we combine the two? The answer is yes. We can assume that if the two files are exactly the same, then the size of the two files and md5 must be the same, if the size of the two files are different, then the two files must be different! In that case, we only need to check the file size exists in the size of a dictionary, if not, it will be added to the size of a dictionary, if size exists, this shows that there are at least two pictures is the same size, then we as long as the file size of the same file md5, if the same md5, so the two files must be exactly the same, we can delete, if md5 is different, we add it to the list, to avoid double-counting md5. The specific code is as follows:
 
# -*- coding: cp936 -*- 
import md5 
import os 
from time import clock as now 
def getmd5(filename): 
file_txt = open(filename,'rb').read() 
m = md5.new(file_txt) 
return m.hexdigest() 
def main(): 
path = raw_input("path: ") 
all_md5 = {} 
all_size = {} 
total_file=0 
total_delete=0 
start=now() 
for file in os.listdir(path): 
total_file += 1 
real_path=os.path.join(path,file) 
if os.path.isfile(real_path) == True: 
size = os.stat(real_path).st_size 
name_and_md5=[real_path,''] 
if size in all_size.keys(): 
new_md5 = getmd5(real_path) 
if all_size[size][1]=='': 
all_size[size][1]=getmd5(all_size[size][0]) 
if new_md5 in all_size[size]: 
print ' delete ',file 
total_delete += 1 
else: 
all_size[size].append(new_md5) 
else: 
all_size[size]=name_and_md5 
end = now() 
time_last = end - start 
print ' Total number of documents: ',total_file 
print ' Number of deletions: ',total_delete 
print ' Time: ',time_last,' seconds ' 

if __name__=='__main__': 
main() 

What about time efficiency? See below:

  < img Alt = "" border = 0 height = 166 SRC =" http://files.jb51.net/upload/201205/20120525231357474.png "width = 188 >

  Only 7.28 seconds! Ten times more efficient than the first two! This time is acceptable

Algorithm is a very magical thing, casual use will have an unexpected harvest! The above code can be further optimized, such as improved search algorithm, etc., readers have any ideas can communicate with me. Switching to C might be faster. Oh, I love the simplicity of python!
Blogger ma6174


Related articles: