File details that identify the same content on Linux

2021-01-19 22:41:42
OfStack

preface

Sometimes a copy of a file is a huge waste of hard disk space and can cause trouble when you want to update your files. Here are six commands to identify these files.

In a recent post, we looked at how to identify and locate hard-linked files (that is, files that point to the same 1 hard disk content and share inode). In this article, we'll look at commands to find files that have the same content but are not linked to each other.

Hard links are useful because they enable files to be stored in multiple places within the file system without taking up additional disk space. On the other hand, sometimes a copy of a file is a huge waste of hard disk space and can cause confusion when you want to update a file. In this article, we'll look at several ways to identify these files.

Compare files with the diff command

Probably the easiest way to compare two files is to use the diff command. The output will show you the differences in your files. < and > The symbol represents the value of the parameter when the first ( < ) or the second ( > Whether there are extra lines in the file. In this example, there are extra lines in backup.html.


$ diff index.html backup.html
2438a2439,2441
> <pre>
> That's all there is to report.
> </pre>

If diff has no output it means the two files are the same.


$ diff home.html index.html
$

The only drawback to diff is that it can only compare two files at a time and you must specify which files to compare. The commands in this post can find multiple duplicate files for you.

Using a checksum

The cksum (checksum) command computes the checksum of the file. A checksum is a mathematical simplification that converts literal content into a long number (for example, 2819078353, 228029). While checksums are not entirely unique, the probability that checksums will be the same for different file contents is very small.


$ cksum *.html
2819078353 228029 backup.html
4073570409 227985 home.html
4073570409 227985 index.html

In the above example, you can see how the second and third files that produce the same checksum can be the same by default.

Use the find command

Although the find command does not have the option to find duplicate files, it can still be used to find files by name or type and run the cksum command. Such as:


$ find . -name "*.html" -exec cksum {} \;
4073570409 227985 ./home.html
2819078353 228029 ./backup.html
4073570409 227985 ./index.html

Use the fslint command

The fslint command can be used specifically to find duplicate files. Notice we gave it a starting position. If it requires traversing a significant number of files, this can take some time to complete. Notice how it lists duplicate files and looks for other problems, such as empty directories and bad ID.


$ fslint .
-----------------------------------file name lint
-------------------------------Invalid utf8 names
-----------------------------------file case lint
----------------------------------DUPlicate files  <==
home.html
index.html
-----------------------------------Dangling links
--------------------redundant characters in links
------------------------------------suspect links
--------------------------------Empty Directories
./.gnupg
----------------------------------Temporary Files
----------------------duplicate/conflicting Names
------------------------------------------Bad ids
-------------------------Non Stripped executables

You may need to install fslint on your system. You may also need to add it to your command search path:


$ export PATH=$PATH:/usr/share/fslint/fslint

Use the rdfind command

The rdfind command also looks for duplicate files. Its name means "duplicate data search," and it can determine which file is the original based on the date of the file - useful if you choose to delete a copy because it removes newer files.


$ rdfind ~
Now scanning "/home/shark", found 12 files.
Now have 12 files in total.
Removed 1 files due to nonunique device and inode.
Total size is 699498 bytes or 683 KiB
Removed 9 files due to unique sizes from list.2 files left.
Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.
Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.
Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left.
It seems like you have 2 files that are not unique
Totally, 223 KiB can be reduced.
Now making results file results.txt

You can run this command in dryrun mode (in other words, only report changes that might otherwise be made).


$ rdfind -dryrun true ~
(DRYRUN MODE) Now scanning "/home/shark", found 12 files.
(DRYRUN MODE) Now have 12 files in total.
(DRYRUN MODE) Removed 1 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 699352 bytes or 683 KiB
Removed 9 files due to unique sizes from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left.
(DRYRUN MODE) It seems like you have 2 files that are not unique
(DRYRUN MODE) Totally, 223 KiB can be reduced.
(DRYRUN MODE) Now making results file results.txt

The rdfind command also provides functions such as ignoring empty documents (-ignoreempty) and following symbolic links (-followsymlinks). See the man page for an explanation.


-ignoreempty    ignore empty files
-minsize    ignore files smaller than speficied size
-followsymlinks   follow symbolic links
-removeidentinode  remove files referring to identical inode
-checksum    identify checksum type to be used
-deterministic   determiness how to sort files
-makesymlinks    turn duplicate files into symbolic links
-makehardlinks   replace duplicate files with hard links
-makeresultsfile  create a results file in the current directory
-outputname   provide name for results file
-deleteduplicates  delete/unlink duplicate files
-sleep     set sleep time between reading files (milliseconds)
-n, -dryrun   display what would have been done, but don't do it

Note that the rdfind command provides the -deleteduplicates true setup option to remove copies. Hopefully this command syntax glitch won't annoy you. ; -)


$ rdfind -deleteduplicates true .
...
Deleted 1 files.  <==

You will probably need to install the rdfind command on your system. It's probably a good idea to experiment with it to get familiar with how to use it.

Use the fdupes command

The fdupes command also makes it easy to identify duplicate files. It also provides a number of useful options -- such as -r for iteration. In this case, it groups duplicate files up to 1 like this:


$ diff home.html index.html
$

This is an example of using iteration, note that many duplicate files are important (the user's.bashrc and.profile files) and should not be deleted.


$ diff home.html index.html
$

Many of the options for the fdupe command are listed below. Use the fdupes-h command or read the man page for details.


$ diff home.html index.html
$

The fdupes command is another command that you may need to install and use for a while to become familiar with its many options.

conclusion

The Linux system offers a series of great tools that can locate and (potentially) remove duplicate files, as well as options that allow you to specify search areas and what to do with duplicate files you find.

via: https://www.networkworld.com/article/3390204/how-to-identify-same-content-files-on-linux.html#tk.rss_all

Sandra Henry-Stocker Topic: lujun9972 Translator: tomjlw Proofread: wxy