The fastest way in NET CORE to compare the contents of two files to be identical

  • 2021-11-10 09:16:10
  • OfStack

Preface

Recently, the project has a requirement to compare whether the contents of two files of any size are the same. The requirements are as follows:

The project is. NET CORE, so use C # to write the comparison method The file size is arbitrary, so you can't read all the contents of the file into memory for comparison (more professionally, you need to use a non-cached comparison method) Do not rely on the third-party library As soon as possible

To choose the best solution, I built a simple command-line project with two identical files of size 912MB. At the end of this article, you can see the code for the project's Main method.

Let's try various comparison methods to choose the best solution:

To compare the two files are identical, the first thought is to use hash algorithm (such as MD5, SHA) to calculate the hash value of the two files, and then compare them.

Cut the crap and roll up your sleeves to write an MD5 comparison method:


/// <summary>
/// MD5
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <returns></returns>
private static bool CompareByMD5(string file1, string file2)
{
 //  Use .NET Built-in MD5 Library 
 using (var md5 = MD5.Create())
 {
 byte[] one, two;
 using (var fs1 = File.Open(file1, FileMode.Open))
 {
  //  With FileStream Read the contents of a file , Calculation HASH Value 
  one = md5.ComputeHash(fs1);
 }
 using (var fs2 = File.Open(file2, FileMode.Open))
 {
  //  With FileStream Read the contents of a file , Calculation HASH Value 
  two = md5.ComputeHash(fs2);
 }
 //  Will MD5 Results ( Byte array ) Convert to strings for comparison 
 return BitConverter.ToString(one) == BitConverter.ToString(two);
 }
}

Comparison results:

Method: CompareByMD5, Identical: True. Elapsed: 00:00:05.7933178

It took 5.79 seconds and felt good. However, is this the best solution?

In fact, if we think about it carefully, the answer should be no.

Because any hash algorithm is essentially to calculate bytes, and the calculation process is time consuming.

Many download websites provide the hash value of the downloaded file, because the downloaded source file itself will not change, and only need to calculate the hash value of the source file once and provide it to the user for verification.

In our requirement, both files are not fixed, so it is not appropriate to calculate the hash value of both files every time.

Therefore, this scheme of hash comparison is used by PASS.

This problem of finding the optimal solution of the algorithm, my past experience is: go to stackoverflow to find:)

After my hard work, I found a very pertinent answer: How to compare 2 files ES50using. NET?

Get up to 1 answer, transform the code by 1 and put it into the project:


/// <summary>
/// https://stackoverflow.com/a/1359947
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <returns></returns>
private static bool CompareByToInt64(string file1, string file2)
{
 const int BYTES_TO_READ = sizeof(Int64); //  Every read 8 Bytes 
 int iterations = (int)Math.Ceiling((double)new FileInfo(file1).Length / BYTES_TO_READ); //  Calculate the number of reads 

 using (FileStream fs1 = File.Open(file1, FileMode.Open))
 using (FileStream fs2 = File.Open(file2, FileMode.Open))
 {
 byte[] one = new byte[BYTES_TO_READ];
 byte[] two = new byte[BYTES_TO_READ];

 for (int i = 0; i < iterations; i++)
 {
  //  Loop to read into a byte array 
  fs1.Read(one, 0, BYTES_TO_READ);
  fs2.Read(two, 0, BYTES_TO_READ);

  //  Convert to Int64 Make a numerical comparison 
  if (BitConverter.ToInt64(one, 0) != BitConverter.ToInt64(two, 0))
  return false;
 }
 }

 return true;
}

The basic principle of this method is to read two files in a loop, 8 bytes at a time, convert to Int64, and then compare the values. So how efficient is it?

Method: CompareByToInt64, Identical: True. Elapsed: 00:00:08.0918099

What? Eight seconds! Slower than MD5? Isn't this the most praised answer of SO? How did this happen?

In fact, analysis 1 is not difficult to think of the reason, because each time only read 8 bytes, the program frequently IO operation, resulting in low performance. It seems that the answer on SO can not be superstitious ah!

Then the direction of optimization becomes how to reduce the loss caused by IO operation.

Since 8 bytes at a time is too little, we define a byte array that is one bit larger, say 1024 bytes. Read 1024 bytes into the array at a time, and then compare the byte arrays.

But this brings a new problem, that is, how to quickly compare whether two byte arrays are the same?

The first thing that comes to mind is what I used in the MD5 method-converting byte arrays to strings for comparison:


/// <summary>
///  Read into a byte array for comparison ( Convert to String Comparison )
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <returns></returns>
private static bool CompareByString(string file1, string file2)
{
 const int BYTES_TO_READ = 1024 * 10;

 using (FileStream fs1 = File.Open(file1, FileMode.Open))
 using (FileStream fs2 = File.Open(file2, FileMode.Open))
 {
 byte[] one = new byte[BYTES_TO_READ];
 byte[] two = new byte[BYTES_TO_READ];
 while (true)
 {
  int len1 = fs1.Read(one, 0, BYTES_TO_READ);
  int len2 = fs2.Read(two, 0, BYTES_TO_READ);
  if (BitConverter.ToString(one) != BitConverter.ToString(two)) return false;
  if (len1 == 0 || len2 == 0) break; //  Both files are read to the end , Quit while Cycle 
 }
 }

 return true;
}

Results:

Method: CompareByString, Identical: True. Elapsed: 00:00:07.8088732

It takes nearly 8 seconds, which is not much better than the previous method.

Analyze the reason 1, in each loop, string conversion is a very time-consuming operation. Is there a byte array comparison method without type conversion?

I thought of SequenceEqual, which has a comparison sequence in LINQ, and we tried to use this method to compare:


/// <summary>
///  Read into a byte array for comparison ( Use LINQ Adj. SequenceEqual Comparison )
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <returns></returns>
private static bool CompareBySequenceEqual(string file1, string file2)
{
 const int BYTES_TO_READ = 1024 * 10;

 using (FileStream fs1 = File.Open(file1, FileMode.Open))
 using (FileStream fs2 = File.Open(file2, FileMode.Open))
 {
 byte[] one = new byte[BYTES_TO_READ];
 byte[] two = new byte[BYTES_TO_READ];
 while (true)
 {
  int len1 = fs1.Read(one, 0, BYTES_TO_READ);
  int len2 = fs2.Read(two, 0, BYTES_TO_READ);
  if (!one.SequenceEqual(two)) return false;
  if (len1 == 0 || len2 == 0) break; //  Both files are read to the end , Quit while Cycle 
 }
 }

 return true;
}

Results:

Method: CompareBySequenceEqual, Identical: True. Elapsed: 00:00:08.2174360

Even slower than the first two (in fact, this is also the slowest one of all schemes), SequenceEqual of LINQ does not seem to be born for efficiency.

So, instead of the fancy functionality, how about a simple, honest use of the while loop to compare byte arrays?


/// <summary>
///  Read into a byte array for comparison (while Compare byte arrays cyclically )
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <returns></returns>
private static bool CompareByByteArry(string file1, string file2)
{
 const int BYTES_TO_READ = 1024 * 10;

 using (FileStream fs1 = File.Open(file1, FileMode.Open))
 using (FileStream fs2 = File.Open(file2, FileMode.Open))
 {
 byte[] one = new byte[BYTES_TO_READ];
 byte[] two = new byte[BYTES_TO_READ];
 while (true)
 {
  int len1 = fs1.Read(one, 0, BYTES_TO_READ);
  int len2 = fs2.Read(two, 0, BYTES_TO_READ);
  int index = 0;
  while (index < len1 && index < len2)
  {
  if (one[index] != two[index]) return false;
  index++;
  }
  if (len1 == 0 || len2 == 0) break;
 }
 }

 return true;
}

The result is...

Method: CompareByByteArry, Identical: True. Elapsed: 00:00:01.5356821

1.53 seconds! Big breakthrough! It seems that sometimes clumsy methods work better!

At this point, it takes about 1.5 seconds to compare two more than 900 MB files. Are readers satisfied with this method?

No! I'm not satisfied! I believe that through hard work, 1 will find a faster way!

Similarly, NET CORE is constantly being optimized for writing high-performance code.

So, how do we continue to optimize our code?

I suddenly thought of a new value type added to C # 7.2: Span < T > Which represents a contiguous segment of memory and provides a series of methods for manipulating the region.

For our requirements, we can use another read-only type ReadOnlySpan because we will not change the value of the array < T > Pursuing higher efficiency.

Modify the code to use ReadOnlySpan < T > :


/// <summary>
///  Read into a byte array for comparison (ReadOnlySpan)
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <returns></returns>
private static bool CompareByReadOnlySpan(string file1, string file2)
{
 const int BYTES_TO_READ = 1024 * 10;

 using (FileStream fs1 = File.Open(file1, FileMode.Open))
 using (FileStream fs2 = File.Open(file2, FileMode.Open))
 {
 byte[] one = new byte[BYTES_TO_READ];
 byte[] two = new byte[BYTES_TO_READ];
 while (true)
 {
  int len1 = fs1.Read(one, 0, BYTES_TO_READ);
  int len2 = fs2.Read(two, 0, BYTES_TO_READ);
  //  Byte arrays can be directly converted to ReadOnlySpan
  if (!((ReadOnlySpan<byte>)one).SequenceEqual((ReadOnlySpan<byte>)two)) return false;
  if (len1 == 0 || len2 == 0) break; //  Both files are read to the end , Quit while Cycle 
 }
 }

 return true;
}

The core is SequenceEqual method for comparison, which is an extension of ReadOnlySpan. Note that it is only a method name with LINQ, and its implementation is completely different.

So how does this method perform?

Method: CompareByReadOnlySpan, Identical: True. Elapsed: 00:00:00.9287703

Less than a second!

Relative to a good result, the speed has increased by almost 40%!

I personally feel very satisfied with this result. If you have a faster way, please feel free to comment. I am very welcome!

About Span < T > Structural types, if you are interested, you can browse this article, which has a very detailed introduction.

Postscript

The code in this paper is only for experimental nature, and can still continue to optimize the details in practical application, such as:

If the two files are different in size, return false directly If the two file paths are the same, return true directly ...

Test engineering Main method source code:


static void Main(string[] args)
{
 string file1 = @"C:\Users\WAKU\Desktop\file1.ISO";
 string file2 = @"C:\Users\WAKU\Desktop\file2.ISO";
 var methods = new Func<string, string, bool>[] { CompareByMD5, CompareByToInt64, CompareByByteArry, CompareByReadOnlySpan };
 foreach (var method in methods)
 {
 var sw = Stopwatch.StartNew();
 bool identical = method(file1, file2);
 Console.WriteLine("Method: {0}, Identical: {1}. Elapsed: {2}", method.Method.Name, identical, sw.Elapsed);
 }
}

Summarize


Related articles: