C read binary file method analysis

  • 2020-12-19 21:09:59
  • OfStack

This paper gives a detailed analysis of the C# method for reading base 2 files. Share to everybody for everybody reference. The specific analysis is as follows:

It's really a good thing when you think about all the files being converted to XML. But that's not true. There are still a lot of files that are not XML, or even ASCII. Base 2 files are still propagated across the network, stored on disk, and passed between applications. In contrast, they are more efficient than text files at handling these problems.

In C and C++, it's easy to read binary files. Every file read to C/C++ is a binary file, with the exception of the 1 problem with the start characters (carriage return) and the end characters (line feed). In fact, C/C++ only knows about base 2 files and how to make base 2 files look like text file 1. As the language we use becomes more and more abstract, the language we end up using cannot directly and easily read the files we create. These languages want to automatically process output data in their own unique way.

The problem:

In many areas of computer science, C and C++ still store and read data directly in accordance with the data structure. In C and C++, it is 10 points easier to read and write files according to the data structure in memory. In C, you just use the fwrite() function and provide the following arguments: 1 pointer to your data, telling it how many data it has and how big one data is. In this way, the data is directly written as a file in base 2 format.

Write the data as a file, as described above, and if you also know the correct data structure, that means it's easy to read the file. You simply use the fread() function and provide the following arguments: 1 file handle, 1 pointer to data, how many data to read, and the length of each data. The fread() function does the rest for you. Suddenly, the data is back in memory. There is no parsing and no object model, it just reads the file directly into memory.

In C and C++, the two biggest issues are data alignment (structure alignment) and byte exchange (byte swapping). Data alignment means that sometimes the compiler skips the bytes in the middle of the data, because if the processor accesses those bytes, it is no longer optimized, takes more time (1 in general, the processor spends twice as much time accessing unaligned data as it does accessing aligned data), and takes more instructions. Therefore, the compiler optimizes for speed, skips those bytes and resorts them. On the other hand, byte exchange refers to the process of reordering bytes of data due to the different ways in which the bytes are sorted by different processors.

The data aligned

Because processors can process more information at once (in a clock cycle), they want the information they process to be arranged in a certain way. Most Intel processors allow the integer type (32-bit) to be divisible by 4 (that is, starting at an address that is divisible by 4). Integers in memory will not work if they are not stored in addresses that are multiples of 4. The compiler knows that. So when compilers encounter a single piece of data that could cause such a problem, they have three options.

First, they can choose to add 1 useless white space character to the data so that the integer's starting address is divisible by 4. This is one of the most common practices. Second, they can reorder the fields so that the integers are on a 4-bit boundary. This approach is less used because it creates other interesting problems. The third option is to allow the integers in the data not to be on the 4-bit boundary, but to copy the code to an appropriate place so that those integers are on the 4-bit boundary. This approach takes a little extra time, but can be useful if you have to compress it.

Most of this is compiler detail, so don't worry too much about it. This is not a problem if you use the same compiler, the same Settings, for programs that write data and programs that read data. The compiler uses the same method to process the same data, 1 all OK. But when it comes to cross-platform file conversion, it's important to arrange all the data in the right way so that the information can be converted. In addition, some programmers know how to make the compiler ignore their data.
Byte exchange (byte swapping) : High priority (big endians) and low priority (little endians)

High priority and low priority are two different ways of storing integers in a computer. Since integers are more than 1 byte, the question is whether the most important byte should be read or written first. The least important bytes are the ones that change the most frequently. That is, if you keep adding 1 to an integer, the least important byte changes 256 times, and the least important byte changes only once.

Different processors store integers in different ways. Intel processor 1 typically stores integers in low order first, in other words, low order is read and written first. Most other processors store integers in a high-priority manner. Therefore, when base 2 files are read and written on different platforms, you may have to reorder the bytes to get the correct order.

On the UNIX platform, there is a special problem because UNIX can run on Sun Sparc processor, HP processor, IBMPowerPC, Inter chip and other processors. When moving from one processor to another, this means that the byte order of those variables must be flipped so that they meet the order required by the new processor.

Use C# to process base 2 files

Working with base 2 files in C# presents two additional challenges. The first challenge is that all.NET languages are strongly typed. Therefore, you have to convert from a stream of bytes in the file to the type of data you want. The second challenge is that some data types are more complex than they appear and require some kind of transformation.

Type destruction (type breaking)

Because.NET languages, including C#, are strongly typed, you can't just arbitrarily read 1 byte from a file and plug it into a data structure and just cut OK. So when you want to break the cast rules, you have to do this by first reading the number of bytes you need into a 1-byte array and then copying them from start to finish into the data structure.

Searching through the Usenet documentation, you'll find several 1-set programs structured at the ES92en.public.dotnet level that allow you to convert any object into a series of bytes and back again. They can be found at Listing A at the following address

Complex data types

In C++, you understand what is an object, what is an array, and what is neither an object nor an array. But in C#, things are not as simple as they seem. One string (string) is one object, and therefore one array. Because in C#, there are no real arrays and many objects have no fixed sizes, some complex data types are not suitable for fixed-size base 2 data.

Fortunately,.NET provides a way to solve this problem. You can tell C# what you want to do with your string (string) and other types of arrays. This is done through the MarshalAs attribute. Here is an example of using a string in C#. This property must be used before the controlled data is used:

[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 50)]
The length of the string (string) that you want to read from or store in a base 2 file determines the size of the parameter SizeConst. This determines the maximum length of the string.
Solve previous problems

Now you see how the problem introduced by.NET has been solved. Later on, you'll see how easy it was to solve the binary file problem you encountered earlier.

Packaging (pack)

Don't bother to set up the compiler to control how the data is arranged. You can simply use the StructLayout attribute to arrange or package the data as you wish. This is useful when you need different data that is packaged in different ways. It's like dressing up your car any way you like. Using the StructLayout attribute is like deciding carefully whether to wrap each piece of data tightly or just toss them around, as long as they can be re-read. The use of the StructLayout attribute is shown below:

[StructLayout(LayoutKind.Sequential, Pack = 1)]

Doing so allows the data to ignore boundary alignment and keep the data as tightly packed as possible. This property should be 1 property of any data you read from base 2 files (i.e. the property you write to the file and the property you read from the file should remain the same).

You may find that even adding this attribute to your data doesn't completely solve the problem. In some cases, you may have to go through tedious trial and error. This is caused by the fact that different computers and compilers run differently at the base 2 level. Especially across platforms, we have to be very careful with base 2 data. .NET is a good tool for other binary files, but it's not a perfect tool either.

Reversal of byte order (endian flipping)

One of the classic problems with reading and writing binary files is that some computers store the least important bytes first (e.g. Inter), while others store the most important bytes first. In C and C++, you have to handle this manually, and it can only be a field by field flip. One of the advantages of the.ES146en framework is that the code can access the type of metadata (metadata) at run time, so you can read the information and use it to automatically solve the byte order problem of each segment of the data. You can find the source code on Listing B so you can see how it works.

Once you know the type of the object, you can get each part of the data and start checking each part to see if it is a 16-bit or 32-bit unsigned integer. In either case, you can change the sort order of the bytes without destroying the data.

Note: you don't use a string class (string) to do everything. It does not affect the string class whether it is high-priority or low-priority. Those fields are not affected by the flip code. You just have to pay attention to unsigned integers. Because negative numbers are not represented in the same way on different systems. Negative numbers can be represented with just one token (1-bit byte), but more commonly, they are represented with two tokens (2-bit bytes). This makes negative numbers a little more difficult across platforms. Fortunately, negative numbers are rarely used in base 2 files.

Again, floating point numbers are sometimes not represented in the standard way. Although most systems base floating-point numbers on the IEEE format, a small number of older systems use other formats for floating-point numbers.

To overcome difficulties

Although C# still has some problems, you can still use it to read binary files. In fact, the way C# uses metadata to access objects (metadata) makes it a better language to read binary files. Thus, C# can automatically solve the whole byte exchange problem of data (byte swapping).

Hopefully this article has helped you with your C# programming.


Related articles: