Copy on write technique for the standard C++ class string

  • 2020-04-02 01:54:59
  • OfStack

Standard C++ class STD: : string memory sharing and copy-on-write technology Chen hao

1, concept,
 
Scott Meyers gives an example in More Effective C++, do you remember? When you are in school, your parents want you not to watch TV, and go to review your lessons, so you shut yourself in your room, and act like a vice is review your lessons, you are doing with other such as to a girl in the class to write love letter, and once your parents out in your room to check whether you in the review, you really pick up the book and read. This is known as the "delaying tactic." you don't do it until you have to.

, of course, this kind of thing in real life tend to have an accident, but its in the world of programming has become the most useful technology, as well as readily declare variables in c + + features, are recommended by Scott Meyers, a storage space when it really needs to go to declare variables (memory), it will get program at runtime a minimum memory cost. Only then will we do the time-consuming work of allocating memory, which will give our program better performance at run time. Indeed, 20 percent of programs run 80 percent of the time.

Delaying tactics, of course, still not only such a type of this technology are widely used by us, especially in the middle of the operating system, when a program is run at the end of the operating system is not in a hurry to clear it out of memory, because it is possible to program will run again immediately (the program loading from disk into memory) is a very slow process, and only when the memory used, and will send these also memory-resident program out.

Copy-on-write technology is a product of the programming world's "lazy behavior" -- procrastination. For example, we have a program to write files, for example, constantly written according to data from the network, if every fwrite or fprintf to a disk I/O operations, all was a huge loss on performance, typically as a result, every time write file operations are written in a certain size of a block of memory (disk cache), only when we close the file, only to disk (this is why if the file is not closed, what you write will be lost). What is more, files are closed do not write disk, and wait until the shutdown or memory is insufficient to write disk, Unix is such a system, if the abnormal exit, then the data will be lost, the file will be damaged.

Well, in order to performance we need to take such a big risk, fortunately our program is not too busy to forget that there is a piece of data to write to disk, so this practice, or very necessary.

2. Copy-on-write of standard C++ class STD ::string
 
The string class in the STL standard template library, which we often use, is also a class with a copy-on-write technique. C++ has been widely questioned and criticized for its performance, and many classes in the STL use copy-on-write technology to improve performance. This lazy behavior does lead to higher performance for programs that use the STL.

Here, I want to lift the veil of copy-on-write technology in string from the perspective of C++ classes or design patterns for you, so as to provide you with some reference when designing class libraries in C++.

Before I get into this technique, I want to briefly illustrate the concept of string class memory allocation. By constant, there must be a private member in the string class, which is a char*, the user records the address to allocate memory from the heap, which allocates memory at construction time, and frees memory at destruction time. Since memory is allocated from the heap, the string class is very careful in maintaining this memory. When the string class returns this memory address, it only returns const char*, which is read-only.

2.1, features,

From the outside to the inside, from the perceptual to the rational, let's take a look at the surface characteristics of copy-on-write of the string class. Let's write down the following procedure:


#include
#include 
using namespace std;

main()
{
       string str1 = "hello world";
       string str2 = str1;

       printf ("Sharing the memory:/n");
       printf ("/tstr1's address: %x/n", str1.c_str() );
       printf ("/tstr2's address: %x/n", str2.c_str() );

    str1[1]='q';
       str2[1]='w';

       printf ("After Copy-On-Write:/n");
       printf ("/tstr1's address: %x/n", str1.c_str() );
       printf ("/tstr2's address: %x/n", str2.c_str() );

       return 0;
}


The purpose of this program is to make the second string pass the first string construct, then print out the memory address where the data is stored, then modify the contents of str1 and str2 respectively, and then look up the memory address. The output of the program looks like this (I got the same result in VC6.0 and g++ 2.95) :

> g++ -o stringTest stringTest.cpp
> ./stringTest
Sharing the memory:
        str1's address: 343be9
        str2's address: 343be9
After Copy-On-Write:
        str1's address: 3407a9
        str2's address: 343be9


From the results, we can see that after the first two statements, str1 and str2 store data at the same address, but after the content changes, the address of str1 changes, while the address of str2 remains the same. From this example, we can see the copy-on-write technique of the string class.

2.2 and in-depth

Before we dive into this, we should know from the above demonstration that in the string class, to Copy only when writing, we need to solve two problems, one is memory sharing, the other is copy-on-wirte. These two topics will give us a lot of questions, so let's learn with a few questions:
1.   What is the principle of copy-on-write?
2,   When does the string class share memory?
3,   When does the string class copy-on-write when it triggers a Write?
4,   What happens when Copy-On-Write?
5,   What is the implementation of copy-on-write?

Well, you said just look at the source of stirng in STL and you'll find the answer. Of course, of course, I also refer to the parent template class basic_string string source. However, if you feel that reading STL source code is like reading machine code, and seriously hit your confidence in C++, even if you have a doubt whether you know C++, if you have such a feeling, then continue to read my article.

OK, let's discuss one problem at a time, and gradually all the technical details will come out.

2.3 what is the principle of copy-on-write?

Experienced programmers know that copy-on-write must use reference counting, and yes, there must be a variable like RefCnt. When first class structure, the string constructor will allocate memory from the heap, according to the incoming parameters when have other classes to the memory, the count for automatic accumulation, when a class destructor, the count will be minus one, until the last class destructor, RefCnt at this time to 1 or 0, at this point, the program will truly Free the memory allocated from the heap.

Yes, reference counting is the principle of copying only when writing in the string class!

But then again, where does the RefCnt exist? If it is stored in the string class, each instance of the string has its own set, and there is no way to have a common RefCnt. If it is declared as a global variable, or a static member, then all the string classes share one. How does this work? Ha ha, life is a confused to explore, know and confused after the cycle process. Don't worry. I'll give you a runaround in the back.

2.3.1,           When does the string class share memory?

The answer to this question should be obvious. By common sense and logic, if one class wants to use another class's data, then the memory of the used class can be Shared. It makes sense, if you don't use mine, then you don't share it, sharing happens only if you use mine.

There are two ways to use data from other classes: 1) construct yourself in another class, and 2) assign values in another class. In the first case, the copy constructor is triggered, and in the second case, the assignment operator is triggered. In both cases, we can implement their corresponding methods in the class. In the first case, just do a little bit of work in the copy constructor of the string class and let the reference count accumulate. Again, in the second case, you simply override the assignment operator of the string class, again with a little bit of processing.

 
A few words:

1) differences between construction and assignment

For these two sentences in the previous routine:
            String str1 = "hello world";
            String str2 = str1;

Do not think that "=" is an assignment operation, in fact, these two statements are equivalent to:

            Strings str1 (" hello world ");     // calls the constructor
            String str2 (str1);     // calls the copy constructor

If str2 is the following case:

String str2;           // the constructor whose call parameter defaults to an empty string: string str2(" ");
Str2 = str1;         // call str2 assignment operation: str2.operator=(str1);

2) another situation
            Char TMP [] = "hello world";
        Strings str1 = TMP;
            String str2 = TMP;
      Does this trigger memory sharing? Of course, it should be Shared. However, according to the Shared memory case we mentioned earlier, the declaration and initial statement of the two string classes do not conform to the two cases I mentioned above, so it does not happen. Furthermore, the existing features of C++ do not allow us to share the memory of classes in this case.

 

2.3.2,           When does the string class copy-on-write when it triggers a Write?

Oh, when do you copy when you write? Obviously, copy-on-write occurs, of course, when the contents of a class that shares the same block of memory change. For example, string class [], =, +=, +, operator assignment, and some string class such as insert, replace, append, and other member functions, including class destructor.

Copy-on-write is triggered by modifying the data, and of course you can't change it without modifying it. This is the essence of the hold - over tactic, not to be done until the time.

2.3.3 what happens when copy-on-write?

We may use that access count to determine whether we need to copy, see the following code:


If  ( RefCnt>0 ) {
    char* tmp =  (char*) malloc(strlen(_Ptr)+1);
    strcpy(tmp, _Ptr);
    _Ptr = tmp;
}


The code above is a hypothetical copy method, and if another class references this memory (check the reference count to see), you need to "copy" the changed class.

We can wrap the run of this copy into a function that can be used by member functions that change the content.

2.3.4,           What is the implementation of copy-on-write?

The last problem we have mainly addressed is the problem of democratic centralism. Please look at the following code first:


string h1 =  " hello " ;
string h2= h1;
string h3;
h3 = h2;

string w1 =  " world " ;
string w2( "" );
w2=w1;


Obviously, we want h1, h2, and h3 to share the same block of memory, and w1 and w2 to share the same block of memory. Because, in h1, h2, and h3, we maintain a reference count, and in w1 and w2 we maintain a reference count.

How do you generate these two reference counts in a clever way? We figured out that the memory of the string class is dynamically allocated on the heap, and since the classes of Shared memory point to the same memory area, why not allocate a little more space on that area to hold the reference count? In this way, all classes that share an area of memory have the same reference count, and since the address of the variable is on the Shared area, all classes that share that area of memory can access it and know how many references there are to that area of memory.

Please see the following picture:

< img border = 0 SRC = "/ / files.jb51.net/file_images/article/201311/o_string.jpg" >

So, with this mechanism, every time we allocate memory for a string, we always allocate an extra space to hold the value of this reference count, and whenever a copy construct occurs, the value of this memory is added to one. When the content is modified, the string class is to check whether the reference count is zero. If it is not zero, it means that someone is sharing this memory, so it needs to make a copy first, then subtract the reference count by one, and then copy the data. The following program segments illustrate these two actions:

 
   //Constructor (split memory)
    string::string(const char* tmp)
{
    _Len = strlen(tmp);
    _Ptr = new char[_Len+1+1];
    strcpy( _Ptr, tmp );
    _Ptr[_Len+1]=0;  //Set the reference count & NBSP; & have spent
}

//Copy construction (Shared memory)
    string::string(const string& str)
    {
if (*this != str){
     this->_Ptr = str.c_str();   //The Shared memory
     this->_Len = str.szie();
     this->_Ptr[_Len+1] ++;  //Reference count plus one
}
}

//Copy copy-on-write only when writing
char& string::operator[](unsigned int idx)
{
    if (idx > _Len || _Ptr == 0 ) {
static char nullchar = 0;
return nullchar;
 }

_Ptr[_Len+1]--;   //Subtract one from the reference count
    char* tmp = new char[_Len+1+1];
    strncpy( tmp, _Ptr, _Len+1);
    _Ptr = tmp;
    _Ptr[_Len+1]=0; //  Set up the new The Shared memory Reference count of 

    return _Ptr[idx];
}

//Some processing of destructors
~string()
{  
_Ptr[_Len+1]--;   //Subtract one from the reference count

//When the reference count is 0, memory is freed
    if (_Ptr[_Len+1]==0) {
        delete[] _Ptr;
}
}


Haha, the whole technical detail has come to light.

That, however, and in the STL basic_string there is a little difference, the implementation details of when you open the STL source code, you will find its take the reference count is through such access: _Ptr [1], the standard library, the reference count of the memory allocation in front of the (I give out of the code is distributing the reference count on the behind, this is very bad), the benefits of distribution in the former is when the length of the string extensions, only needs to expand its memory behind, without the need to move the reference count of the memory location, it saves a little time.

The memory structure of the string in the STL is like the one I drew earlier, with _Ptr pointing to the data area and RefCnt at _ptr-1 or _Ptr[-1].

 
2.4 bedbug Bug

Who said "where there is sun, there is darkness"? Perhaps many of us are superstitious about standard things, thinking they are too tried and tested to go wrong. Oh, don't have this kind of superstition, because any good design, good code in a particular situation will have a Bug, STL is the same, the string class of this Shared memory/write only copy technology is no exception, and this Bug may cause your entire program crash!

Don't believe it? ! So let's look at a test case:

Suppose there is a dynamically linked library (called mynet.dll or mynet.so) with a function that returns a string class:


string GetIPAddress(string hostname)
{
    static string ip;
     ... 
     ... 
    return ip;
}


Your main program dynamically loads this dynamic link library and calls this function:


main()
{
//Loads the functions in the dynamically linked library
hDll = LoadLibraray( ... ..);
pFun =  GetModule(hDll,  " GetIPAddress " );

//Calls functions in the dynamically linked library
string ip = (*pFun)( " host1 " );
 ... 
 ... 
//Release the dynamic link library
FreeLibrary(hDll);
 ... 
cout << ip << endl;
}


Let's take a look at the code where the program dynamically loads the functions in the dynamic link library, then calls the functions in the dynamic link library as a function pointer, puts the return value in a string class, and then frees up the dynamic link library. Once released, enter the contents of the IP.

Based on the definition of function, we know that the function is "value return", so, when the function returns, will invoke the copy constructor, and according to the memory of the string class sharing mechanism, internal variable IP is in the main program and function of the static string variable Shared memory (the memory area is in the address space of the dynamic link library). We assume that the IP value has not been changed in the entire main program. So when the main program frees the dynamic link library, that Shared memory area is also freed. Therefore, the access to IP in the future, will inevitably make the memory address access illegal, causing the program crash. Even if you don't use IP in the future, a memory access exception will occur when the main program exits, because IP will be destructed when the program exits, and memory access exception will occur when the program exits.

Memory access exceptions mean two things: 1) no matter how beautiful your program is, it will be overshadowed by this error, and your reputation will be damaged by this error. 2) for some time to come, you will suffer from this system level error (it is not easy to find and eliminate this memory error in the C++ world). This is the permanent pain in the heart of C/C++ programmers, a thousand miles of dike, collapse in the nest. And if you don't know this feature of the string class, finding a memory exception in thousands of lines of code is a nightmare.

Note: there are many ways to correct the above bugs. Here is one for reference only:
String IP = pFun (*) (" host1 "). The CSTR ();

3,       Afterword.

The article should end here, this article has the following main purposes:

1)       Let me introduce you to the write only copy/memory sharing technique.
2)       Taking the string class in the STL as an example, this paper introduces a design pattern.
3)       In the C++ world, no matter how clever your design is or how solid your code is, it's hard to take care of all situations. Smart Pointers are a good example of how no matter how you design them, they can be very buggy.
4)       C++ is a double-edged sword, only to understand the principle, you can better use C++. Otherwise, you will be burned. If you design and use libraries with the feeling that playing C++ is like playing with fire and you have to be careful, then you are ready to learn until you can control the fire.

Finally, use this postscript to introduce yourself. I am currently engaged in all Unix platform software development, mainly to do system level software product research and development, for the next generation of the computer revolution, grid computing is very interested in, the same as for distributed computing, peer-to-peer (P2P), Web services, J2EE technology direction is also very interested in, in addition, for the project, team management and project management also have small heart, want to and I fight in the same "technology and management pay equal attention to" on the front of the young generation, able to communicate with me a lot. My MSN and email are: haoel@hotmail.com.


Related articles: