Analysis from the source code analysis of common data structure based on Array dynamic capacity expansion mechanism

  • 2020-05-10 18:43:35
  • OfStack

The impulse to write this article came from the micro blog "everyone knows System.Collections.Generic.List < T > What kind of data structure is it? How are the internal elements stored? There are Dictionary < TKey,TValue > ? ..." .

After looking it up in the first book, if you refer to the data structure and the features of the linear hash table introduced in the algorithm, the very official answer is something like: List < T > Is a linear memory continuously allocated storage structure, the elements are stored sequentially; Its advantage is that the memory is continuously allocated, relatively save space, in the set length range of the cost of adding elements is small; The disadvantage is that the lookup complexity is O(n), which is not as fast as the hash structure O(1). If the inserted node exceeds the specified length, the memory needs to be re-opened, which is very expensive. While Dictionary < TKey,TValue > It's a hash structure, blahblahblah, blahblahblah, blahblahblah, blahblahblah. That's all.

Then look at the answer below the micro blog, seems to be very ununified 1, most think it is based on the implementation of array, but... Wipe, see 1 turn have no satisfactory answer of old zhao. I have read articles before and heard about how StringBuilder and HashTable are implemented internally, as well as a general list of double memory expansion, but I don't know the details and I'm not sure, so I really want to know the answer. Mr. Zhao said any programmer with a little curiosity should look at the source code for both implementations. Nothing in the world is difficult if you put your heart into it. If you really put your heart into it, you should record your learning experience regardless of whether you are right or wrong.

Note: if you are a novice, it is recommended to stop here and stop reading. Really curious to know the answer, the simplest correct is also my idol Lao zhao recommended the practice of course is to check MSDN and framework source code. In order not to mislead people, this article plus 1 label: no responsibility to write.

1. StringBuilder

StringBuilder has six types of constructors, and it is common to create objects directly through a no-argument constructor. MSDN (Chinese) says "the default capacity for this implementation is 16, and the default maximum capacity is Int32.MaxValue". The default capacity is 16,16 what? Why is that so vague? Disassembly with 1 source code, see its constructor finally to call 1 method:


public unsafe StringBuilder(string value, int startIndex, int length, int capacity)
        {
            if (capacity < 0)
            {
                throw new ArgumentOutOfRangeException("capacity", Environment.GetResourceString("ArgumentOutOfRange_MustBePositive", new object[]
    {
     "capacity"
    }));
            }
            if (length < 0)
            {
                throw new ArgumentOutOfRangeException("length", Environment.GetResourceString("ArgumentOutOfRange_MustBeNonNegNum", new object[]
    {
     "length"
    }));
            }
            if (startIndex < 0)
            {
                throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_StartIndex"));
            }
            if (value == null)
            {
                value = string.Empty;
            }
            if (startIndex > value.Length - length)
            {
                throw new ArgumentOutOfRangeException("length", Environment.GetResourceString("ArgumentOutOfRange_IndexLength"));
            }
            this.m_MaxCapacity = 2147483647;
            if (capacity == 0)
            {
                capacity = 16; // The default value 
            }
            if (capacity < length)
            {
                capacity = length;
            }
            this.m_ChunkChars = new char[capacity]; // Character array capacity initialization 
            this.m_ChunkLength = length;
            fixed (char* ptr = value)
            {
                StringBuilder.ThreadSafeCopy(ptr + (IntPtr)startIndex, this.m_ChunkChars, 0, length);
            }
        }

The default capacity is 16, but the estimate is that the default pre-allocated string capacity is 16 characters.

The main implementation of public StringBuilder(int capacity, int maxCapacity) with two parameters is as follows:


       public StringBuilder(int capacity, int maxCapacity)
        {
            if (capacity > maxCapacity)
            {
                throw new ArgumentOutOfRangeException("capacity", Environment.GetResourceString("ArgumentOutOfRange_Capacity"));
            }
            if (maxCapacity < 1)
            {
                throw new ArgumentOutOfRangeException("maxCapacity", Environment.GetResourceString("ArgumentOutOfRange_SmallMaxCapacity"));
            }
            if (capacity < 0)
            {
                throw new ArgumentOutOfRangeException("capacity", Environment.GetResourceString("ArgumentOutOfRange_MustBePositive", new object[]
    {
     "capacity"
    }));
            }
            if (capacity == 0)
            {
                capacity = Math.Min(16, maxCapacity); // Will be maximum capacity and default capacity 16 Let's compare them. Let's take the smaller one 
            }
            this.m_MaxCapacity = maxCapacity;
            this.m_ChunkChars = new char[capacity];
        }

There is a question here: usually when we implement string concatenation, we can't guarantee that the capacity of string is not greater than the default capacity of 16. If it is larger than 16, how can StringBuilder achieve this dynamic expansion effect? Let's take a look at how the common Append(string value) method is implemented:

        public unsafe StringBuilder Append(string value)
        {
            if (value != null)
            {
                char[] chunkChars = this.m_ChunkChars;
                int chunkLength = this.m_ChunkLength;
                int length = value.Length;
                int num = chunkLength + length;
                if (num < chunkChars.Length) // Not exceeding the maximum capacity 
                {
                    if (length <= 2) //2 A and 2 The following characters are concatenated 
                    {
                        if (length > 0)
                        {
                            chunkChars[chunkLength] = value[0];
                        }
                        if (length > 1)
                        {
                            chunkChars[chunkLength + 1] = value[1];
                        }
                    }
                    else
                    {
                        fixed (char* smem = value)
                        {
                            fixed (char* ptr = &chunkChars[chunkLength])
                            {
                                string.wstrcpy(ptr, smem, length); // As if C function   You can't see the internal implementation 
                            }
                        }
                    }
                    this.m_ChunkLength = num;
                }
                else
                {
                    this.AppendHelper(value);
                }
            }
            return this;
        }

In the above code, the logic for strings that exceed the default maximum capacity after concatenation is implemented in AppendHelper, and AppendHelper is finally implemented by the following method:

      internal unsafe StringBuilder Append(char* value, int valueCount)
        {
            int num = valueCount + this.m_ChunkLength;
            if (num <= this.m_ChunkChars.Length)
            {
                StringBuilder.ThreadSafeCopy(value, this.m_ChunkChars, this.m_ChunkLength, valueCount);
                this.m_ChunkLength = num;
            }
            else
            {
                int num2 = this.m_ChunkChars.Length - this.m_ChunkLength;
                if (num2 > 0)
                {
                    StringBuilder.ThreadSafeCopy(value, this.m_ChunkChars, this.m_ChunkLength, num2);
                    this.m_ChunkLength = this.m_ChunkChars.Length;
                }
                int num3 = valueCount - num2;
                this.ExpandByABlock(num3); // You finally see the expansion function 
                StringBuilder.ThreadSafeCopy(value + (IntPtr)num2, this.m_ChunkChars, 0, num3);
                this.m_ChunkLength = num3;
            }
            return this;
        }

Then analyze the capacity expansion function ExpandByABlock:

     private void ExpandByABlock(int minBlockCharCount)
        {
            if (minBlockCharCount + this.Length > this.m_MaxCapacity)
            {
                throw new ArgumentOutOfRangeException("requiredLength", Environment.GetResourceString("ArgumentOutOfRange_SmallCapacity"));
            }
            int num = Math.Max(minBlockCharCount, Math.Min(this.Length, 8000)); // Redistribute the array's space calculations 
            this.m_ChunkPrevious = new StringBuilder(this); 
            this.m_ChunkOffset += this.m_ChunkLength;
            this.m_ChunkLength = 0;
            if (this.m_ChunkOffset + num < num)
            {
                this.m_ChunkChars = null;
                throw new OutOfMemoryException();
            }
            this.m_ChunkChars = new char[num]; // The array reallocates space 
        }

It is obvious from the straightforward analysis above that instantiating an StringBuilder must initialize and maintain an internal array char[] m_ChunkChars. C and learned the language and data structure should all know, array for its given advance 1 create requires continuous allocated memory space, it doesn't support the original memory space on the basis of extension, so the array for dynamic memory allocation is very bad, but the basic data structures, such as string and linear table is based on the array.

Then a brief look at 1 under several other common Mosaic methods and indexer, the internal implementation of roughly 1, almost all is the operation of the character array logic. Interested you might as well also look at the source.

At this point, it is safe to assume that string operations implemented internally by StringBuilder are ultimately handled through the character array char[] m_ChunkChars. Want 1 is yes, if the implementation is StringBuildr through String plus equals minus equals to spell coming through it away.

What cannot be ignored is its algorithm of expanding capacity. The most important thing is the following line of code:

int num = Math.Max (minBlockCharCount, Math.Min (this.Length, 8000)); Actually, it's a very simple mathematical method, StringBuilder internal MaxChunkSize is 8000 by default, so you can evaluate 1, if you do multiple concatenations, the string length is 1 random point, what happens to the memory allocation, does the number 8000 make sense? According to legend, the internal implementation of some common framework1 classes also follows a balanced design, just like the internal algorithm 1 of GC.

2. Linear structure with dynamic capacity expansion based on Array

You should all be familiar with it, like StringBuilder1, many of the collections in framework have dynamic capacity augmentation. For example, the familiar linear set ArrayList, generic List, Queue, Stack and so on are implemented directly based on Array.

So based on the Array set how do they achieve dynamic scaling internally, and how is the scaling controlled? Perhaps we have heard that the capacity of dynamic expansion is twice the capacity of the previous one. Is that the case? With this in mind, let's pick one of the most common set generics, List < T > Analyze 1.

Similar to the StringBuilder analysis, we started with the constructor and the Add method.

The no-parameter constructor is as follows:


     public List()
        {
            this._items = List<T>._emptyArray;
        }

_items is a generic array (private T[] _items; Can't say that about generic arrays? , it must have a default initial capacity via the constructor,_emptyArray must have a fixed space allocated for initialization, guess right? _emptyArray is defined as follows:

  private static readonly T[] _emptyArray = new T[0];
Look at the one with one parameter:


        public List(int capacity)
        {
            if (capacity < 0)
            {
                ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.capacity, ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
            }
            this._items = new T[capacity];
        }

This time, an array object is given directly to _items new.

Ok, one more pass in the initial set:


public List(IEnumerable<T> collection)
        {
            if (collection == null)
            {
                ThrowHelper.ThrowArgumentNullException(ExceptionArgument.collection);
            }
            ICollection<T> collection2 = collection as ICollection<T>;
            if (collection2 != null)
            {
                int count = collection2.Count;
                this._items = new T[count];
                collection2.CopyTo(this._items, 0);
                this._size = count;
                return;
            }
            this._size = 0;
            this._items = new T[4];
            using (IEnumerator<T> enumerator = collection.GetEnumerator())
            {
                while (enumerator.MoveNext())
                {
                    this.Add(enumerator.Current);
                }
            }
        }

You can probably see the point here: each constructor initializes the _items variable, which is an array (as we know, the generic List is actually implemented based on an array).

Then, analyze the increase in addition, deletion and revision, which is the Add method:


       public void Add(T item)
        {
            if (this._size == this._items.Length)
            {
                this.EnsureCapacity(this._size + 1); // Expansion function 
            }
            this._items[this._size++] = item; // Add elements by indexer assignment 
            this._version++;
        }

With a clear goal, we finally found the capacity expansion function EnsureCapacity:

        private void EnsureCapacity(int min)
        {
            if (this._items.Length < min)
            {
                int num = (this._items.Length == 0) ? 4 : (this._items.Length * 2); // capacity = Twice the current length 
                if (num < min)
                {
                    num = min;
                }
                this.Capacity = num; // Current array capacity assignment 
            }
        }

Here, we can see that when adding an element, if the capacity does not exceed the size of the pre-allocated array space, assign the value directly through the following indexer:

      public T this[int index]
        {
            [TargetedPatchingOptOut("Performance critical to inline across NGen image boundaries")]
            get
            {
                if (index >= this._size)
                {
                    ThrowHelper.ThrowArgumentOutOfRangeException();
                }
                return this._items[index];
            }
            [TargetedPatchingOptOut("Performance critical to inline across NGen image boundaries")]
            set
            {
                if (index >= this._size)
                {
                    ThrowHelper.ThrowArgumentOutOfRangeException();
                }
                this._items[index] = value;
                this._version++;
            }
        }

If the new items exceed the maximum size of the existing array, the capacity is redistributed through the expansion function (new1 array objects). In the function EnsureCapacity, we can see:

int num = (this. _items. Length == 0)? 4: (this. _items. Length * 2); // capacity = twice the current length the array capacity is recalculated here in a simple way. The most important is to assign a value to the property Capacity:

this. Capacity = num; // the set of the current array capacity assignment property contains the operations of the array new:


    public int Capacity
        {
            [TargetedPatchingOptOut("Performance critical to inline across NGen image boundaries")]
            get
            {
                return this._items.Length;
            }
            set
            {
                if (value < this._size)
                {
                    ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.value, ExceptionResource.ArgumentOutOfRange_SmallCapacity);
                }
                if (value != this._items.Length)
                {
                    if (value > 0)
                    {
                        T[] array = new T[value]; // reinitialization 1 An array 
                        if (this._size > 0)
                        {
                            Array.Copy(this._items, 0, array, 0, this._size);
                        }
                        this._items = array; // The new array points to the original array, and the original array is discarded 
                        return;
                    }
                    this._items = List<T>._emptyArray;
                }
            }
        }

At this point, we can completely summarize the capacity enlargement algorithm of the set through pseudo code:

public void Add(T item)
{
    Determines whether the current number of elements is equal to its preset capacity 
   if( Is equal to the )
   {
        Reallocate memory space ( before 1 Student: second space times 2)
        using Array.Copy Method copies the element to the newly allocated space 
   }

    Sets the current location of the element 
   currentIndex++;
   this[currentIndex] = item;
}

One might ask, wouldn't it be rash to simply multiply by 2 when you reallocate memory space, which is a waste of memory space? Isn't the implementation of MS very rigorous?

Indeed, when buffer expansion occurs, it is obvious that a large amount of data will lead to memory waste. The generic List provides a method called TrimExcess to solve the memory waste problem:


       public void TrimExcess()
        {
            int num = (int)((double)this._items.Length * 0.9); // The current array space times 0.9
            if (this._size < num)
            {
                this.Capacity = this._size; // Reset the array space 
            }
        }

The principle is a simple mathematical method, you call in the external once or more, it will automatically help you balance the space to save memory.

One very common function is to determine if an element is in a list. The method name is Contains and the code is O(n).


    public bool Contains(T item)
        {
            if (item == null)
            {
                for (int i = 0; i < this._size; i++)
                {
                    if (this._items[i] == null)
                    {
                        return true;
                    }
                }
                return false;
            }
            EqualityComparer<T> @default = EqualityComparer<T>.Default;
            for (int j = 0; j < this._size; j++)
            {
                if (@default.Equals(this._items[j], item))
                {
                    return true;
                }
            }
            return false;
        }

That's it. Haha, I was stunned. Generic List is also array-based, and we can now confidently say that I'm aware of its internal augmentation implementation. At this point I really want to clap the shoulder to say to myself, brother you hard, too have a sense of achievement, seize the time to think more about the girl you like...... Sure enough, my face was flushed again.

Then, looking at the source code, we found that many core method implementations operate directly on the _items array and call Array's static method several times, so we must not underestimate the array of data structures (Array).

Disassembler looked at the source code of Array, found that the addition, deletion, change and check method is quite rich, which have a heart can dig the number of classical algorithms used (such as BinarySearch heard that used 2 points to find), anyway feel that it is greatly underestimated by me. At least now we know that commonly used data structures such as StringBuilder, ArrayList, generic List, Queue, Stack, and so on are implemented based on Array. Don't underestimate it. With the development of framework, there may be new data structures based on Array in the future.

3. Dynamic expandable hash structure based on Array

Other data structures that are commonly used in development such as HashTable, generic Dictionary, HashSet, and the thread-safe container ConcurrentDictionary can also be dynamically augmented. The following is a simple analysis of the source code for HashTable.

Start with the no-argument constructor:


      public Hashtable()
            : this(0, 1f)
        {
        }

The constructors that the no-argument constructor ultimately needs to call are as follows:

      public Hashtable(int capacity, float loadFactor)
        {
            if (capacity < 0)
            {
                throw new ArgumentOutOfRangeException("capacity", Environment.GetResourceString("ArgumentOutOfRange_NeedNonNegNum"));
            }
            if (loadFactor < 0.1f || loadFactor > 1f)
            {
                throw new ArgumentOutOfRangeException("loadFactor", Environment.GetResourceString("ArgumentOutOfRange_HashtableLoadFactor", new object[]
    {
     0.1, 
     1.0
    }));
            }
            this.loadFactor = 0.72f * loadFactor; // The load factor   Familiar and unfamiliar 0.72
            double num = (double)((float)capacity / this.loadFactor);
            if (num > 2147483647.0)
            {
                throw new ArgumentException(Environment.GetResourceString("Arg_HTCapacityOverflow"));
            }
            int num2 = (num > 3.0) ? HashHelpers.GetPrime((int)num) : 3; // Initializes the array space 
            this.buckets = new Hashtable.bucket[num2]; // Turned out to be bucket Construct an array of structures 
            this.loadsize = (int)(this.loadFactor * (float)num2);
            this.isWriterInProgress = false;
        }

As we know, the hash function evaluates to 1 storage unit address, and each storage unit is called a bucket, and isn't buckets exactly the hash bucket we're looking for to store the hash function's calculation? It turns out it's also an array. Can you see it here? this.buckets is an array of struct structures.

Ok, let's start with the Add method.

          public virtual void Add(object key, object value)
              {
                      this.Insert(key, value, true);
              }
Trace to the Insert method, and code 1 becomes rich, but we see a judgment statement directly:

                  if (this.count > = this.loadsize)
                      {
                              this.expand();
                      }
The above if is very clear. Doesn't expand mean extension? As in:

          private void expand()
              {
      HashHelpers     prime     prime         prime = HashHelpers GetPrime(this buckets Length * 2); // directly multiply by 2 and there seems to be an auxiliary call to get the prime number HashHelpers.GetPrime
                    this rehash(prime); / / hash again
              }
There's an rehash function, and it goes in:


        private void rehash(int newsize)
        {
            this.occupancy = 0;
            Hashtable.bucket[] newBuckets = new Hashtable.bucket[newsize]; // Redefine hash buckets
            for (int i = 0; i < this.buckets.Length; i++) // It was walking through the old buckets And fill it with new buckets
            {
                Hashtable.bucket bucket = this.buckets[i];
                if (bucket.key != null && bucket.key != this.buckets)
                {
                    this.putEntry(newBuckets, bucket.key, bucket.val, bucket.hash_coll & 2147483647);
                }
            }
            Thread.BeginCriticalRegion();
            this.isWriterInProgress = true;
            this.buckets = newBuckets; // The old buckets Change the reference to newly populated newBuckets
            this.loadsize = (int)(this.loadFactor * (float)newsize);
            this.UpdateVersion();
            this.isWriterInProgress = false;
            Thread.EndCriticalRegion();
        }

Sure enough, I saw the re-definition and re-assignment of the array: this.buckets = newBuckets;

Isn't that a return to the old ways? If you look at the frequency of Array, many of the hash table methods are implemented as array operations. Here, can not help but think of chow xingchi movie in the words: finish finish work.

But that's not all, there's a problem with the calculation of dynamic capacity expansion, is it a direct multiplication of 2 as in generic list 1? In the expand function we see the following line:

int prime = HashHelpers.GetPrime(this.buckets.Length * 2);
Obviously, multiplying by 2 requires an operation of an auxiliary function, HashHelpers.GetPrime. The code for HashHelpers.GetPrime is as follows:


       public StringBuilder(int capacity, int maxCapacity)
        {
            if (capacity > maxCapacity)
            {
                throw new ArgumentOutOfRangeException("capacity", Environment.GetResourceString("ArgumentOutOfRange_Capacity"));
            }
            if (maxCapacity < 1)
            {
                throw new ArgumentOutOfRangeException("maxCapacity", Environment.GetResourceString("ArgumentOutOfRange_SmallMaxCapacity"));
            }
            if (capacity < 0)
            {
                throw new ArgumentOutOfRangeException("capacity", Environment.GetResourceString("ArgumentOutOfRange_MustBePositive", new object[]
    {
     "capacity"
    }));
            }
            if (capacity == 0)
            {
                capacity = Math.Min(16, maxCapacity); // Will be maximum capacity and default capacity 16 Let's compare them. Let's take the smaller one 
            }
            this.m_MaxCapacity = maxCapacity;
            this.m_ChunkChars = new char[capacity];
        }
8
The implementation of getting prime Numbers seems familiar. Keeping track of HashHelpers, it turns out that its primes array lists primes with a minimum of 3 and a maximum of 7199369. If your hash table is not particularly large, the number of primes from 3 to 7199369 is sufficient (according to the principle of 28); otherwise, when the hash table is expanded, it needs to be judged by HashHelpers.IsPrime in the range of 2147483647 to dynamically generate primes. Therefore, it is not accurate to say that the capacity of the hash table is doubled after expansion, which is directly related to the requirement of the hash algorithm for prime Numbers. In the case of inaccuracy, we can think that it is approximately equal to twice.

Just out of curiosity, I took a look at the source code for a generic dictionary. Its internal implementation contains two arrays of structures, buckets and entries. I guess their implementations also take advantage of array features to varying degrees, and are based on the Array implementation after all (see? Other several common hash structure containers have not looked carefully, interested you can directly look at the source code analysis 1, pick a typical 1 two to see the comparison should be very effective.

Feel free to share what you know about the other data structures based on Array, so I won't be giving you a free pass.


Consider: throw up a weak question to verify: is it necessary to implement the dynamic properties of containers based on arrays? Is the implementation based on arrays unique to 1?


Related articles: