Example of using cardinality sorting algorithm to sort strings in C

  • 2021-10-13 08:24:04
  • OfStack

Before beginning

Assuming that the longest string is L, L is the input length, and then assuming that all strings are "filled" to this length, this filling is only logical, we can imagine that there is a kind of "null character", which is smaller than any other character, and use this character to fill all strings with insufficient length. For example, if the longest string is 9 and one string A is 6, then when comparing the 7th character, we make A [7] a "null character".

If it seems difficult to include all the characters, we first define a character set in which all the characters in the string to be sorted are included


// Character set 
private string _myCharSet = "0123456789qwertyuiopasdfghjklzxcvbnm";

Another method to generate random strings (C # implementation):


private Random _random = new Random();
 
string[] GetRandStrings(int size, int minLength, int maxLength)
{
  string[] strs = new string[size];
  int len = 0;
  StringBuilder sb = new StringBuilder(maxLength);
 
  for (int i = 0; i < strs.Length; i++)
  {
    // Random determination first 1 Length of 
    len = _random.Next(minLength, maxLength);
    for (int j = 0; j < len; j++)
    {
      // Random selection 1 Characters 
      sb.Append(_myCharSet[_random.Next(_myCharSet.Length)]);
    }
    strs[i] = sb.ToString();
    sb.Clear();
  }
  return strs;
}

Here, the range of buckets is determined according to the integer representation of characters, and then one bucket is prepared for "empty characters". To represent the special case of "null character", we use default (char), that is, '\ 0', because when the string. ElementAtOrDefault (int) method is called, '\ 0' is returned if the index is exceeded.

Primary version (C #)


void StringRadixSort(string[] strArray)
{
  if (strArray == null
    || strArray.Length == 0
    || strArray.Contains(null))
  {
    return;
  }
 
  // Get the maximum length of a string 
  int maxLength = 0;
  foreach (string s in strArray)
  {
    if (s.Length > maxLength)
    {
      maxLength = s.Length;
    }
  }
 
  // Determine the integer range of characters 
  int rangeStart = _myCharSet[0];
  int rangeEnd = _myCharSet[0];
  foreach (char ch in _myCharSet)
  {
    if (ch < rangeStart)
      rangeStart = ch;
    if (ch >= rangeEnd)
      rangeEnd = ch + 1;
  }
 
  // Also for " Null character " Allocation 1 Buckets with an index of 0
  int bucketCount = rangeEnd - rangeStart + 1;
  LinkedList<string>[] buckets = new LinkedList<string>[bucketCount];
 
  // Initialize all buckets 
  for (int i = 0; i < buckets.Length; i++)
  {
    buckets[i] = new LinkedList<string>();
  }
 
  // From the end 1 Start sorting with characters 
  int currentIndex = maxLength - 1;
  while (currentIndex >= 0)
  {
    foreach (string theString in strArray)
    {
      // If the index is exceeded, return '\0' Character (default(char))
      char ch = theString.ElementAtOrDefault(currentIndex);
      if (ch == default(char))
      {  //" Null character " Treatment of 
        buckets[0].AddLast(theString);
      }
      else
      {  // Map characters to buckets 
        int index = ch - rangeStart + 1;
        buckets[index].AddLast(theString);
      }
    }
    // Retrieve the strings from the bucket in turn, and complete 1 Trip sorting 
    int i = 0;
    foreach (LinkedList<string> bucket in buckets)
    {
      while (bucket.Count > 0)
      {
        strArray[i++] = bucket.First();
        bucket.RemoveFirst();
      }
    }
    currentIndex--;
  }
}

Make a slight "improvement"

The code used to determine the integer range of characters is slightly painful, and according to the character set, not all the characters corresponding to integers in the range may appear, so there will be a situation where we allocate buckets to some characters that will not appear at all, which is a waste. We can use a dictionary (hash) to record the mapping between a character and its bucket. So there is the following code.


private Dictionary<char, int> _charOrderDict = 
        new Dictionary<char, int>(_myCharSet.Length);
void BuildCharOrderDict()
{
  char[] sortedCharSet = _myCharSet.ToArray();
  // Sort with default comparator 
  Array.Sort(sortedCharSet);
  // For " Null character " Create a map separately 
  _charOrderDict.Add(default(char), 0);
  for (int i = 0; i < sortedCharSet.Length; i++)
  {
    //  Save the index of the character and its corresponding bucket 
    _charOrderDict.Add(sortedCharSet[i], i + 1);
  }
}

Alternatively, instead of using the default character sorting as a mapping, you can define the size relationship between characters entirely by yourself. The following is the adjusted code:


void StringRadixSort(string[] strArray)
{
  if (strArray == null
    || strArray.Length == 0
    || strArray.Contains(null))
  {
    return;
  }
  // Get the maximum length of a string 
  int maxLength = 0;
  foreach (string s in strArray)
  {
    if (s.Length > maxLength)
    {
      maxLength = s.Length;
    }
  }
 
  // For each 1 Characters ( Include null characters '\0') Allocation 1 Bucket 
  //" Null character " The index should be 0
  int bucketCount = _myCharSet.Length + 1;
  LinkedList<string>[] buckets = new LinkedList<string>[bucketCount];
 
  // Initialize all buckets 
  for (int i = 0; i < buckets.Length; i++)
  {
    buckets[i] = new LinkedList<string>();
  }
 
  // From the end 1 Start sorting with characters 
  int currentIndex = maxLength - 1;
  while (currentIndex >= 0)
  {
    foreach (string theString in strArray)
    {
      // If the index is exceeded, return '\0' Character (default(char))
      char ch = theString.ElementAtOrDefault(currentIndex);
      // Query characters according to the definition of character order 
      int index = _charOrderDict[ch];
      buckets[index].AddLast(theString);
    }
    // Retrieve the strings from the bucket in turn, and complete 1 Trip sorting 
    int i = 0;
    foreach (LinkedList<string> bucket in buckets)
    {
      while (bucket.Count > 0)
      {
        strArray[i++] = bucket.First();
        bucket.RemoveFirst();
      }
    }
    currentIndex--;
  }
}

Now, it works! The time complexity is O (n *logn) O (n *logn). On the surface, cardinality sorting is better, but strictly speaking, the time complexity of cardinality sorting should be O (k *n) O (k *n), where k is positively correlated with string length. At this time, the comparison between the two algorithms can be approximately obtained by comparing the comparison results of k and lognlogn. If the string length is very long, that is, k is very large and the input size n is not large, there will be k > lognlogn, quick sorting is more advantageous at this time. On the contrary, cardinality sorting may be better.

Finally...

Incidentally, when I expanded the character set to include all the characters on the keyboard, I found that the result of cardinality sorting was not the same as that of Array. Sort (string [] method. A closer look at the resource manager's sorting of filenames reveals that its string sorting rules are much more complicated than simply comparing characters. After searching the relevant data, we find that the sorting of strings should even consider the influence of regional culture. Even if they are all Latin letters, the sorting rules of different regions may be different. Therefore, the string sorting algorithm realized by using cardinality sorting seems to have little practical value < T-T > .


Related articles: