C++ outputs strings in hexadecimal form

2020-05-10 18:33:01
OfStack

preface

When developing i18n, character encoding conversion errors are often encountered. If you can print out the relevant string in base 106, for example, "abc" to "\\x61\\x62\\x63", this is helpful for debugging i18n. In Python, you just need to use it repr() The function will do. How do you do this in C++?

Here is a simple implementation using ostream's formatting capabilities:


std::string get_raw_string(std::string const& s)
{
 std::ostringstream out;
 out << '\"';
 out << std::hex;
 for (std::string::const_iterator it = s.begin(); it != s.end(); ++it)
 {
 out << "\\x" << *it;
 }
 out << '\"';
 return out.str();
}

It looks straightforward, but unfortunately this code doesn't do what we're trying to do. It still outputs each character literally. We specified that std::hex should be used to format the output. ? The problem turned out to be that std::hex is just an output format for integer types, and when the output character type is C++ stream is still output literally. A closer look at the ostream documentation reveals that the C++ standard output stream has very little control over formatted output, providing only a limited number of formatting customizations, most of which are for integer and floating point types, and no parameters for character types at all. Ironically, ostream makes use of C++ 's function overloading and strong typing mechanism to maintain the same expressive power as C, while eliminating the endless troubles caused by the notorious printf and greatly increasing security. Here, however, strong type-safety is an obstacle to our goal: I just want ostream to print characters as integers! Fortunately, C++ also has strong typing, which allows us to bypass the safety gate of strong typing:


out << std::hex << "\\x" << static_cast<int>(*it);

Well, the characters are printed as integers, and std::hex instructs ostream to output integers in base 106. Problem solved. Wait a minute, why does the output of UTF-8 in Chinese look like this:


"\xffffffe4\xffffffb8\xffffffad" // get_raw_string(" In the ")

So many F word are too bad for the city. Can you get rid of them? The reason for this is that we're putting out a number that is cast to int, whereas int is 32, bit is long, so it's going to be a bit longer. If you want to get rid of it, you just turn it into an integer of 8 bit. Unfortunately, C/C++ doesn't have an integer of 8 bit. The only thing you can do is 1


typedef char int8_t;

However, int8_t does not work with the resulting int8_t, because in C++, typedef does not generate a new type, but simply defines an alias for the original type. This alias does not participate in the matching calculation of the function overload. In other words, ostream said, don't think I don't know you because you're wearing an int8_t vest. I'll just output you as char. No way!

So should we give up on ostream? Wait a minute, in fact, ostream will not output the previous 0 by default, so as long as the last 8 bit before the bit is erased to 0, we can meet our requirements.

Okay, here's the final version:


std::string get_raw_string(std::string const& s)
{
 std::ostringstream out;
 out << '\"';
 out << std::hex;
 for (std::string::const_iterator it = s.begin(); it != s.end(); ++it)
 {
 // AND 0xFF will remove the leading "ff" in the output,
 // So that we could get "\xab" instead of "\xffab"
 out << "\\x" << (static_cast<short>(*it) & 0xff);
 }
 out << '\"';
 return out.str();
}

After a few twists and turns, the ostream has been successfully used to print a string in base 106 using the base 106 output function provided by the ostream. In fact, the reason why it is so convoluted is that ostream itself is too weak in formatting output control. Is there a better tool in step 1, C++ to do this? boost::format It looks like it is, but it still doesn't deal with the dilemma we encountered above. Fortunately, another boost library provides the appropriate answer: boost::spirit::karma

Karma is boost::spirit Part 1 of the library. You're probably familiar with the spirit library doing parser to parse strings. spirit, by contrast, provides the opposite functionality through Karma, which is specifically designed to format C++ data structures into character streams.

We just need it, and here's the code rewritten using the karma library:


template <typename OutputIterator>
bool generate_raw(OutputIterator sink, std::string s)
{
 using boost::spirit::karma::hex;
 using boost::spirit::karma::generate;

 return generate(sink, '\"' << *("\\x" << hex) << '\"', s);
}

std::string get_raw_string_k(std::string const& s)
{
 std::string result;
 if (!generate_raw(std::back_inserter(result), s))
 {
 throw std::runtime_error("parse error");
 }

 return result;
}

The most important part of this is to make use of the built-in output module of karma karam::hex To help us get the job done, and this hex is a polymorphic generator. Unlike the type overloading of ostream, which can only output hex for some types, it can output hex for all types, including char. Another advantage is that the code is more expressive, and the output format is completely reflected in 1 line of code:


//  The output format is  "\x61\x62\x63" , easy to post directly to  python  or  C++  In the code 
'\"' << *("\\x" << hex) << '\"'

If you want to change the output format, just change this line of code, for example:


//  Output format into  "0x61 0x62 0x63 "
'\"' << *("0x" << hex << " ") << '\"'

Is there any performance penalty in terms of efficiency? Here is a piece of test code that converts the same string using two algorithms:


#include "boost/test/unit_test.hpp"
#include "boost/../libs/spirit/optimization/measure.hpp"
#include "string.hpp" // The function for test

static std::string const message = "hex output performance test data  Chinese ";

struct using_karma : test::base
{
 void benchmark()
 {
 this->val += get_raw_string_c(message).size();
 }
};

struct using_ostream : test::base
{
 void benchmark()
 {
 this->val += get_raw_string(message).size();
 }
};

BOOST_AUTO_TEST_CASE(TestStringPerformance)
{
 BOOST_SPIRIT_TEST_BENCHMARK(
 100,
 (using_karma)
 (using_ostream)
 );

 BOOST_CHECK_NE(0, live_code);
}

The following are the results of operation, respectively, the time required by the two algorithms, the lower the value, the better:

算法	耗时(s)
karma	6.97
ostream	14.24

Perhaps surprisingly, karma is roughly twice as fast as ostream. This is also similar to the official performance data of spirit. Here the return value of the function is passed std::string It takes a lot of time for a copy of the value to be returned, and the performance advantage of karma would only be greater if it were purely formatted output. Another test shows that karma should be the fastest formatted character flow scheme you can find in C/C++.

This article seems too long for such a simple function, but fortunately, we finally found a expressive and high performance base 106 output scheme. Good things are hard to come by, but C++, a complex language, can often be found executing quick and highly abstract code schemes. It's just a little too complicated...

conclusion

The above is the whole content of this article, I hope the content of this article to your study or work can bring 1 definite help, if you have questions you can leave a message to communicate.