string_view odi et amo

Posted: January 3, 2017 in Programming Recipes
Tags: ,

string_view-like wrappers have been successfully used in C++ codebases for years, made possible by libraries like boost::string_ref. I think all of you know that string_view has joined the C++ standard library since C++17.

Technically, basic_string_view is an object that can refer to a constant contiguous sequence of char-like objects with the first element of the sequence at position zero. The standard library provides several typedefs for standard character types and std::string_view is simply an alias for:

basic_string_view<char>

For simplicity, I’ll just refer to string_view for the rest of the post but what I’m going to discuss is valid for the other aliases as well.

You can imagine string_view as a smart const char* which provides any const member function of std::string as well as a few handy utilities to reduce its span. You cannot enlarge a string_view until you reassign it. Other languages (e.g. Go) have similar constructs that permit to grow the range as well as to participate in the ownership of such range. Although string_view does not, the power of such simple wrapper is huge, though.

The applications of string_view are many and it’s relatively simple to let string_view join your codebase. For years, I’ve been using a proprietary implementation of string_view dated back to the 90s and then improved on the base of boost::string_ref and recently on std::string_view. If you start today, it’s very likely you can adoperate your compiler’s string_view implementation (e.g. latest Visual Studio 2017 RC, clang and GCC support it), you can grab an implementation from the web or you can just use boost::string_ref or another library (e.g. Google’s, folly).

One can think that using string_view is as simple as using std::string with the only difference that string_view does not take the ownership of the char sequence and cannot change its content. That’s not completely true. Adoperating string_view requires you to pay attention to a few other traps that I’m going to describe later on. Before starting, let me show you a couple of simple examples.

Generally speaking, string_view is a good friend when we need to do text processing (e.g. parsing, comparing, searching), but first of all, string_view is an adapter: it allows different string types to be adapted into a std::string-like container. This means that string_view provides iterator support and STL naming conventions (e.g. size, empty). To create a string_view, we only require a null-terminated const char* or both a const char* and a length. Note that in the latter case we don’t need the char sequence to be null-terminated.

Suppose now that our codebase hosts many different string types but we want to write only one function doing a certain task on constant strings. Can string_view help? It can, if the string types manage a contiguous sequence of characters and also provide (read) access to it. Examples:

Then we may write only one function for our task:

ReturnType readonly_on_string_function(string_view sv); // only one implementation

Into readonly_on_string_function we can exploit the whole set of const functions of std::string. Just this simple capability is priceless. You know what I mean if you use more than three string types into your codebase 🙂

To show you other string_view functionalities, let me consider the problem of splitting a string. This problem can be tackled in many ways (e.g. iterator-based, range-based, etc) but let me keep things simple:

The worst things of this function are (imho):

  • we create a new string for each token (this possibly ends up with dynamic allocation);
  • we can split only std::string and no other types.

Since string_view provides every const function of string, let’s try simply replacing string with string_view:

Not only is the code still valid, but also potentially less demanding because we just allocate 8/16 bytes (respectively on 32 and 64 bit platforms – a pointer and a length) for each token.

Now, let’s use some utilities to shrink the span. Suppose I get a string from some proprietary UI framework control, providing its own string representation:

auto name = uiControl.GetText();

Then imagine we want to remove all the whitespaces from the start and the end of such string (we want to trim). We can do it without changing the string itself, just by using string_view:

remove_prefix moves the start of the view forward by n characters, remove_suffix does the opposite. Edge cases have been handled succinctly.

Now we have a string_view containing only the “good” part of the string. At this point, let me end with a bang: we’ll use the sanitized string to query a map without allocating extra memory for the key. How? Thanks to heterogeneous lookup of associative containers:

That’s possible because less<> is a transparent comparator and string_view can be implicitly constructed from std::string (thus, we don’t need to write operator< between std::string and std::string_view). That’s powerful.

It should be clear that string_view can be dramatically helpful to your daily job and I think it’s quite useless to show you other examples to support this fact. Rather, let me discuss a few common pitfalls I have met in the last years and how to cope with them.

#1: “losing sight of the string”

The first error I have encountered many times is storing string_view as a member variable and forgetting that it will not participate in the ownership of the char sequence:

Suppose that Parse is never called with a temporary (moreover, we can enforce that assumption just by deleting such overload), this code is still fine because the caller of Parse has also ‘current’ in scope. Then some time later, a programmer that is not very familiar with string_view (or who is simply heedless) puts the following error in the code:

‘someProcessing’ is a temporary string and then StatefulParser will very likely refer to garbage.

So, string_view (as well as span, array_view, etc) is often not recommended as a data member. However, I think that string_view as data member sometimes is useful and in these scenarios we need to be prudent, just like using references and pointers as data members.

#2: replacing const string& with string_view

string_view seems a drop-in replacement of const std::string& because it provides the whole set of std::string‘s const functions and also because it’s a view (reference). So, the general rule you hear pretty much everywhere (especially nowdays that string_view has officially joined the C++ standard) is “whenever you see const string&, just replace it with string_view“.

So let’s do that:

void I_dont_know_how_string_will_be_used_but_i_am_cool(const string& s);

We turn into:

void I_dont_know_how_string_will_be_used_but_i_am_cool(string_view s);

As users of this function, we are now permitted to pass whatever valid string_view, aren’t we?

As writers of this function, we may have now serious problems.

We have introduced a subtle change to our interface that breaks a sort of guarantee that we had before:  null-termination. string_view does not require (and then does not necessarily handle) a null-terminated sequence. On the other hand, string guarantees to get one back – with c_str().

Maybe you don’t need that feature, in this case the rest of the interface should be ok. Otherwise, if you are lucky, your code simply stops compiling because you are using c_str() somewhere in the code. Else, you are using data(), and the code continues compiling just fine because string_view provides data() as well.

This is not a syntactic detail. What should be clear is that the interface of ‘I_dont_know_how_string_will_be_used_but_i_am_cool’ is not seamlessly changed because now the user can just pass in a not null-terminated sequence of characters:

string something = "hello world";
I_dont_know_how_string_will_be_used_but_i_am_cool(string_view{something.data(), 5}); // hello

Suppose at some point you call a C-function expecting a null-terminated string (it’s common), then you call .data() on string_view. What you obtain is “hello world\0” instead of what the user expected (“hello”). In this case, you maybe only get a logical error, because the \0 is at the end of the string. In this other case you are not so lucky:

char buff[] = {'h', 'e', 'l', 'l', 'o'};
I_dont_know_how_string_will_be_used_but_i_am_cool(string_view{buff, 5});

Even if uncommon (generally string_view refers to real strings, that are always null-terminated), that’s even worse, isn’t it?

In general, string_view “relaxes” (does not have) that requirement on null-termination (it’s just a wrapper on const char*). Imagine that the DNA, the identity, of string_view is made of both the pointer to the sequence of characters and the number of referred characters (the length of the span). On the other hand, since string::c_str() guarantees that the returned sequence of characters is null-terminated, you can think that the identity of a string is just what c_str() returns – the length is a redundant information (e.g. computable by strlen(str.c_str())).

To conclude this point, replacing const string& with string_view is safe as far as you don’t expect a null-terminated string – if you are using c_str() then you can figure that out at compile time because the code simply not compile, otherwise you are possibly in trouble.

Since we are on the subject: replacing const string& with string_view has also another (minor) consequence because string_view involves some work, that is copying a pointer and a length. The latter is an extra, compared to const string&. That’s just theory. In practice you measure when in doubt.

#3: string = string_view::data() + string_view::size()

From the previous point, it should be evident that wherever you need to create a string from a string_view you have to use both data() and size(), and not only data(). You have to use the DNA of string_view. I have reviewed this error many times:

string_view sv = ...;
string s = sv.data(); // possibly UB

It does not work in general, for the same reasons I have just showed you (e.g. this constructor of std::string requires a null-terminated sequence of characters).

From C++17 you can just use one of string’s constructors:

string s { sv };

Or string_view::to_string, for auto-everything-syntax fans:

auto s = sv.to_string();

Before C++17, we have to use data() + size():

string s { sv.data(), sv.size() };

Clearly, as for std::string, you have to do the same for other string types. E.g.:

CString cstr { sv.data(), sv.size() };
#4: numerical conversions

Although C and C++ provide many functions to perform conversions between a number and a string/C-string (and viceversa), none supports a range of characters (e.g. begin + end, or begin + length). Moreover, every C/C++ conversion function expects the input string to be null-terminated. These facts lead to the conclusion that it does not exist any function able to convert a string_view into a number out of the box. We can use some C/C++ functions, but we have limitations. I’ll show you some in this section.

For instance, using atoi or C++11 functions we fall into traps or undefined behavior:

So, how to properly convert a string_view into a number? Many ways exist, generally motivated by different requirements and compromises. For the rest of this section I’ll refer only to int conversions because the end of the story is similar for other numeric types.

Sometimes, although it seems counterintuitive, to fulfill the null-termination requirement we can create an intermdiate std::string (or char array):

Actually, having a std::string we can rely on any C and C++ conversion function. Such intermdiate step of copying into a std::string is sometimes affordable because certain numeric types – like int – have a small number of maximum digits (e.g. int is 11). As far as the char sequence really contains one of such little data, the resulting std::string will be created without allocating dynamic memory thanks to SSO (Small String Optimization). Clearly, that shortcut does not hold for bigger numeric types and in general is not portable.

Other fragile solutions I encountered were based on sscanf and friends:

In some cases this code does not behave how we expect – e.g. when the converted value overflows and when the sequence contains leading whitespaces. Although I don’t recommend this approach, compared to the previous one, it only allocates a fixed amount of characters (e.g. 24) on the stack.

In many other cases, the approach is strictly based on how string_view is employed. This means that we have to make some assumptions. For example, suppose we write a parser for urls where we assume that each token is separated by ‘/’. Since atoi and strtol stops on the last character interpreted, if the whole url is both well-formed and stored into a null-terminated string (assumptions/preconditions) we can use such functions quite safely:

Basically, we assumed that the character past the end of any string_view is either a delimiter or the null-terminator. Pragmatically, many times we can make such assumptions, even if they distance our solution from genericity.

So, I encountered code like that:

In this example we use strtol to read an int and then we return the rest of the string_view. We basically try to “consume”  an int from the beginning of the string_view.

Note that C and C++ conversion functions have more or less relaxed policies on errors (mainly for performance reasons). For instance, if the conversion cannot be performed, strtol returns 0 and if the representation overflows, it sets errno to ERANGE. Instead, in the latter case the return value of atoi is undefined. What I really mean is that if you decide to use such functions then you are going to accept the consequences of their limitations. So, just pay attention to such limitations and take actions if needed. For example, a more defensive version of the previous code is:

The fact that it makes sense to check against the null-terminator (if (*entrPtr != 0)) is the fundamental assumption we made here. Generally such assumption is easy to make. Scenarios like this, instead:

string whole = "12345";
parse_int ( {whole.data(), 3}, i );

are still not covered, because the length of the string_view is not taken into account. For this, we have at least three options: create and use an intermdiate std::string (or use a std::stringstream – however only std::string benefits the SSO), improve the sscanf-based solution that somehow uses such information, or write a conversion function manually. It’s quite clear that C++ lacks a set of simple functions to convert char ranges to numbers easily, efficiently and with a robust error handling.

Actually, I think the most elegant, robust and generic solution is based on boost::spirit:

However, if you don’t already depend on boost, it’s quite inconvenient to do just for converting strings into numbers.

We have a happy ending, though. Finally, C++17 fills this gap by introducing elementary string conversion functions:

This new function will just convert the given range of characters into an integer. It is locale-independent, non-allocating, and non-throwing. Only a small subset of parsing policies used by other libraries (such as sscanf) is provided. This is intended to allow the fastest possible implementation. Clearly, overloads for other numeric types are provided by the standard.

To be thorough, here is an example of the opposite operation, using to_chars:

Both to_chars and from_chars return a minimal output which contains an error flag and a pointer to the first character at which the parsing stopped (e.g. something like what is written into endPtr in the strtol example).

Are you already looking forward to putting your hands on them?!

 

Here is wrap-up of the main points we covered in this post:

  • string_view is a smart const char*: an object that refers to a constant sequence of characters, keeps track of its length and provides any const function of std::string;
  • just like a reference or a pointer, you have to pay attention to storing string_view as a member variable;
  • string_view’s DNA is both the char sequence and the length:
    • the pointed sequence of characters is not necessarily null-terminated (e.g. c_str() does not exist);
    • whenever you need to copy the content of a string_view into a string(-like container), you have to use both;
  • bear in mind that replacing const string& with string_view implies the user can start passing not null-terminated strings into your functions (just ask yourself if that makes sense);
  • To convert a string_view into a number:
    • pre-C++17: use boost::spirit if you can, agree to compromises and use C/C++ functions with their limitations, or roll some utilities yourself;
    • since C++17: use from_chars.
  • string_view is already available in:
    • Microsoft Visual Studio 2017 RC
    • clang HEAD 4.0 (or in 3.8, under the experimental include folder)
    • gcc HEAD 7.0
Advertisements
Comments
  1. Bartek F. says:

    Nice article!

    I haven’t played with string_view so far, but I’ve noticed that there’s also string_span/span in gls. It seems that string_view was renamed into string_span by the committee: http://stackoverflow.com/questions/34832090/whats-the-difference-between-span-and-array-view-in-the-gsl-library

    • Marco Arena says:

      Thanks man!

      string_span in the GSL originally was named string_view and then was renamed into string_span because string_view was joining the standard.

      gsl::string_span is actually a specialization of span (originally gsl::array_view) for char-sequences, but it totally lacks string-specific functions and utilities (e.g. find_last_not_of). Another major difference is that gsl::string_span is writable.

      At my company we have used a span-like construct for years (it was called array_view and I recently renamed it to span, conforming to GSL).

      Apart the ability to write, I don’t see any other real benefit of using string_span instead of string_view.

      • Bartek F. says:

        aaa… so I understand now. So those are two different things, but initially, I thought it’s only a rename. I need to play with that to understand it better! 🙂

  2. __vic says:

    There is no “string_view::to_string”. std::string_view is completely unaware of std::string. I understand that it was written about std::experimantal::string_view, but people who read it now will be misleaded

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s