Pay attention to unformatted nature of getline

Posted: November 15, 2015 in Programming Recipes
Tags: , , ,

A couple of weeks ago I found a simple bug in the dusty corners of a production repository I usually work on. A certain feature was rarely used and it seemed to be covered by a test…Actually it was not and some years ago a guy, feeling guarded by this test, changed the code bugging the feature, but nobody complained. Recently a user tried to resume this old feature but it didn’t work as expected.

I don’t want to bother you with all the details of the story but just to give you a bit of context: there was an old piece of code reading a small file through C FILE handles. Some years ago that piece of code was migrated to C++ streams (since files to read were really small and simple) and a silly bug was introduced. Since this bug was really simple I wondered if it was caused by inattention or ignorance, then I had a chat with the programmer who committed the change and I discovered it was caused by the latter reason. The fix was really easy.

Some days later I discussed about this problem with some friends and I realized they were unaware of this problem too. So I decided to write a short post about this story, maybe it is useful to other coders.

Imagine you are using streams to read some data from the standard input. This is what the input looks like:

number
some words in a line
number
some words in a line
...

And then imagine the following code reading that input:

int num; string line;
while ( (cin >> num) && getline(cin, line) )
; // something

Did you spot any problems?

If not, don’t worry, it’s a bit subtle.

Consider the invisible characters contained in the input stream:

10'\n'
some words'\n'
...'\n'

Actually this is not formally true on Windows, but in general you have a LF char at the end of each line. Let’s follow the code flow:

  • cin >> num correctly reads the int, stopping at (in the language of streams: “detecting but not consuming”) ‘\n’
  • getline(cin, line) now reads the next line until it encounters a line separator (‘\n’ by default). But ‘\n’ is still in the stream buffer and then getline returns immediately, storing nothing in line.
  • Again cin >> num is evaluated but this time it fails, because the stream is not fed with an int. failbit is set then. The loop terminates.
  • The user complains because the feature does not work as he expects. Ok, sorry this is not part of the code flow…

We just experienced a difference between operator>> and getline: the first skips any leading whitespace (actually any separator – according to the locale in use) before performing the read operation, instead, the second does not.

Basically, it has to do with the difference between formatted and unformatted input function. Stream operators (like operator>> for strings) belong to the former category, getline to the latter. In short – among other differences – formatted input functions skip leading separators (e.g. whitespaces, LF) by default, unformatted functions do not.

The long story is: both formatted and unformatted functions create basic_istream<CharT>::sentry objects for preparing input streams for I/O (e.g. checking the validity of the stream). One of the operations performed in the sentry constructor is skipping leading whitespaces. For deciding whether skipping or not it uses two information:

  • a bool parameter  passed to the constructor, that is false by default. Don’t get confused: it’s false when you want the sentry object to skip whitespaces (in fact, it’s generally called noskipws – e.g. _Noskip on Visual Studio).
  • ios_base::skipws flag (set or not on the stream object).

If _Noskip is false and the ios_base::skipws is true then leading whitespaces will be skipped.

I am sure you already imagine the rest of the story: when a formatted function creates a sentry, the parameter is left to its default value (false) and since cin‘s ios_base::skipws is true, operations like cin >> i work as expected even if some whitespaces stand in front of the int value. Conversely, unformatted functions create sentry instances by explicitly passing true to the constructor. For this reason the lonely leading ‘\n’ is not discarded.

[note]

Beware something about formatted/unformatted functions is changed between C++98 and C++11, in particular about istream& operator>>(streambuf*). In fact in C++98 it was a formatted operation, now it is unformatted.

[/note]

Why does getline preserve leading separators? I think because it’s kind of raw read operation. Note that if the delimiter is found, it is extracted and discarded (e.g. it is not stored and the next input operation will begin after it). This is important, because it enables such a code to work as expected:

stringstream ss("the first line\nthe second line)"
while (getline(ss, line)) 
{ ... // line does not contain '\n' at the end

How we fixed this issue?

The simplest thing you can do is just:

while ( (cin >> num >> std::ws) && getline(cin, line) )
;

The left hand side reads the int and then skips leading separators from the stream. std::ws is an input manipulator created for this purpose.

A bunch of other solutions are possible. For example the one using ignore:

while ( (cin >> num).ignore(numeric_limits<streamsize>::max(), '\n') && std::getline(cin, line))

Here we discard as many leading separators as possible, that is until either count characters are discarded, the delimiter (specified by the second parameter) is found or the end of the stream is reached.

Not only is the former solution simple and effective enough, but it also prevents oversights like:

10'\n'
'\n'
some words

Above the user left an empty line after the first number. The std::ws solution does not break, the ignore one does instead. On the other hand, std::ws solution does not preserve leading whitespaces, if you need them. But it was not our case (you can imagine the final code looked a bit more defensive than the one I showed here, but the main observations hold).

One can also develop a proxy object to allow code like this:

cin >> num >> std::ws >> as_line >> line;

as_line may also embody the std::ws part:

cin >> num >> as_line >> line;

It’s not hard to code such a machinery. For example:

struct lines_proxy
{
	istream& operator()(string& s)
	{
		return getline(is >> std::ws, s);
	}

	istream& is;
};

struct line_t {} as_line;

lines_proxy operator>>(istream& is, line_t)
{
	return{ is };
}

istream& operator>>(lines_proxy p, string& s)
{
	return p(s);
}

...

while (cin >> num >> as_line >> line)

Just for fun.

It’s over for today. The leading actor of our story was a really silly problem but the lesson learned was interesting: even if streams were designed as a “formatted” abstraction on top of I/O buffers, unformatted operations are still there. Mixing formatted and unformatted operations should be done with care.

Comments
  1. […] Pay attention to unformatted nature of getline […]

  2. xboos says:

    Great lesson. Thank you.

  3. mingy says:

    what the `…(actually any separator – according to the locale in use) …` means? can you make a example?

Leave a comment