Merging intervals in next-gen C++

Posted: March 8, 2023 in Competitive Programming, Programming Recipes
Tags: C++, ranges, thatsarotate

A few weeks ago, I set this problem at Coding Gym: given an array of intervals where intervals[i] = [start_i, end_i], merge all overlapping intervals, and return an array of the non-overlapping intervals that cover all the intervals in the input.

Here is an example:

Input: intervals = [[1,3],[2,6],[8,10],[15,18]]
Output: [[1,6],[8,10],[15,18]]
Explanation: Since intervals [1,3] and [2,6] overlap, merge them into [1,6].

In this post I am revisiting this classical problem with next generation C++.

You probably know that a naïve approach is quadratic and consists in checking each interval with the others. When two intervals do overlap, remove the second one from the array and merge both into the first one.

It’s worth reminding how to determine when two intervals overlap. Assuming that interval1 precedes interval2, the magic condition is:

interval1.end >= interval2.begin

For example:

1 5
3 6

[1 5] overlaps with [3 6], indeed 5 >= 3.

To reduce the number of checks, first sort the intervals by the starting point, then check if each interval overlaps only with the next one. Indeed, since the array is sorted, if an interval does not overlap with the next one then it can’t overlap with any. Bingo!

When merging two intervals together, just pay attention to the ending point: it’s the maximum of both. For example:

2 8
3 5

The result is [2, max(8, 5)]=[2, 8].

The logic applies until we found two disjointed routes. At that point, “commit” the current merged interval and continue. Here is a first C++ solution:

vector<vector<int>> merge(vector<vector<int>>& intervals) 
{
	sort(begin(intervals), end(intervals));
	vector<int> curr = intervals[0];
	vector<vector<int>> merged;
	for (auto i=1u; i<intervals.size(); ++i)
	{
		if (curr[1] >= intervals[i][0]) // overlap?
		{
			curr[1] = max(intervals[i][1], curr[1]); // update ending point
		}
		else
		{
			merged.push_back(curr); // "commit" current merged interval
			curr = intervals[i];
		}
	}
	merged.push_back(curr); // last one is not committed yet
	return merged;
}

The use of vector<vector<int>> is because of LeetCode. Here is an in-place variant that uses a vector<pair<int, int>>:

sort(begin(intervals), end(intervals)); // intervals is vector<pair<int, int>>
auto j = 0u;
for (auto i=1u; i<intervals.size(); ++i)
{
    if (intervals[j].second >= intervals[i].first) // overlap?
    {
        intervals[j].second = max(intervals[i].second, intervals[j].second); // keep merging into the current one
    }
    else
    {
        ++j; // next "assignable" spot
        intervals[j] = intervals[i]; // next interval is the new current one
    }
}
intervals.resize(j + 1);

The two solutions work correctly and might be optimized even further. However, in the rest of the article I will be having some fun with patterns.

Let’s go!

Playing with patterns

Consider this sample (sorted already for your convenience):

I recommend focusing on the first version of the algorithm (not in-place) because I think spotting the pattern is a bit easier.

To spot a pattern, it’s often useful to write down how the algorithm works, maybe some variables evolving or such. Over time, after practicing, you will automatically visualize it in your mind. Sometimes, even when you are not consciously paying attention to the problem itself, an insight might knock at your door. Let me guide you through that I had.

Let’s write down what curr looks like while the algorithm runs:

vector<vector<int>> merge(vector<vector<int>>& intervals) 
{
	sort(begin(intervals), end(intervals));
	vector<int> curr = intervals[0];
	std::cout << curr[0] << " " << curr[1] << "\n"; // <<- PRINT HERE
	vector<vector<int>> merged;
	for (auto i=1u; i<intervals.size(); ++i)
	{
		if (curr[1] >= intervals[i][0]) 
		{
			curr[1] = max(intervals[i][1], curr[1]); 
		}
		else
		{
			merged.push_back(curr); 
			curr = intervals[i];
		}
		std::cout << curr[0] << " " << curr[1] << "\n"; // <<- PRINT HERE
	}
	merged.push_back(curr); // last one is not committed yet
	return merged;
}

All eyes on me. Voilà:

Do you see the pattern already?

No? Ok, let me take you one step forward.

That “imaginary” array is a trace of both the final and the intermediate values that contribute to assemble merged, that is none other than the output of the merging. We distinguish two cases:

merging is in progress: the overlapping condition is true and curr is consuming all the next overlapping intervals. It’s an intermediate value;
otherwise, the overlapping condition is false and then curr has to be committed to the result. It’s a final value.

Intervals that will be merged together are easy to spot: they share the same starting value.

Suppose you have this array available. How would you find the solution to the problem?

The question might seem silly: we discard intermediate results, don’t we? Let’s jot down a couple of observations:

we can have an arbitrary number of intermediate values but only one final value;
for each merging phase, the final value corresponds to the last intermediate value (we just commit it when we found that it does not overlap with the next interval).

At this point, if I may, I would dare to suggest a theory: we go through all the pairs of this imaginary array one by one and we remove all those having the very same first element (interval start) except for the last one. Let me show you an example of this idea:

1 2  <- keep, since it does not overlap with the next one
4 6  <- remove, since it's being merged and it's not the last
4 8  <- remove, since it's being merged and it's not the last
4 8  <- keep, since it's being merged and it's the last
9 10 <- keep, since it does not overlap with the next one

Indeed, we get:

1 2
4 8
9 10

Please take your time to review this process and, if you need, try the logic on other samples.

This leads us to exploring how this imaginary array can be built and pruned.

Let me start from the latter. How would you remove all the intermediate values except for the last one?

Well, two known patterns from the standard library come to the rescue:

imaginary.erase(begin(imaginary), unique(rbegin(imaginary), rend(imaginary), [](const auto& i1, const auto& i2){
    return i1[0] == i2[0]; // do they share the very same starting point?
}).base());

Indeed, unique is used to remove equal (defined by us) adjacent pairs of elements but the first. We need to juggle with reverse iterators a bit to remove the last rather than the first.

Now the best part.

How to build imaginary?

If you think about it for a moment, curr is just an accumulator that is initialized with intervals[0] (printed at the beginning).

Then we iterate the other intervals one by one and compare each with the accumulator:

if they do not overlap, the accumulator is assigned to the ith element (in the algorithm itself, the current accumulator is committed to the result vector);
if they do overlap, only the ending point of the accumulator is updated (the accumulator is not committed yet).

The imaginary array is a sort of trace of the accumulator. If we had been interested only in the final value, we would have used accumulate. However, we want to keep track of intermediate values. It’s a running operation…

That’s a partial_sum:

vector<vector<int>> imaginary;
partial_sum(begin(intervals), end(intervals), back_inserter(imaginary), [](auto curr, auto i){
        return curr[1] >= i[0] ? vector{curr[0], max(curr[1], i[1])} : i;
 });

This works like a charm. However, it costs that extra imaginary vector. Can we do better?

Without beating around the bush, we can use ranges (to be honest, range-v3). views::partial_sum saves us that intermediate vector. However, at first I struggled with the second part because the combo unique + reverse + erase would need the array in any case because of reverse.

My result was ugly and useless. It’s not worth sharing it here.

At some point, I noticed one view I was unaware of. Or better, I knew it existed but I was unaware of one subtle detail.

The problem with views::unique is that it eliminates all except the first element from every consecutive group of equivalent elements from the range. Just like std::unique. It makes sense.

What I didn’t know was that another view does exactly the same thing except for one detail: it keeps the last element only.

That’s an adjacent_remove_if:

sort(intervals);
auto merged = views::partial_sum(intervals, [](auto curr, auto i){
    return curr[1] >= i[0] ? std::vector{curr[0], max(curr[1], i[1])} : i;
}) | views::adjacent_remove_if([](auto i1, auto i2){
    return i1[0] == i2[0];
});

Voilà!

Here is the full solution making use of vector<pair<int, int>> as input:

sort(intervals);
auto merged = views::partial_sum(intervals, [](auto curr, auto i){
    return curr.second >= i.first ? std::pair{curr.first, max(curr.second, i.second)} : i;
}) | views::adjacent_remove_if([](auto i1, auto i2){
    return i1.first == i2.first;
});

for (const auto& [i,j] : merged)
{
    std::cout << i << " " << j << "\n";
}

Bear in mind that the code above is lazy: when the range-based for loop pulls one interval, adjacent_remove_if extracts two from partial_sum that makes just two steps forward. And so on.

That was one possible way to merge overlapping intervals in next-gen C++.

Bonus pattern

When I found this solution, I got the feeling that another one was plain to all.

I was right.

My intuition was: what if I group the intervals that can be merged together?

Consider this example:

{1, 2}, {4,6}, {5,8}, {6,7}, {9, 10}

My idea is to form these three groups:

{1, 2}
{4,6}, {5,8}, {6,7}
{9, 10}

Then, each group can be merged individually and it’s guaranteed that it can’t be merged with any other (because they are sorted, remember?).

Merging each group works this way:

the starting value corresponds to the starting point of the first pair;
the ending value corresponds to the maximum of the other ending points.

Grouping things together…mmm…that’s a chunk_by! And the predicate is just…the overlapping condition!

We get sorted the second part in multiple ways. For example, for_each + yield (list comprehension) + max_element:

auto merged = views::chunk_by(intervals, [](const auto& i1, const auto& i2){
	return i1.second >= i2.first;
}) | views::for_each([](auto rng){
	return yield(std::pair{front(rng).first, *max_element(views::values(rng))});
});

The main issue here is that we go through each element…twice! In other words, chunk_by makes a group by iterating all the pairs until it finds two that do not satisfy the condition. One iteration. Then, max_element does another iteration.

Apart from this, the solution can be useful, for example, to “visualize” which intervals are going to be merged together. Also, the schema prompts another classical pattern: distributing the “merging task” to multiple workers. Clearly, consider just the idea and not the task itself.

Conclusion

The arrival point is not directly intended for production. However, I am fully convinced that the intellectual journey matters here. This is deliberate practice and we have unlimited possibilities to learn something new.

For example, I learned that adjacent_remove_if keeps the last element of multiple adjacent duplicates. My brain played with patterns and this is good. After all, our brain has been made for that.

Anyway, this is not only about learning during the process. Starting from working code, we can achieve even more.

Indeed, I think not only every problem has value, but also every solution.

Do challenge yourself and explore what you are more interested in. What is tickling your curiosity?

For example, do you know if that call to partial_sum can be safely parallelized? You know, the update function must be associative. Is this the case or not? Can you prove it?

And what about an in-place version of the algorithm that – somehow – takes advantage of ranges?

Is the range-based solution slower than the others? Can you measure it?

What if we can’t sort the input in advance?

And so on.

There are unlimited possibilities.

I hope you enjoyed the article and I would be glad to hear your thoughts.

Growing up

LATEST POSTS

CATEGORIES

Ingredients

Search