Blog 2020 09 08 The big STL Algorithms tutorial: modifying sequence operations - how to get distinct elements
Post
Cancel

The big STL Algorithms tutorial: modifying sequence operations - how to get distinct elements

In this next part of the big STL algorithm tutorial, we will discover the 2 modifying sequence algorithms that will help you get unique elements of a container:

  • unique
  • unique_copy

Let’s get started!

unique

unique - and as a matter of fact unique_copy - could have been implemented as two algorithms, just like remove and remove_if are two different algorithms.

Consistency is not the strongest feature of the <algortihms> header.

In this case, we simply have two separate overloaded signatures, but let’s go on the goal of this algorithm.

unique will remove all the duplicated elements from a container. But only if they are consecutive. In case, you have two identical elements that are not placed next to each other, both supposed to be kept. But we are going to check that.

The return value is the same in both cases, it points at the new end() of the container after the duplicate ones were moved past the new end.

In the first example, we’ll use the simpler signature where we only pass in an input range defined by the usual two iterators pointing in the beginning and the end of the range.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <algorithm>
#include <iostream>
#include <vector>


int main()
{
    std::vector<int> numbers{9, 1, 3, 3, 3, 5, 1, 6, 1};
    std::cout << "Original values: " << '\n';
    std::for_each(numbers.begin(), numbers.end(), [](auto i) {std::cout << i << " ";});
    std::cout << '\n';
    std::cout << '\n';
    
    std::cout << "size: " << numbers.size() << ", capacity: " << numbers.capacity() << '\n';
    auto oldEnd = numbers.end();
    auto newEnd = std::unique(numbers.begin(), numbers.end());
    std::cout << "same values are only removed if they are next to each other:" << '\n';
    std::for_each(numbers.begin(), newEnd, [](auto i) {std::cout << i << " ";});
    std::cout << '\n';
    std::cout << '\n';
    
    std::cout << std::boolalpha << "oldEnd == newEnd? :" << (oldEnd == newEnd) << '\n';
    std::cout << "In fact, the end hasn't changed. oldEnd == numbers.end(): " << (oldEnd == numbers.end()) << '\n';
    std::cout << "number of elements removed: " << std::distance(newEnd, oldEnd) << '\n';
    std::cout << "Though if you use the end, stranfe results are there..." << '\n';
    std::for_each(numbers.begin(), oldEnd, [](auto i) {std::cout << i << " ";});
    std::cout << '\n';
    std::cout << '\n';
    
    std::cout << "size: " << numbers.size() << ", capacity: " << numbers.capacity() << ", these values haven't changed" << '\n';
    numbers.erase(newEnd, oldEnd);
    numbers.shrink_to_fit();
    std::cout << "size: " << numbers.size() << ", capacity: " << numbers.capacity() << ", we should erase what is between the return value of unique() and the old end" << '\n';
}

An interesting fact you might notice is that although the end of the vector has not changed numbers.end() is the same before and after calling std::unique(), what we have between the returned iterator and the (original) end has become meaningless. We could also say it is dangerous to use.

In fact, this makes perfect sense if we remind ourselves of how the STL is designed. Algorithms do not operate on collections, but on iterators. std::unique moves elements around each other, but it doesn’t remove anything from the underlying collection. That’s the exact same reason why you cannot delete elements with std::remove, but you have to use the remove-erase idiom.

So, I’d say that if we want to use this in-place unique algorithm, we should never use that container as a whole anymore. Either we take care of removing the elements beyond the returned iterator or we don’t use it anymore.

If we want to reuse the original container, it’s better to use std::unique_copy, but before, let’s have a glance at the other version of unique where we can customize how elements are compared.

As an optional third argument, we can pass in a binary predicate. In more understandable English, you can pass in a function, function object, lambda function taking two arguments (two elements next to each other in the collection) returning a boolean. The predicate should return true if the two elements are to be considered the same (not unique), false otherwise.

Here is a short example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <algorithm>
#include <iostream>
#include <vector>

struct Person {
    long id;
    std::string name;
    std::string phoneNumber;
};

int main()
{
    std::vector<Person> people { {1, "John D Smith", "555-1234"}, {1, "John David Smith", "784-1234"}, {2, "Adam Jones", "555-7894"} };
    auto it = std::unique(people.begin(), people.end(), [](auto lhs, auto rhs){ return lhs.id == rhs.id; });
    std::for_each(people.begin(), it, [](auto i) {std::cout << i.name << " " << '\n';});
}

In the above example, we have different Person objects that might reference the very same physical being. So the names can differ a little bit, the phone numbers might be still different, but still want to consider two persons as the same. In this particular example, we can use the id for that, we make our comparison based on the id field.

Otherwise, there are no differences between the two different signatures.

  • unique_copy

std::unique_copy works similarly to std::unique, but while the latter moves values around in the original container, the former copies the values to be kept into a target container.

As we learned for other algorithms, the target container is passed after the input, and while the input is denoted by a pair of operators, the target is only by a single one. This target collection has to be big enough to accommodate all the elements. The simplest way is to use a back_inserter for this purpose.

The return value is the same as for std::unique, an iterator pointing at right after the last copied element. Does this make sense? It does. First, it’s consistent with unique and second, passing an inserter iterator as the target is not the only option. Maybe you created a big enough target collection for all the values and there will be some free capacity in the target. By free capacity in this case we mean zero constructed elements. In that case, it’s useful to see where the copied values end.

Let’s see an example of this case.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <algorithm>
#include <iostream>
#include <vector>

int main()
{
    std::vector<int> numbers{9, 1, 3, 3, 3, 5, 1, 6, 1};
    std::vector<int> uniqueNumbers(numbers.size());
    
    auto it = std::unique_copy(numbers.begin(), numbers.end(), uniqueNumbers.begin());

    std::cout << "Content of uniqueNumbers: " << '\n';
    std::for_each(uniqueNumbers.begin(), uniqueNumbers.end(), [](auto i) {std::cout << i << " ";});
    std::cout << '\n' << '\n';
    
    std::cout << "Content of uniqueNumbers until the returned iterator: " << '\n';
    std::for_each(uniqueNumbers.begin(), it, [](auto i) {std::cout << i << " ";});
    std::cout << '\n';
}

It the above example we initialize the target vector with the size of the original one with contiguous duplicates. As such, after calling the unique_copy there will still zero-initialized elements in the target vector.

We should also see as a reminder that even though we called unique_copy, the copied elements are not necessarily unique, as only the neighbouring duplicates got removed - exactly as the contract of the unique* algorithms promises.

Conclusion

Today, we learned about unique and unique_copy, algorithms that remove duplicated items from a range if the duplicated values are next to each other. That’s their biggest catch - that duplicated elements should be next to each other, but it’s well documented.

Next time we’ll learn about the algorithms that bring us some randomness. Stay tuned!

This post is licensed under CC BY 4.0 by the author.