'Idiomatically split a string_view

I read The most elegant way to iterate the words of a string and enjoyed the succinctness of the answer. Now I want to do the same for string_view. Problem is, stringstream can't take a string_view:

#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>

int main() {
    using namespace std;
    string_view sentence = "And I feel fine...";
    istringstream iss(sentence); // <== error
    copy(istream_iterator<string_view>(iss),
         istream_iterator<string_view>(),
         ostream_iterator<string_view>(cout, "\n"));
}

So is there a way to do this? If not, what is the reasoning such a thing would be not idiomatic?



Solution 1:[1]

Split by a delimiter and return a vector<string_view>.

Designed for rapid splitting of lines in a .csv file.

Tested under MSVC 2017 v15.9.6 and Intel Compiler v19.0 compiled with C++17 (which is required for string_view).

#include <string_view>

std::vector<std::string_view> Split(const std::string_view str, const char delim = ',')
{   
    std::vector<std::string_view> result;

    int indexCommaToLeftOfColumn = 0;
    int indexCommaToRightOfColumn = -1;

    for (int i=0;i<static_cast<int>(str.size());i++)
    {
        if (str[i] == delim)
        {
            indexCommaToLeftOfColumn = indexCommaToRightOfColumn;
            indexCommaToRightOfColumn = i;
            int index = indexCommaToLeftOfColumn + 1;
            int length = indexCommaToRightOfColumn - index;

            // Bounds checking can be omitted as logically, this code can never be invoked 
            // Try it: put a breakpoint here and run the unit tests.
            /*if (index + length >= static_cast<int>(str.size()))
            {
                length--;
            }               
            if (length < 0)
            {
                length = 0;
            }*/

            std::string_view column(str.data() + index, length);
            result.push_back(column);
        }
    }
    const std::string_view finalColumn(str.data() + indexCommaToRightOfColumn + 1, str.size() - indexCommaToRightOfColumn - 1);
    result.push_back(finalColumn);
    return result;
}

Be careful of lifetimes: a string_view should never outlive the parent string that it is a window into. If the parent string goes out of scope, then what the string_view points to is is invalid. In this particular case, the API design makes it difficult to go wrong as it the input/output is all string_view which are all windows into the parent string. This ends up being rather efficient in terms of memory copying and CPU usage.

Note that if using string_view the only downside is losing implicit null termination. So use functions that support string_view, e.g. the lexical_cast functions in Boost for converting strings to numbers.

I used this to rapidly parse a .csv file. To get each new line in the .csv file, I used istringstream and getLine() which is blazinly fast (~2GB/second or 1,200,000 lines per second on a single core).

Unit tests. Use Google Test for testing (I installed using vcpkg).

// Google Test integrates into VS2017 if ReSharper is installed. 
#include "gtest/gtest.h" // Can install using vcpkg
// In main(), call:   
// ::testing::InitGoogleTest(&argc, argv);return RUN_ALL_TESTS();

TEST(Strings, Split)
{
    {
        const std::string str = "A,B,C";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "C");
    }       
    {
        const std::string str = ",B,C";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "C");
    }
    {
        const std::string str = "A,B,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "");
    }
    {
        const std::string str = "";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 1);
        EXPECT_TRUE(tokens[0] == "");
    }
    {
        const std::string str =  "A";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 1);
        EXPECT_TRUE(tokens[0] == "A");
    }
    {
        const std::string str =  ",";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "");
    }
    {
        const std::string str =  ",,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "");
        EXPECT_TRUE(tokens[2] == "");
    }
    {
        const std::string str = "A,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "");
    }
    {
        const std::string str = ",B";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "B");
    }       
}

Solution 2:[2]

If you want to use that particular method, you just have to convert the string_view to a string, explicitly:

istringstream iss{string(sentence)}; // N.B. braces to avoid most vexing parse
copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     ostream_iterator<string_view>(cout, "\n"));

The C++ standard library doesn't have good string manipulation functionality. You may want to look at what's available in Boost, Abseil, etc. Any of them are better than this.

Solution 3:[3]

stringstream owns the string that it operates on. That means it creates a copy of the given string. It can't merely reference the string.

Even with the proposed string_view-based stream types, streams are still not random-access ranges. They don't have a way to deal with sub-ranges of a string. That's why they extract data from the stream by copy, rather than via iterators or something.

What you want is best done via a regex-based mechanism, since that works without copying anything. They work just fine with string_views (though you'll have to construct the string_views manually).

Solution 4:[4]

Contango's answer is good. I alter a little to adapt to string and boost::string_view in my project, and I try to get rid of a copy-constructor.

The following code splits a string to string_view;

You have to guarantee the string will not be destroyed.

There are other answer may be more grammar-elegant: Check this out: https://www.bfilipek.com/2018/07/string-view-perf-followup.html. There is a istringstream version above, if the string itself is long, the copy will be a little problem you need to take care of.


   typedef boost::string_view StringView; //Or you can just typedef std::string_view StringView;
#if defined(_WIN32) | defined(WIN32)
#pragma warning(push)
#pragma warning(disable:26486 26481)
#endif
        void SplitStringToStringView(const std::string& str, const char delim, std::vector<StringView>* outputPointer)
        {
            if (outputPointer == nullptr)
                return;

            std::vector<StringView>& result = *outputPointer;

            int indexCommaToLeftOfColumn = 0;
            int indexCommaToRightOfColumn = -1;

            const int end = boost::numeric_cast<int>(str.size());
            for (int i = 0; i < end; i++)
            {
                if (str.at(i) == delim)
                {
                    indexCommaToLeftOfColumn = indexCommaToRightOfColumn;
                    indexCommaToRightOfColumn = i;
                    const int index = indexCommaToLeftOfColumn + 1;
                    const int length = indexCommaToRightOfColumn - index;

                    // Bounds checking can be omitted as logically, this code can never be invoked 
                    // Try it: put a breakpoint here and run the unit tests.
                    /*if (index + length >= static_cast<int>(str.size()))
                    {
                        length--;
                    }
                    if (length < 0)
                    {
                        length = 0;
                    }*/

                    result.emplace_back(StringView(str.c_str() + index, length));
                }
            }
            const StringView finalColumn(str.c_str() + indexCommaToRightOfColumn + 1, 
                str.size() - indexCommaToRightOfColumn - 1);
            result.push_back(finalColumn);
        }
#if defined(_WIN32) | defined(WIN32)
#pragma warning(pop)
#endif

As Contango has provided unit test code, which is so nice, so should I:

{
            const std::string str = "A,B,C";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 3);
            EXPECT_TRUE(tokens[0] == "A");
            EXPECT_TRUE(tokens[1] == "B");
            EXPECT_TRUE(tokens[2] == "C");
        }
        {
            const std::string str = ",B,C";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 3);
            EXPECT_TRUE(tokens[0] == "");
            EXPECT_TRUE(tokens[1] == "B");
            EXPECT_TRUE(tokens[2] == "C");
        }
        {
            const std::string str = "A,B,";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 3);
            EXPECT_TRUE(tokens[0] == "A");
            EXPECT_TRUE(tokens[1] == "B");
            EXPECT_TRUE(tokens[2] == "");
        }
        {
            const std::string str = "";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 1);
            EXPECT_TRUE(tokens[0] == "");
        }
        {
            const std::string str = "A";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 1);
            EXPECT_TRUE(tokens[0] == "A");
        }
        {
            const std::string str = ",";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 2);
            EXPECT_TRUE(tokens[0] == "");
            EXPECT_TRUE(tokens[1] == "");
        }
        {
            const std::string str = ",,";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 3);
            EXPECT_TRUE(tokens[0] == "");
            EXPECT_TRUE(tokens[1] == "");
            EXPECT_TRUE(tokens[2] == "");
        }
        {
            const std::string str = "A,";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 2);
            EXPECT_TRUE(tokens[0] == "A");
            EXPECT_TRUE(tokens[1] == "");
        }
        {
            const std::string str = ",B";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 2);
            EXPECT_TRUE(tokens[0] == "");
            EXPECT_TRUE(tokens[1] == "B");
        }

Solution 5:[5]

There is a way to avoid all copies, but use at your own risk:

#include <algorithm>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <string_view>

int main()
{
    std::string_view const sentence = "And I feel fine...";

    std::istringstream iss;
    iss.rdbuf()->pubsetbuf(const_cast<char *>(sentence.data()),
                           static_cast<std::streamsize>(sentence.size()));

    std::copy(std::istream_iterator<std::string>(iss),
              std::istream_iterator<std::string>(),
              std::ostream_iterator<std::string_view>(std::cout, "\n"));
}

Demo.

Solution 6:[6]

Consider C++20 lazy_split_view

#include <algorithm>
#include <iostream>
#include <ranges>
#include <string_view>
 
// P2210R2: a temporary patch until online g++ >= 12
#define lazy_split_view split_view
#define lazy_split split
 
auto print = [](auto const& view)
{
    for (std::cout << "{ "; const auto element : view)
        std::cout << element;
    std::cout << " } ";
};
 
int main()
{
    constexpr std::string_view text { "And I feel fine..." };
    constexpr std::string_view delim { " " };
    std::cout << "\n" "substrings: ";
    std::ranges::for_each(text | std::views::lazy_split(delim), print);
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Barry
Solution 3 Nicol Bolas
Solution 4
Solution 5 Chnossos
Solution 6 Sergei Krivonos