'Idiomatically split a string_view
I read The most elegant way to iterate the words of a string and enjoyed the succinctness of the answer. Now I want to do the same for string_view. Problem is, stringstream
can't take a string_view
:
#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>
int main() {
using namespace std;
string_view sentence = "And I feel fine...";
istringstream iss(sentence); // <== error
copy(istream_iterator<string_view>(iss),
istream_iterator<string_view>(),
ostream_iterator<string_view>(cout, "\n"));
}
So is there a way to do this? If not, what is the reasoning such a thing would be not idiomatic?
Solution 1:[1]
Split by a delimiter and return a vector<string_view>
.
Designed for rapid splitting of lines in a .csv
file.
Tested under MSVC 2017 v15.9.6
and Intel Compiler v19.0
compiled with C++17
(which is required for string_view
).
#include <string_view>
std::vector<std::string_view> Split(const std::string_view str, const char delim = ',')
{
std::vector<std::string_view> result;
int indexCommaToLeftOfColumn = 0;
int indexCommaToRightOfColumn = -1;
for (int i=0;i<static_cast<int>(str.size());i++)
{
if (str[i] == delim)
{
indexCommaToLeftOfColumn = indexCommaToRightOfColumn;
indexCommaToRightOfColumn = i;
int index = indexCommaToLeftOfColumn + 1;
int length = indexCommaToRightOfColumn - index;
// Bounds checking can be omitted as logically, this code can never be invoked
// Try it: put a breakpoint here and run the unit tests.
/*if (index + length >= static_cast<int>(str.size()))
{
length--;
}
if (length < 0)
{
length = 0;
}*/
std::string_view column(str.data() + index, length);
result.push_back(column);
}
}
const std::string_view finalColumn(str.data() + indexCommaToRightOfColumn + 1, str.size() - indexCommaToRightOfColumn - 1);
result.push_back(finalColumn);
return result;
}
Be careful of lifetimes: a string_view
should never outlive the parent string
that it is a window into. If the parent string
goes out of scope, then what the string_view
points to is is invalid. In this particular case, the API design makes it difficult to go wrong as it the input/output is all string_view
which are all windows into the parent string. This ends up being rather efficient in terms of memory copying and CPU usage.
Note that if using string_view
the only downside is losing implicit null termination. So use functions that support string_view
, e.g. the lexical_cast
functions in Boost for converting strings to numbers.
I used this to rapidly parse a .csv file. To get each new line in the .csv file, I used istringstream
and getLine()
which is blazinly fast (~2GB/second or 1,200,000 lines per second on a single core).
Unit tests. Use Google Test for testing (I installed using vcpkg).
// Google Test integrates into VS2017 if ReSharper is installed.
#include "gtest/gtest.h" // Can install using vcpkg
// In main(), call:
// ::testing::InitGoogleTest(&argc, argv);return RUN_ALL_TESTS();
TEST(Strings, Split)
{
{
const std::string str = "A,B,C";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 3);
EXPECT_TRUE(tokens[0] == "A");
EXPECT_TRUE(tokens[1] == "B");
EXPECT_TRUE(tokens[2] == "C");
}
{
const std::string str = ",B,C";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 3);
EXPECT_TRUE(tokens[0] == "");
EXPECT_TRUE(tokens[1] == "B");
EXPECT_TRUE(tokens[2] == "C");
}
{
const std::string str = "A,B,";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 3);
EXPECT_TRUE(tokens[0] == "A");
EXPECT_TRUE(tokens[1] == "B");
EXPECT_TRUE(tokens[2] == "");
}
{
const std::string str = "";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 1);
EXPECT_TRUE(tokens[0] == "");
}
{
const std::string str = "A";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 1);
EXPECT_TRUE(tokens[0] == "A");
}
{
const std::string str = ",";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 2);
EXPECT_TRUE(tokens[0] == "");
EXPECT_TRUE(tokens[1] == "");
}
{
const std::string str = ",,";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 3);
EXPECT_TRUE(tokens[0] == "");
EXPECT_TRUE(tokens[1] == "");
EXPECT_TRUE(tokens[2] == "");
}
{
const std::string str = "A,";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 2);
EXPECT_TRUE(tokens[0] == "A");
EXPECT_TRUE(tokens[1] == "");
}
{
const std::string str = ",B";
auto tokens = Split(str, ',');
EXPECT_TRUE(tokens.size() == 2);
EXPECT_TRUE(tokens[0] == "");
EXPECT_TRUE(tokens[1] == "B");
}
}
Solution 2:[2]
If you want to use that particular method, you just have to convert the string_view
to a string
, explicitly:
istringstream iss{string(sentence)}; // N.B. braces to avoid most vexing parse
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
ostream_iterator<string_view>(cout, "\n"));
The C++ standard library doesn't have good string manipulation functionality. You may want to look at what's available in Boost, Abseil, etc. Any of them are better than this.
Solution 3:[3]
stringstream
owns the string that it operates on. That means it creates a copy of the given string. It can't merely reference the string.
Even with the proposed string_view
-based stream
types, streams are still not random-access ranges. They don't have a way to deal with sub-ranges of a string. That's why they extract data from the stream by copy, rather than via iterators or something.
What you want is best done via a regex
-based mechanism, since that works without copying anything. They work just fine with string_view
s (though you'll have to construct the string_view
s manually).
Solution 4:[4]
Contango's answer is good. I alter a little to adapt to string and boost::string_view in my project, and I try to get rid of a copy-constructor.
The following code splits a string to string_view;
You have to guarantee the string will not be destroyed.
There are other answer may be more grammar-elegant: Check this out: https://www.bfilipek.com/2018/07/string-view-perf-followup.html. There is a istringstream version above, if the string itself is long, the copy will be a little problem you need to take care of.
typedef boost::string_view StringView; //Or you can just typedef std::string_view StringView;
#if defined(_WIN32) | defined(WIN32)
#pragma warning(push)
#pragma warning(disable:26486 26481)
#endif
void SplitStringToStringView(const std::string& str, const char delim, std::vector<StringView>* outputPointer)
{
if (outputPointer == nullptr)
return;
std::vector<StringView>& result = *outputPointer;
int indexCommaToLeftOfColumn = 0;
int indexCommaToRightOfColumn = -1;
const int end = boost::numeric_cast<int>(str.size());
for (int i = 0; i < end; i++)
{
if (str.at(i) == delim)
{
indexCommaToLeftOfColumn = indexCommaToRightOfColumn;
indexCommaToRightOfColumn = i;
const int index = indexCommaToLeftOfColumn + 1;
const int length = indexCommaToRightOfColumn - index;
// Bounds checking can be omitted as logically, this code can never be invoked
// Try it: put a breakpoint here and run the unit tests.
/*if (index + length >= static_cast<int>(str.size()))
{
length--;
}
if (length < 0)
{
length = 0;
}*/
result.emplace_back(StringView(str.c_str() + index, length));
}
}
const StringView finalColumn(str.c_str() + indexCommaToRightOfColumn + 1,
str.size() - indexCommaToRightOfColumn - 1);
result.push_back(finalColumn);
}
#if defined(_WIN32) | defined(WIN32)
#pragma warning(pop)
#endif
As Contango has provided unit test code, which is so nice, so should I:
{
const std::string str = "A,B,C";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 3);
EXPECT_TRUE(tokens[0] == "A");
EXPECT_TRUE(tokens[1] == "B");
EXPECT_TRUE(tokens[2] == "C");
}
{
const std::string str = ",B,C";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 3);
EXPECT_TRUE(tokens[0] == "");
EXPECT_TRUE(tokens[1] == "B");
EXPECT_TRUE(tokens[2] == "C");
}
{
const std::string str = "A,B,";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 3);
EXPECT_TRUE(tokens[0] == "A");
EXPECT_TRUE(tokens[1] == "B");
EXPECT_TRUE(tokens[2] == "");
}
{
const std::string str = "";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 1);
EXPECT_TRUE(tokens[0] == "");
}
{
const std::string str = "A";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 1);
EXPECT_TRUE(tokens[0] == "A");
}
{
const std::string str = ",";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 2);
EXPECT_TRUE(tokens[0] == "");
EXPECT_TRUE(tokens[1] == "");
}
{
const std::string str = ",,";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 3);
EXPECT_TRUE(tokens[0] == "");
EXPECT_TRUE(tokens[1] == "");
EXPECT_TRUE(tokens[2] == "");
}
{
const std::string str = "A,";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 2);
EXPECT_TRUE(tokens[0] == "A");
EXPECT_TRUE(tokens[1] == "");
}
{
const std::string str = ",B";
std::vector<StringView> tokens;
SplitStringToStringView(str, ',', &tokens);
EXPECT_TRUE(tokens.size() == 2);
EXPECT_TRUE(tokens[0] == "");
EXPECT_TRUE(tokens[1] == "B");
}
Solution 5:[5]
There is a way to avoid all copies, but use at your own risk:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <string_view>
int main()
{
std::string_view const sentence = "And I feel fine...";
std::istringstream iss;
iss.rdbuf()->pubsetbuf(const_cast<char *>(sentence.data()),
static_cast<std::streamsize>(sentence.size()));
std::copy(std::istream_iterator<std::string>(iss),
std::istream_iterator<std::string>(),
std::ostream_iterator<std::string_view>(std::cout, "\n"));
}
Demo.
Solution 6:[6]
Consider C++20 lazy_split_view
#include <algorithm>
#include <iostream>
#include <ranges>
#include <string_view>
// P2210R2: a temporary patch until online g++ >= 12
#define lazy_split_view split_view
#define lazy_split split
auto print = [](auto const& view)
{
for (std::cout << "{ "; const auto element : view)
std::cout << element;
std::cout << " } ";
};
int main()
{
constexpr std::string_view text { "And I feel fine..." };
constexpr std::string_view delim { " " };
std::cout << "\n" "substrings: ";
std::ranges::for_each(text | std::views::lazy_split(delim), print);
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Barry |
Solution 3 | Nicol Bolas |
Solution 4 | |
Solution 5 | Chnossos |
Solution 6 | Sergei Krivonos |