'std::filesystem::path::u8string might not return valid UTF-8?
Consider this code, running on a Linux system (Compiler Explorer link):
#include <filesystem>
#include <cstdio>
int main()
{
try
{
const char8_t bad_path[] = {0xf0, u8'a', 0}; // invalid utf-8, 0xf0 expects continuation bytes
std::filesystem::path p(bad_path);
for (auto c : p.u8string())
{
printf("%X ", static_cast<uint8_t>(c));
}
}
catch (const std::exception& e)
{
printf("error: %s\n", e.what());
}
}
It deliberately constructs a std::filesystem::path
object using a string with incorrect UTF-8 encoding (0xf0 starts a 4-byte character, but 'a'
is not a continuation byte; more info here).
When u8string
is called, no exception is thrown; I find this surprising as the documentation at cppreference states:
- The result encoding in the case of u8string() is always UTF-8.
Checking the implementation of LLVM's libcxx, I see that indeed, there's no validation performed - the string held internally by std::filesystem::path
is just copied into a u8string
and returned:
_LIBCPP_INLINE_VISIBILITY _VSTD::u8string u8string() const { return _VSTD::u8string(__pn_.begin(), __pn_.end()); }
The GCC implementation (libstdc++) exhibits the same behavior.
Of course this is a contrived example, as I deliberately construct a path from an invalid string to keep things simple. But to my knowledge, the Linux kernel/filesystems do not enforce that file paths are valid UTF-8 strings, so I could encounter a path like that "in the wild" while e.g. iterating a directory.
Am I right to conclude that std::filesystem::path::u8string
actually does not guarantee that a valid UTF-8 string will be returned, despite what the documentation says? If so, what is the motivation behind this design?
Solution 1:[1]
The current C++ standard states in fs.path.type.cvt:
char8_t: The encoding is UTF-8. The method of conversion is unspecified.
and also
If the encoding being converted to has no representation for source characters, the resulting converted characters, if any, are unspecified.
So, in a nutshell, anything involving the actual interpretation of the bytes making up the path is unspecified, meaning that implementations are free to handle invalid data as they see fit. So yes, std::filesystem::path::u8string()
does not really guarantee that a valid UTF-8 string is returned.
Regarding the motivation: The standard says nothing about it. But one might have an idea by looking at boost::filesystem
, which the standard is based on. The documentation states:
When a class path function argument type matches the operating system's API argument type for paths, no conversion is performed rather than conversion to a specified encoding such as one of the Unicode encodings. This avoids unintended consequences, etc.
I guess you are using a posix system, in which case the underlying operating system API is most likely using UTF-8 or binary filenames. Hence, inputs are kept as is so to not stumble into any conversion issues. On the other hand, Windows is using UTF-16 and hence needs to convert the string already when constructing a path, resulting in an exception when the input is an invalid UTF-8 encoding (godbolt).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |