'std::filesystem::path::u8string might not return valid UTF-8?

Consider this code, running on a Linux system (Compiler Explorer link):

#include <filesystem>
#include <cstdio>

int main()
{
    try
    {
        const char8_t bad_path[] = {0xf0, u8'a', 0};  // invalid utf-8, 0xf0 expects continuation bytes
        std::filesystem::path p(bad_path);

        for (auto c : p.u8string())
        {
            printf("%X ", static_cast<uint8_t>(c));
        }
    }
    catch (const std::exception& e)
    {
        printf("error: %s\n", e.what());
    }
}

It deliberately constructs a std::filesystem::path object using a string with incorrect UTF-8 encoding (0xf0 starts a 4-byte character, but 'a' is not a continuation byte; more info here).

When u8string is called, no exception is thrown; I find this surprising as the documentation at cppreference states:

  1. The result encoding in the case of u8string() is always UTF-8.

Checking the implementation of LLVM's libcxx, I see that indeed, there's no validation performed - the string held internally by std::filesystem::path is just copied into a u8string and returned:

_LIBCPP_INLINE_VISIBILITY _VSTD::u8string u8string() const { return _VSTD::u8string(__pn_.begin(), __pn_.end()); }

The GCC implementation (libstdc++) exhibits the same behavior.

Of course this is a contrived example, as I deliberately construct a path from an invalid string to keep things simple. But to my knowledge, the Linux kernel/filesystems do not enforce that file paths are valid UTF-8 strings, so I could encounter a path like that "in the wild" while e.g. iterating a directory.

Am I right to conclude that std::filesystem::path::u8string actually does not guarantee that a valid UTF-8 string will be returned, despite what the documentation says? If so, what is the motivation behind this design?



Solution 1:[1]

The current C++ standard states in fs.path.type.cvt:

char8_­t: The encoding is UTF-8. The method of conversion is unspecified.

and also

If the encoding being converted to has no representation for source characters, the resulting converted characters, if any, are unspecified.

So, in a nutshell, anything involving the actual interpretation of the bytes making up the path is unspecified, meaning that implementations are free to handle invalid data as they see fit. So yes, std::filesystem::path::u8string() does not really guarantee that a valid UTF-8 string is returned.

Regarding the motivation: The standard says nothing about it. But one might have an idea by looking at boost::filesystem, which the standard is based on. The documentation states:

When a class path function argument type matches the operating system's API argument type for paths, no conversion is performed rather than conversion to a specified encoding such as one of the Unicode encodings. This avoids unintended consequences, etc.

I guess you are using a posix system, in which case the underlying operating system API is most likely using UTF-8 or binary filenames. Hence, inputs are kept as is so to not stumble into any conversion issues. On the other hand, Windows is using UTF-16 and hence needs to convert the string already when constructing a path, resulting in an exception when the input is an invalid UTF-8 encoding (godbolt).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1