'Fastes way to read a tsv file and store it in a map (C++)

I'm working on a project with 6 huge mapping tsv files. In most cases they look like this:

First: ID which is a unique String which starts with a char and goes on with numbers. Second: A string which should be mapped to the ID.

C1 somethingToMap1
C2 somethingToMap2
C3 somethingToMap3
...

In some files the second part contains multiple strings separated by a semicolon like this:

...
C122546 vvv;wwww
C231541 xxx;yyy;zzz;
...

I do need all the entries in the files in my mappings. Those files contain between 100K - 100M lines of data. At the moment I read them like this:

void ReadMappingFileXY()
    string file_path;
    file_path = "mappings/XY.tsv";
    ifstream tsv_file(file_path);
    string line;
    while (getline(tsv_file, line)) {
        // Split line into tab-separated parts
        vector<string> line_data;
        split(line_data, line, boost::is_any_of("\t"));
        XY_mapping_[line_data[0]] = line_data[1];
    }

XY_mapping is a unordered_map<string, string> which should be very fast once generated. Now the generating process takes round about 15 minutes to generate all 6 mappings and takes up roundabout 35GB of RAM.

I was expecting this, but the problem is, that my program is still very slow when when I'm using the mappings.

I'm looking up an ID in the map like this:

string GetXYByID(const string &id) const {
    auto iterator = id_to_xy_.find(id);
    if(iterator != id_to_xy_.end()) return iterator->second;
    else return "";
}

I've tested the program with much smaller files and it works like a charm. Is there a better way to perform the reading of the files? And also is there a better fitting data structure to use the mappings?



Solution 1:[1]

To check potential optimizations I tried basically 2 approaches:

  1. Single Threaded, with optimizations worth a try
  2. Multi Threaded, using std::async

The multithreaded solution brought improvements, but not as high as assumed. The problem is the limited disk IO.

Anyway, here are some proposals for improvement.

  • Usage of bigger impact buffer
  • Different line/data read algorithm
  • Faster Hash algorithm

I wrote a procedure to create big files. This can be reused as is or modified.

Anyway. The result was a execution duration of 6 minutes for the simple optimization and below 5 minutes for the multi threded appoach. Data size was roughly as shown below:

enter image description here

So, roughly 16GB. This will of course vary, because the data has been created with random length and content.

The software was tested with MS Visual Studio 2022 Community edition, X64, release mode. 10 years old machine with xeon processor and 64GB RAM. RAM was fully consumed and I guess there was even some swapping. At some point I san a process memory usage of 75GB in MSVS. I think that those data are all not that reliable, but at least give some indication.

Let us first have alook at the "conventional" code:

#include <iostream>
#include <fstream>
#include <string>
#include <random>
#include <string_view>
#include <iomanip>
#include <chrono>
#include <array>
#include <unordered_map>

// Some initial definitions and parameters for the program
constexpr unsigned int NumberOfTestFiles{ 6u };
constexpr unsigned int NumberOfLinesMin = 20'000'000u;   // Per test file
constexpr unsigned int NumberOfLinesMax = 100'000'000u;  // Per test file
// Characters that will be used randomly as test string
constexpr std::string_view characterBase{ "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;,:" };
// This will be used for output, for the std::setw function
constexpr unsigned int NumberOfDigits(unsigned int number) {
    unsigned int digits = 0u; do { number /= 10; digits++; } while (number != 0); return digits;
}
// Key string length of Map
constexpr unsigned int MinLengthKey{ NumberOfDigits(NumberOfLinesMax) + 1u };
// Test files will be created with a certain scheme
const std::string baseFileName{ "c:\\temp\\data" };


// Creation of test files. Only needed once.
void createTestFiles() {

    // Initialize random generators for various use cases
    std::random_device randomDevice;
    std::mt19937 randomEngine(randomDevice());
    std::uniform_int_distribution<unsigned int> uniformDistributionLineNumbers(NumberOfLinesMin, NumberOfLinesMax);
    std::uniform_int_distribution<size_t> uniformDistributionCharacterSelector(0u, characterBase.size() - 1u);
    std::uniform_int_distribution<unsigned int> uniformDistributionStringLength(10u, 40u);

    std::cout << "\n\nCreating Test files. " << NumberOfTestFiles << " files will be created:\n\n";

    // Measure the time to create the files. Fille creation will be very long!!!
    auto tp1 = std::chrono::high_resolution_clock::now();

    // Create x test files
    for (unsigned int fileNumber{ 0 }; fileNumber < NumberOfTestFiles; ++fileNumber) {

        // Build the file name
        std::string fileName = baseFileName + std::to_string(fileNumber + 1) + ".tsv";

        // Open the output file and check, if it could be opened
        if (std::ofstream ofs{ fileName }; ofs) {

            // Write a random number of lines into this file
            unsigned int numberOfLines{ uniformDistributionLineNumbers(randomEngine) };
            std::cout << "File:  " << std::left << std::setw(fileName.length() + NumberOfDigits(NumberOfTestFiles))
                << fileName << "    Number of Lines:  " << numberOfLines << '\n';

            // Write data to file
            for (unsigned int lineNumber{}; lineNumber < numberOfLines; ++lineNumber) {

                ofs << 'C' << std::setfill('0') << std::setw(MinLengthKey) << (lineNumber + 1) << '\t';

                unsigned int randomStringLength{ uniformDistributionStringLength(randomEngine) };
                std::string randomString(randomStringLength, '\0');
                for (char& c : randomString) c = characterBase[uniformDistributionCharacterSelector(randomEngine)];

                ofs << randomString << '\n';
            }
        }
        else {
            std::cerr << "\n*** Error: Could not open '" << fileName << "' for writing. Exiting . . .\n\n";
            break;
        }
    }
    auto tp2 = std::chrono::high_resolution_clock::now();
    std::cout << "\n\nDuration to create files: " << std::chrono::duration_cast<std::chrono::milliseconds>(tp2 - tp1).count() << "ms\n\n\n\n";
}

// Simple and fast hash function for strings
struct MyHash {
    size_t operator()(const std::string& key) const { size_t hash{};  for (const char c : key) hash = hash * 101 + c; return hash; }
};
// Define an own MAP and use fast hash function
using MyMap = std::array <std::unordered_map<std::string, std::string, MyHash>, NumberOfTestFiles>;

// Increase input buffer size. Effect will be low
constexpr size_t ifStreamBufferSize = 500'000u;
static char inputBuffer[ifStreamBufferSize];

// Main test function
int main() {

    // Please uncommment the following line to create test files
    //createTestFiles();

    // Here we will store all data
    MyMap data{};

    // Prereserve slots
    for (unsigned int fileNumber{ 0 }; fileNumber < NumberOfTestFiles; ++fileNumber)
        data[fileNumber].reserve(NumberOfLinesMin);

    // Start overall duration measurement
    auto tp1 = std::chrono::high_resolution_clock::now();
    std::cout << "\n\nReading data from files\n\n";

    // And now read all x source files
    for (unsigned int fileNumber{ 0 }; fileNumber < NumberOfTestFiles; ++fileNumber) {

        // Build filename
        std::string fileName = baseFileName + std::to_string(fileNumber + 1) + ".tsv";
        std::cout << "Reading file:  " << std::left << std::setw(fileName.length() + NumberOfDigits(NumberOfTestFiles)) << fileName << '\n';

        // Open fiel for reading and check, if it could be opened
        if (std::ifstream ifs{ fileName }; ifs) {

            // Set bigger input buffer size
            ifs.rdbuf()->pubsetbuf(inputBuffer, ifStreamBufferSize);


            // Start local duration timer for reading one file
            auto tp3 = std::chrono::high_resolution_clock::now();

            // Read the complete file and add data to map
            std::string key, value;
            while (std::getline(ifs >> key >> std::ws, value))
                data[fileNumber][key] = std::move(value);

            // Show, how long it took to read that file
            auto tp4 = std::chrono::high_resolution_clock::now();
            std::cout << "\tDuration for reading file: " << std::chrono::duration_cast<std::chrono::milliseconds>(tp4 - tp3).count() << "ms\n";

        }
        else {
            std::cerr << "\n*** Error: Could not open '" << fileName << "' for reading . . .\n\n";
        }
    }
    // And show overall duration for reading all data
    auto tp2 = std::chrono::high_resolution_clock::now();
    std::cout << "\n\nDuration for reading all files: " << std::chrono::duration_cast<std::chrono::milliseconds>(tp2 - tp1).count() << "ms\n\n";
}

Result:

enter image description here

So, roughly 373 seconds.


Then, I have rewritten the test program using a multihreaded approach

The code looks like the below:

#include <iostream>
#include <fstream>
#include <string>
#include <random>
#include <string_view>
#include <iomanip>
#include <chrono>
#include <array>
#include <unordered_map>
#include <thread>
#include <future>
#include <vector>

constexpr unsigned int NumberOfTestFiles{ 6u };
constexpr unsigned int NumberOfLinesMin = 10'000'000u;
constexpr unsigned int NumberOfLinesMax = 100'000'000u;
constexpr std::string_view characterBase{ "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;,:" };
constexpr unsigned int NumberOfDigits(unsigned int number) {
    unsigned int digits = 0u; do { number /= 10; digits++; } while (number != 0); return digits;
}
constexpr unsigned int MinLengthKey{ NumberOfDigits(NumberOfLinesMax) + 1u };
const std::string baseFileName{ "c:\\temp\\data" };


void createTestFiles() {

    std::random_device randomDevice;
    std::mt19937 randomEngine(randomDevice());
    std::uniform_int_distribution<unsigned int> uniformDistributionLineNumbers(NumberOfLinesMin, NumberOfLinesMax);
    std::uniform_int_distribution<size_t> uniformDistributionCharacterSelector(0u, characterBase.size() - 1u);
    std::uniform_int_distribution<unsigned int> uniformDistributionStringLength(10u, 40u);

    std::cout << "\n\nCreating Test files. " << NumberOfTestFiles << " files will be created:\n\n";

    auto tp1 = std::chrono::high_resolution_clock::now();

    for (unsigned int fileNumber{ 0 }; fileNumber < NumberOfTestFiles; ++fileNumber) {

        std::string fileName = baseFileName + std::to_string(fileNumber + 1) + ".tsv";

        if (std::ofstream ofs{ fileName }; ofs) {

            unsigned int numberOfLines{ uniformDistributionLineNumbers(randomEngine) };
            std::cout << "File:  " << std::left << std::setw(fileName.length() + NumberOfDigits(NumberOfTestFiles))
                << fileName << "    Number of Lines:  " << numberOfLines << '\n';

            for (unsigned int lineNumber{}; lineNumber < numberOfLines; ++lineNumber) {

                ofs << 'C' << std::setfill('0') << std::setw(MinLengthKey) << (lineNumber + 1) << '\t';

                unsigned int randomStringLength{ uniformDistributionStringLength(randomEngine) };
                std::string randomString(randomStringLength, '\0');
                for (char& c : randomString) c = characterBase[uniformDistributionCharacterSelector(randomEngine)];

                ofs << randomString << '\n';
            }
        }
        else {
            std::cerr << "\n*** Error: Could not open '" << fileName << "' for writing. Exiting . . .\n\n";
            break;
        }
    }
    auto tp2 = std::chrono::high_resolution_clock::now();
    std::cout << "\n\nDuration to create files: " << std::chrono::duration_cast<std::chrono::milliseconds>(tp2 - tp1).count() << "ms\n\n\n\n";
}



struct MyHash {
    size_t operator()(const std::string& key) const { size_t hash{};  for (const char c : key) hash = hash * 101 + c; return hash; }
};
using Map = std::unordered_map<std::string, std::string, MyHash>;

// For faster reading we will increase the input buffer size
constexpr size_t ifStreamBufferSize = 1'000'000u;


Map readTSV(const std::string& fileName) {

    Map map;

    std::vector<char> inputBuffer(ifStreamBufferSize);
    map.reserve(NumberOfLinesMax / 2);

    std::cout << "Reading file:  " << std::left << std::setw(fileName.length() + NumberOfDigits(NumberOfTestFiles)) << fileName << '\n';

    if (std::ifstream ifs{ fileName }; ifs) {

        ifs.rdbuf()->pubsetbuf(inputBuffer.data(), ifStreamBufferSize);

        auto tp1 = std::chrono::high_resolution_clock::now();

        std::string key, value;
        while (std::getline(ifs >> key >> std::ws, value))
            map[key] = std::move(value);

        auto tp2 = std::chrono::high_resolution_clock::now();
        std::cout << "Duration for reading file: " << fileName << " --> " << std::chrono::duration_cast<std::chrono::milliseconds>(tp2 - tp1).count() << "ms\n";
    }
    else {
        std::cerr << "\n*** Error: Could not open '" << fileName << "' for reading . . .\n\n";
    }
    return map;
}


int main() {
    // createTestFiles();

    std::array<std::shared_future<Map>, NumberOfTestFiles> maps{};

    auto tp1 = std::chrono::high_resolution_clock::now();
    std::cout << "\n\nReading data from files\n\n";

    for (unsigned int fileNumber{}; fileNumber < NumberOfTestFiles; ++fileNumber) {
        std::string fileName = baseFileName + std::to_string(fileNumber + 1) + ".tsv";
        maps[fileNumber] = std::async(readTSV, fileName);
    }
    for (unsigned int fileNumber{}; fileNumber < NumberOfTestFiles; ++fileNumber) {
        maps[fileNumber].wait();
    }
    std::cout << "\n\n";
    for (unsigned int fileNumber{}; fileNumber < NumberOfTestFiles; ++fileNumber) {
        std::cout << fileNumber << " --> " << maps[fileNumber].get().size() << '\n';
    }
    auto tp2 = std::chrono::high_resolution_clock::now();
    std::cout << "\n\nDuration for reading all files: " << std::chrono::duration_cast<std::chrono::milliseconds>(tp2 - tp1).count() << "ms\n\n";
}

The result was a little bit better, but not that much.

But I was expecting this. Please see: enter image description here

Overall duration was 289 seconds. So, 84 seconds better.

And 3 times faster than your solution while processing more data.

Maybe you can have derive some idea from the above for your own implementation.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Armin Montigny