'CUDA - Implementing Device Hash Map?

Does anyone have any experience implementing a hash map on a CUDA Device? Specifically, I'm wondering how one might go about allocating memory on the Device and copying the result back to the Host, or whether there are any useful libraries that can facilitate this task.

It seems like I would need to know the maximum size of the hash map a priori in order to allocate Device memory. All my previous CUDA endeavors have used arrays and memcpys and therefore been fairly straightforward.

Any insight into this problem are appreciated. Thanks.



Solution 1:[1]

There is a GPU Hash Table implementation presented in "CUDA by example", from Jason Sanders and Edward Kandrot.

Fortunately, you can get information on this book and download the examples source code freely on this page:
http://developer.nvidia.com/object/cuda-by-example.html

In this implementation, the table is pre-allocated on CPU and safe multithreaded access is ensured by a lock function based upon the atomic function atomicCAS (Compare And Swap).

Moreover, newer hardware generation (from 2.0) combined with CUDA >= 4.0 are supposed to be able to use directly new/delete operators on the GPU ( http://developer.nvidia.com/object/cuda_4_0_RC_downloads.html?utm_source=http://forums.nvidia.com&utm_medium=http://forums.nvidia.com&utm_term=Developers&utm_content=Developers&utm_campaign=CUDA4 ), which could serve your implementation. I haven't tested these features yet.

Solution 2:[2]

I recall someone developed a straightforward hash map implementation on top of thrust. There is some code for it here, although whether it works with current thrust releases is something I don't know. It might at least give you some ideas.

Solution 3:[3]

BTW, warpcore is a framework for creating high-throughput, purpose-built hashing data structures on CUDA-accelerators. Hashing at the speed of light on modern CUDA-accelerators. You can find it here:

https://github.com/sleeepyjack/warpcore

Solution 4:[4]

AFAIK, the hash table given in "Cuda by Example" does not perform too well. Currently, I believe, the fastest hash table on CUDA is given in Dan Alcantara's PhD dissertation. Look at chapter 6.

Solution 5:[5]

cuCollections is a relatively new open-source library started by NVIDIA engineers aiming at implementing efficient containers on the GPU.

cuCollections (cuco) is an open-source, header-only library of GPU-accelerated, concurrent data structures.

Similar to how Thrust and CUB provide STL-like, GPU accelerated algorithms and primitives, cuCollections provides STL-like concurrent data structures. cuCollections is not a one-to-one, drop-in replacement for STL data structures like std::unordered_map. Instead, it provides functionally similar data structures tailored for efficient use with GPUs.

cuCollections is still under heavy development. Users should expect breaking changes and refactoring to be common.

At the moment it provides a fixed size hashtable cuco::static_map and one that can grow cuco::dynamic_map.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jopasserat
Solution 2 talonmies
Solution 3 Mojtaba Valizadeh
Solution 4 Gabriel
Solution 5