'Database to check whether a path/ url is present

I have an application where it can have multiple thousands of urls. Let's assume the url paths are of the form,

  • /app/v1/test1
  • /app/v2/test2

I want to store these paths in a datastructure, so that when a request comes for a path which doesn't exist, then I want to catch this early and return 404 Not Found. There are 2 approaches that I could think of,

  1. Store in a hashset and check whether a particular string exists.
  2. Store in a Database, since I don't want to clog the memory as there could be millions of these urls.

With approach 2, I am thinking whether I can use simple RDBMS like Postgres or Mysql (Since already I have experience with them) or to a NoSql DB like Redis (not sure though)

With RDBMS, I see a problem if I put index on the column which stores the path since re-indexing will take time when I try to insert into a table which is already large. Wanted to know others thoughts like whether NoSql DB like Redis will suffice here or any other suggestions which will better suite my case here.



Solution 1:[1]

The disadvantage with approach 1 would be that you'd have to maintain a copy of that hashset in memory in each instance of your server process.

For approach 2 you could use Redis as a sort of shared memory and store the URLs in a Redis set which has O(1) insertion and O(1) membership check time complexity. You'd use the [SADD][1] and [SISMEMBER][2] commands.

Let's add some URLs to a Redis Set (adding to a set that doesn't exist creates it for you):

$ redis-cli
127.0.0.1:6379> sadd application_urls /app/something /app/somethingelse /api/v1/hello

(integer) 3

Now, let's check if a URL exists or should be a 404:

127.0.0.1:6379> sismember application_urls /app/something
(integer) 1
127.0.0.1:6379> sismember application_urls /app/whoops
(integer) 0

We can also use the SMISMEMBER command to check multiple URLs in a single network round trip to Redis:

127.0.0.1:6379> smismember application_urls /app/something /app/whoops
1) (integer) 1
2) (integer) 0

A reply of 1 means the URL exists, 0 means it doesn't and should be a 404.

This approach will be 100% accurate at the cost of taking up the memory associated with storing all of the URLs in Redis (Redis stores a copy of the entire dataset in RAM for speed). If the size of your dataset becomes a problem here, you could consider using the [RedisBloom module][3] to replace the set with a Bloom filter... instead of storing the actual data, the Bloom filter will trade off accuracy for space and store a hash of the data, meaning that when you check for membership you'll sometimes get a false positive.



  [1]: https://redis.io/commands/sadd
  [2]: https://redis.io/commands/sismember/
  [3]: https://redis.io/docs/stack/bloom/

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Simon Prickett