'In Scylla DB, How can I query the records in desc order?

I have a table in ScyllaDB:

CREATE TABLE myservice.auditlog (
    operatorid text,
    operationtime bigint,
    action text,
    actiontype text,
    appname text,
    entityid text,
    entitytype text,
    operatoripaddress text,
    operatorname text,
    payload text,
    PRIMARY KEY (operatorid, operationtime)
) WITH CLUSTERING ORDER BY (operationtime DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'LeveledCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX entityactid ON myservice.auditlog (entityid);
CREATE INDEX actionname ON myservice.auditlog (action);

I want to read N most recent entries in this table and hence I have defined cluster order as DESC. But when I query the data using the following query, I still see the results in ascending order:

select * from myservice.auditlog limit N; (N number of records need to fetch)

My table's primary key is (operatorid, operationtime). When I select, the data is always ordered by the operatorid in ascending order first and only then by operationtime. But I want to get all records ordered only by operationtime irrespective of the operatorid. How can I achieve the same?



Solution 1:[1]

The data layout in Scylla (and Cassandra and DynamoDB are the same, by the way), is that you have partitions, each of them is a list of rows (items). Partitions are indexed by a partition key (your operatorid) and rows inside a partition are indexed by the clustering key (your operationtime).

The important thing to note about the order of all this data is that:

  1. Inside each partition, the individual rows (items) are sorted according to the clustering key, i.e., according to operationtime.
  2. But the partitions themselves, are not sorted in any recognizable order among themselves. It's basically a hash table (the order is a hash function of your partition key).

So if you scan just a single partition - e.g., SELECT * FROM myservice.auditlog WHERE operatorid = 7 - you'll get back the results sorted by operationtime exactly the way you wanted. But if you scan the entire table, withSELECT * FROM myservice.auditlog, you'll get the partitions in seemingly random order - and see order only inside each partition.

There is no way, with your data model, to get all the data in the table (from all partitions) sorted by time. Scylla simply doesn't have it sorted this way - but rather it has your data distributed over possibly many nodes, distributed by the hash function of the partition key and not sorted. But note that you don't actually need to have all the data sorted - you just wanted the latest data. And there's a way to achieve that:

To achieve what you want you may need to organize your data differently. For example, imagine that instead of the partition key being operatorid, it will the current time in (say) one-hour resolution. So for example right now it's hour 1, next hour will be hour 2, tomorrow it will be hour 24, and so on. To read the most recent data, if now is hour 123, you read from just partition 123, and read the most recent data from that single partition with that LIMIT thing. If you're right on the edge of an hour you may need to read from two partitions - but never need to scan the entire database like you tried to do originally.

Finally, you probably still want operatorid to also be a key, because you need to get the latest data from operatorid=1 separately from the latest data of operator=2. There's also an easy way to do that: You can have a compound partition key - the partition key can be (operatorid, hour). Scanning the partition WHERE operatorid=1 AND hour=123 will give you the latest data for this operatorid.

This sort of data model is known as a time series, and Scylla has fairly good support for it, including a special compaction strategy (time window compaction strategy).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nadav Har'El