'Neo4j query really slow when using collect(distinct) and order by

I'm a novice in Neo4j right now

database schema:

enter image description here Currently i'm trying to find all (developer|company|publisher) nodes with it's related platform's Games Nodes in my data:

  1. find a Game node where the game belong to a Platform node
  2. Game node must have a relationship with a Genre node contains a certain string, Game.deleted property must not be Null
  3. we have relationships of Platform -- Game, (Developer|Company|Publisher)-[r:DEVELOPED_BY|PUBLISHED_BY] - Game --> Genre
  4. query will return (Developer|Company|Publisher) with it's related Game node

So i have the following query:

MATCH (p:Platform)--(g:Game),(d)-[r:DEVELOPED_BY|PUBLISHED_BY]-(g)-->(ge:Genre)                    
WHERE ge.id CONTAINS 'Filtered' AND p.id = 'PS5' AND g.deleted is null                    
RETURN g, collect(DISTINCT d) as d ORDER BY g.title DESC SKIP 0 LIMIT 100

the return row contain 2 columns, games and it's d(developer|publisher|company) for examples:

enter image description here

PROBLEM: the query run really slow (15~30sec depend on platform) even if i indexed all the properties (Game.id), (Platform.id)

Started streaming 100 records after 312 ms and completed after 15743 ms.

Profile:

enter image description here

My data has hundred thousands of Game node. How do i improve the query? i see that these 2 have million of DB hits

enter image description here

UPDATES:

After removing the contains query and changed to using regex

=~ "Filtered.*" 

The query speed improved significantly, down to only 2000ms(2 secs) down from 15743ms(15 secs). I still don't understand why CONTAINS slow down the query by so much, even though i already indexed the field with text index type

the above query slow down by alot if i query on around 200000 game nodes, which still doesn't solve my problem

changing the query to this with the large game node dataset speed up the query to just under 1 second

MATCH (p:Platform)--(g:Game),(d)-[r:DEVELOPED_BY|PUBLISHED_BY]-(g)-->(ge:Genre)                    
    WHERE ge.id CONTAINS 'Filtered' AND p.id = 'PS5' AND g.deleted is null                    
    RETURN g, d LIMIT 100

why is collect(distinct d) and order by took so much time? how can i solve this?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source