'MySQL correlated subquery very slow when range is included

DB Fiddle: https://www.db-fiddle.com/f/u1N4YYV3wmiWndnZNb95xz/5

For the following tables nodes and leaves, with nodes having just 9 records and leaves having a large amount (I inserted ~400k rows just for testing here):

> select * from nodes;
+----+-----------+------------+
| id | leaf_left | leaf_right |
+----+-----------+------------+
|  1 |         1 |         10 |
|  2 |         1 |          9 |
|  3 |         1 |          8 |
|  4 |         1 |          7 |
|  5 |         1 |          5 |
|  6 |        11 |         20 |
|  7 |        11 |         19 |
|  8 |        11 |         18 |
|  9 |        11 |         17 |
+----+-----------+------------+
> select * from leaves limit 20;
+----+------+
| id | flag |
+----+------+
|  1 |    0 |
|  2 |    0 |
|  3 |    0 |
|  4 |    0 |
|  5 |    0 |
|  6 |    0 |
|  7 |    0 |
|  8 |    0 |
|  9 |    0 |
| 10 |    0 |
| 11 |    0 |
| 12 |    1 |
| 13 |    0 |
| 14 |    0 |
| 15 |    0 |
| 16 |    0 |
| 17 |    0 |
| 18 |    0 |
| 19 |    1 |
| 20 |    0 |
+----+------+

I am attempting to get all nodes having a leaf in the range of [leaf_left, leaf_right] where flag = true.

This query is slow (~0.4 sec):

select * from nodes
where exists (
  select 1 from leaves
  where leaves.flag and leaves.id between nodes.leaf_left and nodes.leaf_right
)

While this query, which removes the upper limit of the between clause, is extremely quick (0.001 sec):

select * from nodes
where exists (
  select 1 from leaves
  where leaves.flag and leaves.id > nodes.leaf_left
)

The EXPLAIN plan for both queries are identical:

+------+--------------------+--------+------+---------------+------+---------+------+--------+-------------+
| id   | select_type        | table  | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+------+--------------------+--------+------+---------------+------+---------+------+--------+-------------+
|    1 | PRIMARY            | nodes  | ALL  | NULL          | NULL | NULL    | NULL | 9      | Using where |
|    2 | DEPENDENT SUBQUERY | leaves | ALL  | PRIMARY       | NULL | NULL    | NULL | 424432 | Using where |
+------+--------------------+--------+------+---------------+------+---------+------+--------+-------------+

So it seems to me that the between in the subquery is somehow preventing MySQL from using its index efficiently, but I'm not sure. Any ideas?



Solution 1:[1]

I ran into this problem multiple times. It looks like MySQL can't use indexes of range selection (aka. BETWEEN) being used in a subquery. The fiddle in this question gives 150ms to execute for nested query, though. This issue is probably fixed in the last MySQL versions.

But e.g. we use the lastest AWS Aurora (MySQL 5.7.12) and a simple nested sets query runs for 2+ sec (!) for just several 10k rows. The fun fact is while you use constants as the boundaries if works just fine and fast:

SELECT * FROM nodes
WHERE EXISTS (
  SELECT 1 FROM leaves
  WHERE leaves.flag AND leaves.id BETWEEN :left AND :right
)

It goes down drastically once you use the fields from the outer query. I found a workaround to persuade MySQL the range bondaries are contrants using MySQL variables.

SELECT nodes.id,
       @lft := nodes.leaft_left,
       @rgt := nodes.leaf_right
FROM nodes
WHERE EXISTS (
  SELECT 1 FROM leaves
  WHERE leaves.flag AND leaves.id BETWEEN @lft AND @rgt
)

It works as expected. But I couldn't use this as we use Docrine ORM which doesn't understand MySQL variables. But I found the following example works as well:

SELECT * FROM nodes
WHERE EXISTS (
  SELECT 1 FROM leaves
  WHERE leaves.flag AND leaves.id BETWEEN (@lft := nodes.leaf_left) AND (@rht := nodes.leaf_right)
)

Looks like a dirty hack but it does what it should do. So I just moved the subquery to a custom DQL function.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1