'improve sql query with 2 EXISTS sub queries

I have this query (mysql):

 SELECT `budget_items`.*
    FROM `budget_items`
    WHERE (budget_category_id = 4
           AND ((is_custom_for_family = 0)
                OR (is_custom_for_family = 1
                    AND custom_item_family_id = 999))
           AND ((EXISTS
                   (SELECT 1
                    FROM balance_histories
                    WHERE balance_histories.budget_item_id = budget_items.id
                      AND balance_histories.family_id = 999
                      AND payment_date >= '2021-02-01'
                      AND payment_date <= '2021-02-28' ))
                OR (EXISTS
                      (SELECT 1
                       FROM budget_lines
                       WHERE family_id = 999
                         AND budget_id = 188311
                         AND budget_item_id = budget_items.id
                         AND amount > 0))))

It runs multiple times on app start. It takes more than 10 seconds (all of them).

I have indexes on:

balance_histories table: budget_item_id, family_id (tried also payment_date)

budget_lines table: family_id, budget_id, budget_item_id

How can I improve the speed? Query or maybe mysql (8) configuration.

balance_histories table: enter image description here

budget_lines table: enter image description here



Solution 1:[1]

I would start this query in reverse of what you have. Assuming you COULD have years of data, but your EXISTS query is looking more specifically at a date-range, or specific budget lines, start there, it will probably be much smaller. Once you have DISTINCT IDs, then go back to the budget items by qualified ID PLUS the additional criteria.

To help optimize the queries, I would have indexes on

table              index
balance_histories  ( family_id, payment_date, budget_item_id )
budget_lines       ( family_id, budget_id, amount )
budget_items       ( id, budget_category_id, is_custom_for_family, custom_item_family_id )


select
        bi.*
    from
        -- pre-query a list of DISTINCT IDs from the balance history
        -- and budget lines that qualify. THEN join to the rest.
        ( select distinct
                bh.budget_item_id id
            from
                balance_histories bh
            where
                    bh.family_id = 999
                AND bh.payment_date >= '2021-02-01'
                AND bh.payment_date <= '2021-02-28'
        UNION
        select 
                bl.budget_item_id
            FROM 
                budget_lines bl
            WHERE 
                    bl.family_id = 999
                AND bl.budget_id = 188311
                AND bl.amount > 0 ) PQ
            JOIN budget_items bi
                on PQ.id = bi.id
                AND bi.budget_category_id = 4
                AND (       bi.is_custom_for_family = 0
                        OR 
                            (   bi.is_custom_for_family = 1
                            AND bi.custom_item_family_id = 999 )
                    )

Feedback

As for many SQL queries, there are typically multiple ways to get a solution. Sometimes using EXISTS works well, sometimes not as much. You need to consider cardinality of your data, and that is what I was shooting for. Look at what you were asking for first: Get budget items that are all category for and custom for family is 1 or 0 (which is all), but if family, only those for 999. You were correct on your balance of AND/OR. However, this is going through EVERY RECORD, and if you have millions of rows, that is what you are scanning through. Only after scanning through every row are you now doing a secondary query (for each record that qualified) against the histories for the specific date range OR family/budget.

My guess is that the number of possible records returned from your two EXISTS queries was going to be very small. So, by starting by getting a DISTINCT list of just those IDs that are part of that union would be the very small subset. Once that single "ID" if found, it now becomes a direct match to the budget items table and have the final filtering limits of categoryID / Family / Custom Item considerations.

By having indexes better match the context of your query WHERE clause will optimize pulling data. I have had answers to several other questions with similar resolutions and clarify indexes and why in those... take a look for example, and another here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1