'How to avoid double counting while using SQL

I have a transaction table which contains date, transaction_id and category (it is a sales table of clothes). It looks like this:

ndate transaction_id category
11 111 M
11 111 F
11 112 M
12 113 F
12 113 O
13 114 M

I am applying a sql query to aggregate the data and make a new table to reduce the number of rows (as the data is too huge):

select ndate, category, count(distinct transaction_id)
from transaction_table
group by ndate, category;

The output that I am getting is this:

ndate category transactions
11 M 2
11 F 1
12 F 1
12 O 1
13 M 1

As you can see from the transaction_table there have been only 4 transactions but in the output_table the total count of transaction is 6 which is wrongly been done due to the group by of category.

This is the output I am trying to get, I am not sure if a table like this can be made in sql (I am new to sql) :

desired output table

ndate Total transactions category transactions by category
11 2 M 2
F 1
12 1 F 1
O 1
13 1 M 1


Solution 1:[1]

select ndate, category,
    count(distinct transaction_id) as category_transactions,
    (
        select count(distinct transaction_id)
        from transaction_table t2
        where t2.ndate = t.ndate
    ) as ndate_transactions
from transaction_table t
group by ndate, category;

Or this should work but I can't promise it's faster/cheaper to run:

with data as (
    select ndate, category,
        dense_rank() over (partition by ndate order by transaction_id) as dr
    from transaction_table
), data2 as (
    select ndate, category,
        count(distinct dr) as category_transactions, max(dr) as max_dr
    from data
    group by ndate, category
)
select ndate, category, category_transactions,
    max(max_dr) over (partition by ndate) as ndate_transactions
from data2

To suppress repeated values you'd generally just use a lag():

case when ndate = lag(ndate) then null else ndate end

Solution 2:[2]

Consider below for BigQuery

select ndate, total_transactions,
  array(
    select as struct category, 
      count(distinct transaction_id) as transactions_by_category
    from t.arr
    group by category
  ) as transactions
from (
  select ndate, 
    count(distinct transaction_id) total_transactions,
    array_agg(struct(category, transaction_id)) arr
  from transaction_table
  group by ndate
) t

if applied to sample data in your question - output is

enter image description here

Solution 3:[3]

Close to what you want, just presented in a slightly different way.

Use case expressions to do conditional aggregation.

select ndate,
       sum(case when category = 'F' then 1 else 0 end) as f_category,
       sum(case when category = 'M' then 1 else 0 end) as m_category,
       sum(case when category = 'O' then 1 else 0 end) as o_category,
       count(distinct transaction_id) total_transactions
from transaction_table
group by ndate;

Core ANSI SQL-2016. I.e. expected to execute on any dbms.

Solution 4:[4]

You haven't told us what you expect, but I think from reading your question that you need to GROUP BY category, but you only want to count a transaction_id one time if it happened to exist in 2 categories.

We can't have it both ways, so we have to choose - if we list the category in GROUP BY then we have to decide how to count a transaction that happens in 2 categories.

Option 1: Choose "first" category (shown alphabetically)

WITH CTE_DEDUPE AS (
  SELECT 
    * 
  FROM 
    transaction_table QUALIFY ROW_NUMBER() OVER (
      PARTITION BY TRANSACTION_ID 
      ORDER BY 
        TRANSACTION_DT, 
        CATEGORY ASC
    ) = 1
), 
BASIC_AGGS AS (
  SELECT 
    TRANSACTION_DT, 
    CATEGORY, 
    COUNT(DISTINCT TRANSACTION_ID) as TRANSACTION_ID_COUNTDISTINCT, 
    COUNT(1) AS AGG_ROW_COUNT 
  FROM 
    CTE_DEDUPE 
  GROUP BY 
    TRANSACTION_DT, 
    CATEGORY
) 
SELECT 
  BASIC_AGGS.TRANSACTION_DT, 
  BASIC_AGGS.CATEGORY, 
  BASIC_AGGS.TRANSACTION_ID_COUNTDISTINCT 
FROM 
  BASIC_AGGS

Result:

TRANSACTION_DT CATEGORY TRANSACTION_ID_COUNTDISTINCT
11 F 1
12 F 1
13 M 1

There are other options but I will hold off until you tell us some more information about what you expect, and what specific instance of SQL you are running. I generated the SQL with Rasgo but I have that running on Snowflake right now, and the more complex we get with your solution might mean the syntax is slightly different depending on your system.

Edit: the markdown syntax is giving me issues in preview vs what the post renders as...

Solution 5:[5]

Results matching OP desired output (aside from sorting) but my assumption is the plan is the filter out the nulls in the end anyhow.

If sort needs to be exact a small edit can achieve it with an additional column, or with some more work (using a temp table or table variable) to show the exact result sorted and without the extra column, but here is what it looks like with the extra column: enter image description here

PS: even though this does get the result, I think it is going to be extremely inefficient and I highly recommend NOT using this solution but instead wait for one of the other guys to do it right. I mainly did this purely to see if I could (without caring about how horribly inefficient it would be).

select distinct
      "ndate" =
        case
            when row_number() over(partition by t1."ndate" order by "Total transactions" desc) = 1
                then t1."ndate"
            else null
        end
    , "Total transactions" =
        case
            when row_number() over(partition by t1."ndate" order by "Total transactions" desc) = 1
                then "Total transactions"
            else null
        end
    , t1."category"
    , "transactions by category"
From
(
select distinct
      "ndate"
    , "category"
    , "transactions by category" = count("transaction_id") over( partition by "ndate", "category")
from transaction_table as t1
) as t1
left join
(
select distinct
      "ndate"
    , "Total transactions" = count("transaction_id") over( partition by "ndate", "category" order by "ndate")
from transaction_table as t1
) as t2
on t1."ndate" = t2."ndate"
and t1."transactions by category" = t2."Total transactions"

output:

output:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Mikhail Berlyant
Solution 3 jarlh
Solution 4
Solution 5