'How to avoid double counting while using SQL
I have a transaction table which contains date, transaction_id and category (it is a sales table of clothes). It looks like this:
| ndate | transaction_id | category | 
|---|---|---|
| 11 | 111 | M | 
| 11 | 111 | F | 
| 11 | 112 | M | 
| 12 | 113 | F | 
| 12 | 113 | O | 
| 13 | 114 | M | 
I am applying a sql query to aggregate the data and make a new table to reduce the number of rows (as the data is too huge):
select ndate, category, count(distinct transaction_id)
from transaction_table
group by ndate, category;
The output that I am getting is this:
| ndate | category | transactions | 
|---|---|---|
| 11 | M | 2 | 
| 11 | F | 1 | 
| 12 | F | 1 | 
| 12 | O | 1 | 
| 13 | M | 1 | 
As you can see from the transaction_table there have been only 4 transactions but in the output_table the total count of transaction is 6 which is wrongly been done due to the group by of category.
This is the output I am trying to get, I am not sure if a table like this can be made in sql (I am new to sql) :
| ndate | Total transactions | category | transactions by category | 
|---|---|---|---|
| 11 | 2 | M | 2 | 
| F | 1 | ||
| 12 | 1 | F | 1 | 
| O | 1 | ||
| 13 | 1 | M | 1 | 
Solution 1:[1]
select ndate, category,
    count(distinct transaction_id) as category_transactions,
    (
        select count(distinct transaction_id)
        from transaction_table t2
        where t2.ndate = t.ndate
    ) as ndate_transactions
from transaction_table t
group by ndate, category;
Or this should work but I can't promise it's faster/cheaper to run:
with data as (
    select ndate, category,
        dense_rank() over (partition by ndate order by transaction_id) as dr
    from transaction_table
), data2 as (
    select ndate, category,
        count(distinct dr) as category_transactions, max(dr) as max_dr
    from data
    group by ndate, category
)
select ndate, category, category_transactions,
    max(max_dr) over (partition by ndate) as ndate_transactions
from data2
To suppress repeated values you'd generally just use a lag():
case when ndate = lag(ndate) then null else ndate end
    					Solution 2:[2]
Consider below for BigQuery
select ndate, total_transactions,
  array(
    select as struct category, 
      count(distinct transaction_id) as transactions_by_category
    from t.arr
    group by category
  ) as transactions
from (
  select ndate, 
    count(distinct transaction_id) total_transactions,
    array_agg(struct(category, transaction_id)) arr
  from transaction_table
  group by ndate
) t
if applied to sample data in your question - output is
Solution 3:[3]
Close to what you want, just presented in a slightly different way.
Use case expressions to do conditional aggregation.
select ndate,
       sum(case when category = 'F' then 1 else 0 end) as f_category,
       sum(case when category = 'M' then 1 else 0 end) as m_category,
       sum(case when category = 'O' then 1 else 0 end) as o_category,
       count(distinct transaction_id) total_transactions
from transaction_table
group by ndate;
Core ANSI SQL-2016. I.e. expected to execute on any dbms.
Solution 4:[4]
You haven't told us what you expect, but I think from reading your question that you need to GROUP BY category, but you only want to count a transaction_id one time if it happened to exist in 2 categories.
We can't have it both ways, so we have to choose - if we list the category in GROUP BY then we have to decide how to count a transaction that happens in 2 categories.
Option 1: Choose "first" category (shown alphabetically)
WITH CTE_DEDUPE AS (
  SELECT 
    * 
  FROM 
    transaction_table QUALIFY ROW_NUMBER() OVER (
      PARTITION BY TRANSACTION_ID 
      ORDER BY 
        TRANSACTION_DT, 
        CATEGORY ASC
    ) = 1
), 
BASIC_AGGS AS (
  SELECT 
    TRANSACTION_DT, 
    CATEGORY, 
    COUNT(DISTINCT TRANSACTION_ID) as TRANSACTION_ID_COUNTDISTINCT, 
    COUNT(1) AS AGG_ROW_COUNT 
  FROM 
    CTE_DEDUPE 
  GROUP BY 
    TRANSACTION_DT, 
    CATEGORY
) 
SELECT 
  BASIC_AGGS.TRANSACTION_DT, 
  BASIC_AGGS.CATEGORY, 
  BASIC_AGGS.TRANSACTION_ID_COUNTDISTINCT 
FROM 
  BASIC_AGGS
Result:
| TRANSACTION_DT | CATEGORY | TRANSACTION_ID_COUNTDISTINCT | 
|---|---|---|
| 11 | F | 1 | 
| 12 | F | 1 | 
| 13 | M | 1 | 
There are other options but I will hold off until you tell us some more information about what you expect, and what specific instance of SQL you are running. I generated the SQL with Rasgo but I have that running on Snowflake right now, and the more complex we get with your solution might mean the syntax is slightly different depending on your system.
Edit: the markdown syntax is giving me issues in preview vs what the post renders as...
Solution 5:[5]
Results matching OP desired output (aside from sorting) but my assumption is the plan is the filter out the nulls in the end anyhow.
If sort needs to be exact a small edit can achieve it with an additional column, or with some more work (using a temp table or table variable) to show the exact result sorted and without the extra column, but here is what it looks like with the extra column:

PS: even though this does get the result, I think it is going to be extremely inefficient and I highly recommend NOT using this solution but instead wait for one of the other guys to do it right. I mainly did this purely to see if I could (without caring about how horribly inefficient it would be).
select distinct
      "ndate" =
        case
            when row_number() over(partition by t1."ndate" order by "Total transactions" desc) = 1
                then t1."ndate"
            else null
        end
    , "Total transactions" =
        case
            when row_number() over(partition by t1."ndate" order by "Total transactions" desc) = 1
                then "Total transactions"
            else null
        end
    , t1."category"
    , "transactions by category"
From
(
select distinct
      "ndate"
    , "category"
    , "transactions by category" = count("transaction_id") over( partition by "ndate", "category")
from transaction_table as t1
) as t1
left join
(
select distinct
      "ndate"
    , "Total transactions" = count("transaction_id") over( partition by "ndate", "category" order by "ndate")
from transaction_table as t1
) as t2
on t1."ndate" = t2."ndate"
and t1."transactions by category" = t2."Total transactions"
output:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source | 
|---|---|
| Solution 1 | |
| Solution 2 | Mikhail Berlyant | 
| Solution 3 | jarlh | 
| Solution 4 | |
| Solution 5 | 


