'How to avoid double counting while using SQL
I have a transaction table which contains date, transaction_id and category (it is a sales table of clothes). It looks like this:
ndate | transaction_id | category |
---|---|---|
11 | 111 | M |
11 | 111 | F |
11 | 112 | M |
12 | 113 | F |
12 | 113 | O |
13 | 114 | M |
I am applying a sql query to aggregate the data and make a new table to reduce the number of rows (as the data is too huge):
select ndate, category, count(distinct transaction_id)
from transaction_table
group by ndate, category;
The output that I am getting is this:
ndate | category | transactions |
---|---|---|
11 | M | 2 |
11 | F | 1 |
12 | F | 1 |
12 | O | 1 |
13 | M | 1 |
As you can see from the transaction_table there have been only 4 transactions but in the output_table the total count of transaction is 6 which is wrongly been done due to the group by of category.
This is the output I am trying to get, I am not sure if a table like this can be made in sql (I am new to sql) :
ndate | Total transactions | category | transactions by category |
---|---|---|---|
11 | 2 | M | 2 |
F | 1 | ||
12 | 1 | F | 1 |
O | 1 | ||
13 | 1 | M | 1 |
Solution 1:[1]
select ndate, category,
count(distinct transaction_id) as category_transactions,
(
select count(distinct transaction_id)
from transaction_table t2
where t2.ndate = t.ndate
) as ndate_transactions
from transaction_table t
group by ndate, category;
Or this should work but I can't promise it's faster/cheaper to run:
with data as (
select ndate, category,
dense_rank() over (partition by ndate order by transaction_id) as dr
from transaction_table
), data2 as (
select ndate, category,
count(distinct dr) as category_transactions, max(dr) as max_dr
from data
group by ndate, category
)
select ndate, category, category_transactions,
max(max_dr) over (partition by ndate) as ndate_transactions
from data2
To suppress repeated values you'd generally just use a lag()
:
case when ndate = lag(ndate) then null else ndate end
Solution 2:[2]
Consider below for BigQuery
select ndate, total_transactions,
array(
select as struct category,
count(distinct transaction_id) as transactions_by_category
from t.arr
group by category
) as transactions
from (
select ndate,
count(distinct transaction_id) total_transactions,
array_agg(struct(category, transaction_id)) arr
from transaction_table
group by ndate
) t
if applied to sample data in your question - output is
Solution 3:[3]
Close to what you want, just presented in a slightly different way.
Use case
expressions to do conditional aggregation.
select ndate,
sum(case when category = 'F' then 1 else 0 end) as f_category,
sum(case when category = 'M' then 1 else 0 end) as m_category,
sum(case when category = 'O' then 1 else 0 end) as o_category,
count(distinct transaction_id) total_transactions
from transaction_table
group by ndate;
Core ANSI SQL-2016. I.e. expected to execute on any dbms.
Solution 4:[4]
You haven't told us what you expect, but I think from reading your question that you need to GROUP BY
category, but you only want to count a transaction_id one time if it happened to exist in 2 categories.
We can't have it both ways, so we have to choose - if we list the category in GROUP BY
then we have to decide how to count a transaction that happens in 2 categories.
Option 1: Choose "first" category (shown alphabetically)
WITH CTE_DEDUPE AS (
SELECT
*
FROM
transaction_table QUALIFY ROW_NUMBER() OVER (
PARTITION BY TRANSACTION_ID
ORDER BY
TRANSACTION_DT,
CATEGORY ASC
) = 1
),
BASIC_AGGS AS (
SELECT
TRANSACTION_DT,
CATEGORY,
COUNT(DISTINCT TRANSACTION_ID) as TRANSACTION_ID_COUNTDISTINCT,
COUNT(1) AS AGG_ROW_COUNT
FROM
CTE_DEDUPE
GROUP BY
TRANSACTION_DT,
CATEGORY
)
SELECT
BASIC_AGGS.TRANSACTION_DT,
BASIC_AGGS.CATEGORY,
BASIC_AGGS.TRANSACTION_ID_COUNTDISTINCT
FROM
BASIC_AGGS
Result:
TRANSACTION_DT | CATEGORY | TRANSACTION_ID_COUNTDISTINCT |
---|---|---|
11 | F | 1 |
12 | F | 1 |
13 | M | 1 |
There are other options but I will hold off until you tell us some more information about what you expect, and what specific instance of SQL you are running. I generated the SQL with Rasgo but I have that running on Snowflake right now, and the more complex we get with your solution might mean the syntax is slightly different depending on your system.
Edit: the markdown syntax is giving me issues in preview vs what the post renders as...
Solution 5:[5]
Results matching OP desired output (aside from sorting) but my assumption is the plan is the filter out the nulls in the end anyhow.
If sort needs to be exact a small edit can achieve it with an additional column, or with some more work (using a temp table or table variable) to show the exact result sorted and without the extra column, but here is what it looks like with the extra column:
PS: even though this does get the result, I think it is going to be extremely inefficient and I highly recommend NOT using this solution but instead wait for one of the other guys to do it right. I mainly did this purely to see if I could (without caring about how horribly inefficient it would be).
select distinct
"ndate" =
case
when row_number() over(partition by t1."ndate" order by "Total transactions" desc) = 1
then t1."ndate"
else null
end
, "Total transactions" =
case
when row_number() over(partition by t1."ndate" order by "Total transactions" desc) = 1
then "Total transactions"
else null
end
, t1."category"
, "transactions by category"
From
(
select distinct
"ndate"
, "category"
, "transactions by category" = count("transaction_id") over( partition by "ndate", "category")
from transaction_table as t1
) as t1
left join
(
select distinct
"ndate"
, "Total transactions" = count("transaction_id") over( partition by "ndate", "category" order by "ndate")
from transaction_table as t1
) as t2
on t1."ndate" = t2."ndate"
and t1."transactions by category" = t2."Total transactions"
output:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Mikhail Berlyant |
Solution 3 | jarlh |
Solution 4 | |
Solution 5 |