'Why DISTINCT keyword is decreasing my search time and increasing performace in SQL Server
I am new to SQL and SQL Server and I am using SQL server 2012.
but recently I encountered a weird thing while working with SQL server. I got that adding DISTINCT in my query actually decreases my search time of query and returns result early as of not using DISTINCT in query.
I have a table with nearly 96 columns and I want to search all columns for a search string. I want to get the record if a search string exists anywhere in the table. For that I am using query like this
SELECT DISTINCT TOP 5000 [Row Id] ,[Account],[Created Date],[First Name],[A].[City] ... rest column
FROM [dbo].[Account]
WHERE ([Row Id] LIKE '%{search_string}%' OR [Account] LIKE '%{search_string}%' ... rest columns})
I know that this is not most effictive way but still it worked for me. I used DISTINCT as some rows are duplicates in table. and table contains nearly 1 million records.
But later my manager asked to decrease search time of query so I changed query and suddenly my search time increased by nearly 4x than previous. I inspected and got that I had removed DISTINCT to optimize query. But that made it slower. By slower I mean it's 4x slower than previous query.
I am confused that how can just adding or removing DISTINCT keyword can make so much difference. and removing DISTINCT should optimize my query but it's degrading it.
Note: I have experimented with query and it's really with just DISTINCT and not any other factor.
and even one thing to add while experiminting with query I redued the number of columns to search to cehek if searching more columns are creating problems. and I just search 5 columns now like
SELECT DISTINCT TOP 5000 [Row Id] ,[Account],[Created Date],[First Name],[A].[City]
FROM [dbo].[Account]
WHERE ([Row Id] LIKE '%{search_string}%' OR [Account] LIKE '%{search_string}%' ... 3 more columns})
but doing so still gives the same result. But here if I remove the TOP 5000 it again increases my search time even though query returns no(0) records.
If i run below query
SELECT DISTINCT [Row Id] ,[Account],[Created Date],[First Name],[A].[City]
FROM [dbo].[Account]
WHERE ([Row Id] LIKE '%{search_string}%' OR [Account] LIKE '%{search_string}%' ... 3 more columns})
it nearly 3x takes more time to serach. See only TOP is removed from above query.
Now I know that adding a TOP LIMIT is good. But here we are not getting any results so how can adding TOP LIMIT can decrease time.
Note: Table has no primary key or index columns.
Really confused with what is happening can any body help me with this, below is my SQL server info
product: Microsoft SQL Server Enterprise (64-bit)
OS: Microsoft Windows NT 6.2 (9200)
platform: NT x64
version: 11.0.2100.60
Memory: 65491 MB
processors: 8
Server Collation: SQL_Latin1_General_CP1_CI_AS
Is Clustered: False
Is HADR Enabled: False
Thanks in Advance.
Solution 1:[1]
The logical order of operations in SQL puts TOP
and OFFSET
after DISTINCT
. And DISTINCT
functions like GROUP BY
so it has to group all those columns. This is why you get a Hash Aggregate
.
So when you add DISTINCT
in a query with TOP
SELECT DISTINCT TOP (100)
a, b
FROM table
this is like doing the following
SELECT TOP (100)
*
FROM (
SELECT
a, b
FROM table
GROUP BY a, b
) t
so the whole table has to be scanned first.
Presumably, what you actually intended was for the grouping to occur after selecting the top 100 rows. This means that you can do a quick scan of the table for the first 100 rows, then group only those. You can do this like this
SELECT
a, b
FROM (
SELECT TOP (100)
a, b
FROM table
) t
GROUP BY a, b
This is one reason why
DISTINCT
should generally be avoided, it is a highly confusing operator.
Another thing to think about: Why does this need DISTINCT
at all? Why do you not have a primary key on this table? If you did then the DISTINCT
would be redundant as every row would be unique anyway.
One further point to make: Don't use column names that require quoting with
[]
, it's really annoying.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Charlieface |