'Grouping ranges of data

I have segmented linear data to identify the minimum and maximum begin and end values of each range, now I want to consolidate overlapping ranges. This is easy if the data is easy. For example, if the data is always increasing for both the begin and end values. (see the first 4 rows in my sample input below) But where the data is not as uniform, I'm having trouble computing the begin and end of a group of overlapping rows by relating data sets.

It's possible I'm headed down a rabbit hole because of an early decision about how to structure my inputs. The source data doesn't contain the rangebegin and rangeend columns I'm using for inputs. This was my choice to try to help me analyze the data. There may be other options.

Since the volume of information below may be an XY problem, here are the basics of what I'm trying to solve.

In my case, I'm trying to identify one-mile segments of roadway where there have been at least 5 crashes, then combine overlapping ranges to draw as a single geometric feature on a map. So...

Given this data:
Crash A at mile 2.22
Crash B at mile 2.24
Crash C at mile 3.10
Crash D at mile 3.92
Crash E at mile 5.03
Crash F at mile 5.21

I would have a segment from 2.22 to 3.92 because there is less than a mile between each crash in that range. Then there's a gap of more than a mile, then the last two crashes are within a mile of each other. So the output would look like

begin end
2.22 3.92
5.03 5.21

Because of some other variables, the way I am combining the data thus far has produced some ranges that are not sequential. Some of the ranges are included in other ranges. That seems to be causing my biggest problems.

I'm using SQL Server (mostly 2016), but the concept should hold for any database system.

--- Update ---

I am rewriting the question to refer to a "known good" solution as a starting point.

SO moderators: If rewriting this question after receiving answers is wrong, let's fix it.

tl;dr: I'm developing a query that will likely ultimately be written as a series of common table expressions to serve as a data set for a reporting system. I am using a reporting system, not a database management tool like SSMS, so I can't run a "batch". I also don't believe a temporary stored procedure is appropriate, nor do I want to create objects (stored procedures, functions,...) on the database server to solve this single reporting problem. I do realize that the nature of the problem may dictate that I do these things, but I am searching for a query-only solution first.

On the surface, this problem appears to be a classic gaps-and-islands problem, for which I have been using a similar solution for years. But this one involves not only overlapping ranges, but also ranges that don't sort well and so don't work with the classic solution. Starting with https://bertwagner.com/posts/gaps-and-islands/, as recommended by Josh presents a problem. Other overlapping gaps-and-islands solutions I have seen appear to be the same, and so have the same problem. Replacing the last line of inputs in Bert Wagner's solution with...

SELECT '11/2/2017', '11/4/2017' UNION ALL
SELECT '10/27/2017', '11/15/2017' UNION ALL
SELECT '11/5/2017', '11/10/2017'

...helps demonstrate my problem.

Bert's solution is written as a deeply-nested set of subqueries. To improve readability, I would do this as a series of common table expressions.

In an attempt to demonstrate the problem and keep each step in my attempt simple and reproducible, I have broken this down to a series of temporary tables, each with its own outputs. These temporary tables equate to the common table expressions.

DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);

INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL
SELECT '11/2/2017', '11/4/2017' UNION ALL
SELECT '10/27/2017', '11/15/2017' UNION ALL
SELECT '11/5/2017', '11/10/2017'

SELECT * FROM #OverlappingDateRanges;
StartDate  EndDate
---------- ----------
2017-08-24 2017-09-23
2017-08-24 2017-09-20
2017-09-23 2017-09-27
2017-09-25 2017-10-10
2017-10-17 2017-10-18
2017-10-25 2017-11-03
2017-11-02 2017-11-04
2017-10-27 2017-11-15
2017-11-05 2017-11-10
DROP TABLE IF EXISTS #Groups;
SELECT ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN
, StartDate
, EndDate
, LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
INTO #Groups
FROM #OverlappingDateRanges;

SELECT *
FROM #Groups;
RN     StartDate  EndDate    PreviousEndDate
------ ---------- ---------- ---------------
1      2017-08-24 2017-09-20 NULL
2      2017-08-24 2017-09-23 2017-09-20
3      2017-09-23 2017-09-27 2017-09-23
4      2017-09-25 2017-10-10 2017-09-27
5      2017-10-17 2017-10-18 2017-10-10
6      2017-10-25 2017-11-03 2017-10-18
7      2017-10-27 2017-11-15 2017-11-03
8      2017-11-02 2017-11-04 2017-11-15
9      2017-11-05 2017-11-10 2017-11-04    <-- Notice PreviousEndDate < StartDate
DROP TABLE IF EXISTS #Islands;
SELECT *
, CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd
, SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY RN) AS IslandId
INTO #Islands
FROM #Groups g;

SELECT *
FROM #Islands;
RN     StartDate  EndDate    PreviousEndDate IslandStartInd IslandId
------ ---------- ---------- --------------- -------------- -----------
1      2017-08-24 2017-09-20 NULL            1              1
2      2017-08-24 2017-09-23 2017-09-20      0              1
3      2017-09-23 2017-09-27 2017-09-23      0              1
4      2017-09-25 2017-10-10 2017-09-27      0              1
5      2017-10-17 2017-10-18 2017-10-10      1              2
6      2017-10-25 2017-11-03 2017-10-18      1              3
7      2017-10-27 2017-11-15 2017-11-03      0              3
8      2017-11-02 2017-11-04 2017-11-15      0              3
9      2017-11-05 2017-11-10 2017-11-04      1              4        <-- This is within IslandId = 3
SELECT MIN(StartDate) AS IslandStartDate
, MAX(EndDate) AS IslandEndDate
FROM #Islands
GROUP BY IslandId
ORDER BY IslandStartDate;
IslandStartDate IslandEndDate
--------------- -------------
2017-08-24      2017-10-10
2017-10-17      2017-10-18
2017-10-25      2017-11-15
2017-11-05      2017-11-10   <-- This is not a distinct island.

As you can see, the 4th island should be included in the third.

And reviewing Bert's example again in detail, and rewriting it (I'm a tactile learner) has given me some other thoughts regarding how I might solve this. I am posting this here first to be fair to others who may have spent time reading and working on this and care about SO points.

sql


Solution 1:[1]

I'm not sure how to put everything in one query but using adhoc/Stored procedure can solve this problem.

I have tested in my Sybase ASE, please try same approach with your SQL/Oracle.

Hope this help.

1. Add column SEQ (auto increment id) into source table

SEQ RANGE
1 2.22
2 2.24
3 3.10
4 3.92
5 5.03
6 5.21

2. Looping and calculate difference between 2 records

 CREATE TABLE TEST_RESULT
(
    BEGIN_MILE NUMERIC(10,2),
    END_MILE NUMERIC(10,2)
)

--Begin ad-hoc code
DECLARE @iLoop numeric(10), @recNo numeric(10), @sMsg VARCHAR(100), @tRANGE numeric(10,2), @rangebegin numeric(10,2), @rangeend numeric(10,2)

SET @recNo = (SELECT COUNT(*) FROM LLE_TEST_SEQ)
SET @iLoop = 3

--Set 1st begin
SET @rangebegin = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = 1)
--Set 1st end
SET @rangeend = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = 2)

WHILE (@iLoop <= @recNo)
    BEGIN
        SET @tRANGE = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = @iLoop)
        
        --Display for testing
        PRINT '*****'
        SET @sMsg = CONVERT(VARCHAR, @rangeend)
        PRINT @sMsg     
        SET @sMsg = CONVERT(VARCHAR, @tRANGE)
        PRINT @sMsg
        PRINT '*****'
        
        --If delta < 1, increase end point
        IF @tRANGE - @rangeend < 1
            BEGIN
                SET @rangeend = @tRANGE
            END
        --Otherwise, break, insert and create new set
        ELSE
            BEGIN
                INSERT INTO TEST_RESULT (BEGIN_MILE, END_MILE) VALUES (@rangebegin, @rangeend)
                SET @rangebegin = @tRANGE
                SET @rangeend = @tRANGE
            END     
        
        
        SET @iLoop = @iLoop + 1
        
    END
INSERT INTO TEST_RESULT (BEGIN_MILE, END_MILE) VALUES (@rangebegin, @rangeend)
--End ad-hoc code

Output

SELECT * FROM TEST_RESULT
BEGIN_MILE END_MILE
2.22 3.92
5.03 5.21

Solution 2:[2]

Adding a column to #a and adding a few more data rows to address some scenarios not covered in original data set:

drop table if exists #a

create table #a
(pass        int
,category    varchar(5)
,rangebegin  int
,rangeend    int)

insert #a
values 
  (0, 'a', 1, 19)
, (0, 'a', 1, 27)
, (0, 'a', 9, 37)
, (0, 'a', 14, 42)
, (0, 'a', 45, 65)
, (0, 'a', 47, 70)
, (0, 'a', 52, 62)
, (0, 'a', 65, 65)

, (0, 'a', 81, 83)       -- standalone row/range
, (0, 'a', 95, 95)       -- standalone row/range; begin == end

, (0, 'b', 3, 33)
, (0, 'b', 17, 22)
, (0, 'b', 21, 44)

, (0, 'c', 30, 35)       -- no two rows
, (0, 'c', 33, 40)       -- define the
, (0, 'c', 39, 42)       -- entire range

One idea using a loop to iteratively consolidate ranges:

declare @pass           int,
        @prevcount      int,
        @currcount      int

select  @pass = 0,
        @prevcount = (select count(*) from #a where pass = 0),
        @currcount = 0

while   @currcount != @prevcount
begin
        insert  #a
                (pass, category, rangebegin, rangeend)
        select  distinct
                @pass + 1,
                a1.category,
                (select min(a2.rangebegin)
                 from   #a a2
                 where  a2.category     = a1.category
                 and    a2.pass         = @pass
                 and    (       a1.rangebegin   between  a2.rangebegin and a2.rangeend
                         or     a1.rangeend     between  a2.rangebegin and a2.rangeend)),
                (select max(a3.rangeend)
                 from   #a a3
                 where  a3.category     = a1.category
                 and    a3.pass         = @pass
                 and    (       a1.rangebegin   between  a3.rangebegin and a3.rangeend
                         or     a1.rangeend     between  a3.rangebegin and a3.rangeend))
        from    #a a1
        where   a1.pass = @pass

        select  @pass      = @pass + 1,
                @prevcount = @currcount,
                @currcount = (select count(*) from #a where pass = @pass+1)

--      print '############# %1!', @pass
--      select  * from #a where pass = @pass order by 2,3,4
end

select  category,
        rangebegin,
        rangeend
from    #a
where   pass = @pass
order by 1,2,3

This generates:

 category rangebegin  rangeend
 -------- ----------- -----------
 a                  1          42
 a                 45          70
 a                 81          83
 a                 95          95
 b                  3          44
 c                 30          42

NOTES:

  • the 'final' solution is actually determined twice due to looping until @currcount = @prevcount
  • designed/tested on a SAP(Sybase) ASE 16 instance (ASE does not have support for windows functions)
  • then verified the code also works in SQL Server 2019 (fiddle)
  • it may be possible to rewrite this using a recursive CTE ... maybe? yes? no? shrug (I'm a bit rusty with recursive CTEs)
  • for larger volumes of data OP may want to look at indexing #a
  • once a new set of data is inserted into #a the 'previous' set of data could be deleted as said data will no longer be needed ... OP's choice

Solution 3:[3]

Since we're sorting by StartDate and EndDate, we should be able to use the minimum of this and all future start dates and the maximum of this and all past end dates. This is just one more step beyond what Bert Wagner published.

DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);

INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017', '10/18/2017' UNION ALL 
SELECT '10/25/2017', '11/3/2017'  UNION ALL
SELECT '11/2/2017', '11/4/2017' UNION ALL
SELECT '10/27/2017', '11/15/2017' UNION ALL
SELECT '11/5/2017', '11/10/2017'


;
WITH
Ranges as (
    SELECT MIN(StartDate) OVER (ORDER BY StartDate, EndDate ROWS BETWEEN 0 FOLLOWING AND UNBOUNDED FOLLOWING) as StartDate
    , MAX(EndDate) OVER (ORDER BY StartDate, EndDate ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) as EndDate
    FROM #OverlappingDateRanges
),
Groups as (
    SELECT StartDate
    , EndDate
    , LAG(StartDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousStartDate
    , LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
    FROM Ranges
),
Islands as (
    SELECT StartDate
    , EndDate
    , CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd
    , SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY StartDate, EndDate) AS IslandId
    FROM Groups
)

SELECT MIN(StartDate) AS IslandStartDate
, MAX(EndDate) AS IslandEndDate
FROM Islands
GROUP BY IslandId
ORDER BY IslandStartDate;
IslandStartDate IslandEndDate
--------------- -------------
2017-08-24      2017-10-10
2017-10-17      2017-10-18
2017-10-25      2017-11-15

I'm not finding any start/end combinations that are causing problems. I'll let this marinate for a few days before marking it as correct in case somebody can poke some holes in it.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Hana
Solution 2
Solution 3 dougp