'Remove banned words then collapse the data

Gets 'exact' banned words matched against a set of blog comments. It then creates a result set collapsing the banned words into the owner (the blog comment) showing the banned words found and the counts.

Any advice on how to do this in a more efficient manner - less DML - as there likely will be thousands of comments?

Banned words:
though
man
about
hear

Blog comments:
'There are many of us.'
'The man.'
'So glad to hear about.' 
'So glad to hear about. A regular guy.'
'though, though, though.' 

1st entry: word is NOT banned - it's a variant of a banned word. Entry is NOT to be selected.
2nd entry: 1 banned word. Entry selected as 1 row, 1 banned word, counted as 1 banned word.
3rd entry: 2 different banned words. Entry selected as 1 row, 2 banned words separate by 
           commas, counted as 2 banned words.
4th entry: 2 different banned words. Entry selected as 1 row, 2 banned words separate by
           commas, counted as 2 banned words.
5th entry: 3 same banned words. Entry selected as 1 row, 1 banned word, counted as 3 banned 
           word.

Rules:
- Get the banned words in the blog comment.
- Only EXACT matches to the banned words. Do NOT include variants of the banned word.
- If there are more than 1 banned words in the same blog comment, only 1 row should be
  generated. 
- Generate the owner's row, include the banned words BannedWords column - non-unique banned 
  words separated by comma. 1 word for unique banned words.
  Count the banned words and include that column in the generated row. 

Desired Result - 4 rows:
BlogCommentId  BannedWords  t_Text1  t_Text2  t_Text3  CntOfBannedWords
2              man          e        f        g        1
3              hear,about   h        i        j        2
4              hear,about   k        l        m        2
5              though       n        o        p        3    

The exact banned word matching code:

DECLARE @tableFinal TABLE (
        t0_BlogCommentId int,
        t0_Word VARCHAR(50),
        t0_Text1 varchar(10),
        t0_Text2 varchar(10),
        t0_Text3 varchar(10),
        t0_CntOfBannedWords int)

DECLARE @table1 TABLE (
        t_BlogCommentId int,
        t_Word VARCHAR(50),
        t_Text1 varchar(10),
        t_Text2 varchar(10),
        t_Text3 varchar(10));

DECLARE @BlogComment TABLE (
        BlogCommentId INT IDENTITY PRIMARY KEY,
        BlogCommentContent VARCHAR(MAX),
        Text1 varchar(10),
        Text2 varchar(10),
        Text3 varchar(10));
INSERT INTO @BlogComment 
           (BlogCommentContent 
            ,Text1                    
            ,Text2                    
            ,Text3)
VALUES
('There are many of us.',                        'a', 'b', 'c')
('The man.',                                     'e', 'f', 'g')
('So glad to hear about.',                       'h', 'i', 'j')
('So glad to hear about. A regular guy.',        'k', 'l', 'm')
('though, though, though.',                      'n', 'o', 'p');

DECLARE @BannedWords TABLE (
        BannedWordsId INT IDENTITY PRIMARY KEY,
        Word varchar(250));
INSERT INTO @BannedWords (Word) VALUES
('though'),
('man'),
('about'),
('hear');

;WITH rs AS
(
    SELECT word = REPLACE(REPLACE([value],'.',''),',','')
                  ,BlogCommentId
                  ,Text1
                  ,Text2
                  ,Text3
    FROM @BlogComment 
    CROSS APPLY STRING_SPLIT(BlogCommentContent, SPACE(1))
)
INSERT @table1
    (t_Word,
     t_BlogCommentId,
     t_Text1,
     t_Text2,
     t_Text3 )  
SELECT bw.Word
       ,rs.BlogCommentId
       ,rs.Text1
       ,rs.Text2
       ,rs.Text3
FROM rs
INNER JOIN @BannedWords bw ON rs.word = bw.Word;

Result from the WITH above before collapsing. I want the 'Desired Result' to be generated here if possible in the WITH and not have to add the additional code below it.

SELECT *
FROM @table1

Results:
t_BlogCommentId  t_BannedWords  t_Text1  t_Text2  t_Text3
2                man            e        f        g
3                about          h        i        j
3                hear           h        i        j
4                about          k        l        m
4                hear           k        l        m
5                though         n        o        p
5                though         n        o        p
5                though         n        o        p

-- The 'additional code to collapse':

INSERT @tableFinal
      (t0_BlogCommentId
       ,t0_Word
       ,t0_Text1
       ,t0_Text2
       ,t0_Text3
       ,t0_CntOfBannedWords )
SELECT DISTINCT t_BlogCommentId
                ,''
                ,''
                ,''
                ,''
                ,0
FROM @table1

UPDATE @tableFinal
SET t0_Word = t_Word
    ,t0_Text1 = t_Text1
    ,t0_Text2 = t_Text2
    ,t0_Text3 = t_Text3
FROM @table1
WHERE t0_BlogCommentId = t_BlogCommentId

UPDATE @tableFinal
SET t0_Word = t0_Word + ',' + t_Word
FROM @table1
WHERE t0_BlogCommentId = t_BlogCommentId AND t0_Word <> t_Word

UPDATE @tableFinal
SET t0_CntOfBannedWords = (SELECT Count (t_Word)
                           FROM @table1
                           WHERE t0_BlogCommentId = t_BlogCommentId)

Result of collapsing - now it's my 'Desired Result' - but more work and NOT likely suitable if there are a thousands plus comments:

SELECT t0_BlogCommentId as BlogCommentId
       ,t0_Word as BannedWords
       ,t0_Text1 as Text1
       ,t0_Text2 as Text2
       ,t0_Text3 as Text3
       ,t0_CntOfBannedWords as CntOfBannedWords
FROM @tableFinal

BlogCommentId  BannedWords  t_Text1  t_Text2  t_Text3  CntOfBannedWords
2              man          e        f        g        1
3              hear,about   h        i        j        2
4              hear,about   k        l        m        2
5              though       n        o        p        3    

https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=982989411c1a3e3fb784f1e0e46fd9e1



Solution 1:[1]

Excellent post. Thank you for providing all the data in an easy to use format. You were really close to having this all put together. You just needed that final piece to push the values back together. Here I used STUFF for this. I just started with your "rs" cte. I added another cte and then used the old STUFF trick. This produces the desired output for your sample data.

DECLARE @BlogComment TABLE (
        BlogCommentId INT IDENTITY PRIMARY KEY,
        BlogCommentContent VARCHAR(MAX),
        Text1 varchar(10),
        Text2 varchar(10),
        Text3 varchar(10));
INSERT INTO @BlogComment 
           (BlogCommentContent 
            ,Text1                    
            ,Text2                    
            ,Text3)
VALUES
('There are many of us.',                        'a', 'b', 'c')
, ('The man.',                                     'e', 'f', 'g')
, ('So glad to hear about.',                       'h', 'i', 'j')
, ('So glad to hear about. A regular guy.',        'k', 'l', 'm')
, ('though, though, though.',                      'n', 'o', 'p');

DECLARE @BannedWords TABLE (
        BannedWordsId INT IDENTITY PRIMARY KEY,
        Word varchar(250));
INSERT INTO @BannedWords (Word) VALUES
('though'),
('man'),
('about'),
('hear');

;WITH rs AS
(
    SELECT word = REPLACE(REPLACE([value],'.',''),',','')
                  ,BlogCommentId
                  ,Text1
                  ,Text2
                  ,Text3
    FROM @BlogComment 
    CROSS APPLY STRING_SPLIT(BlogCommentContent, SPACE(1))
)
, ExpandedWords as
(
    SELECT bw.Word
       ,rs.BlogCommentId
       ,rs.Text1
       ,rs.Text2
       ,rs.Text3
    FROM rs
    INNER JOIN @BannedWords bw ON rs.word = bw.Word
)

select BlogCommentId
    , BannedWords = STUFF((select ', ' + e2.Word
                            from ExpandedWords e2
                            where e2.BlogCommentId = e1.BlogCommentId
                            --you could add an order by here if you want the list of words in a certain order.
                            FOR XML PATH('')), 1, 1, ' ')
    , e1.Text1
    , e1.Text2
    , e1.Text3
    , BannedWordsCount = count(*)
from ExpandedWords e1
group by e1.BlogCommentId
    , e1.Text1
    , e1.Text2
    , e1.Text3
order by e1.BlogCommentId

For a more modern version of sql server using STRING_AGG is a bit less verbose and obtuse than using STUFF and FOR XML. Here is how that might look. I also use ROW_NUMBER here so you can only return a single instance of a banned word if it appears multiple times in the input. Same concept as above so this is starting a bit later in the code.

, ExpandedWords as
(
    SELECT bw.Word
       , rs.BlogCommentId
       , rs.Text1
       , rs.Text2
       , rs.Text3
       , RowNum = ROW_NUMBER()over(partition by rs.BlogCommentId, bw.Word order by (select newid())) --order doesn't really matter here
    FROM rs
    INNER JOIN @BannedWords bw ON rs.word = bw.Word
)

select e.BlogCommentId
    , STRING_AGG(e.Word, ', ')
    , e.Text1
    , e.Text2
    , e.Text3
    , RowNum
from ExpandedWords e
where RowNum = 1
group by e.BlogCommentId
    , e.Text1
    , e.Text2
    , e.Text3
    , RowNum
order by e.BlogCommentId

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1