'How can get html inner tag in PosgreSQL

I have some data like that. And I want to get html.

with t(x) as (values( XMLPARSE(DOCUMENT ('<root><NotificationServiceDetails NotificationNo="0" AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" ><NotificationServiceDetail Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
      <html lang="en">
      <head>
          <meta charset="utf-8"/>
          <meta name="viewport" content="width=device-width, initial-scale=1"/>
      </head>
          <body>
              <table style="font-family: 굴림,맑은 고딕; font-size:12px;color:#333333;border-width: 1px;border-color: #ddd; border-collapse: collapse; margin:5px;width:auto; min-width:600px;">
                  <tbody>
                      <tr>
                          <td colspan="2" style="border-width: 1px;padding: 10px;border-style: solid;border-color: #ddd; background-color: #f5f5f5; text-align:left; font-weight:bold; font-size:13px;">aaaaaaaaaaaaa</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Writer</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Nguyen Ngo Giap (General Mgr.)</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Date</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">2022-01-04 10:00~11:00</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Schedule Div.</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">테스트함</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Content</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">aaaaaaaaaa</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Share</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;"></td>
                      </tr>
                  </tbody>
              </table>
          </body>
          </html>
            ]]></ContentJson></NotificationServiceDetail></NotificationServiceDetails></root>'))))
    select 
        unnest((xpath('//NotificationServiceDetails/NotificationServiceDetail/@Id',t.x)))::text::integer as Id,
        unnest((xpath('//NotificationServiceDetails/NotificationServiceDetail/Title/text()',t.x))):: text::character varying as Title,
        unnest(xpath('//NotificationServiceDetails/NotificationServiceDetail/ContentJson/text()',t.x))::xml as ContentJson,
        t.x
     from t;

but the ContentJson column gives me special characters. "<..." I want the real html

Expect result for column ContentJson.

<html lang="en">
      <head>
          <meta charset="utf-8"/>
          <meta name="viewport" content="width=device-width, initial-scale=1"/>
      </head>
          <body>
              <table style="font-family: 굴림,맑은 고딕; font-size:12px;color:#333333;border-width: 1px;border-color: #ddd; border-collapse: collapse; margin:5px;width:auto; min-width:600px;">
                  <tbody>
                      <tr>
                          <td colspan="2" style="border-width: 1px;padding: 10px;border-style: solid;border-color: #ddd; background-color: #f5f5f5; text-align:left; font-weight:bold; font-size:13px;">aaaaaaaaaaaaa</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Writer</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Nguyen Ngo Giap (General Mgr.)</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Date</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">2022-01-04 10:00~11:00</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Schedule Div.</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">테스트함</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Content</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">aaaaaaaaaa</td>
                      </tr>
                      <tr>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Share</td>
                          <td style="padding: 15px; background-color: #f9f9f9; text-align:left;"></td>
                      </tr>
                  </tbody>
              </table>
          </body>
          </html>

How can I do that



Solution 1:[1]

So far, I've provided 5 different solutions - all with their own advantages and problems - you'll have to test on your own data and hardware to ensure that it's working for you!

I did the following (all the relevant code is available on the fiddle here):

CREATE TABLE t (x TEXT);

and populated it with some text similar to yours, but shorter to making testing easier:

INSERT INTO t VALUES 
($SOMETAG$with t(x) as (values( XMLPARSE(DOCUMENT ('<root><NotificationServiceDetails NotificationNo="0" AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" ><NotificationServiceDetail Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
      <html lang="en">
      <head>
          <meta charset="utf-8"/>
          more stuff
          more stuff
        </table>
      </body>
       </html>
            ]]></ContentJson></NotificationServiceDetail></NotificationServiceDetails></root>'))))
    select 
        unnest((xpath('//NotificationServiceDetails/NotificationServiceDetail/@Id',t.x)))::text::integer as Id,
        unnest((xpath('//NotificationServiceDetails/NotificationServiceDetail/Title/text()',t.x))):: text::character varying as Title,
        unnest(xpath('//NotificationServiceDetails/NotificationServiceDetail/ContentJson/text()',t.x))::xml as ContentJson,
        t.x
     from t;$SOMETAG$);

I picked up the $SOMETAG$...$SOMETAG$ technique here - very helpful for insterting characters like single quote (') and backslash (\).

There is a "one-pass" solution possible - it justs takes time patience to work it out. Oh, and BTW, there were slight errors in my orininal solution - since corrected.

Part 1:

First I remove all characters up to but not including <html lang="en"> as follows:

SELECT SUBSTRING(x, STRPOS (x,'<html lang="en"'));

SUBSTRING and STRPOS were taken from this snippet.

Part 2:

Then, I reverse that string using the REVERSE() function

Part 3:

Finally, I use the SUBSTRING/STRPOS "trick" to chop of the other end of the string to the point of <ydob>

Part 4 is the pièce de résistance

I reREVERSE() the string to bring it back to its original state - minus the undesirable bits to give the required result.

1st Solution (pretty horrible):

SELECT REVERSE(SUBSTR( REVERSE(SUBSTR(x, strpos(x, '<html lang="en">'))), strpos( REVERSE(SUBSTR(x, strpos(x, '<html lang="en">'))), '>lmth/<'))) FROM t;

It looks a bit better (or is more legible at lease):

SELECT 
  REVERSE(
    SUBSTR( 
      REVERSE( 
        SUBSTR(x, strpos(x, '<html lang="en">'))), 
          strpos(
            REVERSE(
              SUBSTR(x, strpos(x, '<html lang="en">'))), '>lmth/<'))) FROM t;

Instead of these horrible constructions, I used CTE's (Common Table Expressions - AKA the WITH CLAUSE to do this as follows:

Code spends most of its life in maintenance, so easier to read code is easier to repair.

WITH cte1 AS
(
  SELECT REVERSE(SUBSTR(x, strpos(x, '<html lang="en">'))) AS s1 FROM t
), cte2 AS
(
  SELECT REVERSE(SUBSTR(s1, strpos(s1, 'ydob'))) AS s2 FROM cte1
)
SELECT * from cte2;

Result:

reverse
<html lang="en">
      <head>
          <meta charset="utf-8"/>
          more stuff
          more stuff
        </table>
      </body>
       </html>

The answer is the same for all of them!

2nd Solution (a bit more elegant - fiddle available here):

SELECT SPLIT_PART(x, '</html>', 1) from t;

Result:

split_part
with t(x) as (values( XMLPARSE(DOCUMENT ('<root><NotificationServiceDetails NotificationNo="0" AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" ><NotificationServiceDetail Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
      <html lang="en">
      <head>
          <meta charset="utf-8"/>
          more stuff
          more stuff
        </table>
      </body>
               <<=== Note there are 7 spaces here

So, SPLIT_PART() cuts the string up to, but not including the delimiter - </html> in this case. So, using this as follows:

So, we combine two SPLIT_PARTs in a CTE as follows:

WITH cte AS ( SELECT LENGTH(split_part(x, '', 1)) AS beg, LENGTH(split_part(x, '', 1)) AS fin FROM t )

Result:

substring
 <html lang="en">
      <head>
          <meta charset="utf-8"/>
          more stuff
          more stuff
        </table>
      </body>
       </html>

-- I cannot understand why I have to add 8 characters to the 408 of the length?
--
-- 7 (the length of </html>) I could possibly get, but why 8?

Which is the desired result.

3nd Solution (also reasonably elegant - fiddle available here):

I didn't go through all of the steps this time - I combined them all in one query. The interested reader is invited to go through it line by line.

SELECT 
  strpos(x, '<html lang="en">'),
  strpos(x, '</html>'),
  strpos(x, '</html>') - strpos(x, '<html lang="en">'),
  substring(x FROM strpos(x, '<html lang="en">') 
    for  ((strpos(x, '</html>') + 8) - strpos(x, '<html lang="en">'))             )
FROM t; 

--
-- Again, I'm puzzled by the necessity to use 8 characters.
--
--

Result:

strpos  strpos  ?column?    substring
265     409     144     <html lang="en">
      <head>
          <meta charset="utf-8"/>
          more stuff
          more stuff
        </table>
      </body>
       </html>

Et voilà - the desired result!

4th Solution (a bit convoluted, but may be instructive - fiddle available here):

1st Step:

I split the lines into records in a table as follows:

SELECT 
  x.idx,
  LENGTH(x.string),
  x.string
FROM t,
  REGEXP_SPLIT_TO_TABLE(t.x, '\n') WITH ORDINALITY AS x(string, idx);

2nd step:

I pull out the records I want which correspond to the desired result as follows:

WITH cte1 AS
(
  SELECT 
    x.idx,
    LENGTH(x.string) AS ls,
    x.string
  FROM t,
    REGEXP_SPLIT_TO_TABLE(t.x, '\n') WITH ORDINALITY AS x(string, idx)
), cte2 AS
(
  SELECT idx, ls, string 
  FROM cte1
  WHERE string ~ '<html lang="en">' OR string ~ '</html>'
  ORDER BY idx
)
SELECT idx, ls, string
FROM cte1
WHERE idx BETWEEN 
  (SELECT MIN(idx) FROM cte2) AND (SELECT MAX(idx) FROM cte2);

Result:

idx     ls  string
2   22        <html lang="en">
3   12        <head>
4   33            <meta charset="utf-8"/>
5   20            more stuff
6   20            more stuff
7   16          </table>
8   13        </body>
9   14         </html>

As you can see, the string field contains the data we want!

Solution 5 - using a relatively simple regular expression - fiddle here

1st pass:

We check the output of the very handy PG_TYPEOF() function, which from the docco here does:

pg_typeof returns the OID of the data type of the value that is passed to it. This can be helpful for troubleshooting or dynamically constructing SQL queries. The function is declared as returning regtype, which is an OID alias type (see Section 8.18); this means that it is the same as an OID for comparison purposes but displays as a type name.

So, our first query is:

SELECT 
  REGEXP_MATCH(x, '^.*(<head>.*</html>).*'), 
  PG_TYPEOF(REGEXP_MATCH(x, '^.*(<head>.*</html>).*'))
  FROM t;

Result:

regexp_match                          pg_typeof
{"<head>
          <meta charset=\"utf-8\"/>
          more stuff
          more stuff
        </table>
      </body>
       </html>"}                       text[]

So, we have our data, but it's surrounded by braces (curly brackets) - but we know from our PG_TYPEOF() function is a text array, so we know that it's the first (only) element of that array, so therefore we can use the array element notation as follows:

SELECT 
  (REGEXP_MATCHES(x, '^.*(<head>.*</html>).*'))[1]  -- <<-- Note [1]
  FROM t;

Result:

regexp_matches
<head>
          <meta charset="utf-8"/>
          more stuff
          more stuff
        </table>
      </body>
       </html>

Same as the others! Which is the same as for the others!

Crude performance analysis

After 5 runs, the order of merit appears to be the following (a fiddle of the tests run may be found here. Times will vary according to other uses to which the server may be being put, but as I said, I found them to be fairly consistent in terms of time and always in the same order on the 5 runs that I examined.

In descending order of run time:

So, we can see that the most expensive algorithm is ~ 50 times more expensive than the most efficient one. The usual caveats about benchmarking with only one record and on an unknown system apply - although the results where fairly consistent over at least 5 runs. Always benchmark on your own system with your own data!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1