'How to extract tables names in a SQL script?

Assuming there is a sql script:

select *
from (
  select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b 
on a.col1 = b.col2
left join
    test.test_c c
on b.col2  = c.col3
left jon
   (select 
       col4 
    from
       test.test_d) d
on c.col3  = d.col4

I am reading this question and tring to extract all TABLE NAMES after 'from' or 'join' in the script above using python. The difficuty is I am processing the script line by line, but the table name and the key word may be NOT in the same line.

So how to extract such table names from the script? Any suggestion is appreciated.



Solution 1:[1]

If you would like to use core python:

txt = """
select *
from (
  select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b 
on a.col1 = b.col2
left join
    test.test_c c
on b.col2  = c.col3
left jon
   (select 
       col4 
    from
       test.test_d) d
on c.col3  = d.col4"""

replace_list = ['\n', '(', ')', '*', '=']
for i in replace_list:
    txt = txt.replace(i, ' ')
txt = txt.split()
res = []
for i in range(1, len(txt)):
    if txt[i-1] in ['from', 'join'] and txt[i] != 'select': 
        res.append(txt[i])
print(res)

Solution 2:[2]

Just convert all multiple sequences of spaces (including newlines) to a single space, than you'll get a single line, and cat look for your table names using a regular expression.

import re
sql = """select *
from (
  select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b 
on a.col1 = b.col2
left join
    test.test_c c
on b.col2  = c.col3
left join
   (select 
       col4 
    from
       test.test_d) d
on c.col3  = d.col4"""
sql_line = re.sub('\s+', ' ', sql)
tbl_re = re.compile(r'(?:\b(?:from)|(?:join)\b\s+)(\w+)\b')
tablenames = tbl_re.findall(sql_line)
print(tablenames)

Please note that the table name extraction regexp is simplified and only intended as an example (you have to take into account possible quoting, etc.).

Solution 3:[3]

Here is a quick improvement on top of @r.user.05apr answer. Combining some bits from https://stackoverflow.com/a/46177004/82961

import re

txt = """
select *
from (
  select col1 from  test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b 
on a.col1 = b.col2
left join
    test.test_c c -- from xxx
on b.col2  = c.col3 /* join xxxxx */
left jon
   (select 
       col4 
    from
       test.test_d) d
on c.col3  = d.col4"""

def get_tables(sql_str):
    # remove the /* */ comments
    sql_str = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)

    # remove whole line -- and # comments
    lines = [line for line in sql_str.splitlines() if not re.match("^\s*(--|#)", line)]

    # remove trailing -- and # comments
    sql_str = " ".join([re.split("--|#", line)[0] for line in lines])

    replace_list = ['\n', '(', ')', '*', '=']
    for i in replace_list:
        sql_str = sql_str.replace(i, ' ')
    sql_str = sql_str.split()
    res = []
    for i in range(1, len(sql_str)):
        if sql_str[i-1] in ['from', 'join'] and sql_str[i] != 'select': 
            res.append(sql_str[i])
    print(res)
    
get_tables(txt)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 r.user.05apr
Solution 2 Marco Pantaleoni
Solution 3 Faiz