'python regex to print text from a specific pattern to another pattern, but in condition that a specific string should exist in between

So I have a file like :

<html>
    <div>
        <h1>HOiihilasdl</h1>
    </div>
    <script src=https://example.com/file.js></script>
    <script>
        blabla
        blabla
        blabla
        blabla
        blabla
    </script>
    <script src=https://example.com/file.js></script>
    <script>
        blabla
        blabla
        cow
        blabla
        blabla
    </script>
</html>

And I want to print from <script> to </script> but only print if the word cow exists in between ( i want to do that using python regex).

The output would look like this :

    <script>
        blabla
        blabla
        cow
        blabla
        blabla
    </script>

I've searched many answers but i didn't find the one that solves my problem.

I am also wondering If it is possible that If the word "cow" exists between <script> and </script> to just return me "script"

I'm using Python 3.10.4



Solution 1:[1]

I am not completely certain what you are going for here. If you are simply going for scenarios such as those you explicitly present in your question, a solution could look as follows, in which you iterate through each line of the file, and keep track of opening/closing tags. Whenever you meet a closing tag, you begin storing lines. If a pattern such as "cow" is not found before the next closing tag, the search starts over when the next opening tag is met.

Note: The solution below does not work for nested tags, but can easily be altered to do so.

def find_pattern(file, pattern):
    with open(file, 'r') as f:
        lines = []
        start = False
        found_pattern = False

        # Iterate through the lines in the file
        for line in f:
            # Remove the newline character
            line = line.replace("\n", "")

            # Remove the leading whitespaces
            stripped_line = line.lstrip()

            # If we met the start of a tag such as <script>, we need to keep track of the lines until we met the end tag
            if start is False and stripped_line.startswith("<") and not "</" in line:
                start = True

            # We only append lines, whenever we start keeping track
            if start:
                lines.append(line)
        
            # If we find the pattern, we set a flag to true
            if pattern in line:
                found_pattern = True
        
            # If we met an end tag, we have two possibilities:
            # If we found the pattern we break and print. Otherwise, we keep searching.
            if stripped_line.startswith("</"):
                if found_pattern:
                    break
                else:
                    lines = []
                    start = False  

    # If the lines are not empty, i.e. we found the pattern, we print them
    if lines:
        for line in lines:
            print(line)

find_pattern(file="t.txt", pattern="cow")
Output:
    <script>
        blabla
        blabla
        cow
        blabla
        blabla
    </script>

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1