'Search and replace specific strings with floating point representations in python

Problem: I'm trying to replace mutiple specific sequences in a string with another mutiple specific sequences of floating point representations in Python.

I have an array of strings in a JSON-file, which I load to a python script via the json-module. The array of strings:

{
  "LinesToReplace": [
    "_ __ ___ ____ _____ ______ _______      ",
    "_._ __._ ___._ ____._ _____._ ______._  ",
    "_._ _.__ _.___ _.____ _._____ _.______  ",
    "_._ __.__ ___.___ ____.____ _____._____ ",
    "_. __. ___. ____. _____. ______.        "
  ]
}

I load the JSON-file via the json-module:

with open("myFile.json") as jsonFile:
  data = json.load(jsonFile)

I'm trying to replace the sequences of _ with specific substrings of floating point representations.

Specification:

  • Character to find in a string must be a single _ or a sequence of multiple _.
  • The length of the sequence of _ is unknown.
  • If a single _ or a sequence of multiple _ is followed by a ., which is again followed by a single _ or a sequence of multiple _, the . is part of the _-sequence.
  • The . is used to specify decimals
  • If the . isn’t followed by a single _ or a sequence of multiple _, the . is not part of the _-sequence.
  • The sequence of _ and . is to be replaced by floating point representations, i.e., %f1.0.
  • The representations are dependent on the _- and .-sequences.

Examples:

  • __ is to be replaced by %f2.0.
  • _.___ is to be replaced by %f1.3.
  • ____.__ is to be replaced by %f4.2.
  • ___. is to be replaced by %3.0.

For the above JSON-file, the result should be:

{
  "ReplacedLines": [
    "%f1.0 %f2.0 %f3.0 %f4.0 %f5.0 %f6.0 %f7.0      ",
    "%f1.1 %f2.1 %f3.1 %f4.1 %f5.1 %f6.1  ",
    "%f1.1 %f1.2 %f1.3 %f1.4 %f1.5 %f1.6  ",
    "%f1.1 %f2.2 %f3.3 %f4.4 %f5.5 ",
    "%f1.0. %f.0. %f3.0. %f4.0. %f5.0. %f6.0.        "
  ]
}

Some code, which tries to replace single _ with %f1.0 (that doesn't work...):

with open("myFile.json") as jsonFile:
  data = json.load(jsonFile)
  strToFind = "_"
  
  for line in data["LinesToReplace"]:
    for idl, l in enumerate(line):
      if (line[idl] == strToFind and line[idl+1] != ".") and (line[idl+1] != strToFind and line[idl-1] != strToFind):
        l = l[:idl] + "%f1.0" + l[idl+1:] # replace string

Any ideas on how to do this? I have also though about using regular expressions.

EDIT

The algorithm should be able to check if the character is a "_", i.e. to be able to format this:

{
  "LinesToReplace": [
    "Ex1:_ Ex2:_. Ex3:._ Ex4:_._ Ex5:_._.    ",
    "Ex6:._._ Ex7:._._. Ex8:__._ Ex9: _.__   ",
    "Ex10: _ Ex11: _. Ex12: ._ Ex13: _._     ",
    "Ex5:._._..Ex6:.._._.Ex7:.._._._._._._._."
  ]
}

Solution:

{
  "LinesToReplace": [
    "Ex1:%f1.0 Ex2:%f1.0. Ex3:.%f1.0 Ex4:%f1.1 Ex5:%f1.1.    ",
    "Ex6:.%f1.1 Ex7:.%f1.1. Ex8:%f2.1 Ex9: %f1.2   ",
    "Ex10: %f1.0 Ex11: %f1.0. Ex12: .%f1.0 Ex13: %f1.1     ",
    "Ex5:.%f1.1..Ex6:..%f1.1.Ex7:..%f1.1.%f1.1.%f1.1.%f1.0."
  ]
}

I have tried the following algorithm based on the above criteria, but I can't figure out how to implement it:

def replaceFunc3(lines: list[str]) -> list[str]:
    result = []
    charToFind = '_'
    charMatrix = []

    # Find indicies of all "_" in lines
    for line in lines:
        charIndices = [idx for idx, c in enumerate(line) if c == charToFind]
        charMatrix.append(charIndices)

    for (line, char) in zip(lines, charMatrix):
        if not char: # No "_" in current line, append the whole line
            result.append(line)
    else:
        pass
        # result.append(Something)
        # TODO: Insert "%fx.x on all the placeholders"

    return result


Solution 1:[1]

You can use regular expression's re.sub together with a replacement function that performs the logic on the capture groups:

import re

def replace(line):
    return re.sub(
        '(_+)([.]_+)?',
        lambda m: f'%f{len(m.group(1))}.{len(m.group(2) or ".")-1}',
        line,
    )

lines = [replace(line) for line in lines_to_replace]

Explanation of regex:

  • (_+) matches one or more underscores; the () part makes them available as a capture group (the first such group, i.e. m.group(1)).
  • ([.]_+)? optionally matches a dot followed by one or more trailing underscores (made optional by the trailing ?); the dot is part of a character class ([]) because otherwise it would have the special meaning "any character". The () make this part available as the second capture group (m.group(2)).

Solution 2:[2]

Neat problem. Personally, here is how I would do it:

from pprint import pprint

d = {
    "LinesToReplace": [
        "_ __ ___ ____ _____ ______ _______      ",
        "_._ __._ ___._ ____._ _____._ ______._  ",
        "_._ _.__ _.___ _.____ _._____ _.______  ",
        "_._ __.__ ___.___ ____.____ _____._____ ",
        "_. __. ___. ____. _____. ______.        "
    ]
}


def get_replaced_lines(lines: list[str]) -> list[str]:
    result = []

    for line in lines:
        trimmed_line = line.rstrip()
        trailing_spaces = len(line) - len(trimmed_line)

        underscores = trimmed_line.split()
        repl_line = []

        for s in underscores:
            n = len(s)

            if '.' in s:
                if s.endswith('.'):
                    repl_line.append(f'%f{n - 1}.0.')
                else:
                    idx = s.index('.')
                    repl_line.append(f'%f{idx}.{n - idx - 1}')

            else:
                repl_line.append(f'%f{n}.0')

        result.append(' '.join(repl_line) + ' ' * trailing_spaces)

    return result


if __name__ == '__main__':
    pprint(get_replaced_lines(d['LinesToReplace']))

Output:

['%f1.0 %f2.0 %f3.0 %f4.0 %f5.0 %f6.0 %f7.0      ',
 '%f1.1 %f2.1 %f3.1 %f4.1 %f5.1 %f6.1  ',
 '%f1.1 %f1.2 %f1.3 %f1.4 %f1.5 %f1.6  ',
 '%f1.1 %f2.2 %f3.3 %f4.4 %f5.5 ',
 '%f1.0. %f2.0. %f3.0. %f4.0. %f5.0. %f6.0.        ']

If curious, I've also timed it at the alternate regex approach, and found this to be 40% faster overall. I only like this test because it proves that in general, regex is a little slower than just doing it by hand. Though the regex approach is nice because it is certainly shorter :-)

Here is my test code:

import re
from timeit import timeit

d = {
    "LinesToReplace": [
        "_ __ ___ ____ _____ ______ _______      ",
        "_._ __._ ___._ ____._ _____._ ______._  ",
        "_._ _.__ _.___ _.____ _._____ _.______  ",
        "_._ __.__ ___.___ ____.____ _____._____ ",
        "_. __. ___. ____. _____. ______.        "
    ]
}


def get_replaced_lines(lines: list[str]) -> list[str]:
    result = []
    dot = '.'
    space = ' '

    for line in lines:
        trimmed_line = line.rstrip()
        trailing_spaces = len(line) - len(trimmed_line)

        underscores = trimmed_line.split()
        repl_line = []

        for s in underscores:
            n = len(s)

            if dot in s:
                if s[n - 1] == dot:  # if last character is a '.'
                    repl_line.append(f'%f{n - 1}.0.')
                else:
                    idx = s.index(dot)
                    repl_line.append(f'%f{idx}.{n - idx - 1}')

            else:
                repl_line.append(f'%f{n}.0')

        result.append(space.join(repl_line) + space * trailing_spaces)

    return result


def get_replaced_lines_regex(lines_to_replace):
    return [re.sub(
        '(_+)([.]_+)?',
        lambda m: f'%f{len(m.group(1))}.{len(m.group(2) or ".")-1}',
        line,
    ) for line in lines_to_replace]


if __name__ == '__main__':
    n = 100_000

    time_1 = timeit("get_replaced_lines(d['LinesToReplace'])", number=n, globals=globals())
    time_2 = timeit("get_replaced_lines_regex(d['LinesToReplace'])", number=n, globals=globals())

    print(f'get_replaced_lines:        {time_1:.3f}')
    print(f'get_replaced_lines_regex:  {time_2:.3f}')

    print(f'The first (non-regex) approach is faster by {(1 - time_1 / time_2) * 100:.2f}%')

    assert get_replaced_lines(d['LinesToReplace']) == get_replaced_lines_regex(d['LinesToReplace'])

Results on my M1 Mac:

get_replaced_lines:        0.813
get_replaced_lines_regex:  1.359
The first (non-regex) approach is faster by 40.14%

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2