'Reading accounting information with regex

Context

In my daily job I lost a lot of time putting values in excel from a balance sheet manually, so I start learning how to code in python to speed up my work.

Problem

I am trying to create a regex that is able to detect the value of one account and send me the money that the account haves. The issue is there are multiple software's and the regex gets very complex because of that.

Examples of different outputs from accounting software's

As you see there is always a pattern: One Account Name that is a string that can have multiple spaces and special characters; One column of values of credit and debt with a comma and ,

The account number can have commas or not in it

Example 1

Random number Account number Account Name credit$ Debt$ Value of account
10 53 Another Financial Instruments 1.100,00 1.500,00 400,00 D
531 Café C/C (Taxes) 100,00 50,00 50,00 C
230 5321 Example 100,00 50,00 50,00 C
54 Example 2 300,00 500,00 200,00 D

This the most complex output because they put a random number behind the account number so there is accounts with a double space between the random number and number of account and other do not have any space

10 __ 53 -> with space 54 -> no space because of no random number Also the column value of the account that is equals to Credit - Debt if credit > Debt they had a C into the value, else they had a D

Example 2

Account number Account Name credit$ Debt$ Value of account
5.3 Another Financial Instruments 1.100,00 1.500,00 400,00
5.3.1 Café C/C (Taxes) 100,00 50,00 50,00
5.3.2.1 Example 100,00 50,00 50,00
5.4 Example 2 300,00 500,00 200,00

In this example there is no random number or D or C in the value of the account, but there is a comma in the account number.

Example 3

Account number Account Name credit$ Debt$
53 Another Financial Instruments 1.100,00 1.500,00
531 Café C/C (Taxes) 100,00 50,00
5321 Example 100,00 50,00
54 Example 2 300,00 500,00

This is the easiest one, no comma in account numbers no random number, but there is no Account value column so the accountant always need to insert both debt and credit values and calculate the difference.

My newbie regex only for the 1st example (\d{2,4}\s) (\w+(\s+\w+)+(\s+\w+)*) (\s\d,\d\d\s) it's not working for all the values.

The end goal is to export the output of python to excel and do a PROCV of the account number and the value to do some analysis.

I have tried to convert the pdf to excel or to word but it gets broken.

Updade:

Example 1 real life example:

In my examples above I didnt mention that there are 3 more columns debt and credit of the month and the difference between them than you have the debt and credit of the year and the final value. I didnt mention to simplify the examples.

The real life are on portuguese the names sry for that.

12 DEPÓSITOS À ORDEM 55.884,44 42.165,87 750.656,46 427.920,52 322.735,94 D

2 121 Banco BPI, Sa 55.884,44 42.152,35 750.503,75 427.792,60 322.711,15 D

2 122 Novo Banco s a 0,00 13,52 152,71 127,92 24,79 D

13 OUTROS DEPÓSITOS BANCÁRIOS 0,00 0,00 87.277,22 0,00 87.277,22 D

131 DEPOSITOS A PRAZO 0,00 0,00 87.277,22 0,00 87.277,22 D

Example 3

The example 3 only haves debt and credit of the month and debt and credit of the year

12 Depósitos à ordem 4.237.864,32 3.524.858,99 713.005,33 0,00

13 Outros depósitos bancários 400.000,00 0,00 400.000,00 0,00

14 Outros Instrumentos financeiros 49.992,28 0,00 49.992,28 0,00

33 Matérias-primas, subsidiárias e de consumo 1.721.079,97 843.549,38 877.530,59 0,00

-- Ty for the comments!



Solution 1:[1]

For the example data, you might use an approach with capture groups and for the Value of account create an optional group at the end of the pattern.

If you have more columns, you can extend the pattern.

When using code, you can check for the existence of the last optional group.

^(\d+(?:\.\d+)*)\s+(.+?)\s+(\d{1,3}(?:\.\d{3})*,\d{2})\s+(\d{1,3}(?:\.\d{3})*,\d{2})(?:\s+(\d{1,3}(?:\.\d{3})*,\d{2}(?:\s+[CD])?))?$

The pattern matches:

  • ^ Start of string
  • (\d+(?:\.\d+)*) Capture group 1, match 1+ digits and optionally repeat matching a dot and 1+ digits for Account number
  • \s+ Match 1+ whitespace chars
  • (.+?) Capture group 2, match 1+ chars, as least as possible for Account name
  • \s+ Match 1+ whitespace chars
  • (\d{1,3}(?:\.\d{3})*,\d{2}) Capture group 3, the money format for credit$
  • \s+ Match 1+ whitespace chars
  • (\d{1,3}(?:\.\d{3})*,\d{2}) Capture group 4, the money format for Debt$
  • (?: Non capture group
    • \s+ Match 1+ whitespace chars
    • (\d{1,3}(?:\.\d{3})*,\d{2}(?:\s+[CD])?) capture group 5, the money format for Value of account with an optional part matching 1+ whitespace chars and either C or D
  • )? Close non caputure group and make it optional
  • $ End of string

See a regex101 demo.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 The fourth bird