'Python : Changing the original data using a for loop

I have some really big txt files (> 2 gb) where the quality of the data is not good. In some columns (that should be integer), for values below 1000.00 , '.' is used as the decimal point (e.g. 473.71886) but for values above 1000.00 then the form is like that 7.541,72419. So ',' is used as the decimal point and '.' for the thousands separator.

I have already read the text file using pd.read_csv with the below command

df = pd.read_csv('mseg.txt',delimiter=("#|#"),nrows=(1000),engine = 'python')

I tried to build the regular expression to be used but it doesn't work pattern = "[0-9]+[\.][0-9]+[,][0-9]+"

I was thinking of using the below code to correct the above problem but it doesn't work. (in the below code I used as pattern2 = "," to test the code)

for i in df.iloc[:,-5]:
    df3 = []
    if re.search(pattern2,i):
        k= i.replace(".","")
        print(k)
        df3.append(k)
    else:
        df3.append(k)
return dfe3

The print(k) in the loop seems to work fine but when I run df3 then I get the below output

['\x00 \x003\x004\x00\x006\x006\x005\x00,\x002\x001\x007\x006\x000\x00']

Could anyone help?



Solution 1:[1]

I would suggest to do the following:

If there is a ',' in the number replace it with a '.' but get rid of the ',' before. So you would change a 1.234,567 to 1234,567 and then to 1234.567. Then all of your numbers should be in the same format.

df3 = []
for index,i in df.iloc[:,-5]:  
    if ',' in i:
        i= i.replace(".","").replace(',','.')
    df3[index] = i

Solution 2:[2]

You can try this:

>>> df
             0
0    473.71886
1  7.541,72419
>>> df[0].str.split(r'[^\d]') \
         .apply(lambda x: f"{''.join(x[:-1])}.{x[-1]}")

0      473.75410
1    71886.72419
dtype: float64

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Lorenz Hufe
Solution 2