'Python : Changing the original data using a for loop
I have some really big txt files (> 2 gb) where the quality of the data is not good. In some columns (that should be integer), for values below 1000.00 , '.' is used as the decimal point (e.g. 473.71886) but for values above 1000.00 then the form is like that 7.541,72419. So ',' is used as the decimal point and '.' for the thousands separator.
I have already read the text file using pd.read_csv with the below command
df = pd.read_csv('mseg.txt',delimiter=("#|#"),nrows=(1000),engine = 'python')
I tried to build the regular expression to be used but it doesn't work
pattern = "[0-9]+[\.][0-9]+[,][0-9]+"
I was thinking of using the below code to correct the above problem but it doesn't work. (in the below code I used as pattern2 = ","
to test the code)
for i in df.iloc[:,-5]:
df3 = []
if re.search(pattern2,i):
k= i.replace(".","")
print(k)
df3.append(k)
else:
df3.append(k)
return dfe3
The print(k)
in the loop seems to work fine but when I run df3 then I get the below output
['\x00 \x003\x004\x00\x006\x006\x005\x00,\x002\x001\x007\x006\x000\x00']
Could anyone help?
Solution 1:[1]
I would suggest to do the following:
If there is a ',' in the number replace it with a '.' but get rid of the ',' before. So you would change a 1.234,567 to 1234,567 and then to 1234.567. Then all of your numbers should be in the same format.
df3 = []
for index,i in df.iloc[:,-5]:
if ',' in i:
i= i.replace(".","").replace(',','.')
df3[index] = i
Solution 2:[2]
You can try this:
>>> df
0
0 473.71886
1 7.541,72419
>>> df[0].str.split(r'[^\d]') \
.apply(lambda x: f"{''.join(x[:-1])}.{x[-1]}")
0 473.75410
1 71886.72419
dtype: float64
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Lorenz Hufe |
Solution 2 |