'Parse txt file in Pandas

I have the table in file and it looks like that:

+-------------+-----------------+---------------+---------------+--------------+
|number       |name             |very           |column4        |very long     |
|             |                 |long column3   |               |column 5      |
+-------------+-----------------+---------------+---------------+--------------+
|1.1          |Some             |       Х       |       Х       |2373833636.85 |
|             |position 1.1     |               |               |              |
+-------------+-----------------+---------------+---------------+--------------+
|1.2          |Some             |      0,5      |      0,70     |237383.90     |
|             |position 1.2     |               |               |              |
+-------------+-----------------+---------------+---------------+--------------+
|1.3          |Some             |       Х       |      0,5      |2864583.90    |
|             |position 1.3     |               |               |              |
+-------------+-----------------+---------------+---------------+--------------+

First row is the header of table. First two columns are of string type, other three columns are of number (float) type.

I need to convert this table to pandas DataFrame:

number name very long column3 column4 very long column 5
1.1 Some position 1.1 2373833636.85
1.2 Some position 1.2 0.5 0.7 237383.9
1.3 Some position 1.3 0.5 2864583.9


Solution 1:[1]

You could try this:

1. step: Converting the file into a csv-file (adjust the file names accordingly):

import csv
from itertools import groupby

with open("file.txt", "r") as fin, open("file.csv", "w") as fout:
    writer = csv.writer(fout)
    for skip, lines in groupby(fin, lambda l: l.startswith("+")):
        if skip:
            continue
        lines = (
            (part.strip() for part in line.strip().strip("|").split("|"))
            for line in lines
        )
        line = (" ".join(filter(None, parts)) for parts in zip(*lines))
        writer.writerow(line)
  • I'm using itertools.groupby() from the standard library to grab the lines that don't start with + in connected blocks.
  • Then the lines get .strip()ed of (1) the whitespace at the ends, (2) the | at the ends, .split() on |, and the individual components .strip()ed again.
  • Then the columns get build via zip(): Only the truthy parts get " ".join()ed.
  • The so build lines are written into a csv-file that looks like:
number,name,very long column3,column4,very long column 5
1.1,Some position 1.1,?,?,2373833636.85
1.2,Some position 1.2,"0,5","0,70",237383.90
1.3,Some position 1.3,?,"0,5",2864583.90

2. step: Reading into a dataframe and some post processing:

import pandas as pd

df = pd.read_csv("file.csv", na_values=[chr(1061)])
df.iloc[:, 0] = df.iloc[:, 0].astype(str)
for no in (2, 3):
    df.iloc[:, no] = df.iloc[:, no].str.replace(",", ".").astype(float)
  • Use pd.read_csv() to read the csv-file into a dataframe, with the character with unicode codepoint 1061 added to the NaN-values.
  • Convert the first column to string.
  • Replace the , in columns 3 and 4 with a proper decimal point and convert them to float.

Result:

  number               name  very long column3  column4  very long column 5
0    1.1  Some position 1.1                NaN      NaN        2.373834e+09
1    1.2  Some position 1.2                0.5      0.7        2.373839e+05
2    1.3  Some position 1.3                NaN      0.5        2.864584e+06

Solution 2:[2]

The example you show us is a table in Markdown format. Pandas itself doesn't offer a way to parse (understand) that. I also don't know another package that is able to breakdown a Markdown table into its logical components.

The answer to your question is that it is not possible to parse that with pandas.

You could write your own Python code to parse that but I wouldn't advice you to do that.

My advice is to go back to your data provider and tell them that they should provide you the data in a machine readable format e.g. CSV.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 buhtz