'Extracting a .7z File into a Pandas Data Frame
I am Using a Jupyter notebook (google colab) to try and extract data from a .7z file into a pandas dataframe, using linux commands. The data is from http://untroubled.org/spam/ . I wish to extract only the data from the 2020-01.7z file. so far I have,
!wget http://untroubled.org/spam/2020-01.7z
!7z x 2020-01.7z
import pandas as pd
import py7zr
archive = py7zr.SevenZipFile('2020-01.7z', mode='r')
archive.extractall(path="/tmp")
with open ('2020-01.7z', 'r') as myfile:
myfile.read()
mydf = pd.DataFrame(myfile)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 2: invalid
start byte
I'm not really sure what the "/tmp" mean. I know there is a way to do this I just don't have enough understanding yet of these commands and how to use them. Any help is appreciated
Solution 1:[1]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Shihab Masri |