'Running nltk.download in Azure Synapse notebook ValueError: I/O operation on closed file
I'm experimenting with NLTK in an Azure Synapse notebook. When I try and run nltk.download('stopwords') I get the following error:
ValueError: I/O operation on closed file
Traceback (most recent call last):
File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 782, in download
show(msg.message)
File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 775, in show
subsequent_indent=prefix + prefix2 + " " * 4,
File "/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1616860588116_0001/container_1616860588116_0001_01_000001/tmp/9026485902214290372", line 536, in write
super(UnicodeDecodingStringIO, self).write(s)
ValueError: I/O operation on closed file
If I try and just run nltk.download() I get the following error:
EOFError: EOF when reading a line
Traceback (most recent call last):
File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 765, in download
self._interactive_download()
File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 1117, in _interactive_download
DownloaderShell(self).run()
File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 1143, in run
user_input = input("Downloader> ").strip()
EOFError: EOF when reading a line
I'm hoping someone could give me some help on what may be causing this and how to get around it. I haven't been able to find much information on where to go from here.
Edit: The code I am using to generate the error is the following:
import nltk
nltk.download('stopwords')
Update I ended up opening a support request with Microsoft and this was their response:
Synapse does not support arbitrary shell scripts which is where you would download the related model corpus for NLTK
They recommended I use sc.addFile, which I ended up getting to work. So if anyone else finds this, here's what I did.
- Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
- Upload the stopwords to the follwoing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/corpora/stopwords/
- Run the below code to import them
.
import os
import sys
import nltk
from pyspark import SparkFiles
#add stopwords from storage
sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/',True)
#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk_data')
nltk.corpus.stopwords.words('english')
Thanks!
Solution 1:[1]
I ended up opening a support request with Microsoft and this was their response:
Synapse does not support arbitrary shell scripts which is where you would download the related model corpus for NLTK
They recommended I use sc.addFile, which I ended up getting to work. So if anyone else finds this, here's what I did.
- Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
- Upload the stopwords to the foll0woing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/corpora/stopwords/
- Run the below code to import them
....
import os
import sys
import nltk
from pyspark import SparkFiles
#add stopwords from storage
sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/',True)
#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk_data')
nltk.corpus.stopwords.words('english')
Thanks!
Solution 2:[2]
I recently had this same issue on synapse analytics and ended up opening a support request ticket with Microsoft.
Note that synapse does not natively support NLTK stopwords, hence you will have to download the stopwords and put them in an azure storage directory.
The Microsoft team recommended I use sc.addFile, which worked.
Just like the others above, here's what I did.
- Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
- Upload the stopwords to the foll0woing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk-data/corpora/stopwords/ Run the below code to import them
Note that i used 'nltk-data' in the directories, rather than 'nltk_data' for my implementation work
import os
import sys
import nltk
from pyspark import SparkFiles
#add stopwords from storage
sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk-data/',True)
#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk-data')
nltk.corpus.stopwords.words('english')
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | User181 |
Solution 2 | Ioudom Foubi Jephte |