'Running nltk.download in Azure Synapse notebook ValueError: I/O operation on closed file

I'm experimenting with NLTK in an Azure Synapse notebook. When I try and run nltk.download('stopwords') I get the following error:

ValueError: I/O operation on closed file
Traceback (most recent call last):

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 782, in download
    show(msg.message)

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 775, in show
    subsequent_indent=prefix + prefix2 + " " * 4,

  File "/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1616860588116_0001/container_1616860588116_0001_01_000001/tmp/9026485902214290372", line 536, in write
    super(UnicodeDecodingStringIO, self).write(s)

ValueError: I/O operation on closed file

If I try and just run nltk.download() I get the following error:

EOFError: EOF when reading a line
Traceback (most recent call last):

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 765, in download
    self._interactive_download()

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 1117, in _interactive_download
    DownloaderShell(self).run()

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 1143, in run
    user_input = input("Downloader> ").strip()

EOFError: EOF when reading a line

I'm hoping someone could give me some help on what may be causing this and how to get around it. I haven't been able to find much information on where to go from here.

Edit: The code I am using to generate the error is the following:

import nltk
nltk.download('stopwords')

Update I ended up opening a support request with Microsoft and this was their response:

Synapse does not support arbitrary shell scripts which is where you would download the related model corpus for NLTK

They recommended I use sc.addFile, which I ended up getting to work. So if anyone else finds this, here's what I did.

  1. Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
  2. Upload the stopwords to the follwoing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/corpora/stopwords/
  3. Run the below code to import them

.

import os
import sys
import nltk
from pyspark import SparkFiles

#add stopwords from storage
sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/',True)

#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk_data')

nltk.corpus.stopwords.words('english')

Thanks!



Solution 1:[1]

I ended up opening a support request with Microsoft and this was their response:

Synapse does not support arbitrary shell scripts which is where you would download the related model corpus for NLTK

They recommended I use sc.addFile, which I ended up getting to work. So if anyone else finds this, here's what I did.

  1. Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
  2. Upload the stopwords to the foll0woing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/corpora/stopwords/
  3. Run the below code to import them

....

import os
import sys
import nltk
from pyspark import SparkFiles
    
#add stopwords from storage
    sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/',True)
    
#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk_data')
    
nltk.corpus.stopwords.words('english')

Thanks!

Solution 2:[2]

I recently had this same issue on synapse analytics and ended up opening a support request ticket with Microsoft.

Note that synapse does not natively support NLTK stopwords, hence you will have to download the stopwords and put them in an azure storage directory.

The Microsoft team recommended I use sc.addFile, which worked.

Just like the others above, here's what I did.

  1. Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
  2. Upload the stopwords to the foll0woing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk-data/corpora/stopwords/ Run the below code to import them

Note that i used 'nltk-data' in the directories, rather than 'nltk_data' for my implementation work

import os
import sys
import nltk
from pyspark import SparkFiles
    
#add stopwords from storage
    sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk-data/',True)
    
#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk-data')
    
nltk.corpus.stopwords.words('english')


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 User181
Solution 2 Ioudom Foubi Jephte