'Error while installing Spark on Google Colab
I am getting error while installing spark on Google Colab. It says
tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error is not recoverable: exiting now.
These were my steps
- !apt-get install openjdk-8-jdk-headless -qq > /dev/null
- !wget -q http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
- !tar xf spark-2.2.1-bin-hadoop2.7.tgz
- !pip install -q findspark
Solution 1:[1]
The problem is due to the download link you are using to download spark:
http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
To download spark without having any problem, you should download it from their archive website (https://archive.apache.org/dist/spark
).
For example, the following download link from their archive website works fine:
https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
Here is the complete code to install and setup java, spark and pyspark:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
# set your spark folder to your system path environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"
# install findspark using pip
!pip install -q findspark
For python users, you should also install pyspark
using the following command.
!pip install pyspark
Solution 2:[2]
This error is about the link you've used in the second line of the code. The following snippet worked for me on the Google Colab. Do not forget to change the spark version to the latest one and SPARK-HOME path accordingly. You can find the latest versions here: https://downloads.apache.org/spark/
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()
Solution 3:[3]
This is the correct code. I just tested it.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://mirrors.viethosting.com/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
Solution 4:[4]
#for the most recent update on 02/29/2020
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop3.2
Solution 5:[5]
Just go to https://downloads.apache.org/spark/ and choose the version you need from the folders and follow instructions in https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb#scrollTo=m606eNuQgA82
Steps:
- Go to https://downloads.apache.org/spark/
- Select folder for example: "spark-3.0.1/"
- Copy file name you want for example: "spark-3.0.1-bin-hadoop3.2.tgz" (ends with .tgz)
- Paste to the provided script
List item
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/FOLDER_YOU_CHOSE/FILE_YOU_CHOSE
!tar -xvf FILE_YOU_CHOSE
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/FILE_YOU_CHOSE"
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
Solution 6:[6]
I have tried the following commands and it seems to work.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark
I got the latest version, changed the download URL, and added the v
flag to the tar command for verbose output.
Solution 7:[7]
!pip install pyspark
It worked with just the !pip install pyspark. Please refer screen shot for reference.
Solution 8:[8]
you are using link for the old version , following commands will work(new version)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
Solution 9:[9]
To run spark in Colab, first we need to install all the dependencies in Colab environment such as Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark in order to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark
if you get this error again : Cannot open: No such file or directory tar
visit Apache spark website and get the latest build version: 1. https://www-us.apache.org/dist/spark/ 2. http://apache.osuosl.org/spark/
replace spark-2.4.3 bold words with latest version.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Matteo |
Solution 4 | |
Solution 5 | eemilk |
Solution 6 | zonksoft |
Solution 7 | Deepa Vasanthkumar |
Solution 8 | Vipul Sanjay Charthal |
Solution 9 | Roshan Bagdiya |