'Using pyspark in Google Colab
This is my first question here after using a lot of StackOverflow so correct me if I give inaccurate or incomplete info
Up until this week I had a colab notebook setup to run with pyspark following one of the many guides I found throughout the internet, but this week it started popping with a few different errors.
The code used is pretty much this one:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()
I have tried changing the Java version from 8 to 11 and using all of the available Spark builds on https://downloads.apache.org/spark/ and changing the HOME paths accordingly. I used pip freeze
as seen on one guide to check the Spark version used in colab and it said pyspark 3.0.0 so I tried all the ones on version 3.0.0 and all I keep getting is the error:
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
I don't understand much about the need of using Java for this, but I also tried installing pyj4 though !pip install py4j
and it says its already installed when I do, and I tried every different guide on the internet, but I can't run my Spark code anymore. Does anyone know how to fix this?
I only use colab for college purposes because my PC is quite outdated and I don't know much about it, but I really need to get this notebook running reliably and so how do I know it's time to update the imported builds?
Solution 1:[1]
Following this colab notebook which worked for me:
First cell:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
and that pretty much installs pyspark
.
But do follow these steps to also launch the Spark UI which is super-helpful for understanding physical plans, storage usage, and much more. Also: it has nice graphs ;)
Second cell:
from pyspark import SparkSession
from pyspark import SparkContext, SparkConf
# create the session
conf = SparkConf().set("spark.ui.port", "4050")
# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()
Third cell:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!sleep 10
!curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
after which you'll see a URL where you'll find the Spark UI; my example output was:
--2020-10-03 11:30:58-- https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 52.203.78.32, 52.73.16.193, 34.205.238.171, ...
Connecting to bin.equinox.io (bin.equinox.io)|52.203.78.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.1’
ngrok-stable-linux- 100%[===================>] 13.13M 13.9MB/s in 0.9s
2020-10-03 11:31:00 (13.9 MB/s) - ‘ngrok-stable-linux-amd64.zip.1’ saved [13773305/13773305]
Archive: ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: ngrok
http://989c77d52223.ngrok.io
and that last element, http://989c77d52223.ngrok.io, was where my Spark UI lived.
Solution 2:[2]
@Victor I also had similar probem. This is what I did.
Download your existing jupyter notebook from colab to your computer drive.
Create a new notebook in colab
Execute following
!pip3 install pyspark
Upload your notebook to the same colab session.
Run Spark Session and check
Solution 3:[3]
Spark version 2.3.2
works very well in google colab. Just follow my steps :
!pip install pyspark==2.3.2
import pyspark
Check the version we have installed
pyspark.__version__
Try to create a Sparksession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Sparkify").getOrCreate()
And you can now use Spark in colab. ENJOY !
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Dharman |
Solution 2 | Nihad TP |
Solution 3 | Fahd Zaghdoudi |