'installing pyspark on windows
I have a few questions which I would like to clarify before installation. Please bear with me as I am still new to data science and installation packages.
1) I can do a pip install pyspark on my windows. When I try to run a sample script below it tells me my spark_home not set. Do i need to set my spark_home still and how do I go about doing it? The blogs which I have referred online do a manual extraction of the spark files from the spark website and then later they have to put the spark_home and the pythonpath. However, I thought this was elimated with pip install pyspark.
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql('''select 'spark' as hello ''')
df.show()
2) For intellij, do I still need to do additional configuation once i have installed in pyspark and set up as necessary in 1?
Thank you so much. once again I do apologise and please excuse if I ask a silly question.
Solution 1:[1]
Check the directions here
https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c
you'll need to install Apache Spark (the whole thing) too!
I did it and it takes a good while - for the most part when I'm learning/helping a friend I'll use the notebooks at Zepl or databricks
if you do choose to install the whole thing and have trouble don't be shy to post another question :)
Solution 2:[2]
I tried many ways but I Successfully installed from the below link
Source:
Solution 3:[3]
In general, if you do not need a full Spark installation, it is recommended that you just install it in your environment with pip:
pip install pyspark
Once the installation is ready, you should be able to invoke Spark shell with the pyspark
keyword.
There are a couple of issues that you may encounter while running the executable. Notably, you may face following error:
Python was not found but can be installed from the Microsoft Store
Obviously, if you have just installed PySpark, you have Python ready in your virtual environment, so you can turn off these default paths by disabling python.exe
and python3.exe
App Installers in Apps >> Apps & features >> App execution aliases
in Settings.
You may still face a seemingly related error:
Missing Python executable
python3
, defaulting to[path_to_spark_home]
forSPARK_HOME
environment variable. Please install Python or specify the correct Python executable inPYSPARK_DRIVER_PYTHON
orPYSPARK_PYTHON
environment variable to detectSPARK_HOME
.
In this case, the issue could either be in inappropriate SPARK_HOME
variable or an unspecified Python path.
In the first case, try setting SPARK_HOME
explicitly:
$env:SPARK_HOME=python -c "import pyspark, pathlib; print(pathlib.Path(pyspark.__file__).parent)"
Alternatively, set one of the Python-related variables to Python path in your environment. For instance:
$env:PYSPARK_DRIVER_PYTHON=python -c "import sys; print(sys.executable)"
If you do not have Java 11 installed, you should download, install it and set the environment variable JAVA_HOME
to the path, where it was installed.
You may still face an error stating:
Did not find winutils.exe. java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
As expected, you need to download binaries for Hadoop and then set the HADOOP_HOME
environment variable to that path.
Once all these paths are specified, PySpark executable should be able to run as long as you have necessary permissions.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | thrinadhn |
Solution 3 |