'installing pyspark on windows

I have a few questions which I would like to clarify before installation. Please bear with me as I am still new to data science and installation packages.

1) I can do a pip install pyspark on my windows. When I try to run a sample script below it tells me my spark_home not set. Do i need to set my spark_home still and how do I go about doing it? The blogs which I have referred online do a manual extraction of the spark files from the spark website and then later they have to put the spark_home and the pythonpath. However, I thought this was elimated with pip install pyspark.

import findspark
findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.sql('''select 'spark' as hello ''')
df.show()

2) For intellij, do I still need to do additional configuation once i have installed in pyspark and set up as necessary in 1?

Thank you so much. once again I do apologise and please excuse if I ask a silly question.



Solution 1:[1]

Check the directions here

https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c

you'll need to install Apache Spark (the whole thing) too!

I did it and it takes a good while - for the most part when I'm learning/helping a friend I'll use the notebooks at Zepl or databricks

if you do choose to install the whole thing and have trouble don't be shy to post another question :)

Solution 2:[2]

I tried many ways but I Successfully installed from the below link

Source:

  1. Eden Canlilar
  2. PySpark in Jupyter Notebook on Windows

Solution 3:[3]

In general, if you do not need a full Spark installation, it is recommended that you just install it in your environment with pip:

pip install pyspark

Once the installation is ready, you should be able to invoke Spark shell with the pyspark keyword. There are a couple of issues that you may encounter while running the executable. Notably, you may face following error:

Python was not found but can be installed from the Microsoft Store

Obviously, if you have just installed PySpark, you have Python ready in your virtual environment, so you can turn off these default paths by disabling python.exe and python3.exe App Installers in Apps >> Apps & features >> App execution aliases in Settings.

You may still face a seemingly related error:

Missing Python executable python3, defaulting to [path_to_spark_home] for SPARK_HOME environment variable. Please install Python or specify the correct Python executable in PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment variable to detect SPARK_HOME.

In this case, the issue could either be in inappropriate SPARK_HOME variable or an unspecified Python path.

In the first case, try setting SPARK_HOME explicitly:

$env:SPARK_HOME=python -c "import pyspark, pathlib; print(pathlib.Path(pyspark.__file__).parent)"

Alternatively, set one of the Python-related variables to Python path in your environment. For instance:

$env:PYSPARK_DRIVER_PYTHON=python -c "import sys; print(sys.executable)"

If you do not have Java 11 installed, you should download, install it and set the environment variable JAVA_HOME to the path, where it was installed.

You may still face an error stating:

Did not find winutils.exe. java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

As expected, you need to download binaries for Hadoop and then set the HADOOP_HOME environment variable to that path.

Once all these paths are specified, PySpark executable should be able to run as long as you have necessary permissions.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 thrinadhn
Solution 3