'Pyspark-pandas not working on Spark 3.1.2

I am using spark 3.1.2 and attempting to use pyspark-pandas. However when attempting from pyspark import pandas as ps I am getting the following error:

ImportError: cannot import name 'pandas' from 'pyspark' (/databricks/spark/python/pyspark/__init__.py)

How can I utilize this package? (For reference, I am using databricks).



Solution 1:[1]

Databricks Runtime 10.0.0+ is required for pyspark-pandas with spark 3.2+

Solution 2:[2]

Maybe follow these steps from the Google dataproc version: How to run spark 3.2.0 on google dataproc?

I think the solution here is to upgrade Databricks to the latest version after Spark 3.2 release last October 2021.

I am using EMR 6.5 and it still does not have Spark 3.2 thus we cannot use this yet either. Reading the proposal to add Pandas support shows there are other implementations that may work for me: Dask, Modin, and Koalas. See the proposal here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 a.powell
Solution 2