'Pandas dataframe in pyspark to hive
How to send a pandas dataframe to a hive table?
I know if I have a spark dataframe, I can register it to a temporary table using
df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")
but when I try to use the pandas dataFrame to registerTempTable, I get the below error:
AttributeError: 'DataFrame' object has no attribute 'registerTempTable'
Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.
Solution 1:[1]
I guess you are trying to use pandas df
instead of Spark's DF.
Pandas DataFrame has no such method as registerTempTable
.
you may try to create Spark DF from pandas DF.
UPDATE:
I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).
Make sure that you have set PYSPARK_PYTHON
to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh
)
Here is result of my test:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+
>>> sdf.printSchema()
root
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- C: long (nullable = true)
Solution 2:[2]
first u need to convert pandas dataframe to spark dataframe:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)
then u can create a temptable which is in memory:
df.registerTempTable('tmp')
now,u can use hive ql to save data into hive:
hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''
note than:the hive_context must be keep to the same one!
Solution 3:[3]
I converted my pandas df to a temp table by
1) Converting the pandas dataframe to spark dataframe:
spark_df=sqlContext.createDataFrame(Pandas_df)
2) Make sure that the data is migrated properly
spark_df.select("*").show()
3) Convert the spark dataframe to a temp table for querying.
spark_df.registerTempTable("table_name").
Cheers..
Solution 4:[4]
By Following all the other answers here, I was able to convert a pandas dataframe to a permanent Hive table as follows:
# sc is a spark context created with enableHiveSupport()
from pyspark.sql import HiveContext
hc=HiveContext(sc)
# df is my pandas dataframe
sc.createDataFrame(df).registerTempTable('tmp')
# sch is the hive schema, and tabname is my new hive table name
hc.sql("create table sch.tabname as select * from tmp")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | B.Mr.W. |
Solution 3 | Abhi |
Solution 4 | Jim Bander |