'Submit a spark job from Airflow to external spark container
I have a spark and airflow cluster which is built with docker swarm. Airflow container cannot contain spark-submit as I expect.
I am using following images which exist in github
Spark: big-data-europe/docker-hadoop-spark-workbench
Airflow: puckel/docker-airflow (CeleryExecutor)
I prepared a .py file and add it under dags folder.
from airflow import DAG
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta
args = {'owner': 'airflow', 'start_date': datetime(2018, 9, 24) }
dag = DAG('spark_example_new', default_args=args, schedule_interval="@once")
operator = SparkSubmitOperator(task_id='spark_submit_job', conn_id='spark_default', java_class='Main', application='/SimpleSpark.jar', name='airflow-spark-example',conf={'master':'spark://master:7077'},
dag=dag)
I also configure the connection as folows in web site:
Master is the hostname of spark master container.
But it does not find the spark-submit, it produces following error:
[2018-09-24 08:48:14,063] {{logging_mixin.py:95}} INFO - [2018-09-24 08:48:14,062] {{spark_submit_hook.py:283}} INFO - Spark-Submit cmd: ['spark-submit', '--master', 'spark://master:7077', '--conf', 'master=spark://master:7077', '--name', 'airflow-spark-example', '--class', 'Main', '--queue', 'root.default', '/SimpleSpark.jar']
[2018-09-24 08:48:14,067] {{models.py:1736}} ERROR - [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1633, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/spark_submit_operator.py", line 168, in execute
self._hook.submit(self._application)
File "/usr/local/lib/python3.6/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line 330, in submit
**kwargs)
File "/usr/local/lib/python3.6/subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
Solution 1:[1]
As far as I know puckel/docker-airflow
uses Python slim image(https://hub.docker.com/_/python/). This image does not contain the common packages and only contains the minimal packages needed to run python. Hence, you will need to extend the image and install spark-submit
on your container.
Edit: Airflow does need spark binaries in the container to run SparkSubmitOperator
as documented here.
The other approach you can use is to use SSHOperator
to run spark-submit
command on an external VM by SSHing into a remote machine. But here as well SSH should be available which isn't available in Puckel Airflow.
Solution 2:[2]
This is late answer
you should install
apache-airflow-providers-apache-spark
So you should create a file called 'requirements.txt'
add
apache-airflow-providers-apache-spark
in the file requirements.txtCreate a Dockerfile like this
FROM apache/airflow:2.2.3 # Install OpenJDK-11 RUN apt update && \ apt-get install -y openjdk-11-jdk && \ apt-get install -y ant && \ apt-get clean; # Set JAVA_HOME ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64/ RUN export JAVA_HOME USER airflow COPY requirements.txt . RUN pip install -r requirements.txt
in the docker-compose.yml comment the line :
# image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.2.3}
and uncomment the line build .
Finally run
docker-compose build
docker-compose up
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | moe_ |