'Reading of Elasticsearch with pyspark fails with exceptn java.lang.NoClassDefFoundError: org/apache/commons/httpclient/protocol/ProtocolSocketFactory

I have a spark cluster in kubernetes based on image

mcr.microsoft.com/mmlspark/spark2.4:v4.
Spark version version 2.4.0
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_111
Compiled by user on 2018-10-29T06:48:44Z

I try to connect this spark to an elasticsearch version 6.2.3

Currently all my tries end with an exception when loading the elastic driver.

I tried to provide the http-client in --jars parameter of the spark-submit command:
spark-submit --deploy-mode client --master local --conf spark.executor.instances=1 --conf spark.app.name=ITPLogFuzzy --jars http://central.maven.org/maven2/org/elasticsearch/elasticsearch-spark-20_2.11/6.2.3/elasticsearch-spark-20_2.11-6.2.3.jar,/root/.m2/repository/org/apache/httpcomponents/httpclient/4.5.2/httpclient-4.5.2.jar test/test.py

I tried this with different httpclients.

To do so, I use the following command: spark-submit --deploy-mode client --master local --conf spark.executor.instances=1 --conf spark.app.name=LogFuzzy --jars http://central.maven.org/maven2/org/elasticsearch/elasticsearch-spark-20_2.11/6.2.3/elasticsearch-spark-20_2.11-6.2.3.jar test/test.py

It fails with the following error message:

19/06/18 09:30:17 INFO SparkContext: Added JAR http://central.maven.org/maven2/org/elasticsearch/elasticsearch-spark-20_2.11/6.2.3/elasticsearch-spark-20_2.11-6.2.3.jar at http://central.maven.org/maven2/org/elasticsearch/elasticsearch-spark-20_2.11/6.2.3/elasticsearch-spark-20_2.11-6.2.3.jar with timestamp 1560850217287
19/06/18 09:30:17 INFO Executor: Starting executor ID driver on host localhost
19/06/18 09:30:17 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33055.
19/06/18 09:30:17 INFO NettyBlockTransferService: Server created on devsparkzep-master-7bc578ff54-k5gtm:33055
19/06/18 09:30:17 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/06/18 09:30:17 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, devsparkzep-master-7bc578ff54-k5gtm, 33055, None)
19/06/18 09:30:17 INFO BlockManagerMasterEndpoint: Registering block manager devsparkzep-master-7bc578ff54-k5gtm:33055 with 366.3 MB RAM, BlockManagerId(driver, devsparkzep-master-7bc578ff54-k5gtm, 33055, None)
19/06/18 09:30:17 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, devsparkzep-master-7bc578ff54-k5gtm, 33055, None)
19/06/18 09:30:17 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, devsparkzep-master-7bc578ff54-k5gtm, 33055, None)
19/06/18 09:30:17 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6b842ee4{/metrics/json,null,AVAILABLE,@Spark}
19/06/18 09:30:17 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/root/spark-warehouse').
19/06/18 09:30:17 INFO SharedState: Warehouse path is 'file:/root/spark-warehouse'.
19/06/18 09:30:17 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4d86a066{/SQL,null,AVAILABLE,@Spark}
19/06/18 09:30:17 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6f686036{/SQL/json,null,AVAILABLE,@Spark}
19/06/18 09:30:17 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@60cb8ae{/SQL/execution,null,AVAILABLE,@Spark}
19/06/18 09:30:17 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@656c86b5{/SQL/execution/json,null,AVAILABLE,@Spark}
19/06/18 09:30:17 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3bd4a9ba{/static/sql,null,AVAILABLE,@Spark}
19/06/18 09:30:18 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/06/18 09:30:18 INFO Version: Elasticsearch Hadoop v6.2.3 [039a45c5a1]
Traceback (most recent call last):
  File "/root/test/test.py", line 44, in <module>
    dfITPLog = readerITPLog.load(esIdxITPLog).select(
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in load
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.load.
: java.lang.NoClassDefFoundError: org/apache/commons/httpclient/protocol/ProtocolSocketFactory
        at org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransportFactory.create(CommonsHttpTransportFactory.java:39)
        at org.elasticsearch.hadoop.rest.NetworkClient.selectNextNode(NetworkClient.java:99)
        at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:82)
        at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:59)
        at org.elasticsearch.hadoop.rest.RestClient.<init>(RestClient.java:94)
        at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:317)
        at org.elasticsearch.spark.sql.SchemaUtils$.discoverMappingAndGeoFields(SchemaUtils.scala:98)
        at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
        at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:220)
        at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:220)
        at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:224)
        at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:224)
        at scala.Option.getOrElse(Option.scala:121)
        at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:403)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.httpclient.protocol.ProtocolSocketFactory
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)


Solution 1:[1]

for scala Spark-Es, if you are using CDH6.2 or other version. add the dependency

<!-- https://mvnrepository.com/artifact/commons-httpclient/commons-httpclient -->
<dependency>
    <groupId>commons-httpclient</groupId>
    <artifactId>commons-httpclient</artifactId>
    <version>3.1</version>
</dependency>
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch-spark-20_2.11</artifactId>
    <version>your_version</version>
</dependency>

Solution 2:[2]

In my case, I had to download the commons-httpclient jar from mvnrepository and specify the path to the jar in the spark-submit command through the --jars parameter.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 roamer
Solution 2 Chris Gong