How to connect to HDInsight Hadoop cluster from Databricks

Ask Time：2019-12-23T23:33:10 Author：siarblack

Could you help me to find out the correct way to interact with HDInsight Hadoop cluster (first of all with HDFS) from the Databricks notebook?

Now I am trying to use pyarrow library as below:

hdfs1 = pa.hdfs.connect(host=host, port=8020, extra_conf=conf, driver='libhdfs3')

where host is my namenode;

conf is dict, created from HDFS_CLIENT hdfs-site.xml

I have got an error message:

ArrowIOError: HDFS connection failed
---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<command-3476367505086664> in <module>
      1 hdfs1 = pa.hdfs.connect(host=host, port=8020, 
----> 2                         extra_conf=conf, driver='libhdfs3')
/databricks/python/lib/python3.7/site-packages/pyarrow/hdfs.py in connect(host, port, user, kerb_ticket, driver, extra_conf)
    209     fs = HadoopFileSystem(host=host, port=port, user=user,
    210                           kerb_ticket=kerb_ticket, driver=driver,
--> 211                           extra_conf=extra_conf)
    212     return fs
/databricks/python/lib/python3.7/site-packages/pyarrow/hdfs.py in __init__(self, host, port, user, kerb_ticket, driver, extra_conf)
     36             _maybe_set_hadoop_classpath()
     37 
---> 38         self._connect(host, port, user, kerb_ticket, driver, extra_conf)
     39 
     40     def __reduce__(self):
/databricks/python/lib/python3.7/site-packages/pyarrow/io-hdfs.pxi in pyarrow.lib.HadoopFileSystem._connect()
/databricks/python/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowIOError: HDFS connection failed

And also I am not very clear about environmental variables from documentation:

HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so. - my Hadoop should be on HDInsights cluster, but libhdfs.so I have installed on my Databricks together with pyarrow.

I have accessed my HDFS from SSH and in /usr/hdp/current/hadoop-client/lib/native/ directory I have found only libhdfs.a, but not libhdfs.so.

From SSH client echo $HADOOP_HOME and other env variables, required for pyarrow, returns nothing.

For the moment I have strong opinion, that I am doing something conceptually wrong... Working with pyarrow it looks like python script should be exectued in the same OS & environment, that Hadoop cluster is installed. But when I am using Databricks - these are defenitely different clusters, different OS & environments, not to mention the Jupyter Lab under my Windows.

Will be glad if you will help me to find the right way

Author:siarblack，eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/59457787/how-to-connect-to-hdinsight-hadoop-cluster-from-databricks

How to connect to HDInsight Hadoop cluster from Databricks