Could you help me to find out the correct way to interact with HDInsight Hadoop cluster (first of all with HDFS) from the Databricks notebook?
Now I am trying to use pyarrow library as below:
hdfs1 = pa.hdfs.connect(host=host, port=8020, extra_conf=conf, driver='libhdfs3')
where host is my namenode;
conf is dict, created from HDFS_CLIENT hdfs-site.xml
I have got an error message:
ArrowIOError: HDFS connection failed
---------------------------------------------------------------------------
ArrowIOError Traceback (most recent call last)
<command-3476367505086664> in <module>
1 hdfs1 = pa.hdfs.connect(host=host, port=8020,
----> 2 extra_conf=conf, driver='libhdfs3')
/databricks/python/lib/python3.7/site-packages/pyarrow/hdfs.py in connect(host, port, user, kerb_ticket, driver, extra_conf)
209 fs = HadoopFileSystem(host=host, port=port, user=user,
210 kerb_ticket=kerb_ticket, driver=driver,
--> 211 extra_conf=extra_conf)
212 return fs
/databricks/python/lib/python3.7/site-packages/pyarrow/hdfs.py in __init__(self, host, port, user, kerb_ticket, driver, extra_conf)
36 _maybe_set_hadoop_classpath()
37
---> 38 self._connect(host, port, user, kerb_ticket, driver, extra_conf)
39
40 def __reduce__(self):
/databricks/python/lib/python3.7/site-packages/pyarrow/io-hdfs.pxi in pyarrow.lib.HadoopFileSystem._connect()
/databricks/python/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowIOError: HDFS connection failed
And also I am not very clear about environmental variables from documentation:
HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so. - my Hadoop should be on HDInsights cluster, but libhdfs.so I have installed on my Databricks together with pyarrow.
I have accessed my HDFS from SSH and in /usr/hdp/current/hadoop-client/lib/native/ directory I have found only libhdfs.a, but not libhdfs.so.
From SSH client echo $HADOOP_HOME and other env variables, required for pyarrow, returns nothing.
For the moment I have strong opinion, that I am doing something conceptually wrong... Working with pyarrow it looks like python script should be exectued in the same OS & environment, that Hadoop cluster is installed. But when I am using Databricks - these are defenitely different clusters, different OS & environments, not to mention the Jupyter Lab under my Windows.
Will be glad if you will help me to find the right way