I am experimenting with Hadoop and Spark, as the company I work for is getting ready to start spinning up Hadoop and want to use Spark and other resources to do a lot of machine learning on our data.
Most of that falls to me, so I am preparing by learning on my own.
I have a machine I have setup as a single node Hadoop cluster.
Here is what I have:
- CentOS 7 (minimal server install, added XOrg and OpenBox for GUI)
- Python 2.7
- Hadoop 2.7.2
- Spark 2.0.0
I followed these guides to set this up:
When I attempt to run 'pyspark' I get the following:
IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these from the environment and set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYHTON_OPTS instead.
I opened up the pyspark file in vi and examined it.
I see a lot of stuff going on there, but I don't know where to start to make the corrections I need to make.
My Spark installation is under:
/opt/spark-latest
The pyspark is under /opt/spark-latest/bin/
and my Hadoop installation (though I don't think this factors in) is /opt/hadoop/
.
I know there must be a change I need to make in the pyspark file somewhere, I just don't know where to being on this.
I did some googling and found references to similar things, but nothing that indicated steps in order to fix this.
Can anyone give me a push in the right direction?