In a program I have csv extracted from excel, I need to upload the csv to hdfs and save it as parquet format, doesn't matter with python version or spark version, no scala please.
Almost all discussions I came across are about databrick, however, it seems cannot find the file, here is the code and error:
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load("file:///home/rxie/csv_out/wamp.csv")
Error:
java.io.FileNotFoundException: File file:/home/rxie/csv_out/wamp.csv
does not exist
The file path:
ls -la /home/rxie/csv_out/wamp.csv
-rw-r--r-- 1 rxie linuxusers 2896878 Nov 12 14:59 /home/rxie/csv_out/wamp.csv
Thank you.
mdivk :
I found the issue now!\n\nThe reason why it errors out of file not found is actually correct, because I was using Spark Context with setMaster(\"yarn-cluster\"), that means all worker nodes will look for the csv file, of course all worker nodes (except the one starting the program where the csv resides) do not have this file and hence error out. What I really should do is to use setMaster(\"local\").\n\nFIX:\n\nconf = SparkConf().setAppName('test').setMaster(\"local\")\nsc = SparkContext(conf=conf)\nsqlContext = SQLContext(sc)\ncsv = \"file:///home/rxie/csv_out/wamp.csv\"\ndf = sqlContext.read.format(\"com.databricks.spark.csv\").option(\"header\", \"true\").option(\"inferSchema\",\"true\").option(\"delimiter\",\",\").load(csv)\n",
2018-11-14T02:18:43