Null values in field generates MatchError

Ask Time：2017-06-22T20:21:27 Author：thebluephantom

The following is interesting:

val rddSTG = sc.parallelize(
      List ( ("RTD","ANT","SOYA BEANS", "20161123", "20161123", 4000, "docid11", null, 5) , 
             ("RTD","ANT","SOYA BEANS", "20161124", "20161123", 6000, "docid11",  null, 4) ,
             ("RTD","ANT","BANANAS", "20161124", "20161123", 7000, "docid11", null, 9) ,    
             ("HAM","ANT","CORN", "20161123", "20161123", 1000, "docid22", null, 33),
             ("LIS","PAR","BARLEY", "20161123", "20161123", 11111, "docid33", null, 44)
           )
                          )

val dataframe = rddSTG.toDF("ORIG", "DEST", "PROD", "PLDEPDATE", "PLARRDATE", "PLCOST", "docid", "ACTARRDATE", "mutationseq")
dataframe.createOrReplaceTempView("STG")
spark.sql("SELECT * FROM STG ORDER BY PLDEPDATE DESC").show()

It generates an error as follows:

scala.MatchError: Null (of class scala.reflect.internal.Types$TypeRef$$anon$6)

As soon as I change one of the null values to non-null its works. I think I get it, in that no inference can be made on the field, but it does seem odd. Ideas?

Author:thebluephantom，eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/44699290/null-values-in-field-generates-matcherror

Raktotpal Bordoloi :

The problem is - Any is too generic type in scala. In your case NULL is treated as ANY type.\nSpark just has no idea how to serialize NULL.\nWe should explicitly provide some specific type.\nSince null can't be assigned to primitive types in Scala you can use String to match the data type of the column's other values.\nSo try this:\ncase class Record(id: Int, name: String, score: Int, flag: String)\nval sampleRdd = spark.sparkContext.parallelize(\n Seq(\n (1, null.asInstanceOf[String], 100, "YES"),\n (2, "RAKTOTPAL", 200, "NO"),\n (3, "BORDOLOI", 300, "YES"),\n (4, null.asInstanceOf[String], 400, "YES")))\n\nsampleRdd.toDF("ID", "NAME", "SCORE","FLAG")\n\nThis way, the df will retain the null values.\nOther way\nwith case class\ncase class Record(id: Int, name: String, score: Int, flag: String)\n\nval sampleRdd = spark.sparkContext.parallelize(\n Seq(\n Record(1, null.asInstanceOf[String], 100, "YES"),\n Record(2, "RAKTOTPAL", 200, "NO"),\n Record(3, "BORDOLOI", 300, "YES"),\n Record(4, null.asInstanceOf[String], 400, "YES")))\nsampleRdd.toDF()\n",

2017-06-22T12:43:08

philantrovert :

I'm not quite sure the reason behind the error but I am guessing that it is occurring because Null can't be a datatype of a dataframe column. Since your second last column is null which is part of the trait Null. Since they are at the bottom of the hierarchy, they can't be instantiated into any other type.\nBut, null is a subtype of everything and therefore even if you change any of those null to ,say, a String, the column becomes String type. This is just an assumption.\n\nHowever, for your case, defining a case class will work.\n\nval rdd = sc.parallelize(List ( (\"RTD\",\"ANT\",\"SOYA BEANS\", \"20161123\", \"20161123\", 4000, \"docid11\", null, 5) , \n (\"RTD\",\"ANT\",\"SOYA BEANS\", \"20161124\", \"20161123\", 6000, \"docid11\", null, 4) ,\n (\"RTD\",\"ANT\",\"BANANAS\", \"20161124\", \"20161123\", 7000, \"docid11\", null, 9) , \n (\"HAM\",\"ANT\",\"CORN\", \"20161123\", \"20161123\", 1000, \"docid22\", null, 33),\n (\"LIS\",\"PAR\",\"BARLEY\", \"20161123\", \"20161123\", 11111, \"docid33\", null, 44)))\ncase class df_schema (ORIG: String, DEST: String, PROD: String, PLDEPDATE:String, PLARRDATE: String, PLCOSTDATE: Int, DOCID: String, ACTARRDATE: String, MUTATIONSEQ: Int)\nval rddSTG = rdd.map( x=> df_schema(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8, x._9 ) )\nval dataframe = sqlContext.createDataFrame(rddSTG)\n",

2017-06-22T12:40:25

Null values in field generates MatchError