How to Handle Null in Spark

Overview

In this article, we will talk about the second-ugliest exception in the history of programming and attempt to handle it in our Spark apps. If you’ve ever worked with Spark in its native language, you’ve probably faced this bizarre, hard to debug exception, famously called as NullPointerException!

Also, you might have done your homework hoping to find a one fit all solution. But reading articles after articles, you are still not happy with the final result, or even frustrated with the Scala and Spark!

Well, this at least was true in my case, As a newcomer to this language and ecosystem, witnessing the number of arguments (and counterarguments) around the topic was a massive surprise to me. And after some frustration, I learned, this is not a simple problem to have a simple solution. Let’s see why is that:

What is Null?

Even though it’s called the same, it has two different nature when you examine it in the context of programming languages vs data management systems. Let’s see how:

In programming languages

It’s probably easier to comprehend the concept of null pointers if you are familiar with low-level languages like C. It’s a pointer that points to nothing. The usual use case for null pointers in a language like C is to indicate the end of a string value or a list of unknown length. However, with enough abstraction applied, it can turn into a nightmare :joy:.

The history of null pointers is a fascinating topic, and you might get surprised to know how Tony Hoare (who has come up with the idea of null) thinks about his invention:

I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object-oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn’t resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years. (source)

If you are interested, I highly recommend you to watch Null References: The Billion-dollar mistake presentation by Tony Hoare himself and read The worst mistake of computer science by Paul Draper which goes into a fair amount of details around the topic.

In data management

In data management systems or more particularly SQL, null has a special meaning. Using null in your data indicates that there is no valid data available for this field or it is unknown.

So, for example, consider, you have a coordinates column for your data but, you don’t know the value for a particular address. Instead of using something like Unknown or some other misleading values, you use a standardized null to express explicitly you don’t know the value.

If you are also interested in this topic, apart from Null (SQL) on Wikipedia, I highly recommend SQL by Design: The Reason for NULL by Michelle A. Poolet, which is a bit dated but will help you to clarify the topic.

Scala style and null

Now that we are familiar with the concept and use cases, let’s focus our attention on the problem we have. Let’s see how we can deal with null in Spark and Scala in a sane way.

In lots of best practices for Scala, you can read statements like “Avoid nulls”. They usually resonate this approach by stating that since in functional programming, you think in the context of algebraic equations and null doesn’t make sense there, then you should avoid using it when you program in functional paradigm.

However, this is not as simple as it sounds. There are some pointers to take note before smoothly rolling your eyes over null:

First of all, Scala also supports OOP, where null usually exists (in one form or another, or sometimes in multiple forms, yeah, I’m looking at you JavaScript :unamused:).
Second, null is a citizen of the Java ecosystem. Unless you code up everything you ever use in your project, in functional Scala style, you can’t ignore null.
And most importantly third, Spark can’t ignore null. As we stated earlier, in terms of data, null represents a particular meaning.

Spark and Null

We know Spark needs to be aware of the null, in terms of data, but you, as a programmer, should be aware of some details. Null in Spark, is not as straight forward as we wish it to be. At the beginning of this article, I stated that this is not a simple problem we face here. Here I’m going to discuss why I think that’s the case:

Spark is Null safe, well, almost!

The fact that Spark functions are null safe (at least most of the times) is, quite pleasant. Take a look at the following example:

import org.apache.spark.sql.types.{StructType, StructField, IntegerType}

val schema = List(
  StructField("v1", IntegerType, true),
  StructField("v2", IntegerType, true)
)

val data = Seq(
  Row(1, 2),
  Row(3, 4),
  Row(null, 5)
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

val result = df.withColumn("v3", $"v1" + $"v2")

As you can see the third row of our data contains a null, but as you see in the following code box, Scala considers the result of that row as null (which is the desired value if one party of your calculation is already null):

scala> result.show
+----+---+----+
|  v1| v2|  v3|
+----+---+----+
|   1|  2|   3|
|   3|  4|   7|
|null|  5|null|
+----+---+----+

But this is not the case for every Spark internal function. For example, if for a more complicated computation, you want to rely on a function like ml.feature.RegexTokenizer then you need to make sure your desired column doesn’t contain null!

The case of empty string

Now let’s see how Spark handles empty strings. We first read a data frame from a simple CSV file with the following definition:

# test.csv
key, value
"", 1
, 2

As you see, the key column in the first row is an empty string, but in the second row, it’s undefined. Let’s read it in and see what Spark thinks about it:

scala> val df = spark.read.option("header", true).csv("test.csv")
scala> df.show
+----+------+
|name| value|
+----+------+
|null|     1|
|null|     2|
+----+------+

Cool, huh? Spark considered empty string as null. But let’s see if it is always the case. Let’s try the same thing with a JSON data file:

[{"key": ""}, {"key": null}]

The value of key in the first row is an empty string and null in the second row. Let’s read it in:

scala> val df = spark.read.json("test.json")
scala> df.show
+----+
| key|
+----+
|    |
|null|
+----+

Yep, now we have it as an empty string. To me, these details sound like double standards. However, as you see, this is also inevitable, since the source type of your data, decides what you’ll have in your data frame.

The case of UDF

If you were considering what we talked about so far the worst-case scenario that can happen, you were missing the point of UDFs. They are an addition to Spark, to make matters worst (just kidding :joy:).

As always, you can find a lot of so-called best practices out there where they suggest you not to use UDFs. And that’s partly true, as long as what you want to achieve is already supported with Spark functions. But with any amount of real-world data engineering experience, you already know, that’s not always possible. IMO, this is the nature of frameworks, because it’s not practical or even sometimes possible to cover everything that your framework can support.

Now that you are probably going to develop your UDFs, you should take responsibility and deal with your nulls. Because as far as Spark is concerned, a UDF is a block box. If something goes wrong there, have fun with debugging the most inexpressive exception you’ve ever seen in your life.

The case Option

Option in Scala satisfies the definition of a monad in functional programming languages. So it’s no wonder that it’s usually used as a solution when dealing with nulls. I believe Option syntax is handy and, the best practices are right on the point. However, when it comes to Spark, there are some details we should be aware of:

The Option comes with some performance costs. In the typical situation it is no big deal (as you can see in Scala Option Performance Cost blog post by Lex Vorona). However, Databricks in their Scala style guide suggests preferring null over Options when you are dealing with performance-sensitive code (source). And that makes sense in the context of data. Consider a UDF function, that uses some Option dependent logic under the hood, and you apply it on a data frame with millions of rows.
UDFs can’t take option as a parameter. So again, you are on your own to deal with nulls inside the UDF body.
The Option will not magically resolve the issue of null and kind of makes it worst as we talked earlier, null acts like a value while it’s not a value. So, Some(null) is, unfortunately, a valid expression, but you can’t just use Option/Some/None trio, to resolve null exceptions in your code. For example, the following code greets you with a beautiful NullPointerException (here you can find more details on it):
```
 val df = Seq((Some("a")), (Some(null))).toDF
```

In the other hand Some(null) in Scala will have the value of Some[Null] = Some(null). This sort of holes inside the logic is a sign that Option is not able to fully cover NPE issues. And it’s not just true in case of Spark, consider what happens in the following code:

    scala> def strfm(value: Option[String]): Option[String] = {
    | value match {
    | case Some(v) => Some(v.trim)
    | case None => None
    | }
    | }

    // Test it with a normal String
    scala> strfm(Some(" DataChef is in the kitchen! "))
    res3: Option[String] = Some(DataChef is in the kitchen!)

    // Now let's go crazy!
    scala> strfm(Some(null))
    java.lang.NullPointerException
    at .strfm(<console>:18)
    ... 36 elided

    }

Of course, there are some workarounds to cover this as well (e.g. pass Option(null) instead). But to me, they are a workaround, not a solid solution.

So how to cover it?

If you’ve read until here, you might be interested to see how I prefer to deal with this issue as of now. This solution by no means should be considered a best practice, however, it covers all of the concerns we had so far and in my opinion, it doesn’t abstract away critical details.

In our use case, we tried to treat UDFs, how Spark treats them, a black box with a clear input and output mechanism. Having a black box means at the beginning of each UDF, we deal with null explicitly like:

val awesomeFn(value: String): String {
  value match {
    case null => null
    case _    => applyAwesomeLogic(value)
  }
}

This way, we ensure we don’t get surprised by having null read by Spark, end up inside our functions and cause NPE. In Dealing with null in Spark, Matthew Powers suggests an alternative solution like:

val awesomeFn(value: String): String {
  val v = Option(value).getOrElse(return None)
  applyAwesomeLogic(value)
}
// In his sample the return value of the function is an Option, which we will
// come back to in a bit.

Which is also fine, but I prefer to stick with pattern matching for two reasons:

Even though null looks like a curse inside Scala world, I prefer to explicitly define how I’m handling it, instead of abstracting it away.
The explicit return None statement, IMO is a bit hard to read and reduces the code clarity.

Now that we are sure about the incoming values to our functions let’s see how we are going to deal with return values. For the functions which directly get invoked by UDFs for the return values, I tried to lower the need for the Option and only use it where no other solution is possible (as a default value for an object of type Double for example).

Yes, this means we should expect a null appearance in all of the logic used by Spark, but I consider that a reasonable practice since it’s inevitable to ignore null in the first place. Plus we do not trade performance of the main logic to a syntax that doesn’t resolve our main concerns (NPE).

However, this is not all. Other than the functions that directly deal with UDFs and Spark, we still have other code dependencies. Some use codes in Java, and some rely on network or some other external dependency. To deal with these, we use Options, which looks reasonable. For example here is a snippet that utilizes a Java function:

def createResponse(vatNumber: String, countryCode: String, response: EUVatCheckResponse): VatInfo = {
  val name = response.getName match {
    case "-"       => None
    case n: String => Some(n)
    case _         => None
  }
  val address = response.getAddress match {
    case "-"       => None
    case n: String => Some(n)
    case _         => None
  }

  VatInfo(vatNumber, countryCode, response.isValid, name, address)
}

def getInfo(vatNumber: String, countryCode: String): Option[VatInfo] = {
  Try(vatChecker.check(countryCode, vatNumber)) match {
    case Success(info) => Some(createResponse(vatNumber, countryCode, info))
    case Failure(exc)  =>
      log.error("Error while making a request to Vat service", exc)
      None
  }
}

In this code, the vatChecker function is a Java piece from VatChecker library. On that call, lots of following details can go wrong, and cause errors (including NPE). In this case usage of an Option looked pretty reasonable. Especially given that the result of this call would be cached and we aren’t going to recall it for existing values. So Option’s performance hit is managed here.

Is this an optimal solution, it might not be indeed; however, currently, it covers all the details I shared in the previous sections of this article. To improve it we would like to have your take on this issue, and what is your real-world experience with it. So feel free to share your thoughts on it or even better, share how you deal with nulls in your Spark/Scala applications.