If you’re using Scala (or another JVM language) and want to compute MD5 or SHA string hashes as hexadecimal strings (e.g. md5("foo") = "acbd18db4cc2f85cedef654fccc4a4d8"), consider using Guava’s Hashing or Apache Commons Codec’s DigestUtils instead of writing your own helper function: these libraries can be significantly faster than many of the DIY implementations floating around the web.

First I’ll list my recommended solutions. The examples here all return lowercase strings.

With Guava:

import com.google.common.hash.Hashing
import com.google.common.io.BaseEncoding
import java.nio.charset.StandardCharsets

Hashing.sha256().hashString("foo", StandardCharsets.UTF_8).toString
// returns "2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae"

Hashing.md5().hashString("foo", StandardCharsets.UTF_8).toString
// returns "acbd18db4cc2f85cedef654fccc4a4d8"

BaseEncoding.base16().lowerCase().encode(Array[Byte](-34, -83, -34, -83))
// returns "deaddead"

With Apache Commons:

import org.apache.commons.codec.binary.Hex
import org.apache.commons.codec.digest.DigestUtils
import java.nio.charset.StandardCharsets

DigestUtils.sha256Hex("foo".getBytes(StandardCharsets.UTF_8))
// returns "2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae"

DigestUtils.md5Hex("foo".getBytes(StandardCharsets.UTF_8))
// returns "acbd18db4cc2f85cedef654fccc4a4d8"

Hex.encodeHexString(Array[Byte](-34, -83, -34, -83))
// returns "deaddead"

These libraries also have functions for hashing streams, files, and other input types; see their API docs for details.

Comparing performance against a common DIY solution

There’s many implementations online which look something like this:

import java.security.MessageDigest

def diyMD5(str: String): String = {
    MessageDigest
        .getInstance("MD5")
        .digest(str.getBytes("UTF-8"))
        .map("%02x".format(_))
        .mkString("")
}

To compare performance to the Guava and Apache Commons implementations, let’s run a simple benchmark. Here I’m using a simple benchmarking helper function instead of JMH and am ignoring several factors (such as choice of input string). That said, this quick-and-dirty benchmark is sufficient to uncover a large performance gap:

def timeAndRecordAllocations(
    numWarmups: Int,
    numTrials: Int,
  )(functionToBenchmark: => Unit): Unit = {

  import java.lang.management.ManagementFactory
  import com.sun.management.ThreadMXBean
  val threadMxBean = ManagementFactory.getThreadMXBean.asInstanceOf[ThreadMXBean]
  val threadId = Thread.currentThread.getId

  // Warmup:
  {
    var i = 0
    while (i < numWarmups) {
      functionToBenchmark
      i += 1
    }
  }

  // Benchmark:
  System.gc()
  val bytesBefore = threadMxBean.getThreadAllocatedBytes(threadId)
  val cpuTimeBefore = threadMxBean.getThreadCpuTime(threadId)

  {
    var i = 0
    while (i < numTrials) {
      functionToBenchmark
      i += 1
    }
  }

  val cpuTimeAfter = threadMxBean.getThreadCpuTime(threadId)
  System.gc()
  val bytesAfter = threadMxBean.getThreadAllocatedBytes(threadId)
  println(s"Estimated bytes/op: ${Math.round(1.0 * (bytesAfter - bytesBefore) / numTrials)}")
  println(s"CPU time nanos/op: ${Math.round(1.0 * (cpuTimeAfter - cpuTimeBefore) / numTrials)}")
}

Running this on my laptop in an Ammonite shell with Scala 2.12.8 and OpenJDK 1.8.0_191:

// Setup dependencies:
@ import $ivy.`com.google.guava:guava:23.0`
@ import $ivy.`commons-codec:commons-codec:1.14`

// [...] define benchmarking helper functions as above [...]

// Run benchmarks:
@ timeAndRecordAllocations() { diyMD5("foo") }
Estimated bytes/op: 19912
CPU time nanos/op: 17393

@ timeAndRecordAllocations() { Hashing.md5().hashString("foo", StandardCharsets.UTF_8).toString }
Estimated bytes/op: 776
CPU time nanos/op: 520

@ timeAndRecordAllocations() { DigestUtils.md5Hex("foo".getBytes(StandardCharsets.UTF_8)) }
Estimated bytes/op: 785
CPU time nanos/op: 690

Eyeballing this, it looks like the Guava and Apache Commons solutions are ~25-33x faster and allocate ~25x less memory than the DIY implementation (at least for this small input string). There’s other DIY implementations which are faster (e.g. by using a BigInteger for formatting) but they’re generally still slower than these library functions.

A few notes:

  • If you do decide to write your own implementation (e.g. to cut out certain library overheads) then you should be more scientific and careful in your benchmarking than I’ve been here. My goal here is to dissuade folks from copying the slow implementations that you can find on Google, not to micro-optimize.
  • With small input strings the runtime is dominated by the cost of converting the MessageDigest’s output into a string of hexadecimal characters. If you’re hashing large files or byte streams then you may see less speedup from switching to the Guava or Apache implementations.
  • It may be possible to speed up the MessageDigest operations by using an alternative Java crypto provider. For example, see benchmark results from Amazon Corretto crypto provider (disclaimer: I haven’t had a chance to try this myself yet).
  • If you’re trying to replace slow DIY implementations in your own codebase, make sure to double-check the case of the character strings being generated: %02x and %02X will generate lowercase and uppercase strings, respectively, and accidentally mixing them up could have bad consequences if you’re, say, using the checksums as database keys.

Performance optimizations in the Guava and Apache Commons libraries

Here are some performance tricks used in third-party libraries’ implementations:

For Apache Spark

If you’re using Apache Spark’s DataFrame, Dataset, or SQL APIs then you should consider using its built-in md5 and sha2 functions: Spark’s built-in functions will generally be faster than UDFs because they can avoid conversions between Spark’s internal and external data representations. Using the built-ins will also allow your program to benefit from future improvements to Spark’s function implementations.

Under the hood, Spark 2.4.6’s implementation of these functions uses Apache Commons’ DigestUtils.