Columnar UDF Development Guide
Developer can implement columnar Hive UDF or scala UDF in Gazelle for performance benefits.
The original UDF still needs to be registered to Spark for two reasons: 1) Spark will create a kind of expression for the registered UDF, which makes expression replacement possible. 2) expression fallback still need work well for cases that the expression tree has some expression currently unsupported by Gazelle.
Hive UDF
Suppose there is a UDF for handling string type input.
package com.intel.test;
import org.apache.hadoop.hive.ql.exec.UDF;
public class MyHiveUDF extends UDF {
public String evaluate(String b) {
...
}
}
The jar contains this class should be put into Spark class path. Then, the UDF can be registered to Spark at runtime as the below shows.
spark.sql("CREATE TEMPORARY FUNCTION MyHiveUDFName AS 'com.intel.test.MyHiveUDF';")
MyHiveUDFName
is the unique name for UDF and it's case insensitive to Spark & Gazelle.
We need to implement a native version of evaluate
function in arrow/gandiva
. And in Gazelle
ColumnarExpressionConverter.scala
, we need to add some logic to find the expression whose pretty
name is MyHiveUDFName
, and replace it with the implemented columnar expression. The columnar
expression will call gandiva function for evaluating given input.
Scala UDF
In Gazelle, the implementation for scala UDF is similar to Hive UDF. There is only few difference
in finding the matched expression for replacing it. See code details in ColumnarExpressionConverter.scala
.
We still need to register the original scala UDF to Spark, e.g.,
spark.udf.register("MyScalaUDFName", (s : String) => s)
It is also required to implement a columnar expression and a gandiva function, similar to Hive UDF.