Spark Configurations for Native SQL Engine

There are many configuration could impact the Native SQL Engine performance and can be fine tune in Spark. You can add these configuration into spark-defaults.conf to enable or disable the setting.

Parameters Description Recommend Setting
spark.driver.extraClassPath To add Arrow Data Source and Native SQL Engine jar file in Spark Driver /path/to/jar_file1:/path/to/jar_file2
spark.executor.extraClassPath To add Arrow Data Source and Native SQL Engine jar file in Spark Executor /path/to/jar_file1:/path/to/jar_file2
spark.executorEnv.LIBARROW_DIR To set up the location of Arrow library, by default it will search the loation of jar to be uncompressed /path/to/arrow_library/
spark.executorEnv.CC To set up the location of gcc /path/to/gcc/
spark.executor.memory To set up how much memory to be used for Spark Executor.
spark.memory.offHeap.size To set up how much memory to be used for Java OffHeap.
Please notice Native SQL Engine will leverage this setting to allocate memory space for native usage even offHeap is disabled.
The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Native SQL Engine
30G
spark.executor.extraJavaOptions To set up how much Direct Memory to be used for Native SQL Engine. The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Native SQL Engine -XX:MaxDirectMemorySize=30G
spark.sql.sources.useV1SourceList Choose to use V1 source avro
spark.sql.join.preferSortMergeJoin To turn off preferSortMergeJoin in Spark false
spark.sql.extensions To turn on Native SQL Engine Plugin com.intel.oap.ColumnarPlugin
spark.shuffle.manager To turn on Native SQL Engine Columnar Shuffle Plugin org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.oap.sql.columnar.batchscan Enable or Disable Columnar Batchscan, default is true true
spark.oap.sql.columnar.hashagg Enable or Disable Columnar Hash Aggregate, default is true true
spark.oap.sql.columnar.projfilter Enable or Disable Columnar Project and Filter, default is true true
spark.oap.sql.columnar.codegen.sort Enable or Disable Columnar Sort, default is true true
spark.oap.sql.columnar.window Enable or Disable Columnar Window, default is true true
spark.oap.sql.columnar.shuffledhashjoin Enable or Disable ShffuledHashJoin, default is true true
spark.oap.sql.columnar.sortmergejoin Enable or Disable Columnar Sort Merge Join, default is true true
spark.oap.sql.columnar.union Enable or Disable Columnar Union, default is true true
spark.oap.sql.columnar.expand Enable or Disable Columnar Expand, default is true true
spark.oap.sql.columnar.broadcastexchange Enable or Disable Columnar Broadcast Exchange, default is true true
spark.oap.sql.columnar.nanCheck Enable or Disable Nan Check, default is true true
spark.oap.sql.columnar.hashCompare Enable or Disable Hash Compare in HashJoins or HashAgg, default is true true
spark.oap.sql.columnar.broadcastJoin Enable or Disable Columnar BradcastHashJoin, default is true true
spark.oap.sql.columnar.wholestagecodegen Enable or Disable Columnar WholeStageCodeGen, default is true true
spark.oap.sql.columnar.preferColumnar Enable or Disable Columnar Operators, default is false.
This parameter could impact the performance in different case. In some cases, to set false can get some performance boost.
false
spark.oap.sql.columnar.joinOptimizationLevel Fallback to row operators if there are several continous joins 6
spark.sql.execution.arrow.maxRecordsPerBatch Set up the Max Records per Batch 10000
spark.oap.sql.columnar.wholestagecodegen.breakdownTime Enable or Disable metrics in Columnar WholeStageCodeGen false
spark.oap.sql.columnar.tmp_dir Set up a folder to store the codegen files /tmp
spark.oap.sql.columnar.shuffle.customizedCompression.codec Set up the codec to be used for Columnar Shuffle, default is lz4 lz4
spark.oap.sql.columnar.numaBinding Set up NUMABinding, default is false true
spark.oap.sql.columnar.coreRange Set up the core range for NUMABinding, only works when numaBinding set to true.
The setting is based on the number of cores in your system. Use 72 cores as an example.
0-17,36-53 |18-35,54-71

Below is an example for spark-default.conf, if you are using conda to install OAP project.

##### Columnar Process Configuration

spark.sql.sources.useV1SourceList avro
spark.sql.join.preferSortMergeJoin false
spark.sql.extensions com.intel.oap.ColumnarPlugin
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager

# note native sql engine depends on arrow data source
spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-<version>-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar
spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-<version>-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar

spark.executorEnv.LIBARROW_DIR      $HOME/miniconda2/envs/oapenv
spark.executorEnv.CC                $HOME/miniconda2/envs/oapenv/bin/gcc
######

Before you start spark, you must use below command to add some environment variables.

export CC=$HOME/miniconda2/envs/oapenv/bin/gcc
export LIBARROW_DIR=$HOME/miniconda2/envs/oapenv/