Configuring a Sparkler Engine

This topic provides instructions for configuring a connection to a Sparkler compiler. Sparkler is Cambridge Semantics' Spark SPARQL interpreter. Sparkler expresses Spark ingestion jobs as SPARQL, and Sparkler jobs are executed by Spark. They are submitted to Spark using Livy interactive sessions.

  1. In the Administration application, expand the Connections menu and click ETL Engine Config. Anzo displays the ETL Engine Config screen, which lists existing ETL engine connections. For example:

  2. On the ETL Engine Config screen, click the Add ETL Engine Config button and select Sparkler Engine Config. Anzo displays the Create Sparkler Engine Config screen.

  3. On the Create screen, type a Title and optional Description for the engine. Then click Save. Anzo displays the Details view for the new engine. For example:

  4. Configure the engine by completing the required fields and adding any optional values on the Run, Advanced, and Publish tabs. To edit a field, click a value to make the field editable or click the edit icon (). Click the check mark icon () to save changes to an option, or click the X icon () to clear the value for an option. See the Sparkler Settings Reference section below for descriptions of the settings.

Sparkler Settings Reference

This section provides reference information for the Sparkler ETL engine settings on each of the tabs.

Run Tab

  • Remote Server Name: The host name or IP address of the server where the compilation will be performed.
  • Job Runner Endpoint: The HTTP endpoint used to reach the Livy server. For example, when using the local Anzo Sparkler engine, the endpoint is localhost:8998.
  • Target Folder Name: The path and directory on the host where temporary artifacts can be created during the compilation and upload process. The location must be a valid path on the server that the user running the ETL job has access to.
  • Sparkler Home: The path and directory where the Sparkler compiler is installed on the host server.
  • SDI Dependencies Dir: The file system location where the Spark engine will look for the dependency .jar files, sdi-full-deps.jar and sdi-deps.jar. If you are using a remote Spark cluster, sdi-full-deps.jar and sdi-deps.jar can be copied to the Spark master node from the <install_path>/Server/data/sdiScripts/<Spark_version>/compile/dependencies-lib directory on the Anzo server.
  • Additional Jars: For relational database sources, this field lists the file system location for the JDBC driver .jar file or files that are used to connect to the source. All paths must be absolute. For multiple jar files, specify a comma-separated list. Do not include a space after the commas.

    For RDBs whose drivers are installed with Anzo, such as MSSQL (com.springsource.net.sourceforge.jtds_1.2.2.jar), Oracle (oracle.jdbc_11.2.0.3.jar), Amazon Redshift (org.postgresql.osgi.redshift_9.3.702.jar), and PostgreSQL (com.springsource.org.postgresql.jdbc3_8.3.603.jar), you can find the driver jar files in the <install_path>/Server/plugins directory.

    • If you use the local Spark ETL engine, the Additional Jars field should list the path to the jar files in the Anzo plugins directory. For example, /opt/Anzo/Server/plugins/org.postgresql.osgi.redshift_9.3.702.jar.
    • If you use a remote Spark cluster in cluster mode, the driver jar files need to be copied onto the HDFS. If Spark is running in client mode, jar files can be copied to the Hadoop/Spark master node file system. Specify the path to the copied jar files in the Additional Jars field.

    If a driver is uploaded to Anzo as described in Uploading a Plugin, the driver will be in the <install_path>/Server/dropins directory. For example, /opt/Anzo/Server/dropins/com.springsource.com.mysql.jdbc-5.1.6.jar

  • Execute Locally: Select this option for local Sparkler engines on the Anzo server. Make sure this option is not selected when using a remote Sparkler server.
  • Do Callback: Select this option when you want Anzo to create a new data set in the Dataset catalog and generate load files for the graph source.
  • Run with Yarn: Employs the Spark YARN cluster manager when running ETL jobs.
  • Callback URL: When Do Callback is selected, enter one of the following URLs:
    http://Anzo_hostname_or_IP:Anzo_app_HTTP_port/anzoclient/call
    https://Anzo_hostname_or_IP:Anzo_app_HTTPS_port/anzoclient/call

    For example:

    https://10.100.0.1:8443/anzoclient/call

Advanced Tab

The options on this tab enable users with advanced Spark expertise to customize the values that are passed to Spark.

  • Enable CSV Error Reporting: Controls whether detailed CSV errors are displayed in the Anzo user interface.
  • Input Database Partition Default: By default, Sparkler attempts to partition relational database tables if the table has a primary column with an integer data type and the source data has been profiled as described in Generating a Source Data Profile. When Input Database Partition Default is enabled, Sparkler attempts to partition RDBMS tables when they have a primary column with an integer type even if a data source profile has not been generated.
  • Enable Hive Context (Enable in Livy Conf for Spark 2): Controls Hive context for Spark version 1.6. Selecting this setting enables the Hive context for Spark 1.6.
  • Redirect Graph Output to Hive: Controls whether the ETL process writes data to Hive or a file-based linked data set (FLDS). When this option is disabled (the default configuration) data is written to an FLDS that can be added to a graphmart and loaded to AnzoGraph. When this option is enabled, the ETL process writes data to Hive rather than creating an FLDS.
  • Run As User: Specifies the user to impersonate when starting the Livy session.
  • Max Graph Output File Size Default (Bytes): The maximum number of bytes to limit graph output files to.
  • Max Input File Partition Size (Bytes): The maximum number of bytes to pack into a partition when reading files. Maps to the spark.files.maxPartitionBytes Spark configuration setting.
  • Spark Job Driver Cores: The number of cores to use for the driver process. Maps to the spark.driver.cores Spark configuration setting.
  • Spark Job Driver Memory: The amount of memory to use for the driver process. Maps to the spark.driver.memory Spark configuration setting.
  • Number of Executors Per Spark Job: The number of executors to request per Spark job. Maps to the spark.executor.instances Spark configuration setting.
  • Spark Job Cores Per Executor: The number of cores to use on each executor. Maps to the spark.executor.cores Spark configuration setting.
  • Spark Job Memory Per Executor: The amount of memory to use per executor process. Maps to the spark.executor.memory Spark configuration setting.
  • Off Heap Size (Bytes): The amount of memory in bytes that can be used for off-heap allocation. Maps to the spark.memory.offHeap.size Spark configuration setting.
  • Job Dependencies (Maven Package Coordinate): The comma-separated list of Maven jar coordinates to include on the driver and executor classpaths. Maps to the spark.jars.packages Spark configuration setting.
  • Maven Package Excludes: To avoid dependency conflicts, this is the comma-separated list of groupId:artifactId to exclude while resolving the dependencies listed in spark.jars.packages. Maps to the spark.jars.excludes Spark configuration setting.
  • Maven Repositories: A comma-separated list of additional remote repositories to search for the maven coordinates from the Job Dependencies setting. Maps to the spark.jars.repositories Spark configuration setting.
  • Spark Job Deploy Mode (Livy Config has Precedence): The deploy mode of the Spark driver program. If this value is set in the Livy configuration, the Livy value takes precedence. Maps to the spark.submit.deployMode Spark configuration setting.

Publish Tab

The Publish tab controls the action of the Publish All button when a pipeline is published.

Sharing Tab

The Sharing tab enables you to share or restrict access to this ETL engine.

When the configuration is complete, Anzo provides this ETL engine as a choice to select when ingesting data and configuring pipelines. If you want to specify the default ETL engine to use automatically any time a pipeline is configured, see Configure the Default ETL Engine.

Related Topics