Creating an Incremental Schema

If you onboard data via Spark ETL pipelines, follow the instructions below to create a Schema by writing an SQL query that defines a subset of the data to onboard in increments. If you do not use Spark, see Ingesting Data Incrementally with the GDI for instructions.

You can import or create up to 5 schemas per database data source. To include more than 5 schemas, create another data source for the additional schemas.

For better ETL pipeline performance, it is beneficial to include joins and/or filters in schema queries rather than configuring those operations at the mapping level. For more information, see Performance Considerations for Database ETL Pipelines.

  1. In the Anzo application, expand the Onboard menu and click Structured Data. Anzo displays the Data Sources screen, which lists any existing Data Sources. For example:

  2. Click the Data Source for which you want to create an incremental Schema. Anzo displays the Tables tab for the source. For example, the image below shows the Tables tab before any Schemas have been created for this source:

  3. Click the Create Schemas From Query button. Anzo displays the Create Schemas dialog box:

  4. In the Create Schemas dialog box, specify a name for this schema in the Schema Name field.
  5. In the Table Name field, specify a name for the table in the schema that the query will create.
  6. At the bottom of the screen, enable the Include increment data option by sliding the slider to the right. Anzo displays additional settings. For example:

  7. Populate the following fields so that you can use the values as a guide for writing the schema query:
    • Incremental Column Name: The source column whose value will be used to increment the data.
    • Value: The value in the column to use as the stopping point for the first import process and the starting point for the next import.

      Do not include quote characters in the Value field. If the SQL query requires quotes around values, such as ‘2010-01-01’ or ‘TestValue’, include the quotes around the {INCREMENTVALUE} parameter in the query and not in the Value field. For example, if the value to increment on is ‘2010-01-01’, specify 2010-01-01 in the Value field and add the quotes to the query like the following example:

      SELECT * FROM Orders WHERE OrderData > '{INCREMENTVALUE}'

      In addition, make sure that you use single quotes (') in schema queries. Including double quotes (") in a schema query results in an error when the query is run.

    • Comparator: The operator to use for comparing source values against the value above.
  8. In the query text field, type the SQL statement that will target the appropriate source data. The WHERE clause must include the incremental column name, the comparison operator, and an INCREMENTVALUE parameter that is substituted with the Value at runtime. For example, in the query below the incremental column name is EmployeeID, the comparator is > (greater than), and the {INCREMENTVALUE} parameter is specified after the comparator. {INCREMENTVALUE} will be replaced with the value in the Value field at runtime:
    SELECT EmployeeID, FirstName, LastName, Title, Salary, BirthDate, HireDate, Region, Country
    FROM northwind.Employees
    WHERE EmployeeID > {INCREMENTVALUE}

    Make sure that the query includes the INCREMENTVALUE parameter and uses the same Incremental Column Name and Comparator values as the fields below the query. For example:

  9. Click Save to save the query. Anzo creates the new schema and adds it to the list of schemas on the Tables screen. For example:

    You can expand the schema to view its tables. Selecting a row in the schema displays the sample data on the right side of the screen.

  10. If you want to create additional tables in the Schema, follow these steps:
    1. Click the menu icon () in the Actions column for the Schema name and select Add Table. For example:

      The Create New Table dialog box is displayed.

    2. In the Create New Table dialog box, specify a name for the new table in the Table Name field.
    3. In the Schema Query field, write the SQL query that defines the data for the new table.
    4. Click Save to add the table to the Schema and return to the Tables screen.

The source data can now be onboarded to Anzo. See Ingesting Data Sources via ETL Pipelines for next steps.

See Incremental Pipeline Reference for important information about running a pipeline that includes an incremental schema.

Related Topics