Load a Dataset from the Catalog (Load Dataset Step)

This topic provides guidance on configuring a Load Dataset Step to use for adding a dataset from the Datasets catalog to a graphmart. Follow the steps below to create a Load Dataset Step.

  1. Go to the graphmart for which you want to add a step and then click the Data Layers tab.
  2. On the Data Layers tab, find the layer that you want to add the step to. Click the menu icon () for that layer and select Add Step/View. The Add Step/View dialog box is displayed with the New tab selected.

  3. Select Load Dataset Step and then click OK. Anzo creates the step and displays the Details tab:

  4. On the Details tab, configure the following options as needed:
    • Title: The required name of the step.
    • Description: An optional short description of the step.
    • Enabled: When creating a new step, the Enabled option is selected by default, indicating that the step is enabled and will run when the layer is loaded or refreshed. If you want to disable the step so that it is not processed, slide the Enabled slider to the left.
    • Linked Dataset: This field specifies the dataset to load. The list displays all of the datasets in the Dataset catalog. By default, the field is set to Exclude System Data (). If you want to choose a system dataset, click the toggle button on the right side of the field to change it to Include System Data (). When you select a dataset, the current working edition (Managed Edition) of the dataset is selected as the data to load. If you want to change the edition, you can click Modify Edition and follow the steps in Modifying an Edition.
    • Watch FLDS Directory: This option controls whether the FLDS directory is monitored for changes. If Watch FLDS Directory is enabled and changes to the files in the FLDS directory are detected , Anzo will mark this step (and layer) as needing a refresh.
    • Ignore Missing File or Directory: This option controls whether to ignore missing files or subdirectories in the FLDS directory and proceed with the load or fail the step if files or directories are missing.
    • Skip Elastic Search Snapshot Restoration if Index Already Exists: This option applies to graphmarts with Elasticsearch indexes and controls whether Anzo first checks to see if an index with the alias for the dataset already exists in Elasticsearch. If this setting is enabled and the index does exist, Anzo will not reload the index snaphsot into Elasticsearch.
  5. Typically when users add a dataset to a graphmart, they want to load the entire dataset. However, if you are familiar with the data and want to exclude certain predicates from the dataset or write an INSERT query that filters the data, you can configure filtering options on the Filter tab. For information, see Filter Tab below.
  6. Click Save to save the step configuration.

Once the Details tab is configured, the step can be run. For information about running this step conditionally by setting up an execution condition, see Defining Execution Conditions.

Filter Tab

The Filter tab includes options for filtering out some of the data in the dataset. If you want to load all of the statements in the dataset, do not configure Filter options. If you want to exclude some statements, configure the Filter options.

Multiple Select

This option enables you to exclude certain triples from the load by selecting the predicates to filter out. These are known as Masked Predicates. To exclude predicates, select the Multiple Select radio button, then click the Masked Predicate drop-down list and select a predicate to add it to the Masked Predicate field. Click the field again to select additional predicates. You can remove a property from the masked list by clicking the X next to the predicate name.

Query

If you want to hand-pick the data to load, you can use this option to write a SPARQL query that inserts specific values or filters out certain values. To write a query, select the Query radio button, and then type an INSERT query in the text box. For example, you can use the following format to filter out properties from the files:

INSERT {
  GRAPH ${targetGraph}{ 
    ?s ?p ?o.
  }
}
${usingSources}
WHERE {
  ?s ?p ?o .
  FILTER EXISTS { ?s a ?type . }
  FILTER(?type = <URI>)
}

Including the ${targetGraph} and ${usingSources} parameters are required. Anzo automatically populates the query with the appropriate graph URIs when the step is run.

In load filter queries, URIs are not supported in the object position. To specify a URI as an object, include the standard ?s ?p ?o triple pattern in the WHERE clause and then apply FILTER statements with URIs as needed. URIs are supported in the subject or predicate position.

For example, the following query filters the data in a sample dataset that includes information about people and the events they buy tickets for. The WHERE clause filters the data to load only the triples that are related to person1 (personid=1):

INSERT { GRAPH ${targetGraph} {
  ?s ?p ?o
 }
}
${usingSources}
WHERE {
  ?s ?p ?o ;
  <http://cambridgesemantics.com/ont/autogen/c89d/Tickets#tickit_users_personid> ?id .
  FILTER (?id=1)
}