Creating an Unstructured Pipeline

Follow the instructions below to create a new unstructured pipeline.

  1. In the Anzo application, expand the Onboard menu and click Unstructured Data. Anzo displays the Pipelines screen, which lists any existing unstructured pipelines. For example:

  2. Click the Add Unstructured Pipeline button and select Distributed Unstructured Pipeline. Anzo opens the Create Distributed Unstructured Pipeline dialog box. For example:

  3. In the Title field, type a name for the pipeline.

    This Title serves as a key to identify this pipeline and its corpus in multiple contexts. Specify a Title that is unique and stable. The pipeline's corpus dataset name is derived from this Title.

  4. Type an optional description for the pipeline in the Description field.
  5. If necessary, click the Target Anzo Data Store field and select the Anzo Data Store for this pipeline.
  6. If the environment is configured for dynamic Kubernetes-based deployments of the Anzo Unstructured infrastructure, select the Deploy Unstructured Infrastructure Dynamically checkbox and leave the Static Elasticsearch Config field blank.
  7. If necessary, click the Static Elasticsearch Config field and select the Elasticsearch connection to use for this pipeline. If you use dynamic deployments to deploy Elasticsearch instances on-demand, leave this field blank. Anzo will prompt the user to choose a Cloud Location when the pipeline is run.
  8. Click Save to create the Pipeline. Anzo displays the pipeline Overview screen. For example:

    A pipeline saves automatically and constantly undergoes validation to make sure that it is valid based on the current configuration. Anzo displays validation issues in red on the top of the screen. The warnings will disappear as you add components to the pipeline.

  9. If necessary, click Advanced to configure the advanced pipeline settings. For details about the advanced settings, see Pipeline Settings Reference.
  10. Click the Crawlers tab and follow the substeps below to add a crawler to the pipeline:
    1. Click Add Input. Anzo displays the Add Component dialog box. The New tab is selected and lists all available crawlers. The Existing Components tab lists crawlers that have been previously configured for other pipelines.

    2. To add a new crawler, select the crawler. To add an existing crawler, click the Existing Components tab and select a crawler. The list below describes each of the crawlers:
      • File Based Dataset Crawler: Include this crawler to process data from a file-based linked data set (FLDS) on a file store.
      • Filesystem Crawler: Include this crawler to process documents, such as email messages, PDF, XML, PowerPoint, Excel, OneNote, or Word files, and images, that are available on a file store.
      • Graphmart RDF Crawler: Include this crawler to process RDF in an online graphmart or specific data layer.
      • Local Volume Dataset Crawler: Include this crawler to process RDF data that is stored as a linked data set (LDS) in an Anzo journal.
    3. After selecting a crawler, click OK. Anzo opens the Create dialog box for that crawler so that you can configure it. The list below provides details about the settings for each crawler. Click a crawler name to view the details for that component:
    4. When you have finished configuring the crawler, click Save. Anzo adds the crawler to the pipeline and returns to the Crawlers screen. For example:

    5. If you want to change the crawler configuration, click the Edit icon () for the crawler and modify the settings as needed. If you want to add another crawler to the pipeline, repeat substeps a – d.
  11. Click the Annotators tab and follow the substeps below to add an annotator to the pipeline:
    1. Click Add Output to select an annotator. Anzo opens the Add Component dialog box. The New tab is selected and lists the available annotators and the Existing Components tab lists annotators that have been previously configured for other pipelines.

    2. To add a new annotator to the pipeline, click the annotator name to select it. To add an existing annotator to the pipeline, click the Existing Components tab, and then select an annotator. The list below describes each of the default annotators:
      • Custom Relationship Annotator: Include this annotator to map relationships between annotations based on the number of characters between the annotations.
      • External Service Annotator: Include this annotator to hit an HTTP endpoint that provides annotations.
      • Keyword and Phrase Annotator: Include this annotator to create annotations based on the phrases that you specify.
      • Knowledgebase Annotator: Include this annotator to link structured and unstructured data by finding instances in data layers, graphmarts, or Anzo linked datasets. Based on the names and aliases of entities present or patterns that are indicative of the entities, this annotator marks up the documents with the structured entities linked.
      • Regex Annotator: Include this annotator to use regular expression rules to identify entities such as email addresses, URLs, phone numbers, or any other entity that can be matched using a regular expression.
    3. After selecting an annotator, click OK. Anzo opens the Create dialog box for the component. Complete the fields to configure the annotator. The list below provides details about the settings for the annotators that are typically used in pipelines. Click an annotator name to view the details for that component:
    4. When you have finished configuring the annotator, click Save. Anzo adds the annotator to the pipeline and returns to the Annotators screen. For example:

    5. If you want to change the annotator configuration, click the Edit icon () for the annotator and modify the settings as needed (see Annotator Settings Reference for information about settings). If you want to add another annotator to the pipeline, repeat substeps a – d.
  12. When you have finished adding crawlers and annotators to the pipeline, click the Run Pipeline button to run the pipeline.

The process can take several minutes to complete. You can click the Progress tab to view details such as the pipeline status, runtime, number of documents processed, and errors. For example:

When the pipeline finishes, a new dataset becomes available in the Datasets catalog. From the catalog, you can create a graphmart from the dataset so that you can explore and analyze the data. For instructions, see Creating a Graphmart from a Dataset. You can also add the dataset to an existing graphmart by following the steps in Adding a Dataset to an Existing Graphmart.