Pipeline Settings Reference

The table below defines the Advanced settings that are available on the Overview tab when viewing an unstructured pipeline.

Setting Description
Append Timestamp Controls whether to add a timestamp to unstructured document URIs. This setting is enabled by default.
Diagnostic Logging Controls whether verbose diagnostic logging is enabled for the pipeline. This setting is disabled by default. When enabled, debug-level logging is performed for the duration of the pipeline.
Current Pipeline Run This setting is a pointer to the pipeline run object that tracks the ongoing execution of the pipeline.
Pipeline Network Connection This setting specifies the network connection configuration to be used by the pipeline's worker nodes to connect to the Anzo server. If not specified, this setting defaults to the Unstructured Cluster connection configuration.
Persist Extracted Text Controls whether to persist the extracted text from documents. This setting is enabled by default.
Persist HTML Controls whether to persist the extracted highlighted/annotated HTML from documents. This setting is enabled by default.
Persist Original Binary Controls whether to persist the binary from the original documents. This setting is enabled by default.
Persist Hit Spans Controls whether to persist the hit spans for the annotations of unstructured documents. This setting is disabled by default.
Persist Nothing Controls whether RDF data about the documents or annotations are saved or persisted. This setting is disabled by default.
Skip Elastic Search Indexing Controls whether to skip Elasticsearch indexing. This setting is disabled by default.
Skip Elastic Search JSON creation Controls whether to skip creating Elasticsearch JSON. This setting is disabled by default.
Is Corpus Cumulative Controls whether to add the components of each pipeline run to the working edition of the dataset. This setting is disabled by default.
Skip Text Extraction Controls whether to skip text extraction. This setting is disabled by default.
Delete Elastic Search JSON files Controls whether to delete the Elasticsearch JSON files after they are indexed. This setting is enabled by default.
Allow Empty Documents Controls whether to allow documents that have no text to proceed through the pipeline. This setting is disabled by default.
Archive and Host Content Controls whether to download, cleanse, encapsulate, archive, and host complete document content with inline artifacts. This setting is enabled by default.
HTTP Fetch in Archive Controls whether the archiving process should resolve and download HTTP URLs that are specified in documents. This setting is disabled by default.
Corpus Linked Dataset Specifies the FLDS used for documents and annotations from this pipeline. This setting defaults to the name of the pipeline.
Corpus Name Specifies the name of the corpus (collection of documents) for the pipeline.
Phase Status Persistence Specifies how phase status metadata is persisted for each document in the pipeline.
Write Status Updates to Jnl Controls whether status updates for pipeline runs are written to the journal. This setting is enabled by default.
Write Status Updates to FLDS Controls whether status updates for pipeline runs are written to an FLDS. This setting is disabled by default.
Write Original Binary On Timeout Controls whether the original binary is written if the pipeline times out or errors. This setting is disabled by default.
RamDisk Directory Location Specifies an optional RamDisk base directory to create temporary files under. Using a RamDisk may speed up the pipeline.
Use File Name as Document Title Controls whether to use the file's name on disk as the document title. This setting is disabled by default.
RDF Statement Buffer Size Specifies the maximum number of statements to buffer before writing. The default value is 10,000.
RDF File Statement Count Specifies the maximum number of statements to include in each RDF output file.
Batch Size Specifies the number of documents to include in one batch.
Maximum Allowed Session Issues Specifies the maximum number of issues that can be encountered in a run of this pipeline before failing the pipeline.
UI Update Interval (in milliseconds) The interval of time to wait between running queries to update the data on the pipeline Progress screen. The default value is 30,000 milliseconds (30 seconds).
Document Processing Timeout Specifies the timeout in milliseconds for each document batch to be processed. Leave this value unset (or set it to 0) to use the microservice cluster's default timeout value.
Error On No Documents Found Controls whether to fail the pipeline if no documents are found. This setting is enabled by default.
Maximum Pipeline Run Status Journals Specifies the maximum number of pipeline run status journals to keep before aging them off to an FLDS. By default, only the status of the most recent run of a pipeline remains stored in a status journal. All previous reports are automatically converted to an FLDS and the original status journal is deleted.
Elastic Search Bulk Actions Specifies the maximum number of indexing actions to queue during Elasticsearch indexing. The default value is 2,000.
Elastic Search Bulk Size Specifies the maximum size of the document queue during Elasticsearch indexing. The default value is 5.
Elastic Search Bulk Concurrent Requests This setting specifies the maximum number of concurrent bulk requests to allow during Elasticsearch indexing. The default value is 1.
Elastic Search Bulk Max Threads This setting specifies the maximum number of threads to use for Elasticsearch indexing. The default value is 1.
Elastic Search Mapping This setting specifies (in JSON format) the mapping to use when indexing unstructured documents in Elasticsearch.
Elastic Search Pipeline Configuration This setting specifies (in JSON format) the Elasticsearch pipeline configuration to use when indexing unstructured documents.
Elastic Search Directory Write-all Controls whether to give write-all permission to the esi directory in the output corpus FLDS.
Elasticsearch Index Settings This setting specifies (in JSON format) the index settings to use when indexing unstructured documents in Elasticsearch.
Skip Teardown Of Dynamic Resources Controls whether dynamic K8s-based resources associated with the pipeline are left running after the pipeline is complete. This setting is disabled by default. Enabling it can result in increased cloud resource usage.
Default Finish Pending Writes On Pipeline Cancellation Controls whether to finish any pending writes for documents during pipeline cancellation. There is a flag in the cancellation request that can be used to override this setting. This setting is enabled by default.
Post-persist Postprocessor Specifies any post-persist semantic postprocessors in the pipeline.
Rich Text Extractor Lists the HTML extractors to use in the pipeline.
Post Worker Service Specifies a service to invoke on documents after they are successfully processed by the pipeline worker processes.
Pre-persist Postprocessor Specifies any pre-persist semantic postprocessors in the pipeline.
Status Journal Base Path Specifies the base path for storage of the status journal. By default, status journals are written to a status_journals subdirectory in the Anzo Data Store that is specified for the pipeline.
Content Transformer Specifies any content transformation and metadata extraction components to use in the pipeline.
Document Crawler Thread Count Specifies the number of threads to use for document crawling. The default value is 4.
Worker Service ID Specifies the worker service ID to send requests to. If not specified, the default is pipelineWorkerService.