Pipeline Settings Reference

The table below defines the Advanced settings that are available on the Overview tab when viewing an unstructured pipeline.

Setting	Description
Append Timestamp	Controls whether to add a timestamp to unstructured document URIs. This setting is enabled by default.
Diagnostic Logging	Controls whether verbose diagnostic logging is enabled for the pipeline. This setting is disabled by default. When enabled, debug-level logging is performed for the duration of the pipeline.
Current Pipeline Run	This setting is a pointer to the pipeline run object that tracks the ongoing execution of the pipeline.
Pipeline Network Connection	This setting specifies the network connection configuration to be used by the pipeline's worker nodes to connect to the Anzo server. If not specified, this setting defaults to the Unstructured Cluster connection configuration.
Persist Extracted Text	Controls whether to persist the extracted text from documents. This setting is enabled by default.
Persist HTML	Controls whether to persist the extracted highlighted/annotated HTML from documents. This setting is enabled by default.
Persist Original Binary	Controls whether to persist the binary from the original documents. This setting is enabled by default.
Persist Hit Spans	Controls whether to persist the hit spans for the annotations of unstructured documents. This setting is disabled by default.
Persist Nothing	Controls whether RDF data about the documents or annotations are saved or persisted. This setting is disabled by default.
Skip Elastic Search Indexing	Controls whether to skip Elasticsearch indexing. This setting is disabled by default.
Skip Elastic Search JSON creation	Controls whether to skip creating Elasticsearch JSON. This setting is disabled by default.
Is Corpus Cumulative	Controls whether to add the components of each pipeline run to the working edition of the dataset. This setting is disabled by default.
Skip Text Extraction	Controls whether to skip text extraction. This setting is disabled by default.
Delete Elastic Search JSON files	Controls whether to delete the Elasticsearch JSON files after they are indexed. This setting is enabled by default.
Allow Empty Documents	Controls whether to allow documents that have no text to proceed through the pipeline. This setting is disabled by default.
Archive and Host Content	Controls whether to download, cleanse, encapsulate, archive, and host complete document content with inline artifacts. This setting is enabled by default.
HTTP Fetch in Archive	Controls whether the archiving process should resolve and download HTTP URLs that are specified in documents. This setting is disabled by default.
Corpus Linked Dataset	Specifies the FLDS used for documents and annotations from this pipeline. This setting defaults to the name of the pipeline.
Corpus Name	Specifies the name of the corpus (collection of documents) for the pipeline.
Phase Status Persistence	Specifies how phase status metadata is persisted for each document in the pipeline.
Write Status Updates to Jnl	Controls whether status updates for pipeline runs are written to the journal. This setting is enabled by default.
Write Status Updates to FLDS	Controls whether status updates for pipeline runs are written to an FLDS. This setting is disabled by default.
Write Original Binary On Timeout	Controls whether the original binary is written if the pipeline times out or errors. This setting is disabled by default.
RamDisk Directory Location	Specifies an optional RamDisk base directory to create temporary files under. Using a RamDisk may speed up the pipeline.
Use File Name as Document Title	Controls whether to use the file's name on disk as the document title. This setting is disabled by default.
RDF Statement Buffer Size	Specifies the maximum number of statements to buffer before writing. The default value is 10,000.
RDF File Statement Count	Specifies the maximum number of statements to include in each RDF output file.
Batch Size	Specifies the number of documents to include in one batch.
Maximum Allowed Session Issues	Specifies the maximum number of issues that can be encountered in a run of this pipeline before failing the pipeline.
UI Update Interval (in milliseconds)	The interval of time to wait between running queries to update the data on the pipeline Progress screen. The default value is 30,000 milliseconds (30 seconds).
Document Processing Timeout	Specifies the timeout in milliseconds for each document batch to be processed. Leave this value unset (or set it to 0) to use the microservice cluster's default timeout value.
Error On No Documents Found	Controls whether to fail the pipeline if no documents are found. This setting is enabled by default.
Maximum Pipeline Run Status Journals	Specifies the maximum number of pipeline run status journals to keep before aging them off to an FLDS. By default, only the status of the most recent run of a pipeline remains stored in a status journal. All previous reports are automatically converted to an FLDS and the original status journal is deleted.
Elastic Search Bulk Actions	Specifies the maximum number of indexing actions to queue during Elasticsearch indexing. The default value is 2,000.
Elastic Search Bulk Size	Specifies the maximum size of the document queue during Elasticsearch indexing. The default value is 5.
Elastic Search Bulk Concurrent Requests	This setting specifies the maximum number of concurrent bulk requests to allow during Elasticsearch indexing. The default value is 1.
Elastic Search Bulk Max Threads	This setting specifies the maximum number of threads to use for Elasticsearch indexing. The default value is 1.
Elastic Search Mapping	This setting specifies (in JSON format) the mapping to use when indexing unstructured documents in Elasticsearch.
Elastic Search Pipeline Configuration	This setting specifies (in JSON format) the Elasticsearch pipeline configuration to use when indexing unstructured documents.
Elastic Search Directory Write-all	Controls whether to give write-all permission to the `esi` directory in the output corpus FLDS.
Elasticsearch Index Settings	This setting specifies (in JSON format) the index settings to use when indexing unstructured documents in Elasticsearch.
Skip Teardown Of Dynamic Resources	Controls whether dynamic K8s-based resources associated with the pipeline are left running after the pipeline is complete. This setting is disabled by default. Enabling it can result in increased cloud resource usage.
Default Finish Pending Writes On Pipeline Cancellation	Controls whether to finish any pending writes for documents during pipeline cancellation. There is a flag in the cancellation request that can be used to override this setting. This setting is enabled by default.
Post-persist Postprocessor	Specifies any post-persist semantic postprocessors in the pipeline.
Rich Text Extractor	Lists the HTML extractors to use in the pipeline.
Post Worker Service	Specifies a service to invoke on documents after they are successfully processed by the pipeline worker processes.
Pre-persist Postprocessor	Specifies any pre-persist semantic postprocessors in the pipeline.
Status Journal Base Path	Specifies the base path for storage of the status journal. By default, status journals are written to a `status_journals` subdirectory in the Anzo Data Store that is specified for the pipeline.
Content Transformer	Specifies any content transformation and metadata extraction components to use in the pipeline.
Document Crawler Thread Count	Specifies the number of threads to use for document crawling. The default value is 4.
Worker Service ID	Specifies the worker service ID to send requests to. If not specified, the default is `pipelineWorkerService`.