Searching for Text in Unstructured Documents
Anzo Hi-Res Analytics incorporates the Elasticsearch search engine to enable you to perform full text searches on unstructured documents. This topic provides instructions for creating a dashboard with text search capability and running a search across unstructured documents.
For information about running a pipeline to create an unstructured document data set, see Onboarding Unstructured Data.
- In the Anzo application, expand the Blend menu and click Graphmarts. Anzo displays a list of the existing Graphmarts. For example:
- On the Graphmarts screen, click the name of the graphmart that contains the unstructured documents. Anzo displays the graphmart overview screen.
- Click the Create Dashboard button. The Hi-Res Analytics application opens and displays the Create Dashboard dialog box. Leave Graphmart Dashboard selected and click Next.
- Type a name for the dashboard in the Title field and enter an optional Description. Then click OK. Anzo creates the dashboard and populates the Graphmart and Data Layers panels. For example:
- In the Data Types panel, click the plus icon (+) to open the Select Data Types dialog box. In the dialog box, select Unstructured Document. For example:
- Click OK. Anzo adds the data type to the Data Types panel.
-
Next, click the Lenses button in the main toolbar and select New from the drop-down list. Anzo opens the Create Lens dialog box.
- In the Create Lens dialog box, select Document Search and then click Next. Anzo displays the General Information dialog box.
- Type a name for the lens in the Title field and include an optional Description. Then click Finish. Anzo opens the Document Search Designer where you can configure the search settings or customize the style sheet, query, and HTML, if necessary. For example:
- In the Designer, change the optional search settings as needed. The list below describes each option:
- Show No Results on Empty Search: Determines whether documents are listed in the search results before a search is run. When enabled, the Document Search lens remains blank until a search is run.
- Allow Multi Select: Determines whether a user can select multiple documents at a time in the results. When enabled, multiple documents can be selected by holding the Shift key and clicking documents in the results.
- Synonym Expansion Dictionary: Determines whether to display an option for including synonyms in text searches. When enabled, the lens displays an Include Synonyms checkbox next to the Search field.
- Knowledge Base Dataset: Enables you to include a knowledge base in the search if one exists. Click the field to select an available knowledge base.
- Ontology: Enables you to select a data model to use for the search.
- Predicates: Enables you to select specific predicates from the model.
- Click Save. Anzo add the lens to the dashboard. Depending on the search settings, the lens displays the list of documents. For example:
- To run a search, type the text to find in the Search field and press Enter. See the Supported Search Syntax section below for information about supported search syntax. Anzo finds documents that include the search value and displays the documents, snippets of text to show the context of where the matches were found, and the Elasticsearch relevance score for the match. For information about how the relevance score is calculated, see What Is Relevance? in the Elasticsearch documentation. For example:
Clicking Show More expands the result to display additional matches. For example:
- To refine the search, alter the text in the Search field and press Enter again. You can also click highlighted terms in the search results to open a dialog box that shows the full annotated document where the match was found. For example:
Supported Search Syntax
This section describes the keyword search syntax that Anzo supports.
Wildcard Characters: ? and *
- ?: Use a question mark (?) to represent a single wildcard character. For example, in the searchco?l, the resulting documents will include terms like "cool" or "coal."
- *: Use an asterisk (*) to represent multiple wildcard characters. For example, in the searchcol*, the resulting documents will include terms like "collect" or "color."
Boolean Operators: +, -, OR, AND, NOT
- +: Use a plus (+) character to indicate mandatory matches. For example, in the search flight +New York, the resulting documents can include "flight" as an optional match and must include "New York."
- -: Use a minus (-) character to indicate a term that must not match. For example, in the search flight +New York -Los Angeles, the resulting documents can include "flight" as an optional match, must include "New York," and must not include "Los Angeles."
- OR: In the search New York OR Los Angeles, the resulting documents will include a match for either "New York" or "Los Angeles."
- AND: In the search New York AND Los Angeles, the resulting documents must include matches for both "New York" and "Los Angeles."
- NOT: In the search New York NOT Los Angeles, the resulting documents must include "New York" and cannot contain "Los Angeles."
- Grouping operators: In the search(flight AND New York) OR Los Angeles, the resulting documents will include "flight" and "New York" and optionally include "Los Angeles."
Fuzzy Matches: ~n
To search for a fuzzy match, use a tilde (~) character followed by a number to represent the number of fuzzy or incorrect characters. For example, in the search Flgth~3, the resulting documents could include the term "Flight."
Regular Expressions
For example, the following search expression matches email addresses: /([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})/.
For more information about the regular expression syntax that Elasticsearch supports, see Regular expression syntax in the Elasticsearch documentation.