Indexing and Search, Setup

Indexing and Search

Indexing and searching in ACE uses Apache Solr. ACE tracks when content changes and updates the index, and provides a search service with some extra features on top of Solr, but by and large Solr handles most of the work related to search and indexing.

Searching

ACE includes a search service that acts as a proxy for Solr, and adds permissions, filtering by view and/or variant, and including content inlined in the response. All these additional features depend on the system fields described below in the section on the indexer being correctly populated.

Aside from the additions documented here, the API to the search service is just the Solr search API, see https://lucene.apache.org/solr/guide/6_6/.

The proxy has some limitations: it only supports JSON responses it may not work with custom request handlers that customize the response format - it is built to handle the output from the standard solr.SearchHandler. * it only supports searching using query parameters, not the JSON request API (but it does support POSTing the query parameters)

Endpoint

By default, the search service is deployed as http://ace-search-service:8086/. Searches are performed against http://ace-search-service:8086/<collection>/select, where \<collection> is the name of the collection to query against. In a production installation it is likely to be behind a proxy, so the URL should not be hard coded.

Permissions

If the request to the search service includes an X-Auth-Token header, the search will be filtered by the user's permissions so only content the user is allowed to read is shown. If no X-Auth-Token header is included, only content on views marked as public in the search service configuration is included.

Views

The search service takes an optional query parameter named view. If the parameter is supplied, only content on that view is included in the results. The search service takes time state into account, so content must be both on the view and not disabled by time state to show up in search results.

Inline Content Data

The search service takes an optional query parameter named inlineData. If the parameter is supplied content will be included, and the content is inlined in the corresponding Solr document as a field named _content.

Variant

The search service takes an optional query parameter named variant. If the parameter is supplied, only content on that variant is included.

Indexing

When a new version of content is created, it is processed by the indexer. The content is indexed if it is available in the variant aceIndexing, in a form that is compatible with the Solr schema used for the index. The current version on each configured view will be fetched in the index variant.

Data format

The indexer expects an aspect of type aceSolrDocument, containing a Solr JSON document. The default composer for the variant aceIndexing tries to convert all mapped aspects to this format if they aren't already, and merges them into aceSolrDocument in a Solr-aware fashion. Unmapped aspects are ignored. This should mean that most use cases only require an aspect mapper that converts the aspect to a Solr JSON document. The default composer is also responsible for adding the system fields ace_modificationTime_dt (the time the version was created) and ace_type_s (the content type of the content, regardless of mapping).

Required fields

These fields are required for the Solr indexer to work at all.

Field	Description	Type
id	The unique key, contains the content version ID	string (solr.StrField), stored
root	Internal Solr field used for child documents, must be present even if child documents are not used	string (solr.StrField)
version	Internal Solr field	long (solr.TrieLongField)

Required Dynamic Fields

A number of dynamic fields are also expected in the schema in order for the standard indexing features to work. A completely custom index may not need these fields.

Field	Type
*_s	string (solr.StrField)
*_ss	string (solr.StrField), multivalued
*_dt	date (solr.TrieDateField)
*_drange	date_range (solr.DateRangeField)
*_l	long (solr.TrieLongField)
*_i	int (solr.TrieIntField)
*_b	boolean (solr.BoolField)
*_bs	booleans (solr.BoolField), multivalued
*_lc	lowercase_ws (solr.TextField), multivalued, see below

The *_lc field needs some analyzers to process the text correctly, shown in the XML snippet below. This is typically used for autocompletion.

...
  <fieldType name="lowercase_ws" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="/" replacement=" "/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
...

System Fields

These fields are populated by the standard index composer and used by the search service to automatically filter the search, so it only includes content that should be visible to the searching user.

Field	Description
ace_type_s	The name of the content type
ace_modificationTime_dt	The time this version was created
ace_views_ss	The views this version of the content is on
ace_security_context_ss	The security contexts this content belongs to
ace_timestate__drange	The date range when view has this version

See also below about the aceCategorization aspect and the fields it uses.

Supported Solr features

The indexer essentially takes the JSON in the aceSolrDocument aspect and sends it to Solr as the doc object of an add command. The only really hard restrictions are that there must a be an "id" field and a "root" field, and that there is no way to control anything outside the doc object (so e.g. no document boosting, no atomic updates, no deletes).

Most features of the Solr JSON document format related to indexing are supported by the aspect merging in the standard composer:

any field represented as a string, number or boolean in Solr
multi-valued fields
child documents
field boosting

If the standard composer fails to merge something correctly, it should be possible to use a custom composer to generate correct JSON.

Indexed Aspects

Most aspects are left up to the project to index, or not, but the aceCategorization aspect has a default mapper, named aceIndexing.categorization.

aceCategorization

The categorization aspect is indexed by the aspect mapper aceIndexing.categorization. This is mapped to aceCategorization by default in the aceIndexing variant. Categorization is indexed into dynamic fields named for the ID of each dimension. The full path to each entity, and prefixes of those paths, are indexed in a dimension is added to ace_tag_dimensionId_ss. The same values are also added to the field ace_tag_autocomplete_dimensionId_lc, where they are reprocessed for use in auto completion. For entities that have corresponding content, paths based on IDs (i.e. their aliases in the taxonomy namespace) rather than names are added to ace_tag_id_dimensionId_ss.

Field	Description
ace_tag_*_ss	The entity names
ace_tag_autocomplete_*_lc	The entity names, processed for autocompletion
ace_tag_id_*_ss	The entity IDs, for entities that have IDs

Example

Suppose a content is tagged with USA (_taxonomy/location.usa)/New York (_taxonomy/location.new-york)/Queens (_taxonomy/location.queens) in the Location dimension. What is indexed looks like this:

Field	Values
ace_tag_Location_ss	["USA","New York", "Queens", "USA/New York", "USA/New York/Queens"]
ace_tag_autocomplete_Location_lc	["USA","New York", "Queens", "USA/New York", "USA/New York/Queens"]
ace_tag_id_Location_ss	["location.usa", "location.new-york", "location.queens", "location.usa/location.new-york", "location.usa/location.new-york/location.queens"]

Block joins

Because block joins in Solr are a bit of a hack, dealing with blocks when indexing is relatively complicated and will require a custom composer. When you have a block, all documents in the block need to be updated whenever any of them is updated, including when any child or parent is removed. To make this work with ACE indexing requires you to make sure the document for any contents in a block contains the whole block, i.e. even when a child is updated the document for the parent and all its children must be returned. You must also ensure that before a child content is deleted, the parent content (or a sibling) has been updated, so that the block is reindexed without the deleted child, because when a content is deleted it is no longer possible to index it, and so we can't remove the block as a result of the delete itself.

The actual support for block joins in the indexer is limited to deleting all child documents of a content when the content is deleted (using the internal content ID of the content since we can no longer get it from the indexing variant) or updated (using the actual ID of the root document). This probably means multi-level hierarchies won't work, so while nothing prevents you from creating them they are not supported.

Re-indexing

There will be situations in which you will want to rebuild search indexes, such as when you've changed the logic of the indexing configuration or if the indexes have become corrupted for some reason.

Re-indexing of the search indexes in ACE comes in the form of full system re-indexing, which means that all collections are rebuilt from scratch. Rebuilding search indexes means that all existing content in the system is submitted for re-indexing.

It is not possible at this time to re-index a specific index.

Re-indexing in ACE is performed by using a secondary indexing container configured to use a different Kafka topic compared to the normal indexer. By default, the re-indexing topic is aceReindexEvents.

Triggering a re-index process

You trigger a search index re-index process by calling the re-indexing REST endpoint located at http://ace-content-service:8081/index/reindex:

$ curl --request POST \
       --location \
       --include \
       --header "X-Auth-Token: $TOKEN" \
       'http://ace-content-service:8081/index/reindex'

Response:

HTTP/1.1 200 OK
Date: Fri, 07 Sep 2018 07:57:52 GMT
Ace-Api-Version: 1.2.1
Content-Type: application/json
Content-Length: 51

Re-indexing session 'cf996304-a8a2-4b89-9a7b-f47eefd4e7fd' started. Kafka re-indexing queue populated with 239 records.

The re-index process will by default re-index all content of every content type in the system.

You can however limit the re-index process to a specific content type by appending the query parameter contentType to the call to the re-indexing REST endpoint:

$ curl --request POST \
       --location \
       --include \
       --header "X-Auth-Token: $TOKEN" \
       'http://ace-content-service:8081/index/reindex?contentType=article'

Response:

HTTP/1.1 200 OK
Date: Tue, 18 Sep 2018 11:45:50 GMT
Ace-Api-Version: 1.4.2
Content-Type: application/json
Content-Length: 143

Re-indexing session '1809a833-7a80-439b-9506-d5bd8f8c679e' started. Kafka re-indexing queue populated with 5 records of content type 'article'.

Re-indexing permissions

In order to perform a system re-index, the calling user has to be:

Logged in (authenticated)
Be in the sysadmin role in the __global__ context.

Re-indexing progress reporting

The (re-)indexer application will log the following type of messages during a re-indexing process:

Re-indexing process start

INFO [2018-09-28 06:18:40,324] com.atex.ace.indexing.indexer.EventProcessor: Re-indexing session '0dda2fe5-6cfb-475d-acad-70d2cb40eb21' started with a total of 237 content queued.

Re-indexing process completion

INFO [2018-09-28 06:18:58,828] com.atex.ace.indexing.indexer.ProgressTracker: Re-indexing session '0dda2fe5-6cfb-475d-acad-70d2cb40eb21' complete. Duration: 18 seconds (13 content / second).

Re-indexing progress messages

INFO [2018-09-28 06:18:43,104] com.atex.ace.indexing.indexer.ProgressTracker: Re-indexing session '0dda2fe5-6cfb-475d-acad-70d2cb40eb21' progress: 14% (duration: 2 seconds).

The progress messages will keep being reported up until the re-indexing process is complete, at which time the re-indexing process completion message will be reported.

Un-indexing of content types

To remove an entire content type from the search index, add the ACE system composer ace.excludeContent to the list of composers for the type (please see Callback API for more information), then trigger a re-index process (see above).