Indexing and Search
Indexing and searching in ACE uses Apache Solr. ACE tracks when content changes and updates the index, and provides a search service with some extra features on top of Solr, but by and large Solr handles most of the work related to search and indexing.
Searching
ACE includes a search service that acts as a proxy for Solr, and adds permissions, filtering by view and/or variant, and including content inlined in the response. All these additional features depend on the system fields described below in the section on the indexer being correctly populated.
Aside from the additions documented here, the API to the search service is just the Solr search API, see https://lucene.apache.org/solr/guide/6_6/.
The proxy has some limitations: * it only supports JSON responses * it may not work with custom request handlers that customize the response format - it is built to handle the output from the standard solr.SearchHandler. * it only supports searching using query parameters, not the JSON request API (but it does support POSTing the query parameters)
Endpoint
By default, the search service is deployed as http://ace-search-service:8086/. Searches are
performed against http://ace-search-service:8086/<collection>/select, where \<collection> is
the name of the collection to query against. In a production installation it is likely to be behind
a proxy, so the URL should not be hard coded.
Permissions
If the request to the search service includes an X-Auth-Token header, the search will be filtered by the user's permissions so only content the user is allowed to read is shown. If no X-Auth-Token header is included, only content on views marked as public in the search service configuration is included.
Views
The search service takes an optional query parameter named view. If the parameter is supplied, only content on that view is included in the results. The search service takes time state into account, so content must be both on the view and not disabled by time state to show up in search results.
Content state
Content being in content state hidden will automatically be filtered out from any search against an index based on
public content. Please see Content state for more information.
Inline Content Data
The search service takes an optional query parameter named inlineData. If the parameter is supplied content will be included, and the content is inlined in the corresponding Solr document as a field named _content.
Variant
The search service takes an optional query parameter named variant. If the parameter is supplied, only content on that variant is included.
Indexing
When a new version of content is created, it is processed by the
indexer. The content is indexed if it is available in the variant
aceIndexing, in a form that is compatible with the Solr schema
used for the index. The current version on each configured view
will be fetched in the index variant.
Data format
The indexer expects an aspect of type aceSolrDocument, containing
a Solr JSON document. The default composer for the variant aceIndexing
tries to convert all mapped aspects to this format if they aren't already,
and merges them into aceSolrDocument in a Solr-aware fashion. Unmapped
aspects are ignored. This should mean that most use cases only require
an aspect mapper that converts the aspect to a Solr JSON document. The
default composer is also responsible for adding the system fields
ace_modificationTime_dt (the time the version was created) and ace_type_s (the
content type of the content, regardless of mapping).
Required fields
These fields are required for the Solr indexer to work at all.
| Field | Description | Type |
|---|---|---|
| id | The unique key, contains the content version ID | string (solr.StrField), stored |
| root | Internal Solr field used for child documents, must be present even if child documents are not used | string (solr.StrField) |
| version | Internal Solr field | long (solr.TrieLongField) |
Required Dynamic Fields
A number of dynamic fields are also expected in the schema in order for the standard indexing features to work. A completely custom index may not need these fields.
| Field | Type |
|---|---|
| *_s | string (solr.StrField) |
| *_ss | string (solr.StrField), multivalued |
| *_dt | date (solr.TrieDateField) |
| *_drange | date_range (solr.DateRangeField) |
| *_l | long (solr.TrieLongField) |
| *_i | int (solr.TrieIntField) |
| *_b | boolean (solr.BoolField) |
| *_bs | booleans (solr.BoolField), multivalued |
| *_lc | lowercase_ws (solr.TextField), multivalued, see below |
The *_lc field needs some analyzers to process the text correctly, shown in the XML snippet below. This is typically used for autocompletion.
...
<fieldType name="lowercase_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="/" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
...
System Fields
These fields are populated by the standard index composer and used by the search service to automatically filter the search, so it only includes content that should be visible to the searching user.
| Field | Description |
|---|---|
| ace_type_s | The name of the content type |
| ace_modificationTime_dt | The time this version was created |
| ace_views_ss | The views this version of the content is on |
| ace_security_context_ss | The security contexts this content belongs to |
| ace_timestate_ |
The date range when view has this version |
| ace_contentstate_hidden_b | The hidden content state of this content. See Content state. |
See also below about the aceCategorization aspect and the fields it uses.
Supported Solr features
The indexer essentially takes the JSON in the aceSolrDocument aspect and sends it to Solr as the doc object of an add command. The only really hard restrictions are that there must a be an "id" field and a "root" field, and that there is no way to control anything outside the doc object (so e.g. no document boosting, no atomic updates, no deletes).
Most features of the Solr JSON document format related to indexing are supported by the aspect merging in the standard composer:
- any field represented as a string, number or boolean in Solr
- multi-valued fields
- child documents
- field boosting
If the standard composer fails to merge something correctly, it should be possible to use a custom composer to generate correct JSON.
Indexed Aspects
Most aspects are left up to the project to index, or not, but the
aceCategorization aspect has a default mapper, named
aceIndexing.categorization.
aceCategorization
The categorization aspect is indexed by the aspect mapper
aceIndexing.categorization. This is mapped to aceCategorization
by default in the aceIndexing variant. Categorization is indexed into dynamic fields named
for the ID of each dimension. The full path to each
entity, and prefixes of those paths, are indexed in
a dimension is added to
ace_tag_dimensionId_ss. The same values are also added to the field
ace_tag_autocomplete_dimensionId_lc, where they are reprocessed for use
in auto completion. For entities that have corresponding content,
paths based on IDs (i.e. their aliases in the taxonomy namespace) rather than names are added to
ace_tag_id_dimensionId_ss.
| Field | Description |
|---|---|
| ace_tag_*_ss | The entity names |
| ace_tag_autocomplete_*_lc | The entity names, processed for autocompletion |
| ace_tag_id_*_ss | The entity IDs, for entities that have IDs |
Example
Suppose a content is tagged with USA (_taxonomy/location.usa)/New York (_taxonomy/location.new-york)/Queens (_taxonomy/location.queens) in the Location dimension. What is indexed looks like this:
| Field | Values |
|---|---|
| ace_tag_Location_ss | ["USA","New York", "Queens", "USA/New York", "USA/New York/Queens"] |
| ace_tag_autocomplete_Location_lc | ["USA","New York", "Queens", "USA/New York", "USA/New York/Queens"] |
| ace_tag_id_Location_ss | ["location.usa", "location.new-york", "location.queens", "location.usa/location.new-york", "location.usa/location.new-york/location.queens"] |
Block joins
Because block joins in Solr are a bit of a hack, dealing with blocks when indexing is relatively complicated and will require a custom composer. When you have a block, all documents in the block need to be updated whenever any of them is updated, including when any child or parent is removed. To make this work with ACE indexing requires you to make sure the document for any contents in a block contains the whole block, i.e. even when a child is updated the document for the parent and all its children must be returned. You must also ensure that before a child content is deleted, the parent content (or a sibling) has been updated, so that the block is reindexed without the deleted child, because when a content is deleted it is no longer possible to index it, and so we can't remove the block as a result of the delete itself.
The actual support for block joins in the indexer is limited to deleting all child documents of a content when the content is deleted (using the internal content ID of the content since we can no longer get it from the indexing variant) or updated (using the actual ID of the root document). This probably means multi-level hierarchies won't work, so while nothing prevents you from creating them they are not supported.
Re-indexing
There will be situations in which you will want to rebuild the search indexes. This might for example be because you've changed the logic of the indexing configuration or because the indexes have become corrupted for one reason or another.
Re-indexing of the search indexes in ACE can be performed either fully or partially.
- Full system re-indexing means that all collections are rebuilt from scratch. All existing content in the system - no matter content type - will be submitted for force-indexing.
- Partial content re-indexing means re-indexing of a specific set of content, identified by either content type or by content main aliases.
It is not possible at this time to re-index a specific index.
Re-indexing in ACE is performed by using a secondary indexing container configured to use a different Kafka topic
compared to the normal indexer. By default, the re-indexing topic is aceReindexEvents.
Triggering a full system re-indexing process
You trigger a full system search index re-index process by calling the re-indexing REST endpoint located at
http://ace-content-service:8081/index/reindex/full:
$ curl --request POST \
--location \
--include \
--header "X-Auth-Token: $TOKEN" \
'http://ace-content-service:8081/index/reindex/full'
Response:
HTTP/1.1 200 OK
Date: Fri, 07 Sep 2018 07:57:52 GMT
Ace-Api-Version: 1.2.1
Content-Type: application/json
Content-Length: 51
Re-indexing session 'cf996304-a8a2-4b89-9a7b-f47eefd4e7fd' started. Kafka re-indexing queue populated with 239 records.
Triggering a re-index process by content type
You can limit the re-index process to a specific content type by calling the re-indexing REST endpoint located
at http://ace-content-service:8081/index/reindex/contentType/<contentType>:
$ curl --request POST \
--location \
--include \
--header "X-Auth-Token: $TOKEN" \
'http://ace-content-service:8081/index/reindex/contentType/article'
Response:
HTTP/1.1 200 OK
Date: Tue, 18 Sep 2018 11:45:50 GMT
Ace-Api-Version: 1.4.2
Content-Type: application/json
Content-Length: 143
Re-indexing session '1809a833-7a80-439b-9506-d5bd8f8c679e' started. Kafka re-indexing queue populated with 5 records of content type 'article'.
Triggering a re-index process by alias list
You can limit the re-index process to a specific set of content by calling the re-indexing REST endpoint located
at http://ace-content-service:8081/index/reindex in combination with the query parameter aliases:
$ curl --request POST \
--location \
--include \
--header "X-Auth-Token: $TOKEN" \
'http://ace-content-service:8081/index/reindex?aliases=contentid/MDYxN2M3NzYtNDM0Yy00,contentid/MDc2MzE4OTAtNmRkYy00'
The value of the aliases query parameter is a comma-separated list of content main aliases.
Response:
HTTP/1.1 200 OK
Date: Wed, 17 Apr 2019 08:51:16 GMT
Ace-Api-Version: 1.8.3
Content-Type: application/json
Content-Length: 89
Re-indexing session '2eb1732a-c030-457a-b995-261d07a73cae' started with 2 content queued.
Re-indexing permissions
In order to perform a system re-index, the calling user has to be:
- Logged in (authenticated)
- Be in the
sysadminrole in the__global__context.
Re-indexing progress reporting
The (re-)indexer application will log the following type of messages during a re-indexing process:
Re-indexing process start
INFO [2018-09-28 06:18:40,324] com.atex.ace.indexing.indexer.EventProcessor: Re-indexing session '0dda2fe5-6cfb-475d-acad-70d2cb40eb21' started with a total of 237 content queued.
Re-indexing process completion
INFO [2018-09-28 06:18:58,828] com.atex.ace.indexing.indexer.ProgressTracker: Re-indexing session '0dda2fe5-6cfb-475d-acad-70d2cb40eb21' complete. Duration: 18 seconds (13 content / second).
Re-indexing progress messages
INFO [2018-09-28 06:18:43,104] com.atex.ace.indexing.indexer.ProgressTracker: Re-indexing session '0dda2fe5-6cfb-475d-acad-70d2cb40eb21' progress: 14% (duration: 2 seconds).
The progress messages will keep being reported up until the re-indexing process is complete, at which time the re-indexing process completion message will be reported.
Un-indexing of content types
To remove an entire content type from the search index, add the ACE system composer ace.excludeContent to the list of composers
for the type (please see Callback API for more information), then trigger a re-index process (see above).