Configuring Indexing Backend

Cordra uses indexers that are based on Apache Lucene, such as Lucene itself, Apache Solr, and Elasticsearch.

By default, Cordra uses Apache Lucene that is configured to use the local file system for storing the indexes. However, Cordra can be configured to use alternate indexing backend systems. It is mandatory to use an alternate indexing backend system when Cordra is deployed as a distributed system.

There are a few indexer technologies that Cordra can use for indexing. Cordra includes index modules, which translate Cordra indexing requirements into what each of the indexer technologies natively offer.

To configure a index module, other than for the default file system based index, add a index section to the Cordra config.json file. For example:

"index" : {
    "module" : "module-name-goes-here",
    "options" : {

    }
}

The following index modules are included within the Cordra distribution.

Index Modules

There are currently four indexing backends supported by Cordra.

Lucene Index

Module Name: lucene

Module Options:

Option name

Description

allowLeadingWildcard

Allow queries to start with a wildcard (*). May have performance implications on large indexes. (Default: false).

If no indexing backend is configured in config.json, the Cordra will use a filesystem-based Apache Lucene indexer. This module is only applicable for a single instance deployment scenario.

System Memory Lucene Index

Module Name: memory

Module Options:

Option name

Description

allowLeadingWildcard

Allow queries to start with a wildcard (*). May have performance implications on large indexes. (Default: false).

This module uses Lucene, but the index gets erased once the Cordra process is stopped. This module is useful for testing and is also only applicable for a single instance deployment scenario.

The index section of the config.json file looks like this:

"index" : {
    "module" : "memory"
}

Elasticsearch Index

Module Name: elasticsearch

Module Options:

Option name

Description

address

Address of Elasticsearch server (Default: localhost)

addressScheme

Protocol for Elasticsearch server (Default: https)

port

Port number for Elasticsearch server (Default: 9200)

baseUri

URI(s) of Elasticsearch server(s). If specified, this will be used instead of the previous address settings. Can be specified as a string or as an array and can be specified as “baseUris”.

indexName

Name of the index to use (Default: cordra)

authorization

If specified, the value of an Authorization: header to include with every request to Elasticsearch.

username, password

If specified and authorization is absent, a Basic Authorization header will be included with every request.

mappings

Mappings to be used when Cordra initializes the index, to augment or override Cordra defaults. If the index already exists, has no effect.

index.*

Index setting for Cordra index.

The Elasticsearch indexer works with both self-hosted instances of Elasticsearch and Amazon’s hosted Elasticsearch service. Cordra currently supports Elasticsearch versions 7 and 8.

By default, Cordra sets index.mapping.total_fields.limit to 10000. You can override this or send additional index configuration to Elasticsearch by including the appropriate index.* setting in your configuration. For example, to set the limit on total fields for the index to 5000, you could use the following configuration:

"index" : {
    "module" : "elasticsearch",
    "options" : {
        "address" : "localhost",
        "port"    : "9200",
        "addressScheme"  : "http",
        "index.mapping.total_fields.limit": "5000"
    }
}

When connecting to Elasticsearch using TLS, additional configuration may be required. See Enabling TLS for details.

Solr

Module Name: solr

Module Options:

Option name

Description

baseUri

URI of Cordra index on Solr indexing server. This should include the core name. (Default: http://localhost:8983/solr/cordra)

zkHosts

Connection string for ZooKeeper cluster used with SolrCloud.

collectionName

Name of the collection to use with SolrCloud (Default: cordra)

minRf

A number for the minimum desired replication factor in a SolrCloud configuration. If the achieved replication is lower Cordra will log a warning. Generally this will be set automatically based on SolrCloud configuration in ZooKeeper; this option can be used to set it lower to prevent warnings when Solr nodes are known to be down.

Cordra can be configured to connect to a standalone Solr server or a Solr Cloud cluster with its configuration stored in ZooKeeper. Cordra currently supports Solr versions 6, 7, and 8.

In addition to the Solr setting in the Cordra config.json file, the following Solr configuration file updates must be made. The default managed-schema file (called schemas.xml on older versions of Solr) should be replaced with the following (which can be downloaded here:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="cordra" version="1.6">
    <uniqueKey>id</uniqueKey>

    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="id" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.KeywordTokenizerFactory" />
    </analyzer>
    </fieldType>
    <fieldType name="keyword" class="solr.TextField" positionIncrementGap="10000">
    <analyzer>
      <tokenizer class="solr.KeywordTokenizerFactory" />
      <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
    </fieldType>
    <fieldType name="text" class="solr.TextField" positionIncrementGap="10000" >
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory" />
      <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
    </fieldType>
    <fieldType name="dateRangeField" class="solr.DateRangeField" />
    <fieldType name="datePointField" class="solr.DatePointField" />

    <field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
    <field name="id" type="id" indexed="true" stored="true" required="true" />
    <field name="repoid" type="keyword" indexed="true" stored="false" />
    <field name="type" type="keyword" indexed="true" stored="true" />
    <field name="aclRead" type="keyword" indexed="true" stored="false" multiValued="true" />
    <field name="aclWrite" type="keyword" indexed="true" stored="false" multiValued="true" />
    <field name="createdBy" type="keyword" indexed="true" stored="false" />
    <field name="remoteRepository" type="keyword" indexed="true" stored="false" />
    <field name="username" type="keyword" indexed="true" stored="false" />
    <field name="users" type="keyword" indexed="true" stored="false" multiValued="true" />
    <field name="schemaName" type="keyword" indexed="true" stored="false" />
    <field name="javaScriptModuleName" type="keyword" indexed="true" stored="false" multiValued="true" />
    <field name="isVersion" type="keyword" indexed="true" stored="false" />
    <field name="versionOf" type="keyword" indexed="true" stored="false" />
    <field name="payloadIndexState" type="keyword" indexed="true" stored="false" />
    <field name="payloadIndexCordraServiceId" type="keyword" indexed="true" stored="false" />
    <field name="internal.pointsAt" type="keyword" indexed="true" stored="false" multiValued="true" />
    <field name="internal.all" type="text" indexed="true" stored="false" multiValued="true" />
    <field name="txnId" type="long" indexed="true" stored="false" />
    <dynamicField name="/*" type="text" indexed="true" stored="false" multiValued="true" />
    <dynamicField name="objatt_*" type="text" indexed="true" stored="false" />
    <dynamicField name="elatt_*" type="text" indexed="true" stored="false" />
    <dynamicField name="elname_*" type="text" indexed="true" stored="false" />
    <dynamicField name="date_*" type="dateRangeField" indexed="true" stored="false" multiValued="true" />
    <dynamicField name="sort_date_*" type="datePointField" indexed="true" stored="false" docValues="true" omitNorms="true" />
    <dynamicField name="num_*" type="double" indexed="true" stored="false" multiValued="true" />
    <dynamicField name="sort_num_*" type="double" indexed="true" stored="false" docValues="true" omitNorms="true" />
    <dynamicField name="sort_*" type="keyword" indexed="true" stored="false" docValues="false" omitNorms="true" />
    <dynamicField name="acl/*" type="keyword" indexed="true" stored="false" multiValued="true" />
    <dynamicField name="*" type="text" indexed="true" stored="false" multiValued="true" />
</schema>

The default solrconfig.xml file should be modified in the following ways:

  • Change the maxTime value of autoSoftCommit to 10000

  • Change the maxTime value of autoCommit to 60000

  • Make sure the openSearcher value of autoCommit is set to false

  • Remove or comment out the searchComponent named “elevator” and the requestHandler named “/elevate”

  • Remove or comment out the updateRequestProcessorChain named “add-unknown-fields-to-the-schema”

  • Remove or comment out any initParams setting that make use of add-unknown-fields-to-the-schema

  • In initParams, change the value of df to internal.all. Any other df values used should also be changed to internal.all.

An example of a fully modified solrconfig.xml can be downloaded here.

When connecting to Solr using TLS, additional configuration may be required. See Enabling TLS for details.

Phrase Queries

You should refer to the query syntax supported by the indexing backend system that you configured with your Cordra instance.

One point is worth noting here. Queries that are placed within double quotes trigger exact match searches. However, queries without double quotes will be tokenized in a way which can sometimes be surprising. This is a side effect of the tokenization used by Lucene, Solr, and Elasticsearch.

For example, suppose you send the query /name:foo-bar to the indexer. The value is tokenized and treated as an OR statement. The query becomes /name:(foo bar), which will match items with the name foo, bar, foo-bar, and bar-foo. However, with double quotes, the query is turned into a phrase query, /name:"foo-bar", which will only match items with the name “foo-bar”.

In general, you should ensure that double quotes are used when a search might result in multiple tokens and only matches of the entire phrase are desired.