Elasticsearch mapping design¶

Conventions¶

Access Control¶

Specify the _access field for access control.

"_access": {
  "type": "object",
  "properties": {
    "owner":{
      "type": "keyword"
    },
    "read": {
      "type": "keyword"
    },
    "update": {
      "type": "keyword"
    },
    "delete": {
      "type": "keyword"
    }
  }
}

Full text search¶

The default field for global search, meaning not over specified fields, is _data. Therefore, you must specify it and put inside all fields that are meant to be queried when performing global searches. Take into account that full text search can only be performed over text and keyword type fields, which means that all the fields that do not belong to these two types must be placed outside _data.

Mandatory field definitions¶

If using strict mappings "dynamic": "strict" some internally generated fields must be defined in the schema.

control_number
Corresponds to the CERN Search ID, which can be used for following actions, ie get, update, delete. Self generated.

"control_number": {
  "type": "keyword"
}

_created and _updated Self generated by the Invenio framework.

"_updated": {
  "type": "date"
},
"_created": {
  "type": "date"
}

$schema This field is not mandatory in the payload and not self generated. However, if it is not set, the document will be inserted in the default schema (defined upon instance creation). Value should follow the standard http://<host>/schemas/<search_instance>/<index_doc_type>.json. For example for env variables and mapping doc_v1.0.0.json: JSONSCHEMAS_ENDPOINT = '/schemas' JSONSCHEMAS_HOST = '' (to match your domain, ie. JSONSCHEMAS_HOST=my-search-domain.ch) CERN_SEARCH_INSTANCE = 'my-search' $schema becomes: https://my-search-domain.web.cern.ch/schemas/my-search/doc_v1.0.0.json

"$schema": {
  "enabled": false
}

Best practices¶

First of all clarify that these best practices are not rules, not even close. They are mere "strategies" or examples of custom analysis properties that have helped us tackle certain use cases and can help you in the similar ones.

Custom analyzers: Avoid false positives due to always present characters¶

Spoiler alert: We will use Elasticsearch analyzers and tokenizers to tackle this problem.

Imagine the case of websites URLs. Certain parts such as the protocol and the domain (i.e. http, https, ch, com, etc.) will be present in all of the strings and therefore queries like HTTP protocol will produce many false positives.

This can be fixed by using a custom analyzer (based on the standard one) that deletes this unwanted parts of the string. We can achieve that behaviour by setting the stopwords property. However, in order to "stop" them, we still need to tokenize the string so those parts become "words". For that purpose we will use a custom tokenizer and split by ., :// and /.

The following JSON code shows the final result. First you can see both the tokenizer and the analyzer defined in the analysis section of the mapping. Later on you can see a field with the analyzer property set to url_analyzer, which is the name we gave to out custom one.

{
  "settings": {
    "index.percolator.map_unmapped_fields_as_string": true,
    "index.mapping.total_fields.limit": 50,
    "analysis": {
      "tokenizer": {
        "url_tokenizer": {
          "type": "pattern",
          "pattern": "\\.|(\\:\\/\\/)|\\/"
        }
      },
      "analyzer": {
        "url_analyzer": {
          "type": "standard",
          "tokenizer": "url_tokenizer",
          "stopwords": ["http", "https", "ftp", "www", "com", "ch"]
        }
      }
    }
  },
  "mappings": {
    "custom_analyzer_v1.0.0": {
      "dynamic": "strict",
      "numeric_detection": true,
      "_meta": {
        [...]
      },
      "properties": {
        [...]
        "url": {
          "type": "text",
          "analyzer": "url_analyzer"
        },
      }
    }
  }
}

Case in-sensitive exact match¶

Spoiler alert: We will use Elasticsearch normalizers.

Before going into details, case in-sensitive exact match is not "exact match" per se, but is in many cases the desired outcome.

In this ocassion we will define a normalizer that will process the data before indexing and before querying, and therefore will produce the appropriate matches. This normalizer will change the string to lowercase and perform asciifolding to allow matches over words that have diacritics. As a result all case and diacritic variations of "TEST" will match "test", "TeSt", "TésT"...

The following JSON code shows the final result. First you can see the normalizer defined in the analysis section of the mapping. Later on you can see a field with the normalizer property set to case_insensitive_normalizer, which is the name we gave to out custom one.

{
  "settings": {
    "index.percolator.map_unmapped_fields_as_string": true,
    "index.mapping.total_fields.limit": 50,
    "analysis": {
      "normalizer": {
        "case_insensitive_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "custom_analyzer_v1.0.0": {
      "dynamic": "strict",
      "numeric_detection": true,
      "_meta": {
        [...]
      },
      "properties": {
        [...]
        "promoted_keywords": {
          "type": "keyword",
          "normalizer": "case_insensitive_normalizer"
        }
      }
    }
  }
}

Interesing properties¶

Access patterns depend on the operation

Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for this document?".

analyzer: For example French, English.
boost: Increase importance of a field in the relevance, has to be applied at query time. Default 1.0.
coerce: For numeric fields. Interpret and cast to the corresponding format (e.g. "5" to 5).
doc_values: When sort, aggreation and specific field searches are not needed.
copy_to: Used to reimplement the legacy _all field when we want to avoid search upon certain fields (e.g. _access).
dynamic: Accept or not creation of fields that were sent but not defined in the design. It is recommended to set it to strict to make it fail but be notified of its occurrance. Must be set at index level and on each inner object.
enabled: When you want to just store the field without indexing it. Default true.
fielddata: If you need to perform sorting or aggregations to text fields you need to enable this property.
eager_global_ordinals: Activating it means improving search response time by sacrificing some indexing performace. It might be useful to activate it for fields that you know will be queried for aggregations.
ignore_above: It will not index/store strings that are longer than the specified amount of characters. This option is also useful for protecting against Lucene’s term byte-length limit of 32766. Since the value for ignore_above is the character count and Lucene counts bytes, if you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 4 = 8191 (Since UTF-8 characters may occupy at most 4 bytes).
ignore_malformed: If enabled, in case some received data does not have the expected format it will be ignored (indexing the rest of the document) instead of failing and throwing an exception.
index: The index option controls whether field values are indexed. Fields that are not indexed are not queryable. Default is true.
fields: It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields.
store: By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.

It can consume a lot of resources

Fielddata can consume a lot of heap space, especially when loading high cardinality text fields. Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.

This list might not be up to date, there might be more parameters and some might have been deprecated. More information can be found in the official Elasticsearch documentation.