We are working on updating this book for the latest version. Some content might be out of date.
While Elasticsearch comes with a number of analyzers available out of the box, the real power comes from the ability to create your own custom analyzers by combining character filters, tokenizers, and token filters in a configuration that suits your particular data.
In Analysis and Analyzers, we said that an analyzer is a wrapper that combines three functions into a single package, which are executed in sequence:
- Character filters
Character filters are used to “tidy up” a string before it is tokenized. For instance, if our text is in HTML format, it will contain HTML tags like
<p>or<div>that we don’t want to be indexed. We can use thehtml_stripcharacter filter to remove all HTML tags and to convert HTML entities likeÁinto the corresponding Unicode characterÁ.An analyzer may have zero or more character filters.
- Tokenizers
An analyzer must have a single tokenizer. The tokenizer breaks up the string into individual terms or tokens. The
standardtokenizer, which is used in thestandardanalyzer, breaks up a string into individual terms on word boundaries, and removes most punctuation, but other tokenizers exist that have different behavior.For instance, the
keywordtokenizer outputs exactly the same string as it received, without any tokenization. Thewhitespacetokenizer splits text on whitespace only. Thepatterntokenizer can be used to split text on a matching regular expression.- Token filters
After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified.
Token filters may change, add, or remove tokens. We have already mentioned the
lowercaseandstoptoken filters, but there are many more available in Elasticsearch. Stemming token filters “stem” words to their root form. Theascii_foldingfilter removes diacritics, converting a term like"très"into"tres". Thengramandedge_ngramtoken filters can produce tokens suitable for partial matching or autocomplete.
In Search in Depth , we discuss examples of where and how to use these tokenizers and filters. But first, we need to explain how to create a custom analyzer.
In the same way as
we configured the es_std analyzer previously, we can configure
character filters, tokenizers, and token filters in their respective sections
under analysis:
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": { ... custom character filters ... },
"tokenizer": { ... custom tokenizers ... },
"filter": { ... custom token filters ... },
"analyzer": { ... custom analyzers ... }
}
}
}As an example, let’s set up a custom analyzer that will do the following:
-
Strip out HTML by using the
html_stripcharacter filter. Replace
&characters with" and ", using a custommappingcharacter filter:"char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] } }-
Tokenize words, using the
standardtokenizer. -
Lowercase terms, using the
lowercasetoken filter. Remove a custom list of stopwords, using a custom
stoptoken filter:"filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] } }
Our analyzer definition combines the predefined tokenizer and filters with the custom filters that we have configured previously:
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "html_strip", "&_to_and" ],
"tokenizer": "standard",
"filter": [ "lowercase", "my_stopwords" ]
}
}To put it all together, the whole create-index request looks like this:
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [ "&=> and "]
}},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": [ "the", "a" ]
}},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "html_strip", "&_to_and" ],
"tokenizer": "standard",
"filter": [ "lowercase", "my_stopwords" ]
}}
}}}After creating the index, use the analyze API to
test the new analyzer:
GET /my_index/_analyze?analyzer=my_analyzer The quick & brown fox
The following abbreviated results show that our analyzer is working correctly:
{
"tokens" : [
{ "token" : "quick", "position" : 2 },
{ "token" : "and", "position" : 3 },
{ "token" : "brown", "position" : 4 },
{ "token" : "fox", "position" : 5 }
]
}The analyzer is not much use unless we tell
Elasticsearch where to use it. We
can apply it to a string field with a mapping such as the following:
PUT /my_index/_mapping/my_type
{
"properties": {
"title": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}