We are working on updating this book for the latest version. Some content might be out of date.
Out-of-the-box stemming solutions are never perfect.
Algorithmic stemmers,
especially, will blithely apply their rules to any words they encounter,
perhaps conflating words that you would prefer to keep separate. Maybe, for
your use case, it is important to keep skies and skiing as distinct words
rather than stemming them both down to ski (as would happen with the
english analyzer).
The keyword_marker and
stemmer_override token filters
allow us to customize the stemming process.
The stem_exclusion parameter for language analyzers (see
Configuring Language Analyzers) allowed
us to specify a list of words that
should not be stemmed. Internally, these language analyzers use the
keyword_marker token filter
to mark the listed words as keywords, which prevents subsequent stemming
token filters from touching those words.
For instance, we can create a simple custom analyzer that uses the
porter_stem token filter,
but prevents the word skies from being stemmed:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"no_stem": {
"type": "keyword_marker",
"keywords": [ "skies" ]
}
},
"analyzer": {
"my_english": {
"tokenizer": "standard",
"filter": [
"lowercase",
"no_stem",
"porter_stem"
]
}
}
}
}
}Testing it with the analyze API shows that just the word skies has
been excluded from stemming:
While the language analyzers allow
us only to specify an array of words in the
stem_exclusion parameter, the keyword_marker token filter also accepts a
keywords_path parameter that allows us to store all of our keywords in a
file.
The file should contain one word per line, and must be present on every
node in the cluster. See Updating Stopwords for tips on how to update this
file.
In the preceding example, we prevented skies from being stemmed, but perhaps we
would prefer it to be stemmed to sky instead.
The
stemmer_override token
filter allows us to specify our own custom stemming rules. At the same time,
we can handle some irregular forms like stemming mice to mouse and feet
to foot:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"custom_stem": {
"type": "stemmer_override",
"rules": [
"skies=>sky",
"mice=>mouse",
"feet=>foot"
]
}
},
"analyzer": {
"my_english": {
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_stem",
"porter_stem"
]
}
}
}
}
}
GET /my_index/_analyze?analyzer=my_english
The mice came down from the skies and ran over my feet 
Rules take the form | |
The | |
Returns |
Just as for the keyword_marker token filter, rules can be stored
in a file whose location should be specified with the rules_path
parameter.