We are working on updating this book for the latest version. Some content might be out of date.
The removal of stopwords is
handled by the
stop token filter which can be used
when creating a custom analyzer (see Using the stop Token Filter).
However, some out-of-the-box analyzers
come with the stop filter pre-integrated:
- Language analyzers
-
Each language analyzer defaults to using the appropriate stopwords list
for that language. For instance, the
englishanalyzer uses the_english_stopwords list. -
standardanalyzer -
Defaults to the empty stopwords list:
_none_, essentially disabling stopwords. -
patternanalyzer -
Defaults to
_none_, like thestandardanalyzer.
To use custom stopwords in conjunction with
the standard analyzer, all we
need to do is to create a configured version of the analyzer and pass in the
list of stopwords that we require:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": [ "and", "the" ]
}
}
}
}
}
This is a custom analyzer called | |
This analyzer is the | |
The stopwords to filter out are |
This same technique can be used to configure custom stopword lists for any of the language analyzers.
The output from the analyze API
is quite interesting:
GET /my_index/_analyze?analyzer=my_analyzer The quick and the dead
{
"tokens": [
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "dead",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 5
}
]
}The stopwords have been filtered out, as expected, but the interesting part is
that the position of the
two remaining terms is unchanged: quick is the
second word in the original sentence, and dead is the fifth. This is
important for phrase queries—if the positions of each term had been
adjusted, a phrase query for quick dead would have matched the preceding
example incorrectly.
Stopwords can be passed inline, as we did in the previous example, by specifying an array:
"stopwords": [ "and", "the" ]
The default stopword list for a particular language can be specified using the
_lang_ notation:
"stopwords": "_english_"
The predefined language-specific stopword
lists available in
Elasticsearch can be found in the
stop token filter documentation.
Stopwords can be disabled by
specifying the special list: _none_. For
instance, to use the english analyzer
without stopwords, you can do the
following:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stopwords": "_none_"
}
}
}
}
}Finally, stopwords can also be listed in a file with one word per line. The
file must be present on all nodes in the cluster, and the path can be
specified with the stopwords_path parameter:
The stop token filter can be combined
with a tokenizer
and other token filters when you need to create a custom
analyzer. For instance, let’s say that we wanted to
create a Spanish analyzer
with the following:
- A custom stopwords list
-
The
light_spanishstemmer -
The
asciifoldingfilter to remove diacritics
We could set that up as follows:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": [ "si", "esta", "el", "la" ]
},
"light_spanish": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"my_spanish": {
"tokenizer": "spanish",
"filter": [
"lowercase",
"asciifolding",
"spanish_stop",
"light_spanish"
]
}
}
}
}
}
The | |
See Algorithmic Stemmers. | |
The order of token filters is important, as explained next. |
We have placed the spanish_stop filter after the asciifolding filter.
This
means that esta, ésta, and está will first have their diacritics
removed to become just esta, which will then be removed as a stopword. If,
instead, we wanted to remove esta and ésta, but not está, we
would have to put the spanish_stop filter before the asciifolding
filter, and specify both words in the stopwords list.
A few techniques can be used to update the list of stopwords used by an analyzer. Analyzers are instantiated at index creation time, when a node is restarted, or when a closed index is reopened.
If you specify stopwords inline with the stopwords parameter, your
only option is to close the index and update the analyzer configuration with the
update index settings API, then reopen
the index.
Updating stopwords is easier if you specify them in a file with the
stopwords_path parameter. You can just update the file (on every node in
the cluster) and then force the analyzers to be re-created by either of these actions:
- Closing and reopening the index (see open/close index), or
- Restarting each node in the cluster, one by one
Of course, updating the stopwords list will not change any documents that have already been indexed. It will apply only to searches and to new or updated documents. To apply the changes to existing documents, you will need to reindex your data. See Reindexing Your Data.