We are working on updating this book for the latest version. Some content might be out of date.
In the same way as the lowercase
token filter is a good starting point for
many languages
but falls short when exposed to the entire tower of Babel, so
the asciifolding
token filter requires a more
effective Unicode character-folding counterpart for dealing with the many
languages of the world.
The icu_folding
token filter (provided by the icu
plug-in)
does the same job as the asciifolding
filter, but extends the transformation
to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion
of numbers in other scripts into their Latin equivalents, plus various other
numeric, symbolic, and punctuation transformations.
The icu_folding
token filter applies Unicode normalization and case folding
from nfkc_cf
automatically, so the icu_normalizer
is not required:
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_folder": { "tokenizer": "icu_tokenizer", "filter": [ "icu_folding" ] } } } } } GET /my_index/_analyze?analyzer=my_folder ١٢٣٤٥
If there are particular characters that you would like to protect from
folding, you can use a
UnicodeSet
(much like a character class in regular expressions) to specify which Unicode
characters may be folded. For instance, to exclude the Swedish letters å
,
ä
, ö
, Å
, Ä
, and Ö
from folding, you would specify a character class
representing all Unicode characters, except for those letters: [^åäöÅÄÖ]
(^
means everything except).
PUT /my_index { "settings": { "analysis": { "filter": { "swedish_folding": { "type": "icu_folding", "unicodeSetFilter": "[^åäöÅÄÖ]" } }, "analyzer": { "swedish_analyzer": { "tokenizer": "icu_tokenizer", "filter": [ "swedish_folding", "lowercase" ] } } } } }
The | |
The |