How to Build a Food...
December 18, 2024
Elasticsearch analyzers are a fundamental aspect of text processing, shaping how data is indexed and searched within the system. In addition to the default analyzer, Elasticsearch offers a range of specialized analyzers tailored to specific needs. In this blog post, we will delve into analyzers such as Keyword, Language, Pattern, Simple, Standard, Stop, and Whitespace. Understanding when to use each analyzer will empower you to optimize your Elasticsearch setup for diverse scenarios.
Elasticsearch analyzers are a critical component of the Elasticsearch search engine, and they are meant to process and index text data for speedy and accurate search operations. Character filters, tokenizers, and token filters are the three primary components of an analyzer.
Tokenizers separate the text into individual tokens, while token filters change or filter these tokens. Elasticsearch can handle activities like stemming (reducing words to their root form), lowercasing, and deleting stop words using analyzers to improve the quality of search results.
Elasticsearch comes with default analyzers for a variety of languages, and users may also develop custom analyzers to meet specific indexing and search needs. Configuring analyzers well is critical for optimizing search functionality in Elasticsearch and increasing the relevancy of search results.
Must Read: Explore Elasticsearch and Why It’s Worth Using?
Explore firsthand the functionality of Elasticsearch analyzers through practical code demonstrations. These examples serve as a gateway to understanding the inner workings of analyzers, showcasing how they facilitate efficient indexing and powerful search capabilities within Elasticsearch. Mastering these analyzers not only aids in refining Elasticsearch queries but also enhances overall indexing strategies for optimal performance.
The simple analyzer breaks text into tokens at any non-letter character, such as numbers, spaces, hyphens, and apostrophes, discards non-letter characters, and changes uppercase to lowercase.
The simple analyzer is defined by one tokenizer which is a lowercase tokenizer.
Example
POST _analyze
{
“analyzer”: “simple”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}
Tokens generated
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Use Case: Basic Tokenization
Scenario: In situations where a simple tokenization approach is sufficient, such as when dealing with less structured or informal text, the simple analyzer provides a straightforward solution without extensive filtering.
Mapping:
“mappings”: {
“properties”: {
“text_field”: {
“type”: “text”,
“analyzer”: “simple”
}
}
}
The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.
Example
POST _analyze
{
“analyzer”: “standard”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}
Token Generated
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone]
Use Case: Common English Words Inclusion
Scenario: Use the standard analyzer when you want to index and search for common words while maintaining tokenization and lowercase conversion.
Mapping:
“mappings”: {
“properties”: {
“text_field”: {
“type”: “text”,
“analyzer”: “standard”
}
}
}
The keyword analyzer is a “noop” analyzer that returns the entire input string as a single token.
Example
POST _analyze
{
“analyzer”: “keyword”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}
Token Generated
[ The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.]
Use Case: Exact Match Searches
Scenario: You have identifiers like product codes, document IDs, or tags that should not be tokenized. The keyword analyzer is suitable for scenarios where you need to search for exact matches without breaking down the input into individual words.
Mapping:
“mappings”: {
“properties”: {
“keyword_field”: {
“type”: “keyword”,
“analyzer”: “keyword”
}
}
}
The whitespace analyzer breaks text into terms whenever it encounters a whitespace character.
Example
POST _analyze
{
“analyzer”: “keyword”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}
Token Generated
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone.]
Use Case: Maintain Text Structure
Scenario: Your data has distinct terms separated by whitespace, and you want to preserve this structure. The whitespace analyzer tokenizes the input based on whitespace characters, allowing you to index and search for terms as they appear in the original text.
Mapping:
“mappings”: {
“properties”: {
“text_field”: {
“type”: “text”,
“analyzer”: “whitespace”
}
}
}
The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators, not the tokens themselves. The regular expression defaults to \W+ (or all non-word characters).
Example
POST _analyze
{
“analyzer”: “pattern”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}
Token Generated
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Use Case: Custom Text Formats
Scenario: You have structured data with specific patterns or custom text formats that need specialized parsing. The pattern analyzer allows you to define regular expressions for tokenization, making it suitable for scenarios where a predefined structure exists. Examples: emails, phone numbers, dates, etc.
Mapping:
“mappings”: {
“properties”: {
“custom_field”: {
“type”: “text”,
“analyzer”: “pattern”,
“pattern”: “\\s*,\\s*” // Example: Tokenize by commas with optional spaces
}
}
}
The stop analyzer is the same as the simple analyzer but adds support for removing stop words. It defaults to using the _english_ stop words. The common stop words in English are is, on, the, a, an, etc.
Example
POST _analyze
{
“analyzer”: “stop”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}
Token Generated
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone]
Use Case: Case-Sensitive Searches with Stop Word Removal
Scenario: You require case-sensitive searches but want to exclude common stop words. The stop analyzer allows you to maintain case sensitivity while filtering out frequently occurring words that may not add significant value to your search results.
Mapping:
“mappings”: {
“properties”: {
“text_field”: {
“type”: “text”,
“analyzer”: “stop”
}
}
}
It is a tailored analyzer for specific languages (e.g., English, Spanish, French). Incorporates language-specific tokenization and stemming rules for more accurate and context-aware indexing.
Example to add bengali custom analyzer
PUT /bengali_example
{
“settings”: {
“analysis”: {
“filter”: {
“bengali_stop”: {
“type”: “stop”,
“stopwords”: “_bengali_”
},
“bengali_keywords”: {
“type”: “keyword_marker”,
“keywords”: [“উদাহরণ”]
},
“bengali_stemmer”: {
“type”: “stemmer”,
“language”: “bengali”
}
},
“analyzer”: {
“rebuilt_bengali”: {
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“decimal_digit”,
“bengali_keywords”,
“indic_normalization”,
“bengali_normalization”,
“bengali_stop”,
“bengali_stemmer”
]
}
}
}
}
}
With the following analyzer, you would be able to analyze bengali words bengali with Bengali stop words and stemmers.
Use Case: Multilingual Content
Scenario: Your dataset includes documents in different languages. By using language-specific analyzers (e.g., English, Spanish, French), you can account for language-specific tokenization and stemming, improving the accuracy of search results in diverse linguistic contexts.
Conclusion
Elasticsearch provides a rich set of analyzers catering to various use cases. Whether dealing with multilingual content, structured data, or specific tokenization needs, selecting the right analyzer is key to achieving efficient and accurate search results. By understanding the nuances of analyzers like Keyword, Language, Pattern, Simple, Standard, Stop, and Whitespace, you can fine-tune your Elasticsearch setup for optimal performance and relevance in diverse scenarios. Partnering with experts in ElasticSearch Consulting and Development Services can further amplify your Elasticsearch capabilities for tailored and effective solutions.