- references
- https://www.manning.com/books/elasticsearch-in-action
- https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
- https://www.elastic.co/blog/moving-from-types-to-typeless-apis-in-elasticsearch-7-0
- https://stackoverflow.com/questions/26001002/elasticsearch-difference-between-term-match-phrase-and-query-string
- https://stackoverflow.com/questions/43530610/how-to-do-a-mapping-of-array-of-strings-in-elasticsearch
- https://marcobonzanini.com/2015/02/09/phrase-match-and-proximity-search-in-elasticsearch/
- https://chatgpt.com/
- goals of this workshop
- https://github.com/mtumilowicz/java12-elasticsearch-inverted-index-workshop
- introduction to the basics of API
- creating index, customize analyzers
- define mappings
- indexing documents
- searching: query, filter
- analyzing output
- aggregating
- in
docker-compose
there is elasticsearch + kibana (7.6) prepared for local testing- cd
docker/compose
docker-compose up -d
- cd
- workshop and answers are in
workshop
directory
- note that 7.0 deprecated APIs that accept types, introduced new typeless APIs
- create index
PUT /index-name { "settings": { "index" : { ... // configure index }, "analysis": { ... // customize analyzer } }, "mappings": { "properties": { ... // fields } } }
- field datatypes
- any field can contain zero or more values by default, however, all values in the array must be of the same datatype
- string
- text
- full-text indexed (analyzed)
- are not used for sorting and seldom used for aggregations
- example: body of an email or the description of a product
- sometimes it is useful to have multiple version of the same field: one for full text search and the other for aggregations and sorting
- keyword
- are only searchable by their exact value (not analyzed)
- typically used for filtering, sorting, and aggregations
- example: IDs, email addresses, hostnames, status codes, zip codes or tags
- text
- numeric
- byte, short, integer, long ...
- date
"date": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis" }
- internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch
- queries are internally converted on this long representation, and the result is converted back to a string according to the field's date format
- many more: range, object, nested, geo-points...
- indexing documents
- example
- with specified id
POST /index-name/_create/id { ... // fields }
- autogenerated id
POST /products/_doc { "name": "Tablet", "price": 499 }
- with specified id
- each indexed document is versioned
- prevent conflicting updates
- example
- deleting documents
- example
DELETE /index-name/_doc/id
- optimistic locking - version can be specified
- example:
DELETE /products/_doc/123?version=3
- example:
- example
- updating documents
- every update is actually a delete + reindex operation.
- example
POST /index-name/_update/id { "doc" : { "name" : "new_name" } }
- partial update (patch):
POST
- for replace (full overwrite):
PUT
- for replace (full overwrite):
- optimistic locking - version can be specified
- example:
POST /products/_doc/123?version=3
- example:
- by default updates detect if they don’t change anything and return
"result": "noop"
- every node in the cluster can handle HTTP and Transport traffic by default
- HTTP = External API access (CRUD/search)
- Transport traffic = internal communication protocol that Elasticsearch nodes use to talk to each other
- search requests may involve data held on different data nodes
- a search request is executed in two phases
- scatter phase
- coordinator forwards request to the data nodes which hold the data
- each data node executes the request locally and returns its results to the coordinating node
- gather phase
- coordinator reduces each data node’s results into a single global resultset
- scatter phase
- template for search requests
GET index-name/_search // you can search entire cluster: GET /_search {...} { "query": { "query-type": { // query type: 'match', `range`, `term`, etc // payload specific to query type } } }
- query
match
- text is analyzed before matching
- standard query for performing a full-text search
- per field
- example
"match": { "description": { "query": "lightweight laptop with fast chip" } }
query_string
- text is analyzed before matching
- if no field default - search all document
- used to create a complex search (wildcard characters, searches across multiple fields)
- example
"query_string": { "query": "apple AND macbook OR iphone", "default_field": "name" }
match_phrase
- text is analyzed before matching
- all the terms must appear in the field
- terms must have the same order
- configured
slop
- allows some flexibility in word positions within the phrase
- example: gaps or slight reordering
this is a brown dog
andthe dog is brown
are OK withslope = 1
- allows some flexibility in word positions within the phrase
- configured
- example
"match_phrase": { "name": { "query": "macbook pro", "slop": 1 } }
range
- used for numeric, date, or keyword fields to search within a range
- example
"range": { "price": { "gte": 500, "lte": 1000 } }
term
- text is NOT analyzed before matching
- exact matching
- best for keyword fields, IDs, status flags, enums
- vs filtering: affects
_score
- example
"term": { "status": "available" }
bool
- boolean combinations of other queries
- all clauses are combined with a top-level logical AND
- example
"bool" : { "must" : { ... }, "filter": { ... }, "must_not" : { ... }, "should" : [ ... ], }
must
- must appear in matching documents
- contributes to the score
must_not
- must not appear in the matching documents
- scoring is ignored
- considered for caching
filter
- must appear in matching documents
- bypass analysis
- filtering 'Andy' when indexed is 'andy' will give no hit
- score of the query will be ignored
- considered for caching
- Elasticsearch constructs a bitset, which is a binary set of bits denoting whether the document matches this filter
should
- should appear in the matching document
- contributes to the score
- boolean combinations of other queries
{
"took" : 1, // in milliseconds
"timed_out" : false,
"_shards" : { // count of shards used for the request
"total" : 1, // total number of shards that require querying
"successful" : 1, // number of shards that executed the request successfully
"skipped" : 0,
"failed" : 0 // number of shards that failed to execute the request
},
"hits" : { // documents and metadata
"total" : { // metadata about the number of returned documents
"value" : 2, // total number of returned documents
"relation" : "eq"
},
"max_score" : 0.9395274, // highest returned document _score
"hits" : [ // array of returned document objects
{
"_index" : "programming-user-groups",
"_id" : "2",
"_score" : 0.9395274,
"_source" : { ... } // original JSON body
}
]
}
}
took
- time elapsed between: receiving request on the coordinator and being ready to send response to client
_shard.skipped
- skipped the request because a lightweight check helped realize that no documents could possibly match on this shard
- typically happens when a search request includes a range filter and the shard only has values that fall outside of that range
hits.total.relation
- tells whether
total.value
is exact or just an estimate- eq: Accurate
- gte: Lower bound, including returned documents
- tells whether
- template
GET /programming-user-groups/_search { "aggs": { "agg-name" : { "agg-type": { ... }, "aggs":{ // sub aggregations "sub-agg-name": { "sub-agg-type": { ... } } } } } }
- types
- bucketing
- group documents into buckets based on some criteria
- example: terms, ranges, dates, etc.
- can be nested
- bucketing aggregations can have sub-aggregations (bucketing or metric)
- types: terms, range, date_histogram, filters, histogram
- example
- request
"aggs": { "genres": { "terms": { "field": "genre.keyword" } }
- response
{ ... // same as search response "aggregations" : { "genres" : { ... "buckets" : [ { "key" : "electronic", "doc_count" : 6 }, { "key" : "rock", "doc_count" : 3 }, { "key" : "jazz", "doc_count" : 2 } ] } } }
- request
- group documents into buckets based on some criteria
- metrics
- calculate values like average, min, max, count
- usually inside buckets
- types: avg, sum, min/max, value_count, stats, percentiles
- example
- request
"aggs": { "max_price": { "max": { "field": "price" } } }
- response
{ ... // same as search response "aggregations": { "max_price": { "value": 200.0 } } }
- request
- calculate values like average, min, max, count
- pipeline
- use the output of other aggregations as input
- uses the order of buckets returned by a bucket aggregation
- types
- derivative
- formula: current_value - previous_value
- example: daily change in average product price
- moving_fn
- formula: avg(values in [N most recent buckets])
- example: 3-day moving average of pric
- cumulative_sum
- formula: sum = prev_sum + current_value
- example: revenue over days
- bucket_script
- formula: custom_script(params.metric1, params.metric2)
- bucket_selector
- formula: include bucket if script(params.metric) is true
- example: keep brands with avg price > 500
- avg_bucket, sum_bucket, max_bucket, min_bucket
- formula: aggregate([bucket_values])
- example: Average of daily average prices
- serial_diff
- formula: value[n] - value[n - lag]
- example: compare today's avg price to price 7 days ago
- derivative
- example: how average price changed over time
- request
GET /products/_search { "size": 0, "aggs": { "by_day": { "date_histogram": { "field": "timestamp", // must be a date field "calendar_interval": "day" }, "aggs": { "avg_price": { "avg": { "field": "price" } }, "price_change": { "derivative": { // calculates the difference from the previous bucket "buckets_path": "avg_price" } } } } } }
- response
"aggregations": { "by_day": { "buckets": [ { "key_as_string": "2025-05-01", "avg_price": { "value": 150 }, "price_change": { "value": null } }, { "key_as_string": "2025-05-02", "avg_price": { "value": 170 }, "price_change": { "value": 20 } }, { "key_as_string": "2025-05-03", "avg_price": { "value": 140 }, "price_change": { "value": -30 } } ] } }
- request
- bucketing
- drawbacks
- memory footprint
- terms aggregation builds an in-memory map of buckets
- bucket state lives in the JVM heap
- high-cardinality fields => number of unique values (buckets) is huge
- example
{ "brand:Apple" => 1542, "brand:Dell" => 1100, ... }
- terms aggregation builds an in-memory map of buckets
- aggregations run per shard
- each shard computes top N results independently => coordinating node merges
- problem: global ranking inaccuracy
- example: product not in any local top 5 are excluded, even if their global total would be higher
- all aggregations are on-the-fly
- no materialized views or persisted rollups
- memory footprint