Skip to content

Simple introduction to indexing, querying, filtering and aggregating data in elasticsearch.

License

Notifications You must be signed in to change notification settings

mtumilowicz/elasticsearch7-query-filter-aggregation-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: GPL v3

elasticsearch7-query-filter-aggregation-workshop

preface

  • goals of this workshop
  • in docker-compose there is elasticsearch + kibana (7.6) prepared for local testing
    • cd docker/compose
    • docker-compose up -d
  • workshop and answers are in workshop directory

index

  • note that 7.0 deprecated APIs that accept types, introduced new typeless APIs
  • create index
    PUT /index-name
    {
        "settings": {
            "index" : {
                ... // configure index
            },
            "analysis": {
                ... // customize analyzer
            }
        },
        "mappings": {
            "properties": {
                ... // fields
            }
        }
    }
    
  • field datatypes
    • any field can contain zero or more values by default, however, all values in the array must be of the same datatype
    • string
      • text
        • full-text indexed (analyzed)
        • are not used for sorting and seldom used for aggregations
        • example: body of an email or the description of a product
        • sometimes it is useful to have multiple version of the same field: one for full text search and the other for aggregations and sorting
      • keyword
        • are only searchable by their exact value (not analyzed)
        • typically used for filtering, sorting, and aggregations
        • example: IDs, email addresses, hostnames, status codes, zip codes or tags
    • numeric
      • byte, short, integer, long ...
    • date
      "date": {
          "type":   "date",
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
      
      • internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch
      • queries are internally converted on this long representation, and the result is converted back to a string according to the field's date format
    • many more: range, object, nested, geo-points...
  • indexing documents
    • example
      • with specified id
        POST /index-name/_create/id
        {
            ... // fields
        }
        
      • autogenerated id
        POST /products/_doc
        {
          "name": "Tablet",
          "price": 499
        }
        
    • each indexed document is versioned
      • prevent conflicting updates
  • deleting documents
    • example
      DELETE /index-name/_doc/id
      
    • optimistic locking - version can be specified
      • example: DELETE /products/_doc/123?version=3
  • updating documents
    • every update is actually a delete + reindex operation.
    • example
      POST /index-name/_update/id
      {
          "doc" : {
              "name" : "new_name"
          }
      }
      
    • partial update (patch): POST
      • for replace (full overwrite): PUT
    • optimistic locking - version can be specified
      • example: POST /products/_doc/123?version=3
    • by default updates detect if they don’t change anything and return "result": "noop"

search

mechanics

  • every node in the cluster can handle HTTP and Transport traffic by default
    • HTTP = External API access (CRUD/search)
    • Transport traffic = internal communication protocol that Elasticsearch nodes use to talk to each other
  • search requests may involve data held on different data nodes
  • a search request is executed in two phases
    • scatter phase
      • coordinator forwards request to the data nodes which hold the data
      • each data node executes the request locally and returns its results to the coordinating node
    • gather phase
      • coordinator reduces each data node’s results into a single global resultset

API

  • template for search requests
    GET index-name/_search // you can search entire cluster: GET /_search {...}
    {
        "query": {
            "query-type": { // query type: 'match', `range`, `term`, etc
              // payload specific to query type
            }
        }
    }
    
  • query
    • match
      • text is analyzed before matching
      • standard query for performing a full-text search
      • per field
      • example
        "match": {
          "description": {
            "query": "lightweight laptop with fast chip"
          }
        }
        
    • query_string
      • text is analyzed before matching
      • if no field default - search all document
      • used to create a complex search (wildcard characters, searches across multiple fields)
      • example
        "query_string": {
          "query": "apple AND macbook OR iphone",
          "default_field": "name"
        }
        
    • match_phrase
      • text is analyzed before matching
      • all the terms must appear in the field
      • terms must have the same order
        • configured slop
          • allows some flexibility in word positions within the phrase
            • example: gaps or slight reordering
          • this is a brown dog and the dog is brown are OK with slope = 1
      • example
        "match_phrase": {
          "name": {
            "query": "macbook pro",
            "slop": 1
          }
        }
        
    • range
      • used for numeric, date, or keyword fields to search within a range
      • example
        "range": {
          "price": {
            "gte": 500,
            "lte": 1000
          }
        }
        
    • term
      • text is NOT analyzed before matching
      • exact matching
        • best for keyword fields, IDs, status flags, enums
      • vs filtering: affects _score
      • example
        "term": {
          "status": "available"
        }
        
    • bool
      • boolean combinations of other queries
        • all clauses are combined with a top-level logical AND
      • example
        "bool" : {
            "must" : {
                ...
            },
            "filter": {
                ...
            },
            "must_not" : {
                ...
            },
            "should" : [
                ...
            ],
        }
        
      • must
        • must appear in matching documents
        • contributes to the score
      • must_not
        • must not appear in the matching documents
        • scoring is ignored
        • considered for caching
      • filter
        • must appear in matching documents
        • bypass analysis
          • filtering 'Andy' when indexed is 'andy' will give no hit
        • score of the query will be ignored
        • considered for caching
        • Elasticsearch constructs a bitset, which is a binary set of bits denoting whether the document matches this filter
      • should
        • should appear in the matching document
        • contributes to the score

response body

{
    "took" : 1, // in milliseconds
    "timed_out" : false,
    "_shards" : { // count of shards used for the request
        "total" : 1, // total number of shards that require querying
        "successful" : 1, // number of shards that executed the request successfully
        "skipped" : 0,
        "failed" : 0 // number of shards that failed to execute the request
    },
    "hits" : { // documents and metadata
        "total" : { // metadata about the number of returned documents
            "value" : 2, // total number of returned documents
            "relation" : "eq"
        },
        "max_score" : 0.9395274, // highest returned document _score
        "hits" : [ // array of returned document objects
            {
                "_index" : "programming-user-groups",
                "_id" : "2",
                "_score" : 0.9395274,
                "_source" : { ... } // original JSON body
            } 
        ]
    }
}

  • took
    • time elapsed between: receiving request on the coordinator and being ready to send response to client
  • _shard.skipped
    • skipped the request because a lightweight check helped realize that no documents could possibly match on this shard
    • typically happens when a search request includes a range filter and the shard only has values that fall outside of that range
  • hits.total.relation
    • tells whether total.value is exact or just an estimate
      • eq: Accurate
      • gte: Lower bound, including returned documents

aggregate

  • template
    GET /programming-user-groups/_search
    {
        "aggs": {
            "agg-name" : {
                "agg-type": { ... },
                "aggs":{ // sub aggregations
                    "sub-agg-name": {
                        "sub-agg-type": { ... }
                    }
                }
            }
        }
    }
    
  • types
    • bucketing
      • group documents into buckets based on some criteria
        • example: terms, ranges, dates, etc.
      • can be nested
        • bucketing aggregations can have sub-aggregations (bucketing or metric)
      • types: terms, range, date_histogram, filters, histogram
      • example
        • request
          "aggs": {
            "genres": {
              "terms": {
                "field": "genre.keyword"
              }
            }
          
        • response
          {
              ... // same as search response
              "aggregations" : {
                  "genres" : {
                      ...
                      "buckets" : [ 
                          { "key" : "electronic", "doc_count" : 6 },
                          { "key" : "rock", "doc_count" : 3 },
                          { "key" : "jazz", "doc_count" : 2 }
                      ]
                  }
              }
          }
          
    • metrics
      • calculate values like average, min, max, count
        • usually inside buckets
      • types: avg, sum, min/max, value_count, stats, percentiles
      • example
        • request
          "aggs": {
            "max_price": {
              "max": {
                "field": "price"
              }
            }
          }
          
        • response
          {
              ... // same as search response
              "aggregations": {
                  "max_price": {
                      "value": 200.0
                  }
              }
          }
          
    • pipeline
      • use the output of other aggregations as input
      • uses the order of buckets returned by a bucket aggregation
      • types
        • derivative
          • formula: current_value - previous_value
          • example: daily change in average product price
        • moving_fn
          • formula: avg(values in [N most recent buckets])
          • example: 3-day moving average of pric
        • cumulative_sum
          • formula: sum = prev_sum + current_value
          • example: revenue over days
        • bucket_script
          • formula: custom_script(params.metric1, params.metric2)
        • bucket_selector
          • formula: include bucket if script(params.metric) is true
          • example: keep brands with avg price > 500
        • avg_bucket, sum_bucket, max_bucket, min_bucket
          • formula: aggregate([bucket_values])
          • example: Average of daily average prices
        • serial_diff
          • formula: value[n] - value[n - lag]
          • example: compare today's avg price to price 7 days ago
      • example: how average price changed over time
        • request
          GET /products/_search
          {
            "size": 0,
            "aggs": {
              "by_day": {
                "date_histogram": {
                  "field": "timestamp",     // must be a date field
                  "calendar_interval": "day"
                },
                "aggs": {
                  "avg_price": {
                    "avg": {
                      "field": "price"
                    }
                  },
                  "price_change": {
                    "derivative": { // calculates the difference from the previous bucket
                      "buckets_path": "avg_price"
                    }
                  }
                }
              }
            }
          }
          
        • response
          "aggregations": {
            "by_day": {
              "buckets": [
                {
                  "key_as_string": "2025-05-01",
                  "avg_price": { "value": 150 },
                  "price_change": { "value": null }
                },
                {
                  "key_as_string": "2025-05-02",
                  "avg_price": { "value": 170 },
                  "price_change": { "value": 20 }
                },
                {
                  "key_as_string": "2025-05-03",
                  "avg_price": { "value": 140 },
                  "price_change": { "value": -30 }
                }
              ]
            }
          }
          
  • drawbacks
    • memory footprint
      • terms aggregation builds an in-memory map of buckets
        • bucket state lives in the JVM heap
        • high-cardinality fields => number of unique values (buckets) is huge
        • example
          {
            "brand:Apple" => 1542,
            "brand:Dell" => 1100,
            ...
          }
          
    • aggregations run per shard
      • each shard computes top N results independently => coordinating node merges
      • problem: global ranking inaccuracy
        • example: product not in any local top 5 are excluded, even if their global total would be higher
    • all aggregations are on-the-fly
      • no materialized views or persisted rollups