Elasticsearch学习系列四（聚合搜索与智能建议）-六虎

携手创作，共同成长！这是我参与「日新计划 8 月更文挑战」的第11天，点击查看活动详情

聚合剖析

聚合剖析是数据库中重要的功用特性，完结对一个查询的会集数据的聚合计算。如：最大值、最小值、求和、平均值等等。对一个数据集求和，算最大最小值等等，在ES中称为目标聚合，而对数据做类似关系型数据库那样的分组（group by），在ES中称为分桶。

语法：

aggregations" : {
  "<aggregation_name>" : { <!--聚合的姓名 -->
    "<aggregation_type>" : { <!--聚合的类型 -->
       <aggregation_body> <!--聚合体：对哪些字段进行聚合 -->
    }
    [,"meta" : { [<meta_data_body>] } ]? <!--元 -->
    [,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里边在定义子聚合 -->
 }
 [,"<aggregation_name_2>" : { ... } ]*<!--聚合的姓名 -->
}

aggregations能够简写为aggs。

目标聚合

示例1：查询一切产品里最贵的价格

size就填0就行。

POST /item/_search
{
  "size":0,
  "aggs": {
    "max_price": {
      "max": {
        "field": "price"
      }
    }
  }
}

示例2：文档计数

POST /item/_count
{
  "query": {
    "range": {
      "price": {
        "gte": 10,
        "lte": 5000
      }
    }
  }
}

示例3：计算某字段有值的文档数

POST /item/_search?size=0
{
  "aggs": {
    "price_count": {
      "value_count": {
        "field": "price"
      }
    }
  }
}

示例4：用cardinality值去重计数

假如有price重复的，就只会计算去重后的数量

POST /item/_search?size=0
{
  "aggs":{
    "price_count":{
      "cardinality": {
        "field": "price"
      }
    }
  }
}

示例5：stats计算count、max、min、avg、sum5个值

POST /item/_search?size=0
{
  "aggs":{
    "price_stats":{
      "stats": {
        "field": "price"
      }
    }
  }
}

结果如下：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "price_stats" : {
      "count" : 5,
      "min" : 2333.0,
      "max" : 6888.0,
      "avg" : 4059.2,
      "sum" : 20296.0
    }
  }
}

示例6：extended stats,stats的增强版，增加了平方和、方差、标准差、平均值加/减两个标准差的区间。

POST /item/_search?size=0
{
  "aggs":{
    "price_stats":{
      "extended_stats": {
        "field": "price"
      }
    }
  }
}

查询结果：

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "price_stats" : {
      "count" : 5,
      "min" : 2333.0,
      "max" : 6888.0,
      "avg" : 4059.2,
      "sum" : 20296.0,
      "sum_of_squares" : 9.9816722E7,
      "variance" : 3486239.7599999993,
      "std_deviation" : 1867.1474928349928,
      "std_deviation_bounds" : {
        "upper" : 7793.494985669986,
        "lower" : 324.9050143300142
      }
    }
  }
}

示例7：Percentiles 占比百分位对应的值计算


POST /item/_search?size=0
{
  "aggs":{
    "price_percents":{
      "percentiles": {
        "field": "price"
      }
    }
  }
}
#指定分位值
POST /item/_search?size=0
{
  "aggs":{
    "price_percents":{
      "percentiles": {
        "field": "price",
        "percents": [
          1,
          5,
          25,
          50,
          75,
          95,
          99
        ]
      }
    }
  }
}

查询结果：

......
  "aggregations" : {
    "price_percents" : {
      "values" : {
        "1.0" : 2333.0000000000005,
        "5.0" : 2333.0,
        "25.0" : 2599.25,
        "50.0" : 2688.0,
        "75.0" : 5996.25,
        "95.0" : 6888.0,
        "99.0" : 6888.0
      }
    }
  }
}

Percentiles rank 计算值小于等于指定值的文档占比

price小于3000和5000的占比

POST /item/_search?size=0
{
  "aggs":{
    "price_percents":{
      "percentile_ranks": {
        "field": "price"
        , "values": [3000,5000]
      }
    }
  }
}

桶聚合

他履行的是对文档分组的操作，把满足相关特性的文档分到一个桶里，即桶分。输出结果往往是一个个包含多个文档的桶。

示例1：分组求平均值

POST /item/_search
{
  "size": 0,
  "aggs": {
    "group_by_price": {
      "range": {
        "field": "price",
        "ranges": [
          {
            "from": 50,
            "to": 100
          },
          {
            "from": 2000,
            "to": 3000
          },
          {
            "from": 3000,
            "to": 5000
          }
        ]
      },
      "aggs": {
        "average_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

查询结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_price" : {
      "buckets" : [
        {
          "key" : "50.0-100.0",
          "from" : 50.0,
          "to" : 100.0,
          "doc_count" : 0,
          "average_price" : {
            "value" : null
          }
        },
        {
          "key" : "2000.0-3000.0",
          "from" : 2000.0,
          "to" : 3000.0,
          "doc_count" : 3,
          "average_price" : {
            "value" : 2569.6666666666665
          }
        },
        {
          "key" : "3000.0-7000.0",
          "from" : 3000.0,
          "to" : 7000.0,
          "doc_count" : 2,
          "average_price" : {
            "value" : 6293.5
          }
        }
      ]
    }
  }
}

示例2：分组的文档个数计算

POST /item/_search
{
  "size": 0,
  "aggs": {
    "group_by_price": {
      "range": {
        "field": "price",
        "ranges": [
          {
            "from": 50,
            "to": 100
          },
          {
            "from": 2000,
            "to": 3000
          },
          {
            "from": 3000,
            "to": 7000
          }
        ]
      },
      "aggs": {
        "average_price": {
          "value_count": {
            "field": "price"
          }
        }
      }
    }
  }
}

示例3：运用having语法

POST /item/_search
{
  "size": 0,
  "aggs": {
    "group_by_price": {
      "range": {
        "field": "price",
        "ranges": [
          {
            "from": 50,
            "to": 100
          },
          {
            "from": 2000,
            "to": 3000
          },
          {
            "from": 3000,
            "to": 7000
          }
        ]
      },
      "aggs": {
        "average_price": {
          "avg": {
            "field": "price"
          }
        },
        "having":{
          "bucket_selector": {
            "buckets_path": {
              "avg_price":"average_price"
            },
            "script": {
              "source": "params.avg_price >=2600"
            }
          }
        }
      }
    }
  }
}

智能查找主张

先构造一些测试数据：

PUT /blogs/
{
  "mappings": {
    "properties": {
      "body":{
        "type": "text"
      }
    }
  }
}
POST _bulk/?refresh=true
{ "index" : { "_index" : "blogs" } }
{ "body": "Lucene is cool"}
{ "index" : { "_index" : "blogs" } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { "_index" : "blogs" } }
{ "body": "Elasticsearch rocks"}
{ "index" : { "_index" : "blogs" } }
{ "body": "Elastic is the company behind ELK stack"}
{ "index" : { "_index" : "blogs" } }
{ "body": "elk rocks"}
{ "index" : { "_index" : "blogs"} }
{ "body": "elasticsearch is rock solid"}

Term Suggester

查找

POST /blogs/_search
{
  "suggest": {
    "my_suggest": {
      "text": "rock",
      "term": {
        "field": "body",
        "suggest_mode":"missing"
      }
    }
  }
}

suggest_mode有3个值：

missing：假如rock这个词已经存在了，就不会再主张
popular：尽管rock在词典里有了，但是有词频更高的类似项就会主张
always：不论词典里是否有，也要给出类似项

Phrase suggester

在Term suggester的基础上，会考量多个term之间的关系，如是否同时呈现在索引的原文里、相邻程度等等。

POST /blogs/_search
{
  "suggest": {
    "my_suggest": {
      "text": "lucne and elasticsear rock",
      "phrase": {
        "field": "body",
        "highlight":{
          "pre_tag":"<em>",
          "post_tag":"</em>"
        }
      }
    }
  }
}

查找结果：

{
  "took" : 41,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "my_suggest" : [
      {
        "text" : "lucne and elasticsear rock",
        "offset" : 0,
        "length" : 26,
        "options" : [
          {
            "text" : "lucene and elasticsearch rock",
            "highlighted" : "<em>lucene</em> and <em>elasticsearch</em> rock",
            "score" : 0.004993905
          },
          {
            "text" : "lucne and elasticsearch rock",
            "highlighted" : "lucne and <em>elasticsearch</em> rock",
            "score" : 0.0033391973
          },
          {
            "text" : "lucene and elasticsear rock",
            "highlighted" : "<em>lucene</em> and elasticsear rock",
            "score" : 0.0029183894
          }
        ]
      }
    ]
  }
}

options直接回来一个phrase列表,由于lucene和elasticsearch曾经在同一条原文里呈现过，同时替换2个term的可信度更高，所以打分较高。

Completion Suggester

它首要针对的应用场景是”Auto Completion”，此场景下用户每输入一个字符的时候，就需求发送一次请求到后端查找匹配项。因此数据结构实现上与上面的两个Suggester不一样，索引并非通过倒排来完结，而是将analyze过的数据编码成FST和索引一起存放。关于一个open状态的索引，FST会被ES整个装载到内存里，进行前缀查找时速度极快。但是FST也只能用于前缀查询。为了能运用Completion Suggester，字段类型需定义为completion。

PUT /blogs_complation
{
  "mappings": {
    "properties": {
      "body":{
        "type": "completion"
      }
    }
  }
}

刺进些测试数据：

POST _bulk/?refresh=true
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Lucene is cool"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { "_index" : "blogs_completion"} }
{ "body": "Elasticsearch rocks"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Elastic is the company behind ELK stack"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "the elk stack rocks"}
{ "index" : { "_index" : "blogs_completion"} }
{ "body": "elasticsearch is rock solid"}

查找示例：

POST /blogs_completion/_search?pretty
{
  "size": 0,
  "suggest": {
    "blog-suggest": {
      "prefix": "elastic i",
      "completion": {
        "field": "body"
      }
    }
  }
}

引荐结果如下：

# 省掉部分
"options" : [
          {
            "text" : "Elastic is the company behind ELK stack",
            "_index" : "blogs_completion",
            "_type" : "_doc",
            "_id" : "SG16oIEB1fsyWKAeKha5",
            "_score" : 1.0,
            "_source" : {
              "body" : "Elastic is the company behind ELK stack"
            }
          }
        ]

需求留意analyzer会影响主张，假如是english analyzer，is这个单词会被过滤到，所以无法匹配到主张词。还有preserve_separators和preserve_position_increments也会影响查询。

preserve_separators 这个设置为false,将疏忽空格之类的分隔符
preserve_position_increments: 假如主张词第一个词是停用词,而且我们运用了过滤停用
词的剖析器,需求将此设置为false。

假如Completion Suggester已经到了零匹配，那么能够猜测是否用户有输入过错，这时候能够尝试一下Phrase Suggester。假如Phrase Suggester没有找到任何option，开始尝试term Suggester。

Context Suggester

Completion Suggester的扩展，能够在查找中加入更多的上下文信息，然后依据不同的上下文信息，对相同的输入，比方”star”供给不同的主张值，比方：

咖啡相关：star bucks
电影相关: star wars

Elasticsearch学习系列四（聚合搜索与智能建议）

聚合剖析

目标聚合

桶聚合

智能查找主张

Term Suggester

Phrase suggester

Completion Suggester

Context Suggester

相关文章

OMM常规问题侦察：如何避免大图片 OOM

CAT分布式实时监控系统介绍

在 KubeSphere 部署 Wiki 系统 wiki.js 并启用中文全文检索

程序员学英语（一）

作者信息