携手创作,共同成长!这是我参与「日新计划 8 月更文挑战」的第11天,点击查看活动详情
聚合剖析
聚合剖析是数据库中重要的功用特性,完结对一个查询的会集数据的聚合计算。如:最大值、最小值、求和、平均值等等。对一个数据集求和,算最大最小值等等,在ES中称为目标聚合,而对数据做类似关系型数据库那样的分组(group by),在ES中称为分桶。
语法:
aggregations" : {
"<aggregation_name>" : { <!--聚合的姓名 -->
"<aggregation_type>" : { <!--聚合的类型 -->
<aggregation_body> <!--聚合体:对哪些字段进行聚合 -->
}
[,"meta" : { [<meta_data_body>] } ]? <!--元 -->
[,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里边在定义子聚合 -->
}
[,"<aggregation_name_2>" : { ... } ]*<!--聚合的姓名 -->
}
aggregations能够简写为aggs。
目标聚合
示例1:查询一切产品里最贵的价格
size就填0就行。
POST /item/_search
{
"size":0,
"aggs": {
"max_price": {
"max": {
"field": "price"
}
}
}
}
示例2:文档计数
POST /item/_count
{
"query": {
"range": {
"price": {
"gte": 10,
"lte": 5000
}
}
}
}
示例3:计算某字段有值的文档数
POST /item/_search?size=0
{
"aggs": {
"price_count": {
"value_count": {
"field": "price"
}
}
}
}
示例4:用cardinality值去重计数
假如有price重复的,就只会计算去重后的数量
POST /item/_search?size=0
{
"aggs":{
"price_count":{
"cardinality": {
"field": "price"
}
}
}
}
示例5:stats计算count、max、min、avg、sum5个值
POST /item/_search?size=0
{
"aggs":{
"price_stats":{
"stats": {
"field": "price"
}
}
}
}
结果如下:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"price_stats" : {
"count" : 5,
"min" : 2333.0,
"max" : 6888.0,
"avg" : 4059.2,
"sum" : 20296.0
}
}
}
示例6:extended stats,stats的增强版,增加了平方和、方差、标准差、平均值加/减两个标准差的区间。
POST /item/_search?size=0
{
"aggs":{
"price_stats":{
"extended_stats": {
"field": "price"
}
}
}
}
查询结果:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"price_stats" : {
"count" : 5,
"min" : 2333.0,
"max" : 6888.0,
"avg" : 4059.2,
"sum" : 20296.0,
"sum_of_squares" : 9.9816722E7,
"variance" : 3486239.7599999993,
"std_deviation" : 1867.1474928349928,
"std_deviation_bounds" : {
"upper" : 7793.494985669986,
"lower" : 324.9050143300142
}
}
}
}
示例7:Percentiles 占比百分位对应的值计算
POST /item/_search?size=0
{
"aggs":{
"price_percents":{
"percentiles": {
"field": "price"
}
}
}
}
#指定分位值
POST /item/_search?size=0
{
"aggs":{
"price_percents":{
"percentiles": {
"field": "price",
"percents": [
1,
5,
25,
50,
75,
95,
99
]
}
}
}
}
查询结果:
......
"aggregations" : {
"price_percents" : {
"values" : {
"1.0" : 2333.0000000000005,
"5.0" : 2333.0,
"25.0" : 2599.25,
"50.0" : 2688.0,
"75.0" : 5996.25,
"95.0" : 6888.0,
"99.0" : 6888.0
}
}
}
}
Percentiles rank 计算值小于等于指定值的文档占比
price小于3000和5000的占比
POST /item/_search?size=0
{
"aggs":{
"price_percents":{
"percentile_ranks": {
"field": "price"
, "values": [3000,5000]
}
}
}
}
桶聚合
他履行的是对文档分组的操作,把满足相关特性的文档分到一个桶里,即桶分。输出结果往往是一个个包含多个文档的桶。
示例1:分组求平均值
POST /item/_search
{
"size": 0,
"aggs": {
"group_by_price": {
"range": {
"field": "price",
"ranges": [
{
"from": 50,
"to": 100
},
{
"from": 2000,
"to": 3000
},
{
"from": 3000,
"to": 5000
}
]
},
"aggs": {
"average_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
查询结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_price" : {
"buckets" : [
{
"key" : "50.0-100.0",
"from" : 50.0,
"to" : 100.0,
"doc_count" : 0,
"average_price" : {
"value" : null
}
},
{
"key" : "2000.0-3000.0",
"from" : 2000.0,
"to" : 3000.0,
"doc_count" : 3,
"average_price" : {
"value" : 2569.6666666666665
}
},
{
"key" : "3000.0-7000.0",
"from" : 3000.0,
"to" : 7000.0,
"doc_count" : 2,
"average_price" : {
"value" : 6293.5
}
}
]
}
}
}
示例2:分组的文档个数计算
POST /item/_search
{
"size": 0,
"aggs": {
"group_by_price": {
"range": {
"field": "price",
"ranges": [
{
"from": 50,
"to": 100
},
{
"from": 2000,
"to": 3000
},
{
"from": 3000,
"to": 7000
}
]
},
"aggs": {
"average_price": {
"value_count": {
"field": "price"
}
}
}
}
}
}
示例3:运用having语法
POST /item/_search
{
"size": 0,
"aggs": {
"group_by_price": {
"range": {
"field": "price",
"ranges": [
{
"from": 50,
"to": 100
},
{
"from": 2000,
"to": 3000
},
{
"from": 3000,
"to": 7000
}
]
},
"aggs": {
"average_price": {
"avg": {
"field": "price"
}
},
"having":{
"bucket_selector": {
"buckets_path": {
"avg_price":"average_price"
},
"script": {
"source": "params.avg_price >=2600"
}
}
}
}
}
}
}
智能查找主张
先构造一些测试数据:
PUT /blogs/
{
"mappings": {
"properties": {
"body":{
"type": "text"
}
}
}
}
POST _bulk/?refresh=true
{ "index" : { "_index" : "blogs" } }
{ "body": "Lucene is cool"}
{ "index" : { "_index" : "blogs" } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { "_index" : "blogs" } }
{ "body": "Elasticsearch rocks"}
{ "index" : { "_index" : "blogs" } }
{ "body": "Elastic is the company behind ELK stack"}
{ "index" : { "_index" : "blogs" } }
{ "body": "elk rocks"}
{ "index" : { "_index" : "blogs"} }
{ "body": "elasticsearch is rock solid"}
Term Suggester
查找
POST /blogs/_search
{
"suggest": {
"my_suggest": {
"text": "rock",
"term": {
"field": "body",
"suggest_mode":"missing"
}
}
}
}
suggest_mode有3个值:
- missing:假如rock这个词已经存在了,就不会再主张
- popular:尽管rock在词典里有了,但是有词频更高的类似项就会主张
- always:不论词典里是否有,也要给出类似项
Phrase suggester
在Term suggester的基础上,会考量多个term之间的关系,如是否同时呈现在索引的原文里、相邻程度等等。
POST /blogs/_search
{
"suggest": {
"my_suggest": {
"text": "lucne and elasticsear rock",
"phrase": {
"field": "body",
"highlight":{
"pre_tag":"<em>",
"post_tag":"</em>"
}
}
}
}
}
查找结果:
{
"took" : 41,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"my_suggest" : [
{
"text" : "lucne and elasticsear rock",
"offset" : 0,
"length" : 26,
"options" : [
{
"text" : "lucene and elasticsearch rock",
"highlighted" : "<em>lucene</em> and <em>elasticsearch</em> rock",
"score" : 0.004993905
},
{
"text" : "lucne and elasticsearch rock",
"highlighted" : "lucne and <em>elasticsearch</em> rock",
"score" : 0.0033391973
},
{
"text" : "lucene and elasticsear rock",
"highlighted" : "<em>lucene</em> and elasticsear rock",
"score" : 0.0029183894
}
]
}
]
}
}
options直接回来一个phrase列表,由于lucene和elasticsearch曾经在同一条原文里呈现过,同时替换2个term的可信度更高,所以打分较高。
Completion Suggester
它首要针对的应用场景是”Auto Completion”,此场景下用户每输入一个字符的时候,就需求发送一次请求到后端查找匹配项。因此数据结构实现上与上面的两个Suggester不一样,索引并非通过倒排来完结,而是将analyze过的数据编码成FST和索引一起存放。关于一个open状态的索引,FST会被ES整个装载到内存里,进行前缀查找时速度极快。但是FST也只能用于前缀查询。为了能运用Completion Suggester,字段类型需定义为completion。
PUT /blogs_complation
{
"mappings": {
"properties": {
"body":{
"type": "completion"
}
}
}
}
刺进些测试数据:
POST _bulk/?refresh=true
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Lucene is cool"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { "_index" : "blogs_completion"} }
{ "body": "Elasticsearch rocks"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Elastic is the company behind ELK stack"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "the elk stack rocks"}
{ "index" : { "_index" : "blogs_completion"} }
{ "body": "elasticsearch is rock solid"}
查找示例:
POST /blogs_completion/_search?pretty
{
"size": 0,
"suggest": {
"blog-suggest": {
"prefix": "elastic i",
"completion": {
"field": "body"
}
}
}
}
引荐结果如下:
# 省掉部分
"options" : [
{
"text" : "Elastic is the company behind ELK stack",
"_index" : "blogs_completion",
"_type" : "_doc",
"_id" : "SG16oIEB1fsyWKAeKha5",
"_score" : 1.0,
"_source" : {
"body" : "Elastic is the company behind ELK stack"
}
}
]
需求留意analyzer会影响主张,假如是english analyzer,is这个单词会被过滤到,所以无法匹配到主张词。还有preserve_separators和preserve_position_increments也会影响查询。
- preserve_separators 这个设置为false,将疏忽空格之类的分隔符
- preserve_position_increments: 假如主张词第一个词是停用词,而且我们运用了过滤停用
词的剖析器,需求将此设置为false。
假如Completion Suggester已经到了零匹配,那么能够猜测是否用户有输入过错,这时候能够尝试一下Phrase Suggester。假如Phrase Suggester没有找到任何option,开始尝试term Suggester。
Context Suggester
Completion Suggester的扩展,能够在查找中加入更多的上下文信息,然后依据不同的上下文信息,对相同的输入,比方”star”供给不同的主张值,比方:
- 咖啡相关:star bucks
- 电影相关: star wars