最近Mysql v8.20 48c 256g的数据库服务器每天固定时间CPU飙升,告警平台超越90%的正告信息不断的报
一起顺便/tmp使用超95%告警
在领导不断安排下,几天后总算开端着手处理这个工作了。
关于这种突发的CPU冲高,大部分仍是因为某条SQL句子功率很低,一起有伴随着大于低效SQL运行功率的并发呈现,造成了服务器资源不断被占用,直到服务器资源耗尽(这是一种平衡被打破的状况,具体咱们可以领会一下),所以在扫除OS问题后,思路仍是先找到那条SQL句子。
顺便一个抓取大量消耗CPU的SQL的小脚本yw1_mysql_get_top5_sql.sh,有兴趣可以自己本地化一下,废话不说,上脚本,当然也有慢日志可以挖。
#!/bin/bash
###Top 5 sql
###If you want to get top 10 sql,please modify variable yw1_mysql_top_threads's -A 10
###Author:yideng
yw1_config_file_hidden=".config.cnf"
yw1_current_path=`pwd`
yw1_log_file_name="$yw1_current_path/log/cpu_top5_`date +%Y%m%d%H%M%S`.log"
[ ! -d log ] && mkdir ./log
#get username and passwd
yw1_mysql_user=`cat $yw1_current_path/$yw1_config_file_hidden |awk -F'=' '/USER/{print $2}'`
yw1_mysql_pd=`cat $yw1_current_path/$yw1_config_file_hidden |awk -F'=' '/PASSWORD/{print $2}'`
#echo -e "hint of command:top and top -Hp"
#read -p "mysql PID:" yw1_mysql_PID
echo "=====`date +%F' '%T`=====" >> $yw1_log_file_name
#get MYSQL PID
yw1_mysql_PID=`ps -aux | grep "mysql" | grep "mysql.sock"|grep -v "grep"|awk '{print $2}'`
top -b -n 1 | grep 'top -' -A 11 >> $yw1_log_file_name
echo -e "\n=========yw1_mysql_PID_tag=========">> $yw1_log_file_name
top -n 1 -Hp $yw1_mysql_PID | grep 'PID USER' -A 5 >> $yw1_log_file_name;
yw1_mysql_top_threads=`cat $yw1_log_file_name |grep "yw1_mysql_PID_tag" -A 6 |grep -v "PID USER" |grep -v "yw1_mysql_PID_tag"|awk '{print$2}'`
echo -e "\n===========TOP 5 SQL===========">> $yw1_log_file_name
for line in $yw1_mysql_top_threads;do
yw1_thread_id=`mysql -u "$user" -p${yw1_mysql_pd} -N -e "select yw1_thread_id from performance_schema.threads where THREAD_OS_ID=$line;" 2> /dev/null`
echo -e "\nthread:$line yw1_thread_id:$yw1_thread_id\n------------------------------">> $yw1_log_file_name
mysql -u"$yw1_mysql_user" -p${yw1_mysql_pd} -N -e "select DIGEST_TEXT from performance_schema.events_statements_current where yw1_thread_id=$yw1_thread_id;" 2> /dev/null >> $yw1_log_file_name
done
cat $yw1_log_file_name
echo -e "\n=======collect completed======="
echo -e "log path:\n$yw1_log_file_name"
这条奇怪的SQL总算找到了。我看到这条SQL都蒙了,结果就一条为什么还要分页?明明只需要统计name的数量,为什么还要列出b.id?id列是个主键,为什么还要group by?(大量group by排序在Mysql里将用/tmp目录)不过这些都不是首要的。
测验+慢日志追踪它,发现不堵塞的时分履行时间是5秒多,堵塞的时分履行功率是30多秒,有时高达120多秒。
SELECT COUNT(DISTINCT(r.xxx_name))
FROM (SELECT b.id, b.xxx_name
FROM bbb_bbbb_bbbbb_info b
LEFT JOIN elel_ell_list el
ON b.ENT_abcd = el.ENT_abcdefg
WHERE b.status = 1
and abcd_abc_type = 1
GROUP BY b.id) r LIMIT 14300,
100;
先全表数据量了解一下:
el表数据量7.6万,b表数据量75多万,明明是两个不太大的表,为何功率会这么低下呢?
让咱们再深入了解一下SQL使用的列的数据量与结构:
- el与b表的id列均为主键;
- abcd_abc_type列是el表的,有索引(绝大部分值都是1,并且此列在el表上的基数为2,还不如不要索引);
- b.status表大约有60多万行值为1,其他乱七八糟有几个其他值(无索引,这种列建了索引也没什么意义);
- b.ENT_abcd类型varchar(50)关于75万行数据的B表基数为5W不到一点,份额仅为6.3%;(重点在此)
- el.ENT_abcdefg类型varchar(40)列基数很高,份额90%左右。
- xxx_name字段无索引。
了解到这里,有经验的小伙伴都知道了慢的原因了,下面咱们持续经过履行计划进行求证。
+----+-------------+------------+------------+-------+-------------------------------------+--------------------+---------+-----------------+--------+----------+------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+-------------------------------------+--------------------+---------+-----------------+--------+----------+------------------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 36367 | 100.00 | NULL |
| 2 | DERIVED | b | NULL | index | PRIMARY,idx_xxxx_xxx_xx_xx | PRIMARY | 8 | NULL | 727364 | 10.00 | Using where |
| 2 | DERIVED | el | NULL | ref | idx_xxxx_xxxx_xxxx,idx_xxx_xxx_xxxx | idx_xxx_xxxxx_xxxx | 162 | xxxx.b.xxx_xxxx | 1 | 50.00 | Using index condition; Using where |
+----+-------------+------------+------------+-------+-------------------------------------+--------------------+---------+-----------------+--------+----------+------------------------------------+
3 rows in set, 1 warning (0.01 sec)
咱们看下
LEFT JOIN elel_ell_list el ON b.ENT_abcd = el.ENT_abcdefg
b.ENT_abcd 作为join的首要条件,基数份额仅为6%,这点从履行计划里也看到了,filtered仅为10%左右。
不过咱们仍是没有抛弃,在先收集统计信息ANALYZE TABLE,做碎片整理后。
然后再不断尝试改写,加hit的状况下,最终履行功率到达了2.25s
可是并未到达事务需求,果断要求改事务逻辑,为何用如此低基数的列做为匹配条件,莫非就没有其他表能完成此事务吗?
经过开发小哥的思考,总算找到其他表来完成此事务,单表主键查询就可以了
至此,出产环境优化结束。
最终必须在测验环境大约重现出此句子履行状况:
mysql> select * from t100w limit 1; --t100w,testjoin两张表结构相同
+----+-------+------+------+---------------------+
| id | num | k1 | k2 | dt |
+----+-------+------+------+---------------------+
| 1 | 25503 | 0M | IJ56 | 2019-08-12 11:41:16 |
+----+-------+------+------+---------------------+
1 row in set (0.00 sec)
mysql> select count(*) from t100w;
+----------+
| count(*) |
+----------+
| 750000 |
+----------+
1 row in set (0.05 sec)
mysql> select count(*) from testjoin;
+----------+
| count(*) |
+----------+
| 80000 |
+----------+
1 row in set (0.00 sec)
--基数比更低一些
mysql> select count(*) from (select count(*) as c1 from t100w t group by k1) c2 ;
+----------+
| count(*) |
+----------+
| 1225 |
+----------+
1 row in set (0.62 sec)
mysql> select count(*) from (select count(*) as c1 from t100w t group by k2) c2 ;
+----------+
| count(*) |
+----------+
| 1296 |
+----------+
1 row in set (0.83 sec)
--索引创立状况,索引基本相同,字段k2上无索引
mysql> show index from t100w;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| t100w | 0 | PRIMARY | 1 | id | A | 748032 | NULL | NULL | | BTREE | | | YES | NULL |
| t100w | 1 | k1 | 1 | k1 | A | 1310 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.01 sec)
mysql> show index from testjoin;
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| testjoin | 0 | PRIMARY | 1 | id | A | 77683 | NULL | NULL | | BTREE | | | YES | NULL |
| testjoin | 1 | idx_k1 | 1 | k1 | A | 493 | NULL | NULL | YES | BTREE | | | YES | NULL |
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.01 sec)
--测验环境模拟的句子--后面两个=1的条件省掉了,因为基数太小,基本等于1=1;
select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.k1=t2.k1 ) r;
+---------------+
| (count(r.k2)) |
+---------------+
| 55355739 |
+---------------+
1 row in set (22.10 sec)
--真实履行计划
desc analyze select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.k1=t2.k1 ) r;
| EXPLAIN
| -> Aggregate: count(t1.k2) (cost=24584501.44 rows=117868903) (actual time=29190.135..29190.136 rows=1 loops=1)
-> Nested loop left join (cost=12797611.14 rows=117868903) (actual time=0.111..26300.421 rows=55355740 loops=1)
-> Table scan on t1 (cost=77077.32 rows=748032) (actual time=0.059..345.737 rows=750000 loops=1)
-> Covering index lookup on t2 using idx_k1 (k1=t1.k1) (cost=1.25 rows=158) (actual time=0.021..0.030 rows=74 loops=750000)
|
1 row in set (29.19 sec)
--tree格局,履行计划
desc format=tree select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.k1=t2.k1 ) r;
| EXPLAIN
| -> Aggregate: count(t1.k2) (cost=24584501.44 rows=117868903)
-> Nested loop left join (cost=12797611.14 rows=117868903)
-> Table scan on t1 (cost=77077.32 rows=748032)
-> Covering index lookup on t2 using idx_k1 (k1=t1.k1) (cost=1.25 rows=158)
|
--json格局,履行计划
desc format=json select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.k1=t2.k1 ) r;
| EXPLAIN
| {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "12797611.14"
},
"nested_loop": [
{
"table": {
"table_name": "t1",
"access_type": "ALL",
"rows_examined_per_scan": 748032,
"rows_produced_per_join": 748032,
"filtered": "100.00",
"cost_info": {
"read_cost": "2274.12",
"eval_cost": "74803.20",
"prefix_cost": "77077.32",
"data_read_per_join": "28M"
},
"used_columns": [
"k1",
"k2"
]
}
},
{
"table": {
"table_name": "t2",
"access_type": "ref",
"possible_keys": [
"idx_k1"
],
"key": "idx_k1",
"used_key_parts": [
"k1"
],
"key_length": "9",
"ref": [
"test.t1.k1"
],
"rows_examined_per_scan": 157,
"rows_produced_per_join": 117868902,
"filtered": "100.00",
"using_index": true,
"cost_info": {
"read_cost": "933643.52",
"eval_cost": "11786890.30",
"prefix_cost": "12797611.14",
"data_read_per_join": "4G"
},
"used_columns": [
"k1"
]
}
}
]
}
} |
就算使用id主键作为join匹配条件耗时也较久:
mysql> select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.id=t2.id ) r;
+---------------+
| (count(r.k2)) |
+---------------+
| 750000 |
+---------------+
1 row in set (3.92 sec)
孤陋寡闻,目前只能做到这里了,拿这样的事务逻辑SQL有种无法的感觉♂️
希望事务从设计开端就能避免这样的状况呈现
如果有大佬能点拨一下,不胜感激!