最近Mysql v8.20 48c 256g的数据库服务器每天固定时间CPU飙升,告警平台超越90%的正告信息不断的报
一起顺便/tmp使用超95%告警

在领导不断安排下,几天后总算开端着手处理这个工作了。

关于这种突发的CPU冲高,大部分仍是因为某条SQL句子功率很低,一起有伴随着大于低效SQL运行功率的并发呈现,造成了服务器资源不断被占用,直到服务器资源耗尽(这是一种平衡被打破的状况,具体咱们可以领会一下),所以在扫除OS问题后,思路仍是先找到那条SQL句子。
顺便一个抓取大量消耗CPU的SQL的小脚本yw1_mysql_get_top5_sql.sh,有兴趣可以自己本地化一下,废话不说,上脚本,当然也有慢日志可以挖。

#!/bin/bash
###Top 5 sql
###If you want to get top 10 sql,please modify variable yw1_mysql_top_threads's -A 10
###Author:yideng
yw1_config_file_hidden=".config.cnf"
yw1_current_path=`pwd`
yw1_log_file_name="$yw1_current_path/log/cpu_top5_`date +%Y%m%d%H%M%S`.log"
[ ! -d log ] && mkdir ./log
#get username and passwd
yw1_mysql_user=`cat $yw1_current_path/$yw1_config_file_hidden |awk -F'=' '/USER/{print $2}'`
yw1_mysql_pd=`cat $yw1_current_path/$yw1_config_file_hidden |awk -F'=' '/PASSWORD/{print $2}'`
#echo -e "hint of command:top and top -Hp"
#read -p "mysql PID:" yw1_mysql_PID
echo "=====`date +%F' '%T`=====" >> $yw1_log_file_name
#get MYSQL PID
yw1_mysql_PID=`ps -aux | grep  "mysql" | grep "mysql.sock"|grep -v "grep"|awk '{print $2}'`
top -b -n 1 | grep 'top -' -A 11 >> $yw1_log_file_name
echo -e  "\n=========yw1_mysql_PID_tag=========">> $yw1_log_file_name
top -n 1 -Hp $yw1_mysql_PID | grep 'PID USER' -A 5 >> $yw1_log_file_name;
yw1_mysql_top_threads=`cat $yw1_log_file_name |grep "yw1_mysql_PID_tag" -A 6 |grep -v "PID USER" |grep -v "yw1_mysql_PID_tag"|awk '{print$2}'`
echo -e  "\n===========TOP 5 SQL===========">> $yw1_log_file_name
for line in $yw1_mysql_top_threads;do
yw1_thread_id=`mysql -u "$user" -p${yw1_mysql_pd} -N -e  "select yw1_thread_id from performance_schema.threads where THREAD_OS_ID=$line;"  2> /dev/null`
echo -e "\nthread:$line   yw1_thread_id:$yw1_thread_id\n------------------------------">> $yw1_log_file_name
mysql -u"$yw1_mysql_user" -p${yw1_mysql_pd} -N -e "select DIGEST_TEXT from performance_schema.events_statements_current where yw1_thread_id=$yw1_thread_id;" 2> /dev/null >> $yw1_log_file_name
done
cat $yw1_log_file_name
echo -e  "\n=======collect completed======="
echo -e "log path:\n$yw1_log_file_name"

这条奇怪的SQL总算找到了。我看到这条SQL都蒙了,结果就一条为什么还要分页?明明只需要统计name的数量,为什么还要列出b.id?id列是个主键,为什么还要group by?(大量group by排序在Mysql里将用/tmp目录)不过这些都不是首要的。

测验+慢日志追踪它,发现不堵塞的时分履行时间是5秒多,堵塞的时分履行功率是30多秒,有时高达120多秒。

SELECT COUNT(DISTINCT(r.xxx_name))
  FROM (SELECT b.id, b.xxx_name
          FROM bbb_bbbb_bbbbb_info b
          LEFT JOIN elel_ell_list el
            ON b.ENT_abcd = el.ENT_abcdefg
         WHERE b.status = 1
           and abcd_abc_type = 1
         GROUP BY b.id) r LIMIT 14300,
       100;

先全表数据量了解一下:

el表数据量7.6万,b表数据量75多万,明明是两个不太大的表,为何功率会这么低下呢?

让咱们再深入了解一下SQL使用的列的数据量与结构:

  1. el与b表的id列均为主键;
  2. abcd_abc_type列是el表的,有索引(绝大部分值都是1,并且此列在el表上的基数为2,还不如不要索引);
  3. b.status表大约有60多万行值为1,其他乱七八糟有几个其他值(无索引,这种列建了索引也没什么意义);
  4. b.ENT_abcd类型varchar(50)关于75万行数据的B表基数为5W不到一点,份额仅为6.3%;(重点在此)
  5. el.ENT_abcdefg类型varchar(40)列基数很高,份额90%左右。
  6. xxx_name字段无索引。

了解到这里,有经验的小伙伴都知道了慢的原因了,下面咱们持续经过履行计划进行求证。

+----+-------------+------------+------------+-------+-------------------------------------+--------------------+---------+-----------------+--------+----------+------------------------------------+
| id | select_type | table      | partitions | type  | possible_keys                       | key                | key_len | ref             | rows   | filtered | Extra                              |
+----+-------------+------------+------------+-------+-------------------------------------+--------------------+---------+-----------------+--------+----------+------------------------------------+
|  1 | PRIMARY     | <derived2> | NULL       | ALL   | NULL                                | NULL               | NULL    | NULL            |  36367 |   100.00 | NULL                               |
|  2 | DERIVED     | b          | NULL       | index | PRIMARY,idx_xxxx_xxx_xx_xx          | PRIMARY            | 8       | NULL            | 727364 |    10.00 | Using where                        |
|  2 | DERIVED     | el         | NULL       | ref   | idx_xxxx_xxxx_xxxx,idx_xxx_xxx_xxxx | idx_xxx_xxxxx_xxxx | 162     | xxxx.b.xxx_xxxx |      1 |    50.00 | Using index condition; Using where |
+----+-------------+------------+------------+-------+-------------------------------------+--------------------+---------+-----------------+--------+----------+------------------------------------+
3 rows in set, 1 warning (0.01 sec)

咱们看下
LEFT JOIN elel_ell_list el ON b.ENT_abcd = el.ENT_abcdefg
b.ENT_abcd 作为join的首要条件,基数份额仅为6%,这点从履行计划里也看到了,filtered仅为10%左右。

不过咱们仍是没有抛弃,在先收集统计信息ANALYZE TABLE,做碎片整理后。
然后再不断尝试改写,加hit的状况下,最终履行功率到达了2.25s
可是并未到达事务需求,果断要求改事务逻辑,为何用如此低基数的列做为匹配条件,莫非就没有其他表能完成此事务吗?
经过开发小哥的思考,总算找到其他表来完成此事务,单表主键查询就可以了

至此,出产环境优化结束。

最终必须在测验环境大约重现出此句子履行状况:

mysql> select * from t100w limit 1;  --t100w,testjoin两张表结构相同
+----+-------+------+------+---------------------+
| id | num   | k1   | k2   | dt                  |
+----+-------+------+------+---------------------+
|  1 | 25503 | 0M   | IJ56 | 2019-08-12 11:41:16 |
+----+-------+------+------+---------------------+
1 row in set (0.00 sec)
mysql> select count(*) from t100w;
+----------+
| count(*) |
+----------+
|   750000 |
+----------+
1 row in set (0.05 sec)
mysql> select count(*) from testjoin;
+----------+
| count(*) |
+----------+
|    80000 |
+----------+
1 row in set (0.00 sec)
--基数比更低一些
mysql> select count(*) from (select count(*) as c1 from t100w t group by k1) c2 ; 
+----------+
| count(*) |
+----------+
|     1225 |
+----------+
1 row in set (0.62 sec)
mysql> select count(*) from (select count(*) as c1 from t100w t group by k2) c2 ;
+----------+
| count(*) |
+----------+
|     1296 |
+----------+
1 row in set (0.83 sec)
--索引创立状况,索引基本相同,字段k2上无索引
mysql> show index from t100w;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| t100w |          0 | PRIMARY  |            1 | id          | A         |      748032 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| t100w |          1 | k1       |            1 | k1          | A         |        1310 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.01 sec)
mysql>  show index from testjoin;
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table    | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| testjoin |          0 | PRIMARY  |            1 | id          | A         |       77683 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| testjoin |          1 | idx_k1   |            1 | k1          | A         |         493 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.01 sec)
--测验环境模拟的句子--后面两个=1的条件省掉了,因为基数太小,基本等于1=1;
select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.k1=t2.k1 ) r;
+---------------+
| (count(r.k2)) |
+---------------+
|      55355739 |
+---------------+
1 row in set (22.10 sec)
--真实履行计划
desc analyze select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.k1=t2.k1 ) r;
| EXPLAIN
| -> Aggregate: count(t1.k2)  (cost=24584501.44 rows=117868903) (actual time=29190.135..29190.136 rows=1 loops=1)
    -> Nested loop left join  (cost=12797611.14 rows=117868903) (actual time=0.111..26300.421 rows=55355740 loops=1)
        -> Table scan on t1  (cost=77077.32 rows=748032) (actual time=0.059..345.737 rows=750000 loops=1)
        -> Covering index lookup on t2 using idx_k1 (k1=t1.k1)  (cost=1.25 rows=158) (actual time=0.021..0.030 rows=74 loops=750000)
 |
1 row in set (29.19 sec)
--tree格局,履行计划
desc format=tree select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.k1=t2.k1 ) r;
| EXPLAIN
| -> Aggregate: count(t1.k2)  (cost=24584501.44 rows=117868903)
    -> Nested loop left join  (cost=12797611.14 rows=117868903)
        -> Table scan on t1  (cost=77077.32 rows=748032)
        -> Covering index lookup on t2 using idx_k1 (k1=t1.k1)  (cost=1.25 rows=158)
 |
--json格局,履行计划
desc format=json select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.k1=t2.k1 ) r;
| EXPLAIN
| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "12797611.14"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows_examined_per_scan": 748032,
          "rows_produced_per_join": 748032,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "2274.12",
            "eval_cost": "74803.20",
            "prefix_cost": "77077.32",
            "data_read_per_join": "28M"
          },
          "used_columns": [
            "k1",
            "k2"
          ]
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ref",
          "possible_keys": [
            "idx_k1"
          ],
          "key": "idx_k1",
          "used_key_parts": [
            "k1"
          ],
          "key_length": "9",
          "ref": [
            "test.t1.k1"
          ],
          "rows_examined_per_scan": 157,
          "rows_produced_per_join": 117868902,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "933643.52",
            "eval_cost": "11786890.30",
            "prefix_cost": "12797611.14",
            "data_read_per_join": "4G"
          },
          "used_columns": [
            "k1"
          ]
        }
      }
    ]
  }
} |

就算使用id主键作为join匹配条件耗时也较久:

mysql>  select distinct(count(r.k2)) from (select t1.k2 from t100w t1 left join testjoin t2 on t1.id=t2.id ) r;
+---------------+
| (count(r.k2)) |
+---------------+
|        750000 |
+---------------+
1 row in set (3.92 sec)

孤陋寡闻,目前只能做到这里了,拿这样的事务逻辑SQL有种无法的感觉‍♂️

希望事务从设计开端就能避免这样的状况呈现

如果有大佬能点拨一下,不胜感激!