Intel PAUSE指令变化影响到MySQL的性能，该如何解决？-六虎

MySQL得益于其开源特色、老练的商业运作、良好的社区运营以及功能的不断迭代与完善，现已成为互联网联系型数据库的标配。能够说，X86服务器、Linux作为基础设施，跟MySQL一起构建了互联网y T Q i N 8 ?数据存储服务q – 2 a R的基石，三者相辅相成。本文将同享一个作业中的实践案例：因Intel PAUSE指令周期的迭代，引发了MySQL的功能瓶颈，美团MySQL DBAE 6 ( t团队如何根据这O ( t T I & 8 A T三@ k !者来一步步进行剖析、定位和优化。希望这些思路能对大家有所启发。

1.背景

在2017年，Intel发布了新一代的服务器渠道Purley，并将Intel Xew 3 =o# d z :n Scalable Processor（至强可扩展处理器）重新划分为：Platinum（铂金）、Gold（金）、SilveB K | q _r（银）、Broze（铜）等四个等级。产品定位和结构也变得愈加清晰。

因美团线上海量数据买卖和存储等后M + ) B ? D ~ z端服务依靠很多高功能服务r y 0 W @ u Y器的支撑。跟着线上部分GrantlyY @ 6渠道E系列服务器生命周期的临近，以及产品自身的发展和迭代。从2019年开端，RDS（联系型数据库服务）后端存储（MySQL）开端很多上线PurlA D } l 9 – :ey渠道的Skylake CPU服务器，其间包括p 2 T JSilver 4110等。

Silver 4110比较上一代E5-2620 V4，支撑更高的内存频率、更多的内存通道、更大的L2 Cache、更快的总线传输速率等。Intel官方q G T A数据显现Silver 4110的功能比上一代E5-2620 V4进步了10%。

然而，跟着线上Skylake服务器T C 8 j R Z ? t数量的添加，以及越来越多的事务接入。美团MySQL DBA团队发A } Z ^现部$ ? 2分MySQL实例功能与预期并不相符，有时甚至呈现较大程度的下降。经过持续的功能问题剖析，咱们定H Z 1 _ % q ?位到e S u o @ 5Skylake服务器存在功能瓶颈：

CPU负载相对较高。
TPS等吞吐量下降。

接下来，咱们将从Intel CPU、ut_delay函数、PAUSE指令三方面入手，进行剖析定位，并探究相关优化方@ ? : ` N , ;案。

2.功能问题剖析

2.1 Grantly与j k W , f 4 GPurley CPU功能差异

首要，根b : O据上述两代渠道的CPU（Grantly和Purley），经过基; E J | % z I准测验，横向比照在不同OSF Q A % 2 J . C S下的功能体现。

经过基准测验数据，总结如下：O o A – 7 9

1.在oltp_write_only（只写）的场景下Pu 2 | 3rley 4110的功能下降较为显着。
2.同为Purley 4110，CentOS 7比CentOS 6 oltp_write_only（只写）功能有进步。q f M p x d t

咱们经过二维折线图，来展现功能5 c 2 o之间的差异：

在上图中，同为Purley 4110，CentOS 7比CentOS 6[ 9 x D ^ r功能有进步。详细进步原因，因不涉及本文重点内容，所以不在这儿详细展开了。

New MCS-based Locking Mecha7 7 X mnism

Red Hat7 U [ Enterprise LinuX / X 1x 7.1 introduces a new locking mechanism, MCS locks. This nd 2 6 ?ew locking mechanism significantly reduces spinlock overhead in large sy4 ` r s &stA v e Q j 6ems, which makes spinlocks generally more efficient in Red= s 5 W | ` a ? Hat Enterprise Linux 7.1.

红帽官网Release Notes显现，从内核3.10.0-229开端，引入了新的加锁机制6 j E C I l /，MCS锁。能够下降spinlock的开支，然后更高效地运转。普通spinlock在多CPU Core下，一起只能有一个CPU获取变J _ p m y Q ) u量，并自旋，而缓存一致性协议为了确保数据的正确，会对一切CPU Cache Line状态、数据，同步、失效等操作，导致功能下降。而MSC锁完成每个CPU都有自己的“h w wspinlock”本地变量，只在本地自旋。避免Cache Line同步等，然后进步了相关功能。不过，社区关于sp. ] F T =inlock的优化争议仍是比较大的，后续又有大牛根据MSC完成了qspinlock，并在4.x的版别上patch了。t [ v 1 U q c L `详细完成能够参看：MCS locks and qspinlocks。

在大致了解CentOS^ m / n Z _ 6 7功能的迭代后，接下来咱们深入剖析一下Skylake9 ( Q CPU 4110导致功能下降的缘由。

3.CPU功能盯梢

3.1 定位热门函数

详细定位4110功能瓶颈，分如下几步:

首要，经过perf top来盯梢一下Linux CPU功能开支。
然后，经过perf recC % ] + 9 ~ w ) mord记载函数CPU周期的消耗占比。
最终，经过火焰图来验证定P ( 5 ,位1 1 o 0 C K p ;热门函数。

能够看到，其间占CPU消耗占比较大为：ut_delay函数。

咱们持续深挖一下函数链调用联系：

# Children      Selfp H W $ ~ p  Command  Shared Object        Symbol
# ....X r U * A....  ........  .......  ...................  ........* e x D 6 J..................................................................................................................................................................3 R + g g c 8 G........
#
93.54%     0.00%  mysqld   libpthread-2.17.so   [.] start_thread
|
---s_ / 5 f 0 q ]tart_thread
|
|--77.07%--pfs_spawn_thread
|          |
|           --77.05%--handle_connectiw Y fon
|                     |
|                      --76.97%--do_command
|                                |
|                                |--742 ! [.30%--dispatch_command
|                                |          |
|                                |          |--71.16%--mysqld_stmt_r 2 b q w T d =execute
|                                |          |          |
|                                |          |           --70.74%--P| j n _ trepared_statement::eh R d 1 8 & ] xecute_loop
|                                |          |                     |
|                                |          |                     |--6h D O9.53%--Prepared_statement::executh i G $e
|                                |          |                     |          |
|                                |          |                     |          |--67.90%--mysql_execute_command
|                                |          |                     |          |          |
|                                |          |                     |          |          |7 0 i ,--0 B @ D23.43%--trans_commit_stmt
|                                |          |                     |I , 3          |          |          |
|                                |S x } a          |                     |          |          |           --23.30%--ha_commit_trans
|                                |          |                     |          |          |                     |
|                                |          |                     |          |          |                     |--18.86%--MYSQL_BIN_LOG::coL f 4mmit
|                                |          |                     |          |          |                     |          |
|                                |          |                     |          |          |                     |           --18.18%--MYSQL_BIN_LOG::ordered_commit
|                                |          |                     |          |          |                     |                     |
|                                |          |                     |          |          |                     |                     |--8.02%--MYSQL_BIN_LOG::change_stage
|                                |          |                     |          |          |                     |                     |          |
|                                |          |                     |          |          |                     |                     |          |e 2 T n ! Y--2.35%--__lll_unlock_wake
|                                |          |                     |          |          |                     |                     |          |g r I k q 4 0 H -          |
|                                |          |                     |          |          |                     |                     |          |           --2.24%--system_call_fastpath
|                                |          |                     |          |          |                     |                     |          |                     |
|                                |          |                     |          |          |                     |                     |          |                      --2.24%--sys_futex
|                                |          |                     |          |          |                     |                     |          |                                |
|                                |          |                     |          |          |                     |                     |          |                                 --2.23%-C m :-do_futex
|                                |          |                     |          |          |                     |                     |          |                                           |
|                                |          |                     |          |          |                     |                     |          |                                            --2.1 q 94%--futex_wake
|                                |          |                     |          |          |                     |                     |          |                                                      |
|                                |          |                     |          |          |                     |                     |          |                                                       --1.38%--wH  @ Iake_up_q
|                                |          |                     |          |          |                     |                     |          |                                                                 |
|                                |          |                     |          |          |                     |                     |          |                                                                  --1.33%--try_to_wake_up
...

将上述调用经过火焰图进行直观展现：

现在基本能够确定，一切的函数调用，最终大部分的消耗都在ut_dk % ! aelay上。

3.2 ut_delay和PAUSE之间的相关与功能影响

3.2.1 MySQL ut_d+ P G – d : x ael5 C m [ay完成

接下来，咱们持续看一下MySQL源码中ut_delay函数的功能：

/********************************************] = o _ y*****************//**
Runs an idle loop on CPU. The argument4 Z m 3 t g g~ q ] / ) _ F w giv) W v ! + B _es the desired deI s $ { `lay
in microse. f K ~conds on 100 MHz Pentium + Vi4  X v } q 4 Zsual C++.
@return dummy value */
ulint
ut_delay(
/*=====*/
ulint delay)  /*!< in: delay i+ s r S 9 3n microseconds on 100 MHz Pentium */
{
ulint i, j;

UT_LOW_PRIORITY_CPU();

j = 0;

for (i = 0; i < delay * 50; i++) {
j += i;
UTr I w w T 3 f z 3_RELAX_CP! q VU();
}

UT_RESUME_PRIORITY_CPU();

return(j);
}
...

#   de- ) J M 4 J p Nfine UTc | 9_RELAX_CPU() asm ("pa. | 9 H m kuse"l t = : M )
#   define UT_RELAX_CPU()G ? B  __asm__ __vF ^ ] ! Dolatile__ ("pause")

能够了解到，MySQL自旋会调用PAUSE指令，然后进步spin-wait loop的功能。

3.2.2 PAUSE指令周期的演变

咱们能够看下Intel官网，也描绘了在新渠道架构PAUSE的改动：

Pause Latency in Sky+ M d M m L I klake Microarchitecture

The PAUSE instruction is typically used with softwa8 # k e c . Fre threads executing on two logical process0 3 m 9 rors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to la6 8 nst between tens and a few hundreds of cycles, so performance-wise it is better to wait while occupying th/ k $ e CPU than yielding to the OS. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by cG t ; / / talling an OS synchronization API fun] T h iction, such as WaitForSingleObject on Windows* OS or futex on Linux.

…

The latency of the PAUS0 / x b IE i1 W / k L # % j Vnstruction in prior generation microarchitectures is about 10 cycles, whereas in Skyl% @ / d t 7 ~ake microarchitecy | h * 7 lture it has+ 5 o been extended to as many as 140 cycles.

The increased latency (allowing more effective utilization of competitively-shaO 8 Y * P qred microarchitectural resoJ W 0 D 2 % Curces to the logical processor ready to mak3 E ! t a 8 g Ae forward progress) has a small positive performance imp( F – vact of 1-2% onA Y _ F x 8 2 = highly threag { { = { $ c % )ded applio w y _ . Ccations. It is expected[ i } [ t 0 J g to have negligible impact on less tq t w : I / k : /hreaded appx y M E Z | Tlications if forward progress is not bloc, e v s ! 9 0 0 0ked executing a fixed number of looE 7 4 E L aped PAUSE instructions. There’s also a small power benefit in 2-core and 4-core systems.

As theB i ! PAUSE latency has been inB u N 8 B c tcreav = hsed significantly, worklH ! N d Loa! n b i C 3ds that are sensitive to PAUSE latency will su} Q f o 9 , `ffer some performaz F % W lnce loss.k – { j V H :

…

上一代架构] B j ? & { v ;中（Grantly渠道E系列）PAUSE的周期时长4 D Y为10 cycles，新一代的Skylake架构中则为140 cycles。
如果程^ n g序中使用固定次数的PAUSE循环来完成一段时间的推迟，以此阻塞程序履行，或许引发非预期的推迟。
因为PAUSE周期添加，关于PAU8 c { = . RSE灵敏的使用会有一定的功能丢失。

衡量程序履行功能的简化公式：

ExecutionTim% J ^ r ^ Ge(T)=InstructionCount∗TimePerCycle! ^ $ y∗CPI

即：程序履行时间 =z ^ e 7 程序总指令数 x 每CPU时钟周期时间 x 每指令履行所需平均时钟周期数。

MySQL内部自旋，就是经过固定次L 3 * ~ M W数的PAUSE循环完成J { X ) 7。可知，PAUSE指令周期的添加，那么履行自N n V 7 d + T % ~旋的时间也会添加，即程序履行的时间也会相对添加，对系统全体的吞吐量就会有影响。

显着，Intel文档已阐明不同渠道、不同架构CPU PAUSE定义的周期是不一样的。

下面，咱们经过一个测验用例来大致验证、比照一下新老架构CPU履行PAUSE的cycles：

 #include <stdio.h>
#define TIMES 5

static inline unsigned long long rdtsc(void)
{
unsi~ a F h N a Y ggned long low, high;
asm vola{ a A I / 1 w u gtile("rdtsc" : "=a" (low), "=d" (high) );
return ((low- ( j 1 m L _ ]) | (hi a ^ Z : h i l Righ) << 32);
}

void pause_test()
{
int i = 0;
for (i = 0; i < TIMES; i++) {
asm(
"pausen"
"pausen"
"paC S h , = [usen"
"pausen"
"pausen"
"pausen"
"pausen"
"pausen"
"pausen"
"pausen"
v ^ , H & ? _ R"pausen"
"pausen"
"pausen"
"pausen"
"pausen"
"pausen"
"pausen"
"pausen"
"paY 9 2 C a Qusen"Q ~ 1
"pausen"
::
:);
}
}

unsigned long pause_cycle()
{
unsigned long start,$ 8 $ ` / T ! finish, elapsed;
start = rJ q @ z  _ K 9 Ldtsc();
pause_test();
finish = rdtsc();
elapsed = finish - start;
printf(S [ = D b g"Pause的cycles约为:] d ,%ldn", elapsed / 100);
returne - ] 0;
}

int main()
{
pause_cycle();
return 0;
}

其运转结果统计如下：

4110和5118 PAUSE周K X W期较大，均为100多，它们归于Purley第一代架构：Skylake。
4210和5218 PAUSE比较前一代有进步，是因为它们同属Purley第二代架构：Cascadelake，该代CPU PAUSE指令有优化。

3.2Y W S b.3 Intel 进步PAUSE猜想

Intel进步PAUSE指令周期的原因，% 7 r R F M 3 f估测或许是削减自旋锁抵触的概率，以及下降功耗；但反而导致PAUSE履行时间变长，下降f A K了全体的吞吐量。

The increased latency (allr f (owing more effective utilization of competitively-shared microarchitectural resources to the logical processor read to make forward progress)1 ^ 5 v , 7 6 has a small positive performance impact of 1-2% on highly threaded applications. It isI 8 v ! expected to have negligible impact on less threaded applicaA Y H 6 4 3 } * ,tionsV D p if forward progresK i v h ^ / K rs is not blocked executing a fixed number of looped PAUSE instructions.

3.3 PAUSE导致写瓶颈剖析5 M [ Y

接下来，咱们深入剖析一下PAUt % }SE指令导致MySQL写瓶颈的原因。

首要，经过M T B w YySQL 内部统计信息，查看一下InnoDB信号量监控数据：

SEMAPHORES
---------X , V $ ( y { #-
OS WAIT ARRAY INFO: reservation count 153720
--Thread 139868617205504 has waited at row0row.cc line 1075 for 0.00 sL | ( b Peconds the sed T 3 | S g t Lmaphore:
X-lock on RW-latch at] 3 G $ , j 0x7f4298084250 created in file bufL h I { 40buf.cc line 1425
a writer (thread id 1398692v 0 J b 6 k E d84108032) has reserved it in mode  SX
number of readers& R [ - t 6 f z ) 0, waiters flag 1, lock_word: 10000000
Last time read l/ ( h bocked in file not yet reserved line 0
Last time write locked in file /mnt/worksp} H j l Y vace/percona-server-/ } ) h H 6 { v5.7-redhat-binary-rocks-new/label_exp/min-centos-7-x64/test/rpmbuild/BUILD/percona-server-5.7.26p ? R S u # @ T-29/percona-server-5.7.26-29/storage/innobase/buf/buf0flu.cc line 12] 5 & O s16
OS WAIT ARRAY INFO: signal count 441329
RW-shared spins 0, rounds 1498677, OS waits 111991
RW-excl spins 0, rounds 717200, OS waits 9012
RW-sx spins 47596, rounds 366136, OS waits 4100
Spin rou* 4 w k 1 [ x Snds per wait: 1498677.00 RW-shr 7 F 7 t Dared, 717200.00 RW-excl, 7.69 RWP w i O }-sx

可见写操作并阻塞在：storage/innobase/buf/buf0flu.cc第1216行调用上。

盯梢一下发生等待的源码：buf0flu.cc line 1216：

    if (flush_type == BUF_FLUSH_LIST
&aa | s w |mp;& is_uncompressed
&& !rL A c [ H 0w_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) {    // 加锁前，判断锁抵触
if (!fsp_is_system_t( O i j U (emporary(bpage-; M h 1>id.space())) {
/* avoiding deadlock possibility inv| S s 2olvh T A s - | [es
doublewrite bufP p g ofer, should fl# ^ K K r 4 _ ~ mush it, because
it might hold the another block->lock. */
buf_dblwr_flush_buffered_writes(
buf_paralG c p 5 B U [ wlel_dblwb = P R 6 9r_partition(bpage,
flush_type));
} else {
buf_dblwr_sync_datafiles();
}
rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE);        //  加sx锁
}
...
#define rw_lock_sx_lock_nowb Y G v B & m cait(M, P)       
rw_lock_sx_lock_low((M), (P), __FILE__, __LINE__)
...

rw_lock( % G_sx_lock_func(                                       // 加sx锁函数            
/*================I f , @=*/
rw_lock_t*  lockI _ _ Z B 3 / ` !, /*!< in: pointer to rw-lock */
ulint   pass, /*!&l[ H h L y # it; in: pass value; != 0, if the lock will
be passed to anothW ! # A er thread to unlock */e Q c M 3 .
const char* file_name,/*!< in: file name where lock requested */
ulint   line) /*!< in: line wherw R 4 je requested */

{
ulint   i = 0;
syn6 x [ J C ! Nc_array_t* sync_arr;
ulint   spin_count = 0;
uint64_t  count_os_[ 7 j c 0wait = 0;
ulint   spin_wait_count = 0;

ut_ad(rw_lock_validate(lock));
ut_ad(!rw_lock_own(lock, RW_LOCK_S));
y x ` ~ 1 9 x -
lock_loop:

if (rS _ i [ B V 2 b &w_lock_sx_lock_low(locv e v J T Vk,Q p r # Y pass, file_name, line)) {

if (count_os_wait > 0) {
lock->count_os_wait +=
static_cast<uint32Y f 1 y @ R G U_t>(count_x { # b ` 7os_wait);
rw_lock_stats.rw_sx_os_wait_count.add(count_os_wait);
}

rw_lock_stats.rw_sx_spin_round_count.add(spin_count);
rw_lock_stats.rw_sx_spin_wait_count.add(spin_wait_count);

/* Locking succeeded */
return;

} else {
k h g : j h 8
++spin_waitF Q C & o G T { ,_count;

/# p 6* Spin waiting fL ~ x { d h @ Yor the lock_word tZ c e - t J }o become fD p [ 8 0 iree */
os_rmb;
while (i < srv_n_spin_wait_rounds
&& lock->lock_word <= X_LOCK_HALF_DECR) {

if (srv_spin_wa] q } qit_delay) {
ut_delay(ut_rnd_interval(
0, srv_spin_wait_delay));                         // 加锁失利，调用ut_delay
}

i++;
}

spin_count += i;

if (i >= srv_n_spin_wait_B 8 ] / [ro / S m G Z ,unds) {

os_thread_yield();
* x 0 . (
} else {

goto lock_loop;
}
...
ulong srv_n_spin_waiw ` k n N -t_rounds  = 30;
ulong s5 w arv_spin_wait_delay = 6;

上述源码可知，MySQL锁等待是经过调` ` B A用ut_delay做空循环完成的。

InnoDB层有三种锁：R z Z ` iS（m – g n l &同享锁）、X（排他锁）和SX（同享排他锁）。 SX与SX、X是互斥锁。加SX不会影响读，只会阻塞写。所以在很多写入操作时，会造成很多的锁等待，即很多的PAUSE指令。

剖析到这儿/ ] f Z s x G，咱们总结一下影响吞吐量G w &的两个因素：

自旋的时长，在MySQL5.7以及之前版别的源码定位为：spin_waitE % Q R H : ` !_delay * 50。
Intel CPU PAUSE的指令周期。

接下来，咱们就从这两方面入手，评价优化空间以及作用。

4. 针对PAUSE指令和spin参数优化与探究

4.1 MySQL spin参数优化

4.1.1 MySQL 5.7 spin参数优化

咱们能够根据现有MySQL版别、硬件等方面，来寻找优3 b n A N q W } G化点。

MySQL针对spin控制这块有个参数能够调整，根据参数特色进行相关优化：

innodb_spin_wait_delay

innodb_spin_wait_delay的单位，是100MHZ的飞跃处理器处理1毫秒的时间，默认innodb_spin_wai{ 9 a ? } – U [ {t_delay配置成6，表示最多在100MHZ的飞跃处理器上自旋6毫秒。

inng m !odb_sL 6 Uync_spL _ $ QinZ _ Q U , _ D ~_loops

当 innodb 线程获取 mutex 资源而得不到7 L % f s满意时K k p M，会最多进行 innodb_sync_s6 K h { ! { 6pin_loops次尝试获取mutex资源。

其间innodb_spin_wait_delay参数对PAUQ i p V { e LSE运转时} u M长是有影响t * 0 m 3 D )的。针对此参I * G h $ 8 M : A数，咱们进行调优测验。

同样，针对上述参数优化，咱们经过基准测验来比照功能和作用：

能够总结为：

innodb_spin_l 4 ;wait_delay的调整对TPS、QPS 一定影响，其值趋于小，* D 7 t则MySQL功能有进步。反之，下降。
innodb_spiM * S b % { k An_wait_delay参数调整功能| I G ; $优化作用有限，功能进步的幅度g J ` ( 1 $ 2 ] ~仍是无法满意线上事务需k W 2 0 8 N {求。

4.2 MySQL8.0 spin新特性移植

4.2.1 spin_wait_pause_mY j a J t c I B bultiplier移植

针对Skylake CPU，PAUSE造成的吞吐量下降，咱们对MySQL 5.7 spin控制参数innodb_spin_wait_delay的调优并未取得显着作用。

所以，咱们将目光投向了MySQL 8.0的新特性：MySQL. W ) + x 8.0 针对PAUSE，源码中r ) P U ?新增了spin_wait_pause_multiplier参数，来替换之前写死的循环次数。

4.2.2 sf 5 [ d vpin_wait_pause_multiplier完成

MySQL 8.0源7 M ! _ H v w码中，之前循环50次的逻辑修正成了能够调整循环次数的参数：spin_wait_pause_multiplier。

ulint ut_delay(ulint delay) {
ulint i, j;
/* We don't expect overflow here, as ut::spin_s n _ #wait_pause_multiplier is limited
to 100, and values of delay are not la, * W f D E rger than @@innodb_spin_wait_delay
which is limited by 1 000. Anyway, in case anO A M / s  - b - overflow happened, the program
wouldI R R 7 x D A q i still work (as i; C s F 2 6 ~ 3 1terations i^ # ^ G Q I Y p ss unsigned). */
coh v } { 2 T M %nst ulint iterations = delay * ut::spin_wait_paZ q ` 3 wuse_multiplier;
UT_LOW_PRIORITY_CPU(I q b J ; O Z);

j = 0;
g t 5 C  ] $ Y
for (i = 0; i < iteratib l j W 4 kof 8 G v ] dns; i++) {
j += i;
UT_RELAX_CPU();
}

UT_RESUME_PRIORITY_CPi m PU();

return (j);
}
...
namespace ut {
ulong spin_wait_pause_mc N # 9 X O q }ultiplier = 50;
}

4.2.3 移植spin_wait_pause_multiplier patch优化

已然MySQL 8.0参数spin_wait_pause_multiplier能够控制PAUSE履行的时长，那么就能够削减该值，然后下降全体PAUSE影响。

了解MySQL 8.0相关代码后，咱们将该patch移植到线上d * – n X = 0 p B的安稳版别：

MySQ >select version();
+------------------+
| version()        |
+------------------+
| 5.7.26-29-mt-log |
+--x c } &--------------P k / k 9 ( G # o--+
1 row in set (0.00 sec)

MySD [ oQL>show global variables like '%spin%';
+-----------------------------------+-------+
| Variab! g ) f E  _ _ Jle_name                     | Valu` = n 4 g 9 = Ue |
+-----^ X 1 N z e l k %------------------------------+-----: + 7 , o--+
| innodb_spin_wait_delay            | 6     |
| innodb_spin_wait_pause_multiplier | 5     |
| innodb_sync_spin_loops            | 30    |
+-----------------------------------+-------+
3 rows in set (0.00 sec)

由上述可知，Silver 4110的PAUSE cycles是E5-2620 v4的14倍左右。根据此，将innodb5 % Z / I x c_spin_wait_pause_multiplier^ $ D . b *值调整为默认值的1/14，取稍大值：5。即将该参数由原默认的50调整M N } . % k ? i为5g } c。

最终，仍是经过二维折线图来比照该patch调优后的基准测验数据：

Silver 4110移植spin_wait_pause_multiplier patch，并调整优化后e 5 E S t C . ?，4110（patch）功能有了较大的进步。
SilverZ Y 9 4110（patch）相对调优innodb_spin_wai% ) t q Z 8 wt_delay功能上更优。
Silver 4110（patch）并发线程大于64的只写场景，功能略低于: – / 7 IE5-2620 V4 ，其他均优。
按照实在的线上读写份额，411D 9 | q R m y 7 m0（patch）能够将吞吐量康复到原先的功能水平。

4.3 PAUSE指令周期优化

上述章节中，咱们测出Cascadelake CPU PAUSE周期下降了。在跟Intel技能专家承认后得知：从Purley的第二代产品Cascadelake开端，I~ ! P j R 7 l Tntel将PAUSE的指令周期下降到了44。（估计Intel也发现了第一代b E Q ! 2添加PAUSE周期后的功能瓶颈问题。）

咱们针对第二代CPU产品持续做基准测验，来看一下功能体现：

接X G 3 /着用perf dif_ T ~ 6 :f来比照一下4110和4210在ut_d7 L – 8 7 . N Yelay上的开支：

能够看到4210比4110占比下降了8%。
因为PAUSE指令周期仍是数倍于E5系/ B V列CPU，4210在高负载下，PAUSE的开支对MySQL吞吐量仍是有较大的影响。而在128并发线程以下，功能比较4110有了较大的进步。按理，能够满意线上事务需求（该测验结果跟移植spin_wait_pause_multiplier patch功能测验数据曲线一致）。

5. 总结

最终针对本篇内容，[ ` S咱们能$ ( i ,够做个简单的总结：

Intel在新渠道CPU产品调大了PAUS^ m V y 3 ZE指令周期，在高并发spinlock竞争剧烈场景下，或许会造成6 t n g – Y l !程序功能较大损耗（特别是履行固定PAUSE次数的程序）。
针对Skylake架构CPU（比如：4110等）PAUSE指令周期较长引起功能问题的优化方法如下：

将MySQL 8.0 innodb_spin_wait_pause_mul7 i . t G 2 F z htiplier patch) # U { P移植到线上安稳版别（或升级到MySQL 8.0），经过下降PAUSE履行时长，来进步吞吐量。
如果是OS为CentOS 6，能够升级到CentOS 7，CentOS 7自身spinlock优化，对MySQL功能也有一定进步。
最简单、直接的方法l [ I J m能够替换为Cascadelake架构CJ E 9 v e M f WPU。

针对Cascad: S x uelake架构CPU，因为Intel自身在PAUSE周期现已优化，功能上现已做了修正。@ | k $当然也能u 0 ^ c F ; 6够选用上述优化方案，让功能进步一个台阶。

6. 作者简介

春林，2017年参加美团，首要负责MySQL运维开发和优化作业。

招聘信息

美团DBe w Q ~ P dA团队招聘各类人才，Base北京、上海均可t 7 2 ^ @ 4。咱们致力于为公司供给安稳、牢靠、高效的在线存储服务，打造业界抢先的数据库团队。这儿7 r x有数万各类架构的MySQL实例，每天供给万亿级的OLTP访问恳求。真正的海量、分& o J布式、高并发环境T [ c l K D t { 0。欢迎感兴趣的同学发送简历至：tech@meituan.com（邮件标题注明：美) k I 6团Dj 2 D ? ABA团队）

阅读更多技能文章，$ x k 3 U 7 S L请扫码重视微信公众号-美团技能团队！

Intel PAUSE指令变化影响到MySQL的性能，该如何解决？