1. SystemTap是什么?
SystemTap是一款Linux内核调试和应用跟踪调试的东西,能够读取和更改应用程序运转过程中的状态,具有低时延、动态调试的特色。运用SystemTap你需求编写systemtap语法的脚本,SystemTap会自动把脚本翻译成C言语的代码,再编译成一个体系的内核。在脚本运转期间,内核会加载到体系中去,运转结束后,内核模块会被卸载。这个跟目前业界很火的EBPF很像,但是SystemTap没有EBPF的安全检查机制,比如循环检测、禁止不可达指令等机制。但是SystemTap发展了这么多年,有很多配套的东西和成熟的解决方案,仍是很值得咱们学习的。
从脚本到C言语再到内核模块,再到probe信息输出的流程图如下:
怎么装置SystemTap的运用环境,请参阅我的这一篇文章:手把手教你运用火焰图检查确诊OpenResty(nginx)的Lua代码性能瓶颈
2. Hello World
hello-world.stp的内容如下:
#!/usr/bin/env stap
probe oneshot {
# 嵌入C代码
%{ printk(KERN_ALERT "Hello Wrold from systemtap c\n") %};
# 标准stap的脚本语法
printf("Hello World\n");
exit();
}
运转后,输出如下:
[root@localhost systemtap]# stap -g hello-world.stp Hello World
参数解析:
- -g参数,代表guru形式,运转嵌入C代码的时分,需求敞开guru形式
代码详解:
- probe oneshot代表此block的脚本之运转一次,这个语法与AWK这个东西很类似。
- %{%} 内部能够嵌入C的代码,咱们就能够完结调用内核开发中运用的printk函数,打印输出(注意这里的输出不会到标准输出中,需求用dmesg去检查)
- printf是stap脚本的打印,常用于打印统计的数据
嵌入C的代码输入如下
[root@localhost systemtap]# dmesg|tail -n 1 [51913.062057] Hello Wrold from systemtap c
3. 根本语法
术语:
-
event: 事情,有定时器事情,有函数事情(enter/exit)
-
handler: 某个event发生的时分,履行的代码
-
probe = event + handler, probe界说的语法是
probe event {statements}
-
scripte: systemtap的脚本,一个脚本里边能够有多个probe
-
function: 函数,probe的statements中能够运用函数
function function_name(arguements) {statements}
probe evetn {function_name(arguments)}
3.1 Event
systemtap中有两大类事情,分别是同步事情和异步事情。同步事情:有代码运转到了指定的位置,而触发的事情;异步事情:与特定的代码运转无关。
能够运用man stapprobes
检查具体的Events有哪些。
同步事情比如:
- syscall.system_call :体系调用
- vfs.file_operation
- module(“module”).funciton(“function”)
- kernel.function(“function”)
- kernel.trace(“tracepoint”)
异步事情比如:
-
begin
-
end
-
oneshot
-
timer事情
[root@localhost systemtap]# stap -e ‘probe timer.s(1) {print(“1 second elapsed…\n”)}’ 1 second elapsed… 1 second elapsed… 1 second elapsed…
能够运用stap -l检查体系预设的Event。
-
检查内核函数
[vagrant@localhost openresty-1.21.4.1]$ sudo stap -l ‘kernel.function(“vfs_read“)’ kernel.function(“vfs_read@fs/read_write.c:436”) kernel.function(“vfs_readlink@fs/namei.c:4598”) kernel.function(“vfs_readv@fs/read_write.c:833”)
-
检查syscall
[vagrant@localhost openresty-1.21.4.1]$ sudo stap -l ‘syscall.*read’ syscall.pread syscall.read
-
检查trace point
[vagrant@localhost ~]$ sudo stap -l ‘kernel.trace(“*readpage”)’ kernel.trace(“ext3:ext3_readpage”) kernel.trace(“ext4:ext4_readpage”) kernel.trace(“f2fs:f2fs_readpage”)
-
检查某个用户态程序的函数
[vagrant@localhost ~]$ sudo stap -l ‘process(“/usr/local/openresty/nginx/sbin/nginx”).function(“ngx_http_write*”)’ process(“/usr/local/openresty/nginx/sbin/nginx”).function(“ngx_http_write_filter@src/http/ngx_http_write_filter_module.c:48”) process(“/usr/local/openresty/nginx/sbin/nginx”).function(“ngx_http_write_filter_init@src/http/ngx_http_write_filter_module.c:357”) process(“/usr/local/openresty/nginx/sbin/nginx”).function(“ngx_http_write_request_body@src/http/ngx_http_request_body.c:529”) process(“/usr/local/openresty/nginx/sbin/nginx”).function(“ngx_http_writer@src/http/ngx_http_request.c:2826”)
-
检查tapset中预设的probe
[vagrant@localhost ~]$ sudo stap -l ‘netdev.change‘ netdev.change_mac netdev.change_mtu netdev.change_rx_flag
3.2 tapset
体系自带的tapset的stp文件在/usr/share/systemtap/tapset中,里边有自界说的probes(event)和函数。
预设的probe比如:
probe netdev.receive
= kernel.function("netif_receive_skb_internal") !,
kernel.function("netif_receive_skb")
{
try { dev_name = get_netdev_name($skb->dev) } catch { }
try { length = $skb->len } catch { }
try { protocol = $skb->protocol } catch { }
try { truesize = $skb->truesize } catch { }
}
界说了netdev.receive这个probe,它是内核函数netif_receive_skb_internal(假如存在),或许netif_receive_skb函数,而且里边还提前准备了输入参数:
- dev_name
- length
- protocol
- truesize
netdev.receive 界说中,!符号代表的是:netif_receive_skb_internal存在的时分,就运用这个函数。当netif_receive_skb_internal不存在的时分,就运用!后面的函数netif_receive_skb。
- 还有别的一个与!类似的marker: ?。 ?表明该探测点不在的时分,脚本运转也不要报错。
- if {expr} 的maker。signal.*? if (switch) 表明,当switch为true的时分,该探测点才收效。
function的比如:
function strlen:long(s:string)
%{ /* pure */ /* unprivileged */ /* unmodified-fnargs */
STAP_RETURN(strlen(STAP_ARG_s));
%}
函数strlen回来字符串的长度。
运用tapset的比如nettop.stp:
#! /usr/bin/env stap
global ifxmit, ifrecv
global ifmerged
probe netdev.transmit
{
ifxmit[pid(), dev_name, execname(), uid()] <<< length
ifmerged[pid(), dev_name, execname(), uid()] <<< 1
}
probe netdev.receive
{
ifrecv[pid(), dev_name, execname(), uid()] <<< length
ifmerged[pid(), dev_name, execname(), uid()] <<< 1
}
function print_activity()
{
printf("%5s %5s %-12s %7s %7s %7s %7s %-15s\n",
"PID", "UID", "DEV", "XMIT_PK", "RECV_PK",
"XMIT_KB", "RECV_KB", "COMMAND")
foreach ([pid, dev, exec, uid] in ifmerged-) {
n_xmit = @count(ifxmit[pid, dev, exec, uid])
n_recv = @count(ifrecv[pid, dev, exec, uid])
printf("%5d %5d %-12s %7d %7d %7d %7d %-15s\n",
pid, uid, dev, n_xmit, n_recv,
@sum(ifxmit[pid, dev, exec, uid])/1024,
@sum(ifrecv[pid, dev, exec, uid])/1024,
exec)
}
print("\n")
delete ifxmit
delete ifrecv
delete ifmerged
}
probe timer.ms(5000), end, error
{
print_activity()
}
此脚本运用了预设tapset的probe点:netdev.receive,统计运用网络的进程。
输出如下:
> [root@localhost systemtap]# stap nettop.stp
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
0 0 eth0 0 28 0 1 swapper/3
3201 1000 eth0 14 0 1 0 sshd
14913 1000 eth0 5 0 0 0 ping
1983 1000 eth0 1 0 0 0 sshd
3.3 润饰符
probe函数的时分,咱们常看到一些润饰符,比如.inline / .call / .return等。
- .return是probe函数运转回来的时分,在此probe的handler中,咱们能够经过$return变量获取到函数的回来值
- .inline代表的,probe的是内联函数。
- .call是与.inline相反的
- .maxactive 润饰该探测点能够同时有多少个实例在运转,假如一个函数因为体系默许的maxactive过低导致没有被探测到,能够恰当运用.maxactive调整
- ?表明该探测点不在的时分,脚本运转也不要报错
- ! 表明探测点满意了一个,就不再持续解析下去(resolve)
3.4 stap常用指令行参数
-
-x指定进程的PID进行追踪,target()函数回来的便是-x指定的
-
-c 运转一个指令,而且追踪这个指令
-
-e 从指令行中输入脚本
[root@localhost vagrant]# stap -v -e ‘probe vfs.read{ printf(“read performed”); exit()}’
Pass 1: parsed user script and 473 library scripts using 271956virt/69188res/3504shr/65752data kb, in 420usr/20sys/442real ms. Pass 2: analyzed script: 1 probe, 1 function, 7 embeds, 0 globals using 439512virt/233636res/4820shr/233308data kb, in 1360usr/460sys/1813real ms. Pass 3: using cached /root/.systemtap/cache/89/stap_89794dac39b59bfbd6e29e0bad1d4100_2803.c Pass 4: using cached /root/.systemtap/cache/89/stap_89794dac39b59bfbd6e29e0bad1d4100_2803.ko Pass 5: starting run. read performedPass 5: run completed in 0usr/40sys/595real ms.
-
-l :list probe points
[root@localhost vagrant]# stap -l ‘syscall.write*’ syscall.write syscall.writev
-
-L: 输出probe points的具体信息,包含支撑的变量
[vagrant@localhost ~]sudostap−L′netdev.receive′netdev.receivedevname:stringlength:longprotocol:longtruesize:longsudo stap -L ‘netdev.receive’ netdev.receive dev_name:string length:long protocol:long truesize:longskb:struct sk_buff*
netdev.receive能够运用的变量有:dev_name / length / protocol/ truesize / $skb
-
–dump-probe-type 获取stap支撑的probe语法
[root@localhost systemtap]# stap –dump-probe-type begin begin(number) end end(number) error error(number) java(number).class(string).method(string) java(number).class(string).method(string).return java(string).class(string).method(string) java(string).class(string).method(string).return kernel.data(number).length(number).rw kernel.data(number).length(number).write kernel.data(number).rw kernel.data(number).write kernel.data(string).rw kernel.data(string).write kernel.function(number) kernel.function(number).call kernel.function(number).exported
…. 省略
-
–dump-probe-aliases 获取tapset中预设的probe aliases
[root@localhost systemtap]# stap –dump-probe-aliases|grep netdev netdev.change_mac = kernel.function(“dev_set_mac_address”)? netdev.change_mtu = kernel.function(“dev_set_mtu”) netdev.change_rx_flag = kernel.function(“dev_change_rx_flags”)? netdev.close = kernel.function(“dev_close”) netdev.get_stats = kernel.function(“dev_get_stats”)? netdev.hard_transmit = kernel.function(“dev_hard_start_xmit”)? netdev.ioctl = kernel.function(“dev_ioctl”) netdev.open = kernel.function(“dev_open”) netdev.receive = kernel.function(“netif_receive_skb_internal”)!, kernel.function(“netif_receive_skb”) netdev.register = kernel.function(“register_netdevice”), kernel.function(“register_netdev”) netdev.rx = kernel.function(“netif_rx”) netdev.set_promiscuity = kernel.function(“dev_set_promiscuity”) netdev.transmit = kernel.function(“__dev_queue_xmit”)!, kernel.function(“dev_queue_xmit”) netdev.unregister = kernel.function(“unregister_netdev”)
-
–dump-functions 获取tapset中预设的函数
[root@localhost systemtap]# stap –dump-functions |grep user_string set_user_string:unknown (addr:long, val:string) /* guru / set_user_string_n:unknown (addr:long, n:long, val:string) / guru / user_string2:string (addr:long, err_msg:string) / unprivileged */ user_string2_n_warn:string (addr:long, n:long, warn_msg:string) user_string2_utf16:string (addr:long, err_msg:string)
3.5 systemtap内置函数
所谓的内置函数,也是在tapset中预设的。大部分的函数都是运用内嵌C代码的方式,完结其逻辑。
- pp() 当时probe point的姓名
- target() , -x PID 或 -c CMD 指定的进程PID
- ctime(),
- cpu() 当时CPU的编号
- probefunc() 当时probe point地点的函数
相关的函数文档能够参阅: sourceware.org/systemtap/t…
以probefunc为比如,其完结在linux/context-symbols.stp文件中。
3.6 联合数组
3.7 获取指令行输入参数
3.9 脚本中运用C代码
3.10 label和marker
3.9 跨机器编译
systemtap支撑跨机器编译,完结在A机器上编译成内核模块,B机器上履行此内核模块,具体可参阅:sourceware.org/systemtap/S…
4. 用户空间事情
5. 参阅
- systemtap脚本比如:sourceware.org/systemtap/e…