背景
当你运用下面代码不断分配内存时,终究导致物理内存耗尽,手机所有运用进程被杀死。 在一次 crash问题排查过程中,现象是crash但是并没有crash日志的捕获,后面排查到便是内存泄露导致的lowmemory 。那么线上怎么去监控这种lowmemory killer的状况呢,
JNIEXPORT void JNICALL
Java_com_test_Java2C_tm(JNIEnv *env, jclass clazz, jlong size) {
char * t =(char *) malloc(size);
memset(t,0,size);
}
监控方法
Android R之后system 进程给运用进程供给了一个api 运用进程能够调用android.app.ActivityManager的getHistoricalProcessExitReasons 检查运用的历史被杀原因。 咱们每次启动时能够去运用下这个api 检查被杀记录并进行上报。
Return a list of ApplicationExitInfo records containing the reasons for the most recent app deaths.
Note: System stores this historical information in a ring buffer and only the most recent records will be returned.
Note: In the case that this application was bound to an external service with flag Context.BIND_EXTERNAL_SERVICE, the process of that external service will be included in this package's exit info.
Params:
packageName – Optional, a null value means match all packages belonging to the caller's UID. If this package belongs to another UID, you must hold Manifest.permission.DUMP in order to retrieve it.
pid – A process ID that used to belong to this package but died later; a value of 0 means to ignore this parameter and return all matching records.
maxNum – The maximum number of results to be returned; a value of 0 means to ignore this parameter and return all matching records
Returns:
public List<ApplicationExitInfo> getHistoricalProcessExitReasons(@Nullable String packageName,
@IntRange(from = 0) int pid, @IntRange(from = 0) int maxNum) {
try {
ParceledListSlice<ApplicationExitInfo> r = getService().getHistoricalProcessExitReasons(
packageName, pid, maxNum, mContext.getUserId());
return r == null ? Collections.emptyList() : r.getList();
} catch (RemoteException e) {
throw e.rethrowFromSystemServer();
}
}
检查ApplicationExitInfo 根据mReason字段大概能够区分下面原因,类型3便是low memory killer 用上面走漏测验代码测验后 的确得到了3。
类型 | 描绘 |
---|---|
1 | 进程自杀(System#exit) |
2 | OS signal(OsConstants#SIGKILL) |
3 | low memory killer |
4 | java crash |
5 | native crash |
6 | anr |
7 | REASON_INITIALIZATION_FAILURE |
8 | 运行时权限改动 |
9 | 过度运用资源 |
10 | 用户主动杀进程 |
11 | 设备多用户杀场景进程 |
12 | REASON_DEPENDENCY_DIED |
13 | REASON_OTHER |
还有几个ApplicationExitInfo相关字段
mImportance 退出时进程的优先级
运用在前台:100
运用在后台:400
mPss: 退出时进程占用内存
mTimestamp: 进程退出时间戳
目标树立
设备维度剖析
简略探求下lowmemory kill的机制
源码搜索(根据Android 13)
搜索后发现这个原因的杀进程 并不是产生在system_server进程中,而是经过 lmkd进程完结 再告诉到system_server进程 在com.android.server.am.ProcessList 中发现 mAppExitInfoSourceLmkd
/**
* Bookkeeping low memory kills info from lmkd.
*/
final AppExitInfoExternalSource mAppExitInfoSourceLmkd =
new AppExitInfoExternalSource("lmkd", ApplicationExitInfo.REASON_LOW_MEMORY);
仓库调试
com.android.server.am.LmkdConnection 中 system_server进程经过LocalSocket与lmkd衔接 lmkd进程杀进程后会告诉到 system_server进程 system_server进程再进行记录
public boolean connect() {
synchronized (mLmkdSocketLock) {
if (mLmkdSocket != null) {
return true;
}
// temporary sockets and I/O streams
final LocalSocket socket = openSocket();
if (socket == null) {
Slog.w(TAG, "Failed to connect to lowmemorykiller, retry later");
return false;
}
final OutputStream ostream;
final InputStream istream;
try {
ostream = socket.getOutputStream();
istream = socket.getInputStream();
} catch (IOException ex) {
IoUtils.closeQuietly(socket);
return false;
}
// execute onConnect callback
if (mListener != null && !mListener.onConnect(ostream)) {
Slog.w(TAG, "Failed to communicate with lowmemorykiller, retry later");
IoUtils.closeQuietly(socket);
return false;
}
// connection established
mLmkdSocket = socket;
mLmkdOutputStream = ostream;
mLmkdInputStream = istream;
mMsgQueue.addOnFileDescriptorEventListener(mLmkdSocket.getFileDescriptor(),
EVENT_INPUT | EVENT_ERROR,
new MessageQueue.OnFileDescriptorEventListener() {
public int onFileDescriptorEvents(FileDescriptor fd, int events) {
return fileDescriptorEventHandler(fd, events);
}
}
);
mLmkdSocketLock.notifyAll();
}
return true;
}
at com.android.server.am.ProcessList$1.handleUnsolicitedMessage(ProcessList.java:810)
at com.android.server.am.LmkdConnection.processIncomingData(LmkdConnection.java:217)
at com.android.server.am.LmkdConnection.fileDescriptorEventHandler(LmkdConnection.java:172)
at com.android.server.am.LmkdConnection.-$$Nest$mfileDescriptorEventHandler(Unknown Source:0)
at com.android.server.am.LmkdConnection$1.onFileDescriptorEvents(LmkdConnection.java:158)
at android.os.MessageQueue.dispatchEvents(MessageQueue.java:293)
at android.os.MessageQueue.nativePollOnce(Native Method)
at android.os.MessageQueue.next(MessageQueue.java:335)
at android.os.Looper.loopOnce(Looper.java:161)
at android.os.Looper.loop(Looper.java:288)
at android.os.HandlerThread.run(HandlerThread.java:67)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
at com.android.server.am.AppExitInfoTracker.updateExistingExitInfoRecordLocked(AppExitInfoTracker.java:477)
at com.android.server.am.AppExitInfoTracker.handleNoteProcessDiedLocked(AppExitInfoTracker.java:392)
at com.android.server.am.AppExitInfoTracker$KillHandler.handleMessage(AppExitInfoTracker.java:1632)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loopOnce(Looper.java:201)
at android.os.Looper.loop(Looper.java:288)
at android.os.HandlerThread.run(HandlerThread.java:67)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
LMKD进程剖析
Logcat 下面就依照这个logcat的case 去剖析。
2023-04-13 13:49:24.921 456-456/? I/lowmemorykiller: Kill 'com.test' (9213), uid 10251, oom_score_adj 700 to free 128292kB rss, 50152kB swap; reason: low watermark is breached and swap is low (852kB < 314572kB)
上面能够看到杀进程是由lmkd进行的。 杀完之后会告诉到咱们的system_server进程。 咱们继续去跟进一下lmkd
代码位置 system/memory/lmkd
enum lmk_cmd {
LMK_TARGET = 0, /* Associate minfree with oom_adj_score */
LMK_PROCPRIO, /* Register a process and set its oom_adj_score */
LMK_PROCREMOVE, /* Unregister a process */
LMK_PROCPURGE, /* Purge all registered processes */
LMK_GETKILLCNT, /* Get number of kills */
LMK_SUBSCRIBE, /* Subscribe for asynchronous events */
LMK_PROCKILL, /* Unsolicited msg to subscribed clients on proc kills */
LMK_UPDATE_PROPS, /* Reinit properties */
LMK_STAT_KILL_OCCURRED, /* Unsolicited msg to subscribed clients on proc kills for statsd log */
LMK_STAT_STATE_CHANGED, /* Unsolicited msg to subscribed clients on state changed */
};
lmkd.h中也有LMK_PROCKILL界说。咱们经过这个去追踪
static inline size_t lmkd_pack_set_prockills(LMKD_CTRL_PACKET packet, pid_t pid, uid_t uid) {
packet[0] = htonl(LMK_PROCKILL);
packet[1] = htonl(pid);
packet[2] = htonl(uid);
return 3 * sizeof(int);
}
static void ctrl_data_write_lmk_kill_occurred(pid_t pid, uid_t uid) {
LMKD_CTRL_PACKET packet;
size_t len = lmkd_pack_set_prockills(packet, pid, uid);
for (int i = 0; i < MAX_DATA_CONN; i++) {
if (data_sock[i].sock >= 0 && data_sock[i].async_event_mask & 1 << LMK_ASYNC_EVENT_KILL) {
ctrl_data_write(i, (char*)packet, len);
}
}
}
static int kill_one_process(struct proc* procp, int min_oom_score, struct kill_info *ki,
union meminfo *mi, struct wakeup_info *wi, struct timespec *tm,
struct psi_data *pd) {
int pid = procp->pid;
int pidfd = procp->pidfd;
uid_t uid = procp->uid;
char *taskname;
int kill_result;
int result = -1;
struct memory_stat *mem_st;
struct kill_stat kill_st;
int64_t tgid;
int64_t rss_kb;
int64_t swap_kb;
char buf[PAGE_SIZE];
char desc[LINE_MAX];
if (!procp->valid || !read_proc_status(pid, buf, sizeof(buf))) {
goto out;
}
if (!parse_status_tag(buf, PROC_STATUS_TGID_FIELD, &tgid)) {
ALOGE("Unable to parse tgid from /proc/%d/status", pid);
goto out;
}
if (tgid != pid) {
ALOGE("Possible pid reuse detected (pid %d, tgid %" PRId64 ")!", pid, tgid);
goto out;
}
// Zombie processes will not have RSS / Swap fields.
if (!parse_status_tag(buf, PROC_STATUS_RSS_FIELD, &rss_kb)) {
goto out;
}
if (!parse_status_tag(buf, PROC_STATUS_SWAP_FIELD, &swap_kb)) {
goto out;
}
taskname = proc_get_name(pid, buf, sizeof(buf));
// taskname will point inside buf, do not reuse buf onwards.
if (!taskname) {
goto out;
}
mem_st = stats_read_memory_stat(per_app_memcg, pid, uid, rss_kb * 1024, swap_kb * 1024);
snprintf(desc, sizeof(desc), "lmk,%d,%d,%d,%d,%d", pid, ki ? (int)ki->kill_reason : -1,
procp->oomadj, min_oom_score, ki ? ki->max_thrashing : -1);
result = lmkd_free_memory_before_kill_hook(procp, rss_kb / page_k, procp->oomadj,
ki ? (int)ki->kill_reason : -1);
if (result > 0) {
/*
* Memory was freed elsewhere; no need to kill. Note: intentionally do not
* pid_remove(pid) since it was not killed.
*/
ALOGI("Skipping kill; %ld kB freed elsewhere.", result * page_k);
return result;
}
trace_kill_start(desc);
start_wait_for_proc_kill(pidfd < 0 ? pid : pidfd);
kill_result = reaper.kill({ pidfd, pid, uid }, false);
trace_kill_end();
if (kill_result) {
stop_wait_for_proc_kill(false);
ALOGE("kill(%d): errno=%d", pid, errno);
/* Delete process record even when we fail to kill so that we don't get stuck on it */
goto out;
}
last_kill_tm = *tm;
inc_killcnt(procp->oomadj);
if (ki) {
kill_st.kill_reason = ki->kill_reason;
kill_st.thrashing = ki->thrashing;
kill_st.max_thrashing = ki->max_thrashing;
ALOGI("Kill '%s' (%d), uid %d, oom_score_adj %d to free %" PRId64 "kB rss, %" PRId64
"kB swap; reason: %s", taskname, pid, uid, procp->oomadj, rss_kb, swap_kb,
ki->kill_desc);
} else {
kill_st.kill_reason = NONE;
kill_st.thrashing = 0;
kill_st.max_thrashing = 0;
ALOGI("Kill '%s' (%d), uid %d, oom_score_adj %d to free %" PRId64 "kB rss, %" PRId64
"kb swap", taskname, pid, uid, procp->oomadj, rss_kb, swap_kb);
}
killinfo_log(procp, min_oom_score, rss_kb, swap_kb, ki, mi, wi, tm, pd);
kill_st.uid = static_cast<int32_t>(uid);
kill_st.taskname = taskname;
kill_st.oom_score = procp->oomadj;
kill_st.min_oom_score = min_oom_score;
kill_st.free_mem_kb = mi->field.nr_free_pages * page_k;
kill_st.free_swap_kb = get_free_swap(mi) * page_k;
stats_write_lmk_kill_occurred(&kill_st, mem_st);
ctrl_data_write_lmk_kill_occurred((pid_t)pid, uid);
result = rss_kb / page_k;
out:
/*
* WARNING: After pid_remove() procp is freed and can't be used!
* Therefore placed at the end of the function.
*/
pid_remove(pid);
return result;
}
kill_one_process 函数中 ALOGI(“Kill ‘%s’ (%d), uid %d, oom_score_adj %d to free %” PRId64 “kB rss, %” PRId64
“kB swap; reason: %s”, taskname, pid, uid, procp->oomadj, rss_kb, swap_kb,
ki->kill_desc);的打印日志 也和咱们控制台的匹配上了。
2023-04-13 13:49:24.921 456-456/? I/lowmemorykiller: Kill ‘com.shizhuang.duapp’ (9213), uid 10251, oom_score_adj 700 to free 128292kB rss, 50152kB swap; reason: low watermark is breached and swap is low (852kB < 314572kB)
调用仓库
mp_event_psi->find_and_kill_process ->kill_one_process->ctrl_data_write_lmk_kill_occurred
mp_event_psi是经过监听/proc/pressure/memory触发的。
/proc/pressure/memory是Linux体系中的一个文件,用于记录内存压力状况。该文件供给了三个目标:内存运用状况(memory), 内存收回状况(memory_stall)和页面交流状况(swap)。
memory:该目标表明体系内存的压力状况。当可用内存不足时,会产生内存压力,导致程序运行缓慢或溃散。该目标记录了最近一段时间内内存的运用状况,以及内存压力的强度和持续时间等信息。
memory_stall:该目标表明体系内存收回的状况。当体系收回内存时,会产生内存收回压力,影响程序的履行功率。该目标记录了最近一段时间内内存收回的次数、持续时间和收回的页数等信息。
swap:该目标表明体系页面交流的状况。当体系内存不足时,会将不常用的数据存储到交流分区中,以开释内存空间。该目标记录了最近一段时间内页面交流的次数、持续时间和交流的页数等信息。
经过检查/proc/pressure/memory文件,能够了解当时体系的内存运用状况和内存压力状况,然后优化体系的内存办理和程序的履行功率
日志 杀进程条件判别
swap_is_low判别条件 :SwapFree < SwapTotal*swap_free_low_percentage/100
- SwapFree 和SwapTotal都是经过 /proc/meminfo文件获取的 比方我的机器是SwapTotal: 3145724 kB 而日志是314572kB能够判别swap_free_low_percentage=10
wmark判别条件: MemFree-CmaFree< WMARK_HIGH. WMARK_HIGH是从/proc/zoneinfo 获取的 high
总归便是可用内存越小越容易满足这个条件
else if (swap_is_low && wmark < WMARK_HIGH) {
/* Both free memory and swap are low */
kill_reason = LOW_MEM_AND_SWAP;
snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and swap is low (%"
PRId64 "kB < %" PRId64 "kB)", wmark < WMARK_LOW ? "min" : "low",
get_free_swap(&mi) * page_k, swap_low_threshold * page_k);
/* Do not kill perceptible apps unless below min watermark or heavily thrashing */
if (wmark > WMARK_MIN && thrashing < thrashing_critical_pct) {
min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
}
}
杀进程顺序
经过find_and_kill_process 杀进程
依照进程的oom_score_adj从大到小去杀,相同优先级先杀rss占用较大进程
杀完进程后经过ctrl_data_write_lmk_kill_occurred 告诉到system_server进程。
for (i = OOM_SCORE_ADJ_MAX; i >= min_score_adj; i--) {
struct proc *procp;
if (!choose_heaviest_task && i <= PERCEPTIBLE_APP_ADJ) {
/*
* If we have to choose a perceptible process, choose the heaviest one to
* hopefully minimize the number of victims.
*/
choose_heaviest_task = true;
}
while (true) {
procp = choose_heaviest_task ?
proc_get_heaviest(i) : proc_adj_tail(i);
if (!procp)
break;
killed_size = kill_one_process(procp, min_score_adj, ki, mi, wi, tm, pd);
if (killed_size >= 0) {
if (!lmk_state_change_start) {
lmk_state_change_start = true;
stats_write_lmk_state_changed(STATE_START);
}
break;
}
}
if (killed_size) {
break;
}
}
进程oomadj怎么告诉到lmkd进程
进程oom_adj改动后都会经过ProcessList.writeLmkd中sLmkdConnection告诉到lmkd进程
at com.android.server.am.ProcessList.writeLmkd(ProcessList.java:1478)
at com.android.server.am.ProcessList.setOomAdj(ProcessList.java:1404)
at com.android.server.am.OomAdjuster.applyOomAdjLSP(OomAdjuster.java:2595)
at com.android.server.am.OomAdjuster.updateAndTrimProcessLSP(OomAdjuster.java:1066)
at com.android.server.am.OomAdjuster.updateOomAdjInnerLSP(OomAdjuster.java:860)
at com.android.server.am.OomAdjuster.performUpdateOomAdjPendingTargetsLocked(OomAdjuster.java:730)
at com.android.server.am.OomAdjuster.updateOomAdjPendingTargetsLocked(OomAdjuster.java:710)
at com.android.server.am.ActivityManagerService.updateOomAdjPendingTargetsLocked(ActivityManagerService.java:15505)
public static void setOomAdj(int pid, int uid, int amt) {
// This indicates that the process is not started yet and so no need to proceed further.
if (pid <= 0) {
return;
}
if (amt == UNKNOWN_ADJ)
return;
long start = SystemClock.elapsedRealtime();
ByteBuffer buf = ByteBuffer.allocate(4 * 4);
buf.putInt(LMK_PROCPRIO);
buf.putInt(pid);
buf.putInt(uid);
buf.putInt(amt);
writeLmkd(buf, null);
long now = SystemClock.elapsedRealtime();
if ((now-start) > 250) {
Slog.w("ActivityManager", "SLOW OOM ADJ: " + (now-start) + "ms for pid " + pid
+ " = " + amt);
}
}
AMS scheduleTrimMemory机制
也便是咱们application 是在什么状况下会收到onTrimMemory回调,从AMS角度去剖析。
- 咱们先确认咱们的内存是不是属于正常值。 这个正常值的判别是用处于proState较大的缓存的进程数来判别的。 这也能够理解 ,内存不够时会触发lmkd,缓存进程更容易被杀导致缓存进程数变少。
- 当咱们的memFactor == ProcessStats.ADJ_MEM_FACTOR_NORMA 当时被隐藏界面的进程(即用户点击了Home键或许Back键导致运用的 UI 界面不可见)会调用 scheduleTrimMemory TRIM_MEMORY_UI_HIDDEN
- 当咱们的memFactor != ProcessStats.ADJ_MEM_FACTOR_NORMA时 ,会去走scheduleTrimMemory逻辑,而 前台 的进程和后台进程的调度又是分开的
- 先说后台进程scheduleTrimMemory逻辑 后台进程条件app.curProcState >= ActivityManager.PROCESS_STATE_HOME,咱们把后台进程均匀分为3份,最优先的一份 app.trimMemoryLevel = TRIM_MEMORY_COMPLETE ,第二份app.trimMemoryLevel=TRIM_MEMORY_MODERATE 最终一份app.trimMemoryLevel =TRIM_MEMORY_BACKGROUND AMS杀进程机制来看最终一份的进程是最容易被杀的。下一次检测时触发scheduleTrimMemory逻辑然后告诉各进程application进行onTrimMemory
- 再说前台进程,首先会界说一个fgTrimLevel,这个和缓存的进程数numCached有关 比方numCached<=5 fgTrimLevel=TRIM_MEMORY_RUNNING_LOW 然后再给每个进程app.trimMemoryLevel 赋值,当下一次fgTrimLevel变小时 就会触发scheduleTrimMemory逻辑。
所以说scheduleTrimMemory 并不是经过核算 内存 运用进行触发的,而是经过进程优先级较低缓存的进程数间接来判别的,缓存进程数变少了阐明当时内存不够了告诉运用去开释内存。