内核小惊慌OOPS，我也有点小惊慌

Posted Aug 8, 2020

By 9 min read

内核小惊慌OOPS，我也有点小惊慌

8月6号的时候一个土耳其的内核用户说我的内核打不开cpu热询轮，而我自己也发现devcheck和aida64等软件也无法正常使用，卡屏。

一个良好的内核应该不会出现死锁，软锁或随机重启等问题，Linux kernel每天都在不断的更新，出现内核惊慌也是可理解之事。因此，Linux kernel才会不断的更新版本，android设备对应的longterm也会不断更新小版本，例如4.9.z更新至4.9.z+200……

我们首先假设你开启了pstore，偶尔发现手机随机重启问题，开机速度变慢或者app打开卡屏。

我们可以通过cat /proc/last_kmsg获取内核日志，或者dmesg获取当前ring缓冲日志。发现内核出现了call trace，例如这个：kernel issue: call trace: can’t open app cpu throttling

[   32.602658] Workqueue: thermal_passive_wq thermal_zone_device_check
[   32.602661] Call trace:
[   32.602670] [<0000000030b1c339>] __switch_to+0xd4/0xf0
[   32.602677] [<000000001e413388>] __schedule+0x940/0xc04
[   32.602680] [<000000001fc8480f>] schedule+0x70/0x90
[   32.602683] [<000000008c461601>] schedule_timeout+0x3c/0x4bc
[   32.602686] [<000000003f40c10d>] wait_for_common+0xe8/0x198
[   32.602689] [<000000000ee6a67b>] wait_for_completion+0x14/0x1c
[   32.602693] [<00000000baeabb64>] flush_work+0x250/0x2b4
[   32.602695] [<00000000a08a3d80>] __cancel_work_timer+0x124/0x1cc
[   32.602698] [<000000004513e14c>] cancel_delayed_work_sync+0x10/0x1c
[   32.602702] [<000000009f8c58d6>] handle_thermal_trip+0x2c4/0x2e0
[   32.602705] [<000000006925b5a8>] thermal_zone_device_update+0x14c/0x190
[   32.602708] [<00000000bc6fdb3b>] thermal_zone_device_check+0x14/0x1c
//..
[   32.620108] [<00000000be0dc30a>] mutex_lock+0x44/0x48
[   32.620115] [<000000005aaf70e6>] of_thermal_set_mode+0x28/0x78
[   32.621052] [<00000000be0dc30a>] mutex_lock+0x44/0x48
[   32.621059] [<000000002890db9c>] thermal_zone_get_temp+0x44/0x120
//....
[   32.621588] sysrq: SysRq : Show backtrace of all active CPUs
[   32.621643] sysrq: CPU0:
[   32.621645] Call trace:
[   32.621656] [<00000000bd0ae753>] dump_backtrace+0x0/0x3fc
[   32.621659] [<000000005bbb9a50>] show_stack+0x14/0x1c
[   32.621665] [<0000000034e72ea2>] showacpu+0x64/0x7c
[   32.621671] [<00000000f20db29a>] flush_smp_call_function_queue+0xc4/0x164
[   32.621674] [<00000000771999e5>] generic_smp_call_function_single_interrupt+0x10/0x18
[   32.621678] [<000000007a81dcee>] handle_IPI+0x150/0x2d8
[   32.621681] [<000000001f1aa1f8>] gic_handle_irq+0x110/0x1ac
[   32.621684] Exception stack(0xffffffc48328bec0 to 0xffffffc48328bff0)

这个时候就要留意Workqueue（WQ）中的什么函数在运行。根据函数名和地址溯源，我们可以发现内核在WQ链表进行了thermal-core进行延时取消工作的操作。

然后我们来分析一下

什么，居然还上了两个锁！！这不被阻塞死了么。怪不得devcheck无法打开，aida64无法打开检测温度界面。

[   32.620108] [<00000000be0dc30a>] mutex_lock+0x44/0x48
[   32.620115] [<000000005aaf70e6>] of_thermal_set_mode+0x28/0x78
[   32.621052] [<00000000be0dc30a>] mutex_lock+0x44/0x48
[   32.621059] [<000000002890db9c>] thermal_zone_get_temp+0x44/0x120

我们从有两个互斥锁的[32.620108]分析，一切从从这里开始。

        
      
static int of_thermal_set_mode(struct thermal_zone_device *tz,
             enum thermal_device_mode mode)
{
  struct __thermal_zone *data = tz->devdata;

  mutex_lock(&tz->lock);
//...

  mutex_unlock(&tz->lock);
  thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);

  return 0;
}

到了thermal_zone_device_update()

        
      
void thermal_zone_device_update(struct thermal_zone_device *tz)
{
//....
    /* 处理触发点，这里面就会调到具体的 governor */
              handle_thermal_trip(tz, count);
}

到了handle_thermal_trip();，还记得call trace有个[ 32.602702] [<000000009f8c58d6>] handle_thermal_trip+0x2c4/0x2e0

        
      
static void handle_thermal_trip(struct thermal_zone_device *tz, int trip)
{
  enum thermal_trip_type type;

//...

  monitor_thermal_zone(tz);
  trace_thermal_handle_trip(tz, trip);
}

到了 monitor_thermal_zone()

        
      
static void monitor_thermal_zone(struct thermal_zone_device *tz)
{
  mutex_lock(&tz->lock);

  if (tz->passive)
    thermal_zone_device_set_polling(thermal_passive_wq,
            tz, tz->passive_delay);
  else if (tz->polling_delay)
    thermal_zone_device_set_polling(
        system_freezable_power_efficient_wq,
        tz, tz->polling_delay);
  else
    thermal_zone_device_set_polling(NULL, tz, 0);

  mutex_unlock(&tz->lock);
}

到了 thermal_zone_device_set_polling()

        
      
static void thermal_zone_device_set_polling(struct workqueue_struct *queue,
              struct thermal_zone_device *tz,
              int delay)
{
  if (delay > 1000)
    mod_delayed_work(queue, &tz->poll_queue,
         round_jiffies(msecs_to_jiffies(delay)));
  else if (delay)
    mod_delayed_work(queue, &tz->poll_queue,
         msecs_to_jiffies(delay));
  else
    cancel_delayed_work_sync(&tz->poll_queue);
}

什么，你这又回到了thermal_zone_device_set_polling()，然而thermal_zone_device_set_polling()里面调用cancel_delayed_work_sync(&tz->poll_queue)是进行延时取消工作。

函数cancel_delayed_work_sync（）是让函数cancel_work_sync()延时进行，返回bool类型。

/*
 *cancel_delayed_work_sync - cancel a delayed work and wait for it to finish
 *@dwork: the delayed work cancel
 *
 *This is cancel_work_sync() for delayed works.
 *
 *RETURNS:
 *%true if @dwork was pending, %false otherwise.
 *返回1则延时进行延时取消工作操作，0则是反之
*/

第一次WQ处理cancel_delayed_work_sync(&tz->poll_queue)线程，cancel_delayed_work_sync（）让WQ延时调用cancel_work_sync()处理工作，接着第二次从of_thermal_set_mode()开始执行monitor_thermal_zone()又执行了一次thermal_zone_device_set_polling()，而thermal_zone_device_set_polling()里又再次添加了cancel_delayed_work()线程到WQ工作队列。由于monitor_thermal_zone()函数内部上锁了，只能thermal_zone_device_set_polling(NULL, tz, 0);执行完毕才能解锁。。。但是上锁了，Workqueue（WQ）中其他任务cancel_work_sync()是不能被执行的，线程于是在这里阻塞了。

当时我在想，直接把thermal_zone_device_set_polling()函数里面的cancel_delayed_work_sync（）延时取消工作函数改为cancel_delayed_work（）就应该可以了吧，这样就一步就可以取消了延时。

接着我还发现一系列的问题，

设备冷启动函数执行顺序如下，来自我的想象

        
      
手机热启动

//...

thermal_zone_device_set_polling(NULL, tz, 0);
cancel_delayed_work_sync(&tz->poll_queue);
monitor_thermal_zone(struct thermal_zone_device *tz)
mutex_lock(&tz->lock);
thermal_zone_device_set_polling(NULL, tz, 0);
cancel_delayed_work_sync(&tz->poll_queue);
mutex_unlock(&tz->lock)

//...

static int of_thermal_set_mode(struct thermal_zone_device *tz,enum thermal_device_mode mode)
thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);
static void handle_thermal_trip(struct thermal_zone_device *tz, int trip)
static void monitor_thermal_zone(struct thermal_zone_device *tz)
//...

由于CPU是多线程，以下两个主要函数同时两个不同核心进行，这两个函数里面都有mutex_lock(&tz->lock);
static void monitor_thermal_zone(struct thermal_zone_device *tz)
int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)

  mutex_lock(&tz->lock);
//...

        
      
这里与上面的·thermal_zone_get_temp()同步进行
static void monitor_thermal_zone(struct thermal_zone_device *tz)
mutex_lock(&tz->lock);
thermal_zone_device_set_polling(NULL, tz, 0);

当时我在想这里有两个锁多线程进行，这不就是锁的竞争么,这是死锁。

后续我还注意到连续对8个CPU都进行了中断操作，然后每个CPU发现死锁后立即中断了thermal-core在WQ的队列，并且dump backstack。

于是我断定代码发生了死锁。

我们可以注意到每一个CPU都有sysrq的迹象，因为thermal-core，导致其他core也跟着中断起来。

//....
[   32.622139] sysrq: CPU5:
[   32.622140] Call trace:
//....
[   32.622119] [<000000009e18ebc9>] pfk_load_key_start+0xb0/0x1e0
[   32.622123] [<0000000092c77623>] qcom_ice_config_start+0x6c/0x2ac
//....
[   32.622169] sysrq: CPU7:
[   32.622170] Call trace:
//....

//....

于是便去上游寻找修复补丁。

于是我相关字眼去搜索Linux上游，看看有啥修复的，结果发现了thermal: Fix deadlock in thermal thermal_zone_device_check，还一下子修复了两个问题(修改后会产生另外一个免费使用问题])，很厉害，原来上面的问题是由上游免费使用问题的提交导致的。。。

当然，因为修改引入新问题我没有注意到，有时候自己一个人是很难发现问题所在的，因为当时我并不了解thermal是如何工作的，直到阅读魅族的帖子才得知，

依靠上游或者开源社区的提交是最好的解决方法。

最后，问题使用git cherry-pick解决了。

thermal: Fix deadlock in thermal thermal_zone_device_check

后来，我发现该上游的v4.9.203修复了这个错误，所以以后必须要养成给linux设备打linux-stable的习惯。

Blog

This post is licensed under CC BY 4.0 by the author.

内核小惊慌OOPS，我也有点小惊慌

Trending Tags