到底什么是调度延迟-电子发烧友网

本次圈定的性能指标是调度延迟，那首要的目标就是看看到底什么是调度延迟，调度延迟是保证每一个可运行进程都至少运行一次的时间间隔，翻译一下，是指一个task的状态变成了TASK_RUNNING，然后从进入 CPU 的runqueue开始，到真正执行（获得 CPU 的执行权）的这段时间间隔。

需要说明的是调度延迟在 Linux Kernel 中实现的时候是分为两种方式的：面向task和面向rq，我们现在关注的是task层面。

那么runqueue和调度器的一个sched period的关系就显得比较重要了。首先来看调度周期，调度周期的含义就是所有可运行的task都在CPU上执行一遍的时间周期，而Linux CFS中这个值是不固定的，当进程数量小于8的时候，sched period就是一个固定值6ms，如果runqueue数量超过了8个，那么就保证每个task都必须运行一定的时间，这个一定的时间还叫最小粒度时间，CFS的默认最小粒度时间是0.75ms，使用sysctl_sched_min_granularity保存，sched period是通过下面这个内核函数来决定的：

/** The idea is to set a period in which each task runs once.** When there are too many tasks (sched_nr_latency) we have to stretch* this period because otherwise the slices get too small.** p = (nr <= nl) ? l : l*nr/nl*/static u64 __sched_period(unsigned long nr_running){    if (unlikely(nr_running > sched_nr_latency))        return nr_running * sysctl_sched_min_granularity;    else        return sysctl_sched_latency;}

nr_running就是可执行task数量

那么一个疑问就产生了，这个不就是调度延迟scheduling latency吗，并且每一次计算都会给出一个确定的调度周期的值是多少，但是这个调度周期仅仅是用于调度算法里面，因为这里的调度周期是为了确保runqueue上的task的最小调度周期，也就是在这段时间内，所有的task至少被调度一次，但是这仅仅是目标，而实际是达不到的。因为系统的状态、task的状态、task的slice等等都是不断变化的，周期性调度器会在每一次tick来临的时候检查当前task的slice是否到期，如果到期了就会发生preempt抢，而周期性调度器本身的精度就很有限，不考虑 hrtick 的情况下，我们查看系统的时钟频率：

$ grep CONFIG_HZ /boot/config-$(uname -r)

# CONFIG_HZ_PERIODIC is not set

# CONFIG_HZ_100 is not set

CONFIG_HZ_250=y

# CONFIG_HZ_300 is not set

# CONFIG_HZ_1000 is not set

CONFIG_HZ=250

仅仅是250HZ，也就是4ms一次时钟中断，所以都无法保证每一个task在CPU上运行的slice是不是它应该有的slice，更不要说保证调度周期了，外加还有wakeup、preempt等等事件。

1. atop的统计方法

既然不能直接使用计算好的值，那么就得通过其他方法进行统计了，首先Linux kernel 本身是有统计每一个task的调度延迟的，在内核中调度延迟使用的说法是run delay，并且通过proc文件系统暴露了出来，因此大概率现有的传统工具提取调度延迟的源数据是来自于proc的，例如atop工具。

run delay在proc中的位置：

进程的调度延迟：/proc//schedstat
线程的调度延迟：/proc//task//schedstat

现在的目标变为搞清楚atop工具是怎么统计调度延迟的。

现有的工具atop是可以输出用户态每一个进程和线程的调度延迟指标的，在开启atop后按下s键，就会看到RDELAY列，这一列就是调度延迟了。我们来看看 atop 工具是怎么统计这个指标值的，cloneatop工具的代码：

git@github.com:Atoptool/atop.git

由于目前的目标是搞清楚atop对调度延迟指标的统计方法，因此我只关心和这个部分相关的代码片段，可视化展示的部分并不关心。

整体来说，atop 工作的大体流程是：

intmain(int argc, char *argv[]){···    // 获取 interval    interval = atoi(argv[optind]);
    // 开启收集引擎    engine();···    return 0;    /* never reached */}

这里的interval就是我们使用atop的时候以什么时间间隔来提取数据，这个时间间隔就是interval。

所有的计算等操作都在engine()函数中完成

engine()的工作流程如下：

static voidengine(void){···    /*    ** install the signal-handler for ALARM, USR1 and USR2 (triggers    * for the next sample)    */    memset(&sigact, 0, sizeof sigact);    sigact.sa_handler = getusr1;    sigaction(SIGUSR1, &sigact, (struct sigaction *)0);···    if (interval > 0)        alarm(interval);···    for (sampcnt=0; sampcnt < nsamples; sampcnt++)    {···        if (sampcnt > 0 && awaittrigger)            pause();        awaittrigger = 1;···        do        {            curtlen   = counttasks();    // worst-case value            curtpres  = realloc(curtpres,                    curtlen * sizeof(struct tstat));
            ptrverify(curtpres, "Malloc failed for %lu tstats
",                                curtlen);
            memset(curtpres, 0, curtlen * sizeof(struct tstat));        }        while ( (ntaskpres = photoproc(curtpres, curtlen)) == curtlen);
···    } /* end of main-loop */}

代码细节上不再详细介绍，整体运行的大循环是在16行开始的，真正得到调度延迟指标值的是在34行的photoproc()函数中计算的，传入的是需要计算的task列表和task的数量

来看看最终计算的地方：

unsigned longphotoproc(struct tstat *tasklist, int maxtask){···        procschedstat(curtask);        /* from /proc/pid/schedstat */···        if (curtask->gen.nthr > 1)        {···            curtask->cpu.rundelay = 0;···            /*            ** open underlying task directory            */            if ( chdir("task") == 0 )            {···                while ((tent=readdir(dirtask)) && tvalcpu.rundelay +=                        procschedstat(curthr);                    ···                }                ···            }        }    ···    return tval;}

第5行的函数就是在读取proc的schedstat文件：

  static count_t  procschedstat(struct tstat *curtask){    FILE    *fp;    char    line[4096];    count_t    runtime, rundelay = 0;    unsigned long pcount;    static char *schedstatfile = "schedstat";      /*     ** open the schedstat file     */    if ( (fp = fopen(schedstatfile, "r")) )    {        curtask->cpu.rundelay = 0;          if (fgets(line, sizeof line, fp))        {            sscanf(line, "%llu %llu %lu
",                    &runtime, &rundelay, &pcount);              curtask->cpu.rundelay = rundelay;        }          /*        ** verify if fgets returned NULL due to error i.s.o. EOF        */        if (ferror(fp))            curtask->cpu.rundelay = 0;          fclose(fp);    }    else    {        curtask->cpu.rundelay = 0;    }      return curtask->cpu.rundelay;  }

15行是在判断是不是有多个thread，如果有多个thread，那么就把所有的thread的调度延迟相加就得到了这个任务的调度延迟。

所以追踪完atop对调度延迟的处理后，我们就可以发现获取数据的思路是开启atop之后，按照我们指定的interval，在大循环中每一次interval到来以后，就读取一次proc文件系统，将这个值保存，因此结论就是目前的atop工具对调度延迟的提取方式就是每隔interval秒，读取一次proc下的schedstat文件。
因此atop获取的是每interval时间的系统当前进程的调度延迟快照数据，并且是秒级别的提取频率。

2. proc的底层方法—面向task

那么数据源头我们已经定位好了，就是来源于proc，而proc的数据全部都是内核运行过程中自己统计的，那现在的目标就转为内核内部是怎么统计每一个task的调度延迟的，因此需要定位到内核中 proc 计算调度延迟的地点是哪里。

方法很简单，写一个读取schedstat文件的简单程序，使用ftrace追踪一下，就可以看到proc里面是哪个函数来生成的schedstat文件中的数据，ftrace的结果如下：

2)   0.125 us    |            single_start();  
2)               |            proc_single_show() {  
2)               |              get_pid_task() {  
2)   0.124 us    |                rcu_read_unlock_strict();  
2)   0.399 us    |              }  
2)               |              proc_pid_schedstat() {  
2)               |                seq_printf() {  
2)   1.145 us    |                  seq_vprintf();  
2)   1.411 us    |                }  
2)   1.722 us    |              }  
2)   2.599 us    |            }

很容易可以发现是第六行的函数：

#ifdef CONFIG_SCHED_INFO/** Provides /proc/PID/schedstat*/static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,                              struct pid *pid, struct task_struct *task){    if (unlikely(!sched_info_on()))        seq_puts(m, "0 0 0
");    else        seq_printf(m, "%llu %llu %lu
",                   (unsigned long long)task->se.sum_exec_runtime,                   (unsigned long long)task->sched_info.run_delay,                   task->sched_info.pcount);
    return 0;}#endif

第8行是在判断一个内核配置选项，一般默认都是开启的，或者能看到schedstat文件有输出，那么就是开启的，或者可以用make menuconfig查找一下这个选项的状态。

可以发现proc在拿取这个调度延迟指标的时候是直接从传进来的task_struct中的sched_info中记录的run_delay，而且是一次性读取，没有做平均值之类的数据处理，因此也是一个快照形式的数据。

首先说明下sched_info结构：

struct sched_info {#ifdef CONFIG_SCHED_INFO    /* Cumulative counters: */
    /* # of times we have run on this CPU: */    unsigned long            pcount;
    /* Time spent waiting on a runqueue: */    unsigned long long        run_delay;
    /* Timestamps: */
    /* When did we last run on a CPU? */    unsigned long long        last_arrival;
    /* When were we last queued to run? */    unsigned long long        last_queued;#endif /* CONFIG_SCHED_INFO */};

和上面proc函数的宏是一样的，所以可以推测出来这个宏很有可能是用来开启内核统计task的调度信息的。每个字段的含义代码注释已经介绍的比较清晰了，kernel 对调度延迟给出的解释是在 runqueue 中等待的时间。

现在的目标转变为内核是怎么对这个run_delay字段进行计算的。需要回过头来看一下sched_info的结构，后两个是用于计算run_delay参数的，另外这里就需要Linux调度器框架和CFS调度器相关了，首先需要梳理一下和进程调度信息统计相关的函数，其实就是看CONFIG_SCHED_INFO这个宏包起来了哪些函数，找到这些函数的声明点，相关的函数位于kernel/sched/stats.h中。

涉及到的函数如下：

sched_info_queued(rq, t)sched_info_reset_dequeued(t)sched_info_dequeued(rq, t)sched_info_depart(rq, t)sched_info_arrive(rq, next)sched_info_switch(rq, t, next)

BTW，调度延迟在rq中统计的函数是：

rq_sched_info_arrive()rq_sched_info_dequeued()rq_sched_info_depart()

注意的是这些函数的作用只是统计调度信息，查看这些函数的代码，其中和调度延迟相关的函数有以下三个：

sched_info_depart(rq, t)sched_info_queued(rq, t)sched_info_arrive(rq, next)

并且一定是在关键的调度时间节点上被调用的：

1. 进入runqueue
task 从其他状态（休眠，不可中断等）切换到可运行状态后，进入 runqueue 的起始时刻；

2. 调度下CPU，然后进入runqueue
task 从一个 cpu 的 runqueue 移动到另外一个 cpu 的 runqueue 时，更新进入新的 runqueue
的起始时刻；
task 正在运行被调度下CPU，放入 runqueue 的起始时刻，被动下CPU；

3. 产生新task然后进入runqueue；

4. 调度上CPU
进程从 runqueue 中被调度到cpu上运行时更新last_arrival；

可以这么理解要么上CPU，要么下CPU，下CPU并且状态还是TASK_RUNNING状态的其实就是进入runqueue的时机。

进入到runqueue都会最终调用到sched_info_queued，而第二种情况会先走sched_info_depart函数：

static inline void sched_info_depart(struct rq *rq, struct task_struct *t){    unsigned long long delta = rq_clock(rq) - t->sched_info.last_arrival;
    rq_sched_info_depart(rq, delta);
    if (t->state == TASK_RUNNING)        sched_info_queued(rq, t);}

第3行的代码在计算上次在CPU上执行的时间戳是多少，用现在的时间减去last_arrival（上次被调度上CPU的时间）就可以得到，然后传递给了rq_sched_info_depart()函数

第2种情况下，在第8行，如果进程这个时候的状态还是TASK_RUNNING，那么说明这个时候task是被动下CPU的，表示该task又开始在runqueue中等待了，为什么不统计其它状态的task，因为其它状态的task是不能进入runqueue的，例如等待IO的task，这些task只有在完成等待后才可以进入runqueue，这个时候就有变成了第1种情况；第1种情况下会直接进入sched_info_queued()函数；因此这两种情况下都是task进入了runqueue然后最终调用sched_info_queued()函数记录上次（就是现在）进入runqueue 的时间戳last_queued。

sched_info_queued()的代码如下：

  static inline void sched_info_queued(struct rq *rq, struct task_struct *t)  {    if (unlikely(sched_info_on())) {        if (!t->sched_info.last_queued)            t->sched_info.last_queued = rq_clock(rq);    }  }

然后就到了最后一个关键节点，task被调度CPU了，就会触发sched_info_arrive()函数：

static void sched_info_arrive(struct rq *rq, struct task_struct *t)  {    unsigned long long now = rq_clock(rq), delta = 0;      if (t->sched_info.last_queued)        delta = now - t->sched_info.last_queued;    sched_info_reset_dequeued(t);    t->sched_info.run_delay += delta;    t->sched_info.last_arrival = now;    t->sched_info.pcount++;      rq_sched_info_arrive(rq, delta);  }

这个时候就可以来计算调度延迟了，代码逻辑是如果有记录上次的last_queued时间戳，那么就用现在的时间戳减去上次的时间戳，就是该 task 的调度延迟，然后保存到run_delay字段里面，并且标记这次到达CPU的时间戳到last_arrival里面，pcount记录的是上cpu上了多少次。

公式就是：

该task的调度延迟=该task刚被调度上CPU的时间戳-last_queued(该task上次进入runqueue的时间戳)

审核编辑：彭静

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉