Skip to content

Commit 3270476

Browse files
committed
workqueue: reimplement WQ_HIGHPRI using a separate worker_pool
WQ_HIGHPRI was implemented by queueing highpri work items at the head of the global worklist. Other than queueing at the head, they weren't handled differently; unfortunately, this could lead to execution latency of a few seconds on heavily loaded systems. Now that workqueue code has been updated to deal with multiple worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using a separate worker_pool. NR_WORKER_POOLS is bumped to two and gcwq->pools[0] is used for normal pri work items and ->pools[1] for highpri. Highpri workers get -20 nice level and has 'H' suffix in their names. Note that this change increases the number of kworkers per cpu. POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain wakeup code in process_one_work() are no longer used and removed. This allows proper prioritization of highpri work items and removes high execution latency of highpri work items. v2: nr_running indexing bug in get_pool_nr_running() fixed. v3: Refreshed for the get_pool_nr_running() update in the previous patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josh Hunt <joshhunt00@gmail.com> LKML-Reference: <CAKA=qzaHqwZ8eqpLNFjxnO2fX-tgAOjmpvxgBFjv6dJeQaOW1w@mail.gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com>
1 parent 4ce62e9 commit 3270476

File tree

2 files changed

+65
-138
lines changed

2 files changed

+65
-138
lines changed

Documentation/workqueue.txt

+38-65
Original file line numberDiff line numberDiff line change
@@ -89,52 +89,55 @@ called thread-pools.
8989

9090
The cmwq design differentiates between the user-facing workqueues that
9191
subsystems and drivers queue work items on and the backend mechanism
92-
which manages thread-pool and processes the queued work items.
92+
which manages thread-pools and processes the queued work items.
9393

9494
The backend is called gcwq. There is one gcwq for each possible CPU
95-
and one gcwq to serve work items queued on unbound workqueues.
95+
and one gcwq to serve work items queued on unbound workqueues. Each
96+
gcwq has two thread-pools - one for normal work items and the other
97+
for high priority ones.
9698

9799
Subsystems and drivers can create and queue work items through special
98100
workqueue API functions as they see fit. They can influence some
99101
aspects of the way the work items are executed by setting flags on the
100102
workqueue they are putting the work item on. These flags include
101-
things like CPU locality, reentrancy, concurrency limits and more. To
102-
get a detailed overview refer to the API description of
103+
things like CPU locality, reentrancy, concurrency limits, priority and
104+
more. To get a detailed overview refer to the API description of
103105
alloc_workqueue() below.
104106

105-
When a work item is queued to a workqueue, the target gcwq is
106-
determined according to the queue parameters and workqueue attributes
107-
and appended on the shared worklist of the gcwq. For example, unless
108-
specifically overridden, a work item of a bound workqueue will be
109-
queued on the worklist of exactly that gcwq that is associated to the
110-
CPU the issuer is running on.
107+
When a work item is queued to a workqueue, the target gcwq and
108+
thread-pool is determined according to the queue parameters and
109+
workqueue attributes and appended on the shared worklist of the
110+
thread-pool. For example, unless specifically overridden, a work item
111+
of a bound workqueue will be queued on the worklist of either normal
112+
or highpri thread-pool of the gcwq that is associated to the CPU the
113+
issuer is running on.
111114

112115
For any worker pool implementation, managing the concurrency level
113116
(how many execution contexts are active) is an important issue. cmwq
114117
tries to keep the concurrency at a minimal but sufficient level.
115118
Minimal to save resources and sufficient in that the system is used at
116119
its full capacity.
117120

118-
Each gcwq bound to an actual CPU implements concurrency management by
119-
hooking into the scheduler. The gcwq is notified whenever an active
120-
worker wakes up or sleeps and keeps track of the number of the
121-
currently runnable workers. Generally, work items are not expected to
122-
hog a CPU and consume many cycles. That means maintaining just enough
123-
concurrency to prevent work processing from stalling should be
124-
optimal. As long as there are one or more runnable workers on the
125-
CPU, the gcwq doesn't start execution of a new work, but, when the
126-
last running worker goes to sleep, it immediately schedules a new
127-
worker so that the CPU doesn't sit idle while there are pending work
128-
items. This allows using a minimal number of workers without losing
129-
execution bandwidth.
121+
Each thread-pool bound to an actual CPU implements concurrency
122+
management by hooking into the scheduler. The thread-pool is notified
123+
whenever an active worker wakes up or sleeps and keeps track of the
124+
number of the currently runnable workers. Generally, work items are
125+
not expected to hog a CPU and consume many cycles. That means
126+
maintaining just enough concurrency to prevent work processing from
127+
stalling should be optimal. As long as there are one or more runnable
128+
workers on the CPU, the thread-pool doesn't start execution of a new
129+
work, but, when the last running worker goes to sleep, it immediately
130+
schedules a new worker so that the CPU doesn't sit idle while there
131+
are pending work items. This allows using a minimal number of workers
132+
without losing execution bandwidth.
130133

131134
Keeping idle workers around doesn't cost other than the memory space
132135
for kthreads, so cmwq holds onto idle ones for a while before killing
133136
them.
134137

135138
For an unbound wq, the above concurrency management doesn't apply and
136-
the gcwq for the pseudo unbound CPU tries to start executing all work
137-
items as soon as possible. The responsibility of regulating
139+
the thread-pools for the pseudo unbound CPU try to start executing all
140+
work items as soon as possible. The responsibility of regulating
138141
concurrency level is on the users. There is also a flag to mark a
139142
bound wq to ignore the concurrency management. Please refer to the
140143
API section for details.
@@ -205,31 +208,22 @@ resources, scheduled and executed.
205208

206209
WQ_HIGHPRI
207210

208-
Work items of a highpri wq are queued at the head of the
209-
worklist of the target gcwq and start execution regardless of
210-
the current concurrency level. In other words, highpri work
211-
items will always start execution as soon as execution
212-
resource is available.
211+
Work items of a highpri wq are queued to the highpri
212+
thread-pool of the target gcwq. Highpri thread-pools are
213+
served by worker threads with elevated nice level.
213214

214-
Ordering among highpri work items is preserved - a highpri
215-
work item queued after another highpri work item will start
216-
execution after the earlier highpri work item starts.
217-
218-
Although highpri work items are not held back by other
219-
runnable work items, they still contribute to the concurrency
220-
level. Highpri work items in runnable state will prevent
221-
non-highpri work items from starting execution.
222-
223-
This flag is meaningless for unbound wq.
215+
Note that normal and highpri thread-pools don't interact with
216+
each other. Each maintain its separate pool of workers and
217+
implements concurrency management among its workers.
224218

225219
WQ_CPU_INTENSIVE
226220

227221
Work items of a CPU intensive wq do not contribute to the
228222
concurrency level. In other words, runnable CPU intensive
229-
work items will not prevent other work items from starting
230-
execution. This is useful for bound work items which are
231-
expected to hog CPU cycles so that their execution is
232-
regulated by the system scheduler.
223+
work items will not prevent other work items in the same
224+
thread-pool from starting execution. This is useful for bound
225+
work items which are expected to hog CPU cycles so that their
226+
execution is regulated by the system scheduler.
233227

234228
Although CPU intensive work items don't contribute to the
235229
concurrency level, start of their executions is still
@@ -239,14 +233,6 @@ resources, scheduled and executed.
239233

240234
This flag is meaningless for unbound wq.
241235

242-
WQ_HIGHPRI | WQ_CPU_INTENSIVE
243-
244-
This combination makes the wq avoid interaction with
245-
concurrency management completely and behave as a simple
246-
per-CPU execution context provider. Work items queued on a
247-
highpri CPU-intensive wq start execution as soon as resources
248-
are available and don't affect execution of other work items.
249-
250236
@max_active:
251237

252238
@max_active determines the maximum number of execution contexts per
@@ -328,20 +314,7 @@ If @max_active == 2,
328314
35 w2 wakes up and finishes
329315

330316
Now, let's assume w1 and w2 are queued to a different wq q1 which has
331-
WQ_HIGHPRI set,
332-
333-
TIME IN MSECS EVENT
334-
0 w1 and w2 start and burn CPU
335-
5 w1 sleeps
336-
10 w2 sleeps
337-
10 w0 starts and burns CPU
338-
15 w0 sleeps
339-
15 w1 wakes up and finishes
340-
20 w2 wakes up and finishes
341-
25 w0 wakes up and burns CPU
342-
30 w0 finishes
343-
344-
If q1 has WQ_CPU_INTENSIVE set,
317+
WQ_CPU_INTENSIVE set,
345318

346319
TIME IN MSECS EVENT
347320
0 w0 starts and burns CPU

kernel/workqueue.c

+27-73
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,6 @@ enum {
5252
/* pool flags */
5353
POOL_MANAGE_WORKERS = 1 << 0, /* need to manage workers */
5454
POOL_MANAGING_WORKERS = 1 << 1, /* managing workers */
55-
POOL_HIGHPRI_PENDING = 1 << 2, /* highpri works on queue */
5655

5756
/* worker flags */
5857
WORKER_STARTED = 1 << 0, /* started */
@@ -74,7 +73,7 @@ enum {
7473
TRUSTEE_RELEASE = 3, /* release workers */
7574
TRUSTEE_DONE = 4, /* trustee is done */
7675

77-
NR_WORKER_POOLS = 1, /* # worker pools per gcwq */
76+
NR_WORKER_POOLS = 2, /* # worker pools per gcwq */
7877

7978
BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
8079
BUSY_WORKER_HASH_SIZE = 1 << BUSY_WORKER_HASH_ORDER,
@@ -95,6 +94,7 @@ enum {
9594
* all cpus. Give -20.
9695
*/
9796
RESCUER_NICE_LEVEL = -20,
97+
HIGHPRI_NICE_LEVEL = -20,
9898
};
9999

100100
/*
@@ -174,7 +174,7 @@ struct global_cwq {
174174
struct hlist_head busy_hash[BUSY_WORKER_HASH_SIZE];
175175
/* L: hash of busy workers */
176176

177-
struct worker_pool pool; /* the worker pools */
177+
struct worker_pool pools[2]; /* normal and highpri pools */
178178

179179
struct task_struct *trustee; /* L: for gcwq shutdown */
180180
unsigned int trustee_state; /* L: trustee state */
@@ -277,7 +277,8 @@ EXPORT_SYMBOL_GPL(system_nrt_freezable_wq);
277277
#include <trace/events/workqueue.h>
278278

279279
#define for_each_worker_pool(pool, gcwq) \
280-
for ((pool) = &(gcwq)->pool; (pool); (pool) = NULL)
280+
for ((pool) = &(gcwq)->pools[0]; \
281+
(pool) < &(gcwq)->pools[NR_WORKER_POOLS]; (pool)++)
281282

282283
#define for_each_busy_worker(worker, i, pos, gcwq) \
283284
for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++) \
@@ -473,6 +474,11 @@ static atomic_t unbound_pool_nr_running[NR_WORKER_POOLS] = {
473474

474475
static int worker_thread(void *__worker);
475476

477+
static int worker_pool_pri(struct worker_pool *pool)
478+
{
479+
return pool - pool->gcwq->pools;
480+
}
481+
476482
static struct global_cwq *get_gcwq(unsigned int cpu)
477483
{
478484
if (cpu != WORK_CPU_UNBOUND)
@@ -484,7 +490,7 @@ static struct global_cwq *get_gcwq(unsigned int cpu)
484490
static atomic_t *get_pool_nr_running(struct worker_pool *pool)
485491
{
486492
int cpu = pool->gcwq->cpu;
487-
int idx = 0;
493+
int idx = worker_pool_pri(pool);
488494

489495
if (cpu != WORK_CPU_UNBOUND)
490496
return &per_cpu(pool_nr_running, cpu)[idx];
@@ -586,15 +592,14 @@ static struct global_cwq *get_work_gcwq(struct work_struct *work)
586592
}
587593

588594
/*
589-
* Policy functions. These define the policies on how the global
590-
* worker pool is managed. Unless noted otherwise, these functions
591-
* assume that they're being called with gcwq->lock held.
595+
* Policy functions. These define the policies on how the global worker
596+
* pools are managed. Unless noted otherwise, these functions assume that
597+
* they're being called with gcwq->lock held.
592598
*/
593599

594600
static bool __need_more_worker(struct worker_pool *pool)
595601
{
596-
return !atomic_read(get_pool_nr_running(pool)) ||
597-
(pool->flags & POOL_HIGHPRI_PENDING);
602+
return !atomic_read(get_pool_nr_running(pool));
598603
}
599604

600605
/*
@@ -621,9 +626,7 @@ static bool keep_working(struct worker_pool *pool)
621626
{
622627
atomic_t *nr_running = get_pool_nr_running(pool);
623628

624-
return !list_empty(&pool->worklist) &&
625-
(atomic_read(nr_running) <= 1 ||
626-
(pool->flags & POOL_HIGHPRI_PENDING));
629+
return !list_empty(&pool->worklist) && atomic_read(nr_running) <= 1;
627630
}
628631

629632
/* Do we need a new worker? Called from manager. */
@@ -891,43 +894,6 @@ static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
891894
work);
892895
}
893896

894-
/**
895-
* pool_determine_ins_pos - find insertion position
896-
* @pool: pool of interest
897-
* @cwq: cwq a work is being queued for
898-
*
899-
* A work for @cwq is about to be queued on @pool, determine insertion
900-
* position for the work. If @cwq is for HIGHPRI wq, the work is
901-
* queued at the head of the queue but in FIFO order with respect to
902-
* other HIGHPRI works; otherwise, at the end of the queue. This
903-
* function also sets POOL_HIGHPRI_PENDING flag to hint @pool that
904-
* there are HIGHPRI works pending.
905-
*
906-
* CONTEXT:
907-
* spin_lock_irq(gcwq->lock).
908-
*
909-
* RETURNS:
910-
* Pointer to inserstion position.
911-
*/
912-
static inline struct list_head *pool_determine_ins_pos(struct worker_pool *pool,
913-
struct cpu_workqueue_struct *cwq)
914-
{
915-
struct work_struct *twork;
916-
917-
if (likely(!(cwq->wq->flags & WQ_HIGHPRI)))
918-
return &pool->worklist;
919-
920-
list_for_each_entry(twork, &pool->worklist, entry) {
921-
struct cpu_workqueue_struct *tcwq = get_work_cwq(twork);
922-
923-
if (!(tcwq->wq->flags & WQ_HIGHPRI))
924-
break;
925-
}
926-
927-
pool->flags |= POOL_HIGHPRI_PENDING;
928-
return &twork->entry;
929-
}
930-
931897
/**
932898
* insert_work - insert a work into gcwq
933899
* @cwq: cwq @work belongs to
@@ -1068,7 +1034,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
10681034
if (likely(cwq->nr_active < cwq->max_active)) {
10691035
trace_workqueue_activate_work(work);
10701036
cwq->nr_active++;
1071-
worklist = pool_determine_ins_pos(cwq->pool, cwq);
1037+
worklist = &cwq->pool->worklist;
10721038
} else {
10731039
work_flags |= WORK_STRUCT_DELAYED;
10741040
worklist = &cwq->delayed_works;
@@ -1385,6 +1351,7 @@ static struct worker *create_worker(struct worker_pool *pool, bool bind)
13851351
{
13861352
struct global_cwq *gcwq = pool->gcwq;
13871353
bool on_unbound_cpu = gcwq->cpu == WORK_CPU_UNBOUND;
1354+
const char *pri = worker_pool_pri(pool) ? "H" : "";
13881355
struct worker *worker = NULL;
13891356
int id = -1;
13901357

@@ -1406,15 +1373,17 @@ static struct worker *create_worker(struct worker_pool *pool, bool bind)
14061373

14071374
if (!on_unbound_cpu)
14081375
worker->task = kthread_create_on_node(worker_thread,
1409-
worker,
1410-
cpu_to_node(gcwq->cpu),
1411-
"kworker/%u:%d", gcwq->cpu, id);
1376+
worker, cpu_to_node(gcwq->cpu),
1377+
"kworker/%u:%d%s", gcwq->cpu, id, pri);
14121378
else
14131379
worker->task = kthread_create(worker_thread, worker,
1414-
"kworker/u:%d", id);
1380+
"kworker/u:%d%s", id, pri);
14151381
if (IS_ERR(worker->task))
14161382
goto fail;
14171383

1384+
if (worker_pool_pri(pool))
1385+
set_user_nice(worker->task, HIGHPRI_NICE_LEVEL);
1386+
14181387
/*
14191388
* A rogue worker will become a regular one if CPU comes
14201389
* online later on. Make sure every worker has
@@ -1761,10 +1730,9 @@ static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
17611730
{
17621731
struct work_struct *work = list_first_entry(&cwq->delayed_works,
17631732
struct work_struct, entry);
1764-
struct list_head *pos = pool_determine_ins_pos(cwq->pool, cwq);
17651733

17661734
trace_workqueue_activate_work(work);
1767-
move_linked_works(work, pos, NULL);
1735+
move_linked_works(work, &cwq->pool->worklist, NULL);
17681736
__clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
17691737
cwq->nr_active++;
17701738
}
@@ -1879,21 +1847,6 @@ __acquires(&gcwq->lock)
18791847
set_work_cpu(work, gcwq->cpu);
18801848
list_del_init(&work->entry);
18811849

1882-
/*
1883-
* If HIGHPRI_PENDING, check the next work, and, if HIGHPRI,
1884-
* wake up another worker; otherwise, clear HIGHPRI_PENDING.
1885-
*/
1886-
if (unlikely(pool->flags & POOL_HIGHPRI_PENDING)) {
1887-
struct work_struct *nwork = list_first_entry(&pool->worklist,
1888-
struct work_struct, entry);
1889-
1890-
if (!list_empty(&pool->worklist) &&
1891-
get_work_cwq(nwork)->wq->flags & WQ_HIGHPRI)
1892-
wake_up_worker(pool);
1893-
else
1894-
pool->flags &= ~POOL_HIGHPRI_PENDING;
1895-
}
1896-
18971850
/*
18981851
* CPU intensive works don't participate in concurrency
18991852
* management. They're the scheduler's responsibility.
@@ -3047,9 +3000,10 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
30473000
for_each_cwq_cpu(cpu, wq) {
30483001
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
30493002
struct global_cwq *gcwq = get_gcwq(cpu);
3003+
int pool_idx = (bool)(flags & WQ_HIGHPRI);
30503004

30513005
BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
3052-
cwq->pool = &gcwq->pool;
3006+
cwq->pool = &gcwq->pools[pool_idx];
30533007
cwq->wq = wq;
30543008
cwq->flush_color = -1;
30553009
cwq->max_active = max_active;

0 commit comments

Comments
 (0)