GCD Internals

The undocumented side of the Grand Central Dispatcher

Jonathan Levin, http://newosxbook.com/ - 02/15/14

About

The book touches very little on Apple's Grand Central Dispatcher, which is becoming the de-facto standard for multi-threaded applications in OS X and iOS, as it pushes aside the venerable (and standard) pthread APIs. While I do discuss the kernel support for GCD (Chapter 14, pg. 550, "Work Queues"), the implementation has changed considerably as Apple has added a new SPI in Mountain Lion (XNU 2050)/iOS 6, and has completely externalized pthread functionality to the pthread kernel extension in Mavericks and iOS 7. The pthread support in user mode has also been moved (as of OS X 10.9) to /usr/lib/system/libsystem_pthread.dylib (from its former home in libsystem_c) into its own (closed source) project, and has been enhanced with a powerful introspection feature.

On (yet another) flight to PVG, where I have to deliver a presentation on (among other things) GCD internals, I figured I might as well make public the information, in my attempt to keep the book as updated as possible for readers such as yourself. This article covers libdispatch versions 339.1.9 (OS X 10.9, for which the source is available), 354.3.1 (iOS 7, no source), and XNU 2050 (OS X 10.8) and 2423 (~ OS X 10.9/iOS 7).

Why should you care? (Target Audience)

Arguably, most developers couldn't care less about the implementation of GCD, as it "magically" provides concurrency and scheduling support. I'm more of the view that all "magic" has a logical explanation, and this is what I aim to provide here. Unlike other articles I've posted thus far, which can come in quite handy when you develop apps, this article is more of a deep dive into the esoteric. So maybe you do care, or maybe you don't. That's for you to decide. Me, I still have 11 hours to kill.

I: User Mode (libdispatch)

The Grand Central Dispatcher is implemented in /usr/lib/system/libdispatch.dylib, which - like other libraries in its path - is reexported by libSystem.B.dylib. The GCD APIs are well documented by apple in the Concurrency Programming Guide 1, libDispatch reference 2, the header files (<dispatch/dispatch.h> and friends) and the man pages (q.v. dispatch(3)). The rest of this article builds on those references as a foundation, though in a nutshell, the process for using GCD can be summarized as follows:

GCD offers the application several global dispatch queues, of different priorities: DISPATCH_QUEUE_PRIORITY_HIGH (2), _DEFAULT (0), _LOW (-2) and _BACKGROUND (-32768). The queues are scheduled in decreasing priority. The _BACKGROUND queue is also run on a background thread (i.e. priority of about 4), with I/O throttling. These queues are obtained by dispatch_get_global_queue(priority, flags), with the only supported flag being DISPATCH_QUEUE_OVERCOMMIT.
An application also has a main thread queue, which can be obtained by a call to dispatch_get_main_queue. This is the queue served by the well known CF/NSRunLoop constructs.
An application can create additional queues using dispatch_queue_create (label, attr). The label is an optional name (which can be obtained by dispatch_queue_get_label and debugging tools), and attr is either DISPATCH_QUEUE_SERIAL (1-by-1, FIFO) or DISPATCH_QUEUE_CONCURRENT (parallelized execution) , controlling the execution of blocks. What Apple doesn't mention here is that (as of 10.9/7) there is also a dispatch_queue_create_with_target, specifying a third argument of an already existing queue, to serve as the target queue.
To schedule work, an application can call one of the following functions:
- dispatch_async[_f]: Sending a block or function (_f) to the queue specified. Execution is asynchronous, "as soon as possible".
- dispatch_sync[_f]: Sending a block or function (_f) to the queue specified, and blocking until exection completes. Note that this doesn't necessarily mean the block or function will be executed in the current thread context - only that the current thread will block (that is, hang) so as to synchronize execution with the block or function.`
GCD also supports dispatch sources. These can be created with dispatch_source_create, which takes four arguments: a source type, a (type-dependent) handle, a (type-dependent) mask of events to handle, and a queue on which the handler will run. The handler itself is set with dispatch_source_set_event_handler[_f], after which the source may be started with a call to dispatch_resume.

The root and predefined queues

What the Apple documentation refers to as "global" queues (in the sense of being global to the application, requiring no initialization), libdispatch calls "root" queues. The queues are hard-coded in an array (queue.c) as shown in the following table:

Index	serial #	queue name
0	4	com.apple.root.low-priority
1	5	com.apple.root.low-overcommit-priority
2	6	com.apple.root.default-priority
3	7	com.apple.root.default-overcommit-priority
4	8	com.apple.root.high-priority
5	9	com.apple.root.high-overcommit-priority
6	10	com.apple.root.background-priority
7	11	com.apple.root.background-overcommit-priority

The implementation of dispatch_get_global_queue calls the internal _dispatch_get_root_queue with the same arguments, which returns the approriate queue from the _dispatch_root_queues array, mapping the priority code to an index of 0 (LOW),2 (DEFAULT),4(HIGH) or 6(BACKGROUND), or their off-by-one odd numbers if OVERCOMMIT was specified. Application created queues (i.e. dispatch_queue_create) are always mapped to the low priority queue (index 0), with the serial queues created with overcommit (index 1)

Looking at the above table you might wonder about why the queues' serial numbers start at 4. This is because libdispatch also creates a queue for the application's main thread - com.apple.main-thread (Serial #1, from init.c), and uses internal queues for its own management: com.apple.root.libdispatch-manager (Serial #2), and com.apple.libdispatch-manager (Serial #3). Serial #0 is unused.

Dispatch Queue implementation

The dispatch queue is implemented is defined in queue_internal.h. It is defined using three macros, in a way that mimics what C++ would consider classes and subclasses.

The dispatch queue starts by including the DISPATCH_STRUCT_HEADER - as all dispatch objects do. This common header consists of an OS_OBJECT_HEADER, (which provides the object operations table (vtable), and reference count), and several more fields, including the target queue (settable by dispatch_set_target_queue). The target queue is one of the root queues (usually the default one). Custom queues as well as dispatch sources thus eventually get coalesced into the root queues.

Then dispatch queue then follows with its subclass fields: DISPATCH_QUEUE_HEADER, and the DISPATCH_QUEUE_CACHELINE_PADDING. The latter is used to ensure that the structure can fit optimally within the CPU's cache lines. The former (DISPATCH_QUEUE_HEADER) is used to maintain the queue metadata, including the "width" (# of threads in pool), label (for debugging), serial #, and work item list. The annotated header is shown below:

Listing 1: The dispatch_queue_s


struct dispatch_queue_s {
    /* DISPATCH_STRUCT_HEADER(queue)  - from queue_internal.h */

        /* _OS_OBJECT_HEADER(const struct dispatch_queue_vtable_s *do_vtable, do_ref_cnt, do_xref_cnt);  */
	/*  from os/object_private.h */
	   const struct dispatch_queue_vtable_s *do_vtable;     // object operations table
	   int volatile  do_ref_cnt;                            // reference count
  	   int volatile  do_xref_cnt;                           // cross reference count

        struct dispatch_queue_s *volatile do_next;              // pointer to next object (i.e. linked list)
        struct dispatch_queue_s *do_targetq;               // Actual target of object (one of the root queues)
        void *do_ctxt;                                     // context
        void *do_finalizer;                                // Set with dispatch_set_finalizer[_f]
        unsigned int do_suspend_cnt;                       // increment/decrement with dispatch_queue_suspend/resume

    /* DISPATCH_QUEUE_HEADER */
        uint32_t volatile dq_running;                      // How many dispatch objects are currently running
        struct dispatch_object_s *volatile dq_items_head;  // pointer to first item on dispatch queue (for remove)
        /* LP64 global queue cacheline boundary */ 
        struct dispatch_object_s *volatile dq_items_tail;  // pointer to last item on dispatch queue (for insert)
        dispatch_queue_t dq_specific_q;                    // Used for dispatch_queue_set/get_specific
        uint32_t dq_width;                                 // Concurrency "width" (how many objects run in parallel)
        unsigned int dq_is_thread_bound:1;                 // true for main thread
        unsigned long dq_serialnum;                        // Serial # (1-12)
        const char *dq_label;                              // User-defined; obtain with get_label
        /* DISPATCH_INTROSPECTION_QUEUE_LIST */
  	   TAILQ_ENTRY(dispatch_queue_s) diq_list	   // introspection builds (-DDISPATCH_INTROSPECTION) only
        /* DISPATCH_QUEUE_CACHELINE_PADDING */
	char _dq_pad[DISPATCH_QUEUE_CACHELINE_PAD];        // pads to 64-byte boundary

};

Note that Queues are not threads!. A single queue may be served by multiple worker threads, and vice versa. You can easily see the internals of GCD by using lldb on a sample program, say something as crude as:

Listing 2: A simple GCD program


#include <stdio.h>
#include <dispatch/dispatch.h>
#include <pthread.h>

int main (int arg,char **argv)
{
    // Using pthread_self() inside a block will show you the thread it
    // is being run in. The interested reader might want to dispatch
    // this block several times, and note that the # of threads can
    // change according to GCD's internal decisions..

    void (^myblock1) (void) = ^ { printf("%d Blocks are cool - 1 \n",
                                  (int) pthread_self());  };
    dispatch_queue_t q = 
       dispatch_queue_create("com.technologeeks.demoq",  // Our name
                             DISPATCH_QUEUE_CONCURRENT); // DISPATCH_QUEUE_SERIAL or CONCURRENT

    dispatch_group_t g = dispatch_group_create();
 
    dispatch_group_async(g, q, myblock1);
        
    int rc= dispatch_group_wait(g, DISPATCH_TIME_FOREVER);
                                        
}

By placing a breakpoint inside a block, you'll see something similar to:

Output 1: Debugging program from Listing 2 (10.8)


morpheus@Zephyr (~)$ cc /tmp/a.c -o /tmp/a
morpheus@Zephyr (~)$ lldb /tmp/a
Current executable set to '/tmp/a' (x86_64).
(lldb) b printf
Breakpoint 1: where = libsystem_c.dylib`printf, address = 0x0000000000080784
(lldb) 
Process 9454 launched: '/tmp/a' (x86_64)
Process 9454 stopped
* thread #2: tid = 0xee5c1, 0x00007fff83232784 libsystem_c.dylib`printf, 
   queue = 'com.technologeeks.demoq, stop reason = breakpoint 1.1
    frame #0: 0x00007fff83232784 libsystem_c.dylib`printf
libsystem_c.dylib`printf:
-> 0x7fff83232784:  pushq  %rbp
   0x7fff83232785:  movq   %rsp, %rbp
   0x7fff83232788:  pushq  %r15
   0x7fff8323278a:  pushq  %r14
(lldb) bt all
# 
# Main thread is blocking in dispatch_group_wait, which is basically like pthread_join
#
  thread #1: tid = 0xee5b0, 0x00007fff86ff76c2 libsystem_kernel.dylib`semaphore_wait_trap + 10, 
   queue = 'com.apple.main-thread
    frame #0: 0x00007fff86ff76c2 libsystem_kernel.dylib`semaphore_wait_trap + 10
    frame #1: 0x00007fff893d983b libdispatch.dylib`_dispatch_group_wait_slow + 154
    frame #2: 0x0000000100000e54 a`main + 100
    frame #3: 0x00007fff8621e7e1 libdyld.dylib`start + 1
# 
# Block is execution asynchronously on a worker thread, handled as a custom queue by libdispatch
# Offsets on Mavericks/iOS7 are (naturally) different, and worker_thread2 calls root_queue_drain
#
* thread #2: tid = 0xee5c1, 0x00007fff83232784 libsystem_c.dylib`printf, 
   queue = 'com.technologeeks.demoq, stop reason = breakpoint 1.1
    frame #0: 0x00007fff83232784 libsystem_c.dylib`printf
    frame #1: 0x0000000100000e97 a`__main_block_invoke + 39
    frame #2: 0x00007fff893d7f01 libdispatch.dylib`_dispatch_call_block_and_release + 15
    frame #3: 0x00007fff893d40b6 libdispatch.dylib`_dispatch_client_callout + 8
    frame #4: 0x00007fff893d9317 libdispatch.dylib`_dispatch_async_f_redirect_invoke + 117
    frame #5: 0x00007fff893d40b6 libdispatch.dylib`_dispatch_client_callout + 8
    frame #6: 0x00007fff893d51fa libdispatch.dylib`_dispatch_worker_thread2 + 304
    frame #7: 0x00007fff831c8cdb libsystem_c.dylib`_pthread_wqthread + 404
    frame #8: 0x00007fff831b3191 libsystem_c.dylib`start_wqthread + 13
(lldb)

When I tried this code in my 10.9 VM, the same breakpoint caught the main thread in the act of dispatching - before the dispatch_group_wait:

Output 2: Debugging program from Listing 2, again


  thread #1: tid = 0x6231, 0x00007fff86bace6a libsystem_kernel.dylib`__workq_kernreturn + 10, 
   queue = 'com.apple.main-thread
    frame #0: 0x00007fff86bace6a libsystem_kernel.dylib`__workq_kernreturn + 10
    frame #1: 0x00007fff8e96afa7 libsystem_pthread.dylib`pthread_workqueue_addthreads_np + 47
    frame #2: 0x00007fff9432dba1 libdispatch.dylib`_dispatch_queue_wakeup_global_slow + 64
    frame #3: 0x0000000100000e41 a`main + 81
    frame #4: 0x00007fff911795fd libdyld.dylib`start + 1
    frame #5: 0x00007fff911795fd libdyld.dylib`start + 1

This isn't due to 10.9's GCD being different - rather, it demonstrates the true asynchronous nature of GCD: The main thread has yet to return from requesting the worker (which it does by pthread_workqueue_addthreads_np, as I'll describe later), and already the worker thread has spawned and is mid execution, possibly on another CPU core. The exact state of the main thread with respect to the worker is largely unpredictable.

Note another cool feature of GCD is that the queue name in thread #2 has been set to the custom queue. GCD renames the root queues when they are working on behalf of custom queues, like in this example), in a way that is visible to lldb. I'm working on adding this functionality to process explorer. In case you're wondering why "dispatch_worker_thread2" is used - that's because libdispatch defined three worker thread functions: the first, for use when compiled with DISPATCH_USE_PTHREAD_POOL. The second (this one), for use with HAVE_PTHREAD_WORKQUEUE_SETDISPATCH_NP, and the third for HAVE_PTHREAD_WORKQUEUES. The second also falls through to the third.

Dispatch Sources

A key function of dispatch queues is connecting them to dispatch sources. These enable an application to multiplex multiple event-listeners, much as would traditionally be provided by select(2), but with a far wider support of event sources - from file descriptors, through sockets, mach ports, signals, process events, timers and event custom sources.

All of the myriad sources are built on top of the kernel's kqueue mechanism. The type argument to dispatch_source_create is, in fact, a struct dispatch_source_type_s pointer, defined in source_internal.h as follows:

Listing 3: The dispatch_source_type_s definition (from source_internal.h)


struct dispatch_source_type_s {
        struct kevent64_s ke;
        uint64_t mask;
        void (*init)(dispatch_source_t ds, dispatch_source_type_t type,
                        uintptr_t handle, unsigned long mask, dispatch_queue_t q);
};

A dispatch source can be thought of a special case of a queue. The two are closely related, and the former is a "subclass" of the latter, as can be seen by the definition:

Listing 4: The dispatch_source_s definition (from source_internal.h)


struct dispatch_source_s {
     /* DISPATCH_STRUCT_HEADER(source); */      // As per all other dispatch objects... 

     /* DISPATCH_QUEUE_HEADER; */               // as per dispatch_queue definition

     /* DISPATCH_SOURCE_HEADER(source);
        dispatch_kevent_t ds_dkev; \            // linked list of events and source refs
        dispatch_source_refs_t ds_refs; 
        unsigned int ds_atomic_flags; \
        unsigned int \
                ds_is_level:1, \
                ds_is_adder:1, \                      // true for DISPATCH_SOURCE_ADD
                ds_is_installed:1, \                  // true if source is installed on manager queue
                ds_needs_rearm:1, \                   // true if needs rearmin on manager queue
                ds_is_timer:1, \                      // true for timer sources only
                ds_cancel_is_block:1, \               // true if data source cancel_handler is a block
                ds_handler_is_block:1, \              // true if data source event_handler is a block
                ds_registration_is_block:1, \         // true if data source registration handler is a block
                dm_connect_handler_called:1, \        // used by mach sources only
                dm_cancel_handler_called:1; \         // true if in the process of calling cancel block
        unsigned long ds_pending_data_mask;	      // returned by dispatch_source_get_data_mask()
      unsigned long ds_ident_hack;                    // returned by dispatch_source_get_handle()
      unsigned long ds_data;                          // returned by dispatch_source_get_data()
      unsigned long ds_pending_data;
};

The dispatch_source_create function operation is straightforward: following validation of the type argument, it allocates and initializes a dispatch_source_s structure, in particular populating its ds_dkev with the kevent() parameters passed to the function.

Internally, most (if not all) sources eventually get triggered by kevent(). I cover this important syscall in both chapter 2 (page 57) and 14 (500 pages later..). This means that most sources use the same kqueue. Most, with the exception of Mach sources, which use Mach's request_notification mechanism.

You can see this for yourself by using lldb on a program or daemon which uses dispatch sources. One example to debug is diskarbitration:

bash-3.2# ps -ef | grep diskarb
    0    16     1   0 Sun10AM ??         0:02.40 /usr/sbin/diskarbitrationd
bash-3.2# lldb -p 16
Attaching to process with:
    process attach -p 16
Process 16 stopped
Executable module set to "/usr/sbin/diskarbitrationd".
Architecture set to: x86_64-apple-macosx.
(lldb) thread backtrace all
# 
# The CFRunLoop construct (which is also responsible for the main thread queue)
# blocks on mach_msg_trap, which will return when a message is received
#
* thread #1: tid = 0x0140, 0x00007fff86ff7686 libsystem_kernel.dylib`mach_msg_trap + 10, 
   queue = 'com.apple.main-thread, stop reason = signal SIGSTOP
    frame #0: 0x00007fff86ff7686 libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007fff86ff6c42 libsystem_kernel.dylib`mach_msg + 70
    frame #2: 0x00007fff8be77233 CoreFoundation`__CFRunLoopServiceMachPort + 195
    frame #3: 0x00007fff8be7c916 CoreFoundation`__CFRunLoopRun + 1078
    frame #4: 0x00007fff8be7c0e2 CoreFoundation`CFRunLoopRunSpecific + 290
    frame #5: 0x00007fff8be8add1 CoreFoundation`CFRunLoopRun + 97
    frame #6: 0x00000001069d83e6 diskarbitrationd`___lldb_unnamed_function176$$diskarbitrationd + 2377
    frame #7: 0x00007fff8621e7e1 libdyld.dylib`start + 1

#
# The manager queue (holds a kqueue() and blocks on kevent until a source "fires")
#
  thread #2: tid = 0x0146, 0x00007fff86ff9d16 libsystem_kernel.dylib`kevent + 10, 
   queue = 'com.apple.libdispatch-manager
    frame #0: 0x00007fff86ff9d16 libsystem_kernel.dylib`kevent + 10
    frame #1: 0x00007fff893d6dea libdispatch.dylib`_dispatch_mgr_invoke + 883
    frame #2: 0x00007fff893d69ee libdispatch.dylib`_dispatch_mgr_thread + 54
(lldb) detach
Detaching from process 16
Process 16 detached

When a source does fire, the libdispatch-manager triggers the callback on another thread (via dispatch_worker_thread2, as usual, though it goes on to call dispatch_source_invoke, resulting in a slightly different stack). This way, the manager thread remains available to process events from other sources.

II: Still in User Mode (pthread)

GCD, contrary to the impression one might get, does not replace threads - it builds on them. The underlying support for libdispatch is still the venerable POSIX threads library (pthread), though most of the support comes from non-POSIX compliant Apple extensions (which are easily identifiable by the _np suffix in function names. Most of those functions were silently introduced in Leopard (10.5), with others added in 10.6, as GCD was formerly introduced. The API, however, has undergone significant changes, making it a moving target.

To exacerbate matters, though the Apple pthread implementation was formerly a part of LibC, (thus open source), this has changed as of OS X 10.9 (somewhere between LibC-825 and 997). Pthreads is now its own library (libsystem_pthread.dylib) and project (presently, libpthread-53.1.4). I'm not entirely clear why Apple decided to refactor it out, (and maybe the source is still somewhere on opensource.apple.com..) but the move also aligns with the one in kernel mode - having moved out all pthread support to pthread.kext (which is part of the above project, in kern/kern_support.c. Seeing as these were non POSIX extensions (and mostly APPLE_PRIVATE APIs) I guess they figure developers were forewarned.

The last open source implementation of pthreads, therefore, is that of 10.8 (LibC-825), wherein Apple changed the API and added new _np calls. 10.9 changes the API further, and it seems like it might take a while before the dust settles. This is also evident in the code of libdispatch, in the sections defined DISPATCH_USE_LEGACY_WORKQUEUE_FALLBACK, though as of 10.8 the legacy interface has effectively been removed: Both libdispatch and pthreads check if the kernel supports the new interface (referred to as the "New SPIs"), and return an error if that is not the case.

The non standard pthread extensions provided by Apple were, surprisingly enough, documented - not by Apple, but by FreeBSD man pages, since GCD has been ported to it. Apple, however, effectively drops almost of all those extensions in favor of new ones, as shown in the following figure:


# OS X 10.8 output:
morpheus@Zephyr$ jtool -S -v /usr/lib/system/libsystem_c.dylib | grep pthread_workqueue
00000000000cfd80 d ___pthread_workqueue_pool_head
0000000000015b39 T ___pthread_workqueue_setkill
0000000000017230 T _pthread_workqueue_additem_np
0000000000016fb7 T _pthread_workqueue_addthreads_np
0000000000016aad T _pthread_workqueue_atfork_child
0000000000016aa3 T _pthread_workqueue_atfork_parent
0000000000016a99 T _pthread_workqueue_atfork_prepare
00000000000167bb T _pthread_workqueue_attr_destroy_np
0000000000016808 T _pthread_workqueue_attr_getovercommit_np
00000000000167d1 T _pthread_workqueue_attr_getqueuepriority_np
000000000001679f T _pthread_workqueue_attr_init_np
0000000000016822 T _pthread_workqueue_attr_setovercommit_np
00000000000167eb T _pthread_workqueue_attr_setqueuepriority_np
0000000000016ff8 T _pthread_workqueue_create_np
0000000000017848 T _pthread_workqueue_getovercommit_np
000000000001683a T _pthread_workqueue_init_np
0000000000016a56 T _pthread_workqueue_requestconcurrency
0000000000016f26 T _pthread_workqueue_setdispatch_np     
# OS X 10.9 output:
morpheus@simulacrum$ jtool -S -v /usr/lib/system/libsystem_pthread.dylib | grep pthread_workqueue
0000000000002c0d t _pthread_workqueue_atfork_child         # survived, but made private
0000000000002371 T ___pthread_workqueue_setkill            # make thread killable by pthread_kill
0000000000002f78 T _pthread_workqueue_addthreads_np        # 
0000000000002f19 T _pthread_workqueue_setdispatch_np       # q.v. below
0000000000002f12 T _pthread_workqueue_setdispatchoffset_np #

Since virtually the entire "legacy" API has been eradicated, let's focus on those functions which did make the cut:

Function Notes

pthread_workqueue_addthreads_np (int queue_priority, int options, int numthreads) Add numthreads to workqueue of priority queue_priority, according to options. The only options supported is WORKQ_ADDTHREADS_OPTION_OVERCOMMIT. As you could see in Output 2, this call will asynchronously spawn the worker threads.

pthread_workqueue_setdispatch_np (void (*worker_func)(int queue_priority, int options, void *ctxt)) - Sets the dispatch worker function (always worker_thread2)
- Makes sure new SPI is supported
- Calls workq_open()

pthread_workqueue_setdispatchoffset_np A new addition to the API (10.9) Used by libdispatch when setting up the root queues, and passes the offset of the dq_serialnum member relative to the dispatch_queue_s struct.

Function	Notes
`pthread_workqueue_addthreads_np (int queue_priority, int options, int numthreads)`	Add numthreads to workqueue of priority queue_priority, according to options. The only options supported is `WORKQ_ADDTHREADS_OPTION_OVERCOMMIT`. As you could see in Output 2, this call will asynchronously spawn the worker threads.
`pthread_workqueue_setdispatch_np (void (worker_func)(int queue_priority, int options, void ctxt))`	- Sets the dispatch worker function (always worker_thread2) - Makes sure new SPI is supported - Calls `workq_open()`
`pthread_workqueue_setdispatchoffset_np`	A new addition to the API (10.9) Used by libdispatch when setting up the root queues, and passes the offset of the dq_serialnum member relative to the dispatch_queue_s struct.

As you can see, there is no longer a way to manipulate most aspects of work queues via pthreads. Whereas before pthread exported an _additem_np (which would enable scheduling of a work item), this has been removed in favor of _addthreads_np, and the work function itself is set by _setdispatch_np, normally once per process instance, during libdispatch's root_queue_init(). This means that the actual work queue thread pool management is handled by the kernel.

Work queue diagnostics

Apple's fantabulous yet undocumented proc_info syscall (#336), which I laud so much in the book, also has a PROC_PIDWORKQUEUEINFO code (#12). It provides a very high level view of the workqueue, as shown here:

Listing 5: The proc_workqueueinfo (from <sys/proc_info.h>)


struct proc_workqueueinfo {
        uint32_t        pwq_nthreads;           /* total number of workqueue threads */
        uint32_t        pwq_runthreads;         /* total number of running workqueue threads */
        uint32_t        pwq_blockedthreads;     /* total number of blocked workqueue threads */
        uint32_t        pwq_state;
};

The latest version of my Process Explorer (v0.2.9 and later) automatically displays associated work queue information, if work queues are detected in the process whose information you are querying.

III: Kernel support (workqueues)

System call interface

As stated in the book3, Workqueues in the kernel are supported through two undocumented system calls - workq_open (#367) and workq_kernreturn (#368). Though the system calls remain constant, their implementation has changed with 10.8/6 and the introduction of the "new SPI". Beginning with 10.9/7, the implementation of the system calls has moved to the pthread.kext, leaving nothing but the shims in the kernel source. Another function of importance is bsdthread_register. You can find the definitions in bsd/kern/syscalls.master:

Listing 6: Workqueue related system calls


366     AUE_NULL        ALL     { int bsdthread_register(user_addr_t threadstart, user_addr_t wqthread, int pthsize,
                                  user_addr_t dummy_value, user_addr_t targetconc_ptr, uint64_t dispatchqueue_offset) 
                                  NO_SYSCALL_STUB; } 
367     AUE_WORKQOPEN   ALL     { int workq_open(void) NO_SYSCALL_STUB; }
368     AUE_WORKQOPS    ALL     { int workq_kernreturn(int options, user_addr_t item, int affinity, int prio) 
                                  NO_SYSCALL_STUB; }

There's a reason why all three have NO_SYSCALL_STUB: Like other (crazy useful) syscalls in XNU, Apple doesn't want you to use them. If XNU weren't open source, nobody but Apple would like know how to use them, either.

workq_open works in essentially the same way it has before. workq_kernreturn, however, has been completely modified: Rather than offering the WQOPS discussed in the book as options, the new SPI deprecates them all but WQOPS_THREAD_RETURN, and instead offers two new others:

WQOPS_QUEUE_NEWSPISUPP (0x10), which is used to check for SPI support - and merely returns 0 if supported.
WQOPS_QUEUE_REQTHREADS (0x20). This code requests the kernel to run n more (possibly overcommited) requests of a given priority. The value of "n" in passed in the "affinity" argument, with the item argument (formerly used to pass the user mode address to execute for WQOPS_QUEUE_ADD) is ignored.

The kernel workqueue implementation

Kernel workqueue support was in bsd/sys/pthread_internal.h - but has become opaque with v10.9. The last reported sighting of a workqueue in the wild (in the xnu-2050.22.13 sources), looked like so (annotations added):

Listing 7: The kernel workqueue implementation


struct workqueue {
   proc_t          wq_proc;                          // Owning process
   vm_map_t        wq_map;                           // VM Map for work thread stacks
   task_t          wq_task;                          // The owning process's task port (used to create thread)
   thread_call_t   wq_atimer_call;
   int             wq_flags;                         // WQ_EXITING, WQ_ATIMER_RUNNING, WQ_LIST_INITED,
   int             wq_lflags;                        // WQL_ATIMER_BUSY, _WAITING
   uint64_t        wq_thread_yielded_timestamp;      // set by workqueue_thread_yielded()
   uint32_t        wq_thread_yielded_count;          // count of yielded threads, used with threshold
   uint32_t        wq_timer_interval;
   uint32_t        wq_affinity_max;
   uint32_t        wq_threads_scheduled;
   uint32_t        wq_constrained_threads_scheduled;
   uint32_t        wq_nthreads;                      // # of threads in this workqueue
   uint32_t        wq_thidlecount;                   // .. of which how many are idle
   uint32_t        wq_reqcount;                      // # of current requests (incremented by WQOPS_QUEUE_REQTHREADS)
   TAILQ_HEAD(, threadlist) wq_thrunlist;            // List of active threads
   TAILQ_HEAD(, threadlist) wq_thidlelist;           // List of idle ("parked") threads
   uint16_t        wq_requests[WORKQUEUE_NUMPRIOS];  // # of current requests, by priority
   uint16_t        wq_ocrequests[WORKQUEUE_NUMPRIOS];// # of overcommitted requests, by priority
   uint16_t        wq_reqconc[WORKQUEUE_NUMPRIOS];           /* requested concurrency for each priority level */
   uint16_t        *wq_thscheduled_count[WORKQUEUE_NUMPRIOS];
   uint32_t        *wq_thactive_count[WORKQUEUE_NUMPRIOS];   /* must be uint32_t since we OSAddAtomic on these */
   uint64_t        *wq_lastblocked_ts[WORKQUEUE_NUMPRIOS];
};

@TODO: detail more about work queue implementation..

sysctl variables

The kernel exports several variables to control work queues. These are basically the same as those of FreeBSD, and are exported by the kernel proper (pre 10.9/7) or by pthread.kext (10.9/7 and later). The variables are shown in the following table:

sysctl variable	controls
`kern.wq_yielded_threshold`	Maximum # of threads that may be yielded
`kern.wq_yielded_window_usecs`	Yielded window size
`kern.wq_stalled_window_usecs`	Maximum # of usecs thread can not respond before it is deemed stalled
`kern.wq_reduce_pool_window_usecs`	Maximum # of usecs thread can idle before the thread pool will be reduced
`kern.wq_max_timer_interval_usecs`	Maximum # of usecs between thread checks
`kern.wq_max_threads`	Maximum # of threads in the work queue

kdebug codes

As with all kernel operations, the workqueue mechanism is laced with KERNEL_DEBUG macro calls, to mark function calls and arguments. Unlike other calls, however, the macros often define the debug codes as hex constants, rather than meaningful names. Unsurprisingly, the codes aren't listed in CoreProfile, either. I'm working on adding these to my kdebugView tool. I still need to delve into the "how" of kernel mode - so Updates will follow. Me, I need to get off this flight already.

@TODO

Usage of sysctl vars inside pthread_synch
flow of workqueue_run_nextreq
wq_runreq and setup_wqthread
Kdebug constants..

References

Concurrency Programming Guide:

GCD Reference:

My book