Chapter 12 The Moby run-time system
12.1 Run-time representations
This section describes the run-time representations of various
kinds of Moby data values.
We use C-like type and expression syntax to describe these representations,
where the C type MobyWord_t is the unsigned
integer type that is the same size as a pointer.
12.1.1 Function closures
Function closures are represented by a two-word record object: the first
word holds the function's code addressand the second word holds
the function's environment pointer.
Note that the environment pointer is not always a pointer --- if the
function's environment consists of a single value that has uniform
representation, then the environment pointer will be that value, and if
the environment is empty, then the environment pointer will be zero.
12.1.2 Objects and method suites
Objects are represented as a pair of the object's data pointer
and the object's class ID (see Figure 12.1.
Figure 12.1: Object representation
The object's data pointer points to a record whose first field is a pointer
to the object's method suite, and whose other fields contain the
object's fields.
Every object generated from a particular class shares the same method suite.
The object's class ID is used as a key when dynamically searching for the
field or method slot for a given label.
When executing code inside a method body, we do not need the class ID to
locate slots, since the slot offsets are static w.r.t. the particular
class.
Therefore, we represent self by just the pointer to the object's
data inside methods (and when we pass self to a method's body
during dispatch).
Tag types are implemented using a scheme described by Reppy and
Riecke [RR96].
For each declared tagtype, there is a statically allocated descriptor
with the following layout:
typedef struct MOBY_tagtype_desc *MOBY_tagtype_t;
struct MOBY_tagtype_desc {
MobyWord_t _depth;
MOBY_tagtype_t _id[N];
char _name[];
};
where N is the depth value for the tagtype.
The address of the tagtype's descriptor serves as its unique ID.
The _id field of the descriptor holds the IDs of the
ancestors of the tagtype, with X->_id[X->_depth-1]
holding X, for any tagtype descriptor X.
Following the ancestor IDs is a null-terminated string, which
is used by the run-time system to report the name of exceptions.
We can test if a tagtype value matches the constructor
X as follows:
((tag->_depth >= X.depth)(tag->_id[tag->_depth] == X))
where tag is the value's ID.
12.1.4 Arrays and vectors
We use a two-level representation of arrays, strings, and vectors.
12.2 Runtime system data structures
Figure 12.2: Per-task data structures
12.3 Garbage collection
We are using a modified version of Morrisett and Smith's Mostly-copying Collector (MCC) [SM97].
Although we have maintained much of the algorithm's structure, we have
made liberal modifications to the source code.
In this section, we describe only the single-threaded version of the
garbage collector. By single-threaded, we mean the version that works
only for Moby programs that do not spawn additional threads. Both
versions of the collector run in one thread (they are not parallel)
and must run while all mutator threads are stopped (they are not
incremental). The next section explains how the multi-threaded
version is different. Because the multi-threaded version is less
stable, we currently have two source trees, one for each version. In
the future, we hope to merge them, turning off expensive
synchronization operations for single-threaded programs.
MCC has conservative objects and record objects. (It also has arrays
and large objects.) Both objects have 32-bit headers. Conservative
objects' headers just record the size of the object; we do not know
which of the fields hold pointer values. Record objects' headers have
a bitmask --- a 1 means a field is definitely a pointer, a 0 means a
field is definitely not a pointer. Unfortunately, this encoding means
an object with just one ambiguous field must be given a conservative
header, so all fields are considered ambiguous. So in the future, we
intend to change the header encoding for record objects to use two
bits per object field. Then one of the four bit patterns can indicate
that a particular field is ambiguous.
The glue code to transfer control between the mutator and the
collector and back is new and subtle, so we discuss it in detail. (It
is also rather different for multi-threaded programs, so read the next
section for that case.) The biggest difference from the original MCC
is that the allocation pointer does not stay in a global variable.
At program initialization time, the Moby runtime calls
gcInit. This function gets an initial heap, breaks it up
into blocks, breaks the blocks up into pages, etc. Unlike the
original MCC, it does not establish an allocation pointer. (In MCC
terminology, the entire space for collectable objects is called the
heap. It is broken up into contiguous-memory segments called
blocks and the blocks are broken up into pages. Large
objects can span pages, but not blocks. Small objects do not span
pages. Pages can be used for mutator objects or for system
information about other pages. Page size is a power of 2 (currently
8K for Moby) and pages are aligned on the same boundary.
gcNewAllocPtr returns a pointer to a fresh to-space page.
The mutator is then responsible for allocating off this pointer. It
should not let the pointer leave the page.
Each module's initialization code registers its statically allocated
pointers by calling
MOBY_add_staticseg so that they
will be scanned during garbage collection. The collector assumes the
values in such a segment are unambiguous pointers, so they can
be small values, but they cannot be integers that could appear to be
pointers in the Moby heap. Therefore, the compiler segregates
top-level bindings by pointerness. This segregation works because
top-level bindings are monomorphic, but it may make functors harder to
compile. We may want a third category for the segregation for
ambiguous pointers.
Because we expect allocation performance to be crucial for Moby
applications, the compiler reserves a register (edi on the
IA-32) to hold the allocation pointer while Moby code is running. The
Moby code does heap-limit checks to ensure that there is room for
ensuing allocations on the current page. (The placement of these
checks is currently broken because it does not take into account that
function calls can perform an arbitrary amount of allocation. That's
the compiler's problem.)
When a heap-limit check fails, the mutator calls
MOBY_callgc, written in assembly using a special calling
convention. Basically, the function stores all the registers on the
stack and then calls the C routine MOBY_do_gc. It is
not necessary to preserve all the registers; indeed, the
multi-threaded version does something more sophisticated. In other
words, this assembly code should be improved for the single-threaded
version.
The C routine then moves the registers from the stack into a global
data structure and transfers control to the MCC routine
gcAllocSpace. (Again, all this stuff is cleaner in the
multi-threaded version, but we refrain from cleaning up the
single-threaded case for fear of introducing bugs.) This routine
marks the rest of the old page as unused, grabs a new page, and
returns a pointer onto this new page. Upon return,
MOBY_callgc moves this pointer into register
edi and restores all the other registers.
The slow path, of course, is if the fast path described above has been
taken enough times that a full garbage collection is warranted. The
biggest difference in collection between Moby and the original MCC is
how the roots are scanned. Therefore, the scanRoots function
is re-implemented in the Moby run-time.
scanRoots first crawls the stack and
then scans the registered static segments. This order is very
important. MCC assumes that all ambiguous pointers are scanned
before any unambiguous pointers. Otherwise, we may move an object
that must be pinned. The order works because our current information
about the stack requires us to treat all live locations as ambiguous
pointers. In the future, we might put information in the PC map to
indicate that some slots were live, unambiguous pointers. Doing
this naively will break the collector. After all, it would break the
invariant that all ambiguous roots are scanned first. The simplest
fix is probably to maintain a stack of unambiguous roots and do them
after all stack walking is done. Another option is two stack walks.
Stack-crawling must know where to stop. We currently use a
straightforward kludge: Moby code is called from C code, so we will
reach this C code's activation frame before ``falling off'' the
deep-end of the stack. We can detect the C code because Moby frames
have return addresses that appear in the PC map. So once a return
address lookup fails, we stop. A more robust solution is to record
the deep-end of the Moby stack. The trick here is to get it exactly
right when transferring control to Moby code; some hand-written
assembly may be necessary. Also, a more complicated solution will be
necessary if we allow C code to ``callback'' into Moby code.
When we are done with the collection, we put the allocation pointer
on a fresh page and return it. Note that the multi-threaded version
does not move any allocation pointers during collection.
Because Moby code keeps its allocation pointer in a register, it is
inconvenient to have foreign code allocate off of this pointer.
Therefore, the collector maintains another allocation pointer, called
specialAllocPtr in a global variable. Calls to
malloc-replacement functions, such as gcMalloc, use this
pointer to do allocation. The collector just needs to put this
pointer onto a fresh page after collection is complete.
12.4 Multithreading
Note: the multithreaded version works fine for single-threaded
programs, so we encourage using it unless bugs are encountered or
performance demands otherwise. (The degraded performance would
involve managing some unneeded mutexes, maintaining some unneeded
global state, and enduring some unneeded function call indirections.)
In fact, this situation is the only one that has been tested. That
is, we have not yet invoked multiple threads with enough allocation to
trigger a collection.
Currently, each Moby thread runs in its own Posix thread (pthread).
Some day we hope to multiplex multiple Moby threads in a pthread, but
we ignore that complication here and just focus on the
garbage-collection issues. (A major reason we temporarily abandoned
this multiplexing is that the current Linux pthread implementation
determines thread identity by masking the stack pointer. Newer
versions have an implementation that does not rely on register
esp.) Be warned that by default, these threads have only
2 megabyte stacks.
For each thread, there is a Moby context
(moby_exec_ctxt_l_t) structure. All the
contexts are kept in a doubly-linked list that is reachable through
the global moby_runtime_info. Each thread also keeps
a pointer to its own context in thread-local-data (using the key
MobyCtxtKey). But wait, there's more. Concerned that
accessing thread-local-data is too expensive, we also put a pointer to
the context at the bottom of the page into which the context's
allocation pointer currently points. So we can mask the allocation
pointer to get the context pointer. (To be precise, we write a record
object with a header indicating one non-pointer field and then the
context pointer. It is a non-pointer in the sense that it does not
point into the Moby heap.) When a thread gets a new page for
allocating, the context pointer must be copied into that page.
Access to the global MOBY runtime information is guarded by a single,
global lock. We need the lock only for module initialization, thread
creation, thread destruction, and (full-blown) garbage collection, so
we don't expect this to be a major bottleneck. However, when we
support lighter-weight threads (much lighter than a pthread), we
probably should strive not to need hold the global lock just to create
and destruct such a lightweight thread.
We now describe the fields of a
moby_exec_ctxt_l_t structure:
typedef struct moby_exec_ctxt_l {
greg_t saved_alloc_ptr;
stk_ptr_t saved_stack_ptr;
stk_ptr_t saved_frame_ptr;
code_ptr_t saved_ret_addr;
ucontext_t * uctxt;
int gc_instigator;
moby_thd_status_t thd_status;
pthread_t pthread;
MOBY_closure_t * closure;
struct moby_exec_ctxt_l * prev;
struct moby_exec_ctxt_l * next;
} moby_exec_ctxt_l_t;
The prev and next fields establish a doubly-linked
list. Code modifying this list should hold the global MOBY info lock.
The closure field initially holds a pointer to the closure
that a thread first executes. This field is necessary because the
closure lives in the Moby heap: If a garbage collection occurs between
the time the thread is created and when it starts running, then the
closure field must be treated as a root and updated by the collector.
Once the collector detects that the Moby thread has started
running, it assigns NULL to this field.
The pthread field is what you think it is. When the
garbage-collection instigator stops the world, it does so by walking
through the list of contexts and sending a signal to each pthread
encountered, except for itself.
The gc_instigator field is set only for the thread that
instigated the garbage collection. The stackwalk must treat this
thread differently for a couple of reasons. First, we can find the
callee-save registers and Moby stack portion in known stack offsets.
Second, the uctxt field is not correct because we do not
signal ourselves (see below). Now that I think about it, we don't
need to maintain this state because the fragment
pthread_equal(pthread_self(),ctxt->pthread)
should suffice. (This fragment works because the collection runs in
the instigator's thread.)
The thd_status field should be deleted. We need
something like this field when we do our own scheduling, but the
operating system currently does all of the scheduling for us.
The uctxt field points to space that is allocated during
thread-creation and destroyed during thread-destruction. The signal
handler for an interrupted-for-garbage-collection thread copies the
uctxt provided to it by the system into this field. (The signal
handler kludges are described in more detail below.) Then on
restarting, we somehow use this information to perform a
setcontext. (This function is currently unimplemented
in Linux, but we have some ideas on workarounds.) Note that this
field is not correct for the instigating thread.
The saved_* fields are written by the trampoline code
that is used to let Moby code call C code. In the single-threaded
case, Moby code can call C code directly, especially because Moby has
strictly fewer callee-save registers (namely, none). Also, for the
time being we might assume that the C code is not allocating into the
Moby heap, so we are assured that a garbage collection will not occur
while executing C code. (If it can occur, we probably need trampoline
code because C code does not use MOBY_callgc.) But in
the multi-threaded case, another thread could instigate a garbage
collection. We use the saved information to determine the Moby
portion of the stack and the allocation pointer. Then the stack
portion above (shallower) than the Moby portion is treated
conservatively.
The trampoline code is in the file MOBY-ccall.S. The
compiler uses the multithreaded flag (-t) to decide to
generate C function calls to go through this code instead of directly
to C. Also, the multithreaded version of MOBY_callgc
actually calls the trampoline code, and relies on its
implementation. Specifically, it stores the callee-save registers
(ebx and esi) on the stack. Later, the stack-walker
for the instigator will use the saved_stack_ptr field
to find these values. This trick avoids the need to scan the C
portion of the instigator's stack.
Now we can describe how garbage collection occurs. First we describe
the synchronization and signalling. Then we describe how the actual
collection process is different from the signal-threaded case.
Recall that most calls to gcAllocSpace take the fast path of
finishing up the old page, grabbing a free page, writing the
context-pointer record into it, and returning a pointer. These last
three steps constitute a critical section for two reasons. First, we
cannot have concurrent access to the free list (for grabbing a free
page). Second, we cannot have a full-blown collection occur before
the context-pointer record is written and the allocation pointer
updated. Hence the call to collect is also in this critical
section. We protect gcNewAllocPtr with the same lock because
it also manipulates the list of free pages. In the future, we intend
to use a (faster?) spinlock for this code because we do not expect
much contention.
Once a thread decides to do a full-blown collection, it is the
instigator. No other thread can also be instigating because of mutual
exclusion. To stop the world, the instigator sends a signal to every
other thread. It then waits for all of the threads to receive the
signal, save appropriate state in their context, and wait to be
restarted. The synchronization to do this stopping stage is
essentially the mutex encoding of a counting semaphore. The
synchronization for restarting is essentially the inverse --- we set a
start flag and broadcast to the other threads. (These other threads
check the flag before stopping in case collection happens ``really
fast.'') None of this synchronization has been tested, of
course.
Note that the description above implies that signal handlers are
performing synchronization operations. The pthreads implementation
forbids doing so and uses scary words like ``deadlock'' and
``unpredictable'' to dissuade us. Therefore, we pull a dirty trick.
After the handler copies the ucontext into its thread's context
record,1 it updates the system's ucontext's program counter so that
the handler will return somewhere else, specifically the C function
stop_for_gc.
Therefore, stop_for_gc must not take any arguments!
Also, no function can ever have data at negative offsets from the
stack pointer! We access the context pointer using thread-local data
and do the synchronization described above. Upon restart, we use
setcontext along with the saved ctxt field to
restore things how they were. The memory management of uctxt
here is quite subtle. Why don't we pass setcontext the
actual field from the context record, not a copy? If
setcontext frees this memory (it probably doesn't), the next
collection will have a memory error. If setcontext does not
free the memory, then we have a race condition: Another collection
might occur before setcontext is done reading the
uctxt field. When I think about what would happen as a
result, my head feels fuzzy. The other option is to pass a copy of
the uctxt to setcontext. But now we have a memory
leak if setcontext does not free the memory. So we need to
stack allocate this copy. Particularly subtle is that this structure
is not flat; the pointer field must also refer to a stack-allocated
object. But if setcontext does free the memory, we're hosed
because it is stack-allocated. The bottom line is we need to
ensure that passing a stack-allocated parameter to setcontext
is the right thing to do.
Once the world is stopped, the actual garbage collection is much like
the single-threaded case. Of course, we may have more than one stack
to crawl and only one stack actually instigated the garbage
collection. Starting a stack crawl is difficult because of
pre-emption and is probably bug-ridden. There are five cases:
-
The thread is the instigator. The top of the Moby portion of
the stack is in the context record. The callee-save registers are in
a known position. The C portion of the stack contains no roots.
- The thread is running Moby code. The context record is not
accurate. We may be in an allocation sequence; if so we roll back the
program counter because the already-initialized fields may have moved.
- The thread is running C code. The context record is accurate.
Treat the C portion of the stack conservatively; doing so ensures we
trace the callee-save registers.
- The thread has not started running yet. Trace the
closure field only. The stack has no Moby roots.
- The thread is in the trampoline code. How to find the necessary
roots depends on the exact value of the program counter. This
case is unimplemented.
The other main difference in the multi-threaded collector is that we
do not adjust any allocation pointers. For this policy to be correct,
it suffices to pin the pages into which allocation pointers point. So
if there are n contexts, we will pin an additional n+1 pages (we
still have the specialAllocPtr). The obvious disadvantage is
less compaction. The advantage is that allocation pointers are not
volatile; they are never changed by another thread. Changing them
wouldn't be so bad if they always lived in register edi. But
with foreign calls and concurrent run-time calls, it gets very
difficult to always track where a context holds its ``true''
allocation pointer. For example, imagine if another thread is in
gcAllocSpace when it is pre-empted. Of course, we could have
larger critical sections instead. Pinning allocation-pointer pages
turned out to be a simple solution and a pleasing interaction with the
mostly-coping concept.
Note that by not moving allocation pointers, it would be fairly simple
to avoid the need for rollback. All we need to do is trace the
already-initialized fields as roots. However, the real advantage of
rollback comes when we have multiple Moby threads per context, all
sharing an allocation pointer. Here, rollback is a duty of the
thread scheduler anyway.
There are two unfortunate properties of MCC that may prove problematic
for us. The first is that conservative objects (and, by extension,
objects with ambiguous pointer fields) are maintained in an explicit
data structure so that collection only has to traverse the pointer
graph once. (This distinction is the primary difference between MCC
and Bartlett's collector.) Therefore, allocating such objects is
likely to be too complicated to do with in-line Moby code. Also the
accessing the explicit data structure must by synchronized.
BUG: The current code does not synchronize access to the
conservative objects queue. A better solution, though, is probably
to have a separate queue per context, thus avoiding the need for
synchronization.
The second is that part of determining if an ambiguous pointer is
valid involves testing whether it points to the beginning of an
object. To detect object beginnings, the collector must walk from the
beginning of a page to the beginning of the object. Every ambiguous
pointer that is a pointer suffers this walk. Two changes suggest that
this overhead will be much worse for us than for the original MCC.
The first is that we use a larger page size. The second is that we
intend, at least at first, to consider all polymorphic fields to be
ambiguous pointers. (MCC's original mutator created conservative
objects only in C code.)
The MCC code base has some support for using bitmasks to determine
object beginnings. (So for each word --- or double-word if objects
are so aligned --- have one bit indicating whether it is the beginning
of an object.) I think these are a good idea, but the MCC code
indicates that this feature is currently broken. It's probably worth
fixing and/or finding out why it wasn't used. Another option, of
course, is to be more conservative and conclude an ambiguous pointer
might be a pointer as soon as it points into the Moby heap with
correct alignment. The disadvantage here is more false pointers.
- 1
- We use pthread_getspecific for this
operation; the pthread implementation suggests this operation is safe
in a handler even though the documentation does not specifically
permit it.