mulle-objc: tagged pointers, boon or bane ?
Continued from mulle-objc: mulle-objc: fast methods make
mulle_objc_object_call
even faster.
Something that I’ve always passed over in the previous articles, was the function _mulle_objc_object_get_isa
. Let’s have a look at it:
static inline struct _mulle_objc_class *_mulle_objc_object_get_isa( void *obj)
{
unsigned int index;
struct _mulle_objc_runtime *runtime;
index = _mulle_objc_object_get_taggedpointer_index( obj);
if( __builtin_expect( ! index, 1))
return( _mulle_objc_objectheader_get_isa( _mulle_objc_object_get_objectheader( obj)));
runtime = mulle_objc_inlined_get_runtime();
return( runtime->taggedpointers.pointerclass[ index]);
}
We can see that there is the “classical” way of getting the isa
from self
, by reading the Class
pointer from an offset from self
:
return( _mulle_objc_objectheader_get_isa( _mulle_objc_object_get_objectheader( obj)));
But the other parts of the code are used for tagged pointers or TPS for short.
The scheme used by _mulle_objc_object_get_taggedpointer_index
is the following:
Architecture | Bitmask |
---|---|
32 bit | 0x3 |
64 bit | 0x7 |
See:
mulle_objc_taggedpointer.h
for more details.
If the value of self
ANDed with the bitmask is zero, then self
is
a conventional object. Any other value indicates a tagged pointer.
Classes for TPS are stored in the runtime. And here we have the first problem.
In mulle-objc the runtime is usually accessed via the class, but we don’t have
the class yet.
Getting the TPS class
There are two possible configurations for the runtime, global and
thread-local. global is the default. In this case the runtime is stored in
a global variable. Access to it is assumed to be reasonably fast, but still
its another overhead incurred on every method call.
In the thread-local case though, the runtime is retrieved via mulle_thread_tss_get
which does a pthread_getspecific
on many platforms.
Now pthread_getspecific
is very fast, but
pthread_getspecific:
-> 0x100000f28 <+0>: jmpq *0xe2(%rip) ; (void *)0x00007fff86670d4c: pthread_getspecific
libsystem_pthread.dylib`pthread_getspecific:
-> 0x7fff86670d4c <+0>: movq %gs:(,%rdi,8), %rax
0x7fff86670d55 <+9>: retq
still calling a shared library function, could put even more of a damper on the proceedings. But none of this has been really benchmarked so far.
pthreads really should provide an inline function for
pthread_getspecific
.
Pros and Cons of TPS
What can we fit into a TPS ? Small strings of like
mulle_char5_t
for example.
A standard object in mulle-objc has a guaranteed footprint of at least
2 * sizeof( uintptr_t), which translates on 64 bit to 16 bytes. This memory
is used for the retain-count and isa
.
Now add the data required for the characters. In an app that holds 16 M
unique strings of 7 ASCII characters each, that is 256 MB overhead for a
payload of about half the size.
With tagged pointers you can eliminate this overhead, if the strings fit the
TPS encoding. The creation of a TPS object is also cheaper than a conventional
object , since you don’t call malloc
. Retain/release of the object are
also very cheap as it is a NOP
.
The big downside of TPS is, that it does slow down all other non TPS objects
method calls. The carefully crafted inlinable code section of
mulle_objc_object_call
now suddenly enlarges by quite a bit. This might make first stage inlining
prohibitive. But if we remove this inlining, we will slow-down other
objects even more.
So Boon or Bane ?
I don’t know!
My gut feeling is, that TPS will pay off in most programs. Currently
the compiler does compile with TPS by default. This will define the __MULLE_OBJC_TPS__
(in “future” version 3.9.1.1). The runtime checks
this and adds the TPS related code.
You can turn off the generation of tagged pointers with -fno-objc-tps
.
Since you can not mix TPS with non-TPS code, the runtime checks that you
don’t load classes with mixed settings.
Post a comment
All comments are held for moderation; basic HTML formatting accepted.