Slabifier: A slab allocator with minimal meta information

Lately I have started tinkering around with the slab in particular after
Matt Mackal mentioned that the slab should be more modular at the KS.
One particular design issue with the current slab is that it is build on the
basic notion of shifting object references from list to list. Without NUMA this
is wild enough with the per cpu caches and the shared cache but with NUMA we now
have per node shared arrays, per node list and per node per node alien caches.
Somehow this all works but one wonders does it have to be that way? On very
large systems the number of these entities grows to unbelievable numbers.

So I thought it may be best to try to develop another basic slab layer
that does not have all the object queues and that does not have to carry
so much state information. I also have had concerns about the way locking
is handled for awhile. We could increase parallelism by finer grained locking.
This in turn may avoid the need for object queues.

After toying around for awhile I came to the realization that the page struct
contains all the information necessary to manage a slab block. One can put
all the management information there and that is also advantageous
for performance since we constantly have to use the page struct anyways for
reverse object lookups and during slab creation. So this also reduces the
cache footprint of the slab. The alignment is naturally the best since the
first object starts right at the page boundary. This reduces the complexity
of alignment calculations.

We use two locks:

1. The per slab list_lock to protect the lists. This lock does not protect
   the slab and is only taken during list manipulation. List manupulation
   is reduced to necessary moves if the state of a page changes. An allocation
   of an object or the freeing of an object in a slab does not require
   that the list_lock is taken if the slab does not run empty or overflows.

2. The page lock in struct page is used to protect the slab during
   allocation and freeing. This lock is conveniently placed in a cacheline
   that is already available. Other key information is also placed there.

struct page overloading:

- _mapcout	=> Used to count the objects in use in a slab
- mapping	=> Reference to the slab structure
- index		=> Pointer to the first free element in a slab
- lru		=> Used for list management.

Flag overloading:

PageReferenced	=> Used to control per cpu slab freeing.
PageActive	=> slab is under active allocation.
PageLocked	=> slab locking

The freelists of objects per page are managed as a chained list.
The struct page contains a pointer to the first element. The first 4 bytes of
the free element contains a pointer to the next free element etc until the
chain ends with NULL.

There is no freelist for slabs. slabs are immediately returned to the page
allocator.  The page allocator has its own per cpu page queues that should
provide enough caching (only works for order 0 pages though).

Per cpu caches exist in the sense that each processor has a per processor
"cpuslab". Objects in this slab will only be allocated from this processor.
The page state is likely going to stay in the cache. Allocation will be
very fast since we only need the page struct reference for all our needs
which is likely not contended at all. Fetching the next free pointer from
the location of the object nicely prefetches the object.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc4/mm/slabifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/slabifier.c	2006-08-23 12:08:40.039798479 -0700
@@ -0,0 +1,1029 @@
+/*
+ * Generic Slabifier for any allocator with minimal management overhead.
+ *
+ * (C) 2006 Silicon Graphics Inc., Christoph Lameter <clameter@sgi.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/allocator.h>
+#include <linux/bit_spinlock.h>
+#include <linux/module.h>
+#include <linux/interrupt.h>
+
+// #define SLABIFIER_DEBUG
+
+#ifdef SLABIFIER_DEBUG
+#define	DBUG_ON(_x) BUG_ON(_x)
+#else
+#define DBUG_ON(_x)
+#endif
+
+// #define TPRINTK printk
+#define TPRINTK(x, ...)
+
+struct slab {
+	struct slab_cache sc;
+	struct work_struct flush;
+	ZONE_PADDING(slab_pad);		/* Align to cacheline boundary */
+	int size;			/* Slab size */
+	int offset;			/* Free pointer offset. */
+	int objects;			/* Number of objects in slab */
+	atomic_t refcount;		/* Refcount for destroy */
+	atomic_long_t nr_slabs;		/* Total slabs used */
+	spinlock_t list_lock;
+	struct list_head partial;	/* List of partially allocated slabs */
+	unsigned long nr_partial;	/* Partial slabs */
+	int flusher_active;
+	struct page *active[NR_CPUS];	/* Per CPU slabs list protected by
+					 * page lock
+					 */
+};
+
+/*
+ * The page struct is used to keep necessary information about a slab.
+ * For a compound page the first page keeps the slab state.
+ *
+ * Overloaded fields in struct page:
+ *
+ * 	lru	 -> used to a slab on the lists
+ *	mapping	 -> pointer to struct slab
+ *	index	 -> pointer to next free object
+ *	_mapcount -> count number of elements in use
+ *
+ * Lock order:
+ *   1. slab_lock(page)
+ *   2. slab->list_lock
+ *
+ * The slabifier assigns one slab for allocation to each processor
+ * This slab is on the active list and allocations
+ * occur only on the active slabs. If a cpu slab is active then
+ * a workqueue thread checks every 10 seconds if the cpu slab is
+ * still in use. It is dropped back to the inactive lists if not.
+ *
+ * Leftover slabs with free elements are kept on the partial list
+ * and full slabs on the full list.
+ *
+ * Slabs are freed when they become empty. Teardown and setup is
+ * minimal so we rely on the page cache per cpu caches for
+ * performance on frees/allocs.
+ */
+
+#define lru_to_last_page(_head) (list_entry((_head)->next, struct page, lru))
+#define lru_to_first_page(_head) (list_entry((_head)->next, struct page, lru))
+
+/* Some definitions to overload certain fields in struct page */
+
+static void *get_object_pointer(struct page *page)
+{
+	return (void *)page->index;
+}
+
+static void set_object_pointer(struct page *page, void *object)
+{
+	page->index = (unsigned long)object;
+}
+
+static struct slab *get_slab(struct page *page)
+{
+	return (struct slab *)page->mapping;
+}
+
+static void set_slab(struct page *page, struct slab *s)
+{
+	page->mapping = (void *)s;
+}
+
+static int *object_counter(struct page *page)
+{
+	return (int *)&page->_mapcount;
+}
+
+static void inc_object_counter(struct page *page)
+{
+	(*object_counter(page))++;
+}
+
+static void dec_object_counter(struct page *page)
+{
+	(*object_counter(page))--;
+}
+
+static void set_object_counter(struct page *page, int counter)
+{
+	(*object_counter(page))= counter;
+}
+
+static int get_object_counter(struct page *page)
+{
+	return (*object_counter(page));
+}
+
+/*
+ * Locking for each individual slab using the pagelock
+ */
+static __always_inline void slab_lock(struct page *page)
+{
+	bit_spin_lock(PG_locked, &page->flags);
+}
+
+static __always_inline void slab_unlock(struct page *page)
+{
+	bit_spin_unlock(PG_locked, &page->flags);
+}
+
+static void add_partial(struct slab *s, struct page *page)
+{
+	spin_lock(&s->list_lock);
+	s->nr_partial++;
+	list_add_tail(&page->lru, &s->partial);
+	spin_unlock(&s->list_lock);
+}
+
+static void remove_partial(struct slab *s, struct page *page)
+{
+	spin_lock(&s->list_lock);
+	list_del(&page->lru);
+	s->nr_partial--;
+	spin_unlock(&s->list_lock);
+}
+
+/*
+ * Get a page and remove it from the partial list
+ * Must hold list_lock
+ */
+static int lock_and_del_slab(struct slab *s, struct page *page)
+{
+	if (bit_spin_trylock(PG_locked, &page->flags)) {
+		list_del(&page->lru);
+		s->nr_partial--;
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * Get a partial page, lock it and return it.
+ */
+static struct page *get_partial(struct slab *s, int node)
+{
+	struct page *page;
+	struct list_head *h;
+	int wanted_node;
+
+	spin_lock(&s->list_lock);
+
+#ifdef CONFIG_NUMA
+	/*
+	 * Search for slab on the right node
+	 *
+	 * This search is a scalability concern. Searching big
+	 * lists under lock can cause latencies.
+	 *
+	 * On the other hand picking the right slab that
+	 * is from the node were we are and maybe even
+	 * from the same cpu as before is very good
+	 * for latency.
+	 */
+	wanted_node = node < 0 ? numa_node_id() : node;
+	list_for_each(h, &s->partial) {
+		page = container_of(h, struct page, lru);
+
+		if (likely(page_to_nid(page) == wanted_node) &&
+			lock_and_del_slab(s, page))
+			goto out;
+	}
+
+	if (node >= 0)
+		goto fail;
+
+#endif
+	list_for_each(h, &s->partial) {
+		page = container_of(h, struct page, lru);
+
+		if (lock_and_del_slab(s, page))
+			goto out;
+	}
+fail:
+	page = NULL;
+out:
+	spin_unlock(&s->list_lock);
+	return page;
+}
+
+static void check_slab(struct page *page)
+{
+#ifdef SLABIFIER_DEBUG
+	if (!PageSlab(page)) {
+		printk(KERN_CRIT "Not a valid slab page @%p flags=%lx"
+			" mapping=%p count=%d \n",
+			page, page->flags, page->mapping, page_count(page));
+		BUG();
+	}
+#endif
+}
+
+static void check_active_slab(struct page *page)
+{
+#ifdef SLABIFIER_DEBUG
+	if (!PageActive(page)) {
+		printk(KERN_CRIT "Not an active slab page @%p flags=%lx"
+			" mapping=%p count=%d \n",
+			page, page->flags, page->mapping, page_count(page));
+		BUG();
+	}
+#endif
+}
+
+/*
+ * Discard an unused slab page
+ */
+static void discard_slab(struct slab *s, struct page *page)
+{
+	DBUG_ON(PageActive(page));
+	DBUG_ON(PageLocked(page));
+	atomic_long_dec(&s->nr_slabs);
+
+	/* Restore page state */
+	page->mapping = NULL;		/* was used for slab pointer */
+	page->index = 0;		/* was used for the object pointer */
+	reset_page_mapcount(page);	/* Was used for inuse counter */
+	__ClearPageSlab(page);
+
+	s->sc.page_alloc->free(s->sc.page_alloc, page, s->sc.order);
+	sub_zone_page_state(page_zone(page), NR_SLAB, 1 << s->sc.order);
+}
+
+/*
+ * Move a page back to the lists.
+ *
+ * Must be called with the slab lock held.
+ * On exit the slab lock will have been dropped.
+ */
+static void putback_slab(struct slab *s, struct page *page)
+{
+	int inuse;
+
+	inuse = get_object_counter(page);
+
+	TPRINTK(KERN_CRIT "putback_slab %s: %p %d/%d\n",s->sc.name, page, inuse, s->objects);
+	if (inuse) {
+		if (inuse < s->objects)
+			add_partial(s, page);
+		slab_unlock(page);
+	} else {
+		slab_unlock(page);
+		discard_slab(s, page);
+	}
+}
+
+/*
+ * Make the current active page inactive
+ */
+static void deactivate_slab(struct slab *s, struct page *page, int cpu)
+{
+	s->active[cpu] = NULL;
+	smp_wmb();
+	ClearPageActive(page);
+	ClearPageReferenced(page);
+
+	putback_slab(s, page);
+}
+
+static int check_valid_pointer(struct slab *s, struct page *page, void *object, void *origin)
+{
+#ifdef SLABIFIER_DEBUG
+	void *base = page_address(page);
+
+	if (object < base || object >= base + s->objects * s->size) {
+		printk(KERN_CRIT "slab %s size %d: pointer %p->%p\nnot in"
+			" range (%p-%p) in page %p\n", s->sc.name, s->size,
+			origin, object, base, base + s->objects * s->size,
+			page);
+		return 0;
+	}
+
+	if ((object - base) % s->size) {
+		printk(KERN_CRIT "slab %s size %d: pointer %p->%p\n"
+			"does not properly point"
+			"to an object in page %p\n",
+			s->sc.name, s->size, origin, object, page);
+		return 0;
+	}
+#endif
+	return 1;
+}
+
+/*
+ * Determine if a certain object on a page is on the freelist and
+ * therefore free. Must hold the slab lock for active slabs to
+ * guarantee that the chains are consistent.
+ */
+static int on_freelist(struct slab *s, struct page *page, void *search)
+{
+	int nr = 0;
+	void **object = get_object_pointer(page);
+	void *origin = &page->lru;
+
+	check_slab(page);
+
+	while (object && nr <= s->objects) {
+		if (object == search)
+			return 1;
+		if (!check_valid_pointer(s, page, object, origin))
+			goto try_recover;
+		origin = object;
+		object = object[s->offset];
+		nr++;
+	}
+
+	if (get_object_counter(page) != s->objects - nr) {
+		printk(KERN_CRIT "slab %s: page %p wrong object count."
+			" counter is %d but counted were %d\n",
+			s->sc.name, page, get_object_counter(page), s->objects - nr);
+try_recover:
+		printk(KERN_CRIT "****** Trying to continue by marking "
+			"all objects used (memory leak!)\n");
+		set_object_counter(page, s->objects);
+		set_object_pointer(page, NULL);
+	}
+	return 0;
+}
+
+void check_free_chain(struct slab *s, struct page *page)
+{
+#ifdef SLABIFIER_DEBUG
+	on_freelist(s, page, NULL);
+#endif
+}
+
+/*
+ * Allocate a new slab and prepare an empty freelist
+ * and the basic struct page settings.
+ * Return with the slab locked.
+ */
+static struct page *new_slab(struct slab *s, gfp_t flags, int node)
+{
+	void *p, *start, *end;
+	void **last;
+	struct page *page;
+
+	page = s->sc.page_alloc->allocate(s->sc.page_alloc, s->sc.order,
+			flags, node < 0 ? s->sc.node : node);
+	if (!page)
+		return NULL;
+
+	set_slab(page, s);
+	start = page_address(page);
+	set_object_pointer(page, start);
+
+	end = start + s->objects * s->size;
+	last = start;
+	for (p = start +  s->size; p < end; p += s->size) {
+		last[s->offset] = p;
+		last = p;
+	}
+	last[s->offset] = NULL;
+	set_object_counter(page, 0);
+	__SetPageSlab(page);
+	check_free_chain(s, page);
+	add_zone_page_state(page_zone(page), NR_SLAB, 1 << s->sc.order);
+	atomic_long_inc(&s->nr_slabs);
+	slab_lock(page);
+	return page;
+}
+
+/*
+ * Acquire the slab lock from the active array. If there is no active
+ * slab for this processor then return NULL;
+ */
+static __always_inline struct page *get_and_lock_active(struct slab *s, int cpu) {
+	struct page *page;
+
+redo:
+	page = s->active[cpu];
+	if (unlikely(!page))
+		return NULL;
+	slab_lock(page);
+	if (unlikely(s->active[cpu] != page)) {
+		slab_unlock(page);
+		goto redo;
+	}
+	check_active_slab(page);
+	check_free_chain(s, page);
+	return page;
+}
+
+/*
+ * Flush an active slab back to the lists.
+ */
+static void flush_active(struct slab *s, int cpu)
+{
+	struct page *page;
+	unsigned long flags;
+
+	TPRINTK(KERN_CRIT "flush_active %s cpu=%d\n", s->sc.name, cpu);
+	local_irq_save(flags);
+	page = get_and_lock_active(s, cpu);
+	if (likely(page))
+		deactivate_slab(s, page, cpu);
+	local_irq_restore(flags);
+}
+
+/*
+ * Flush per cpu slabs if they are not in use.
+ */
+void flusher(void *d)
+{
+	struct slab *s = d;
+	int cpu = smp_processor_id();
+	struct page *page;
+	int nr_active = 0;
+
+	for_each_online_cpu(cpu) {
+
+		page = s->active[cpu];
+		if (!page)
+			continue;
+
+		if (PageReferenced(page)) {
+			ClearPageReferenced(page);
+			nr_active++;
+		} else
+			flush_active(s, cpu);
+	}
+	if (nr_active)
+		schedule_delayed_work(&s->flush, 10 * HZ);
+	else
+		s->flusher_active = 0;
+}
+
+/*
+ * Drain all per cpu slabs
+ */
+static void drain_all(struct slab *s)
+{
+	int cpu;
+
+	if (s->flusher_active) {
+		cancel_delayed_work(&s->flush);
+		for_each_possible_cpu(cpu)
+			flush_active(s, cpu);
+		s->flusher_active = 0;
+	}
+}
+
+/*
+ * slab_create produces objects aligned at size and the first object
+ * is placed at offset 0 in the slab (We have no metainformation on the
+ * slab, all slabs are in essence off slab).
+ *
+ * In order to get the desired alignment one just needs to align the
+ * size. F.e.
+ *
+ * slab_create(&my_cache, ALIGN(sizeof(struct mystruct)), CACHE_L1_SIZE),
+ *				2, page_allocator);
+ *
+ * Notice that the allocation order determines the sizes of the per cpu
+ * caches. Each processor has always one slab available for allocations.
+ * Increasing the allocation order reduces the number of times that slabs
+ * must be moved on and off lists and therefore influences overhead.
+ *
+ * The offset is used to relocate the free list link in each object. It is
+ * therefore possible to move the free list link behind the object. This
+ * is necessary for RCU to work properly and also useful for debugging.
+ */
+static struct slab_cache *slab_create(struct slab_control *x,
+	const struct slab_cache *sc)
+{
+	struct slab *s = (void *)x;
+	int cpu;
+
+	BUG_ON(sizeof(struct slab_control) < sizeof(struct slab));
+
+	memcpy(&x->sc, sc, sizeof(struct slab_cache));
+
+	s->size = ALIGN(sc->size, sizeof(void *));
+
+	if (sc->offset > s->size - sizeof(void *) || (sc->offset % sizeof(void*)))
+		return NULL;
+
+	s->offset = sc->offset / sizeof(void *);
+	s->objects = (PAGE_SIZE << sc->order) / s->size;
+	atomic_long_set(&s->nr_slabs, 0);
+	s->nr_partial = 0;
+	s->flusher_active = 0;
+
+	if (!s->objects)
+		return NULL;
+
+	INIT_LIST_HEAD(&s->partial);
+
+	atomic_set(&s->refcount, 1);
+	spin_lock_init(&s->list_lock);
+	INIT_WORK(&s->flush, &flusher, s);
+
+	for_each_possible_cpu(cpu)
+		s->active[cpu] = NULL;
+	return &s->sc;
+}
+
+/*
+ * Reload a new active cpu slab
+ *
+ * If we have reloaded successfully then we exit with holding the slab lock
+ * and return the pointer to the new page.
+ *
+ * Return NULL if we cannot reload.
+ */
+static struct page *reload(struct slab *s, unsigned long cpu, gfp_t flags,
+							int node)
+{
+	struct page *page;
+
+redo:
+	if (s->nr_partial) { /* Racy check. If we do a useless allocation then
+			         we just build up the partial list */
+		page = get_partial(s, node);
+		if (page)
+			goto gotpage;
+	}
+
+	if ((flags & __GFP_WAIT)) {
+		local_irq_enable();
+		page = new_slab(s, flags, node);
+		local_irq_disable();
+	} else
+		page = new_slab(s, flags, node);
+
+	if (!page)
+		return NULL;
+
+gotpage:
+	/*
+	 * Now we have a page that is isolated from the lists and
+	 * locked,
+	 */
+	SetPageActive(page);
+	ClearPageReferenced(page);
+
+	/*
+	 * Barrier is needed so that a racing process never
+	 * sees a page that thas active not set.
+	 */
+	smp_wmb();
+
+	if (cmpxchg(&s->active[cpu], NULL, page) != NULL) {
+
+		TPRINTK(KERN_CRIT "active already provided %s\n", s->sc.name);
+
+		ClearPageActive(page);
+		add_partial(s, page);
+		slab_unlock(page);
+
+		page = get_and_lock_active(s, cpu);
+		if (page)
+			return page;
+		goto redo;
+	}
+
+	check_free_chain(s, page);
+	if (keventd_up() && !s->flusher_active &&
+			s->size != (PAGE_SIZE << s->sc.order))
+		schedule_delayed_work(&s->flush, 10 * HZ);
+	return page;
+}
+
+/*
+ * If the gfp mask has __GFP_WAIT set then slab_alloc() may enable interrupts
+ * if it needs to acquire more pages for new slabs.
+ */
+static __always_inline void *__slab_alloc(struct slab_cache *sc, gfp_t gfpflags,
+		int node)
+{
+	struct slab *s = (void *)sc;
+	struct page *page;
+	void **object;
+	void *next_object;
+	unsigned long flags;
+	int cpu = smp_processor_id();
+
+	local_irq_save(flags);
+	page = get_and_lock_active(s, cpu);
+	if (unlikely(!page))
+		goto load;
+
+	while (unlikely(!get_object_pointer(page) ||
+		(node > 0 && page_to_nid(page) != node))) {
+
+		/* Current slab is unfit for allocation */
+		deactivate_slab(s, page, cpu);
+load:
+		/* Get a new slab */
+		page = reload(s, cpu, gfpflags, node);
+		if (!page) {
+			local_irq_restore(flags);
+			return NULL;
+		}
+	}
+
+	inc_object_counter(page);
+	object = get_object_pointer(page);
+	next_object = object[s->offset];
+	set_object_pointer(page, next_object);
+	check_free_chain(s, page);
+	SetPageReferenced(page);
+	slab_unlock(page);
+	local_irq_restore(flags);
+	return object;
+
+}
+
+static void *slab_alloc(struct slab_cache *sc, gfp_t gfpflags)
+{
+	return __slab_alloc(sc, gfpflags, -1);
+}
+
+static void *slab_alloc_node(struct slab_cache *sc, gfp_t gfpflags, int node)
+{
+	return __slab_alloc(sc, gfpflags, node);
+}
+
+/* Figure out on which slab object the object resides */
+static struct page *get_object_page(const void *x)
+{
+	struct page * page;
+
+	if (unlikely((unsigned long)x >= VMALLOC_START &&
+			(unsigned long)x < VMALLOC_END))
+		page = vmalloc_to_page(x);
+	else
+		page = virt_to_page(x);
+
+	if (unlikely(PageCompound(page)))
+		page = (struct page *)page_private(page);
+
+	if (!PageSlab(page))
+		return NULL;
+
+	return page;
+}
+
+/*
+ * If the struct slab pointer is NULL then we will determine the
+ * proper cache. Otherwise the slab ownership will be verified
+ * before object removal.
+ */
+static void slab_free(struct slab_cache *sc, const void *x)
+{
+	struct slab *s = (void *)sc;
+	struct page * page;
+	int leftover;
+	void *prior;
+	void **object = (void *)x;
+	unsigned long flags;
+
+	if (!object)
+		return;
+
+	page = get_object_page(object);
+	if (unlikely(!page)) {
+		printk(KERN_CRIT "slab_free %s size %d: attempt to free object"
+			"(%p) outside of slab.\n", s->sc.name, s->size, object);
+		goto dumpret;
+	}
+
+	if (!s) {
+		s = get_slab(page);
+
+		if (unlikely(!s)) {
+#ifdef CONFIG_SLAB
+			extern void __kfree(void *);
+			/*
+			 * Upps we use multiple slab allocator. This
+			 * must be another one (regular slab?)
+			 */
+			__kfree(object);
+			return;
+#else
+			printk(KERN_CRIT
+				"slab_free : no slab(NULL) for object %p.\n",
+						object);
+#endif
+			goto dumpret;
+		}
+	}
+
+	if (unlikely(s != get_slab(page))) {
+		printk(KERN_CRIT "slab_free %s: object at %p"
+				" belongs to slab %p\n",
+				s->sc.name, object, get_slab(page));
+		dump_stack();
+		s = get_slab(page);
+	}
+
+	if (unlikely(!check_valid_pointer(s, page, object, NULL))) {
+dumpret:
+		dump_stack();
+		printk(KERN_CRIT "***** Trying to continue by not"
+				"freeing object.\n");
+		return;
+	}
+
+	local_irq_save(flags);
+	slab_lock(page);
+
+	/* Check for double free */
+#ifdef SLABIFIER_DEBUG
+	if (on_freelist(s, page, object)) {
+		printk(KERN_CRIT "slab_free %s: object %p already free.\n",
+						s->sc.name, object);
+		dump_stack();
+		goto out_unlock;
+	}
+#endif
+
+	prior = get_object_pointer(page);
+	object[s->offset] = prior;
+
+	set_object_pointer(page, object);
+	dec_object_counter(page);
+	leftover = get_object_counter(page);
+
+	if (unlikely(PageActive(page)))
+		goto out_unlock;
+
+	if (unlikely(leftover == 0)) {
+		/*
+		 * We deallocated all objects in a slab and the slab
+		 * is not under allocation. So we can free it.
+		 */
+		if (s->objects > 1)
+			remove_partial(s, page);
+		check_free_chain(s, page);
+		slab_unlock(page);
+		discard_slab(s, page);
+		goto out;
+	}
+	if (unlikely(!prior)) {
+		/*
+		 * Page was fully used before. It will only have one free
+		 * object now. So move to the front of the partial list.
+		 * This will increase the chances of the first object
+		 * to be reused soon. Its likely cache hot.
+		 */
+		add_partial(s, page);
+	}
+out_unlock:
+	slab_unlock(page);
+out:
+	local_irq_restore(flags);
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+static int slab_pointer_valid(struct slab_cache *sc, const void *object)
+{
+	struct slab *s = (void *)sc;
+	struct page * page;
+	void *addr;
+
+	TPRINTK(KERN_CRIT "slab_pointer_valid(%s,%p\n",s->sc.name, object);
+
+	page = get_object_page(object);
+
+	if (!page || s != get_slab(page))
+		return 0;
+
+	addr = page_address(page);
+	if (object < addr || object >= addr + s->objects * s->size)
+		return 0;
+
+	if ((object - addr) & s->size)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * Determine the size of a slab object
+ */
+static unsigned long slab_object_size(struct slab_cache *sc,
+						const void *object)
+{
+	struct page *page;
+	struct slab *s;
+
+	TPRINTK(KERN_CRIT "slab_object_size(%p)\n",object);
+
+	page = get_object_page(object);
+	if (page) {
+		s = get_slab(page);
+		BUG_ON(sc && s != (void *)sc);
+		if (s)
+			return s->size;
+	}
+	BUG();
+	return 0;	/* Satify compiler */
+}
+
+/*
+ * Move slab objects in a given slab by calling the move_objects
+ * function.
+ *
+ * Must be called with the slab lock held but will drop and reacuire the
+ * slab lock.
+ */
+static int move_slab_objects(struct slab *s, struct page *page,
+			 int (*move_objects)(struct slab_cache *, void *))
+{
+	int unfreeable = 0;
+	void *addr = page_address(page);
+
+	while (get_object_counter(page) - unfreeable > 0) {
+		void *p;
+
+		for (p = addr; p < addr + s->objects; p+= s->size) {
+			if (!on_freelist(s, page, p)) {
+				/*
+				 * Drop the lock here to allow the
+				 * move_object function to do things
+				 * with the slab_cache and maybe this
+				 * page.
+				 *
+				 */
+				slab_unlock(page);
+				local_irq_enable();
+				if (move_objects((struct slab_cache *)s, p))
+					slab_free(&s->sc, p);
+				else
+					unfreeable++;
+				local_irq_disable();
+				slab_lock(page);
+			}
+		}
+	}
+	return unfreeable;
+}
+
+/*
+ * Shrinking drops all the active per cpu slabs and also reaps all empty
+ * slabs off the partial list. Returns the number of slabs freed.
+ *
+ * If a move_object function is specified then the partial list is going
+ * to be compacted by calling the function on all slabs until only a single
+ * slab is on the partial list. The move_object function
+ * will be called for each objects in the respective slab. Move_object needs to
+ * perform a new allocation for the object and move the contents
+ * of the object to the new location. If move_object returns 1
+ * for success then the object is going to be removed. If 0 then the
+ * slab object cannot be freed at all.
+ *
+ * Try to remove as many slabs as possible. In particular try to undo the
+ * effect of slab_preload which may have added empty pages to the
+ * partial list.
+ *
+ * Returns the number of slabs freed.
+ */
+static int slab_shrink(struct slab_cache *sc,
+			int (*move_object)(struct slab_cache *, void *))
+{
+	struct slab *s = (void *)sc;
+	unsigned long flags;
+	int slabs_freed = 0;
+	int i;
+
+	drain_all(s);
+
+	local_irq_save(flags);
+	for(i = 0; s->nr_partial > 1 && i < s->nr_partial - 1; i++ ) {
+		struct page * page;
+
+		page = get_partial(s, -1);
+		SetPageActive(page);	/* Pin page so that slab_free will not free */
+
+		/*
+		 * Ok. The page cannot become active anymore.
+		 * Only frees can occur in the page.
+		 */
+		if (s->nr_partial) {
+			if (get_object_counter(page) < s->objects)
+				if (move_slab_objects(s,
+						page, move_object) == 0)
+					slabs_freed++;
+
+		}
+		/*
+		 * This will put the slab on the front of the partial
+		 * list, the used list or free it.
+		 */
+		putback_slab(s, page);
+	}
+	local_irq_restore(flags);
+
+	return slabs_freed;
+
+}
+
+/* Duplicate a slab handle */
+static struct slab_cache *slab_dup(struct slab_cache *sc)
+{
+	struct slab *s = (void *)sc;
+
+	atomic_inc(&s->refcount);
+	return &s->sc;
+}
+
+static void free_list(struct slab *s, struct list_head *list)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&s->list_lock, flags);
+	while (!list_empty(list))
+		discard_slab(s, lru_to_last_page(list));
+
+	spin_unlock_irqrestore(&s->list_lock, flags);
+}
+
+/*
+ * Release all leftover slabs. If there are any leftover pointers dangling
+ * to these objects then we will get into a lot of trouble later.
+ */
+static int slab_destroy(struct slab_cache *sc)
+{
+	struct slab * s = (void *)sc;
+
+	if (!atomic_dec_and_test(&s->refcount))
+		return 0;
+
+	TPRINTK("Slab destroy %s\n",sc->name);
+
+	drain_all(s);
+
+	/* There may be empty slabs on the partial list */
+	free_list(s, &s->partial);
+
+	if (atomic_long_read(&s->nr_slabs))
+		return 1;
+
+	/* Just to make sure that no one uses this again */
+	s->size = 0;
+	return 0;
+
+}
+
+static unsigned long count_objects(struct slab *s, struct list_head *list)
+{
+	int count = 0;
+	struct list_head *h;
+	unsigned long flags;
+
+	spin_lock_irqsave(&s->list_lock, flags);
+	list_for_each(h, list) {
+		struct page *page = lru_to_first_page(h);
+
+		count += get_object_counter(page);
+	}
+	spin_unlock_irqrestore(&s->list_lock, flags);
+	return count;
+}
+
+static unsigned long slab_objects(struct slab_cache *sc,
+			unsigned long *p_active, unsigned long *p_partial)
+{
+	struct slab *s = (void *)sc;
+	int partial;
+	int active = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (s->active[cpu])
+			active++;
+
+	partial = count_objects(s, &s->partial);
+
+	if (p_partial)
+		*p_partial = partial;
+
+	if (p_active)
+		*p_active = active;
+
+	return active + partial +
+		(atomic_read(&s->nr_slabs) - s->nr_partial) * s->objects;
+}
+
+const struct slab_allocator slabifier_allocator = {
+	.name = "Slabifier",
+	.create = slab_create,
+	.alloc = slab_alloc,
+	.alloc_node = slab_alloc_node,
+	.free = slab_free,
+	.valid_pointer = slab_pointer_valid,
+	.object_size = slab_object_size,
+	.objects = slab_objects,
+	.shrink = slab_shrink,
+	.dup = slab_dup,
+	.destroy = slab_destroy,
+	.destructor = null_slab_allocator_destructor,
+};
+EXPORT_SYMBOL(slabifier_allocator);
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile	2006-08-22 21:57:45.429194510 -0700
+++ linux-2.6.18-rc4/mm/Makefile	2006-08-23 12:08:00.279564703 -0700
@@ -24,5 +24,5 @@ obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
-obj-$(CONFIG_MODULAR_SLAB) += allocator.o
+obj-$(CONFIG_MODULAR_SLAB) += allocator.o slabifier.o
 
Index: linux-2.6.18-rc4/mm/memory.c
===================================================================
--- linux-2.6.18-rc4.orig/mm/memory.c	2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/mm/memory.c	2006-08-23 12:08:00.291282727 -0700
@@ -2432,7 +2432,7 @@ int make_pages_present(unsigned long add
 /* 
  * Map a vmalloc()-space virtual address to the physical page.
  */
-struct page * vmalloc_to_page(void * vmalloc_addr)
+struct page * vmalloc_to_page(const void * vmalloc_addr)
 {
 	unsigned long addr = (unsigned long) vmalloc_addr;
 	struct page *page = NULL;
Index: linux-2.6.18-rc4/include/linux/mm.h
===================================================================
--- linux-2.6.18-rc4.orig/include/linux/mm.h	2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/include/linux/mm.h	2006-08-23 12:08:00.299094743 -0700
@@ -1013,7 +1013,7 @@ static inline unsigned long vma_pages(st
 }
 
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
-struct page *vmalloc_to_page(void *addr);
+struct page *vmalloc_to_page(const void *addr);
 unsigned long vmalloc_to_pfn(void *addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);