SLUB: slab defragmentation and kmem_cache_vacate

Slab defragmentation occurs when the slabs are shrunk (after inode, dentry
shrinkers have been run from the reclaim code) or when a manual shrinking
is requested via slabinfo. During the shrink operation SLUB will generate a
list of partially populated slabs sorted by the number of objects in use.

We extract pages off that list that are only filled less than a quarter and
attempt to motivate the users of those slabs to either remove the objects
or move the objects.

Targeted reclaim allows to target a single slab for reclaim. This is done by
calling

kmem_cache_vacate(page);

It will return 1 on success, 0 if the operation failed.


In order for a slabcache to support defragmentation a couple of functions
must be defined via kmem_cache_ops. These are

void *get(struct kmem_cache *s, int nr, void **objects)

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect the situation and void the attempts to handle such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get_reference(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page with the object is taken. Any attempt to perform a slab
	operation may lead to a deadlock.

	get() returns a private pointer that is passed to kick. Should we
	be unable to obtain all references then that pointer may indicate
	to the kick() function that it should not attempt any object removal
	or move but simply remove the reference counts.

void kick(struct kmem_cache *, int nr, void **objects, void *get_result)

	After SLUB has established references to the objects in a
	slab it will drop all locks and then use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via get(). The callback may perform
	any slab operation since no locks are held at the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Allocation is likely to result in filling up a slab so that
	it can be removed from the partial list.

	Kick() does not return a result. SLUB will check the number of
	remaining objects in the slab. If all objects were removed then
	we know that the operation was successful.

If a kmem_cache_vacate on a page fails then the slab has usually a pretty
low usage ratio. Go through the slab and resequence the freelist so that
object addresses increase as we allocate objects. This will trigger the
cacheline prefetcher when we start allocating from the slab again and
thereby increase allocations speed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h |   31 +++++
 mm/slab.c            |    9 +
 mm/slob.c            |    9 +
 mm/slub.c            |  271 ++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 307 insertions(+), 13 deletions(-)

Index: slub/include/linux/slab.h
===================================================================
--- slub.orig/include/linux/slab.h	2007-05-31 14:34:12.000000000 -0700
+++ slub/include/linux/slab.h	2007-05-31 14:34:42.000000000 -0700
@@ -85,6 +85,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+int kmem_cache_vacate(struct page *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c	2007-05-31 14:34:12.000000000 -0700
+++ slub/mm/slub.c	2007-05-31 14:35:32.000000000 -0700
@@ -1009,6 +1009,7 @@ static inline int slab_pad_check(struct 
 static inline int check_object(struct kmem_cache *s, struct page *page,
 			void *object, int active) { return 1; }
 static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void remove_full(struct kmem_cache *s, struct page *page) {}
 static inline void kmem_cache_open_debug_check(struct kmem_cache *s) {}
 #define slub_debug 0
 #endif
@@ -2426,6 +2427,121 @@ out:
 }
 
 /*
+ * Order the freelist so that addresses increase as object are allocated.
+ * This is useful to trigger the cpu cacheline prefetching logic.
+ */
+void resequence_freelist(struct kmem_cache *s, struct page *page)
+{
+	void *p;
+	void *last;
+	void *addr = page_address(page);
+	DECLARE_BITMAP(map, s->objects);
+
+	bitmap_zero(map, s->objects);
+
+	/* Figure out which objects are on the freelist */
+	for_each_free_object(p, s, page->freelist)
+		set_bit(slab_index(p, s, addr), map);
+
+	last = NULL;
+	for_each_object(p, s, addr)
+		if (test_bit(slab_index(p, s, addr), map)) {
+			if (last)
+				set_freepointer(s, last, p);
+			else
+				page->freelist = p;
+			last = p;
+		}
+
+	if (last)
+		set_freepointer(s, last, NULL);
+	else
+		page->freelist = NULL;
+}
+
+/*
+ * Get a page off a list and freeze it. Must be holding slab lock.
+ */
+static void freeze_from_list(struct kmem_cache *s, struct page *page)
+{
+	if (page->inuse < s->objects)
+		remove_partial(s, page);
+	else if (s->flags & SLAB_STORE_USER)
+		remove_full(s, page);
+	SetSlabFrozen(page);
+}
+
+/*
+ * Attempt to free objects in a page. Return 1 if succesful.
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	unsigned long flags;
+	struct kmem_cache *s;
+	int vacated = 0;
+	void **vector = NULL;
+
+	/*
+	 * Get a reference to the page. Return if its freed or being freed.
+	 * This is necessary to make sure that the page does not vanish
+	 * from under us before we are able to check the result.
+	 */
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	if (!PageSlab(page))
+		goto out;
+
+	s = page->slab;
+	if (!s)
+		goto out;
+
+	vector = kmalloc(s->objects * sizeof(void *), GFP_KERNEL);
+	if (!vector)
+		goto out2;
+
+	local_irq_save(flags);
+	/*
+	 * The implicit memory barrier in slab_lock guarantees that page->inuse
+	 * is loaded after PageSlab(page) has been established to be true. This is
+	 * only revelant for a  newly created slab.
+	 */
+	slab_lock(page);
+
+	/*
+	 * We may now have locked a page that may be in various stages of
+	 * being freed. If the PageSlab bit is off then we have already
+	 * reached the page allocator. If page->inuse is zero then we are
+	 * in SLUB but freeing or allocating the page.
+	 * page->inuse is never modified without the slab lock held.
+	 *
+	 * Also abort if the page happens to be already frozen. If its
+	 * frozen then a concurrent vacate may be in progress.
+	 */
+	if (!PageSlab(page) || SlabFrozen(page) || !page->inuse)
+		goto out_locked;
+
+	/*
+	 * We are holding a lock on a slab page and all operations on the
+	 * slab are blocking.
+	 */
+	if (!s->ops->get || !s->ops->kick)
+		goto out_locked;
+	freeze_from_list(s, page);
+	vacated = __kmem_cache_vacate(s, page, flags, vector);
+out:
+	kfree(vector);
+out2:
+	put_page(page);
+	return vacated == 0;
+out_locked:
+	slab_unlock(page);
+	local_irq_restore(flags);
+	goto out;
+
+}
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up
Index: slub/mm/slab.c
===================================================================
--- slub.orig/mm/slab.c	2007-05-31 14:34:12.000000000 -0700
+++ slub/mm/slab.c	2007-05-31 14:34:42.000000000 -0700
@@ -2516,6 +2516,15 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+/*
+ * SLAB does not support slab defragmentation
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_vacate);
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: slub/mm/slob.c
===================================================================
--- slub.orig/mm/slob.c	2007-05-31 14:34:12.000000000 -0700
+++ slub/mm/slob.c	2007-05-31 14:34:42.000000000 -0700
@@ -591,6 +591,15 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+/*
+ * SLOB does not support slab defragmentation
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_vacate);
+
 int kmem_ptr_validate(struct kmem_cache *a, const void *b)
 {
 	return 0;