From clameter@sgi.com Fri May  4 15:05:19 2007
Message-Id: <20070504215839.290346570@sgi.com>
User-Agent: quilt/0.46-1
Date: Fri, 04 May 2007 14:58:39 -0700
From: clameter@sgi.com
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
 dgc@sgi.com,
 Dipankar Sarma <dipankar@in.ibm.com>,
 Eric Dumazet <dada1@cosmosbay.com>,
 Mel Gorman <mel@csn.ul.ie>
Subject: [patch 0/3] [RFC] Slab Defrag / Slab Targeted Reclaim and general Slab API changes

I originally intended this for the 2.6.23 development cycle but since there
is an agressive push for SLUB I thought that we may want to introduce this earlier.

This is an RFC for patches that do major changes to the way that slab
allocations are handled in order to introduce some more advanced features
and in order to get rid of some things that are no longer used or awkward.

Specifically:

A. Add Slab fragmentation

This means that on kmem_cache_shrink SLUB will not only sort the
partial slabs by object number but attempt to free objects out of partial
slabs that have a low number of objects. Doing so increases the object
density in the remaining partial. Ideally kmem_cache_shrink would be
able to completely defrag the partial list so that only one partial
slab is left over. But it is advantageous to have slabs with a few free
objects since that speeds up kfree. Also going to the extreme on this one
would mean that the reclaimable slabs would have to be able to move objects.
So we just free objects in slabs with a low population ratio.

B. Targeted Reclaim

This is mainly to support antifragmentation / defragmentation methods. The
Slab adds a new function kmem_cache_vacate(struct page *) which can be used
to request from the SLAB that a page be cleared of all objects. This makes it
possible to reduce the size of the RECLAIMABLE fragmentation area and move
slabs into the MOVABLE area enhancing the capabilities of antifragmentation
significantly.

C. Introduces a slab_ops structure that allows a slab user to provide
   operations on slabs.

This replaces the current constructor / destructor scheme. It is necessary
in order to support additional methods needed to support targeted reclaim
and slab defragmentation. A slab supporting targeted reclaim and
slab defragmentation must support the following additional methods:

	1. get_reference(void *)
		Get a reference on a particular slab object.

	2. kick_object(void *)
		Kick an object off a slab. The object is either reclaimed
		(easiest) or a new object is alloced using kmem_cache_alloc()
		and then the object is moved to the new location.

D. Slab creation is no longer done using kmem_cache_create

kmem_cache_create is not a clean API since it has only 2 callbacks for
constuctor and destructor, does not allow the specification of a slab ops
structure. Parameters are confusing.

F.e. It is possible to specify alignment information in the alignment
field and in addtion in the flags field (SLAB_HWCACHE_ALIGN). The semantics
of SLAB_HWCACHE_ALIGN are fuzzy because it only aligns object if
larger than 1/2 cacheline. So there is no guarantee of an alignment.

All of this is really not necessary since the compiler knows how to align
structures and we should use this information instead of having the user
specify an alignment. I would like to get rid of SLAB_HWCACHE_ALIGN
and kmem_cache_create. Instead one would use the following macros (that
then result in a call to __kmem_cache_create).

	KMEM_CACHE(<struct-name>, flags)

The macro will determine the slab name from the struct name and use that for
/sys/slab, will use the size of the struct for slab size and the alignment
of the structure for alignment. This means one will be able to set slab
object alignment by specifying the usual alignment options for static
allocations when defining the structure.

Since the name is derived from the struct name it will much easier to
find the source code for slabs listed in /sys/slab.

An additional macro is provided if the slab also supports slab operations.

	KMEM_CACHE_OPS(<struct-name>, flags, slab_ops)

It is likely that this macro will be rarely used.

E. kmem_cache_create() SLAB_HWCACHE_ALIGN legacy interface

In order to avoid having to modify all slab creation calls throughout
the kernel we will provide a kmem_cache_create emulation. That function
is the only call that will still understand SLAB_HWCACHE_ALIGN. If that
parameter is specified then it will set up the proper alignment (the slab
allocators never see that flag).

If constructor or destructor are specified then we will allocate a slab_ops
structure and populate it with the values specified. Note that this will
cause a memory leak if the slab is disposed of later. If you need disposable
slabs then the new API must be used.

F. Remove destructor support from all slab allocators?

I am only aware of two call sites left after all the changes that are
scheduled to go into 2.6.22-rc1 have been merged. These are in FRV and sh
arch code. The one in FRV will go away if they switch to quicklists like
i386. Sh contains another use but a single user is no justification for keeping
destructors around.





-- 

From clameter@sgi.com Fri May  4 15:05:19 2007
Message-Id: <20070504220519.815033039@sgi.com>
References: <20070504215839.290346570@sgi.com>
User-Agent: quilt/0.46-1
Date: Fri, 04 May 2007 14:58:40 -0700
From: clameter@sgi.com
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
 dgc@sgi.com,
 Dipankar Sarma <dipankar@in.ibm.com>,
 Eric Dumazet <dada1@cosmosbay.com>,
 Mel Gorman <mel@csn.ul.ie>
Subject: [patch 1/3] SLUB: slab_ops instead of constructors / destructors
Content-Disposition: inline; filename=slabapic23

This patch gets rid constructors and destructors and replaces them
with a slab operations structure that is passed into SLUB.

For backward compatibility we provide a kmem_cache_create() emulation
that can construct a slab operations structure on the fly.

The new API's to create slabs are:

Without any callbacks:

slabhandle = KMEM_CACHE(<struct>, <flags>)

Creates a slab based on the structure definition with the structure
alignment, size and name. This is cleaner because the name showing up
in /sys/slab/xxx will be the structure name. One can search the source
for the name. The common alignment attributs to the struct can control
slab alignmnet.

Note: SLAB_HWCACHE_ALIGN is *not* supported as a flag. The flags do
*not* specify alignments. The alignment is done to the structure and
please nowhere else.

Create a slabcache with slab_ops (please use only for special slabs):

KMEM_CACHE_OPS(<struct>, <flags>, <slab_ops>)

Old kmem_cache_create() support:

kmem_cache_create alone still accepts the specification of SLAB_HWCACHE_ALIGN
*if* there is no other alignment specified. In that case kmem_cache_create
will generate a proper alignment depending on the size of the structure.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   60 ++++++++++++++++++++++++++++++------
 include/linux/slub_def.h |    3 -
 mm/slub.c                |   77 +++++++++++++++++------------------------------
 3 files changed, 80 insertions(+), 60 deletions(-)

Index: slub/include/linux/slab.h
===================================================================
--- slub.orig/include/linux/slab.h	2007-05-03 20:53:00.000000000 -0700
+++ slub/include/linux/slab.h	2007-05-04 02:38:38.000000000 -0700
@@ -23,7 +23,6 @@ typedef struct kmem_cache kmem_cache_t _
 #define SLAB_DEBUG_FREE		0x00000100UL	/* DEBUG: Perform (expensive) checks on free */
 #define SLAB_RED_ZONE		0x00000400UL	/* DEBUG: Red zone objs in a cache */
 #define SLAB_POISON		0x00000800UL	/* DEBUG: Poison objects */
-#define SLAB_HWCACHE_ALIGN	0x00002000UL	/* Align objs on cache lines */
 #define SLAB_CACHE_DMA		0x00004000UL	/* Use GFP_DMA memory */
 #define SLAB_STORE_USER		0x00010000UL	/* DEBUG: Store the last owner for bug hunting */
 #define SLAB_RECLAIM_ACCOUNT	0x00020000UL	/* Objects are reclaimable */
@@ -32,19 +31,21 @@ typedef struct kmem_cache kmem_cache_t _
 #define SLAB_MEM_SPREAD		0x00100000UL	/* Spread some memory over cpuset */
 #define SLAB_TRACE		0x00200000UL	/* Trace allocations and frees */
 
-/* Flags passed to a constructor functions */
-#define SLAB_CTOR_CONSTRUCTOR	0x001UL		/* If not set, then deconstructor */
-
 /*
  * struct kmem_cache related prototypes
  */
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
-struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
-			unsigned long,
-			void (*)(void *, struct kmem_cache *, unsigned long),
-			void (*)(void *, struct kmem_cache *, unsigned long));
+struct slab_ops {
+	/* FIXME: ctor should only take the object as an argument. */
+	void (*ctor)(void *, struct kmem_cache *, unsigned long);
+	/* FIXME: Remove all destructors ? */
+	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+};
+
+struct kmem_cache *__kmem_cache_create(const char *, size_t, size_t,
+	unsigned long, struct slab_ops *s);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
@@ -62,9 +63,14 @@ int kmem_ptr_validate(struct kmem_cache 
  * f.e. add ____cacheline_aligned_in_smp to the struct declaration
  * then the objects will be properly aligned in SMP configurations.
  */
-#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
+#define KMEM_CACHE(__struct, __flags) __kmem_cache_create(#__struct,\
 		sizeof(struct __struct), __alignof__(struct __struct),\
-		(__flags), NULL, NULL)
+		(__flags), NULL)
+
+#define KMEM_CACHE_OPS(__struct, __flags, __ops) \
+	__kmem_cache_create(#__struct, sizeof(struct __struct), \
+	__alignof__(struct __struct), (__flags), (__ops))
+
 
 #ifdef CONFIG_NUMA
 extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
@@ -236,6 +242,40 @@ extern void *__kmalloc_node_track_caller
 extern const struct seq_operations slabinfo_op;
 ssize_t slabinfo_write(struct file *, const char __user *, size_t, loff_t *);
 
+/*
+ * Legacy functions
+ *
+ * The sole reason that these definitions are here is because of their
+ * frequent use. Remove when all call sites have been updated.
+ */
+#define SLAB_HWCACHE_ALIGN	0x8000000000UL
+#define SLAB_CTOR_CONSTRUCTOR	0x001UL
+
+static inline struct kmem_cache *kmem_cache_create(const char *s,
+		size_t size, size_t align, unsigned long flags,
+		void (*ctor)(void *, struct kmem_cache *, unsigned long),
+		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+{
+	struct slab_ops *so = NULL;
+
+	if ((flags & SLAB_HWCACHE_ALIGN) && size > L1_CACHE_BYTES / 2) {
+		/* Clear the align flag. It is no longer supported */
+		flags &= ~SLAB_HWCACHE_ALIGN;
+
+		/* Do not allow conflicting alignment specificiations */
+		BUG_ON(align);
+
+		/* And set the cacheline alignment */
+		align = L1_CACHE_BYTES;
+	}
+	if (ctor || dtor) {
+		so = kzalloc(sizeof(struct slab_ops), GFP_KERNEL);
+		so->ctor = ctor;
+		so->dtor = dtor;
+	}
+	return  __kmem_cache_create(s, size, align, flags, so);
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SLAB_H */
 
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c	2007-05-04 02:23:34.000000000 -0700
+++ slub/mm/slub.c	2007-05-04 02:40:13.000000000 -0700
@@ -209,6 +209,11 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+struct slab_ops default_slab_ops = {
+	NULL,
+	NULL
+};
+
 /*
  * Object debugging
  */
@@ -809,8 +814,8 @@ static void setup_object(struct kmem_cac
 		init_tracking(s, object);
 	}
 
-	if (unlikely(s->ctor))
-		s->ctor(object, s, SLAB_CTOR_CONSTRUCTOR);
+	if (unlikely(s->slab_ops->ctor))
+		s->slab_ops->ctor(object, s, SLAB_CTOR_CONSTRUCTOR);
 }
 
 static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
@@ -867,16 +872,18 @@ out:
 static void __free_slab(struct kmem_cache *s, struct page *page)
 {
 	int pages = 1 << s->order;
+	void (*dtor)(void *, struct kmem_cache *, unsigned long) =
+		s->slab_ops->dtor;
 
-	if (unlikely(PageError(page) || s->dtor)) {
+	if (unlikely(PageError(page) || dtor)) {
 		void *start = page_address(page);
 		void *end = start + (pages << PAGE_SHIFT);
 		void *p;
 
 		slab_pad_check(s, page);
 		for (p = start; p <= end - s->size; p += s->size) {
-			if (s->dtor)
-				s->dtor(p, s, 0);
+			if (dtor)
+				dtor(p, s, 0);
 			check_object(s, page, p, 0);
 		}
 	}
@@ -1618,7 +1625,7 @@ static int calculate_sizes(struct kmem_c
 	 * then we should never poison the object itself.
 	 */
 	if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
-			!s->ctor && !s->dtor)
+			s->slab_ops->ctor && !s->slab_ops->dtor)
 		s->flags |= __OBJECT_POISON;
 	else
 		s->flags &= ~__OBJECT_POISON;
@@ -1647,7 +1654,7 @@ static int calculate_sizes(struct kmem_c
 	s->inuse = size;
 
 	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
-		s->ctor || s->dtor)) {
+		s->slab_ops->ctor || s->slab_ops->dtor)) {
 		/*
 		 * Relocate free pointer after the object if it is not
 		 * permitted to overwrite the first word of the object on
@@ -1731,13 +1738,11 @@ static int __init finish_bootstrap(void)
 static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		struct slab_ops *slab_ops)
 {
 	memset(s, 0, kmem_size);
 	s->name = name;
-	s->ctor = ctor;
-	s->dtor = dtor;
+	s->slab_ops = slab_ops;
 	s->objsize = size;
 	s->flags = flags;
 	s->align = align;
@@ -1757,7 +1762,7 @@ static int kmem_cache_open(struct kmem_c
 	if (s->size >= 65535 * sizeof(void *)) {
 		BUG_ON(flags & (SLAB_RED_ZONE | SLAB_POISON |
 				SLAB_STORE_USER | SLAB_DESTROY_BY_RCU));
-		BUG_ON(ctor || dtor);
+		BUG_ON(slab_ops->ctor || slab_ops->dtor);
 	}
 	else
 		/*
@@ -1992,7 +1997,7 @@ static struct kmem_cache *create_kmalloc
 
 	down_write(&slub_lock);
 	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
-			flags, NULL, NULL))
+			flags, &default_slab_ops))
 		goto panic;
 
 	list_add(&s->list, &slab_caches);
@@ -2313,23 +2318,21 @@ static int slab_unmergeable(struct kmem_
 	if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
 		return 1;
 
-	if (s->ctor || s->dtor)
+	if (s->slab_ops != &default_slab_ops)
 		return 1;
 
 	return 0;
 }
 
 static struct kmem_cache *find_mergeable(size_t size,
-		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		size_t align, unsigned long flags, struct slab_ops *slab_ops)
 {
 	struct list_head *h;
 
 	if (slub_nomerge || (flags & SLUB_NEVER_MERGE))
 		return NULL;
 
-	if (ctor || dtor)
+	if (slab_ops != &default_slab_ops)
 		return NULL;
 
 	size = ALIGN(size, sizeof(void *));
@@ -2364,15 +2367,17 @@ static struct kmem_cache *find_mergeable
 	return NULL;
 }
 
-struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+struct kmem_cache *__kmem_cache_create(const char *name, size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		struct slab_ops *slab_ops)
 {
 	struct kmem_cache *s;
 
+	if (!slab_ops)
+		slab_ops = &default_slab_ops;
+
 	down_write(&slub_lock);
-	s = find_mergeable(size, align, flags, dtor, ctor);
+	s = find_mergeable(size, align, flags, slab_ops);
 	if (s) {
 		s->refcount++;
 		/*
@@ -2386,7 +2391,7 @@ struct kmem_cache *kmem_cache_create(con
 	} else {
 		s = kmalloc(kmem_size, GFP_KERNEL);
 		if (s && kmem_cache_open(s, GFP_KERNEL, name,
-				size, align, flags, ctor, dtor)) {
+				size, align, flags, slab_ops)) {
 			if (sysfs_slab_add(s)) {
 				kfree(s);
 				goto err;
@@ -2406,7 +2411,7 @@ err:
 		s = NULL;
 	return s;
 }
-EXPORT_SYMBOL(kmem_cache_create);
+EXPORT_SYMBOL(__kmem_cache_create);
 
 void *kmem_cache_zalloc(struct kmem_cache *s, gfp_t flags)
 {
@@ -2961,28 +2966,6 @@ static ssize_t order_show(struct kmem_ca
 }
 SLAB_ATTR_RO(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
-{
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
-
-		return n + sprintf(buf + n, "\n");
-	}
-	return 0;
-}
-SLAB_ATTR_RO(ctor);
-
-static ssize_t dtor_show(struct kmem_cache *s, char *buf)
-{
-	if (s->dtor) {
-		int n = sprint_symbol(buf, (unsigned long)s->dtor);
-
-		return n + sprintf(buf + n, "\n");
-	}
-	return 0;
-}
-SLAB_ATTR_RO(dtor);
-
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
 	return sprintf(buf, "%d\n", s->refcount - 1);
@@ -3213,8 +3196,6 @@ static struct attribute * slab_attrs[] =
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
-	&dtor_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,
Index: slub/include/linux/slub_def.h
===================================================================
--- slub.orig/include/linux/slub_def.h	2007-05-04 02:23:51.000000000 -0700
+++ slub/include/linux/slub_def.h	2007-05-04 02:24:27.000000000 -0700
@@ -39,8 +39,7 @@ struct kmem_cache {
 	/* Allocation and freeing of slabs */
 	int objects;		/* Number of objects in slab */
 	int refcount;		/* Refcount for slab cache destroy */
-	void (*ctor)(void *, struct kmem_cache *, unsigned long);
-	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+	struct slab_ops *slab_ops;
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	const char *name;	/* Name (only for display!) */

-- 

From clameter@sgi.com Fri May  4 15:05:20 2007
Message-Id: <20070504220519.978854977@sgi.com>
References: <20070504215839.290346570@sgi.com>
User-Agent: quilt/0.46-1
Date: Fri, 04 May 2007 14:58:41 -0700
From: clameter@sgi.com
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
 dgc@sgi.com,
 Dipankar Sarma <dipankar@in.ibm.com>,
 Eric Dumazet <dada1@cosmosbay.com>,
 Mel Gorman <mel@csn.ul.ie>
Subject: [patch 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
Content-Disposition: inline; filename=reclaim_callback

Targeted reclaim allows to target a single slab for reclaim. This is done by
calling

kmem_cache_vacate(slab, page);

It will return 1 on success, 0 if the operation failed.

The vacate functionality is also used for slab shrinking. During the shrink
operation SLUB will generate a list sorted by the number of objects in use.

We extract pages off that list that are only filled less than a quarter. These
objects are then processed using kmem_cache_vacate.

In order for a slabcache to support this functionality two functions must
be defined via slab_operations.

get_reference(void *)

Must obtain a reference to the object if it has not been freed yet. It is up
to the slabcache to resolve the race. SLUB guarantees that the objects is
still allocated. However, another thread may be blocked in slab_free
attempting to free the same object. It may succeed as soon as
get_reference() returns to the slab allocator.

get_reference() processing must recognize this situation (i.e. check refcount
for zero) and fail in such a sitation (no problem since the object will
be freed as soon we drop the slab lock before doing kick calls).

No slab operations may be performed in get_reference(). The slab lock
for the page with the object is taken. Any slab operations may lead to
a deadlock.

2. kick_object(void *)

After SLUB has established references to the remaining objects in a slab it
will drop all locks and then use kick_object on each of the objects for which
we obtained a reference. The existence of the objects is guaranteed by
virtue of the earlier obtained reference. The callback may perform any
slab operation since no locks are held at the time of call.

The callback should remove the object from the slab in some way. This may
be accomplished by reclaiming the object and then running kmem_cache_free()
or reallocating it and then running kmem_cache_free(). Reallocation
is advantageous at this point because it will then allocate from the partial
slabs with the most objects because we have just finished slab shrinking.

NOTE: This patch is for conceptual review. I'd appreciate any feedback
especially on the locking approach taken here. It will be critical to
resolve the locking issue for this approach to become feasable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h |    3 
 mm/slub.c            |  159 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 154 insertions(+), 8 deletions(-)

Index: slub/include/linux/slab.h
===================================================================
--- slub.orig/include/linux/slab.h	2007-05-04 13:32:34.000000000 -0700
+++ slub/include/linux/slab.h	2007-05-04 13:32:50.000000000 -0700
@@ -42,6 +42,8 @@ struct slab_ops {
 	void (*ctor)(void *, struct kmem_cache *, unsigned long);
 	/* FIXME: Remove all destructors ? */
 	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+	int (*get_reference)(void *);
+	void (*kick_object)(void *);
 };
 
 struct kmem_cache *__kmem_cache_create(const char *, size_t, size_t,
@@ -54,6 +56,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+int kmem_cache_vacate(struct page *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c	2007-05-04 13:32:34.000000000 -0700
+++ slub/mm/slub.c	2007-05-04 13:56:25.000000000 -0700
@@ -173,7 +173,7 @@ static struct notifier_block slab_notifi
 static enum {
 	DOWN,		/* No slab functionality available */
 	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
-	UP,		/* Everything works */
+	UP,		/* Everything works but does not show up in sysfs */
 	SYSFS		/* Sysfs up */
 } slab_state = DOWN;
 
@@ -211,6 +211,8 @@ static inline struct kmem_cache_node *ge
 
 struct slab_ops default_slab_ops = {
 	NULL,
+	NULL,
+	NULL,
 	NULL
 };
 
@@ -839,13 +841,10 @@ static struct page *new_slab(struct kmem
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
+
 	page->offset = s->offset / sizeof(void *);
 	page->slab = s;
-	page->flags |= 1 << PG_slab;
-	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-			SLAB_STORE_USER | SLAB_TRACE))
-		page->flags |= 1 << PG_error;
-
+	page->inuse = 0;
 	start = page_address(page);
 	end = start + s->objects * s->size;
 
@@ -862,7 +861,17 @@ static struct page *new_slab(struct kmem
 	set_freepointer(s, last, NULL);
 
 	page->freelist = start;
-	page->inuse = 0;
+
+	/*
+	 * pages->inuse must be visible when PageSlab(page) becomes
+	 * true for targeted reclaim
+	 */
+	smp_wmb();
+	page->flags |= 1 << PG_slab;
+	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
+			SLAB_STORE_USER | SLAB_TRACE))
+		page->flags |= 1 << PG_error;
+
 out:
 	if (flags & __GFP_WAIT)
 		local_irq_disable();
@@ -2124,6 +2133,111 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
+ * Vacate all objects in the given slab. Slab must be locked.
+ *
+ * Will drop and regain and drop the slab lock.
+ * Slab must be marked PageActive() to avoid concurrent slab_free from
+ * remove the slab from the lists. At the end the slab will either
+ * be freed or have been returned to the partial lists.
+ *
+ * Return error code or number of remaining objects
+ */
+static int __kmem_cache_vacate(struct kmem_cache *s, struct page *page)
+{
+	void *p;
+	void *addr = page_address(page);
+	unsigned long map[BITS_TO_LONGS(s->objects)];
+	int leftover;
+
+	if (!page->inuse)
+		return 0;
+
+	/* Determine free objects */
+	bitmap_zero(map, s->objects);
+	for(p = page->freelist; p; p = get_freepointer(s, p))
+		set_bit((p - addr) / s->size, map);
+
+	/*
+	 * Get a refcount for all used objects. If that fails then
+	 * no KICK callback can be performed.
+	 */
+	for(p = addr; p < addr + s->objects * s->size; p += s->size)
+		if (!test_bit((p - addr) / s->size, map))
+			if (!s->slab_ops->get_reference(p))
+				set_bit((p - addr) / s->size, map);
+
+	/* Got all the references we need. Now we can drop the slab lock */
+	slab_unlock(page);
+
+	/* Perform the KICK callbacks to remove the objects */
+	for(p = addr; p < addr + s->objects * s->size; p += s->size)
+		if (!test_bit((p - addr) / s->size, map))
+			s->slab_ops->kick_object(p);
+
+	slab_lock(page);
+	leftover = page->inuse;
+	ClearPageActive(page);
+	putback_slab(s, page);
+	return leftover;
+}
+
+/*
+ * Remove a page from the lists. Must be holding slab lock.
+ */
+static void remove_from_lists(struct kmem_cache *s, struct page *page)
+{
+	if (page->inuse < s->objects)
+		remove_partial(s, page);
+	else if (s->flags & SLAB_STORE_USER)
+		remove_full(s, page);
+}
+
+/*
+ * Attempt to free objects in a page. Return 1 when succesful.
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	struct kmem_cache *s;
+	int rc = 0;
+
+	/* Get a reference to the page. Return if its freed or being freed */
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	/* Check that this is truly a slab page */
+	if (!PageSlab(page))
+		goto out;
+
+	slab_lock(page);
+
+	/*
+	 * We may now have locked a page that is in various stages of being
+	 * freed. If the PageSlab bit is off then we have already reached
+	 * the page allocator. If page->inuse is zero then we are
+	 * in SLUB but freeing or allocating the page.
+	 * page->inuse is never modified without the slab lock held.
+	 *
+	 * Also abort if the page happens to be a per cpu slab
+	 */
+	if (!PageSlab(page) || PageActive(page) || !page->inuse) {
+		slab_unlock(page);
+		goto out;
+	}
+
+	/*
+	 * We are holding a lock on a slab page that is not in the
+	 * process of being allocated or freed.
+	 */
+	s = page->slab;
+	remove_from_lists(s, page);
+	SetPageActive(page);
+	rc = __kmem_cache_vacate(s, page) == 0;
+out:
+	put_page(page);
+	return rc;
+}
+
+/*
  *  kmem_cache_shrink removes empty slabs from the partial lists
  *  and then sorts the partially allocated slabs by the number
  *  of items in use. The slabs with the most items in use
@@ -2137,11 +2251,12 @@ int kmem_cache_shrink(struct kmem_cache 
 	int node;
 	int i;
 	struct kmem_cache_node *n;
-	struct page *page;
+	struct page *page, *page2;
 	struct page *t;
 	struct list_head *slabs_by_inuse =
 		kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
 	unsigned long flags;
+	LIST_HEAD(zaplist);
 
 	if (!slabs_by_inuse)
 		return -ENOMEM;
@@ -2194,8 +2309,36 @@ int kmem_cache_shrink(struct kmem_cache 
 		for (i = s->objects - 1; i >= 0; i--)
 			list_splice(slabs_by_inuse + i, n->partial.prev);
 
+		if (!s->slab_ops->get_reference || !s->slab_ops->kick_object)
+			goto out;
+
+		/* Take objects with just a few objects off the tail */
+		while (n->nr_partial > MAX_PARTIAL) {
+			page = container_of(n->partial.prev, struct page, lru);
+
+			/*
+			 * We are holding the list_lock so we can only
+			 * trylock the slab
+			 */
+			if (!slab_trylock(page))
+				break;
+
+			if (page->inuse > s->objects / 4)
+				break;
+
+			list_move(&page->lru, &zaplist);
+			n->nr_partial--;
+			SetPageActive(page);
+			slab_unlock(page);
+		}
 	out:
 		spin_unlock_irqrestore(&n->list_lock, flags);
+
+		/* Now we can free objects in the slabs on the zaplist */
+		list_for_each_entry_safe(page, page2, &zaplist, lru) {
+			slab_lock(page);
+			__kmem_cache_vacate(s, page);
+		}
 	}
 
 	kfree(slabs_by_inuse);

-- 

From clameter@sgi.com Fri May  4 15:05:20 2007
Message-Id: <20070504220520.163037547@sgi.com>
References: <20070504215839.290346570@sgi.com>
User-Agent: quilt/0.46-1
Date: Fri, 04 May 2007 14:58:42 -0700
From: clameter@sgi.com
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
 dgc@sgi.com,
 Dipankar Sarma <dipankar@in.ibm.com>,
 Eric Dumazet <dada1@cosmosbay.com>,
 Mel Gorman <mel@csn.ul.ie>
Subject: [patch 3/3] Support targeted reclaim and slab defrag for dentry cache
Content-Disposition: inline; filename=dcache_targetd_reclaim

This is an experimental patch for locking review only. I am not that
familiar with dentry cache locking.

We setup the dcache cache a bit differently using the new APIs and
define a get_reference and kick_object() function for the dentry cache.

get_dentry_reference simply works by incrementing the dentry refcount
if its not already zero. If it is zero then the slab called us while
another processor is in the process of freeing the object. The other
process will finish this free as soon as we return from this call. So
we have to fail.

kick_dentry_object() is called after get_dentry_reference() has
been used and after the slab has dropped all of its own locks.
Trying to use the dentry pruning here. Hope that is correct.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/dcache.c        |   48 +++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    2 +-
 2 files changed, 40 insertions(+), 10 deletions(-)

Index: slub/fs/dcache.c
===================================================================
--- slub.orig/fs/dcache.c	2007-05-04 13:32:15.000000000 -0700
+++ slub/fs/dcache.c	2007-05-04 13:55:39.000000000 -0700
@@ -2133,17 +2133,48 @@ static void __init dcache_init_early(voi
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab is holding locks on the current slab. We can just
+ * get a reference
+ */
+int get_dentry_reference(void *private)
+{
+	struct dentry *dentry = private;
+
+	return atomic_inc_not_zero(&dentry->d_count);
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the
+ * refcount we obtained earlier and also rid of the
+ * object.
+ */
+void kick_dentry_object(void *private)
+{
+	struct dentry *dentry = private;
+
+	spin_lock(&dentry->d_lock);
+	if (atomic_read(&dentry->d_count) > 1) {
+		spin_unlock(&dentry->d_lock);
+		dput(dentry);
+	}
+	spin_lock(&dcache_lock);
+	prune_one_dentry(dentry, 1);
+	spin_unlock(&dcache_lock);
+}
+
+struct slab_ops dentry_slab_ops = {
+	.get_reference = get_dentry_reference,
+	.kick_object = kick_dentry_object
+};
+
 static void __init dcache_init(unsigned long mempages)
 {
 	int loop;
 
-	/* 
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
-	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+	dentry_cache = KMEM_CACHE_OPS(dentry,
+		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+		&dentry_slab_ops);
 
 	register_shrinker(&dcache_shrinker);
 
@@ -2192,8 +2223,7 @@ void __init vfs_caches_init(unsigned lon
 	names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 
-	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
+	filp_cachep = KMEM_CACHE(file, SLAB_PANIC);
 
 	dcache_init(mempages);
 	inode_init(mempages);
Index: slub/include/linux/fs.h
===================================================================
--- slub.orig/include/linux/fs.h	2007-05-04 13:32:15.000000000 -0700
+++ slub/include/linux/fs.h	2007-05-04 13:55:39.000000000 -0700
@@ -785,7 +785,7 @@ struct file {
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
-};
+} ____cacheline_aligned;
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
 #define file_list_unlock() spin_unlock(&files_lock);

--