GIT adec43f801ee39ce1db37211d44f3e9b4184e727 master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.16.git commit adec43f801ee39ce1db37211d44f3e9b4184e727 Author: Herbert Xu Date: Tue Nov 22 14:49:54 2005 -0800 [IPV4]: Safer reassembly Another spin of Herbert Xu's "safer ip reassembly" patch for 2.6.16. (The original patch is here: http://marc.theaimsgroup.com/?l=linux-netdev&m=112281936522415&w=2 and my only contribution is to have tested it.) This patch (optionally) does additional checks before accepting IP fragments, which can greatly reduce the possibility of reassembling fragments which originated from different IP datagrams. Signed-off-by: Herbert Xu Signed-off-by: Arthur Kepner Signed-off-by: David S. Miller commit 491b6a20c7c4861ad59aab1520f66a2d9869af05 Author: Harald Welte Date: Tue Nov 22 14:28:10 2005 -0800 [NETFILTER] arp_tables: Fix bug introduced with NUMA aware allocation This fix shows how bad I am with copy & paste. Signed-off-by: Harald Welte Signed-off-by: David S. Miller commit 065ffdd5195e2adf001a3c6d773d14f655c25144 Author: Bart De Schuymer Date: Tue Nov 22 14:26:55 2005 -0800 [NETFILTER] ebtables: Support nf_log API from ebt_log and ebt_ulog This makes ebt_log and ebt_ulog use the new nf_log api. This enables the bridging packet filter to log packets e.g. via nfnetlink_log. Signed-off-by: Bart De Schuymer Signed-off-by: Harald Welte Signed-off-by: David S. Miller commit 6d1273850cac8db77c462caaa145b813d93adc55 Author: Eric Dumazet Date: Fri Nov 18 16:00:15 2005 -0800 [NETFILTER] ip_tables: NUMA-aware allocation Part of a performance problem with ip_tables is that memory allocation is not NUMA aware, but 'only' SMP aware (ie each CPU normally touch separate cache lines) Even with small iptables rules, the cost of this misplacement can be high on common workloads. Instead of using one vmalloc() area (located in the node of the iptables process), we now allocate an area for each possible CPU, using vmalloc_node() so that memory should be allocated in the CPU's node if possible. Port to arp_tables and ip6_tables by Harald Welte. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller commit 1170fe6bda47b082acec7f58bd6928d62a19816a Author: Stephen Hemminger Date: Fri Nov 18 15:56:13 2005 -0800 [TCP] BIC: CUBIC window growth (2.0) Replace existing BIC version 1.1 with new version 2.0. The main change is to replace the window growth function with a cubic function as described in: http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/cubic-paper.pdf Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller commit ab799dd68908a32b3764adb109d539c0c2fedd74 Author: Stephen Hemminger Date: Fri Nov 18 15:55:01 2005 -0800 [TCP] BIC: spelling and whitespace Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller commit 1c8fc8e5bafaedf8c7c243063919283f4fb1402f Author: Stephen Hemminger Date: Fri Nov 18 15:54:29 2005 -0800 [TCP] BIC: remove low utilization code. The latest BICTCP patch at: http://www.csc.ncsu.edu:8080/faculty/rhee/export/bitcp/index_files/Page546.htm disables the low_utilization feature of BICTCP because it doesn't work in some cases. This patch removes it. Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller commit 03a5dc6766288ddc3256763cc74e9889f069247f Author: Trent Jaeger Date: Fri Nov 18 12:52:19 2005 -0800 [LSM-IPSec]: Per-packet access control. This patch series implements per packet access control via the extension of the Linux Security Modules (LSM) interface by hooks in the XFRM and pfkey subsystems that leverage IPSec security associations to label packets. Extensions to the SELinux LSM are included that leverage the patch for this purpose. This patch implements the changes necessary to the SELinux LSM to create, deallocate, and use security contexts for policies (xfrm_policy) and security associations (xfrm_state) that enable control of a socket's ability to send and receive packets. Patch purpose: The patch is designed to enable the SELinux LSM to implement access control on individual packets based on the strongly authenticated IPSec security association. Such access controls augment the existing ones in SELinux based on network interface and IP address. The former are very coarse-grained, and the latter can be spoofed. By using IPSec, the SELinux can control access to remote hosts based on cryptographic keys generated using the IPSec mechanism. This enables access control on a per-machine basis or per-application if the remote machine is running the same mechanism and trusted to enforce the access control policy. Patch design approach: The patch's main function is to authorize a socket's access to a IPSec policy based on their security contexts. Since the communication is implemented by a security association, the patch ensures that the security association's negotiated and used have the same security context. The patch enables allocation and deallocation of such security contexts for policies and security associations. It also enables copying of the security context when policies are cloned. Lastly, the patch ensures that packets that are sent without using a IPSec security assocation with a security context are allowed to be sent in that manner. A presentation available at www.selinux-symposium.org/2005/presentations/session2/2-3-jaeger.pdf from the SELinux symposium describes the overall approach. Patch implementation details: The function which authorizes a socket to perform a requested operation (send/receive) on a IPSec policy (xfrm_policy) is selinux_xfrm_policy_lookup. The Netfilter and rcv_skb hooks ensure that if a IPSec SA with a securit y association has not been used, then the socket is allowed to send or receive the packet, respectively. The patch implements SELinux function for allocating security contexts when policies (xfrm_policy) are created via the pfkey or xfrm_user interfaces via selinux_xfrm_policy_alloc. When a security association is built, SELinux allocates the security context designated by the XFRM subsystem which is based on that of the authorized policy via selinux_xfrm_state_alloc. When a xfrm_policy is cloned, the security context of that policy, if any, is copied to the clone via selinux_xfrm_policy_clone. When a xfrm_policy or xfrm_state is freed, its security context, if any is also freed at selinux_xfrm_policy_free or selinux_xfrm_state_free. Testing: The SELinux authorization function is tested using ipsec-tools. We created policies and security associations with particular security contexts and added SELinux access control policy entries to verify the authorization decision. We also made sure that packets for which no security context was supplied (which either did or did not use security associations) were authorized using an unlabelled context. Signed-off-by: Trent Jaeger Signed-off-by: Herbert Xu Signed-off-by: David S. Miller commit d0c763bedba25b659bc9667d394835471a6da07d Author: Trent Jaeger Date: Fri Nov 18 12:52:00 2005 -0800 [LSM-IPSec]: Security association restriction. This patch series implements per packet access control via the extension of the Linux Security Modules (LSM) interface by hooks in the XFRM and pfkey subsystems that leverage IPSec security associations to label packets. Extensions to the SELinux LSM are included that leverage the patch for this purpose. This patch implements the changes necessary to the XFRM subsystem, pfkey interface, ipv4/ipv6, and xfrm_user interface to restrict a socket to use only authorized security associations (or no security association) to send/receive network packets. Patch purpose: The patch is designed to enable access control per packets based on the strongly authenticated IPSec security association. Such access controls augment the existing ones based on network interface and IP address. The former are very coarse-grained, and the latter can be spoofed. By using IPSec, the system can control access to remote hosts based on cryptographic keys generated using the IPSec mechanism. This enables access control on a per-machine basis or per-application if the remote machine is running the same mechanism and trusted to enforce the access control policy. Patch design approach: The overall approach is that policy (xfrm_policy) entries set by user-level programs (e.g., setkey for ipsec-tools) are extended with a security context that is used at policy selection time in the XFRM subsystem to restrict the sockets that can send/receive packets via security associations (xfrm_states) that are built from those policies. A presentation available at www.selinux-symposium.org/2005/presentations/session2/2-3-jaeger.pdf from the SELinux symposium describes the overall approach. Patch implementation details: On output, the policy retrieved (via xfrm_policy_lookup or xfrm_sk_policy_lookup) must be authorized for the security context of the socket and the same security context is required for resultant security association (retrieved or negotiated via racoon in ipsec-tools). This is enforced in xfrm_state_find. On input, the policy retrieved must also be authorized for the socket (at __xfrm_policy_check), and the security context of the policy must also match the security association being used. The patch has virtually no impact on packets that do not use IPSec. The existing Netfilter (outgoing) and LSM rcv_skb hooks are used as before. Also, if IPSec is used without security contexts, the impact is minimal. The LSM must allow such policies to be selected for the combination of socket and remote machine, but subsequent IPSec processing proceeds as in the original case. Testing: The pfkey interface is tested using the ipsec-tools. ipsec-tools have been modified (a separate ipsec-tools patch is available for version 0.5) that supports assignment of xfrm_policy entries and security associations with security contexts via setkey and the negotiation using the security contexts via racoon. The xfrm_user interface is tested via ad hoc programs that set security contexts. These programs are also available from me, and contain programs for setting, getting, and deleting policy for testing this interface. Testing of sa functions was done by tracing kernel behavior. Signed-off-by: Trent Jaeger Signed-off-by: Herbert Xu Signed-off-by: David S. Miller --- diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ebc09a1..2b7cf19 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -46,6 +46,29 @@ ipfrag_secret_interval - INTEGER for the hash secret) for IP fragments. Default: 600 +ipfrag_max_dist - INTEGER + ipfrag_max_dist is a non-negative integer value which defines the + maximum "disorder" which is allowed among fragments which share a + common IP source address. Note that reordering of packets is + not unusual, but if a large number of fragments arrive from a source + IP address while a particular fragment queue remains incomplete, it + probably indicates that one or more fragments belonging to that queue + have been lost. When ipfrag_max_dist is positive, an additional check + is done on fragments before they are added to a reassembly queue - if + ipfrag_max_dist (or more) fragments have arrived from a particular IP + address between additions to any IP fragment queue using that source + address, it's presumed that one or more fragments in the queue are + lost. The existing fragment queue will be dropped, and a new one + started. An ipfrag_max_dist value of zero disables this check. + + Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can + result in unnecessarily dropping fragment queues when normal + reordering of packets occurs, which could lead to poor application + performance. Using a very large value, e.g. 50000, increases the + likelihood of incorrectly reassembling IP fragments that originate + from different IP datagrams, which could result in data corruption. + Default: 64 + INET peer storage: inet_peer_threshold - INTEGER diff --git a/include/linux/pfkeyv2.h b/include/linux/pfkeyv2.h index 7240667..6351c40 100644 --- a/include/linux/pfkeyv2.h +++ b/include/linux/pfkeyv2.h @@ -216,6 +216,16 @@ struct sadb_x_nat_t_port { } __attribute__((packed)); /* sizeof(struct sadb_x_nat_t_port) == 8 */ +/* Generic LSM security context */ +struct sadb_x_sec_ctx { + uint16_t sadb_x_sec_len; + uint16_t sadb_x_sec_exttype; + uint8_t sadb_x_ctx_alg; /* LSMs: e.g., selinux == 1 */ + uint8_t sadb_x_ctx_doi; + uint16_t sadb_x_ctx_len; +} __attribute__((packed)); +/* sizeof(struct sadb_sec_ctx) = 8 */ + /* Message types */ #define SADB_RESERVED 0 #define SADB_GETSPI 1 @@ -325,7 +335,8 @@ struct sadb_x_nat_t_port { #define SADB_X_EXT_NAT_T_SPORT 21 #define SADB_X_EXT_NAT_T_DPORT 22 #define SADB_X_EXT_NAT_T_OA 23 -#define SADB_EXT_MAX 23 +#define SADB_X_EXT_SEC_CTX 24 +#define SADB_EXT_MAX 24 /* Identity Extension values */ #define SADB_IDENTTYPE_RESERVED 0 diff --git a/include/linux/security.h b/include/linux/security.h index f7e0ae0..ef75365 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -59,6 +59,12 @@ struct sk_buff; struct sock; struct sockaddr; struct socket; +struct flowi; +struct dst_entry; +struct xfrm_selector; +struct xfrm_policy; +struct xfrm_state; +struct xfrm_user_sec_ctx; extern int cap_netlink_send(struct sock *sk, struct sk_buff *skb); extern int cap_netlink_recv(struct sk_buff *skb); @@ -788,6 +794,52 @@ struct swap_info_struct; * which is used to copy security attributes between local stream sockets. * @sk_free_security: * Deallocate security structure. + * @sk_getsid: + * Retrieve the LSM-specific sid for the sock to enable caching of network + * authorizations. + * + * Security hooks for XFRM operations. + * + * @xfrm_policy_alloc_security: + * @xp contains the xfrm_policy being added to Security Policy Database + * used by the XFRM system. + * @sec_ctx contains the security context information being provided by + * the user-level policy update program (e.g., setkey). + * Allocate a security structure to the xp->selector.security field. + * The security field is initialized to NULL when the xfrm_policy is + * allocated. + * Return 0 if operation was successful (memory to allocate, legal context) + * @xfrm_policy_clone_security: + * @old contains an existing xfrm_policy in the SPD. + * @new contains a new xfrm_policy being cloned from old. + * Allocate a security structure to the new->selector.security field + * that contains the information from the old->selector.security field. + * Return 0 if operation was successful (memory to allocate). + * @xfrm_policy_free_security: + * @xp contains the xfrm_policy + * Deallocate xp->selector.security. + * @xfrm_state_alloc_security: + * @x contains the xfrm_state being added to the Security Association + * Database by the XFRM system. + * @sec_ctx contains the security context information being provided by + * the user-level SA generation program (e.g., setkey or racoon). + * Allocate a security structure to the x->sel.security field. The + * security field is initialized to NULL when the xfrm_state is + * allocated. + * Return 0 if operation was successful (memory to allocate, legal context). + * @xfrm_state_free_security: + * @x contains the xfrm_state. + * Deallocate x>sel.security. + * @xfrm_policy_lookup: + * @xp contains the xfrm_policy for which the access control is being + * checked. + * @sk_sid contains the sock security label that is used to authorize + * access to the policy xp. + * @dir contains the direction of the flow (input or output). + * Check permission when a sock selects a xfrm_policy for processing + * XFRMs on a packet. The hook is called when selecting either a + * per-socket policy or a generic xfrm policy. + * Return 0 if permission is granted. * * Security hooks affecting all Key Management operations * @@ -1237,8 +1289,18 @@ struct security_operations { int (*socket_getpeersec) (struct socket *sock, char __user *optval, int __user *optlen, unsigned len); int (*sk_alloc_security) (struct sock *sk, int family, gfp_t priority); void (*sk_free_security) (struct sock *sk); + unsigned int (*sk_getsid) (struct sock *sk, struct flowi *fl, u8 dir); #endif /* CONFIG_SECURITY_NETWORK */ +#ifdef CONFIG_SECURITY_NETWORK_XFRM + int (*xfrm_policy_alloc_security) (struct xfrm_policy *xp, struct xfrm_user_sec_ctx *sec_ctx); + int (*xfrm_policy_clone_security) (struct xfrm_policy *old, struct xfrm_policy *new); + void (*xfrm_policy_free_security) (struct xfrm_policy *xp); + int (*xfrm_state_alloc_security) (struct xfrm_state *x, struct xfrm_user_sec_ctx *sec_ctx); + void (*xfrm_state_free_security) (struct xfrm_state *x); + int (*xfrm_policy_lookup)(struct xfrm_policy *xp, u32 sk_sid, u8 dir); +#endif /* CONFIG_SECURITY_NETWORK_XFRM */ + /* key management security hooks */ #ifdef CONFIG_KEYS int (*key_alloc)(struct key *key); @@ -2679,6 +2741,11 @@ static inline void security_sk_free(stru { return security_ops->sk_free_security(sk); } + +static inline unsigned int security_sk_sid(struct sock *sk, struct flowi *fl, u8 dir) +{ + return security_ops->sk_getsid(sk, fl, dir); +} #else /* CONFIG_SECURITY_NETWORK */ static inline int security_unix_stream_connect(struct socket * sock, struct socket * other, @@ -2795,8 +2862,73 @@ static inline int security_sk_alloc(stru static inline void security_sk_free(struct sock *sk) { } + +static inline unsigned int security_sk_sid(struct sock *sk, struct flowi *fl, u8 dir) +{ + return 0; +} #endif /* CONFIG_SECURITY_NETWORK */ +#ifdef CONFIG_SECURITY_NETWORK_XFRM +static inline int security_xfrm_policy_alloc(struct xfrm_policy *xp, struct xfrm_user_sec_ctx *sec_ctx) +{ + return security_ops->xfrm_policy_alloc_security(xp, sec_ctx); +} + +static inline int security_xfrm_policy_clone(struct xfrm_policy *old, struct xfrm_policy *new) +{ + return security_ops->xfrm_policy_clone_security(old, new); +} + +static inline void security_xfrm_policy_free(struct xfrm_policy *xp) +{ + security_ops->xfrm_policy_free_security(xp); +} + +static inline int security_xfrm_state_alloc(struct xfrm_state *x, struct xfrm_user_sec_ctx *sec_ctx) +{ + return security_ops->xfrm_state_alloc_security(x, sec_ctx); +} + +static inline void security_xfrm_state_free(struct xfrm_state *x) +{ + security_ops->xfrm_state_free_security(x); +} + +static inline int security_xfrm_policy_lookup(struct xfrm_policy *xp, u32 sk_sid, u8 dir) +{ + return security_ops->xfrm_policy_lookup(xp, sk_sid, dir); +} +#else /* CONFIG_SECURITY_NETWORK_XFRM */ +static inline int security_xfrm_policy_alloc(struct xfrm_policy *xp, struct xfrm_user_sec_ctx *sec_ctx) +{ + return 0; +} + +static inline int security_xfrm_policy_clone(struct xfrm_policy *old, struct xfrm_policy *new) +{ + return 0; +} + +static inline void security_xfrm_policy_free(struct xfrm_policy *xp) +{ +} + +static inline int security_xfrm_state_alloc(struct xfrm_state *x, struct xfrm_user_sec_ctx *sec_ctx) +{ + return 0; +} + +static inline void security_xfrm_state_free(struct xfrm_state *x) +{ +} + +static inline int security_xfrm_policy_lookup(struct xfrm_policy *xp, u32 sk_sid, u8 dir) +{ + return 0; +} +#endif /* CONFIG_SECURITY_NETWORK_XFRM */ + #ifdef CONFIG_KEYS #ifdef CONFIG_SECURITY static inline int security_key_alloc(struct key *key) diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 6bc03c9..73dc03c 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -390,6 +390,7 @@ enum NET_IPV4_ICMP_ERRORS_USE_INBOUND_IFADDR=109, NET_TCP_CONG_CONTROL=110, NET_TCP_ABC=111, + NET_IPV4_IPFRAG_MAX_DIST=112, }; enum { diff --git a/include/linux/xfrm.h b/include/linux/xfrm.h index 0fb077d..82fbb75 100644 --- a/include/linux/xfrm.h +++ b/include/linux/xfrm.h @@ -27,6 +27,22 @@ struct xfrm_id __u8 proto; }; +struct xfrm_sec_ctx { + __u8 ctx_doi; + __u8 ctx_alg; + __u16 ctx_len; + __u32 ctx_sid; + char ctx_str[0]; +}; + +/* Security Context Domains of Interpretation */ +#define XFRM_SC_DOI_RESERVED 0 +#define XFRM_SC_DOI_LSM 1 + +/* Security Context Algorithms */ +#define XFRM_SC_ALG_RESERVED 0 +#define XFRM_SC_ALG_SELINUX 1 + /* Selector, used as selector both on policy rules (SPD) and SAs. */ struct xfrm_selector @@ -146,6 +162,18 @@ enum { #define XFRM_NR_MSGTYPES (XFRM_MSG_MAX + 1 - XFRM_MSG_BASE) +/* + * Generic LSM security context for comunicating to user space + * NOTE: Same format as sadb_x_sec_ctx + */ +struct xfrm_user_sec_ctx { + __u16 len; + __u16 exttype; + __u8 ctx_alg; /* LSMs: e.g., selinux == 1 */ + __u8 ctx_doi; + __u16 ctx_len; +}; + struct xfrm_user_tmpl { struct xfrm_id id; __u16 family; @@ -176,6 +204,7 @@ enum xfrm_attr_type_t { XFRMA_TMPL, /* 1 or more struct xfrm_user_tmpl */ XFRMA_SA, XFRMA_POLICY, + XFRMA_SEC_CTX, /* struct xfrm_sec_ctx */ __XFRMA_MAX #define XFRMA_MAX (__XFRMA_MAX - 1) diff --git a/include/net/flow.h b/include/net/flow.h index 9a5c94b..ec7eb86 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -84,11 +84,12 @@ struct flowi { #define FLOW_DIR_OUT 1 #define FLOW_DIR_FWD 2 -typedef void (*flow_resolve_t)(struct flowi *key, u16 family, u8 dir, +struct sock; +typedef void (*flow_resolve_t)(struct flowi *key, u32 sk_sid, u16 family, u8 dir, void **objp, atomic_t **obj_refp); -extern void *flow_cache_lookup(struct flowi *key, u16 family, u8 dir, - flow_resolve_t resolver); +extern void *flow_cache_lookup(struct flowi *key, u32 sk_sid, u16 family, u8 dir, + flow_resolve_t resolver); extern void flow_cache_flush(void); extern atomic_t flow_cache_genid; diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h index 7fda471..0965515 100644 --- a/include/net/inetpeer.h +++ b/include/net/inetpeer.h @@ -25,6 +25,7 @@ struct inet_peer __u32 v4daddr; /* peer's address */ __u16 avl_height; __u16 ip_id_count; /* IP ID for the next packet */ + atomic_t rid; /* Frag reception counter */ __u32 tcp_ts; unsigned long tcp_ts_stamp; }; diff --git a/include/net/ip.h b/include/net/ip.h index e4563bb..4d6294b 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -45,6 +45,7 @@ struct inet_skb_parm #define IPSKB_TRANSLATED 2 #define IPSKB_FORWARDED 4 #define IPSKB_XFRM_TUNNEL_SIZE 8 +#define IPSKB_FRAG_COMPLETE 16 }; struct ipcm_cookie @@ -168,6 +169,7 @@ extern int sysctl_ipfrag_high_thresh; extern int sysctl_ipfrag_low_thresh; extern int sysctl_ipfrag_time; extern int sysctl_ipfrag_secret_interval; +extern int sysctl_ipfrag_max_dist; /* From inetpeer.c */ extern int inet_peer_threshold; diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 5beae1c..f327cb3 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -144,6 +144,9 @@ struct xfrm_state * transformer. */ struct xfrm_type *type; + /* Security context */ + struct xfrm_sec_ctx *security; + /* Private data of this transformer, format is opaque, * interpreted by xfrm_type methods. */ void *data; @@ -298,6 +301,7 @@ struct xfrm_policy __u8 flags; __u8 dead; __u8 xfrm_nr; + struct xfrm_sec_ctx *security; struct xfrm_tmpl xfrm_vec[XFRM_MAX_DEPTH]; }; @@ -510,6 +514,25 @@ xfrm_selector_match(struct xfrm_selector return 0; } +#ifdef CONFIG_SECURITY_NETWORK_XFRM +/* If neither has a context --> match + * Otherwise, both must have a context and the sids, doi, alg must match + */ +static inline int xfrm_sec_ctx_match(struct xfrm_sec_ctx *s1, struct xfrm_sec_ctx *s2) +{ + return ((!s1 && !s2) || + (s1 && s2 && + (s1->ctx_sid == s2->ctx_sid) && + (s1->ctx_doi == s2->ctx_doi) && + (s1->ctx_alg == s2->ctx_alg))); +} +#else +static inline int xfrm_sec_ctx_match(struct xfrm_sec_ctx *s1, struct xfrm_sec_ctx *s2) +{ + return 1; +} +#endif + /* A struct encoding bundle of transformations to apply to some set of flow. * * dst->child points to the next element of bundle. @@ -878,8 +901,8 @@ static inline int xfrm_dst_lookup(struct struct xfrm_policy *xfrm_policy_alloc(gfp_t gfp); extern int xfrm_policy_walk(int (*func)(struct xfrm_policy *, int, int, void*), void *); int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl); -struct xfrm_policy *xfrm_policy_bysel(int dir, struct xfrm_selector *sel, - int delete); +struct xfrm_policy *xfrm_policy_bysel_ctx(int dir, struct xfrm_selector *sel, + struct xfrm_sec_ctx *ctx, int delete); struct xfrm_policy *xfrm_policy_byid(int dir, u32 id, int delete); void xfrm_policy_flush(void); u32 xfrm_get_acqseq(void); diff --git a/net/bridge/netfilter/Kconfig b/net/bridge/netfilter/Kconfig index c70b3be..b84fc60 100644 --- a/net/bridge/netfilter/Kconfig +++ b/net/bridge/netfilter/Kconfig @@ -196,9 +196,13 @@ config BRIDGE_EBT_LOG To compile it as a module, choose M here. If unsure, say N. config BRIDGE_EBT_ULOG - tristate "ebt: ulog support" + tristate "ebt: ulog support (OBSOLETE)" depends on BRIDGE_NF_EBTABLES help + This option enables the old bridge-specific "ebt_ulog" implementation + which has been obsoleted by the new "nfnetlink_log" code (see + CONFIG_NETFILTER_NETLINK_LOG). + This option adds the ulog watcher, that you can use in any rule in any ebtables table. The packet is passed to a userspace logging daemon using netlink multicast sockets. This differs diff --git a/net/bridge/netfilter/ebt_log.c b/net/bridge/netfilter/ebt_log.c index 662975b..c436e6c 100644 --- a/net/bridge/netfilter/ebt_log.c +++ b/net/bridge/netfilter/ebt_log.c @@ -3,6 +3,7 @@ * * Authors: * Bart De Schuymer + * Harald Welte * * April, 2002 * @@ -10,6 +11,7 @@ #include #include +#include #include #include #include @@ -55,27 +57,30 @@ static void print_MAC(unsigned char *p) } #define myNIPQUAD(a) a[0], a[1], a[2], a[3] -static void ebt_log(const struct sk_buff *skb, unsigned int hooknr, - const struct net_device *in, const struct net_device *out, - const void *data, unsigned int datalen) +static void +ebt_log_packet(unsigned int pf, unsigned int hooknum, + const struct sk_buff *skb, const struct net_device *in, + const struct net_device *out, const struct nf_loginfo *loginfo, + const char *prefix) { - struct ebt_log_info *info = (struct ebt_log_info *)data; - char level_string[4] = "< >"; + unsigned int bitmask; - level_string[1] = '0' + info->loglevel; spin_lock_bh(&ebt_log_lock); - printk(level_string); - printk("%s IN=%s OUT=%s ", info->prefix, in ? in->name : "", - out ? out->name : ""); + printk("<%c>%s IN=%s OUT=%s MAC source = ", '0' + loginfo->u.log.level, + prefix, in ? in->name : "", out ? out->name : ""); - printk("MAC source = "); print_MAC(eth_hdr(skb)->h_source); printk("MAC dest = "); print_MAC(eth_hdr(skb)->h_dest); printk("proto = 0x%04x", ntohs(eth_hdr(skb)->h_proto)); - if ((info->bitmask & EBT_LOG_IP) && eth_hdr(skb)->h_proto == + if (loginfo->type == NF_LOG_TYPE_LOG) + bitmask = loginfo->u.log.logflags; + else + bitmask = NF_LOG_MASK; + + if ((bitmask & EBT_LOG_IP) && eth_hdr(skb)->h_proto == htons(ETH_P_IP)){ struct iphdr _iph, *ih; @@ -84,10 +89,9 @@ static void ebt_log(const struct sk_buff printk(" INCOMPLETE IP header"); goto out; } - printk(" IP SRC=%u.%u.%u.%u IP DST=%u.%u.%u.%u,", - NIPQUAD(ih->saddr), NIPQUAD(ih->daddr)); - printk(" IP tos=0x%02X, IP proto=%d", ih->tos, - ih->protocol); + printk(" IP SRC=%u.%u.%u.%u IP DST=%u.%u.%u.%u, IP " + "tos=0x%02X, IP proto=%d", NIPQUAD(ih->saddr), + NIPQUAD(ih->daddr), ih->tos, ih->protocol); if (ih->protocol == IPPROTO_TCP || ih->protocol == IPPROTO_UDP) { struct tcpudphdr _ports, *pptr; @@ -104,7 +108,7 @@ static void ebt_log(const struct sk_buff goto out; } - if ((info->bitmask & EBT_LOG_ARP) && + if ((bitmask & EBT_LOG_ARP) && ((eth_hdr(skb)->h_proto == htons(ETH_P_ARP)) || (eth_hdr(skb)->h_proto == htons(ETH_P_RARP)))) { struct arphdr _arph, *ah; @@ -144,6 +148,21 @@ static void ebt_log(const struct sk_buff out: printk("\n"); spin_unlock_bh(&ebt_log_lock); + +} + +static void ebt_log(const struct sk_buff *skb, unsigned int hooknr, + const struct net_device *in, const struct net_device *out, + const void *data, unsigned int datalen) +{ + struct ebt_log_info *info = (struct ebt_log_info *)data; + struct nf_loginfo li; + + li.type = NF_LOG_TYPE_LOG; + li.u.log.level = info->loglevel; + li.u.log.logflags = info->bitmask; + + nf_log_packet(PF_BRIDGE, hooknr, skb, in, out, &li, info->prefix); } static struct ebt_watcher log = @@ -154,13 +173,32 @@ static struct ebt_watcher log = .me = THIS_MODULE, }; +static struct nf_logger ebt_log_logger = { + .name = "ebt_log", + .logfn = &ebt_log_packet, + .me = THIS_MODULE, +}; + static int __init init(void) { - return ebt_register_watcher(&log); + int ret; + + ret = ebt_register_watcher(&log); + if (ret < 0) + return ret; + if (nf_log_register(PF_BRIDGE, &ebt_log_logger) < 0) { + printk(KERN_WARNING "ebt_log: not logging via system console " + "since somebody else already registered for PF_INET\n"); + /* we cannot make module load fail here, since otherwise + * ebtables userspace would abort */ + } + + return 0; } static void __exit fini(void) { + nf_log_unregister_logger(&ebt_log_logger); ebt_unregister_watcher(&log); } diff --git a/net/bridge/netfilter/ebt_ulog.c b/net/bridge/netfilter/ebt_ulog.c index aae26ae..ce617b3 100644 --- a/net/bridge/netfilter/ebt_ulog.c +++ b/net/bridge/netfilter/ebt_ulog.c @@ -3,6 +3,7 @@ * * Authors: * Bart De Schuymer + * Harald Welte * * November, 2004 * @@ -115,14 +116,13 @@ static struct sk_buff *ulog_alloc_skb(un return skb; } -static void ebt_ulog(const struct sk_buff *skb, unsigned int hooknr, +static void ebt_ulog_packet(unsigned int hooknr, const struct sk_buff *skb, const struct net_device *in, const struct net_device *out, - const void *data, unsigned int datalen) + const struct ebt_ulog_info *uloginfo, const char *prefix) { ebt_ulog_packet_msg_t *pm; size_t size, copy_len; struct nlmsghdr *nlh; - struct ebt_ulog_info *uloginfo = (struct ebt_ulog_info *)data; unsigned int group = uloginfo->nlgroup; ebt_ulog_buff_t *ub = &ulog_buffers[group]; spinlock_t *lock = &ub->lock; @@ -216,6 +216,39 @@ alloc_failure: goto unlock; } +/* this function is registered with the netfilter core */ +static void ebt_log_packet(unsigned int pf, unsigned int hooknum, + const struct sk_buff *skb, const struct net_device *in, + const struct net_device *out, const struct nf_loginfo *li, + const char *prefix) +{ + struct ebt_ulog_info loginfo; + + if (!li || li->type != NF_LOG_TYPE_ULOG) { + loginfo.nlgroup = EBT_ULOG_DEFAULT_NLGROUP; + loginfo.cprange = 0; + loginfo.qthreshold = EBT_ULOG_DEFAULT_QTHRESHOLD; + loginfo.prefix[0] = '\0'; + } else { + loginfo.nlgroup = li->u.ulog.group; + loginfo.cprange = li->u.ulog.copy_len; + loginfo.qthreshold = li->u.ulog.qthreshold; + strlcpy(loginfo.prefix, prefix, sizeof(loginfo.prefix)); + } + + ebt_ulog_packet(hooknum, skb, in, out, &loginfo, prefix); +} + +static void ebt_ulog(const struct sk_buff *skb, unsigned int hooknr, + const struct net_device *in, const struct net_device *out, + const void *data, unsigned int datalen) +{ + struct ebt_ulog_info *uloginfo = (struct ebt_ulog_info *)data; + + ebt_ulog_packet(hooknr, skb, in, out, uloginfo, NULL); +} + + static int ebt_ulog_check(const char *tablename, unsigned int hookmask, const struct ebt_entry *e, void *data, unsigned int datalen) { @@ -240,6 +273,12 @@ static struct ebt_watcher ulog = { .me = THIS_MODULE, }; +static struct nf_logger ebt_ulog_logger = { + .name = EBT_ULOG_WATCHER, + .logfn = &ebt_log_packet, + .me = THIS_MODULE, +}; + static int __init init(void) { int i, ret = 0; @@ -265,6 +304,13 @@ static int __init init(void) else if ((ret = ebt_register_watcher(&ulog))) sock_release(ebtulognl->sk_socket); + if (nf_log_register(PF_BRIDGE, &ebt_ulog_logger) < 0) { + printk(KERN_WARNING "ebt_ulog: not logging via ulog " + "since somebody else already registered for PF_BRIDGE\n"); + /* we cannot make module load fail here, since otherwise + * ebtables userspace would abort */ + } + return ret; } @@ -273,6 +319,7 @@ static void __exit fini(void) ebt_ulog_buff_t *ub; int i; + nf_log_unregister_logger(&ebt_ulog_logger); ebt_unregister_watcher(&ulog); for (i = 0; i < EBT_ULOG_MAXNLGROUPS; i++) { ub = &ulog_buffers[i]; diff --git a/net/core/flow.c b/net/core/flow.c index 7e95b39..c4f2538 100644 --- a/net/core/flow.c +++ b/net/core/flow.c @@ -23,6 +23,7 @@ #include #include #include +#include struct flow_cache_entry { struct flow_cache_entry *next; @@ -30,6 +31,7 @@ struct flow_cache_entry { u8 dir; struct flowi key; u32 genid; + u32 sk_sid; void *object; atomic_t *object_ref; }; @@ -162,7 +164,7 @@ static int flow_key_compare(struct flowi return 0; } -void *flow_cache_lookup(struct flowi *key, u16 family, u8 dir, +void *flow_cache_lookup(struct flowi *key, u32 sk_sid, u16 family, u8 dir, flow_resolve_t resolver) { struct flow_cache_entry *fle, **head; @@ -186,6 +188,7 @@ void *flow_cache_lookup(struct flowi *ke for (fle = *head; fle; fle = fle->next) { if (fle->family == family && fle->dir == dir && + fle->sk_sid == sk_sid && flow_key_compare(key, &fle->key) == 0) { if (fle->genid == atomic_read(&flow_cache_genid)) { void *ret = fle->object; @@ -210,6 +213,7 @@ void *flow_cache_lookup(struct flowi *ke *head = fle; fle->family = family; fle->dir = dir; + fle->sk_sid = sk_sid; memcpy(&fle->key, key, sizeof(*key)); fle->object = NULL; flow_count(cpu)++; @@ -221,7 +225,7 @@ nocache: void *obj; atomic_t *obj_ref; - resolver(key, family, dir, &obj, &obj_ref); + resolver(key, sk_sid, family, dir, &obj, &obj_ref); if (fle) { fle->genid = atomic_read(&flow_cache_genid); diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c index 2fc3fd3..ce5fe3f 100644 --- a/net/ipv4/inetpeer.c +++ b/net/ipv4/inetpeer.c @@ -401,6 +401,7 @@ struct inet_peer *inet_getpeer(__u32 dad return NULL; n->v4daddr = daddr; atomic_set(&n->refcnt, 1); + atomic_set(&n->rid, 0); n->ip_id_count = secure_ip_id(daddr); n->tcp_ts_stamp = 0; diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 8ce0ce2..ce2b70c 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -22,6 +22,7 @@ * Patrick McHardy : LRU queue of frag heads for evictor. */ +#include #include #include #include @@ -38,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -56,6 +58,8 @@ int sysctl_ipfrag_high_thresh = 256*1024; int sysctl_ipfrag_low_thresh = 192*1024; +int sysctl_ipfrag_max_dist = 64; + /* Important NOTE! Fragment queue must be destroyed before MSL expires. * RFC791 is wrong proposing to prolongate timer each fragment arrival by TTL. */ @@ -89,8 +93,10 @@ struct ipq { spinlock_t lock; atomic_t refcnt; struct timer_list timer; /* when will this queue expire? */ - int iif; struct timeval stamp; + int iif; + unsigned int rid; + struct inet_peer *peer; }; /* Hash table. */ @@ -195,6 +201,9 @@ static void ip_frag_destroy(struct ipq * BUG_TRAP(qp->last_in&COMPLETE); BUG_TRAP(del_timer(&qp->timer) == 0); + if (qp->peer) + inet_putpeer(qp->peer); + /* Release all fragment data. */ fp = qp->fragments; while (fp) { @@ -353,6 +362,7 @@ static struct ipq *ip_frag_create(unsign qp->meat = 0; qp->fragments = NULL; qp->iif = 0; + qp->peer = sysctl_ipfrag_max_dist ? inet_getpeer(iph->saddr, 1) : NULL; /* Initialize a timer for this entry. */ init_timer(&qp->timer); @@ -398,6 +408,56 @@ static inline struct ipq *ip_find(struct return ip_frag_create(hash, iph, user); } +/* Is the fragment too far ahead to be part of ipq? */ +static inline int ip_frag_too_far(struct ipq *qp) +{ + struct inet_peer *peer = qp->peer; + unsigned int max = sysctl_ipfrag_max_dist; + unsigned int start, end; + + int rc; + + if (!peer || !max) + return 0; + + start = qp->rid; + end = atomic_inc_return(&peer->rid); + qp->rid = end; + + rc = qp->fragments && (end - start) > max; + + if (rc) { + IP_INC_STATS_BH(IPSTATS_MIB_REASMFAILS); + } + + return rc; +} + +static int ip_frag_reinit(struct ipq *qp) +{ + struct sk_buff *fp; + + if (!mod_timer(&qp->timer, jiffies + sysctl_ipfrag_time)) { + atomic_inc(&qp->refcnt); + return -ETIMEDOUT; + } + + fp = qp->fragments; + do { + struct sk_buff *xp = fp->next; + frag_kfree_skb(fp, NULL); + fp = xp; + } while (fp); + + qp->last_in = 0; + qp->len = 0; + qp->meat = 0; + qp->fragments = NULL; + qp->iif = 0; + + return 0; +} + /* Add new segment to existing queue. */ static void ip_frag_queue(struct ipq *qp, struct sk_buff *skb) { @@ -408,6 +468,12 @@ static void ip_frag_queue(struct ipq *qp if (qp->last_in & COMPLETE) goto err; + if (!(IPCB(skb)->flags & IPSKB_FRAG_COMPLETE) && + unlikely(ip_frag_too_far(qp)) && unlikely(ip_frag_reinit(qp))) { + ipq_kill(qp); + goto err; + } + offset = ntohs(skb->nh.iph->frag_off); flags = offset & ~IP_OFFSET; offset &= IP_OFFSET; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 11c2f68..a6c055a 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -445,6 +445,7 @@ int ip_fragment(struct sk_buff *skb, int hlen = iph->ihl * 4; mtu = dst_mtu(&rt->u.dst) - hlen; /* Size of data space */ + IPCB(skb)->flags |= IPSKB_FRAG_COMPLETE; /* When frag_list is given, use it. First, check its validity: * some transformers could create wrong frag_list or break existing diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index 3c2e963..f8be250 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -68,19 +68,14 @@ struct arpt_table_info { unsigned int initial_entries; unsigned int hook_entry[NF_ARP_NUMHOOKS]; unsigned int underflow[NF_ARP_NUMHOOKS]; - char entries[0] __attribute__((aligned(SMP_CACHE_BYTES))); + void *entries[NR_CPUS]; }; static LIST_HEAD(arpt_target); static LIST_HEAD(arpt_tables); +#define SET_COUNTER(c,b,p) do { (c).bcnt = (b); (c).pcnt = (p); } while(0) #define ADD_COUNTER(c,b,p) do { (c).bcnt += (b); (c).pcnt += (p); } while(0) -#ifdef CONFIG_SMP -#define TABLE_OFFSET(t,p) (SMP_ALIGN((t)->size)*(p)) -#else -#define TABLE_OFFSET(t,p) 0 -#endif - static inline int arp_devaddr_compare(const struct arpt_devaddr_info *ap, char *hdr_addr, int len) { @@ -269,9 +264,7 @@ unsigned int arpt_do_table(struct sk_buf outdev = out ? out->name : nulldevname; read_lock_bh(&table->lock); - table_base = (void *)table->private->entries - + TABLE_OFFSET(table->private, - smp_processor_id()); + table_base = (void *)table->private->entries[smp_processor_id()]; e = get_entry(table_base, table->private->hook_entry[hook]); back = get_entry(table_base, table->private->underflow[hook]); @@ -462,7 +455,8 @@ static inline int unconditional(const st /* Figures out from what hook each rule can be called: returns 0 if * there are loops. Puts hook bitmask in comefrom. */ -static int mark_source_chains(struct arpt_table_info *newinfo, unsigned int valid_hooks) +static int mark_source_chains(struct arpt_table_info *newinfo, + unsigned int valid_hooks, void *entry0) { unsigned int hook; @@ -472,7 +466,7 @@ static int mark_source_chains(struct arp for (hook = 0; hook < NF_ARP_NUMHOOKS; hook++) { unsigned int pos = newinfo->hook_entry[hook]; struct arpt_entry *e - = (struct arpt_entry *)(newinfo->entries + pos); + = (struct arpt_entry *)(entry0 + pos); if (!(valid_hooks & (1 << hook))) continue; @@ -514,13 +508,13 @@ static int mark_source_chains(struct arp goto next; e = (struct arpt_entry *) - (newinfo->entries + pos); + (entry0 + pos); } while (oldpos == pos + e->next_offset); /* Move along one */ size = e->next_offset; e = (struct arpt_entry *) - (newinfo->entries + pos + size); + (entry0 + pos + size); e->counters.pcnt = pos; pos += size; } else { @@ -537,7 +531,7 @@ static int mark_source_chains(struct arp newpos = pos + e->next_offset; } e = (struct arpt_entry *) - (newinfo->entries + newpos); + (entry0 + newpos); e->counters.pcnt = pos; pos = newpos; } @@ -689,6 +683,7 @@ static inline int cleanup_entry(struct a static int translate_table(const char *name, unsigned int valid_hooks, struct arpt_table_info *newinfo, + void *entry0, unsigned int size, unsigned int number, const unsigned int *hook_entries, @@ -713,8 +708,8 @@ static int translate_table(const char *n ret = ARPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, check_entry_size_and_hooks, newinfo, - newinfo->entries, - newinfo->entries + size, + entry0, + entry0 + size, hook_entries, underflows, &i); duprintf("translate_table: ARPT_ENTRY_ITERATE gives %d\n", ret); if (ret != 0) @@ -743,29 +738,26 @@ static int translate_table(const char *n } } - if (!mark_source_chains(newinfo, valid_hooks)) { + if (!mark_source_chains(newinfo, valid_hooks, entry0)) { duprintf("Looping hook\n"); return -ELOOP; } /* Finally, each sanity check must pass */ i = 0; - ret = ARPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, + ret = ARPT_ENTRY_ITERATE(entry0, newinfo->size, check_entry, name, size, &i); if (ret != 0) { - ARPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, + ARPT_ENTRY_ITERATE(entry0, newinfo->size, cleanup_entry, &i); return ret; } /* And one copy for every other CPU */ for_each_cpu(i) { - if (i == 0) - continue; - memcpy(newinfo->entries + SMP_ALIGN(newinfo->size) * i, - newinfo->entries, - SMP_ALIGN(newinfo->size)); + if (newinfo->entries[i] && newinfo->entries[i] != entry0) + memcpy(newinfo->entries[i], entry0, newinfo->size); } return ret; @@ -807,20 +799,45 @@ static inline int add_entry_to_counter(c return 0; } +static inline int set_entry_to_counter(const struct arpt_entry *e, + struct arpt_counters total[], + unsigned int *i) +{ + SET_COUNTER(total[*i], e->counters.bcnt, e->counters.pcnt); + + (*i)++; + return 0; +} + static void get_counters(const struct arpt_table_info *t, struct arpt_counters counters[]) { unsigned int cpu; unsigned int i; + unsigned int curcpu = get_cpu(); + + /* Instead of clearing (by a previous call to memset()) + * the counters and using adds, we set the counters + * with data used by current CPU */ + + i = 0; + ARPT_ENTRY_ITERATE(t->entries[curcpu], + t->size, + set_entry_to_counter, + counters, + &i); for_each_cpu(cpu) { + if (cpu == curcpu) + continue; i = 0; - ARPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, cpu), + ARPT_ENTRY_ITERATE(t->entries[cpu], t->size, add_entry_to_counter, counters, &i); } + put_cpu(); } static int copy_entries_to_user(unsigned int total_size, @@ -831,6 +848,7 @@ static int copy_entries_to_user(unsigned struct arpt_entry *e; struct arpt_counters *counters; int ret = 0; + void *loc_cpu_entry; /* We need atomic snapshot of counters: rest doesn't change * (other than comefrom, which userspace doesn't care @@ -843,13 +861,13 @@ static int copy_entries_to_user(unsigned return -ENOMEM; /* First, sum counters... */ - memset(counters, 0, countersize); write_lock_bh(&table->lock); get_counters(table->private, counters); write_unlock_bh(&table->lock); - /* ... then copy entire thing from CPU 0... */ - if (copy_to_user(userptr, table->private->entries, total_size) != 0) { + loc_cpu_entry = table->private->entries[get_cpu()]; + /* ... then copy entire thing ... */ + if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) { ret = -EFAULT; goto free_counters; } @@ -859,7 +877,7 @@ static int copy_entries_to_user(unsigned for (off = 0, num = 0; off < total_size; off += e->next_offset, num++){ struct arpt_entry_target *t; - e = (struct arpt_entry *)(table->private->entries + off); + e = (struct arpt_entry *)(loc_cpu_entry + off); if (copy_to_user(userptr + off + offsetof(struct arpt_entry, counters), &counters[num], @@ -880,6 +898,7 @@ static int copy_entries_to_user(unsigned } free_counters: + put_cpu(); vfree(counters); return ret; } @@ -911,6 +930,47 @@ static int get_entries(const struct arpt return ret; } +static void free_table_info(struct arpt_table_info *info) +{ + int cpu; + for_each_cpu(cpu) { + if (info->size <= PAGE_SIZE) + kfree(info->entries[cpu]); + else + vfree(info->entries[cpu]); + } + kfree(info); +} + +static struct arpt_table_info *alloc_table_info(unsigned int size) +{ + struct arpt_table_info *newinfo; + int cpu; + + newinfo = kzalloc(sizeof(struct arpt_table_info), GFP_KERNEL); + if (!newinfo) + return NULL; + + newinfo->size = size; + + for_each_cpu(cpu) { + if (size <= PAGE_SIZE) + newinfo->entries[cpu] = kmalloc_node(size, + GFP_KERNEL, + cpu_to_node(cpu)); + else + newinfo->entries[cpu] = vmalloc_node(size, + cpu_to_node(cpu)); + + if (newinfo->entries[cpu] == NULL) { + free_table_info(newinfo); + return NULL; + } + } + + return newinfo; +} + static int do_replace(void __user *user, unsigned int len) { int ret; @@ -918,6 +978,7 @@ static int do_replace(void __user *user, struct arpt_table *t; struct arpt_table_info *newinfo, *oldinfo; struct arpt_counters *counters; + void *loc_cpu_entry, *loc_cpu_old_entry; if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) return -EFAULT; @@ -930,13 +991,13 @@ static int do_replace(void __user *user, if ((SMP_ALIGN(tmp.size) >> PAGE_SHIFT) + 2 > num_physpages) return -ENOMEM; - newinfo = vmalloc(sizeof(struct arpt_table_info) - + SMP_ALIGN(tmp.size) * - (highest_possible_processor_id()+1)); + newinfo = alloc_table_info(tmp.size); if (!newinfo) return -ENOMEM; - if (copy_from_user(newinfo->entries, user + sizeof(tmp), + /* choose the copy that is on our node/cpu */ + loc_cpu_entry = newinfo->entries[get_cpu()]; + if (copy_from_user(loc_cpu_entry, user + sizeof(tmp), tmp.size) != 0) { ret = -EFAULT; goto free_newinfo; @@ -947,10 +1008,9 @@ static int do_replace(void __user *user, ret = -ENOMEM; goto free_newinfo; } - memset(counters, 0, tmp.num_counters * sizeof(struct arpt_counters)); ret = translate_table(tmp.name, tmp.valid_hooks, - newinfo, tmp.size, tmp.num_entries, + newinfo, loc_cpu_entry, tmp.size, tmp.num_entries, tmp.hook_entry, tmp.underflow); if (ret != 0) goto free_newinfo_counters; @@ -989,8 +1049,10 @@ static int do_replace(void __user *user, /* Get the old counters. */ get_counters(oldinfo, counters); /* Decrease module usage counts and free resource */ - ARPT_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL); - vfree(oldinfo); + loc_cpu_old_entry = oldinfo->entries[smp_processor_id()]; + ARPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,NULL); + put_cpu(); + free_table_info(oldinfo); if (copy_to_user(tmp.counters, counters, sizeof(struct arpt_counters) * tmp.num_counters) != 0) ret = -EFAULT; @@ -1002,11 +1064,12 @@ static int do_replace(void __user *user, module_put(t->me); up(&arpt_mutex); free_newinfo_counters_untrans: - ARPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, cleanup_entry, NULL); + ARPT_ENTRY_ITERATE(loc_cpu_entry, newinfo->size, cleanup_entry, NULL); free_newinfo_counters: vfree(counters); free_newinfo: - vfree(newinfo); + put_cpu(); + free_table_info(newinfo); return ret; } @@ -1030,6 +1093,7 @@ static int do_add_counters(void __user * struct arpt_counters_info tmp, *paddc; struct arpt_table *t; int ret = 0; + void *loc_cpu_entry; if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) return -EFAULT; @@ -1059,7 +1123,9 @@ static int do_add_counters(void __user * } i = 0; - ARPT_ENTRY_ITERATE(t->private->entries, + /* Choose the copy that is on our node */ + loc_cpu_entry = t->private->entries[smp_processor_id()]; + ARPT_ENTRY_ITERATE(loc_cpu_entry, t->private->size, add_counter_to_entry, paddc->counters, @@ -1220,30 +1286,33 @@ int arpt_register_table(struct arpt_tabl struct arpt_table_info *newinfo; static struct arpt_table_info bootstrap = { 0, 0, 0, { 0 }, { 0 }, { } }; + void *loc_cpu_entry; - newinfo = vmalloc(sizeof(struct arpt_table_info) - + SMP_ALIGN(repl->size) * - (highest_possible_processor_id()+1)); + newinfo = alloc_table_info(repl->size); if (!newinfo) { ret = -ENOMEM; return ret; } - memcpy(newinfo->entries, repl->entries, repl->size); + + /* choose the copy on our node/cpu */ + loc_cpu_entry = newinfo->entries[get_cpu()]; + memcpy(loc_cpu_entry, repl->entries, repl->size); ret = translate_table(table->name, table->valid_hooks, - newinfo, repl->size, + newinfo, loc_cpu_entry, repl->size, repl->num_entries, repl->hook_entry, repl->underflow); + put_cpu(); duprintf("arpt_register_table: translate table gives %d\n", ret); if (ret != 0) { - vfree(newinfo); + free_table_info(newinfo); return ret; } ret = down_interruptible(&arpt_mutex); if (ret != 0) { - vfree(newinfo); + free_table_info(newinfo); return ret; } @@ -1272,20 +1341,24 @@ int arpt_register_table(struct arpt_tabl return ret; free_unlock: - vfree(newinfo); + free_table_info(newinfo); goto unlock; } void arpt_unregister_table(struct arpt_table *table) { + void *loc_cpu_entry; + down(&arpt_mutex); LIST_DELETE(&arpt_tables, table); up(&arpt_mutex); /* Decrease module usage counts and free resources */ - ARPT_ENTRY_ITERATE(table->private->entries, table->private->size, + loc_cpu_entry = table->private->entries[get_cpu()]; + ARPT_ENTRY_ITERATE(loc_cpu_entry, table->private->size, cleanup_entry, NULL); - vfree(table->private); + put_cpu(); + free_table_info(table->private); } /* The built-in targets: standard (NULL) and error. */ diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 75c27e9..091bc1c 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -83,11 +83,6 @@ static DECLARE_MUTEX(ipt_mutex); context stops packets coming through and allows user context to read the counters or update the rules. - To be cache friendly on SMP, we arrange them like so: - [ n-entries ] - ... cache-align padding ... - [ n-entries ] - Hence the start of any table is given by get_table() below. */ /* The table itself */ @@ -105,20 +100,15 @@ struct ipt_table_info unsigned int underflow[NF_IP_NUMHOOKS]; /* ipt_entry tables: one per CPU */ - char entries[0] ____cacheline_aligned; + void *entries[NR_CPUS]; }; static LIST_HEAD(ipt_target); static LIST_HEAD(ipt_match); static LIST_HEAD(ipt_tables); +#define SET_COUNTER(c,b,p) do { (c).bcnt = (b); (c).pcnt = (p); } while(0) #define ADD_COUNTER(c,b,p) do { (c).bcnt += (b); (c).pcnt += (p); } while(0) -#ifdef CONFIG_SMP -#define TABLE_OFFSET(t,p) (SMP_ALIGN((t)->size)*(p)) -#else -#define TABLE_OFFSET(t,p) 0 -#endif - #if 0 #define down(x) do { printk("DOWN:%u:" #x "\n", __LINE__); down(x); } while(0) #define down_interruptible(x) ({ int __r; printk("DOWNi:%u:" #x "\n", __LINE__); __r = down_interruptible(x); if (__r != 0) printk("ABORT-DOWNi:%u\n", __LINE__); __r; }) @@ -290,8 +280,7 @@ ipt_do_table(struct sk_buff **pskb, read_lock_bh(&table->lock); IP_NF_ASSERT(table->valid_hooks & (1 << hook)); - table_base = (void *)table->private->entries - + TABLE_OFFSET(table->private, smp_processor_id()); + table_base = (void *)table->private->entries[smp_processor_id()]; e = get_entry(table_base, table->private->hook_entry[hook]); #ifdef CONFIG_NETFILTER_DEBUG @@ -563,7 +552,8 @@ unconditional(const struct ipt_ip *ip) /* Figures out from what hook each rule can be called: returns 0 if there are loops. Puts hook bitmask in comefrom. */ static int -mark_source_chains(struct ipt_table_info *newinfo, unsigned int valid_hooks) +mark_source_chains(struct ipt_table_info *newinfo, + unsigned int valid_hooks, void *entry0) { unsigned int hook; @@ -572,7 +562,7 @@ mark_source_chains(struct ipt_table_info for (hook = 0; hook < NF_IP_NUMHOOKS; hook++) { unsigned int pos = newinfo->hook_entry[hook]; struct ipt_entry *e - = (struct ipt_entry *)(newinfo->entries + pos); + = (struct ipt_entry *)(entry0 + pos); if (!(valid_hooks & (1 << hook))) continue; @@ -622,13 +612,13 @@ mark_source_chains(struct ipt_table_info goto next; e = (struct ipt_entry *) - (newinfo->entries + pos); + (entry0 + pos); } while (oldpos == pos + e->next_offset); /* Move along one */ size = e->next_offset; e = (struct ipt_entry *) - (newinfo->entries + pos + size); + (entry0 + pos + size); e->counters.pcnt = pos; pos += size; } else { @@ -645,7 +635,7 @@ mark_source_chains(struct ipt_table_info newpos = pos + e->next_offset; } e = (struct ipt_entry *) - (newinfo->entries + newpos); + (entry0 + newpos); e->counters.pcnt = pos; pos = newpos; } @@ -855,6 +845,7 @@ static int translate_table(const char *name, unsigned int valid_hooks, struct ipt_table_info *newinfo, + void *entry0, unsigned int size, unsigned int number, const unsigned int *hook_entries, @@ -875,11 +866,11 @@ translate_table(const char *name, duprintf("translate_table: size %u\n", newinfo->size); i = 0; /* Walk through entries, checking offsets. */ - ret = IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, + ret = IPT_ENTRY_ITERATE(entry0, newinfo->size, check_entry_size_and_hooks, newinfo, - newinfo->entries, - newinfo->entries + size, + entry0, + entry0 + size, hook_entries, underflows, &i); if (ret != 0) return ret; @@ -907,27 +898,24 @@ translate_table(const char *name, } } - if (!mark_source_chains(newinfo, valid_hooks)) + if (!mark_source_chains(newinfo, valid_hooks, entry0)) return -ELOOP; /* Finally, each sanity check must pass */ i = 0; - ret = IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, + ret = IPT_ENTRY_ITERATE(entry0, newinfo->size, check_entry, name, size, &i); if (ret != 0) { - IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, + IPT_ENTRY_ITERATE(entry0, newinfo->size, cleanup_entry, &i); return ret; } /* And one copy for every other CPU */ for_each_cpu(i) { - if (i == 0) - continue; - memcpy(newinfo->entries + SMP_ALIGN(newinfo->size) * i, - newinfo->entries, - SMP_ALIGN(newinfo->size)); + if (newinfo->entries[i] && newinfo->entries[i] != entry0) + memcpy(newinfo->entries[i], entry0, newinfo->size); } return ret; @@ -943,15 +931,12 @@ replace_table(struct ipt_table *table, #ifdef CONFIG_NETFILTER_DEBUG { - struct ipt_entry *table_base; - unsigned int i; + int cpu; - for_each_cpu(i) { - table_base = - (void *)newinfo->entries - + TABLE_OFFSET(newinfo, i); - - table_base->comefrom = 0xdead57ac; + for_each_cpu(cpu) { + struct ipt_entry *table_base = newinfo->entries[cpu]; + if (table_base) + table_base->comefrom = 0xdead57ac; } } #endif @@ -986,21 +971,47 @@ add_entry_to_counter(const struct ipt_en return 0; } +static inline int +set_entry_to_counter(const struct ipt_entry *e, + struct ipt_counters total[], + unsigned int *i) +{ + SET_COUNTER(total[*i], e->counters.bcnt, e->counters.pcnt); + + (*i)++; + return 0; +} + static void get_counters(const struct ipt_table_info *t, struct ipt_counters counters[]) { unsigned int cpu; unsigned int i; + unsigned int curcpu = get_cpu(); + + /* Instead of clearing (by a previous call to memset()) + * the counters and using adds, we set the counters + * with data used by current CPU */ + + i = 0; + IPT_ENTRY_ITERATE(t->entries[curcpu], + t->size, + set_entry_to_counter, + counters, + &i); for_each_cpu(cpu) { + if (cpu == curcpu) + continue; i = 0; - IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, cpu), + IPT_ENTRY_ITERATE(t->entries[cpu], t->size, add_entry_to_counter, counters, &i); } + put_cpu(); } static int @@ -1012,6 +1023,7 @@ copy_entries_to_user(unsigned int total_ struct ipt_entry *e; struct ipt_counters *counters; int ret = 0; + void *loc_cpu_entry; /* We need atomic snapshot of counters: rest doesn't change (other than comefrom, which userspace doesn't care @@ -1023,13 +1035,14 @@ copy_entries_to_user(unsigned int total_ return -ENOMEM; /* First, sum counters... */ - memset(counters, 0, countersize); write_lock_bh(&table->lock); get_counters(table->private, counters); write_unlock_bh(&table->lock); - /* ... then copy entire thing from CPU 0... */ - if (copy_to_user(userptr, table->private->entries, total_size) != 0) { + /* choose the copy that is on our node/cpu, ... */ + loc_cpu_entry = table->private->entries[get_cpu()]; + /* ... then copy entire thing ... */ + if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) { ret = -EFAULT; goto free_counters; } @@ -1041,7 +1054,7 @@ copy_entries_to_user(unsigned int total_ struct ipt_entry_match *m; struct ipt_entry_target *t; - e = (struct ipt_entry *)(table->private->entries + off); + e = (struct ipt_entry *)(loc_cpu_entry + off); if (copy_to_user(userptr + off + offsetof(struct ipt_entry, counters), &counters[num], @@ -1078,6 +1091,7 @@ copy_entries_to_user(unsigned int total_ } free_counters: + put_cpu(); vfree(counters); return ret; } @@ -1110,6 +1124,45 @@ get_entries(const struct ipt_get_entries return ret; } +static void free_table_info(struct ipt_table_info *info) +{ + int cpu; + for_each_cpu(cpu) { + if (info->size <= PAGE_SIZE) + kfree(info->entries[cpu]); + else + vfree(info->entries[cpu]); + } + kfree(info); +} + +static struct ipt_table_info *alloc_table_info(unsigned int size) +{ + struct ipt_table_info *newinfo; + int cpu; + + newinfo = kzalloc(sizeof(struct ipt_table_info), GFP_KERNEL); + if (!newinfo) + return NULL; + + newinfo->size = size; + + for_each_cpu(cpu) { + if (size <= PAGE_SIZE) + newinfo->entries[cpu] = kmalloc_node(size, + GFP_KERNEL, + cpu_to_node(cpu)); + else + newinfo->entries[cpu] = vmalloc_node(size, cpu_to_node(cpu)); + if (newinfo->entries[cpu] == 0) { + free_table_info(newinfo); + return NULL; + } + } + + return newinfo; +} + static int do_replace(void __user *user, unsigned int len) { @@ -1118,6 +1171,7 @@ do_replace(void __user *user, unsigned i struct ipt_table *t; struct ipt_table_info *newinfo, *oldinfo; struct ipt_counters *counters; + void *loc_cpu_entry, *loc_cpu_old_entry; if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) return -EFAULT; @@ -1130,13 +1184,13 @@ do_replace(void __user *user, unsigned i if ((SMP_ALIGN(tmp.size) >> PAGE_SHIFT) + 2 > num_physpages) return -ENOMEM; - newinfo = vmalloc(sizeof(struct ipt_table_info) - + SMP_ALIGN(tmp.size) * - (highest_possible_processor_id()+1)); + newinfo = alloc_table_info(tmp.size); if (!newinfo) return -ENOMEM; - if (copy_from_user(newinfo->entries, user + sizeof(tmp), + /* choose the copy that is our node/cpu */ + loc_cpu_entry = newinfo->entries[get_cpu()]; + if (copy_from_user(loc_cpu_entry, user + sizeof(tmp), tmp.size) != 0) { ret = -EFAULT; goto free_newinfo; @@ -1147,10 +1201,9 @@ do_replace(void __user *user, unsigned i ret = -ENOMEM; goto free_newinfo; } - memset(counters, 0, tmp.num_counters * sizeof(struct ipt_counters)); ret = translate_table(tmp.name, tmp.valid_hooks, - newinfo, tmp.size, tmp.num_entries, + newinfo, loc_cpu_entry, tmp.size, tmp.num_entries, tmp.hook_entry, tmp.underflow); if (ret != 0) goto free_newinfo_counters; @@ -1189,8 +1242,10 @@ do_replace(void __user *user, unsigned i /* Get the old counters. */ get_counters(oldinfo, counters); /* Decrease module usage counts and free resource */ - IPT_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL); - vfree(oldinfo); + loc_cpu_old_entry = oldinfo->entries[smp_processor_id()]; + IPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,NULL); + put_cpu(); + free_table_info(oldinfo); if (copy_to_user(tmp.counters, counters, sizeof(struct ipt_counters) * tmp.num_counters) != 0) ret = -EFAULT; @@ -1202,11 +1257,12 @@ do_replace(void __user *user, unsigned i module_put(t->me); up(&ipt_mutex); free_newinfo_counters_untrans: - IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, cleanup_entry,NULL); + IPT_ENTRY_ITERATE(loc_cpu_entry, newinfo->size, cleanup_entry,NULL); free_newinfo_counters: vfree(counters); free_newinfo: - vfree(newinfo); + put_cpu(); + free_table_info(newinfo); return ret; } @@ -1239,6 +1295,7 @@ do_add_counters(void __user *user, unsig struct ipt_counters_info tmp, *paddc; struct ipt_table *t; int ret = 0; + void *loc_cpu_entry; if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) return -EFAULT; @@ -1268,7 +1325,9 @@ do_add_counters(void __user *user, unsig } i = 0; - IPT_ENTRY_ITERATE(t->private->entries, + /* Choose the copy that is on our node */ + loc_cpu_entry = t->private->entries[smp_processor_id()]; + IPT_ENTRY_ITERATE(loc_cpu_entry, t->private->size, add_counter_to_entry, paddc->counters, @@ -1460,28 +1519,30 @@ int ipt_register_table(struct ipt_table struct ipt_table_info *newinfo; static struct ipt_table_info bootstrap = { 0, 0, 0, { 0 }, { 0 }, { } }; + void *loc_cpu_entry; - newinfo = vmalloc(sizeof(struct ipt_table_info) - + SMP_ALIGN(repl->size) * - (highest_possible_processor_id()+1)); + newinfo = alloc_table_info(repl->size); if (!newinfo) return -ENOMEM; - memcpy(newinfo->entries, repl->entries, repl->size); + /* choose the copy on our node/cpu */ + loc_cpu_entry = newinfo->entries[get_cpu()]; + memcpy(loc_cpu_entry, repl->entries, repl->size); ret = translate_table(table->name, table->valid_hooks, - newinfo, repl->size, + newinfo, loc_cpu_entry, repl->size, repl->num_entries, repl->hook_entry, repl->underflow); + put_cpu(); if (ret != 0) { - vfree(newinfo); + free_table_info(newinfo); return ret; } ret = down_interruptible(&ipt_mutex); if (ret != 0) { - vfree(newinfo); + free_table_info(newinfo); return ret; } @@ -1510,20 +1571,24 @@ int ipt_register_table(struct ipt_table return ret; free_unlock: - vfree(newinfo); + free_table_info(newinfo); goto unlock; } void ipt_unregister_table(struct ipt_table *table) { + void *loc_cpu_entry; + down(&ipt_mutex); LIST_DELETE(&ipt_tables, table); up(&ipt_mutex); /* Decrease module usage counts and free resources */ - IPT_ENTRY_ITERATE(table->private->entries, table->private->size, + loc_cpu_entry = table->private->entries[get_cpu()]; + IPT_ENTRY_ITERATE(loc_cpu_entry, table->private->size, cleanup_entry, NULL); - vfree(table->private); + put_cpu(); + free_table_info(table->private); } /* Returns 1 if the port is matched by the range, 0 otherwise */ diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 01444a0..dbf8295 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -22,6 +22,7 @@ extern int sysctl_ip_nonlocal_bind; #ifdef CONFIG_SYSCTL +static int zero; static int tcp_retr1_max = 255; static int ip_local_port_range_min[] = { 1, 1 }; static int ip_local_port_range_max[] = { 65535, 65535 }; @@ -614,6 +615,15 @@ ctl_table ipv4_table[] = { .strategy = &sysctl_jiffies }, { + .ctl_name = NET_IPV4_IPFRAG_MAX_DIST, + .procname = "ipfrag_max_dist", + .data = &sysctl_ipfrag_max_dist, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .extra1 = &zero + }, + { .ctl_name = NET_TCP_NO_METRICS_SAVE, .procname = "tcp_no_metrics_save", .data = &sysctl_tcp_nometrics_save, diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c index 1d0cd86..f00b787 100644 --- a/net/ipv4/tcp_bic.c +++ b/net/ipv4/tcp_bic.c @@ -1,12 +1,12 @@ /* - * Binary Increase Congestion control for TCP + * Binary Increase Congestion control for TCP v2.0 * - * This is from the implementation of BICTCP in - * Lison-Xu, Kahaled Harfoush, and Injong Rhee. - * "Binary Increase Congestion Control for Fast, Long Distance - * Networks" in InfoComm 2004 + * This is from the implementation of CUBIC TCP in + * Injong Rhee, Lisong Xu. + * "CUBIC: A New TCP-Friendly High-Speed TCP Variant + * in PFLDnet 2005 * Available from: - * http://www.csc.ncsu.edu/faculty/rhee/export/bitcp.pdf + * http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/cubic-paper.pdf * * Unless BIC is enabled and congestion window is large * this behaves the same as the original Reno. @@ -25,32 +25,27 @@ * In binary search, * go to point (max+min)/N */ +#define BICTCP_HZ 10 /* BIC HZ 2^10 = 1024 */ static int fast_convergence = 1; static int max_increment = 16; -static int low_window = 14; static int beta = 819; /* = 819/1024 (BICTCP_BETA_SCALE) */ -static int low_utilization_threshold = 153; -static int low_utilization_period = 2; static int initial_ssthresh = 100; -static int smooth_part = 20; +static int bic_scale = 41; +static int tcp_friendliness = 1; module_param(fast_convergence, int, 0644); MODULE_PARM_DESC(fast_convergence, "turn on/off fast convergence"); module_param(max_increment, int, 0644); MODULE_PARM_DESC(max_increment, "Limit on increment allowed during binary search"); -module_param(low_window, int, 0644); -MODULE_PARM_DESC(low_window, "lower bound on congestion window (for TCP friendliness)"); module_param(beta, int, 0644); MODULE_PARM_DESC(beta, "beta for multiplicative increase"); -module_param(low_utilization_threshold, int, 0644); -MODULE_PARM_DESC(low_utilization_threshold, "percent (scaled by 1024) for low utilization mode"); -module_param(low_utilization_period, int, 0644); -MODULE_PARM_DESC(low_utilization_period, "if average delay exceeds then goto to low utilization mode (seconds)"); module_param(initial_ssthresh, int, 0644); MODULE_PARM_DESC(initial_ssthresh, "initial value of slow start threshold"); -module_param(smooth_part, int, 0644); -MODULE_PARM_DESC(smooth_part, "log(B/(B*Smin))/log(B/(B-1))+B, # of RTT from Wmax-B to Wmax"); +module_param(bic_scale, int, 0644); +MODULE_PARM_DESC(bic_scale, "scale (scaled by 1024) value for bic function (bic_scale/1024)"); +module_param(tcp_friendliness, int, 0644); +MODULE_PARM_DESC(tcp_friendliness, "turn on/off tcp friendliness"); /* BIC TCP Parameters */ @@ -60,12 +55,12 @@ struct bictcp { u32 loss_cwnd; /* congestion window at last loss */ u32 last_cwnd; /* the last snd_cwnd */ u32 last_time; /* time when updated last_cwnd */ + u32 bic_origin_point;/* origin point of bic function */ + u32 bic_K; /* time to origin point from the beginning of the current epoch */ u32 delay_min; /* min delay */ - u32 delay_max; /* max delay */ - u32 last_delay; - u8 low_utilization;/* 0: high; 1: low */ - u32 low_utilization_start; /* starting time of low utilization detection*/ u32 epoch_start; /* beginning of an epoch */ + u32 ack_cnt; /* number of acks */ + u32 tcp_cwnd; /* estimated tcp cwnd */ #define ACK_RATIO_SHIFT 4 u32 delayed_ack; /* estimate the ratio of Packets/ACKs << 4 */ }; @@ -77,13 +72,13 @@ static inline void bictcp_reset(struct b ca->loss_cwnd = 0; ca->last_cwnd = 0; ca->last_time = 0; + ca->bic_origin_point = 0; + ca->bic_K = 0; ca->delay_min = 0; - ca->delay_max = 0; - ca->last_delay = 0; - ca->low_utilization = 0; - ca->low_utilization_start = 0; ca->epoch_start = 0; ca->delayed_ack = 2 << ACK_RATIO_SHIFT; + ca->ack_cnt = 0; + ca->tcp_cwnd = 0; } static void bictcp_init(struct sock *sk) @@ -93,11 +88,131 @@ static void bictcp_init(struct sock *sk) tcp_sk(sk)->snd_ssthresh = initial_ssthresh; } +/* 65536 times the cubic root */ +static const u64 cubic_table[8] + = {0, 65536, 82570, 94519, 104030, 112063, 119087, 125367}; + +/* + * calculate the cubic root of x + * the basic idea is that x can be expressed as i*8^j + * so cubic_root(x) = cubic_root(i)*2^j + * in the following code, x is i, and y is 2^j + * because of integer calculation, there are errors in calculation + * so finally use binary search to find out the exact solution + */ +static u32 cubic_root(u64 x) +{ + u64 y, app, target, start, end, mid, start_diff, end_diff; + + if (x == 0) + return 0; + + target = x; + + /* first estimate lower and upper bound */ + y = 1; + while (x >= 8){ + x = (x >> 3); + y = (y << 1); + } + start = (y*cubic_table[x])>>16; + if (x==7) + end = (y<<1); + else + end = (y*cubic_table[x+1]+65535)>>16; + + /* binary search for more accurate one */ + while (start < end-1) { + mid = (start+end) >> 1; + app = mid*mid*mid; + if (app < target) + start = mid; + else if (app > target) + end = mid; + else + return mid; + } + + /* find the most accurate one from start and end */ + app = start*start*start; + if (app < target) + start_diff = target - app; + else + start_diff = app - target; + app = end*end*end; + if (app < target) + end_diff = target - app; + else + end_diff = app - target; + + if (start_diff < end_diff) + return (u32)start; + else + return (u32)end; +} + +static inline u32 bictcp_K(u32 dist, u32 srtt) +{ + u64 d64; + u32 d32; + u32 count; + u32 result; + + /* calculate the "K" for (wmax-cwnd) = c/rtt * K^3 + so K = cubic_root( (wmax-cwnd)*rtt/c ) + the unit of K is bictcp_HZ=2^10, not HZ + + c = bic_scale >> 10 + rtt = (tp->srtt >> 3 ) / HZ + + the following code has been designed and tested for + cwnd < 1 million packets + RTT < 100 seconds + HZ < 1,000,00 (corresponding to 10 nano-second) + + */ + + /* 1/c * 2^2*bictcp_HZ */ + d32 = (1 << (10+2*BICTCP_HZ)) / bic_scale; + d64 = (__u64)d32; + + /* srtt * 2^count / HZ + 1) to get a better accuracy of the following d32, + the larger the "count", the better the accuracy + 2) and avoid overflow of the following d64 + the larger the "count", the high possibility of overflow + 3) so find a "count" between bictcp_hz-3 and bictcp_hz + "count" may be less than bictcp_HZ, + then d64 becomes 0. that is OK + */ + d32 = srtt; + count = 0; + while (((d32 & 0x80000000)==0) && (count < BICTCP_HZ)){ + d32 = d32 << 1; + count++; + } + d32 = d32 / HZ; + + /* (wmax-cwnd) * (srtt>>3 / HZ) / c * 2^(3*bictcp_HZ) */ + d64 = (d64 * dist * d32) >> (count+3-BICTCP_HZ); + + /* cubic root */ + d64 = cubic_root(d64); + + result = (u32)d64; + return result; +} + /* * Compute congestion window to use. */ static inline void bictcp_update(struct bictcp *ca, u32 cwnd) { + u64 d64; + u32 d32, t, srtt, bic_target, min_cnt, max_cnt; + + ca->ack_cnt++; /* count the number of ACKs */ + if (ca->last_cwnd == cwnd && (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32) return; @@ -105,49 +220,88 @@ static inline void bictcp_update(struct ca->last_cwnd = cwnd; ca->last_time = tcp_time_stamp; - if (ca->epoch_start == 0) /* record the beginning of an epoch */ - ca->epoch_start = tcp_time_stamp; + srtt = (HZ << 3)/10; /* use real time-based growth function */ - /* start off normal */ - if (cwnd <= low_window) { - ca->cnt = cwnd; - return; + if (ca->epoch_start == 0) { + ca->epoch_start = tcp_time_stamp; /* record the beginning of an epoch */ + ca->ack_cnt = 1; /* start counting */ + ca->tcp_cwnd = cwnd; /* syn with cubic */ + + if (ca->last_max_cwnd <= cwnd) { + ca->bic_K = 0; + ca->bic_origin_point = cwnd; + } else { + ca->bic_K = bictcp_K(ca->last_max_cwnd-cwnd, srtt); + ca->bic_origin_point = ca->last_max_cwnd; + } } - /* binary increase */ - if (cwnd < ca->last_max_cwnd) { - __u32 dist = (ca->last_max_cwnd - cwnd) - / BICTCP_B; - - if (dist > max_increment) - /* linear increase */ - ca->cnt = cwnd / max_increment; - else if (dist <= 1U) - /* binary search increase */ - ca->cnt = (cwnd * smooth_part) / BICTCP_B; - else - /* binary search increase */ - ca->cnt = cwnd / dist; - } else { - /* slow start AMD linear increase */ - if (cwnd < ca->last_max_cwnd + BICTCP_B) - /* slow start */ - ca->cnt = (cwnd * smooth_part) / BICTCP_B; - else if (cwnd < ca->last_max_cwnd + max_increment*(BICTCP_B-1)) - /* slow start */ - ca->cnt = (cwnd * (BICTCP_B-1)) - / (cwnd - ca->last_max_cwnd); - else - /* linear increase */ - ca->cnt = cwnd / max_increment; + /* cubic function - calc*/ + /* calculate c * time^3 / rtt, + * while considering overflow in calculation of time^3 + * (so time^3 is done by using d64) + * and without the support of division of 64bit numbers + * (so all divisions are done by using d32) + * also NOTE the unit of those veriables + * time = (t - K) / 2^bictcp_HZ + * c = bic_scale >> 10 + * rtt = (srtt >> 3) / HZ + * !!! The following code does not have overflow problems, + * if the cwnd < 1 million packets !!! + */ + + /* change the unit from HZ to bictcp_HZ */ + t = ((tcp_time_stamp + ca->delay_min - ca->epoch_start) + << BICTCP_HZ) / HZ; + + if (t < ca->bic_K) /* t - K */ + d32 = ca->bic_K - t; + else + d32 = t - ca->bic_K; + + d64 = (u64)d32; + d32 = (bic_scale << 3) * HZ / srtt; /* 1024*c/rtt */ + d64 = (d32 * d64 * d64 * d64) >> (10+3*BICTCP_HZ); /* c/rtt * (t-K)^3 */ + d32 = (u32)d64; + if (t < ca->bic_K) /* below origin*/ + bic_target = ca->bic_origin_point - d32; + else /* above origin*/ + bic_target = ca->bic_origin_point + d32; + + /* cubic function - calc bictcp_cnt*/ + if (bic_target > cwnd) { + ca->cnt = cwnd / (bic_target - cwnd); + } else { + ca->cnt = 100 * cwnd; /* very small increment*/ + } + + if (ca->delay_min > 0) { + /* max increment = Smax * rtt / 0.1 */ + min_cnt = (cwnd * HZ * 8)/(10 * max_increment * ca->delay_min); + if (ca->cnt < min_cnt) + ca->cnt = min_cnt; } - /* if in slow start or link utilization is very low */ - if ( ca->loss_cwnd == 0 || - (cwnd > ca->loss_cwnd && ca->low_utilization)) { - if (ca->cnt > 20) /* increase cwnd 5% per RTT */ - ca->cnt = 20; - } + /* slow start and low utilization */ + if (ca->loss_cwnd == 0) /* could be aggressive in slow start */ + ca->cnt = 50; + + /* TCP Friendly */ + if (tcp_friendliness) { + u32 scale = 8*(BICTCP_BETA_SCALE+beta)/3/(BICTCP_BETA_SCALE-beta); + d32 = (cwnd * scale) >> 3; + while (ca->ack_cnt > d32) { /* update tcp cwnd */ + ca->ack_cnt -= d32; + ca->tcp_cwnd++; + } + + if (ca->tcp_cwnd > cwnd){ /* if bic is slower than tcp */ + d32 = ca->tcp_cwnd - cwnd; + max_cnt = cwnd / d32; + if (ca->cnt > max_cnt) + ca->cnt = max_cnt; + } + } ca->cnt = (ca->cnt << ACK_RATIO_SHIFT) / ca->delayed_ack; if (ca->cnt == 0) /* cannot be zero */ @@ -155,58 +309,26 @@ static inline void bictcp_update(struct } -/* Detect low utilization in congestion avoidance */ -static inline void bictcp_low_utilization(struct sock *sk, int flag) +/* Keep track of minimum rtt */ +static inline void measure_delay(struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); struct bictcp *ca = inet_csk_ca(sk); - u32 dist, delay; + u32 delay; /* No time stamp */ if (!(tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr) || /* Discard delay samples right after fast recovery */ - tcp_time_stamp < ca->epoch_start + HZ || - /* this delay samples may not be accurate */ - flag == 0) { - ca->last_delay = 0; - goto notlow; - } + (s32)(tcp_time_stamp - ca->epoch_start) < HZ) + return; - delay = ca->last_delay<<3; /* use the same scale as tp->srtt*/ - ca->last_delay = tcp_time_stamp - tp->rx_opt.rcv_tsecr; - if (delay == 0) /* no previous delay sample */ - goto notlow; + delay = tcp_time_stamp - tp->rx_opt.rcv_tsecr; + if (delay == 0) + delay = 1; /* first time call or link delay decreases */ - if (ca->delay_min == 0 || ca->delay_min > delay) { - ca->delay_min = ca->delay_max = delay; - goto notlow; - } - - if (ca->delay_max < delay) - ca->delay_max = delay; - - /* utilization is low, if avg delay < dist*threshold - for checking_period time */ - dist = ca->delay_max - ca->delay_min; - if (dist <= ca->delay_min>>6 || - tp->srtt - ca->delay_min >= (dist*low_utilization_threshold)>>10) - goto notlow; - - if (ca->low_utilization_start == 0) { - ca->low_utilization = 0; - ca->low_utilization_start = tcp_time_stamp; - } else if ((s32)(tcp_time_stamp - ca->low_utilization_start) - > low_utilization_period*HZ) { - ca->low_utilization = 1; - } - - return; - - notlow: - ca->low_utilization = 0; - ca->low_utilization_start = 0; - + if (ca->delay_min == 0 || ca->delay_min > delay) + ca->delay_min = delay; } static void bictcp_cong_avoid(struct sock *sk, u32 ack, @@ -215,7 +337,8 @@ static void bictcp_cong_avoid(struct soc struct tcp_sock *tp = tcp_sk(sk); struct bictcp *ca = inet_csk_ca(sk); - bictcp_low_utilization(sk, data_acked); + if (data_acked) + measure_delay(sk); if (!tcp_is_cwnd_limited(sk, in_flight)) return; @@ -238,10 +361,6 @@ static void bictcp_cong_avoid(struct soc } -/* - * behave like Reno until low_window is reached, - * then increase congestion window slowly - */ static u32 bictcp_recalc_ssthresh(struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); @@ -249,11 +368,6 @@ static u32 bictcp_recalc_ssthresh(struct ca->epoch_start = 0; /* end of epoch */ - /* in case of wrong delay_max*/ - if (ca->delay_min > 0 && ca->delay_max > ca->delay_min) - ca->delay_max = ca->delay_min - + ((ca->delay_max - ca->delay_min)* 90) / 100; - /* Wmax and fast convergence */ if (tp->snd_cwnd < ca->last_max_cwnd && fast_convergence) ca->last_max_cwnd = (tp->snd_cwnd * (BICTCP_BETA_SCALE + beta)) @@ -263,24 +377,19 @@ static u32 bictcp_recalc_ssthresh(struct ca->loss_cwnd = tp->snd_cwnd; - - if (tp->snd_cwnd <= low_window) - return max(tp->snd_cwnd >> 1U, 2U); - else - return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U); + return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U); } static u32 bictcp_undo_cwnd(struct sock *sk) { - const struct tcp_sock *tp = tcp_sk(sk); - const struct bictcp *ca = inet_csk_ca(sk); - return max(tp->snd_cwnd, ca->last_max_cwnd); + struct bictcp *ca = inet_csk_ca(sk); + + return max(tcp_sk(sk)->snd_cwnd, ca->last_max_cwnd); } static u32 bictcp_min_cwnd(struct sock *sk) { - const struct tcp_sock *tp = tcp_sk(sk); - return tp->snd_ssthresh; + return tcp_sk(sk)->snd_ssthresh; } static void bictcp_state(struct sock *sk, u8 new_state) @@ -289,14 +398,14 @@ static void bictcp_state(struct sock *sk bictcp_reset(inet_csk_ca(sk)); } -/* Track delayed acknowledgement ratio using sliding window +/* Track delayed acknowledgment ratio using sliding window * ratio = (15*ratio + sample) / 16 */ static void bictcp_acked(struct sock *sk, u32 cnt) { const struct inet_connection_sock *icsk = inet_csk(sk); - if (cnt > 0 && icsk->icsk_ca_state == TCP_CA_Open) { + if (cnt > 0 && icsk->icsk_ca_state == TCP_CA_Open) { struct bictcp *ca = inet_csk_ca(sk); cnt -= ca->delayed_ack >> ACK_RATIO_SHIFT; ca->delayed_ack += cnt; @@ -330,6 +439,7 @@ static void __exit bictcp_unregister(voi module_init(bictcp_register); module_exit(bictcp_unregister); -MODULE_AUTHOR("Stephen Hemminger"); +MODULE_AUTHOR("Sangtae Ha, Stephen Hemminger"); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("BIC TCP"); +MODULE_VERSION("2.0"); diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 7d49222..0a216e0 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -86,11 +86,6 @@ static DECLARE_MUTEX(ip6t_mutex); context stops packets coming through and allows user context to read the counters or update the rules. - To be cache friendly on SMP, we arrange them like so: - [ n-entries ] - ... cache-align padding ... - [ n-entries ] - Hence the start of any table is given by get_table() below. */ /* The table itself */ @@ -108,20 +103,15 @@ struct ip6t_table_info unsigned int underflow[NF_IP6_NUMHOOKS]; /* ip6t_entry tables: one per CPU */ - char entries[0] ____cacheline_aligned; + void *entries[NR_CPUS]; }; static LIST_HEAD(ip6t_target); static LIST_HEAD(ip6t_match); static LIST_HEAD(ip6t_tables); +#define SET_COUNTER(c,b,p) do { (c).bcnt = (b); (c).pcnt = (p); } while(0) #define ADD_COUNTER(c,b,p) do { (c).bcnt += (b); (c).pcnt += (p); } while(0) -#ifdef CONFIG_SMP -#define TABLE_OFFSET(t,p) (SMP_ALIGN((t)->size)*(p)) -#else -#define TABLE_OFFSET(t,p) 0 -#endif - #if 0 #define down(x) do { printk("DOWN:%u:" #x "\n", __LINE__); down(x); } while(0) #define down_interruptible(x) ({ int __r; printk("DOWNi:%u:" #x "\n", __LINE__); __r = down_interruptible(x); if (__r != 0) printk("ABORT-DOWNi:%u\n", __LINE__); __r; }) @@ -376,8 +366,7 @@ ip6t_do_table(struct sk_buff **pskb, read_lock_bh(&table->lock); IP_NF_ASSERT(table->valid_hooks & (1 << hook)); - table_base = (void *)table->private->entries - + TABLE_OFFSET(table->private, smp_processor_id()); + table_base = (void *)table->private->entries[smp_processor_id()]; e = get_entry(table_base, table->private->hook_entry[hook]); #ifdef CONFIG_NETFILTER_DEBUG @@ -649,7 +638,8 @@ unconditional(const struct ip6t_ip6 *ipv /* Figures out from what hook each rule can be called: returns 0 if there are loops. Puts hook bitmask in comefrom. */ static int -mark_source_chains(struct ip6t_table_info *newinfo, unsigned int valid_hooks) +mark_source_chains(struct ip6t_table_info *newinfo, + unsigned int valid_hooks, void *entry0) { unsigned int hook; @@ -658,7 +648,7 @@ mark_source_chains(struct ip6t_table_inf for (hook = 0; hook < NF_IP6_NUMHOOKS; hook++) { unsigned int pos = newinfo->hook_entry[hook]; struct ip6t_entry *e - = (struct ip6t_entry *)(newinfo->entries + pos); + = (struct ip6t_entry *)(entry0 + pos); if (!(valid_hooks & (1 << hook))) continue; @@ -708,13 +698,13 @@ mark_source_chains(struct ip6t_table_inf goto next; e = (struct ip6t_entry *) - (newinfo->entries + pos); + (entry0 + pos); } while (oldpos == pos + e->next_offset); /* Move along one */ size = e->next_offset; e = (struct ip6t_entry *) - (newinfo->entries + pos + size); + (entry0 + pos + size); e->counters.pcnt = pos; pos += size; } else { @@ -731,7 +721,7 @@ mark_source_chains(struct ip6t_table_inf newpos = pos + e->next_offset; } e = (struct ip6t_entry *) - (newinfo->entries + newpos); + (entry0 + newpos); e->counters.pcnt = pos; pos = newpos; } @@ -941,6 +931,7 @@ static int translate_table(const char *name, unsigned int valid_hooks, struct ip6t_table_info *newinfo, + void *entry0, unsigned int size, unsigned int number, const unsigned int *hook_entries, @@ -964,8 +955,8 @@ translate_table(const char *name, ret = IP6T_ENTRY_ITERATE(newinfo->entries, newinfo->size, check_entry_size_and_hooks, newinfo, - newinfo->entries, - newinfo->entries + size, + entry0, + entry0 + size, hook_entries, underflows, &i); if (ret != 0) return ret; @@ -993,27 +984,24 @@ translate_table(const char *name, } } - if (!mark_source_chains(newinfo, valid_hooks)) + if (!mark_source_chains(newinfo, valid_hooks, entry0)) return -ELOOP; /* Finally, each sanity check must pass */ i = 0; - ret = IP6T_ENTRY_ITERATE(newinfo->entries, newinfo->size, + ret = IP6T_ENTRY_ITERATE(entry0, newinfo->size, check_entry, name, size, &i); if (ret != 0) { - IP6T_ENTRY_ITERATE(newinfo->entries, newinfo->size, + IP6T_ENTRY_ITERATE(entry0, newinfo->size, cleanup_entry, &i); return ret; } /* And one copy for every other CPU */ for_each_cpu(i) { - if (i == 0) - continue; - memcpy(newinfo->entries + SMP_ALIGN(newinfo->size) * i, - newinfo->entries, - SMP_ALIGN(newinfo->size)); + if (newinfo->entries[i] && newinfo->entries[i] != entry0) + memcpy(newinfo->entries[i], entry0, newinfo->size); } return ret; @@ -1029,15 +1017,12 @@ replace_table(struct ip6t_table *table, #ifdef CONFIG_NETFILTER_DEBUG { - struct ip6t_entry *table_base; - unsigned int i; - - for_each_cpu(i) { - table_base = - (void *)newinfo->entries - + TABLE_OFFSET(newinfo, i); + int cpu; - table_base->comefrom = 0xdead57ac; + for_each_cpu(cpu) { + struct ip6t_entry *table_base = newinfo->entries[cpu]; + if (table_base) + table_base->comefrom = 0xdead57ac; } } #endif @@ -1072,21 +1057,47 @@ add_entry_to_counter(const struct ip6t_e return 0; } +static inline int +set_entry_to_counter(const struct ip6t_entry *e, + struct ip6t_counters total[], + unsigned int *i) +{ + SET_COUNTER(total[*i], e->counters.bcnt, e->counters.pcnt); + + (*i)++; + return 0; +} + static void get_counters(const struct ip6t_table_info *t, struct ip6t_counters counters[]) { unsigned int cpu; unsigned int i; + unsigned int curcpu = get_cpu(); + + /* Instead of clearing (by a previous call to memset()) + * the counters and using adds, we set the counters + * with data used by current CPU */ + + i = 0; + IP6T_ENTRY_ITERATE(t->entries[curcpu], + t->size, + set_entry_to_counter, + counters, + &i); for_each_cpu(cpu) { + if (cpu == curcpu) + continue; i = 0; - IP6T_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, cpu), + IP6T_ENTRY_ITERATE(t->entries[cpu], t->size, add_entry_to_counter, counters, &i); } + put_cpu(); } static int @@ -1098,6 +1109,7 @@ copy_entries_to_user(unsigned int total_ struct ip6t_entry *e; struct ip6t_counters *counters; int ret = 0; + void *loc_cpu_entry; /* We need atomic snapshot of counters: rest doesn't change (other than comefrom, which userspace doesn't care @@ -1109,13 +1121,13 @@ copy_entries_to_user(unsigned int total_ return -ENOMEM; /* First, sum counters... */ - memset(counters, 0, countersize); write_lock_bh(&table->lock); get_counters(table->private, counters); write_unlock_bh(&table->lock); - /* ... then copy entire thing from CPU 0... */ - if (copy_to_user(userptr, table->private->entries, total_size) != 0) { + /* choose the copy that is on ourc node/cpu */ + loc_cpu_entry = table->private->entries[get_cpu()]; + if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) { ret = -EFAULT; goto free_counters; } @@ -1127,7 +1139,7 @@ copy_entries_to_user(unsigned int total_ struct ip6t_entry_match *m; struct ip6t_entry_target *t; - e = (struct ip6t_entry *)(table->private->entries + off); + e = (struct ip6t_entry *)(loc_cpu_entry + off); if (copy_to_user(userptr + off + offsetof(struct ip6t_entry, counters), &counters[num], @@ -1164,6 +1176,7 @@ copy_entries_to_user(unsigned int total_ } free_counters: + put_cpu(); vfree(counters); return ret; } @@ -1196,6 +1209,46 @@ get_entries(const struct ip6t_get_entrie return ret; } +static void free_table_info(struct ip6t_table_info *info) +{ + int cpu; + for_each_cpu(cpu) { + if (info->size <= PAGE_SIZE) + kfree(info->entries[cpu]); + else + vfree(info->entries[cpu]); + } + kfree(info); +} + +static struct ip6t_table_info *alloc_table_info(unsigned int size) +{ + struct ip6t_table_info *newinfo; + int cpu; + + newinfo = kzalloc(sizeof(struct ip6t_table_info), GFP_KERNEL); + if (!newinfo) + return NULL; + + newinfo->size = size; + + for_each_cpu(cpu) { + if (size <= PAGE_SIZE) + newinfo->entries[cpu] = kmalloc_node(size, + GFP_KERNEL, + cpu_to_node(cpu)); + else + newinfo->entries[cpu] = vmalloc_node(size, + cpu_to_node(cpu)); + if (newinfo->entries[cpu] == NULL) { + free_table_info(newinfo); + return NULL; + } + } + + return newinfo; +} + static int do_replace(void __user *user, unsigned int len) { @@ -1204,6 +1257,7 @@ do_replace(void __user *user, unsigned i struct ip6t_table *t; struct ip6t_table_info *newinfo, *oldinfo; struct ip6t_counters *counters; + void *loc_cpu_entry, *loc_cpu_old_entry; if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) return -EFAULT; @@ -1212,13 +1266,13 @@ do_replace(void __user *user, unsigned i if ((SMP_ALIGN(tmp.size) >> PAGE_SHIFT) + 2 > num_physpages) return -ENOMEM; - newinfo = vmalloc(sizeof(struct ip6t_table_info) - + SMP_ALIGN(tmp.size) * - (highest_possible_processor_id()+1)); + newinfo = alloc_table_info(tmp.size); if (!newinfo) return -ENOMEM; - if (copy_from_user(newinfo->entries, user + sizeof(tmp), + /* choose the copy that is on our node/cpu */ + loc_cpu_entry = newinfo->entries[get_cpu()]; + if (copy_from_user(loc_cpu_entry, user + sizeof(tmp), tmp.size) != 0) { ret = -EFAULT; goto free_newinfo; @@ -1229,10 +1283,9 @@ do_replace(void __user *user, unsigned i ret = -ENOMEM; goto free_newinfo; } - memset(counters, 0, tmp.num_counters * sizeof(struct ip6t_counters)); ret = translate_table(tmp.name, tmp.valid_hooks, - newinfo, tmp.size, tmp.num_entries, + newinfo, loc_cpu_entry, tmp.size, tmp.num_entries, tmp.hook_entry, tmp.underflow); if (ret != 0) goto free_newinfo_counters; @@ -1271,8 +1324,10 @@ do_replace(void __user *user, unsigned i /* Get the old counters. */ get_counters(oldinfo, counters); /* Decrease module usage counts and free resource */ - IP6T_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL); - vfree(oldinfo); + loc_cpu_old_entry = oldinfo->entries[smp_processor_id()]; + IP6T_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,NULL); + put_cpu(); + free_table_info(oldinfo); if (copy_to_user(tmp.counters, counters, sizeof(struct ip6t_counters) * tmp.num_counters) != 0) ret = -EFAULT; @@ -1284,11 +1339,12 @@ do_replace(void __user *user, unsigned i module_put(t->me); up(&ip6t_mutex); free_newinfo_counters_untrans: - IP6T_ENTRY_ITERATE(newinfo->entries, newinfo->size, cleanup_entry,NULL); + IP6T_ENTRY_ITERATE(loc_cpu_entry, newinfo->size, cleanup_entry,NULL); free_newinfo_counters: vfree(counters); free_newinfo: - vfree(newinfo); + put_cpu(); + free_table_info(newinfo); return ret; } @@ -1321,6 +1377,7 @@ do_add_counters(void __user *user, unsig struct ip6t_counters_info tmp, *paddc; struct ip6t_table *t; int ret = 0; + void *loc_cpu_entry; if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) return -EFAULT; @@ -1350,7 +1407,9 @@ do_add_counters(void __user *user, unsig } i = 0; - IP6T_ENTRY_ITERATE(t->private->entries, + /* Choose the copy that is on our node */ + loc_cpu_entry = t->private->entries[smp_processor_id()]; + IP6T_ENTRY_ITERATE(loc_cpu_entry, t->private->size, add_counter_to_entry, paddc->counters, @@ -1543,28 +1602,30 @@ int ip6t_register_table(struct ip6t_tabl struct ip6t_table_info *newinfo; static struct ip6t_table_info bootstrap = { 0, 0, 0, { 0 }, { 0 }, { } }; + void *loc_cpu_entry; - newinfo = vmalloc(sizeof(struct ip6t_table_info) - + SMP_ALIGN(repl->size) * - (highest_possible_processor_id()+1)); + newinfo = alloc_table_info(repl->size); if (!newinfo) return -ENOMEM; - memcpy(newinfo->entries, repl->entries, repl->size); + /* choose the copy on our node/cpu */ + loc_cpu_entry = newinfo->entries[get_cpu()]; + memcpy(loc_cpu_entry, repl->entries, repl->size); ret = translate_table(table->name, table->valid_hooks, - newinfo, repl->size, + newinfo, loc_cpu_entry, repl->size, repl->num_entries, repl->hook_entry, repl->underflow); + put_cpu(); if (ret != 0) { - vfree(newinfo); + free_table_info(newinfo); return ret; } ret = down_interruptible(&ip6t_mutex); if (ret != 0) { - vfree(newinfo); + free_table_info(newinfo); return ret; } @@ -1593,20 +1654,24 @@ int ip6t_register_table(struct ip6t_tabl return ret; free_unlock: - vfree(newinfo); + free_table_info(newinfo); goto unlock; } void ip6t_unregister_table(struct ip6t_table *table) { + void *loc_cpu_entry; + down(&ip6t_mutex); LIST_DELETE(&ip6t_tables, table); up(&ip6t_mutex); /* Decrease module usage counts and free resources */ - IP6T_ENTRY_ITERATE(table->private->entries, table->private->size, + loc_cpu_entry = table->private->entries[get_cpu()]; + IP6T_ENTRY_ITERATE(loc_cpu_entry, table->private->size, cleanup_entry, NULL); - vfree(table->private); + put_cpu(); + free_table_info(table->private); } /* Returns 1 if the port is matched by the range, 0 otherwise */ diff --git a/net/key/af_key.c b/net/key/af_key.c index 3903168..d32f779 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -336,6 +336,7 @@ static u8 sadb_ext_min_len[] = { [SADB_X_EXT_NAT_T_SPORT] = (u8) sizeof(struct sadb_x_nat_t_port), [SADB_X_EXT_NAT_T_DPORT] = (u8) sizeof(struct sadb_x_nat_t_port), [SADB_X_EXT_NAT_T_OA] = (u8) sizeof(struct sadb_address), + [SADB_X_EXT_SEC_CTX] = (u8) sizeof(struct sadb_x_sec_ctx), }; /* Verify sadb_address_{len,prefixlen} against sa_family. */ @@ -383,6 +384,55 @@ static int verify_address_len(void *p) return 0; } +static inline int pfkey_sec_ctx_len(struct sadb_x_sec_ctx *sec_ctx) +{ + int len = 0; + + len += sizeof(struct sadb_x_sec_ctx); + len += sec_ctx->sadb_x_ctx_len; + len += sizeof(uint64_t) - 1; + len /= sizeof(uint64_t); + + return len; +} + +static inline int verify_sec_ctx_len(void *p) +{ + struct sadb_x_sec_ctx *sec_ctx = (struct sadb_x_sec_ctx *)p; + int len; + + if (sec_ctx->sadb_x_ctx_len > PAGE_SIZE) + return -EINVAL; + + len = pfkey_sec_ctx_len(sec_ctx); + + if (sec_ctx->sadb_x_sec_len != len) + return -EINVAL; + + return 0; +} + +static inline struct xfrm_user_sec_ctx *pfkey_sadb2xfrm_user_sec_ctx(struct sadb_x_sec_ctx *sec_ctx) +{ + struct xfrm_user_sec_ctx *uctx = NULL; + int ctx_size = sec_ctx->sadb_x_ctx_len; + + uctx = kmalloc((sizeof(*uctx)+ctx_size), GFP_KERNEL); + + if (!uctx) + return NULL; + + uctx->len = pfkey_sec_ctx_len(sec_ctx); + uctx->exttype = sec_ctx->sadb_x_sec_exttype; + uctx->ctx_doi = sec_ctx->sadb_x_ctx_doi; + uctx->ctx_alg = sec_ctx->sadb_x_ctx_alg; + uctx->ctx_len = sec_ctx->sadb_x_ctx_len; + memcpy(uctx + 1, sec_ctx + 1, + uctx->ctx_len); + + return uctx; +} + static int present_and_same_family(struct sadb_address *src, struct sadb_address *dst) { @@ -438,6 +488,10 @@ static int parse_exthdrs(struct sk_buff if (verify_address_len(p)) return -EINVAL; } + if (ext_type == SADB_X_EXT_SEC_CTX) { + if (verify_sec_ctx_len(p)) + return -EINVAL; + } ext_hdrs[ext_type-1] = p; } p += ext_len; @@ -586,6 +640,9 @@ static struct sk_buff * pfkey_xfrm_state struct sadb_key *key; struct sadb_x_sa2 *sa2; struct sockaddr_in *sin; + struct sadb_x_sec_ctx *sec_ctx; + struct xfrm_sec_ctx *xfrm_ctx; + int ctx_size = 0; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) struct sockaddr_in6 *sin6; #endif @@ -609,6 +666,12 @@ static struct sk_buff * pfkey_xfrm_state sizeof(struct sadb_address)*2 + sockaddr_size*2 + sizeof(struct sadb_x_sa2); + + if ((xfrm_ctx = x->security)) { + ctx_size = PFKEY_ALIGN8(xfrm_ctx->ctx_len); + size += sizeof(struct sadb_x_sec_ctx) + ctx_size; + } + /* identity & sensitivity */ if ((x->props.family == AF_INET && @@ -899,6 +962,20 @@ static struct sk_buff * pfkey_xfrm_state n_port->sadb_x_nat_t_port_reserved = 0; } + /* security context */ + if (xfrm_ctx) { + sec_ctx = (struct sadb_x_sec_ctx *) skb_put(skb, + sizeof(struct sadb_x_sec_ctx) + ctx_size); + sec_ctx->sadb_x_sec_len = + (sizeof(struct sadb_x_sec_ctx) + ctx_size) / sizeof(uint64_t); + sec_ctx->sadb_x_sec_exttype = SADB_X_EXT_SEC_CTX; + sec_ctx->sadb_x_ctx_doi = xfrm_ctx->ctx_doi; + sec_ctx->sadb_x_ctx_alg = xfrm_ctx->ctx_alg; + sec_ctx->sadb_x_ctx_len = xfrm_ctx->ctx_len; + memcpy(sec_ctx + 1, xfrm_ctx->ctx_str, + xfrm_ctx->ctx_len); + } + return skb; } @@ -909,6 +986,7 @@ static struct xfrm_state * pfkey_msg2xfr struct sadb_lifetime *lifetime; struct sadb_sa *sa; struct sadb_key *key; + struct sadb_x_sec_ctx *sec_ctx; uint16_t proto; int err; @@ -993,6 +1071,21 @@ static struct xfrm_state * pfkey_msg2xfr x->lft.soft_add_expires_seconds = lifetime->sadb_lifetime_addtime; x->lft.soft_use_expires_seconds = lifetime->sadb_lifetime_usetime; } + + sec_ctx = (struct sadb_x_sec_ctx *) ext_hdrs[SADB_X_EXT_SEC_CTX-1]; + if (sec_ctx != NULL) { + struct xfrm_user_sec_ctx *uctx = pfkey_sadb2xfrm_user_sec_ctx(sec_ctx); + + if (!uctx) + goto out; + + err = security_xfrm_state_alloc(x, uctx); + kfree(uctx); + + if (err) + goto out; + } + key = (struct sadb_key*) ext_hdrs[SADB_EXT_KEY_AUTH-1]; if (sa->sadb_sa_auth) { int keysize = 0; @@ -1720,6 +1813,18 @@ parse_ipsecrequests(struct xfrm_policy * return 0; } +static inline int pfkey_xfrm_policy2sec_ctx_size(struct xfrm_policy *xp) +{ + struct xfrm_sec_ctx *xfrm_ctx = xp->security; + + if (xfrm_ctx) { + int len = sizeof(struct sadb_x_sec_ctx); + len += xfrm_ctx->ctx_len; + return PFKEY_ALIGN8(len); + } + return 0; +} + static int pfkey_xfrm_policy2msg_size(struct xfrm_policy *xp) { int sockaddr_size = pfkey_sockaddr_size(xp->family); @@ -1733,7 +1838,8 @@ static int pfkey_xfrm_policy2msg_size(st (sockaddr_size * 2) + sizeof(struct sadb_x_policy) + (xp->xfrm_nr * (sizeof(struct sadb_x_ipsecrequest) + - (socklen * 2))); + (socklen * 2))) + + pfkey_xfrm_policy2sec_ctx_size(xp); } static struct sk_buff * pfkey_xfrm_policy2msg_prep(struct xfrm_policy *xp) @@ -1757,6 +1863,8 @@ static void pfkey_xfrm_policy2msg(struct struct sadb_lifetime *lifetime; struct sadb_x_policy *pol; struct sockaddr_in *sin; + struct sadb_x_sec_ctx *sec_ctx; + struct xfrm_sec_ctx *xfrm_ctx; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) struct sockaddr_in6 *sin6; #endif @@ -1941,6 +2049,21 @@ static void pfkey_xfrm_policy2msg(struct } } } + + /* security context */ + if ((xfrm_ctx = xp->security)) { + int ctx_size = pfkey_xfrm_policy2sec_ctx_size(xp); + + sec_ctx = (struct sadb_x_sec_ctx *) skb_put(skb, ctx_size); + sec_ctx->sadb_x_sec_len = ctx_size / sizeof(uint64_t); + sec_ctx->sadb_x_sec_exttype = SADB_X_EXT_SEC_CTX; + sec_ctx->sadb_x_ctx_doi = xfrm_ctx->ctx_doi; + sec_ctx->sadb_x_ctx_alg = xfrm_ctx->ctx_alg; + sec_ctx->sadb_x_ctx_len = xfrm_ctx->ctx_len; + memcpy(sec_ctx + 1, xfrm_ctx->ctx_str, + xfrm_ctx->ctx_len); + } + hdr->sadb_msg_len = size / sizeof(uint64_t); hdr->sadb_msg_reserved = atomic_read(&xp->refcnt); } @@ -1976,12 +2099,13 @@ out: static int pfkey_spdadd(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { - int err; + int err = 0; struct sadb_lifetime *lifetime; struct sadb_address *sa; struct sadb_x_policy *pol; struct xfrm_policy *xp; struct km_event c; + struct sadb_x_sec_ctx *sec_ctx; if (!present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1], ext_hdrs[SADB_EXT_ADDRESS_DST-1]) || @@ -2028,6 +2152,22 @@ static int pfkey_spdadd(struct sock *sk, if (xp->selector.dport) xp->selector.dport_mask = ~0; + sec_ctx = (struct sadb_x_sec_ctx *) ext_hdrs[SADB_X_EXT_SEC_CTX-1]; + if (sec_ctx != NULL) { + struct xfrm_user_sec_ctx *uctx = pfkey_sadb2xfrm_user_sec_ctx(sec_ctx); + + if (!uctx) { + err = -ENOBUFS; + goto out; + } + + err = security_xfrm_policy_alloc(xp, uctx); + kfree(uctx); + + if (err) + goto out; + } + xp->lft.soft_byte_limit = XFRM_INF; xp->lft.hard_byte_limit = XFRM_INF; xp->lft.soft_packet_limit = XFRM_INF; @@ -2051,10 +2191,9 @@ static int pfkey_spdadd(struct sock *sk, err = xfrm_policy_insert(pol->sadb_x_policy_dir-1, xp, hdr->sadb_msg_type != SADB_X_SPDUPDATE); - if (err) { - kfree(xp); - return err; - } + + if (err) + goto out; if (hdr->sadb_msg_type == SADB_X_SPDUPDATE) c.event = XFRM_MSG_UPDPOLICY; @@ -2069,6 +2208,7 @@ static int pfkey_spdadd(struct sock *sk, return 0; out: + security_xfrm_policy_free(xp); kfree(xp); return err; } @@ -2078,9 +2218,10 @@ static int pfkey_spddelete(struct sock * int err; struct sadb_address *sa; struct sadb_x_policy *pol; - struct xfrm_policy *xp; + struct xfrm_policy *xp, tmp; struct xfrm_selector sel; struct km_event c; + struct sadb_x_sec_ctx *sec_ctx; if (!present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1], ext_hdrs[SADB_EXT_ADDRESS_DST-1]) || @@ -2109,7 +2250,24 @@ static int pfkey_spddelete(struct sock * if (sel.dport) sel.dport_mask = ~0; - xp = xfrm_policy_bysel(pol->sadb_x_policy_dir-1, &sel, 1); + sec_ctx = (struct sadb_x_sec_ctx *) ext_hdrs[SADB_X_EXT_SEC_CTX-1]; + memset(&tmp, 0, sizeof(struct xfrm_policy)); + + if (sec_ctx != NULL) { + struct xfrm_user_sec_ctx *uctx = pfkey_sadb2xfrm_user_sec_ctx(sec_ctx); + + if (!uctx) + return -ENOMEM; + + err = security_xfrm_policy_alloc(&tmp, uctx); + kfree(uctx); + + if (err) + return err; + } + + xp = xfrm_policy_bysel_ctx(pol->sadb_x_policy_dir-1, &sel, tmp.security, 1); + security_xfrm_policy_free(&tmp); if (xp == NULL) return -ENOENT; @@ -2660,6 +2818,7 @@ static struct xfrm_policy *pfkey_compile { struct xfrm_policy *xp; struct sadb_x_policy *pol = (struct sadb_x_policy*)data; + struct sadb_x_sec_ctx *sec_ctx; switch (family) { case AF_INET: @@ -2709,10 +2868,32 @@ static struct xfrm_policy *pfkey_compile (*dir = parse_ipsecrequests(xp, pol)) < 0) goto out; + /* security context too */ + if (len >= (pol->sadb_x_policy_len*8 + + sizeof(struct sadb_x_sec_ctx))) { + char *p = (char *)pol; + struct xfrm_user_sec_ctx *uctx; + + p += pol->sadb_x_policy_len*8; + sec_ctx = (struct sadb_x_sec_ctx *)p; + if (len < pol->sadb_x_policy_len*8 + + sec_ctx->sadb_x_sec_len) + goto out; + if ((*dir = verify_sec_ctx_len(p))) + goto out; + uctx = pfkey_sadb2xfrm_user_sec_ctx(sec_ctx); + *dir = security_xfrm_policy_alloc(xp, uctx); + kfree(uctx); + + if (*dir) + goto out; + } + *dir = pol->sadb_x_policy_dir-1; return xp; out: + security_xfrm_policy_free(xp); kfree(xp); return NULL; } diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 0db9e57..4daaddf 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -10,7 +10,7 @@ * YOSHIFUJI Hideaki * Split up af-specific portion * Derek Atkins Add the post_input processor - * + * */ #include @@ -256,6 +256,7 @@ void __xfrm_policy_destroy(struct xfrm_p if (del_timer(&policy->timer)) BUG(); + security_xfrm_policy_free(policy); kfree(policy); } EXPORT_SYMBOL(__xfrm_policy_destroy); @@ -349,7 +350,8 @@ int xfrm_policy_insert(int dir, struct x write_lock_bh(&xfrm_policy_lock); for (p = &xfrm_policy_list[dir]; (pol=*p)!=NULL;) { - if (!delpol && memcmp(&policy->selector, &pol->selector, sizeof(pol->selector)) == 0) { + if (!delpol && memcmp(&policy->selector, &pol->selector, sizeof(pol->selector)) == 0 && + xfrm_sec_ctx_match(pol->security, policy->security)) { if (excl) { write_unlock_bh(&xfrm_policy_lock); return -EEXIST; @@ -388,14 +390,15 @@ int xfrm_policy_insert(int dir, struct x } EXPORT_SYMBOL(xfrm_policy_insert); -struct xfrm_policy *xfrm_policy_bysel(int dir, struct xfrm_selector *sel, - int delete) +struct xfrm_policy *xfrm_policy_bysel_ctx(int dir, struct xfrm_selector *sel, + struct xfrm_sec_ctx *ctx, int delete) { struct xfrm_policy *pol, **p; write_lock_bh(&xfrm_policy_lock); for (p = &xfrm_policy_list[dir]; (pol=*p)!=NULL; p = &pol->next) { - if (memcmp(sel, &pol->selector, sizeof(*sel)) == 0) { + if ((memcmp(sel, &pol->selector, sizeof(*sel)) == 0) && + (xfrm_sec_ctx_match(ctx, pol->security))) { xfrm_pol_hold(pol); if (delete) *p = pol->next; @@ -410,7 +413,7 @@ struct xfrm_policy *xfrm_policy_bysel(in } return pol; } -EXPORT_SYMBOL(xfrm_policy_bysel); +EXPORT_SYMBOL(xfrm_policy_bysel_ctx); struct xfrm_policy *xfrm_policy_byid(int dir, u32 id, int delete) { @@ -491,7 +494,7 @@ EXPORT_SYMBOL(xfrm_policy_walk); /* Find policy to apply to this flow. */ -static void xfrm_policy_lookup(struct flowi *fl, u16 family, u8 dir, +static void xfrm_policy_lookup(struct flowi *fl, u32 sk_sid, u16 family, u8 dir, void **objp, atomic_t **obj_refp) { struct xfrm_policy *pol; @@ -505,9 +508,12 @@ static void xfrm_policy_lookup(struct fl continue; match = xfrm_selector_match(sel, fl, family); + if (match) { - xfrm_pol_hold(pol); - break; + if (!security_xfrm_policy_lookup(pol, sk_sid, dir)) { + xfrm_pol_hold(pol); + break; + } } } read_unlock_bh(&xfrm_policy_lock); @@ -515,15 +521,37 @@ static void xfrm_policy_lookup(struct fl *obj_refp = &pol->refcnt; } -static struct xfrm_policy *xfrm_sk_policy_lookup(struct sock *sk, int dir, struct flowi *fl) +static inline int policy_to_flow_dir(int dir) +{ + if (XFRM_POLICY_IN == FLOW_DIR_IN && + XFRM_POLICY_OUT == FLOW_DIR_OUT && + XFRM_POLICY_FWD == FLOW_DIR_FWD) + return dir; + switch (dir) { + default: + case XFRM_POLICY_IN: + return FLOW_DIR_IN; + case XFRM_POLICY_OUT: + return FLOW_DIR_OUT; + case XFRM_POLICY_FWD: + return FLOW_DIR_FWD; + }; +} + +static struct xfrm_policy *xfrm_sk_policy_lookup(struct sock *sk, int dir, struct flowi *fl, u32 sk_sid) { struct xfrm_policy *pol; read_lock_bh(&xfrm_policy_lock); if ((pol = sk->sk_policy[dir]) != NULL) { - int match = xfrm_selector_match(&pol->selector, fl, + int match = xfrm_selector_match(&pol->selector, fl, sk->sk_family); + int err = 0; + if (match) + err = security_xfrm_policy_lookup(pol, sk_sid, policy_to_flow_dir(dir)); + + if (match && !err) xfrm_pol_hold(pol); else pol = NULL; @@ -596,6 +624,10 @@ static struct xfrm_policy *clone_policy( if (newp) { newp->selector = old->selector; + if (security_xfrm_policy_clone(old, newp)) { + kfree(newp); + return NULL; /* ENOMEM */ + } newp->lft = old->lft; newp->curlft = old->curlft; newp->action = old->action; @@ -707,22 +739,6 @@ xfrm_bundle_create(struct xfrm_policy *p return err; } -static inline int policy_to_flow_dir(int dir) -{ - if (XFRM_POLICY_IN == FLOW_DIR_IN && - XFRM_POLICY_OUT == FLOW_DIR_OUT && - XFRM_POLICY_FWD == FLOW_DIR_FWD) - return dir; - switch (dir) { - default: - case XFRM_POLICY_IN: - return FLOW_DIR_IN; - case XFRM_POLICY_OUT: - return FLOW_DIR_OUT; - case XFRM_POLICY_FWD: - return FLOW_DIR_FWD; - }; -} static int stale_bundle(struct dst_entry *dst); @@ -741,19 +757,20 @@ int xfrm_lookup(struct dst_entry **dst_p int err; u32 genid; u16 family = dst_orig->ops->family; + u8 dir = policy_to_flow_dir(XFRM_POLICY_OUT); + u32 sk_sid = security_sk_sid(sk, fl, dir); restart: genid = atomic_read(&flow_cache_genid); policy = NULL; if (sk && sk->sk_policy[1]) - policy = xfrm_sk_policy_lookup(sk, XFRM_POLICY_OUT, fl); + policy = xfrm_sk_policy_lookup(sk, XFRM_POLICY_OUT, fl, sk_sid); if (!policy) { /* To accelerate a bit... */ if ((dst_orig->flags & DST_NOXFRM) || !xfrm_policy_list[XFRM_POLICY_OUT]) return 0; - policy = flow_cache_lookup(fl, family, - policy_to_flow_dir(XFRM_POLICY_OUT), + policy = flow_cache_lookup(fl, sk_sid, family, dir, xfrm_policy_lookup); } @@ -934,16 +951,20 @@ int __xfrm_policy_check(struct sock *sk, { struct xfrm_policy *pol; struct flowi fl; + u8 fl_dir = policy_to_flow_dir(dir); + u32 sk_sid; if (_decode_session(skb, &fl, family) < 0) return 0; + sk_sid = security_sk_sid(sk, &fl, fl_dir); + /* First, check used SA against their selectors. */ if (skb->sp) { int i; for (i=skb->sp->len-1; i>=0; i--) { - struct sec_decap_state *xvec = &(skb->sp->x[i]); + struct sec_decap_state *xvec = &(skb->sp->x[i]); if (!xfrm_selector_match(&xvec->xvec->sel, &fl, family)) return 0; @@ -958,11 +979,10 @@ int __xfrm_policy_check(struct sock *sk, pol = NULL; if (sk && sk->sk_policy[dir]) - pol = xfrm_sk_policy_lookup(sk, dir, &fl); + pol = xfrm_sk_policy_lookup(sk, dir, &fl, sk_sid); if (!pol) - pol = flow_cache_lookup(&fl, family, - policy_to_flow_dir(dir), + pol = flow_cache_lookup(&fl, sk_sid, family, fl_dir, xfrm_policy_lookup); if (!pol) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 7cf48aa..df7ffec 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -10,7 +10,7 @@ * Split up af-specific functions * Derek Atkins * Add UDP Encapsulation - * + * */ #include @@ -70,6 +70,7 @@ static void xfrm_state_gc_destroy(struct x->type->destructor(x); xfrm_put_type(x->type); } + security_xfrm_state_free(x); kfree(x); } @@ -343,7 +344,8 @@ xfrm_state_find(xfrm_address_t *daddr, x selector. */ if (x->km.state == XFRM_STATE_VALID) { - if (!xfrm_selector_match(&x->sel, fl, family)) + if (!xfrm_selector_match(&x->sel, fl, family) || + !xfrm_sec_ctx_match(pol->security, x->security)) continue; if (!best || best->km.dying > x->km.dying || @@ -354,7 +356,8 @@ xfrm_state_find(xfrm_address_t *daddr, x acquire_in_progress = 1; } else if (x->km.state == XFRM_STATE_ERROR || x->km.state == XFRM_STATE_EXPIRED) { - if (xfrm_selector_match(&x->sel, fl, family)) + if (xfrm_selector_match(&x->sel, fl, family) && + xfrm_sec_ctx_match(pol->security, x->security)) error = -ESRCH; } } diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 0cdd9a0..92e2b80 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -7,7 +7,7 @@ * Kazunori MIYAZAWA @USAGI * Kunihiro Ishiguro * IPv6 support - * + * */ #include @@ -88,6 +88,34 @@ static int verify_encap_tmpl(struct rtat return 0; } + +static inline int verify_sec_ctx_len(struct rtattr **xfrma) +{ + struct rtattr *rt = xfrma[XFRMA_SEC_CTX - 1]; + struct xfrm_user_sec_ctx *uctx; + int len = 0; + + if (!rt) + return 0; + + if (rt->rta_len < sizeof(*uctx)) + return -EINVAL; + + uctx = RTA_DATA(rt); + + if (uctx->ctx_len > PAGE_SIZE) + return -EINVAL; + + len += sizeof(struct xfrm_user_sec_ctx); + len += uctx->ctx_len; + + if (uctx->len != len) + return -EINVAL; + + return 0; +} + + static int verify_newsa_info(struct xfrm_usersa_info *p, struct rtattr **xfrma) { @@ -145,6 +173,8 @@ static int verify_newsa_info(struct xfrm goto out; if ((err = verify_encap_tmpl(xfrma))) goto out; + if ((err = verify_sec_ctx_len(xfrma))) + goto out; err = -EINVAL; switch (p->mode) { @@ -209,6 +239,30 @@ static int attach_encap_tmpl(struct xfrm return 0; } + +static inline int xfrm_user_sec_ctx_size(struct xfrm_policy *xp) +{ + struct xfrm_sec_ctx *xfrm_ctx = xp->security; + int len = 0; + + if (xfrm_ctx) { + len += sizeof(struct xfrm_user_sec_ctx); + len += xfrm_ctx->ctx_len; + } + return len; +} + +static int attach_sec_ctx(struct xfrm_state *x, struct rtattr *u_arg) +{ + struct xfrm_user_sec_ctx *uctx; + + if (!u_arg) + return 0; + + uctx = RTA_DATA(u_arg); + return security_xfrm_state_alloc(x, uctx); +} + static void copy_from_user_state(struct xfrm_state *x, struct xfrm_usersa_info *p) { memcpy(&x->id, &p->id, sizeof(x->id)); @@ -253,6 +307,9 @@ static struct xfrm_state *xfrm_state_con if (err) goto error; + if ((err = attach_sec_ctx(x, xfrma[XFRMA_SEC_CTX-1]))) + goto error; + x->km.seq = p->seq; return x; @@ -272,11 +329,11 @@ static int xfrm_add_sa(struct sk_buff *s int err; struct km_event c; - err = verify_newsa_info(p, (struct rtattr **) xfrma); + err = verify_newsa_info(p, (struct rtattr **)xfrma); if (err) return err; - x = xfrm_state_construct(p, (struct rtattr **) xfrma, &err); + x = xfrm_state_construct(p, (struct rtattr **)xfrma, &err); if (!x) return err; @@ -390,6 +447,19 @@ static int dump_one_state(struct xfrm_st if (x->encap) RTA_PUT(skb, XFRMA_ENCAP, sizeof(*x->encap), x->encap); + if (x->security) { + int ctx_size = sizeof(struct xfrm_sec_ctx) + + x->security->ctx_len; + struct rtattr *rt = __RTA_PUT(skb, XFRMA_SEC_CTX, ctx_size); + struct xfrm_user_sec_ctx *uctx = RTA_DATA(rt); + + uctx->exttype = XFRMA_SEC_CTX; + uctx->len = ctx_size; + uctx->ctx_doi = x->security->ctx_doi; + uctx->ctx_alg = x->security->ctx_alg; + uctx->ctx_len = x->security->ctx_len; + memcpy(uctx + 1, x->security->ctx_str, x->security->ctx_len); + } nlh->nlmsg_len = skb->tail - b; out: sp->this_idx++; @@ -603,6 +673,18 @@ static int verify_newpolicy_info(struct return verify_policy_dir(p->dir); } +static int copy_from_user_sec_ctx(struct xfrm_policy *pol, struct rtattr **xfrma) +{ + struct rtattr *rt = xfrma[XFRMA_SEC_CTX-1]; + struct xfrm_user_sec_ctx *uctx; + + if (!rt) + return 0; + + uctx = RTA_DATA(rt); + return security_xfrm_policy_alloc(pol, uctx); +} + static void copy_templates(struct xfrm_policy *xp, struct xfrm_user_tmpl *ut, int nr) { @@ -681,7 +763,10 @@ static struct xfrm_policy *xfrm_policy_c } copy_from_user_policy(xp, p); - err = copy_from_user_tmpl(xp, xfrma); + + if (!(err = copy_from_user_tmpl(xp, xfrma))) + err = copy_from_user_sec_ctx(xp, xfrma); + if (err) { *errp = err; kfree(xp); @@ -702,8 +787,11 @@ static int xfrm_add_policy(struct sk_buf err = verify_newpolicy_info(p); if (err) return err; + err = verify_sec_ctx_len((struct rtattr **)xfrma); + if (err) + return err; - xp = xfrm_policy_construct(p, (struct rtattr **) xfrma, &err); + xp = xfrm_policy_construct(p, (struct rtattr **)xfrma, &err); if (!xp) return err; @@ -761,6 +849,27 @@ rtattr_failure: return -1; } +static int copy_to_user_sec_ctx(struct xfrm_policy *xp, struct sk_buff *skb) +{ + if (xp->security) { + int ctx_size = sizeof(struct xfrm_sec_ctx) + + xp->security->ctx_len; + struct rtattr *rt = __RTA_PUT(skb, XFRMA_SEC_CTX, ctx_size); + struct xfrm_user_sec_ctx *uctx = RTA_DATA(rt); + + uctx->exttype = XFRMA_SEC_CTX; + uctx->len = ctx_size; + uctx->ctx_doi = xp->security->ctx_doi; + uctx->ctx_alg = xp->security->ctx_alg; + uctx->ctx_len = xp->security->ctx_len; + memcpy(uctx + 1, xp->security->ctx_str, xp->security->ctx_len); + } + return 0; + + rtattr_failure: + return -1; +} + static int dump_one_policy(struct xfrm_policy *xp, int dir, int count, void *ptr) { struct xfrm_dump_info *sp = ptr; @@ -782,6 +891,8 @@ static int dump_one_policy(struct xfrm_p copy_to_user_policy(xp, p, dir); if (copy_to_user_tmpl(xp, skb) < 0) goto nlmsg_failure; + if (copy_to_user_sec_ctx(xp, skb)) + goto nlmsg_failure; nlh->nlmsg_len = skb->tail - b; out: @@ -852,8 +963,25 @@ static int xfrm_get_policy(struct sk_buf if (p->index) xp = xfrm_policy_byid(p->dir, p->index, delete); - else - xp = xfrm_policy_bysel(p->dir, &p->sel, delete); + else { + struct rtattr **rtattrs = (struct rtattr **)xfrma; + struct rtattr *rt = rtattrs[XFRMA_SEC_CTX-1]; + struct xfrm_policy tmp; + + err = verify_sec_ctx_len(rtattrs); + if (err) + return err; + + memset(&tmp, 0, sizeof(struct xfrm_policy)); + if (rt) { + struct xfrm_user_sec_ctx *uctx = RTA_DATA(rt); + + if ((err = security_xfrm_policy_alloc(&tmp, uctx))) + return err; + } + xp = xfrm_policy_bysel_ctx(p->dir, &p->sel, tmp.security, delete); + security_xfrm_policy_free(&tmp); + } if (xp == NULL) return -ENOENT; @@ -1224,6 +1352,8 @@ static int build_acquire(struct sk_buff if (copy_to_user_tmpl(xp, skb) < 0) goto nlmsg_failure; + if (copy_to_user_sec_ctx(xp, skb)) + goto nlmsg_failure; nlh->nlmsg_len = skb->tail - b; return skb->len; @@ -1241,6 +1371,7 @@ static int xfrm_send_acquire(struct xfrm len = RTA_SPACE(sizeof(struct xfrm_user_tmpl) * xp->xfrm_nr); len += NLMSG_SPACE(sizeof(struct xfrm_user_acquire)); + len += RTA_SPACE(xfrm_user_sec_ctx_size(xp)); skb = alloc_skb(len, GFP_ATOMIC); if (skb == NULL) return -ENOMEM; @@ -1324,6 +1455,8 @@ static int build_polexpire(struct sk_buf copy_to_user_policy(xp, &upe->pol, dir); if (copy_to_user_tmpl(xp, skb) < 0) goto nlmsg_failure; + if (copy_to_user_sec_ctx(xp, skb)) + goto nlmsg_failure; upe->hard = !!hard; nlh->nlmsg_len = skb->tail - b; @@ -1341,6 +1474,7 @@ static int xfrm_exp_policy_notify(struct len = RTA_SPACE(sizeof(struct xfrm_user_tmpl) * xp->xfrm_nr); len += NLMSG_SPACE(sizeof(struct xfrm_user_polexpire)); + len += RTA_SPACE(xfrm_user_sec_ctx_size(xp)); skb = alloc_skb(len, GFP_ATOMIC); if (skb == NULL) return -ENOMEM; diff --git a/security/Kconfig b/security/Kconfig index 64d3f1e..34f5934 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -54,6 +54,19 @@ config SECURITY_NETWORK implement socket and networking access controls. If you are unsure how to answer this question, answer N. +config SECURITY_NETWORK_XFRM + bool "XFRM (IPSec) Networking Security Hooks" + depends on XFRM && SECURITY_NETWORK + help + This enables the XFRM (IPSec) networking security hooks. + If enabled, a security module can use these hooks to + implement per-packet access controls based on labels + derived from IPSec policy. Non-IPSec communications are + designated as unlabelled, and only sockets authorized + to communicate unlabelled data can send without using + IPSec. + If you are unsure how to answer this question, answer N. + config SECURITY_CAPABILITIES tristate "Default Linux Capabilities" depends on SECURITY diff --git a/security/dummy.c b/security/dummy.c index 3ca5f2b..a15c547 100644 --- a/security/dummy.c +++ b/security/dummy.c @@ -776,8 +776,42 @@ static inline int dummy_sk_alloc_securit static inline void dummy_sk_free_security (struct sock *sk) { } + +static unsigned int dummy_sk_getsid(struct sock *sk, struct flowi *fl, u8 dir) +{ + return 0; +} #endif /* CONFIG_SECURITY_NETWORK */ +#ifdef CONFIG_SECURITY_NETWORK_XFRM +static int dummy_xfrm_policy_alloc_security(struct xfrm_policy *xp, struct xfrm_user_sec_ctx *sec_ctx) +{ + return 0; +} + +static inline int dummy_xfrm_policy_clone_security(struct xfrm_policy *old, struct xfrm_policy *new) +{ + return 0; +} + +static void dummy_xfrm_policy_free_security(struct xfrm_policy *xp) +{ +} + +static int dummy_xfrm_state_alloc_security(struct xfrm_state *x, struct xfrm_user_sec_ctx *sec_ctx) +{ + return 0; +} + +static void dummy_xfrm_state_free_security(struct xfrm_state *x) +{ +} + +static int dummy_xfrm_policy_lookup(struct xfrm_policy *xp, u32 sk_sid, u8 dir) +{ + return 0; +} +#endif /* CONFIG_SECURITY_NETWORK_XFRM */ static int dummy_register_security (const char *name, struct security_operations *ops) { return -EINVAL; @@ -970,7 +1004,16 @@ void security_fixup_ops (struct security set_to_dummy_if_null(ops, socket_getpeersec); set_to_dummy_if_null(ops, sk_alloc_security); set_to_dummy_if_null(ops, sk_free_security); -#endif /* CONFIG_SECURITY_NETWORK */ + set_to_dummy_if_null(ops, sk_getsid); + #endif /* CONFIG_SECURITY_NETWORK */ +#ifdef CONFIG_SECURITY_NETWORK_XFRM + set_to_dummy_if_null(ops, xfrm_policy_alloc_security); + set_to_dummy_if_null(ops, xfrm_policy_clone_security); + set_to_dummy_if_null(ops, xfrm_policy_free_security); + set_to_dummy_if_null(ops, xfrm_state_alloc_security); + set_to_dummy_if_null(ops, xfrm_state_free_security); + set_to_dummy_if_null(ops, xfrm_policy_lookup); +#endif /* CONFIG_SECURITY_NETWORK_XFRM */ #ifdef CONFIG_KEYS set_to_dummy_if_null(ops, key_alloc); set_to_dummy_if_null(ops, key_free); diff --git a/security/selinux/Makefile b/security/selinux/Makefile index b038cd0..06d54d9 100644 --- a/security/selinux/Makefile +++ b/security/selinux/Makefile @@ -8,5 +8,7 @@ selinux-y := avc.o hooks.o selinuxfs.o n selinux-$(CONFIG_SECURITY_NETWORK) += netif.o +selinux-$(CONFIG_SECURITY_NETWORK_XFRM) += xfrm.o + EXTRA_CFLAGS += -Isecurity/selinux/include diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index fc77443..3d496ea 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -73,6 +73,7 @@ #include "avc.h" #include "objsec.h" #include "netif.h" +#include "xfrm.h" #define XATTR_SELINUX_SUFFIX "selinux" #define XATTR_NAME_SELINUX XATTR_SECURITY_PREFIX XATTR_SELINUX_SUFFIX @@ -3349,6 +3350,10 @@ static int selinux_socket_sock_rcv_skb(s err = avc_has_perm(sock_sid, port_sid, sock_class, recv_perm, &ad); } + + if (!err) + err = selinux_xfrm_sock_rcv_skb(sock_sid, skb); + out: return err; } @@ -3401,6 +3406,24 @@ static void selinux_sk_free_security(str sk_free_security(sk); } +static unsigned int selinux_sk_getsid_security(struct sock *sk, struct flowi *fl, u8 dir) +{ + struct inode_security_struct *isec; + u32 sock_sid = SECINITSID_ANY_SOCKET; + + if (!sk) + return selinux_no_sk_sid(fl); + + read_lock_bh(&sk->sk_callback_lock); + isec = get_sock_isec(sk); + + if (isec) + sock_sid = isec->sid; + + read_unlock_bh(&sk->sk_callback_lock); + return sock_sid; +} + static int selinux_nlmsg_perm(struct sock *sk, struct sk_buff *skb) { int err = 0; @@ -3536,6 +3559,11 @@ static unsigned int selinux_ip_postroute send_perm, &ad) ? NF_DROP : NF_ACCEPT; } + if (err != NF_ACCEPT) + goto out; + + err = selinux_xfrm_postroute_last(isec->sid, skb); + out: return err; } @@ -4380,6 +4408,16 @@ static struct security_operations selinu .socket_getpeersec = selinux_socket_getpeersec, .sk_alloc_security = selinux_sk_alloc_security, .sk_free_security = selinux_sk_free_security, + .sk_getsid = selinux_sk_getsid_security, +#endif + +#ifdef CONFIG_SECURITY_NETWORK_XFRM + .xfrm_policy_alloc_security = selinux_xfrm_policy_alloc, + .xfrm_policy_clone_security = selinux_xfrm_policy_clone, + .xfrm_policy_free_security = selinux_xfrm_policy_free, + .xfrm_state_alloc_security = selinux_xfrm_state_alloc, + .xfrm_state_free_security = selinux_xfrm_state_free, + .xfrm_policy_lookup = selinux_xfrm_policy_lookup, #endif }; @@ -4491,6 +4529,7 @@ static int __init selinux_nf_ip_init(voi panic("SELinux: nf_register_hook for IPv6: error %d\n", err); #endif /* IPV6 */ + out: return err; } diff --git a/security/selinux/include/av_perm_to_string.h b/security/selinux/include/av_perm_to_string.h index 1deb59e..71aeb12 100644 --- a/security/selinux/include/av_perm_to_string.h +++ b/security/selinux/include/av_perm_to_string.h @@ -238,3 +238,5 @@ S_(SECCLASS_NSCD, NSCD__SHMEMHOST, "shmemhost") S_(SECCLASS_ASSOCIATION, ASSOCIATION__SENDTO, "sendto") S_(SECCLASS_ASSOCIATION, ASSOCIATION__RECVFROM, "recvfrom") + S_(SECCLASS_ASSOCIATION, ASSOCIATION__RELABELFROM, "relabelfrom") + S_(SECCLASS_ASSOCIATION, ASSOCIATION__RELABELTO, "relabelto") diff --git a/security/selinux/include/av_permissions.h b/security/selinux/include/av_permissions.h index a78b5d5..d1d0996 100644 --- a/security/selinux/include/av_permissions.h +++ b/security/selinux/include/av_permissions.h @@ -908,6 +908,8 @@ #define ASSOCIATION__SENDTO 0x00000001UL #define ASSOCIATION__RECVFROM 0x00000002UL +#define ASSOCIATION__RELABELFROM 0x00000004UL +#define ASSOCIATION__RELABELTO 0x00000008UL #define NETLINK_KOBJECT_UEVENT_SOCKET__IOCTL 0x00000001UL #define NETLINK_KOBJECT_UEVENT_SOCKET__READ 0x00000002UL diff --git a/security/selinux/include/xfrm.h b/security/selinux/include/xfrm.h new file mode 100644 index 0000000..e664ca1 --- /dev/null +++ b/security/selinux/include/xfrm.h @@ -0,0 +1,44 @@ +/* + * SELinux support for the XFRM LSM hooks + * + * Author : Trent Jaeger, + */ +#ifndef _SELINUX_XFRM_H_ +#define _SELINUX_XFRM_H_ + +int selinux_xfrm_policy_alloc(struct xfrm_policy *xp, struct xfrm_user_sec_ctx *sec_ctx); +int selinux_xfrm_policy_clone(struct xfrm_policy *old, struct xfrm_policy *new); +void selinux_xfrm_policy_free(struct xfrm_policy *xp); +int selinux_xfrm_state_alloc(struct xfrm_state *x, struct xfrm_user_sec_ctx *sec_ctx); +void selinux_xfrm_state_free(struct xfrm_state *x); +int selinux_xfrm_policy_lookup(struct xfrm_policy *xp, u32 sk_sid, u8 dir); +int selinux_xfrm_sock_rcv_skb(u32 sid, struct sk_buff *skb); +int selinux_xfrm_postroute_last(u32 isec_sid, struct sk_buff *skb); + + +/* + * Extract the security blob from the sock (it's actually on the socket) + */ +static inline struct inode_security_struct *get_sock_isec(struct sock *sk) +{ + if (!sk->sk_socket) + return NULL; + + return SOCK_INODE(sk->sk_socket)->i_security; +} + + +static inline u32 selinux_no_sk_sid(struct flowi *fl) +{ + /* NOTE: no sock occurs on ICMP reply, forwards, ... */ + /* icmp_reply: authorize as kernel packet */ + if (fl && fl->proto == IPPROTO_ICMP) { + return SECINITSID_KERNEL; + } + + return SECINITSID_ANY_SOCKET; +} + + +#endif /* _SELINUX_XFRM_H_ */ + diff --git a/security/selinux/xfrm.c b/security/selinux/xfrm.c new file mode 100644 index 0000000..c641a3b --- /dev/null +++ b/security/selinux/xfrm.c @@ -0,0 +1,325 @@ +/* + * NSA Security-Enhanced Linux (SELinux) security module + * + * This file contains the SELinux XFRM hook function implementations. + * + * Authors: Serge Hallyn + * Trent Jaeger + * + * Copyright (C) 2005 International Business Machines Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2, + * as published by the Free Software Foundation. + */ + +/* + * USAGE: + * NOTES: + * 1. Make sure to enable the following options in your kernel config: + * CONFIG_SECURITY=y + * CONFIG_SECURITY_NETWORK=y + * CONFIG_SECURITY_NETWORK_XFRM=y + * CONFIG_SECURITY_SELINUX=m/y + * ISSUES: + * 1. Caching packets, so they are not dropped during negotiation + * 2. Emulating a reasonable SO_PEERSEC across machines + * 3. Testing addition of sk_policy's with security context via setsockopt + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "avc.h" +#include "objsec.h" +#include "xfrm.h" + + +/* + * Returns true if an LSM/SELinux context + */ +static inline int selinux_authorizable_ctx(struct xfrm_sec_ctx *ctx) +{ + return (ctx && + (ctx->ctx_doi == XFRM_SC_DOI_LSM) && + (ctx->ctx_alg == XFRM_SC_ALG_SELINUX)); +} + +/* + * Returns true if the xfrm contains a security blob for SELinux + */ +static inline int selinux_authorizable_xfrm(struct xfrm_state *x) +{ + return selinux_authorizable_ctx(x->security); +} + +#ifdef CONFIG_SECURITY_NETWORK_XFRM +/* + * LSM hook implementation that authorizes that a socket can be used + * with the corresponding xfrm_sec_ctx and direction. + */ +int selinux_xfrm_policy_lookup(struct xfrm_policy *xp, u32 sk_sid, u8 dir) +{ + int rc = 0; + u32 sel_sid = SECINITSID_UNLABELED; + struct xfrm_sec_ctx *ctx; + + /* Context sid is either set to label or ANY_ASSOC */ + if ((ctx = xp->security)) { + if (!selinux_authorizable_ctx(ctx)) + return -EINVAL; + + sel_sid = ctx->ctx_sid; + } + + rc = avc_has_perm(sk_sid, sel_sid, SECCLASS_ASSOCIATION, + ((dir == FLOW_DIR_IN) ? ASSOCIATION__RECVFROM : + ((dir == FLOW_DIR_OUT) ? ASSOCIATION__SENDTO : + (ASSOCIATION__SENDTO | ASSOCIATION__RECVFROM))), + NULL); + + return rc; +} + +/* + * Security blob allocation for xfrm_policy and xfrm_state + * CTX does not have a meaningful value on input + */ +static int selinux_xfrm_sec_ctx_alloc(struct xfrm_sec_ctx **ctxp, struct xfrm_user_sec_ctx *uctx) +{ + int rc = 0; + struct task_security_struct *tsec = current->security; + struct xfrm_sec_ctx *ctx; + + BUG_ON(!uctx); + BUG_ON(uctx->ctx_doi != XFRM_SC_ALG_SELINUX); + + if (uctx->ctx_len >= PAGE_SIZE) + return -ENOMEM; + + *ctxp = ctx = kmalloc(sizeof(*ctx) + + uctx->ctx_len, + GFP_KERNEL); + + if (!ctx) + return -ENOMEM; + + ctx->ctx_doi = uctx->ctx_doi; + ctx->ctx_len = uctx->ctx_len; + ctx->ctx_alg = uctx->ctx_alg; + + memcpy(ctx->ctx_str, + uctx+1, + ctx->ctx_len); + rc = security_context_to_sid(ctx->ctx_str, + ctx->ctx_len, + &ctx->ctx_sid); + + if (rc) + goto out; + + /* + * Does the subject have permission to set security or permission to + * do the relabel? + * Must be permitted to relabel from default socket type (process type) + * to specified context + */ + rc = avc_has_perm(tsec->sid, tsec->sid, + SECCLASS_ASSOCIATION, + ASSOCIATION__RELABELFROM, NULL); + if (rc) + goto out; + + rc = avc_has_perm(tsec->sid, ctx->ctx_sid, + SECCLASS_ASSOCIATION, + ASSOCIATION__RELABELTO, NULL); + if (rc) + goto out; + + return rc; + +out: + *ctxp = 0; + kfree(ctx); + return rc; +} + +/* + * LSM hook implementation that allocs and transfers uctx spec to + * xfrm_policy. + */ +int selinux_xfrm_policy_alloc(struct xfrm_policy *xp, struct xfrm_user_sec_ctx *uctx) +{ + int err; + + BUG_ON(!xp); + + err = selinux_xfrm_sec_ctx_alloc(&xp->security, uctx); + return err; +} + + +/* + * LSM hook implementation that copies security data structure from old to + * new for policy cloning. + */ +int selinux_xfrm_policy_clone(struct xfrm_policy *old, struct xfrm_policy *new) +{ + struct xfrm_sec_ctx *old_ctx, *new_ctx; + + old_ctx = old->security; + + if (old_ctx) { + new_ctx = new->security = kmalloc(sizeof(*new_ctx) + + old_ctx->ctx_len, + GFP_KERNEL); + + if (!new_ctx) + return -ENOMEM; + + memcpy(new_ctx, old_ctx, sizeof(*new_ctx)); + memcpy(new_ctx->ctx_str, old_ctx->ctx_str, new_ctx->ctx_len); + } + return 0; +} + +/* + * LSM hook implementation that frees xfrm_policy security information. + */ +void selinux_xfrm_policy_free(struct xfrm_policy *xp) +{ + struct xfrm_sec_ctx *ctx = xp->security; + if (ctx) + kfree(ctx); +} + +/* + * LSM hook implementation that allocs and transfers sec_ctx spec to + * xfrm_state. + */ +int selinux_xfrm_state_alloc(struct xfrm_state *x, struct xfrm_user_sec_ctx *uctx) +{ + int err; + + BUG_ON(!x); + + err = selinux_xfrm_sec_ctx_alloc(&x->security, uctx); + return err; +} + +/* + * LSM hook implementation that frees xfrm_state security information. + */ +void selinux_xfrm_state_free(struct xfrm_state *x) +{ + struct xfrm_sec_ctx *ctx = x->security; + if (ctx) + kfree(ctx); +} + +/* + * LSM hook that controls access to unlabelled packets. If + * a xfrm_state is authorizable (defined by macro) then it was + * already authorized by the IPSec process. If not, then + * we need to check for unlabelled access since this may not have + * gone thru the IPSec process. + */ +int selinux_xfrm_sock_rcv_skb(u32 isec_sid, struct sk_buff *skb) +{ + int i, rc = 0; + struct sec_path *sp; + + sp = skb->sp; + + if (sp) { + /* + * __xfrm_policy_check does not approve unless xfrm_policy_ok + * says that spi's match for policy and the socket. + * + * Only need to verify the existence of an authorizable sp. + */ + for (i = 0; i < sp->len; i++) { + struct xfrm_state *x = sp->x[i].xvec; + + if (x && selinux_authorizable_xfrm(x)) + goto accept; + } + } + + /* check SELinux sock for unlabelled access */ + rc = avc_has_perm(isec_sid, SECINITSID_UNLABELED, SECCLASS_ASSOCIATION, + ASSOCIATION__RECVFROM, NULL); + if (rc) + goto drop; + +accept: + return 0; + +drop: + return rc; +} + +/* + * POSTROUTE_LAST hook's XFRM processing: + * If we have no security association, then we need to determine + * whether the socket is allowed to send to an unlabelled destination. + * If we do have a authorizable security association, then it has already been + * checked in xfrm_policy_lookup hook. + */ +int selinux_xfrm_postroute_last(u32 isec_sid, struct sk_buff *skb) +{ + struct dst_entry *dst; + int rc = 0; + + dst = skb->dst; + + if (dst) { + struct dst_entry *dst_test; + + for (dst_test = dst; dst_test != 0; + dst_test = dst_test->child) { + struct xfrm_state *x = dst_test->xfrm; + + if (x && selinux_authorizable_xfrm(x)) + goto accept; + } + } + + rc = avc_has_perm(isec_sid, SECINITSID_UNLABELED, SECCLASS_ASSOCIATION, + ASSOCIATION__SENDTO, NULL); + if (rc) + goto drop; + +accept: + return NF_ACCEPT; + +drop: + return NF_DROP; +} + +#else /* CONFIG_SECURITY_NETWORK_XFRM */ + +int selinux_xfrm_sock_rcv_skb(u32 isec_sid, struct sk_buff *skb) +{ + return 0; +} + +int selinux_xfrm_postroute_last(u32 isec_sid, struct sk_buff *skb) +{ + return NF_ACCEPT; +} +#endif /* CONFIG_SECURITY_NETWORK_XFRM */