GIT 5d993b3c9996d8df5a83a2d0f7e056b8963d7ee1 git+ssh://master.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.17.git commit 5d993b3c9996d8df5a83a2d0f7e056b8963d7ee1 Author: Harald Welte Date: Mon Feb 13 02:35:31 2006 -0800 [NETFILTER] nf_conntrack: clean up to reduce size of 'struct nf_conn' This patch moves all helper related data fields of 'struct nf_conn' into a separate structure 'struct nf_conn_help'. This new structure is only present in conntrack entries for which we actually have a helper loaded. Also, this patch cleans up the nf_conntrack 'features' mechanism to resemble what the original idea was: Just glue the feature-specific data structures at the end of 'struct nf_conn', and explicitly re-calculate the pointer to it when needed rather than keeping pointers around. Saves 20 bytes per conntrack on my x86_64 box. A non-helped conntrack is 276 bytes. We still need to save another 20 bytes in order to fit into to target of 256bytes. Signed-off-by: Harald Welte Signed-off-by: David S. Miller commit b14652393ae4ea569e655c2b36158aeb74531a52 Author: Michael Chan Date: Fri Feb 10 14:48:13 2006 -0800 [BNX2]: include Include so that it compiles properly on all archs. Update version to 1.4.38. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit 55bb045aa49d5e5234c6213d1ed0bfef0c636971 Author: John Heffner Date: Sun Feb 5 22:14:53 2006 -0800 [TCP]: MTU probing Implementation of packetization layer path mtu discovery for TCP, based on the internet-draft currently found at . Signed-off-by: John Heffner Signed-off-by: David S. Miller commit 02eaa3d3c84eeac4c22b79a930aff0a22a0a472e Author: Michael Chan Date: Sun Feb 5 18:10:56 2006 -0800 [BNX2]: Update version Update version to 1.4.37. Add missing flush_scheduled_work() in bnx2_suspend as noted by Jeff Garzik. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit 1757bf951cc7bf8b570c442529aed666f01ae936 Author: Michael Chan Date: Sun Feb 5 18:10:29 2006 -0800 [BNX2]: Support larger rx ring sizes (part 2) Support bigger rx ring sizes (up to 1020) in the rx fast path. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit 66c19fb0b2af2bcf6f5d12767531201ad7c45413 Author: Michael Chan Date: Sun Feb 5 18:10:01 2006 -0800 [BNX2]: Support larger rx ring sizes (part 1) Increase maximum receive ring size from 255 to 1020 by supporting up to 4 linked pages of receive descriptors. To accomodate the higher memory usage, each physical descriptor page is allocated separately and the software ring that keeps track of the SKBs and the DMA addresses is allocated using vmalloc. Some of the receive-related fields in the bp structure are re- organized a bit for better locality of reference. The max. was reduced to 1020 from 4080 after discussion with David Miller. This patch contains ring init code changes only. This next patch contains rx data path code changes. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit f01eb69fb4843a13ced4c389ace4fe12b4ebc6d8 Author: Michael Chan Date: Sun Feb 5 18:09:29 2006 -0800 [BNX2]: Fix bug when rx ring is full Fix the rx code path that does not handle the full rx ring correctly. When the rx ring is set to the max. size (i.e. 255), the consumer and producer indices will be the same when completing an rx packet. Fix the rx code to handle this condition properly. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit b97f7e16ec6dc5f9e054845c3e808abe4ad35d9b Author: Michael Chan Date: Sun Feb 5 18:08:55 2006 -0800 [BNX2]: Add ethtool -d support Add ETHTOOL_GREGS support. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit d6b5cbaef96284548de32a785434b73606dc8b2f Author: Michael Chan Date: Sun Feb 5 18:08:16 2006 -0800 [BNX2]: Reduce register test size Eliminate some of the registers in ethtool register test to reduce driver size. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit a9215c5c856a156562246de5f5c3f75137171af7 Author: Michael Chan Date: Sun Feb 5 16:30:20 2006 -0800 [TG3]: Update version and reldate Update version to 3.50. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit f53f4abb514cb1b07f66738b2964255786bf6fdc Author: Michael Chan Date: Sun Feb 5 16:27:37 2006 -0800 [TG3]: Support shutdown WoL. Support WoL during shutdown by calling tg3_set_power_state(tp, PCI_D3hot) during tg3_close(). Change the power state parameter to pci_power_t type and use constants defined in pci.h. Certain ethtool operations cannot be performed after tg3_close() because the device will go to low power state. Add return -EAGAIN in such cases where appropriate. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit e61bab3fd86b1d0eed5f8b25c2057e8f4ce24813 Author: Michael Chan Date: Sun Feb 5 16:27:02 2006 -0800 [TG3]: Enable TSO by default Enable TSO by default on newer chips that support TSO in hardware. Leave TSO off by default on older chips that do firmware TSO because performance is slightly lower. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit 2acd34c87fda44f10266883377118c7324223f32 Author: Michael Chan Date: Sun Feb 5 16:26:37 2006 -0800 [TG3]: Add support for 5714S and 5715S Add support for 5714S and 5715S. Signed-off-by: Michael Chan Signed-off-by: David S. Miller commit e7d06140499657a8c76c3f696dd7ea12df721984 Author: Adrian Bunk Date: Sat Feb 4 15:51:24 2006 -0800 [IPV4] fib_rules.c: make struct fib_rules static again struct fib_rules became global for no good reason. Signed-off-by: Adrian Bunk Signed-off-by: David S. Miller commit 6c5a258f773cce164662fc88ff62d2764eec398b Author: Jesper Juhl Date: Sat Feb 4 15:50:27 2006 -0800 [IPCOMP6]: don't check vfree() argument for NULL. vfree does it's own NULL checking, so checking a pointer before handing it to vfree is pointless. Signed-off-by: Jesper Juhl Signed-off-by: David S. Miller commit ec331fa6d35c5f4249c1fdd03438b43fd655f9a1 Author: Andrea Bittau Date: Sat Feb 4 19:10:17 2006 -0200 [DCCP]: Initial feature negotiation implementation Still needs more work, but boots and doesn't crashes, even does some negotiation! 18:38:52.174934 127.0.0.1.43458 > 127.0.0.1.5001: request 18:38:52.218526 127.0.0.1.5001 > 127.0.0.1.43458: response 18:38:52.185398 127.0.0.1.43458 > 127.0.0.1.5001: :-) Signed-off-by: Andrea Bittau Signed-off-by: Arnaldo Carvalho de Melo commit 057184f56752b9d1f4c6ea38b5ad750c0b6d4692 Author: Andrea Bittau Date: Fri Feb 3 15:56:41 2006 -0200 [DCCP] CCID2: Initial CCID2 (TCP-Like) implementation Original work by Andrea Bittau, Arnaldo Melo cleaned up and fixed several issues on the merge process. For now CCID2 was turned the default for all SOCK_DCCP connections, but this will be remedied soon with the merge of the feature negotiation code. Signed-off-by: Andrea Bittau Signed-off-by: Arnaldo Carvalho de Melo commit 17c943adb4c84429595e76a491e1b05ce1bf9eb6 Author: Arnaldo Carvalho de Melo Date: Fri Feb 3 15:49:14 2006 -0200 [DCCP] CCID3: Set the no_feedback_timer fields near init_timer Signed-off-by: Arnaldo Carvalho de Melo commit 88f542e044ce469de5953b312062778333bcf568 Author: Arnaldo Carvalho de Melo Date: Fri Feb 3 15:30:45 2006 -0200 [DCCP]: Don't alloc ack vector for the control sock Signed-off-by: Arnaldo Carvalho de Melo commit 27737d52978d1682b048fec8b4479c3483fd6202 Author: Arnaldo Carvalho de Melo Date: Fri Feb 3 15:28:41 2006 -0200 [DCCP] ackvec: Delete all the ack vector records in dccp_ackvec_free Signed-off-by: Arnaldo Carvalho de Melo commit d9c237ac462b8f53fdf637191bcaad6e3af258d5 Author: Arnaldo Carvalho de Melo Date: Fri Feb 3 15:23:55 2006 -0200 [DCCP] CCID: Allow ccid_{init,exit} to be NULL Testing if the ccid being instantiated has these methods in ccid_init(). Signed-off-by: Arnaldo Carvalho de Melo commit 10db97358ec72904821bae42269e74deccddb4b7 Author: Arnaldo Carvalho de Melo Date: Fri Feb 3 11:04:36 2006 -0200 [DCCP] ackvec: Introduce ack vector records Based on a patch by Andrea Bittau. Signed-off-by: Andrea Bittau Signed-off-by: Arnaldo Carvalho de Melo commit 43e56221a1168de330e44df1e914722d9b26f9af Author: Arnaldo Carvalho de Melo Date: Fri Feb 3 11:00:59 2006 -0200 [LIST]: Introduce list_for_each_entry_from For iterating over list of given type continuing from existing point. Signed-off-by: Arnaldo Carvalho de Melo commit 2b82c4ef2389d39f835bf52fb5623420a4e909ee Author: Robert Olsson Date: Thu Feb 2 17:31:35 2006 -0800 [IPV4]: Use RCU locking in fib_rules. Signed-off-by: Robert Olsson Signed-off-by: David S. Miller commit 195ade10a4a445737b6a1e87a0b8360ec0f54a31 Author: Arnaldo Carvalho de Melo Date: Thu Feb 2 18:15:14 2006 -0200 [LIST]: Introduce list_for_each_entry_safe_from For iterate over list of given type from existing point safe against removal of list entry. Signed-off-by: Arnaldo Carvalho de Melo commit 0c137fb9bfabac7011924b5475fff9093122c3dd Author: Arnaldo Carvalho de Melo Date: Thu Feb 2 13:20:15 2006 -0200 [DCCP] ackvec: Introduce dccp_ackvec_slab Signed-off-by: Arnaldo Carvalho de Melo commit f182587ae43142b0a232f05d36e3e319dc165a17 Author: Arnaldo Carvalho de Melo Date: Thu Feb 2 11:52:30 2006 -0200 [DCCP]: Fix error handling in dccp_init Signed-off-by: Arnaldo Carvalho de Melo commit 8466461c84119ff3fa8d16d96c699c614a0278fe Author: Arnaldo Carvalho de Melo Date: Thu Feb 2 10:05:41 2006 -0200 [DCCP] ackvec: Ditch dccpav_buf_len Simplifying the code a bit as we're always using DCCP_MAX_ACKVEC_LEN. Signed-off-by: Arnaldo Carvalho de Melo commit 9beafe9bf84d81d6c1593fa48256196bbcc3ff2b Author: David S. Miller Date: Wed Feb 1 00:39:07 2006 -0800 [NET] socket: Use put_filp not fput. Another conversion error in the sock_map_fd() splitup for sys_accept(). Based upon a report by Andrew Morton. Signed-off-by: David S. Miller commit 80f86b4f08bfab3e64f87f231770771a056f725f Author: Harald Welte Date: Mon Jan 30 15:50:09 2006 -0800 [NETFILTER] nfnetlink_log: add sequence numbers for log events By using a sequence number for every logged netfilter event, we can determine from userspace whether logging information was lots somewhere downstream. The user has a choice of either having per-instance local sequence counters, or using a global sequence counter, or both. Signed-off-by: Harald Welte Signed-off-by: David S. Miller commit 6cc67900b13fab33fdd59bfdba155aecb037f244 Author: Harald Welte Date: Mon Jan 30 15:49:48 2006 -0800 [NETFILTER] NAT sequence adjustment: Save eight bytes per conntrack This patch reduces the size of 'struct ip_conntrack' on systems with NAT by eight bytes. The sequence number delta values can be int16_t, since we only support one sequence number modification per window anyway, and one such modification is not going to exceed 32kB ;) Signed-off-by: Harald Welte Signed-off-by: David S. Miller commit f308e9c1861a86fdac9c544f81ed2a683eeff59f Author: David S. Miller Date: Sun Jan 29 21:08:25 2006 -0800 [NET] sys_accept: Pass correct socket to sock_attach_fd(). Also, make sure "filep" is assigned to on every path through sock_alloc_fd(). Thanks to Andrew Morton for the OOPS trace. Signed-off-by: David S. Miller commit 9fab2ff4a8d036902a8ea64e24969296cae724c6 Author: David S. Miller Date: Thu Jan 26 20:45:15 2006 -0800 [NET]: Do not lose accepted socket when -ENFILE/-EMFILE. Try to allocate the struct file and an unused file descriptor before we try to pull a newly accepted socket out of the protocol layer. Based upon a patch by Prassana Meda. Signed-off-by: David S. Miller commit a5a4b332720327cb179dc7f07f66036acdde7a8f Author: Patrick McHardy Date: Tue Jan 24 12:43:55 2006 -0800 [NET]: Reduce size of struct sk_buff on 64 bit architectures Move skb->nf_mark next to skb->tc_index to remove a 4 byte hole between skb->nfmark and skb->nfct and another one between skb->users and skb->head when CONFIG_NETFILTER, CONFIG_NET_SCHED and CONFIG_NET_CLS_ACT are enabled. For all other combinations the size stays the same. Signed-off-by: Patrick McHardy Signed-off-by: David S. Miller commit 8efbb841a312701ddbd7663f03cffaa2d385f59f Author: Stefan Rompf Date: Fri Jan 20 02:56:51 2006 -0800 [VLAN]: translate IF_OPER_DORMANT to netif_dormant_on() this patch adds support to the VLAN driver to translate IF_OPER_DORMANT of the underlying device to netif_dormant_on(). Beside clean state forwarding, this allows running independant userspace supplicants on both the real device and the stacked VLAN. It depends on my RFC2863 patch. Signed-off-by: Stefan Rompf Signed-off-by: David S. Miller commit eaa5bb15d1d88c1960f6f45f79bd0eafe15bbeee Author: Stefan Rompf Date: Fri Jan 20 02:55:48 2006 -0800 [NET] core: add RFC2863 operstate this patch adds a dormant flag to network devices, RFC2863 operstate derived from these flags and possibility for userspace interaction. It allows drivers to signal that a device is unusable for user traffic without disabling queueing (and therefore the possibility for protocol establishment traffic to flow) and a userspace supplicant (WPA, 802.1X) to mark a device unusable without changes to the driver. It is the result of our long discussion. However I must admit that it represents what Jamal and I agreed on with compromises towards Krzysztof, but Thomas and Krzysztof still disagree with some parts. Anyway I think it should be applied. Signed-off-by: Stefan Rompf Signed-off-by: David S. Miller commit de45361b34ddb6a929521282229916d6e95394b5 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:57 2006 +0900 [IPV6]: ROUTE: Ensure to accept redirects from nexthop for the target. It is possible to get redirects from nexthop of "more-specific" routes. Signed-off-by: YOSHIFUJI Hideaki commit f61f070fa6449dd8debd6253af5b0ca9dcbb0c8b Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:55 2006 +0900 [IPV6]: ROUTE: Add accept_ra_rt_info_max_plen sysctl. Signed-off-by: YOSHIFUJI Hideaki commit 3cef153c103b5b8329167a8a83204a393f4ccae8 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:54 2006 +0900 [IPV6]: ROUTE: Flag RTF_DEFAULT for Route Infomation for ::/0. Signed-off-by: YOSHIFUJI Hideaki commit b31bb7186fc9fd4e7a5e1be6f9b32e9d4469f05d Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:53 2006 +0900 [IPV6]: ROUTE: Add experimental support for Route Information Option in RA (RFC4191). Signed-off-by: YOSHIFUJI Hideaki commit 3439e162018e45b4f67d267fb195582f1b8be6d3 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:51 2006 +0900 [IPV6]: ROUTE: Add router_probe_interval sysctl. Signed-off-by: YOSHIFUJI Hideaki commit 078fe970b7382489b05749034d1e3516cf322bc0 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:50 2006 +0900 [IPV6]: ROUTE: Add accept_ra_rtr_pref sysctl. Signed-off-by: YOSHIFUJI Hideaki commit 704df97d60bd37a823d738cfdf87419d1414ddd7 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:49 2006 +0900 [IPV6]: ROUTE: Add Router Reachability Probing (RFC4191). Signed-off-by: YOSHIFUJI Hideaki commit c9897cdfd6d16003534c1780c468a24288440f47 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:47 2006 +0900 [IPV6]: ROUTE: Add support for Router Preference (RFC4191). Signed-off-by: YOSHIFUJI Hideaki commit b7cf3132b15b509518db04f2dc10f13beca82c1e Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:46 2006 +0900 [IPV6]: ROUTE: Handle finding the next best route in reachability in BACKTRACK(). Signed-off-by: YOSHIFUJI Hideaki commit 8329407f124cffc7b321139724142d4696cc8491 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:45 2006 +0900 [IPV6]: ROUTE: Try finding the next best route. Signed-off-by: YOSHIFUJI Hideaki commit 85ffbd5103fea8c612435e025dbe4634b6094786 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:44 2006 +0900 [IPV6]: ROUTE: Clean up rt6_select() code path in ip6_route_{intput,output}(). Signed-off-by: YOSHIFUJI Hideaki commit 3692bcfe86df0c8fe9345c806f3594a0e9ae6cda Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:43 2006 +0900 [IPV6]: ROUTE: Try selecting better route for non-default routes as well. Signed-off-by: YOSHIFUJI Hideaki commit 163845215d5967a31fd12445670047c6f20bb2b7 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:41 2006 +0900 [IPV6]: ROUTE: More strict check for default routers in rt6_get_dflt_router(). Check RTF_ADDRCONF|RTF_DEFAULT in rt6_get_dflt_router(). Signed-off-by: YOSHIFUJI Hideaki commit 3aed04a3e0b47615c976aa115281b5d8bf45412f Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:40 2006 +0900 [IPV6]: ROUTE: Eliminate lock for default route pointer. And prepare for more advanced router selection. Signed-off-by: YOSHIFUJI Hideaki commit 8b212b7042c3a40cc511ae011d96b9952ea3cd8d Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:39 2006 +0900 [IPV6]: ROUTE: Clean-up cow'ing in ip6_route_{intput,output}(). Signed-off-by: YOSHIFUJI Hideaki commit 6cb62171ab64db059bb07a98b543f00a1881e76e Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:38 2006 +0900 [IPV6]: ROUTE: Convert rt6_cow() to rt6_alloc_cow(). Signed-off-by: YOSHIFUJI Hideaki commit 27fc8ced0db48e4224cba23fe4a5d27b5dd13aa7 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:36 2006 +0900 [IPV6]: ROUTE: Clean up reference counting / unlocking for returning object. Signed-off-by: YOSHIFUJI Hideaki commit 1422fff2c7f034b2428cf3026c3e9d94b4805212 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:35 2006 +0900 [IPV6]: ROUTE: Unify two code paths for pmtu disc. Signed-off-by: YOSHIFUJI Hideaki commit 640d8dee123ebbc4fad3a3f2b7fa8398a3d51415 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:34 2006 +0900 [IPV6]: ROUTE: Add rt6_alloc_clone() for cloning route allocation. Signed-off-by: YOSHIFUJI Hideaki commit dc8df78d0f7fbacc2207fafafa9d82693e574c1e Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:33 2006 +0900 [IPV6]: ROUTE: Copy u.dst.error for RTF_REJECT routes when cloning. Signed-off-by: YOSHIFUJI Hideaki commit 3c35bbd06cfba8b9607f046db313eb58e2ea775a Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:32 2006 +0900 [IPV6]: ROUTE: Set appropriate information before inserting a route. Signed-off-by: YOSHIFUJI Hideaki commit ba3553dcb260670a9423d085f3c9227a4508043a Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:30 2006 +0900 [IPV6]: ROUTE: Split up rt6_cow() for future changes. Signed-off-by: YOSHIFUJI Hideaki commit e33fea75479da39eea82207df26f5d93399a082d Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:29 2006 +0900 [IPV6]: ADDRCONF: Add accept_ra_pinfo sysctl. This controls whether we accept Prefix Information in RAs. Signed-off-by: YOSHIFUJI Hideaki commit 6bbb6f6b894c37741637bc6c6f7f093202d77d85 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:28 2006 +0900 [IPV6]: ROUTE: Add accept_ra_defrtr sysctl. This controls whether we accept default router information in RAs. Signed-off-by: YOSHIFUJI Hideaki commit 3f44bf0ec4e37fc9eddba14d741f9e8bc57f600b Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:27 2006 +0900 [IPV6]: ADDRCONF: Split up ipv6_generate_eui64() by device type. Signed-off-by: YOSHIFUJI Hideaki commit a9916dac9a6812b4b9e0b8035459e5f566588877 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:25 2006 +0900 [IPV6]: ADDRCONF: Use our standard algorithm for randomized ifid. RFC 3041 describes an algorithm to generate random interface identifier. In RFC 3041bis, it is allowed to use different algorithm than one described in RFC 3041. So, let's use our standard pseudo random algorithm to simplify our implementation. Signed-off-by: YOSHIFUJI Hideaki commit ec1ca6961bbd7f98fdece1ac61aa9b741049614c Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:24 2006 +0900 [NET]: NEIGHBOUR: Ensure to record time to neigh->updated when neighbour's state changed. Signed-off-by: YOSHIFUJI Hideaki commit bec9a273e762875828629cda18ef103e23531415 Author: YOSHIFUJI Hideaki Date: Tue Jan 10 01:49:23 2006 +0900 [IPV6]: TUNNEL6: Don't try to add multicast route twice. Since addrconf_add_dev() has already called addrconf_add_mroute() to added route for multicast prefix, there's no point to call it again in addrconf_ip6_tnl_config(). Signed-off-by: YOSHIFUJI Hideaki --- diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 26364d0..35aed1c 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -717,6 +717,33 @@ accept_ra - BOOLEAN Functional default: enabled if local forwarding is disabled. disabled if local forwarding is enabled. +accept_ra_defrtr - BOOLEAN + Learn default router in Router Advertisement. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + +accept_ra_pinfo - BOOLEAN + Learn Prefix Inforamtion in Router Advertisement. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + +accept_ra_rt_info_max_plen - INTEGER + Maximum prefix length of Route Information in RA. + + Route Information w/ prefix larger than or equal to this + variable shall be ignored. + + Functional default: 0 if accept_ra_rtr_pref is enabled. + -1 if accept_ra_rtr_pref is disabled. + +accept_ra_rtr_pref - BOOLEAN + Accept Router Preference in RA. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + accept_redirects - BOOLEAN Accept Redirects. @@ -727,8 +754,8 @@ autoconf - BOOLEAN Autoconfigure addresses using Prefix Information in Router Advertisements. - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. + Functional default: enabled if accept_ra_pinfo is enabled. + disabled if accept_ra_pinfo is disabled. dad_transmits - INTEGER The amount of Duplicate Address Detection probes to send. @@ -771,6 +798,12 @@ mtu - INTEGER Default Maximum Transfer Unit Default: 1280 (IPv6 required minimum) +router_probe_interval - INTEGER + Minimum interval (in seconds) between Router Probing described + in RFC4191. + + Default: 60 + router_solicitation_delay - INTEGER Number of seconds to wait after interface is brought up before sending Router Solicitations. diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index a24200d..fc19fb8 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -14,8 +14,8 @@ #define DRV_MODULE_NAME "bnx2" #define PFX DRV_MODULE_NAME ": " -#define DRV_MODULE_VERSION "1.4.31" -#define DRV_MODULE_RELDATE "January 19, 2006" +#define DRV_MODULE_VERSION "1.4.38" +#define DRV_MODULE_RELDATE "February 10, 2006" #define RUN_AT(x) (jiffies + (x)) @@ -360,6 +360,8 @@ bnx2_netif_start(struct bnx2 *bp) static void bnx2_free_mem(struct bnx2 *bp) { + int i; + if (bp->stats_blk) { pci_free_consistent(bp->pdev, sizeof(struct statistics_block), bp->stats_blk, bp->stats_blk_mapping); @@ -378,19 +380,23 @@ bnx2_free_mem(struct bnx2 *bp) } kfree(bp->tx_buf_ring); bp->tx_buf_ring = NULL; - if (bp->rx_desc_ring) { - pci_free_consistent(bp->pdev, - sizeof(struct rx_bd) * RX_DESC_CNT, - bp->rx_desc_ring, bp->rx_desc_mapping); - bp->rx_desc_ring = NULL; + for (i = 0; i < bp->rx_max_ring; i++) { + if (bp->rx_desc_ring[i]) + pci_free_consistent(bp->pdev, + sizeof(struct rx_bd) * RX_DESC_CNT, + bp->rx_desc_ring[i], + bp->rx_desc_mapping[i]); + bp->rx_desc_ring[i] = NULL; } - kfree(bp->rx_buf_ring); + vfree(bp->rx_buf_ring); bp->rx_buf_ring = NULL; } static int bnx2_alloc_mem(struct bnx2 *bp) { + int i; + bp->tx_buf_ring = kmalloc(sizeof(struct sw_bd) * TX_DESC_CNT, GFP_KERNEL); if (bp->tx_buf_ring == NULL) @@ -404,18 +410,23 @@ bnx2_alloc_mem(struct bnx2 *bp) if (bp->tx_desc_ring == NULL) goto alloc_mem_err; - bp->rx_buf_ring = kmalloc(sizeof(struct sw_bd) * RX_DESC_CNT, - GFP_KERNEL); + bp->rx_buf_ring = vmalloc(sizeof(struct sw_bd) * RX_DESC_CNT * + bp->rx_max_ring); if (bp->rx_buf_ring == NULL) goto alloc_mem_err; - memset(bp->rx_buf_ring, 0, sizeof(struct sw_bd) * RX_DESC_CNT); - bp->rx_desc_ring = pci_alloc_consistent(bp->pdev, - sizeof(struct rx_bd) * - RX_DESC_CNT, - &bp->rx_desc_mapping); - if (bp->rx_desc_ring == NULL) - goto alloc_mem_err; + memset(bp->rx_buf_ring, 0, sizeof(struct sw_bd) * RX_DESC_CNT * + bp->rx_max_ring); + + for (i = 0; i < bp->rx_max_ring; i++) { + bp->rx_desc_ring[i] = + pci_alloc_consistent(bp->pdev, + sizeof(struct rx_bd) * RX_DESC_CNT, + &bp->rx_desc_mapping[i]); + if (bp->rx_desc_ring[i] == NULL) + goto alloc_mem_err; + + } bp->status_blk = pci_alloc_consistent(bp->pdev, sizeof(struct status_block), @@ -1520,7 +1531,7 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 i struct sk_buff *skb; struct sw_bd *rx_buf = &bp->rx_buf_ring[index]; dma_addr_t mapping; - struct rx_bd *rxbd = &bp->rx_desc_ring[index]; + struct rx_bd *rxbd = &bp->rx_desc_ring[RX_RING(index)][RX_IDX(index)]; unsigned long align; skb = dev_alloc_skb(bp->rx_buf_size); @@ -1656,23 +1667,30 @@ static inline void bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb, u16 cons, u16 prod) { - struct sw_bd *cons_rx_buf = &bp->rx_buf_ring[cons]; - struct sw_bd *prod_rx_buf = &bp->rx_buf_ring[prod]; - struct rx_bd *cons_bd = &bp->rx_desc_ring[cons]; - struct rx_bd *prod_bd = &bp->rx_desc_ring[prod]; + struct sw_bd *cons_rx_buf, *prod_rx_buf; + struct rx_bd *cons_bd, *prod_bd; + + cons_rx_buf = &bp->rx_buf_ring[cons]; + prod_rx_buf = &bp->rx_buf_ring[prod]; pci_dma_sync_single_for_device(bp->pdev, pci_unmap_addr(cons_rx_buf, mapping), bp->rx_offset + RX_COPY_THRESH, PCI_DMA_FROMDEVICE); - prod_rx_buf->skb = cons_rx_buf->skb; - pci_unmap_addr_set(prod_rx_buf, mapping, - pci_unmap_addr(cons_rx_buf, mapping)); + bp->rx_prod_bseq += bp->rx_buf_use_size; - memcpy(prod_bd, cons_bd, 8); + prod_rx_buf->skb = skb; - bp->rx_prod_bseq += bp->rx_buf_use_size; + if (cons == prod) + return; + + pci_unmap_addr_set(prod_rx_buf, mapping, + pci_unmap_addr(cons_rx_buf, mapping)); + cons_bd = &bp->rx_desc_ring[RX_RING(cons)][RX_IDX(cons)]; + prod_bd = &bp->rx_desc_ring[RX_RING(prod)][RX_IDX(prod)]; + prod_bd->rx_bd_haddr_hi = cons_bd->rx_bd_haddr_hi; + prod_bd->rx_bd_haddr_lo = cons_bd->rx_bd_haddr_lo; } static int @@ -1699,14 +1717,19 @@ bnx2_rx_int(struct bnx2 *bp, int budget) u32 status; struct sw_bd *rx_buf; struct sk_buff *skb; + dma_addr_t dma_addr; sw_ring_cons = RX_RING_IDX(sw_cons); sw_ring_prod = RX_RING_IDX(sw_prod); rx_buf = &bp->rx_buf_ring[sw_ring_cons]; skb = rx_buf->skb; - pci_dma_sync_single_for_cpu(bp->pdev, - pci_unmap_addr(rx_buf, mapping), + + rx_buf->skb = NULL; + + dma_addr = pci_unmap_addr(rx_buf, mapping); + + pci_dma_sync_single_for_cpu(bp->pdev, dma_addr, bp->rx_offset + RX_COPY_THRESH, PCI_DMA_FROMDEVICE); rx_hdr = (struct l2_fhdr *) skb->data; @@ -1747,8 +1770,7 @@ bnx2_rx_int(struct bnx2 *bp, int budget) skb = new_skb; } else if (bnx2_alloc_rx_skb(bp, sw_ring_prod) == 0) { - pci_unmap_single(bp->pdev, - pci_unmap_addr(rx_buf, mapping), + pci_unmap_single(bp->pdev, dma_addr, bp->rx_buf_use_size, PCI_DMA_FROMDEVICE); skb_reserve(skb, bp->rx_offset); @@ -1794,8 +1816,6 @@ reuse_rx: rx_pkt++; next_rx: - rx_buf->skb = NULL; - sw_cons = NEXT_RX_BD(sw_cons); sw_prod = NEXT_RX_BD(sw_prod); @@ -3340,27 +3360,35 @@ bnx2_init_rx_ring(struct bnx2 *bp) bp->hw_rx_cons = 0; bp->rx_prod_bseq = 0; - rxbd = &bp->rx_desc_ring[0]; - for (i = 0; i < MAX_RX_DESC_CNT; i++, rxbd++) { - rxbd->rx_bd_len = bp->rx_buf_use_size; - rxbd->rx_bd_flags = RX_BD_FLAGS_START | RX_BD_FLAGS_END; - } + for (i = 0; i < bp->rx_max_ring; i++) { + int j; - rxbd->rx_bd_haddr_hi = (u64) bp->rx_desc_mapping >> 32; - rxbd->rx_bd_haddr_lo = (u64) bp->rx_desc_mapping & 0xffffffff; + rxbd = &bp->rx_desc_ring[i][0]; + for (j = 0; j < MAX_RX_DESC_CNT; j++, rxbd++) { + rxbd->rx_bd_len = bp->rx_buf_use_size; + rxbd->rx_bd_flags = RX_BD_FLAGS_START | RX_BD_FLAGS_END; + } + if (i == (bp->rx_max_ring - 1)) + j = 0; + else + j = i + 1; + rxbd->rx_bd_haddr_hi = (u64) bp->rx_desc_mapping[j] >> 32; + rxbd->rx_bd_haddr_lo = (u64) bp->rx_desc_mapping[j] & + 0xffffffff; + } val = BNX2_L2CTX_CTX_TYPE_CTX_BD_CHN_TYPE_VALUE; val |= BNX2_L2CTX_CTX_TYPE_SIZE_L2; val |= 0x02 << 8; CTX_WR(bp, GET_CID_ADDR(RX_CID), BNX2_L2CTX_CTX_TYPE, val); - val = (u64) bp->rx_desc_mapping >> 32; + val = (u64) bp->rx_desc_mapping[0] >> 32; CTX_WR(bp, GET_CID_ADDR(RX_CID), BNX2_L2CTX_NX_BDHADDR_HI, val); - val = (u64) bp->rx_desc_mapping & 0xffffffff; + val = (u64) bp->rx_desc_mapping[0] & 0xffffffff; CTX_WR(bp, GET_CID_ADDR(RX_CID), BNX2_L2CTX_NX_BDHADDR_LO, val); - for ( ;ring_prod < bp->rx_ring_size; ) { + for (i = 0; i < bp->rx_ring_size; i++) { if (bnx2_alloc_rx_skb(bp, ring_prod) < 0) { break; } @@ -3375,6 +3403,29 @@ bnx2_init_rx_ring(struct bnx2 *bp) } static void +bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size) +{ + u32 num_rings, max; + + bp->rx_ring_size = size; + num_rings = 1; + while (size > MAX_RX_DESC_CNT) { + size -= MAX_RX_DESC_CNT; + num_rings++; + } + /* round to next power of 2 */ + max = MAX_RX_RINGS; + while ((max & num_rings) == 0) + max >>= 1; + + if (num_rings != max) + max <<= 1; + + bp->rx_max_ring = max; + bp->rx_max_ring_idx = (bp->rx_max_ring * RX_DESC_CNT) - 1; +} + +static void bnx2_free_tx_skbs(struct bnx2 *bp) { int i; @@ -3419,7 +3470,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp) if (bp->rx_buf_ring == NULL) return; - for (i = 0; i < RX_DESC_CNT; i++) { + for (i = 0; i < bp->rx_max_ring_idx; i++) { struct sw_bd *rx_buf = &bp->rx_buf_ring[i]; struct sk_buff *skb = rx_buf->skb; @@ -3506,74 +3557,9 @@ bnx2_test_registers(struct bnx2 *bp) { 0x0c00, 0, 0x00000000, 0x00000001 }, { 0x0c04, 0, 0x00000000, 0x03ff0001 }, { 0x0c08, 0, 0x0f0ff073, 0x00000000 }, - { 0x0c0c, 0, 0x00ffffff, 0x00000000 }, - { 0x0c30, 0, 0x00000000, 0xffffffff }, - { 0x0c34, 0, 0x00000000, 0xffffffff }, - { 0x0c38, 0, 0x00000000, 0xffffffff }, - { 0x0c3c, 0, 0x00000000, 0xffffffff }, - { 0x0c40, 0, 0x00000000, 0xffffffff }, - { 0x0c44, 0, 0x00000000, 0xffffffff }, - { 0x0c48, 0, 0x00000000, 0x0007ffff }, - { 0x0c4c, 0, 0x00000000, 0xffffffff }, - { 0x0c50, 0, 0x00000000, 0xffffffff }, - { 0x0c54, 0, 0x00000000, 0xffffffff }, - { 0x0c58, 0, 0x00000000, 0xffffffff }, - { 0x0c5c, 0, 0x00000000, 0xffffffff }, - { 0x0c60, 0, 0x00000000, 0xffffffff }, - { 0x0c64, 0, 0x00000000, 0xffffffff }, - { 0x0c68, 0, 0x00000000, 0xffffffff }, - { 0x0c6c, 0, 0x00000000, 0xffffffff }, - { 0x0c70, 0, 0x00000000, 0xffffffff }, - { 0x0c74, 0, 0x00000000, 0xffffffff }, - { 0x0c78, 0, 0x00000000, 0xffffffff }, - { 0x0c7c, 0, 0x00000000, 0xffffffff }, - { 0x0c80, 0, 0x00000000, 0xffffffff }, - { 0x0c84, 0, 0x00000000, 0xffffffff }, - { 0x0c88, 0, 0x00000000, 0xffffffff }, - { 0x0c8c, 0, 0x00000000, 0xffffffff }, - { 0x0c90, 0, 0x00000000, 0xffffffff }, - { 0x0c94, 0, 0x00000000, 0xffffffff }, - { 0x0c98, 0, 0x00000000, 0xffffffff }, - { 0x0c9c, 0, 0x00000000, 0xffffffff }, - { 0x0ca0, 0, 0x00000000, 0xffffffff }, - { 0x0ca4, 0, 0x00000000, 0xffffffff }, - { 0x0ca8, 0, 0x00000000, 0x0007ffff }, - { 0x0cac, 0, 0x00000000, 0xffffffff }, - { 0x0cb0, 0, 0x00000000, 0xffffffff }, - { 0x0cb4, 0, 0x00000000, 0xffffffff }, - { 0x0cb8, 0, 0x00000000, 0xffffffff }, - { 0x0cbc, 0, 0x00000000, 0xffffffff }, - { 0x0cc0, 0, 0x00000000, 0xffffffff }, - { 0x0cc4, 0, 0x00000000, 0xffffffff }, - { 0x0cc8, 0, 0x00000000, 0xffffffff }, - { 0x0ccc, 0, 0x00000000, 0xffffffff }, - { 0x0cd0, 0, 0x00000000, 0xffffffff }, - { 0x0cd4, 0, 0x00000000, 0xffffffff }, - { 0x0cd8, 0, 0x00000000, 0xffffffff }, - { 0x0cdc, 0, 0x00000000, 0xffffffff }, - { 0x0ce0, 0, 0x00000000, 0xffffffff }, - { 0x0ce4, 0, 0x00000000, 0xffffffff }, - { 0x0ce8, 0, 0x00000000, 0xffffffff }, - { 0x0cec, 0, 0x00000000, 0xffffffff }, - { 0x0cf0, 0, 0x00000000, 0xffffffff }, - { 0x0cf4, 0, 0x00000000, 0xffffffff }, - { 0x0cf8, 0, 0x00000000, 0xffffffff }, - { 0x0cfc, 0, 0x00000000, 0xffffffff }, - { 0x0d00, 0, 0x00000000, 0xffffffff }, - { 0x0d04, 0, 0x00000000, 0xffffffff }, { 0x1000, 0, 0x00000000, 0x00000001 }, { 0x1004, 0, 0x00000000, 0x000f0001 }, - { 0x1044, 0, 0x00000000, 0xffc003ff }, - { 0x1080, 0, 0x00000000, 0x0001ffff }, - { 0x1084, 0, 0x00000000, 0xffffffff }, - { 0x1088, 0, 0x00000000, 0xffffffff }, - { 0x108c, 0, 0x00000000, 0xffffffff }, - { 0x1090, 0, 0x00000000, 0xffffffff }, - { 0x1094, 0, 0x00000000, 0xffffffff }, - { 0x1098, 0, 0x00000000, 0xffffffff }, - { 0x109c, 0, 0x00000000, 0xffffffff }, - { 0x10a0, 0, 0x00000000, 0xffffffff }, { 0x1408, 0, 0x01c00800, 0x00000000 }, { 0x149c, 0, 0x8000ffff, 0x00000000 }, @@ -3585,111 +3571,9 @@ bnx2_test_registers(struct bnx2 *bp) { 0x14c4, 0, 0x00003fff, 0x00000000 }, { 0x14cc, 0, 0x00000000, 0x00000001 }, { 0x14d0, 0, 0xffffffff, 0x00000000 }, - { 0x1500, 0, 0x00000000, 0xffffffff }, - { 0x1504, 0, 0x00000000, 0xffffffff }, - { 0x1508, 0, 0x00000000, 0xffffffff }, - { 0x150c, 0, 0x00000000, 0xffffffff }, - { 0x1510, 0, 0x00000000, 0xffffffff }, - { 0x1514, 0, 0x00000000, 0xffffffff }, - { 0x1518, 0, 0x00000000, 0xffffffff }, - { 0x151c, 0, 0x00000000, 0xffffffff }, - { 0x1520, 0, 0x00000000, 0xffffffff }, - { 0x1524, 0, 0x00000000, 0xffffffff }, - { 0x1528, 0, 0x00000000, 0xffffffff }, - { 0x152c, 0, 0x00000000, 0xffffffff }, - { 0x1530, 0, 0x00000000, 0xffffffff }, - { 0x1534, 0, 0x00000000, 0xffffffff }, - { 0x1538, 0, 0x00000000, 0xffffffff }, - { 0x153c, 0, 0x00000000, 0xffffffff }, - { 0x1540, 0, 0x00000000, 0xffffffff }, - { 0x1544, 0, 0x00000000, 0xffffffff }, - { 0x1548, 0, 0x00000000, 0xffffffff }, - { 0x154c, 0, 0x00000000, 0xffffffff }, - { 0x1550, 0, 0x00000000, 0xffffffff }, - { 0x1554, 0, 0x00000000, 0xffffffff }, - { 0x1558, 0, 0x00000000, 0xffffffff }, - { 0x1600, 0, 0x00000000, 0xffffffff }, - { 0x1604, 0, 0x00000000, 0xffffffff }, - { 0x1608, 0, 0x00000000, 0xffffffff }, - { 0x160c, 0, 0x00000000, 0xffffffff }, - { 0x1610, 0, 0x00000000, 0xffffffff }, - { 0x1614, 0, 0x00000000, 0xffffffff }, - { 0x1618, 0, 0x00000000, 0xffffffff }, - { 0x161c, 0, 0x00000000, 0xffffffff }, - { 0x1620, 0, 0x00000000, 0xffffffff }, - { 0x1624, 0, 0x00000000, 0xffffffff }, - { 0x1628, 0, 0x00000000, 0xffffffff }, - { 0x162c, 0, 0x00000000, 0xffffffff }, - { 0x1630, 0, 0x00000000, 0xffffffff }, - { 0x1634, 0, 0x00000000, 0xffffffff }, - { 0x1638, 0, 0x00000000, 0xffffffff }, - { 0x163c, 0, 0x00000000, 0xffffffff }, - { 0x1640, 0, 0x00000000, 0xffffffff }, - { 0x1644, 0, 0x00000000, 0xffffffff }, - { 0x1648, 0, 0x00000000, 0xffffffff }, - { 0x164c, 0, 0x00000000, 0xffffffff }, - { 0x1650, 0, 0x00000000, 0xffffffff }, - { 0x1654, 0, 0x00000000, 0xffffffff }, { 0x1800, 0, 0x00000000, 0x00000001 }, { 0x1804, 0, 0x00000000, 0x00000003 }, - { 0x1840, 0, 0x00000000, 0xffffffff }, - { 0x1844, 0, 0x00000000, 0xffffffff }, - { 0x1848, 0, 0x00000000, 0xffffffff }, - { 0x184c, 0, 0x00000000, 0xffffffff }, - { 0x1850, 0, 0x00000000, 0xffffffff }, - { 0x1900, 0, 0x7ffbffff, 0x00000000 }, - { 0x1904, 0, 0xffffffff, 0x00000000 }, - { 0x190c, 0, 0xffffffff, 0x00000000 }, - { 0x1914, 0, 0xffffffff, 0x00000000 }, - { 0x191c, 0, 0xffffffff, 0x00000000 }, - { 0x1924, 0, 0xffffffff, 0x00000000 }, - { 0x192c, 0, 0xffffffff, 0x00000000 }, - { 0x1934, 0, 0xffffffff, 0x00000000 }, - { 0x193c, 0, 0xffffffff, 0x00000000 }, - { 0x1944, 0, 0xffffffff, 0x00000000 }, - { 0x194c, 0, 0xffffffff, 0x00000000 }, - { 0x1954, 0, 0xffffffff, 0x00000000 }, - { 0x195c, 0, 0xffffffff, 0x00000000 }, - { 0x1964, 0, 0xffffffff, 0x00000000 }, - { 0x196c, 0, 0xffffffff, 0x00000000 }, - { 0x1974, 0, 0xffffffff, 0x00000000 }, - { 0x197c, 0, 0xffffffff, 0x00000000 }, - { 0x1980, 0, 0x0700ffff, 0x00000000 }, - - { 0x1c00, 0, 0x00000000, 0x00000001 }, - { 0x1c04, 0, 0x00000000, 0x00000003 }, - { 0x1c08, 0, 0x0000000f, 0x00000000 }, - { 0x1c40, 0, 0x00000000, 0xffffffff }, - { 0x1c44, 0, 0x00000000, 0xffffffff }, - { 0x1c48, 0, 0x00000000, 0xffffffff }, - { 0x1c4c, 0, 0x00000000, 0xffffffff }, - { 0x1c50, 0, 0x00000000, 0xffffffff }, - { 0x1d00, 0, 0x7ffbffff, 0x00000000 }, - { 0x1d04, 0, 0xffffffff, 0x00000000 }, - { 0x1d0c, 0, 0xffffffff, 0x00000000 }, - { 0x1d14, 0, 0xffffffff, 0x00000000 }, - { 0x1d1c, 0, 0xffffffff, 0x00000000 }, - { 0x1d24, 0, 0xffffffff, 0x00000000 }, - { 0x1d2c, 0, 0xffffffff, 0x00000000 }, - { 0x1d34, 0, 0xffffffff, 0x00000000 }, - { 0x1d3c, 0, 0xffffffff, 0x00000000 }, - { 0x1d44, 0, 0xffffffff, 0x00000000 }, - { 0x1d4c, 0, 0xffffffff, 0x00000000 }, - { 0x1d54, 0, 0xffffffff, 0x00000000 }, - { 0x1d5c, 0, 0xffffffff, 0x00000000 }, - { 0x1d64, 0, 0xffffffff, 0x00000000 }, - { 0x1d6c, 0, 0xffffffff, 0x00000000 }, - { 0x1d74, 0, 0xffffffff, 0x00000000 }, - { 0x1d7c, 0, 0xffffffff, 0x00000000 }, - { 0x1d80, 0, 0x0700ffff, 0x00000000 }, - - { 0x2004, 0, 0x00000000, 0x0337000f }, - { 0x2008, 0, 0xffffffff, 0x00000000 }, - { 0x200c, 0, 0xffffffff, 0x00000000 }, - { 0x2010, 0, 0xffffffff, 0x00000000 }, - { 0x2014, 0, 0x801fff80, 0x00000000 }, - { 0x2018, 0, 0x000003ff, 0x00000000 }, { 0x2800, 0, 0x00000000, 0x00000001 }, { 0x2804, 0, 0x00000000, 0x00003f01 }, @@ -3707,16 +3591,6 @@ bnx2_test_registers(struct bnx2 *bp) { 0x2c00, 0, 0x00000000, 0x00000011 }, { 0x2c04, 0, 0x00000000, 0x00030007 }, - { 0x3000, 0, 0x00000000, 0x00000001 }, - { 0x3004, 0, 0x00000000, 0x007007ff }, - { 0x3008, 0, 0x00000003, 0x00000000 }, - { 0x300c, 0, 0xffffffff, 0x00000000 }, - { 0x3010, 0, 0xffffffff, 0x00000000 }, - { 0x3014, 0, 0xffffffff, 0x00000000 }, - { 0x3034, 0, 0xffffffff, 0x00000000 }, - { 0x3038, 0, 0xffffffff, 0x00000000 }, - { 0x3050, 0, 0x00000001, 0x00000000 }, - { 0x3c00, 0, 0x00000000, 0x00000001 }, { 0x3c04, 0, 0x00000000, 0x00070000 }, { 0x3c08, 0, 0x00007f71, 0x07f00000 }, @@ -3726,88 +3600,11 @@ bnx2_test_registers(struct bnx2 *bp) { 0x3c18, 0, 0x00000000, 0xffffffff }, { 0x3c1c, 0, 0xfffff000, 0x00000000 }, { 0x3c20, 0, 0xffffff00, 0x00000000 }, - { 0x3c24, 0, 0xffffffff, 0x00000000 }, - { 0x3c28, 0, 0xffffffff, 0x00000000 }, - { 0x3c2c, 0, 0xffffffff, 0x00000000 }, - { 0x3c30, 0, 0xffffffff, 0x00000000 }, - { 0x3c34, 0, 0xffffffff, 0x00000000 }, - { 0x3c38, 0, 0xffffffff, 0x00000000 }, - { 0x3c3c, 0, 0xffffffff, 0x00000000 }, - { 0x3c40, 0, 0xffffffff, 0x00000000 }, - { 0x3c44, 0, 0xffffffff, 0x00000000 }, - { 0x3c48, 0, 0xffffffff, 0x00000000 }, - { 0x3c4c, 0, 0xffffffff, 0x00000000 }, - { 0x3c50, 0, 0xffffffff, 0x00000000 }, - { 0x3c54, 0, 0xffffffff, 0x00000000 }, - { 0x3c58, 0, 0xffffffff, 0x00000000 }, - { 0x3c5c, 0, 0xffffffff, 0x00000000 }, - { 0x3c60, 0, 0xffffffff, 0x00000000 }, - { 0x3c64, 0, 0xffffffff, 0x00000000 }, - { 0x3c68, 0, 0xffffffff, 0x00000000 }, - { 0x3c6c, 0, 0xffffffff, 0x00000000 }, - { 0x3c70, 0, 0xffffffff, 0x00000000 }, - { 0x3c74, 0, 0x0000003f, 0x00000000 }, - { 0x3c78, 0, 0x00000000, 0x00000000 }, - { 0x3c7c, 0, 0x00000000, 0x00000000 }, - { 0x3c80, 0, 0x3fffffff, 0x00000000 }, - { 0x3c84, 0, 0x0000003f, 0x00000000 }, - { 0x3c88, 0, 0x00000000, 0xffffffff }, - { 0x3c8c, 0, 0x00000000, 0xffffffff }, - - { 0x4000, 0, 0x00000000, 0x00000001 }, - { 0x4004, 0, 0x00000000, 0x00030000 }, - { 0x4008, 0, 0x00000ff0, 0x00000000 }, - { 0x400c, 0, 0xffffffff, 0x00000000 }, - { 0x4088, 0, 0x00000000, 0x00070303 }, - - { 0x4400, 0, 0x00000000, 0x00000001 }, - { 0x4404, 0, 0x00000000, 0x00003f01 }, - { 0x4408, 0, 0x7fff00ff, 0x00000000 }, - { 0x440c, 0, 0xffffffff, 0x00000000 }, - { 0x4410, 0, 0xffff, 0x0000 }, - { 0x4414, 0, 0xffff, 0x0000 }, - { 0x4418, 0, 0xffff, 0x0000 }, - { 0x441c, 0, 0xffff, 0x0000 }, - { 0x4428, 0, 0xffffffff, 0x00000000 }, - { 0x442c, 0, 0xffffffff, 0x00000000 }, - { 0x4430, 0, 0xffffffff, 0x00000000 }, - { 0x4434, 0, 0xffffffff, 0x00000000 }, - { 0x4438, 0, 0xffffffff, 0x00000000 }, - { 0x443c, 0, 0xffffffff, 0x00000000 }, - { 0x4440, 0, 0xffffffff, 0x00000000 }, - { 0x4444, 0, 0xffffffff, 0x00000000 }, - - { 0x4c00, 0, 0x00000000, 0x00000001 }, - { 0x4c04, 0, 0x00000000, 0x0000003f }, - { 0x4c08, 0, 0xffffffff, 0x00000000 }, - { 0x4c0c, 0, 0x0007fc00, 0x00000000 }, - { 0x4c10, 0, 0x80003fe0, 0x00000000 }, - { 0x4c14, 0, 0xffffffff, 0x00000000 }, - { 0x4c44, 0, 0x00000000, 0x9fff9fff }, - { 0x4c48, 0, 0x00000000, 0xb3009fff }, - { 0x4c4c, 0, 0x00000000, 0x77f33b30 }, - { 0x4c50, 0, 0x00000000, 0xffffffff }, { 0x5004, 0, 0x00000000, 0x0000007f }, { 0x5008, 0, 0x0f0007ff, 0x00000000 }, { 0x500c, 0, 0xf800f800, 0x07ff07ff }, - { 0x5400, 0, 0x00000008, 0x00000001 }, - { 0x5404, 0, 0x00000000, 0x0000003f }, - { 0x5408, 0, 0x0000001f, 0x00000000 }, - { 0x540c, 0, 0xffffffff, 0x00000000 }, - { 0x5410, 0, 0xffffffff, 0x00000000 }, - { 0x5414, 0, 0x0000ffff, 0x00000000 }, - { 0x5418, 0, 0x0000ffff, 0x00000000 }, - { 0x541c, 0, 0x0000ffff, 0x00000000 }, - { 0x5420, 0, 0x0000ffff, 0x00000000 }, - { 0x5428, 0, 0x000000ff, 0x00000000 }, - { 0x542c, 0, 0xff00ffff, 0x00000000 }, - { 0x5430, 0, 0x001fff80, 0x00000000 }, - { 0x5438, 0, 0xffffffff, 0x00000000 }, - { 0x543c, 0, 0xffffffff, 0x00000000 }, - { 0x5440, 0, 0xf800f800, 0x07ff07ff }, - { 0x5c00, 0, 0x00000000, 0x00000001 }, { 0x5c04, 0, 0x00000000, 0x0003000f }, { 0x5c08, 0, 0x00000003, 0x00000000 }, @@ -4794,6 +4591,64 @@ bnx2_get_drvinfo(struct net_device *dev, info->fw_version[5] = 0; } +#define BNX2_REGDUMP_LEN (32 * 1024) + +static int +bnx2_get_regs_len(struct net_device *dev) +{ + return BNX2_REGDUMP_LEN; +} + +static void +bnx2_get_regs(struct net_device *dev, struct ethtool_regs *regs, void *_p) +{ + u32 *p = _p, i, offset; + u8 *orig_p = _p; + struct bnx2 *bp = netdev_priv(dev); + u32 reg_boundaries[] = { 0x0000, 0x0098, 0x0400, 0x045c, + 0x0800, 0x0880, 0x0c00, 0x0c10, + 0x0c30, 0x0d08, 0x1000, 0x101c, + 0x1040, 0x1048, 0x1080, 0x10a4, + 0x1400, 0x1490, 0x1498, 0x14f0, + 0x1500, 0x155c, 0x1580, 0x15dc, + 0x1600, 0x1658, 0x1680, 0x16d8, + 0x1800, 0x1820, 0x1840, 0x1854, + 0x1880, 0x1894, 0x1900, 0x1984, + 0x1c00, 0x1c0c, 0x1c40, 0x1c54, + 0x1c80, 0x1c94, 0x1d00, 0x1d84, + 0x2000, 0x2030, 0x23c0, 0x2400, + 0x2800, 0x2820, 0x2830, 0x2850, + 0x2b40, 0x2c10, 0x2fc0, 0x3058, + 0x3c00, 0x3c94, 0x4000, 0x4010, + 0x4080, 0x4090, 0x43c0, 0x4458, + 0x4c00, 0x4c18, 0x4c40, 0x4c54, + 0x4fc0, 0x5010, 0x53c0, 0x5444, + 0x5c00, 0x5c18, 0x5c80, 0x5c90, + 0x5fc0, 0x6000, 0x6400, 0x6428, + 0x6800, 0x6848, 0x684c, 0x6860, + 0x6888, 0x6910, 0x8000 }; + + regs->version = 0; + + memset(p, 0, BNX2_REGDUMP_LEN); + + if (!netif_running(bp->dev)) + return; + + i = 0; + offset = reg_boundaries[0]; + p += offset; + while (offset < BNX2_REGDUMP_LEN) { + *p++ = REG_RD(bp, offset); + offset += 4; + if (offset == reg_boundaries[i + 1]) { + offset = reg_boundaries[i + 2]; + p = (u32 *) (orig_p + offset); + i += 2; + } + } +} + static void bnx2_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol) { @@ -4979,7 +4834,7 @@ bnx2_get_ringparam(struct net_device *de { struct bnx2 *bp = netdev_priv(dev); - ering->rx_max_pending = MAX_RX_DESC_CNT; + ering->rx_max_pending = MAX_TOTAL_RX_DESC_CNT; ering->rx_mini_max_pending = 0; ering->rx_jumbo_max_pending = 0; @@ -4996,17 +4851,28 @@ bnx2_set_ringparam(struct net_device *de { struct bnx2 *bp = netdev_priv(dev); - if ((ering->rx_pending > MAX_RX_DESC_CNT) || + if ((ering->rx_pending > MAX_TOTAL_RX_DESC_CNT) || (ering->tx_pending > MAX_TX_DESC_CNT) || (ering->tx_pending <= MAX_SKB_FRAGS)) { return -EINVAL; } - bp->rx_ring_size = ering->rx_pending; + if (netif_running(bp->dev)) { + bnx2_netif_stop(bp); + bnx2_reset_chip(bp, BNX2_DRV_MSG_CODE_RESET); + bnx2_free_skbs(bp); + bnx2_free_mem(bp); + } + + bnx2_set_rx_ring_size(bp, ering->rx_pending); bp->tx_ring_size = ering->tx_pending; if (netif_running(bp->dev)) { - bnx2_netif_stop(bp); + int rc; + + rc = bnx2_alloc_mem(bp); + if (rc) + return rc; bnx2_init_nic(bp); bnx2_netif_start(bp); } @@ -5360,6 +5226,8 @@ static struct ethtool_ops bnx2_ethtool_o .get_settings = bnx2_get_settings, .set_settings = bnx2_set_settings, .get_drvinfo = bnx2_get_drvinfo, + .get_regs_len = bnx2_get_regs_len, + .get_regs = bnx2_get_regs, .get_wol = bnx2_get_wol, .set_wol = bnx2_set_wol, .nway_reset = bnx2_nway_reset, @@ -5678,7 +5546,7 @@ bnx2_init_board(struct pci_dev *pdev, st bp->mac_addr[5] = (u8) reg; bp->tx_ring_size = MAX_TX_DESC_CNT; - bp->rx_ring_size = 100; + bnx2_set_rx_ring_size(bp, 100); bp->rx_csum = 1; @@ -5897,6 +5765,7 @@ bnx2_suspend(struct pci_dev *pdev, pm_me if (!netif_running(dev)) return 0; + flush_scheduled_work(); bnx2_netif_stop(bp); netif_device_detach(dev); del_timer_sync(&bp->timer); diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h index 9f691cb..fd4b7f2 100644 --- a/drivers/net/bnx2.h +++ b/drivers/net/bnx2.h @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -3792,8 +3793,10 @@ struct l2_fhdr { #define TX_DESC_CNT (BCM_PAGE_SIZE / sizeof(struct tx_bd)) #define MAX_TX_DESC_CNT (TX_DESC_CNT - 1) +#define MAX_RX_RINGS 4 #define RX_DESC_CNT (BCM_PAGE_SIZE / sizeof(struct rx_bd)) #define MAX_RX_DESC_CNT (RX_DESC_CNT - 1) +#define MAX_TOTAL_RX_DESC_CNT (MAX_RX_DESC_CNT * MAX_RX_RINGS) #define NEXT_TX_BD(x) (((x) & (MAX_TX_DESC_CNT - 1)) == \ (MAX_TX_DESC_CNT - 1)) ? \ @@ -3805,8 +3808,10 @@ struct l2_fhdr { (MAX_RX_DESC_CNT - 1)) ? \ (x) + 2 : (x) + 1 -#define RX_RING_IDX(x) ((x) & MAX_RX_DESC_CNT) +#define RX_RING_IDX(x) ((x) & bp->rx_max_ring_idx) +#define RX_RING(x) (((x) & ~MAX_RX_DESC_CNT) >> 8) +#define RX_IDX(x) ((x) & MAX_RX_DESC_CNT) /* Context size. */ #define CTX_SHIFT 7 @@ -3903,6 +3908,15 @@ struct bnx2 { struct status_block *status_blk; u32 last_status_idx; + u32 flags; +#define PCIX_FLAG 1 +#define PCI_32BIT_FLAG 2 +#define ONE_TDMA_FLAG 4 /* no longer used */ +#define NO_WOL_FLAG 8 +#define USING_DAC_FLAG 0x10 +#define USING_MSI_FLAG 0x20 +#define ASF_ENABLE_FLAG 0x40 + struct tx_bd *tx_desc_ring; struct sw_bd *tx_buf_ring; u32 tx_prod_bseq; @@ -3920,19 +3934,22 @@ struct bnx2 { u32 rx_offset; u32 rx_buf_use_size; /* useable size */ u32 rx_buf_size; /* with alignment */ - struct rx_bd *rx_desc_ring; - struct sw_bd *rx_buf_ring; + u32 rx_max_ring_idx; + u32 rx_prod_bseq; u16 rx_prod; u16 rx_cons; u32 rx_csum; + struct sw_bd *rx_buf_ring; + struct rx_bd *rx_desc_ring[MAX_RX_RINGS]; + /* Only used to synchronize netif_stop_queue/wake_queue when tx */ /* ring is full */ spinlock_t tx_lock; - /* End of fileds used in the performance code paths. */ + /* End of fields used in the performance code paths. */ char *name; @@ -3945,15 +3962,6 @@ struct bnx2 { /* Used to synchronize phy accesses. */ spinlock_t phy_lock; - u32 flags; -#define PCIX_FLAG 1 -#define PCI_32BIT_FLAG 2 -#define ONE_TDMA_FLAG 4 /* no longer used */ -#define NO_WOL_FLAG 8 -#define USING_DAC_FLAG 0x10 -#define USING_MSI_FLAG 0x20 -#define ASF_ENABLE_FLAG 0x40 - u32 phy_flags; #define PHY_SERDES_FLAG 1 #define PHY_CRC_FIX_FLAG 2 @@ -4004,8 +4012,9 @@ struct bnx2 { dma_addr_t tx_desc_mapping; + int rx_max_ring; int rx_ring_size; - dma_addr_t rx_desc_mapping; + dma_addr_t rx_desc_mapping[MAX_RX_RINGS]; u16 tx_quick_cons_trip; u16 tx_quick_cons_trip_int; diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index e7dc653..f147d73 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -69,8 +69,8 @@ #define DRV_MODULE_NAME "tg3" #define PFX DRV_MODULE_NAME ": " -#define DRV_MODULE_VERSION "3.49" -#define DRV_MODULE_RELDATE "Feb 2, 2006" +#define DRV_MODULE_VERSION "3.50" +#define DRV_MODULE_RELDATE "Feb 4, 2006" #define TG3_DEF_MAC_MODE 0 #define TG3_DEF_RX_MODE 0 @@ -223,8 +223,12 @@ static struct pci_device_id tg3_pci_tbl[ PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0UL }, { PCI_VENDOR_ID_BROADCOM, PCI_DEVICE_ID_TIGON3_5714, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0UL }, + { PCI_VENDOR_ID_BROADCOM, PCI_DEVICE_ID_TIGON3_5714S, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0UL }, { PCI_VENDOR_ID_BROADCOM, PCI_DEVICE_ID_TIGON3_5715, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0UL }, + { PCI_VENDOR_ID_BROADCOM, PCI_DEVICE_ID_TIGON3_5715S, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0UL }, { PCI_VENDOR_ID_BROADCOM, PCI_DEVICE_ID_TIGON3_5780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0UL }, { PCI_VENDOR_ID_BROADCOM, PCI_DEVICE_ID_TIGON3_5780S, @@ -1038,9 +1042,11 @@ static void tg3_frob_aux_power(struct tg struct net_device *dev_peer; dev_peer = pci_get_drvdata(tp->pdev_peer); + /* remove_one() may have been run on the peer. */ if (!dev_peer) - BUG(); - tp_peer = netdev_priv(dev_peer); + tp_peer = tp; + else + tp_peer = netdev_priv(dev_peer); } if ((tp->tg3_flags & TG3_FLAG_WOL_ENABLE) != 0 || @@ -1131,7 +1137,7 @@ static int tg3_halt_cpu(struct tg3 *, u3 static int tg3_nvram_lock(struct tg3 *); static void tg3_nvram_unlock(struct tg3 *); -static int tg3_set_power_state(struct tg3 *tp, int state) +static int tg3_set_power_state(struct tg3 *tp, pci_power_t state) { u32 misc_host_ctrl; u16 power_control, power_caps; @@ -1150,7 +1156,7 @@ static int tg3_set_power_state(struct tg power_control |= PCI_PM_CTRL_PME_STATUS; power_control &= ~(PCI_PM_CTRL_STATE_MASK); switch (state) { - case 0: + case PCI_D0: power_control |= 0; pci_write_config_word(tp->pdev, pm + PCI_PM_CTRL, @@ -1163,15 +1169,15 @@ static int tg3_set_power_state(struct tg return 0; - case 1: + case PCI_D1: power_control |= 1; break; - case 2: + case PCI_D2: power_control |= 2; break; - case 3: + case PCI_D3hot: power_control |= 3; break; @@ -2680,6 +2686,12 @@ static int tg3_setup_fiber_mii_phy(struc err |= tg3_readphy(tp, MII_BMSR, &bmsr); err |= tg3_readphy(tp, MII_BMSR, &bmsr); + if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5714) { + if (tr32(MAC_TX_STATUS) & TX_STATUS_LINK_UP) + bmsr |= BMSR_LSTATUS; + else + bmsr &= ~BMSR_LSTATUS; + } err |= tg3_readphy(tp, MII_BMCR, &bmcr); @@ -2748,6 +2760,13 @@ static int tg3_setup_fiber_mii_phy(struc bmcr = new_bmcr; err |= tg3_readphy(tp, MII_BMSR, &bmsr); err |= tg3_readphy(tp, MII_BMSR, &bmsr); + if (GET_ASIC_REV(tp->pci_chip_rev_id) == + ASIC_REV_5714) { + if (tr32(MAC_TX_STATUS) & TX_STATUS_LINK_UP) + bmsr |= BMSR_LSTATUS; + else + bmsr &= ~BMSR_LSTATUS; + } tp->tg3_flags2 &= ~TG3_FLG2_PARALLEL_DETECT; } } @@ -5568,6 +5587,9 @@ static int tg3_reset_hw(struct tg3 *tp) tg3_abort_hw(tp, 1); } + if (tp->tg3_flags2 & TG3_FLG2_MII_SERDES) + tg3_phy_reset(tp); + err = tg3_chip_reset(tp); if (err) return err; @@ -6080,6 +6102,17 @@ static int tg3_reset_hw(struct tg3 *tp) tp->tg3_flags2 |= TG3_FLG2_HW_AUTONEG; } + if ((tp->tg3_flags2 & TG3_FLG2_MII_SERDES) && + (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5714)) { + u32 tmp; + + tmp = tr32(SERDES_RX_CTRL); + tw32(SERDES_RX_CTRL, tmp | SERDES_RX_SIG_DETECT); + tp->grc_local_ctrl &= ~GRC_LCLCTRL_USE_EXT_SIG_DETECT; + tp->grc_local_ctrl |= GRC_LCLCTRL_USE_SIG_DETECT; + tw32(GRC_LOCAL_CTRL, tp->grc_local_ctrl); + } + err = tg3_setup_phy(tp, 1); if (err) return err; @@ -6158,7 +6191,7 @@ static int tg3_init_hw(struct tg3 *tp) int err; /* Force the chip into D0. */ - err = tg3_set_power_state(tp, 0); + err = tg3_set_power_state(tp, PCI_D0); if (err) goto out; @@ -6445,6 +6478,10 @@ static int tg3_open(struct net_device *d tg3_full_lock(tp, 0); + err = tg3_set_power_state(tp, PCI_D0); + if (err) + return err; + tg3_disable_ints(tp); tp->tg3_flags &= ~TG3_FLAG_INIT_COMPLETE; @@ -6459,7 +6496,9 @@ static int tg3_open(struct net_device *d if ((tp->tg3_flags2 & TG3_FLG2_5750_PLUS) && (GET_CHIP_REV(tp->pci_chip_rev_id) != CHIPREV_5750_AX) && - (GET_CHIP_REV(tp->pci_chip_rev_id) != CHIPREV_5750_BX)) { + (GET_CHIP_REV(tp->pci_chip_rev_id) != CHIPREV_5750_BX) && + !((GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5714) && + (tp->pdev_peer == tp->pdev))) { /* All MSI supporting chips should support tagged * status. Assert that this is the case. */ @@ -6822,7 +6861,6 @@ static int tg3_close(struct net_device * tp->tg3_flags &= ~(TG3_FLAG_INIT_COMPLETE | TG3_FLAG_GOT_SERDES_FLOWCTL); - netif_carrier_off(tp->dev); tg3_full_unlock(tp); @@ -6839,6 +6877,10 @@ static int tg3_close(struct net_device * tg3_free_consistent(tp); + tg3_set_power_state(tp, PCI_D3hot); + + netif_carrier_off(tp->dev); + return 0; } @@ -7157,6 +7199,9 @@ static void tg3_get_regs(struct net_devi memset(p, 0, TG3_REGDUMP_LEN); + if (tp->link_config.phy_is_low_power) + return; + tg3_full_lock(tp, 0); #define __GET_REG32(reg) (*(p)++ = tr32(reg)) @@ -7231,6 +7276,9 @@ static int tg3_get_eeprom(struct net_dev u8 *pd; u32 i, offset, len, val, b_offset, b_count; + if (tp->link_config.phy_is_low_power) + return -EAGAIN; + offset = eeprom->offset; len = eeprom->len; eeprom->len = 0; @@ -7292,6 +7340,9 @@ static int tg3_set_eeprom(struct net_dev u32 offset, len, b_offset, odd_len, start, end; u8 *buf; + if (tp->link_config.phy_is_low_power) + return -EAGAIN; + if (eeprom->magic != TG3_EEPROM_MAGIC) return -EINVAL; @@ -8212,6 +8263,9 @@ static void tg3_self_test(struct net_dev { struct tg3 *tp = netdev_priv(dev); + if (tp->link_config.phy_is_low_power) + tg3_set_power_state(tp, PCI_D0); + memset(data, 0, sizeof(u64) * TG3_NUM_TEST); if (tg3_test_nvram(tp) != 0) { @@ -8269,6 +8323,9 @@ static void tg3_self_test(struct net_dev tg3_full_unlock(tp); } + if (tp->link_config.phy_is_low_power) + tg3_set_power_state(tp, PCI_D3hot); + } static int tg3_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) @@ -8288,6 +8345,9 @@ static int tg3_ioctl(struct net_device * if (tp->tg3_flags2 & TG3_FLG2_PHY_SERDES) break; /* We have no PHY */ + if (tp->link_config.phy_is_low_power) + return -EAGAIN; + spin_lock_bh(&tp->lock); err = tg3_readphy(tp, data->reg_num & 0x1f, &mii_regval); spin_unlock_bh(&tp->lock); @@ -8304,6 +8364,9 @@ static int tg3_ioctl(struct net_device * if (!capable(CAP_NET_ADMIN)) return -EPERM; + if (tp->link_config.phy_is_low_power) + return -EAGAIN; + spin_lock_bh(&tp->lock); err = tg3_writephy(tp, data->reg_num & 0x1f, data->val_in); spin_unlock_bh(&tp->lock); @@ -9718,7 +9781,7 @@ static int __devinit tg3_get_invariants( tp->grc_local_ctrl |= GRC_LCLCTRL_GPIO_OE3; /* Force the chip into D0. */ - err = tg3_set_power_state(tp, 0); + err = tg3_set_power_state(tp, PCI_D0); if (err) { printk(KERN_ERR PFX "(%s) transition to D0 failed\n", pci_name(tp->pdev)); @@ -10771,11 +10834,12 @@ static int __devinit tg3_init_one(struct tp->tg3_flags2 |= TG3_FLG2_TSO_CAPABLE; } - /* TSO is off by default, user can enable using ethtool. */ -#if 0 - if (tp->tg3_flags2 & TG3_FLG2_TSO_CAPABLE) + /* TSO is on by default on chips that support hardware TSO. + * Firmware TSO on older chips gives lower performance, so it + * is off by default, but can be enabled using ethtool. + */ + if (tp->tg3_flags2 & TG3_FLG2_HW_TSO) dev->features |= NETIF_F_TSO; -#endif #endif @@ -10968,7 +11032,7 @@ static int tg3_resume(struct pci_dev *pd pci_restore_state(tp->pdev); - err = tg3_set_power_state(tp, 0); + err = tg3_set_power_state(tp, PCI_D0); if (err) return err; diff --git a/include/linux/dccp.h b/include/linux/dccp.h index 088529f..f91c8a6 100644 --- a/include/linux/dccp.h +++ b/include/linux/dccp.h @@ -154,6 +154,10 @@ enum { DCCPO_MANDATORY = 1, DCCPO_MIN_RESERVED = 3, DCCPO_MAX_RESERVED = 31, + DCCPO_CHANGE_L = 32, + DCCPO_CONFIRM_L = 33, + DCCPO_CHANGE_R = 34, + DCCPO_CONFIRM_R = 35, DCCPO_NDP_COUNT = 37, DCCPO_ACK_VECTOR_0 = 38, DCCPO_ACK_VECTOR_1 = 39, @@ -168,7 +172,9 @@ enum { /* DCCP features */ enum { DCCPF_RESERVED = 0, + DCCPF_CCID = 1, DCCPF_SEQUENCE_WINDOW = 3, + DCCPF_ACK_RATIO = 5, DCCPF_SEND_ACK_VECTOR = 6, DCCPF_SEND_NDP_COUNT = 7, /* 10-127 reserved */ @@ -176,9 +182,18 @@ enum { DCCPF_MAX_CCID_SPECIFIC = 255, }; +/* this structure is argument to DCCP_SOCKOPT_CHANGE_X */ +struct dccp_so_feat { + __u8 dccpsf_feat; + __u8 *dccpsf_val; + __u8 dccpsf_len; +}; + /* DCCP socket options */ #define DCCP_SOCKOPT_PACKET_SIZE 1 #define DCCP_SOCKOPT_SERVICE 2 +#define DCCP_SOCKOPT_CHANGE_L 3 +#define DCCP_SOCKOPT_CHANGE_R 4 #define DCCP_SOCKOPT_CCID_RX_INFO 128 #define DCCP_SOCKOPT_CCID_TX_INFO 192 @@ -314,9 +329,9 @@ static inline unsigned int dccp_hdr_len( /* initial values for each feature */ #define DCCPF_INITIAL_SEQUENCE_WINDOW 100 -/* FIXME: for now we're using CCID 3 (TFRC) */ -#define DCCPF_INITIAL_CCID 3 -#define DCCPF_INITIAL_SEND_ACK_VECTOR 0 +#define DCCPF_INITIAL_CCID 2 +#define DCCPF_INITIAL_ACK_RATIO 2 +#define DCCPF_INITIAL_SEND_ACK_VECTOR 1 /* FIXME: for now we're default to 1 but it should really be 0 */ #define DCCPF_INITIAL_SEND_NDP_COUNT 1 @@ -335,6 +350,24 @@ struct dccp_options { __u8 dccpo_tx_ccid; __u8 dccpo_send_ack_vector; __u8 dccpo_send_ndp_count; + __u8 dccpo_ack_ratio; + struct list_head dccpo_pending; + struct list_head dccpo_conf; +}; + +struct dccp_opt_conf { + __u8 *dccpoc_val; + __u8 dccpoc_len; +}; + +struct dccp_opt_pend { + struct list_head dccpop_node; + __u8 dccpop_type; + __u8 dccpop_feat; + __u8 *dccpop_val; + __u8 dccpop_len; + int dccpop_conf; + struct dccp_opt_conf *dccpop_sc; }; extern void __dccp_options_init(struct dccp_options *dccpo); @@ -430,6 +463,8 @@ struct dccp_sock { struct timeval dccps_timestamp_time; __u32 dccps_timestamp_echo; __u32 dccps_packet_size; + __u16 dccps_l_ack_ratio; + __u16 dccps_r_ack_ratio; unsigned long dccps_ndp_count; __u32 dccps_mss_cache; struct dccp_options dccps_options; diff --git a/include/linux/icmpv6.h b/include/linux/icmpv6.h index 0cf6c8b..c771a7d 100644 --- a/include/linux/icmpv6.h +++ b/include/linux/icmpv6.h @@ -40,14 +40,16 @@ struct icmp6hdr { struct icmpv6_nd_ra { __u8 hop_limit; #if defined(__LITTLE_ENDIAN_BITFIELD) - __u8 reserved:6, + __u8 reserved:4, + router_pref:2, other:1, managed:1; #elif defined(__BIG_ENDIAN_BITFIELD) __u8 managed:1, other:1, - reserved:6; + router_pref:2, + reserved:4; #else #error "Please fix " #endif @@ -70,8 +72,13 @@ struct icmp6hdr { #define icmp6_addrconf_managed icmp6_dataun.u_nd_ra.managed #define icmp6_addrconf_other icmp6_dataun.u_nd_ra.other #define icmp6_rt_lifetime icmp6_dataun.u_nd_ra.rt_lifetime +#define icmp6_router_pref icmp6_dataun.u_nd_ra.router_pref }; +#define ICMPV6_ROUTER_PREF_LOW 0x3 +#define ICMPV6_ROUTER_PREF_MEDIUM 0x0 +#define ICMPV6_ROUTER_PREF_HIGH 0x1 +#define ICMPV6_ROUTER_PREF_INVALID 0x2 #define ICMPV6_DEST_UNREACH 1 #define ICMPV6_PKT_TOOBIG 2 diff --git a/include/linux/if.h b/include/linux/if.h index ce627d9..720e42d 100644 --- a/include/linux/if.h +++ b/include/linux/if.h @@ -33,7 +33,7 @@ #define IFF_LOOPBACK 0x8 /* is a loopback net */ #define IFF_POINTOPOINT 0x10 /* interface is has p-p link */ #define IFF_NOTRAILERS 0x20 /* avoid use of trailers */ -#define IFF_RUNNING 0x40 /* interface running and carrier ok */ +#define IFF_RUNNING 0x40 /* interface RFC2863 OPER_UP */ #define IFF_NOARP 0x80 /* no ARP protocol */ #define IFF_PROMISC 0x100 /* receive all packets */ #define IFF_ALLMULTI 0x200 /* receive all multicast packets*/ @@ -43,12 +43,16 @@ #define IFF_MULTICAST 0x1000 /* Supports multicast */ -#define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_MASTER|IFF_SLAVE|IFF_RUNNING) - #define IFF_PORTSEL 0x2000 /* can set media type */ #define IFF_AUTOMEDIA 0x4000 /* auto media select active */ #define IFF_DYNAMIC 0x8000 /* dialup device with changing addresses*/ +#define IFF_LOWER_UP 0x10000 /* driver signals L1 up */ +#define IFF_DORMANT 0x20000 /* driver signals dormant */ + +#define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|\ + IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT) + /* Private (from user) interface flags (netdevice->priv_flags). */ #define IFF_802_1Q_VLAN 0x1 /* 802.1Q VLAN device. */ #define IFF_EBRIDGE 0x2 /* Ethernet bridging device. */ @@ -80,6 +84,22 @@ #define IF_PROTO_FR_ETH_PVC 0x200B #define IF_PROTO_RAW 0x200C /* RAW Socket */ +/* RFC 2863 operational status */ +enum { + IF_OPER_UNKNOWN, + IF_OPER_NOTPRESENT, + IF_OPER_DOWN, + IF_OPER_LOWERLAYERDOWN, + IF_OPER_TESTING, + IF_OPER_DORMANT, + IF_OPER_UP, +}; + +/* link modes */ +enum { + IF_LINK_MODE_DEFAULT, + IF_LINK_MODE_DORMANT, /* limit upward transition to dormant */ +}; /* * Device mapping structure. I'd just gone off and designed a diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 9c8f4c9..1263d8c 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -145,6 +145,15 @@ struct ipv6_devconf { __s32 max_desync_factor; #endif __s32 max_addresses; + __s32 accept_ra_defrtr; + __s32 accept_ra_pinfo; +#ifdef CONFIG_IPV6_ROUTER_PREF + __s32 accept_ra_rtr_pref; + __s32 rtr_probe_interval; +#ifdef CONFIG_IPV6_ROUTE_INFO + __s32 accept_ra_rt_info_max_plen; +#endif +#endif void *sysctl; }; @@ -167,6 +176,11 @@ enum { DEVCONF_MAX_DESYNC_FACTOR, DEVCONF_MAX_ADDRESSES, DEVCONF_FORCE_MLD_VERSION, + DEVCONF_ACCEPT_RA_DEFRTR, + DEVCONF_ACCEPT_RA_PINFO, + DEVCONF_ACCEPT_RA_RTR_PREF, + DEVCONF_RTR_PROBE_INTERVAL, + DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN, DEVCONF_MAX }; diff --git a/include/linux/ipv6_route.h b/include/linux/ipv6_route.h index d7c41d1..b323ff5 100644 --- a/include/linux/ipv6_route.h +++ b/include/linux/ipv6_route.h @@ -23,12 +23,22 @@ #define RTF_NONEXTHOP 0x00200000 /* route with no nexthop */ #define RTF_EXPIRES 0x00400000 +#define RTF_ROUTEINFO 0x00800000 /* route information - RA */ + #define RTF_CACHE 0x01000000 /* cache entry */ #define RTF_FLOW 0x02000000 /* flow significant route */ #define RTF_POLICY 0x04000000 /* policy route */ +#define RTF_PREF(pref) ((pref) << 27) +#define RTF_PREF_MASK 0x18000000 + #define RTF_LOCAL 0x80000000 +#ifdef __KERNEL__ +#define IPV6_EXTRACT_PREF(flag) (((flag) & RTF_PREF_MASK) >> 27) +#define IPV6_DECODE_PREF(pref) ((pref) ^ 2) /* 1:low,2:med,3:high */ +#endif + struct in6_rtmsg { struct in6_addr rtmsg_dst; struct in6_addr rtmsg_src; diff --git a/include/linux/list.h b/include/linux/list.h index 47208bd..67258b4 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -411,6 +411,17 @@ static inline void list_splice_init(stru pos = list_entry(pos->member.next, typeof(*pos), member)) /** + * list_for_each_entry_from - iterate over list of given type + * continuing from existing point + * @pos: the type * to use as a loop counter. + * @head: the head for your list. + * @member: the name of the list_struct within the struct. + */ +#define list_for_each_entry_from(pos, head, member) \ + for (; prefetch(pos->member.next), &pos->member != (head); \ + pos = list_entry(pos->member.next, typeof(*pos), member)) + +/** * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry * @pos: the type * to use as a loop counter. * @n: another type * to use as temporary storage @@ -438,6 +449,19 @@ static inline void list_splice_init(stru pos = n, n = list_entry(n->member.next, typeof(*n), member)) /** + * list_for_each_entry_safe_from - iterate over list of given type + * from existing point safe against removal of list entry + * @pos: the type * to use as a loop counter. + * @n: another type * to use as temporary storage + * @head: the head for your list. + * @member: the name of the list_struct within the struct. + */ +#define list_for_each_entry_safe_from(pos, n, head, member) \ + for (n = list_entry(pos->member.next, typeof(*pos), member); \ + &pos->member != (head); \ + pos = n, n = list_entry(n->member.next, typeof(*n), member)) + +/** * list_for_each_entry_safe_reverse - iterate backwards over list of given type safe against * removal of list entry * @pos: the type * to use as a loop counter. diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7fda03d..b825be2 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -230,7 +230,8 @@ enum netdev_state_t __LINK_STATE_SCHED, __LINK_STATE_NOCARRIER, __LINK_STATE_RX_SCHED, - __LINK_STATE_LINKWATCH_PENDING + __LINK_STATE_LINKWATCH_PENDING, + __LINK_STATE_DORMANT, }; @@ -335,11 +336,14 @@ struct net_device */ - unsigned short flags; /* interface flags (a la BSD) */ + unsigned int flags; /* interface flags (a la BSD) */ unsigned short gflags; unsigned short priv_flags; /* Like 'flags' but invisible to userspace. */ unsigned short padded; /* How much padding added by alloc_netdev() */ + unsigned char operstate; /* RFC2863 operstate */ + unsigned char link_mode; /* mapping policy to operstate */ + unsigned mtu; /* interface MTU value */ unsigned short type; /* interface hardware type */ unsigned short hard_header_len; /* hardware hdr length */ @@ -714,6 +718,10 @@ static inline void dev_put(struct net_de /* Carrier loss detection, dial on demand. The functions netif_carrier_on * and _off may be called from IRQ context, but it is caller * who is responsible for serialization of these calls. + * + * The name carrier is inappropriate, these functions should really be + * called netif_lowerlayer_*() because they represent the state of any + * kind of lower layer not just hardware media. */ extern void linkwatch_fire_event(struct net_device *dev); @@ -729,6 +737,29 @@ extern void netif_carrier_on(struct net_ extern void netif_carrier_off(struct net_device *dev); +static inline void netif_dormant_on(struct net_device *dev) +{ + if (!test_and_set_bit(__LINK_STATE_DORMANT, &dev->state)) + linkwatch_fire_event(dev); +} + +static inline void netif_dormant_off(struct net_device *dev) +{ + if (test_and_clear_bit(__LINK_STATE_DORMANT, &dev->state)) + linkwatch_fire_event(dev); +} + +static inline int netif_dormant(const struct net_device *dev) +{ + return test_bit(__LINK_STATE_DORMANT, &dev->state); +} + + +static inline int netif_oper_up(const struct net_device *dev) { + return (dev->operstate == IF_OPER_UP || + dev->operstate == IF_OPER_UNKNOWN /* backward compat */); +} + /* Hot-plugging. */ static inline int netif_device_present(struct net_device *dev) { diff --git a/include/linux/netfilter/nfnetlink_log.h b/include/linux/netfilter/nfnetlink_log.h index b04b038..a7497c7 100644 --- a/include/linux/netfilter/nfnetlink_log.h +++ b/include/linux/netfilter/nfnetlink_log.h @@ -47,6 +47,8 @@ enum nfulnl_attr_type { NFULA_PAYLOAD, /* opaque data payload */ NFULA_PREFIX, /* string prefix */ NFULA_UID, /* user id of socket */ + NFULA_SEQ, /* instance-local sequence number */ + NFULA_SEQ_GLOBAL, /* global sequence number */ __NFULA_MAX }; @@ -77,6 +79,7 @@ enum nfulnl_attr_config { NFULA_CFG_NLBUFSIZ, /* u_int32_t buffer size */ NFULA_CFG_TIMEOUT, /* u_int32_t in 1/100 s */ NFULA_CFG_QTHRESH, /* u_int32_t */ + NFULA_CFG_FLAGS, /* u_int16_t */ __NFULA_CFG_MAX }; #define NFULA_CFG_MAX (__NFULA_CFG_MAX -1) @@ -85,4 +88,7 @@ enum nfulnl_attr_config { #define NFULNL_COPY_META 0x01 #define NFULNL_COPY_PACKET 0x02 +#define NFULNL_CFG_F_SEQ 0x0001 +#define NFULNL_CFG_F_SEQ_GLOBAL 0x0002 + #endif /* _NFNETLINK_LOG_H */ diff --git a/include/linux/netfilter_ipv4/ip_nat.h b/include/linux/netfilter_ipv4/ip_nat.h index 41a107d..e9f5ed1 100644 --- a/include/linux/netfilter_ipv4/ip_nat.h +++ b/include/linux/netfilter_ipv4/ip_nat.h @@ -23,7 +23,7 @@ struct ip_nat_seq { * modification (if any) */ u_int32_t correction_pos; /* sequence number offset before and after last modification */ - int32_t offset_before, offset_after; + int16_t offset_before, offset_after; }; /* Single range specification. */ diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index 92a619b..37ab970 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -1852,12 +1852,14 @@ #define PCI_DEVICE_ID_TIGON3_5705M 0x165d #define PCI_DEVICE_ID_TIGON3_5705M_2 0x165e #define PCI_DEVICE_ID_TIGON3_5714 0x1668 +#define PCI_DEVICE_ID_TIGON3_5714S 0x1669 #define PCI_DEVICE_ID_TIGON3_5780 0x166a #define PCI_DEVICE_ID_TIGON3_5780S 0x166b #define PCI_DEVICE_ID_TIGON3_5705F 0x166e #define PCI_DEVICE_ID_TIGON3_5750 0x1676 #define PCI_DEVICE_ID_TIGON3_5751 0x1677 #define PCI_DEVICE_ID_TIGON3_5715 0x1678 +#define PCI_DEVICE_ID_TIGON3_5715S 0x1679 #define PCI_DEVICE_ID_TIGON3_5750M 0x167c #define PCI_DEVICE_ID_TIGON3_5751M 0x167d #define PCI_DEVICE_ID_TIGON3_5751F 0x167e diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index d50482b..edccefb 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -733,6 +733,8 @@ enum #define IFLA_MAP IFLA_MAP IFLA_WEIGHT, #define IFLA_WEIGHT IFLA_WEIGHT + IFLA_OPERSTATE, + IFLA_LINKMODE, __IFLA_MAX }; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index ad7cc22..838ce0f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -270,7 +270,6 @@ struct sk_buff { void (*destructor)(struct sk_buff *skb); #ifdef CONFIG_NETFILTER - __u32 nfmark; struct nf_conntrack *nfct; #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct sk_buff *nfct_reasm; @@ -278,6 +277,7 @@ struct sk_buff { #ifdef CONFIG_BRIDGE_NETFILTER struct nf_bridge_info *nf_bridge; #endif + __u32 nfmark; #endif /* CONFIG_NETFILTER */ #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 32a4139..8f3fb04 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -395,6 +395,8 @@ enum NET_TCP_CONG_CONTROL=110, NET_TCP_ABC=111, NET_IPV4_IPFRAG_MAX_DIST=112, + NET_TCP_MTU_PROBING=113, + NET_TCP_BASE_MSS=114, }; enum { @@ -529,6 +531,11 @@ enum { NET_IPV6_MAX_DESYNC_FACTOR=15, NET_IPV6_MAX_ADDRESSES=16, NET_IPV6_FORCE_MLD_VERSION=17, + NET_IPV6_ACCEPT_RA_DEFRTR=18, + NET_IPV6_ACCEPT_RA_PINFO=19, + NET_IPV6_ACCEPT_RA_RTR_PREF=20, + NET_IPV6_RTR_PROBE_INTERVAL=21, + NET_IPV6_ACCEPT_RA_RT_INFO_MAX_PLEN=22, __NET_IPV6_MAX }; diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index eb8afe3..e459e1a 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -180,11 +180,8 @@ struct inet6_dev #ifdef CONFIG_IPV6_PRIVACY u8 rndid[8]; - u8 entropy[8]; struct timer_list regen_timer; struct inet6_ifaddr *tempaddr_list; - __u8 work_eui64[8]; - __u8 work_digest[16]; #endif struct neigh_parms *nd_parms; diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index fa587c9..103e9a1 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -72,6 +72,7 @@ struct inet_connection_sock_af_ops { * @icsk_probes_out: unanswered 0 window probes * @icsk_ext_hdr_len: Network protocol overhead (IP/IPv6 options) * @icsk_ack: Delayed ACK control data + * @icsk_mtup; MTU probing control data */ struct inet_connection_sock { /* inet_sock has to be the first member! */ @@ -104,6 +105,18 @@ struct inet_connection_sock { __u16 last_seg_size; /* Size of last incoming segment */ __u16 rcv_mss; /* MSS used for delayed ACK decisions */ } icsk_ack; + struct { + int enabled; + + /* Range of MTUs to search */ + int search_high; + int search_low; + + /* Information on the current probe. */ + int probe_size; + __u32 probe_seq_start; + __u32 probe_seq_end; + } icsk_mtup; u32 icsk_ca_priv[16]; #define ICSK_CA_PRIV_SIZE (16 * sizeof(u32)) }; diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 1f2e428..a398ae5 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -7,6 +7,23 @@ #define IP6_RT_PRIO_KERN 512 #define IP6_RT_FLOW_MASK 0x00ff +struct route_info { + __u8 type; + __u8 length; + __u8 prefix_len; +#if defined(__BIG_ENDIAN_BITFIELD) + __u8 reserved_h:3, + route_pref:2, + reserved_l:3; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + __u8 reserved_l:3, + route_pref:2, + reserved_h:3; +#endif + __u32 lifetime; + __u8 prefix[0]; /* 0,8 or 16 */ +}; + #ifdef __KERNEL__ #include @@ -87,11 +104,14 @@ extern struct rt6_info *addrconf_dst_all extern struct rt6_info * rt6_get_dflt_router(struct in6_addr *addr, struct net_device *dev); extern struct rt6_info * rt6_add_dflt_router(struct in6_addr *gwaddr, - struct net_device *dev); + struct net_device *dev, + unsigned int pref); extern void rt6_purge_dflt_routers(void); -extern void rt6_reset_dflt_pointer(struct rt6_info *rt); +extern int rt6_route_rcv(struct net_device *dev, + u8 *opt, int len, + struct in6_addr *gwaddr); extern void rt6_redirect(struct in6_addr *dest, struct in6_addr *saddr, diff --git a/include/net/ndisc.h b/include/net/ndisc.h index bbac87e..91fa271 100644 --- a/include/net/ndisc.h +++ b/include/net/ndisc.h @@ -22,6 +22,8 @@ enum { ND_OPT_PREFIX_INFO = 3, /* RFC2461 */ ND_OPT_REDIRECT_HDR = 4, /* RFC2461 */ ND_OPT_MTU = 5, /* RFC2461 */ + __ND_OPT_ARRAY_MAX, + ND_OPT_ROUTE_INFO = 24, /* RFC4191 */ __ND_OPT_MAX }; diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 6d075ca..9d1f0e6 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -67,6 +67,18 @@ do { \ struct nf_conntrack_helper; +/* nf_conn feature for connections that have a helper */ +struct nf_conn_help { + /* Helper. if any */ + struct nf_conntrack_helper *helper; + + union nf_conntrack_help help; + + /* Current number of expected connections */ + unsigned int expecting; +}; + + #include struct nf_conn { @@ -81,6 +93,9 @@ struct nf_conn /* Have we seen traffic both ways yet? (bitset) */ unsigned long status; + /* If we were expected by an expectation, this will be it */ + struct nf_conn *master; + /* Timer function; drops refcnt when it goes off. */ struct timer_list timeout; @@ -88,38 +103,22 @@ struct nf_conn /* Accounting Information (same cache line as other written members) */ struct ip_conntrack_counter counters[IP_CT_DIR_MAX]; #endif - /* If we were expected by an expectation, this will be it */ - struct nf_conn *master; - - /* Current number of expected connections */ - unsigned int expecting; /* Unique ID that identifies this conntrack*/ unsigned int id; - /* Helper. if any */ - struct nf_conntrack_helper *helper; - /* features - nat, helper, ... used by allocating system */ u_int32_t features; - /* Storage reserved for other modules: */ - - union nf_conntrack_proto proto; - #if defined(CONFIG_NF_CONNTRACK_MARK) u_int32_t mark; #endif - /* These members are dynamically allocated. */ - - union nf_conntrack_help *help; + /* Storage reserved for other modules: */ + union nf_conntrack_proto proto; - /* Layer 3 dependent members. (ex: NAT) */ - union { - struct nf_conntrack_ipv4 *ipv4; - } l3proto; - void *data[0]; + /* features dynamically at the end: helper, nat (both optional) */ + char data[0]; }; struct nf_conntrack_expect @@ -373,10 +372,23 @@ nf_conntrack_expect_event(enum ip_conntr #define NF_CT_F_NUM 4 extern int -nf_conntrack_register_cache(u_int32_t features, const char *name, size_t size, - int (*init_conntrack)(struct nf_conn *, u_int32_t)); +nf_conntrack_register_cache(u_int32_t features, const char *name, size_t size); extern void nf_conntrack_unregister_cache(u_int32_t features); +/* valid combinations: + * basic: nf_conn, nf_conn .. nf_conn_help + * nat: nf_conn .. nf_conn_nat, nf_conn .. nf_conn_nat, nf_conn help + */ +static inline struct nf_conn_help *nfct_help(const struct nf_conn *ct) +{ + unsigned int offset = sizeof(struct nf_conn); + + if (!(ct->features & NF_CT_F_HELP)) + return NULL; + + return (struct nf_conn_help *) ((void *)ct + offset); +} + #endif /* __KERNEL__ */ #endif /* _NF_CONNTRACK_H */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 77f21c6..16879fa 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -60,6 +60,9 @@ extern void tcp_time_wait(struct sock *s /* Minimal RCV_MSS. */ #define TCP_MIN_RCVMSS 536U +/* The least MTU to use for probing */ +#define TCP_BASE_MSS 512 + /* After receiving this amount of duplicate ACKs fast retransmit starts. */ #define TCP_FASTRETRANS_THRESH 3 @@ -219,6 +222,8 @@ extern int sysctl_tcp_nometrics_save; extern int sysctl_tcp_moderate_rcvbuf; extern int sysctl_tcp_tso_win_divisor; extern int sysctl_tcp_abc; +extern int sysctl_tcp_mtu_probing; +extern int sysctl_tcp_base_mss; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; @@ -447,6 +452,10 @@ extern int tcp_read_sock(struct sock *sk extern void tcp_initialize_rcv_mss(struct sock *sk); +extern int tcp_mtu_to_mss(struct sock *sk, int pmtu); +extern int tcp_mss_to_mtu(struct sock *sk, int mss); +extern void tcp_mtup_init(struct sock *sk); + static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd) { tp->pred_flags = htonl((tp->tcp_header_len << 26) | diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c index fa76220..510c2af 100644 --- a/net/8021q/vlan.c +++ b/net/8021q/vlan.c @@ -69,7 +69,7 @@ static struct packet_type vlan_packet_ty /* Bits of netdev state that are propagated from real device to virtual */ #define VLAN_LINK_STATE_MASK \ - ((1<<__LINK_STATE_PRESENT)|(1<<__LINK_STATE_NOCARRIER)) + ((1<<__LINK_STATE_PRESENT)|(1<<__LINK_STATE_NOCARRIER)|(1<<__LINK_STATE_DORMANT)) /* End of global variables definitions. */ @@ -344,6 +344,26 @@ static void vlan_setup(struct net_device new_dev->do_ioctl = vlan_dev_ioctl; } +static void vlan_transfer_operstate(const struct net_device *dev, struct net_device *vlandev) +{ + /* Have to respect userspace enforced dormant state + * of real device, also must allow supplicant running + * on VLAN device + */ + if (dev->operstate == IF_OPER_DORMANT) + netif_dormant_on(vlandev); + else + netif_dormant_off(vlandev); + + if (netif_carrier_ok(dev)) { + if (!netif_carrier_ok(vlandev)) + netif_carrier_on(vlandev); + } else { + if (netif_carrier_ok(vlandev)) + netif_carrier_off(vlandev); + } +} + /* Attach a VLAN device to a mac address (ie Ethernet Card). * Returns the device that was created, or NULL if there was * an error of some kind. @@ -450,7 +470,7 @@ static struct net_device *register_vlan_ new_dev->flags = real_dev->flags; new_dev->flags &= ~IFF_UP; - new_dev->state = real_dev->state & VLAN_LINK_STATE_MASK; + new_dev->state = real_dev->state & ~(1<<__LINK_STATE_START); /* need 4 bytes for extra VLAN header info, * hope the underlying device can handle it. @@ -498,6 +518,10 @@ static struct net_device *register_vlan_ if (register_netdevice(new_dev)) goto out_free_newdev; + new_dev->iflink = real_dev->ifindex; + vlan_transfer_operstate(real_dev, new_dev); + linkwatch_fire_event(new_dev); /* _MUST_ call rfc2863_policy() */ + /* So, got the sucker initialized, now lets place * it into our local structure. */ @@ -573,25 +597,12 @@ static int vlan_device_event(struct noti switch (event) { case NETDEV_CHANGE: /* Propagate real device state to vlan devices */ - flgs = dev->state & VLAN_LINK_STATE_MASK; for (i = 0; i < VLAN_GROUP_ARRAY_LEN; i++) { vlandev = grp->vlan_devices[i]; if (!vlandev) continue; - if (netif_carrier_ok(dev)) { - if (!netif_carrier_ok(vlandev)) - netif_carrier_on(vlandev); - } else { - if (netif_carrier_ok(vlandev)) - netif_carrier_off(vlandev); - } - - if ((vlandev->state & VLAN_LINK_STATE_MASK) != flgs) { - vlandev->state = (vlandev->state &~ VLAN_LINK_STATE_MASK) - | flgs; - netdev_state_change(vlandev); - } + vlan_transfer_operstate(dev, vlandev); } break; diff --git a/net/core/dev.c b/net/core/dev.c index 2afb0de..7ca47bf 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2152,12 +2152,20 @@ unsigned dev_get_flags(const struct net_ flags = (dev->flags & ~(IFF_PROMISC | IFF_ALLMULTI | - IFF_RUNNING)) | + IFF_RUNNING | + IFF_LOWER_UP | + IFF_DORMANT)) | (dev->gflags & (IFF_PROMISC | IFF_ALLMULTI)); - if (netif_running(dev) && netif_carrier_ok(dev)) - flags |= IFF_RUNNING; + if (netif_running(dev)) { + if (netif_oper_up(dev)) + flags |= IFF_RUNNING; + if (netif_carrier_ok(dev)) + flags |= IFF_LOWER_UP; + if (netif_dormant(dev)) + flags |= IFF_DORMANT; + } return flags; } diff --git a/net/core/link_watch.c b/net/core/link_watch.c index d43d120..23027a4 100644 --- a/net/core/link_watch.c +++ b/net/core/link_watch.c @@ -49,6 +49,45 @@ struct lw_event { /* Avoid kmalloc() for most systems */ static struct lw_event singleevent; +static unsigned char default_operstate(const struct net_device *dev) +{ + if (!netif_carrier_ok(dev)) + return (dev->ifindex != dev->iflink ? + IF_OPER_LOWERLAYERDOWN : IF_OPER_DOWN); + + if (netif_dormant(dev)) + return IF_OPER_DORMANT; + + return IF_OPER_UP; +} + + +static void rfc2863_policy(struct net_device *dev) +{ + unsigned char operstate = default_operstate(dev); + + if (operstate == dev->operstate) + return; + + write_lock_bh(&dev_base_lock); + + switch(dev->link_mode) { + case IF_LINK_MODE_DORMANT: + if (operstate == IF_OPER_UP) + operstate = IF_OPER_DORMANT; + break; + + case IF_LINK_MODE_DEFAULT: + default: + break; + }; + + dev->operstate = operstate; + + write_unlock_bh(&dev_base_lock); +} + + /* Must be called with the rtnl semaphore held */ void linkwatch_run_queue(void) { @@ -74,6 +113,7 @@ void linkwatch_run_queue(void) */ clear_bit(__LINK_STATE_LINKWATCH_PENDING, &dev->state); + rfc2863_policy(dev); if (dev->flags & IFF_UP) { if (netif_carrier_ok(dev)) { WARN_ON(dev->qdisc_sleeping == &noop_qdisc); diff --git a/net/core/neighbour.c b/net/core/neighbour.c index e68700f..e98975a 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -750,11 +750,13 @@ static void neigh_timer_handler(unsigned neigh->used + neigh->parms->delay_probe_time)) { NEIGH_PRINTK2("neigh %p is delayed.\n", neigh); neigh->nud_state = NUD_DELAY; + neigh->updated = jiffies; neigh_suspect(neigh); next = now + neigh->parms->delay_probe_time; } else { NEIGH_PRINTK2("neigh %p is suspected.\n", neigh); neigh->nud_state = NUD_STALE; + neigh->updated = jiffies; neigh_suspect(neigh); } } else if (state & NUD_DELAY) { @@ -762,11 +764,13 @@ static void neigh_timer_handler(unsigned neigh->confirmed + neigh->parms->delay_probe_time)) { NEIGH_PRINTK2("neigh %p is now reachable.\n", neigh); neigh->nud_state = NUD_REACHABLE; + neigh->updated = jiffies; neigh_connect(neigh); next = neigh->confirmed + neigh->parms->reachable_time; } else { NEIGH_PRINTK2("neigh %p is probed.\n", neigh); neigh->nud_state = NUD_PROBE; + neigh->updated = jiffies; atomic_set(&neigh->probes, 0); next = now + neigh->parms->retrans_time; } @@ -780,6 +784,7 @@ static void neigh_timer_handler(unsigned struct sk_buff *skb; neigh->nud_state = NUD_FAILED; + neigh->updated = jiffies; notify = 1; NEIGH_CACHE_STAT_INC(neigh->tbl, res_failed); NEIGH_PRINTK2("neigh %p is failed.\n", neigh); @@ -843,10 +848,12 @@ int __neigh_event_send(struct neighbour if (neigh->parms->mcast_probes + neigh->parms->app_probes) { atomic_set(&neigh->probes, neigh->parms->ucast_probes); neigh->nud_state = NUD_INCOMPLETE; + neigh->updated = jiffies; neigh_hold(neigh); neigh_add_timer(neigh, now + 1); } else { neigh->nud_state = NUD_FAILED; + neigh->updated = jiffies; write_unlock_bh(&neigh->lock); if (skb) @@ -857,6 +864,7 @@ int __neigh_event_send(struct neighbour NEIGH_PRINTK2("neigh %p is delayed.\n", neigh); neigh_hold(neigh); neigh->nud_state = NUD_DELAY; + neigh->updated = jiffies; neigh_add_timer(neigh, jiffies + neigh->parms->delay_probe_time); } diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index e8b2acb..25fced6 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -91,6 +91,7 @@ NETDEVICE_SHOW(iflink, fmt_dec); NETDEVICE_SHOW(ifindex, fmt_dec); NETDEVICE_SHOW(features, fmt_long_hex); NETDEVICE_SHOW(type, fmt_dec); +NETDEVICE_SHOW(link_mode, fmt_dec); /* use same locking rules as GIFHWADDR ioctl's */ static ssize_t format_addr(char *buf, const unsigned char *addr, int len) @@ -133,6 +134,43 @@ static ssize_t show_carrier(struct class return -EINVAL; } +static ssize_t show_dormant(struct class_device *dev, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + + if (netif_running(netdev)) + return sprintf(buf, fmt_dec, !!netif_dormant(netdev)); + + return -EINVAL; +} + +static const char *operstates[] = { + "unknown", + "notpresent", /* currently unused */ + "down", + "lowerlayerdown", + "testing", /* currently unused */ + "dormant", + "up" +}; + +static ssize_t show_operstate(struct class_device *dev, char *buf) +{ + const struct net_device *netdev = to_net_dev(dev); + unsigned char operstate; + + read_lock(&dev_base_lock); + operstate = netdev->operstate; + if (!netif_running(netdev)) + operstate = IF_OPER_DOWN; + read_unlock(&dev_base_lock); + + if (operstate >= sizeof(operstates)) + return -EINVAL; /* should not happen */ + + return sprintf(buf, "%s\n", operstates[operstate]); +} + /* read-write attributes */ NETDEVICE_SHOW(mtu, fmt_dec); @@ -190,9 +228,12 @@ static struct class_device_attribute net __ATTR(ifindex, S_IRUGO, show_ifindex, NULL), __ATTR(features, S_IRUGO, show_features, NULL), __ATTR(type, S_IRUGO, show_type, NULL), + __ATTR(link_mode, S_IRUGO, show_link_mode, NULL), __ATTR(address, S_IRUGO, show_address, NULL), __ATTR(broadcast, S_IRUGO, show_broadcast, NULL), __ATTR(carrier, S_IRUGO, show_carrier, NULL), + __ATTR(dormant, S_IRUGO, show_dormant, NULL), + __ATTR(operstate, S_IRUGO, show_operstate, NULL), __ATTR(mtu, S_IRUGO | S_IWUSR, show_mtu, store_mtu), __ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags), __ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len, diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 8700379..7a40764 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -179,6 +179,33 @@ rtattr_failure: } +static void set_operstate(struct net_device *dev, unsigned char transition) +{ + unsigned char operstate = dev->operstate; + + switch(transition) { + case IF_OPER_UP: + if ((operstate == IF_OPER_DORMANT || + operstate == IF_OPER_UNKNOWN) && + !netif_dormant(dev)) + operstate = IF_OPER_UP; + break; + + case IF_OPER_DORMANT: + if (operstate == IF_OPER_UP || + operstate == IF_OPER_UNKNOWN) + operstate = IF_OPER_DORMANT; + break; + }; + + if (dev->operstate != operstate) { + write_lock_bh(&dev_base_lock); + dev->operstate = operstate; + write_unlock_bh(&dev_base_lock); + netdev_state_change(dev); + } +} + static int rtnetlink_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, int type, u32 pid, u32 seq, u32 change, unsigned int flags) @@ -209,6 +236,13 @@ static int rtnetlink_fill_ifinfo(struct } if (1) { + u8 operstate = netif_running(dev)?dev->operstate:IF_OPER_DOWN; + u8 link_mode = dev->link_mode; + RTA_PUT(skb, IFLA_OPERSTATE, sizeof(operstate), &operstate); + RTA_PUT(skb, IFLA_LINKMODE, sizeof(link_mode), &link_mode); + } + + if (1) { struct rtnl_link_ifmap map = { .mem_start = dev->mem_start, .mem_end = dev->mem_end, @@ -399,6 +433,22 @@ static int do_setlink(struct sk_buff *sk dev->weight = *((u32 *) RTA_DATA(ida[IFLA_WEIGHT - 1])); } + if (ida[IFLA_OPERSTATE - 1]) { + if (ida[IFLA_OPERSTATE - 1]->rta_len != RTA_LENGTH(sizeof(u8))) + goto out; + + set_operstate(dev, *((u8 *) RTA_DATA(ida[IFLA_OPERSTATE - 1]))); + } + + if (ida[IFLA_LINKMODE - 1]) { + if (ida[IFLA_LINKMODE - 1]->rta_len != RTA_LENGTH(sizeof(u8))) + goto out; + + write_lock_bh(&dev_base_lock); + dev->link_mode = *((u8 *) RTA_DATA(ida[IFLA_LINKMODE - 1])); + write_unlock_bh(&dev_base_lock); + } + if (ifm->ifi_index >= 0 && ida[IFLA_IFNAME - 1]) { char ifname[IFNAMSIZ]; diff --git a/net/dccp/Kconfig b/net/dccp/Kconfig index 187ac18..24a6981 100644 --- a/net/dccp/Kconfig +++ b/net/dccp/Kconfig @@ -24,6 +24,10 @@ config INET_DCCP_DIAG def_tristate y if (IP_DCCP = y && INET_DIAG = y) def_tristate m +config IP_DCCP_ACKVEC + depends on IP_DCCP + def_bool N + source "net/dccp/ccids/Kconfig" menu "DCCP Kernel Hacking" diff --git a/net/dccp/Makefile b/net/dccp/Makefile index 87b27ff..5736ace 100644 --- a/net/dccp/Makefile +++ b/net/dccp/Makefile @@ -4,7 +4,7 @@ dccp_ipv6-y := ipv6.o obj-$(CONFIG_IP_DCCP) += dccp.o -dccp-y := ccid.o input.o ipv4.o minisocks.o options.o output.o proto.o \ +dccp-y := ccid.o feat.o input.o ipv4.o minisocks.o options.o output.o proto.o \ timer.o dccp-$(CONFIG_IP_DCCP_ACKVEC) += ackvec.o diff --git a/net/dccp/ackvec.c b/net/dccp/ackvec.c index 2c77daf..08b1b08 100644 --- a/net/dccp/ackvec.c +++ b/net/dccp/ackvec.c @@ -13,18 +13,77 @@ #include "dccp.h" #include +#include +#include +#include #include +#include #include +static kmem_cache_t *dccp_ackvec_slab; +static kmem_cache_t *dccp_ackvec_record_slab; + +static struct dccp_ackvec_record *dccp_ackvec_record_new(void) +{ + struct dccp_ackvec_record *avr = + kmem_cache_alloc(dccp_ackvec_record_slab, GFP_ATOMIC); + + if (avr != NULL) + INIT_LIST_HEAD(&avr->dccpavr_node); + + return avr; +} + +static void dccp_ackvec_record_delete(struct dccp_ackvec_record *avr) +{ + if (unlikely(avr == NULL)) + return; + /* Check if deleting a linked record */ + WARN_ON(!list_empty(&avr->dccpavr_node)); + kmem_cache_free(dccp_ackvec_record_slab, avr); +} + +static void dccp_ackvec_insert_avr(struct dccp_ackvec *av, + struct dccp_ackvec_record *avr) +{ + /* + * AVRs are sorted by seqno. Since we are sending them in order, we + * just add the AVR at the head of the list. + * -sorbo. + */ + if (!list_empty(&av->dccpav_records)) { + const struct dccp_ackvec_record *head = + list_entry(av->dccpav_records.next, + struct dccp_ackvec_record, + dccpavr_node); + BUG_ON(before48(avr->dccpavr_ack_seqno, + head->dccpavr_ack_seqno)); + } + + list_add(&avr->dccpavr_node, &av->dccpav_records); +} + int dccp_insert_option_ackvec(struct sock *sk, struct sk_buff *skb) { struct dccp_sock *dp = dccp_sk(sk); +#ifdef CONFIG_IP_DCCP_DEBUG + const char *debug_prefix = dp->dccps_role == DCCP_ROLE_CLIENT ? + "CLIENT tx: " : "server tx: "; +#endif struct dccp_ackvec *av = dp->dccps_hc_rx_ackvec; int len = av->dccpav_vec_len + 2; struct timeval now; u32 elapsed_time; unsigned char *to, *from; + struct dccp_ackvec_record *avr; + + if (DCCP_SKB_CB(skb)->dccpd_opt_len + len > DCCP_MAX_OPT_LEN) + return -1; + + avr = dccp_ackvec_record_new(); + if (avr == NULL) + return -1; dccp_timestamp(sk, &now); elapsed_time = timeval_delta(&now, &av->dccpav_time) / 10; @@ -32,19 +91,6 @@ int dccp_insert_option_ackvec(struct soc if (elapsed_time != 0) dccp_insert_option_elapsed_time(sk, skb, elapsed_time); - if (DCCP_SKB_CB(skb)->dccpd_opt_len + len > DCCP_MAX_OPT_LEN) - return -1; - - /* - * XXX: now we have just one ack vector sent record, so - * we have to wait for it to be cleared. - * - * Of course this is not acceptable, but this is just for - * basic testing now. - */ - if (av->dccpav_ack_seqno != DCCP_MAX_SEQNO + 1) - return -1; - DCCP_SKB_CB(skb)->dccpd_opt_len += len; to = skb_push(skb, len); @@ -55,8 +101,8 @@ int dccp_insert_option_ackvec(struct soc from = av->dccpav_buf + av->dccpav_buf_head; /* Check if buf_head wraps */ - if ((int)av->dccpav_buf_head + len > av->dccpav_vec_len) { - const u32 tailsize = av->dccpav_vec_len - av->dccpav_buf_head; + if ((int)av->dccpav_buf_head + len > DCCP_MAX_ACKVEC_LEN) { + const u32 tailsize = DCCP_MAX_ACKVEC_LEN - av->dccpav_buf_head; memcpy(to, from, tailsize); to += tailsize; @@ -73,45 +119,37 @@ int dccp_insert_option_ackvec(struct soc * sequence number it used for the ack packet; ack_ptr will equal * buf_head; ack_ackno will equal buf_ackno; and ack_nonce will * equal buf_nonce. - * - * This implemention uses just one ack record for now. */ - av->dccpav_ack_seqno = DCCP_SKB_CB(skb)->dccpd_seq; - av->dccpav_ack_ptr = av->dccpav_buf_head; - av->dccpav_ack_ackno = av->dccpav_buf_ackno; - av->dccpav_ack_nonce = av->dccpav_buf_nonce; - av->dccpav_sent_len = av->dccpav_vec_len; + avr->dccpavr_ack_seqno = DCCP_SKB_CB(skb)->dccpd_seq; + avr->dccpavr_ack_ptr = av->dccpav_buf_head; + avr->dccpavr_ack_ackno = av->dccpav_buf_ackno; + avr->dccpavr_ack_nonce = av->dccpav_buf_nonce; + avr->dccpavr_sent_len = av->dccpav_vec_len; + + dccp_ackvec_insert_avr(av, avr); dccp_pr_debug("%sACK Vector 0, len=%d, ack_seqno=%llu, " "ack_ackno=%llu\n", - debug_prefix, av->dccpav_sent_len, - (unsigned long long)av->dccpav_ack_seqno, - (unsigned long long)av->dccpav_ack_ackno); - return -1; + debug_prefix, avr->dccpavr_sent_len, + (unsigned long long)avr->dccpavr_ack_seqno, + (unsigned long long)avr->dccpavr_ack_ackno); + return 0; } -struct dccp_ackvec *dccp_ackvec_alloc(const unsigned int len, - const gfp_t priority) +struct dccp_ackvec *dccp_ackvec_alloc(const gfp_t priority) { - struct dccp_ackvec *av; - - BUG_ON(len == 0); + struct dccp_ackvec *av = kmem_cache_alloc(dccp_ackvec_slab, priority); - if (len > DCCP_MAX_ACKVEC_LEN) - return NULL; - - av = kmalloc(sizeof(*av) + len, priority); if (av != NULL) { - av->dccpav_buf_len = len; av->dccpav_buf_head = - av->dccpav_buf_tail = av->dccpav_buf_len - 1; - av->dccpav_buf_ackno = - av->dccpav_ack_ackno = av->dccpav_ack_seqno = ~0LLU; + av->dccpav_buf_tail = DCCP_MAX_ACKVEC_LEN - 1; + av->dccpav_buf_ackno = DCCP_MAX_SEQNO + 1; av->dccpav_buf_nonce = av->dccpav_buf_nonce = 0; av->dccpav_ack_ptr = 0; av->dccpav_time.tv_sec = 0; av->dccpav_time.tv_usec = 0; av->dccpav_sent_len = av->dccpav_vec_len = 0; + INIT_LIST_HEAD(&av->dccpav_records); } return av; @@ -119,7 +157,20 @@ struct dccp_ackvec *dccp_ackvec_alloc(co void dccp_ackvec_free(struct dccp_ackvec *av) { - kfree(av); + if (unlikely(av == NULL)) + return; + + if (!list_empty(&av->dccpav_records)) { + struct dccp_ackvec_record *avr, *next; + + list_for_each_entry_safe(avr, next, &av->dccpav_records, + dccpavr_node) { + list_del_init(&avr->dccpavr_node); + dccp_ackvec_record_delete(avr); + } + } + + kmem_cache_free(dccp_ackvec_slab, av); } static inline u8 dccp_ackvec_state(const struct dccp_ackvec *av, @@ -146,7 +197,7 @@ static inline int dccp_ackvec_set_buf_he unsigned int gap; long new_head; - if (av->dccpav_vec_len + packets > av->dccpav_buf_len) + if (av->dccpav_vec_len + packets > DCCP_MAX_ACKVEC_LEN) return -ENOBUFS; gap = packets - 1; @@ -158,7 +209,7 @@ static inline int dccp_ackvec_set_buf_he gap + new_head + 1); gap = -new_head; } - new_head += av->dccpav_buf_len; + new_head += DCCP_MAX_ACKVEC_LEN; } av->dccpav_buf_head = new_head; @@ -251,7 +302,7 @@ int dccp_ackvec_add(struct dccp_ackvec * goto out_duplicate; delta -= len + 1; - if (++index == av->dccpav_buf_len) + if (++index == DCCP_MAX_ACKVEC_LEN) index = 0; } } @@ -297,44 +348,50 @@ void dccp_ackvec_print(const struct dccp } #endif -static void dccp_ackvec_throw_away_ack_record(struct dccp_ackvec *av) +static void dccp_ackvec_throw_record(struct dccp_ackvec *av, + struct dccp_ackvec_record *avr) { - /* - * As we're keeping track of the ack vector size (dccpav_vec_len) and - * the sent ack vector size (dccpav_sent_len) we don't need - * dccpav_buf_tail at all, but keep this code here as in the future - * we'll implement a vector of ack records, as suggested in - * draft-ietf-dccp-spec-11.txt Appendix A. -acme - */ -#if 0 - u32 new_buf_tail = av->dccpav_ack_ptr + 1; - if (new_buf_tail >= av->dccpav_vec_len) - new_buf_tail -= av->dccpav_vec_len; - av->dccpav_buf_tail = new_buf_tail; -#endif - av->dccpav_vec_len -= av->dccpav_sent_len; + struct dccp_ackvec_record *next; + + av->dccpav_buf_tail = avr->dccpavr_ack_ptr - 1; + if (av->dccpav_buf_tail == 0) + av->dccpav_buf_tail = DCCP_MAX_ACKVEC_LEN - 1; + + av->dccpav_vec_len -= avr->dccpavr_sent_len; + + /* free records */ + list_for_each_entry_safe_from(avr, next, &av->dccpav_records, + dccpavr_node) { + list_del_init(&avr->dccpavr_node); + dccp_ackvec_record_delete(avr); + } } void dccp_ackvec_check_rcv_ackno(struct dccp_ackvec *av, struct sock *sk, const u64 ackno) { - /* Check if we actually sent an ACK vector */ - if (av->dccpav_ack_seqno == DCCP_MAX_SEQNO + 1) - return; + struct dccp_ackvec_record *avr; - if (ackno == av->dccpav_ack_seqno) { + /* + * If we traverse backwards, it should be faster when we have large + * windows. We will be receiving ACKs for stuff we sent a while back + * -sorbo. + */ + list_for_each_entry_reverse(avr, &av->dccpav_records, dccpavr_node) { + if (ackno == avr->dccpavr_ack_seqno) { #ifdef CONFIG_IP_DCCP_DEBUG - struct dccp_sock *dp = dccp_sk(sk); - const char *debug_prefix = dp->dccps_role == DCCP_ROLE_CLIENT ? - "CLIENT rx ack: " : "server rx ack: "; + struct dccp_sock *dp = dccp_sk(sk); + const char *debug_prefix = dp->dccps_role == DCCP_ROLE_CLIENT ? + "CLIENT rx ack: " : "server rx ack: "; #endif - dccp_pr_debug("%sACK packet 0, len=%d, ack_seqno=%llu, " - "ack_ackno=%llu, ACKED!\n", - debug_prefix, 1, - (unsigned long long)av->dccpav_ack_seqno, - (unsigned long long)av->dccpav_ack_ackno); - dccp_ackvec_throw_away_ack_record(av); - av->dccpav_ack_seqno = DCCP_MAX_SEQNO + 1; + dccp_pr_debug("%sACK packet 0, len=%d, ack_seqno=%llu, " + "ack_ackno=%llu, ACKED!\n", + debug_prefix, 1, + (unsigned long long)avr->dccpavr_ack_seqno, + (unsigned long long)avr->dccpavr_ack_ackno); + dccp_ackvec_throw_record(av, avr); + break; + } } } @@ -344,28 +401,20 @@ static void dccp_ackvec_check_rcv_ackvec const unsigned char *vector) { unsigned char i; + struct dccp_ackvec_record *avr; /* Check if we actually sent an ACK vector */ - if (av->dccpav_ack_seqno == DCCP_MAX_SEQNO + 1) + if (list_empty(&av->dccpav_records)) return; - /* - * We're in the receiver half connection, so if the received an ACK - * vector ackno (e.g. 50) before dccpav_ack_seqno (e.g. 52), we're - * not interested. - * - * Extra explanation with example: - * - * if we received an ACK vector with ackno 50, it can only be acking - * 50, 49, 48, etc, not 52 (the seqno for the ACK vector we sent). - */ - /* dccp_pr_debug("is %llu < %llu? ", ackno, av->dccpav_ack_seqno); */ - if (before48(ackno, av->dccpav_ack_seqno)) { - /* dccp_pr_debug_cat("yes\n"); */ - return; - } - /* dccp_pr_debug_cat("no\n"); */ i = len; + /* + * XXX + * I think it might be more efficient to work backwards. See comment on + * rcv_ackno. -sorbo. + */ + avr = list_entry(av->dccpav_records.next, struct dccp_ackvec_record, + dccpavr_node); while (i--) { const u8 rl = *vector & DCCP_ACKVEC_LEN_MASK; u64 ackno_end_rl; @@ -373,14 +422,20 @@ static void dccp_ackvec_check_rcv_ackvec dccp_set_seqno(&ackno_end_rl, ackno - rl); /* - * dccp_pr_debug("is %llu <= %llu <= %llu? ", ackno_end_rl, - * av->dccpav_ack_seqno, ackno); + * If our AVR sequence number is greater than the ack, go + * forward in the AVR list until it is not so. */ - if (between48(av->dccpav_ack_seqno, ackno_end_rl, ackno)) { + list_for_each_entry_from(avr, &av->dccpav_records, + dccpavr_node) { + if (!after48(avr->dccpavr_ack_seqno, ackno)) + goto found; + } + /* End of the dccpav_records list, not found, exit */ + break; +found: + if (between48(avr->dccpavr_ack_seqno, ackno_end_rl, ackno)) { const u8 state = (*vector & DCCP_ACKVEC_STATE_MASK) >> 6; - /* dccp_pr_debug_cat("yes\n"); */ - if (state != DCCP_ACKVEC_STATE_NOT_RECEIVED) { #ifdef CONFIG_IP_DCCP_DEBUG struct dccp_sock *dp = dccp_sk(sk); @@ -393,19 +448,16 @@ static void dccp_ackvec_check_rcv_ackvec "ACKED!\n", debug_prefix, len, (unsigned long long) - av->dccpav_ack_seqno, + avr->dccpavr_ack_seqno, (unsigned long long) - av->dccpav_ack_ackno); - dccp_ackvec_throw_away_ack_record(av); + avr->dccpavr_ack_ackno); + dccp_ackvec_throw_record(av, avr); } /* - * If dccpav_ack_seqno was not received, no problem - * we'll send another ACK vector. + * If it wasn't received, continue scanning... we might + * find another one. */ - av->dccpav_ack_seqno = DCCP_MAX_SEQNO + 1; - break; } - /* dccp_pr_debug_cat("no\n"); */ dccp_set_seqno(&ackno, ackno_end_rl - 1); ++vector; @@ -424,3 +476,43 @@ int dccp_ackvec_parse(struct sock *sk, c len, value); return 0; } + +static char dccp_ackvec_slab_msg[] __initdata = + KERN_CRIT "DCCP: Unable to create ack vectors slab caches\n"; + +int __init dccp_ackvec_init(void) +{ + dccp_ackvec_slab = kmem_cache_create("dccp_ackvec", + sizeof(struct dccp_ackvec), 0, + SLAB_HWCACHE_ALIGN, NULL, NULL); + if (dccp_ackvec_slab == NULL) + goto out_err; + + dccp_ackvec_record_slab = + kmem_cache_create("dccp_ackvec_record", + sizeof(struct dccp_ackvec_record), + 0, SLAB_HWCACHE_ALIGN, NULL, NULL); + if (dccp_ackvec_record_slab == NULL) + goto out_destroy_slab; + + return 0; + +out_destroy_slab: + kmem_cache_destroy(dccp_ackvec_slab); + dccp_ackvec_slab = NULL; +out_err: + printk(dccp_ackvec_slab_msg); + return -ENOBUFS; +} + +void dccp_ackvec_exit(void) +{ + if (dccp_ackvec_slab != NULL) { + kmem_cache_destroy(dccp_ackvec_slab); + dccp_ackvec_slab = NULL; + } + if (dccp_ackvec_record_slab != NULL) { + kmem_cache_destroy(dccp_ackvec_record_slab); + dccp_ackvec_record_slab = NULL; + } +} diff --git a/net/dccp/ackvec.h b/net/dccp/ackvec.h index f7dfb5f..1d9ec4a 100644 --- a/net/dccp/ackvec.h +++ b/net/dccp/ackvec.h @@ -13,6 +13,7 @@ #include #include +#include #include #include @@ -42,39 +43,57 @@ * Ack Vectors it has recently sent. For each packet sent carrying an * Ack Vector, it remembers four variables: * - * @dccpav_ack_seqno - the Sequence Number used for the packet - * (HC-Receiver seqno) * @dccpav_ack_ptr - the value of buf_head at the time of acknowledgement. - * @dccpav_ack_ackno - the Acknowledgement Number used for the packet - * (HC-Sender seqno) + * @dccpav_records - list of dccp_ackvec_record * @dccpav_ack_nonce - the one-bit sum of the ECN Nonces for all State 0. * - * @dccpav_buf_len - circular buffer length * @dccpav_time - the time in usecs * @dccpav_buf - circular buffer of acknowledgeable packets */ struct dccp_ackvec { u64 dccpav_buf_ackno; - u64 dccpav_ack_seqno; - u64 dccpav_ack_ackno; + struct list_head dccpav_records; struct timeval dccpav_time; u8 dccpav_buf_head; u8 dccpav_buf_tail; u8 dccpav_ack_ptr; u8 dccpav_sent_len; u8 dccpav_vec_len; - u8 dccpav_buf_len; u8 dccpav_buf_nonce; u8 dccpav_ack_nonce; - u8 dccpav_buf[0]; + u8 dccpav_buf[DCCP_MAX_ACKVEC_LEN]; +}; + +/** struct dccp_ackvec_record - ack vector record + * + * ACK vector record as defined in Appendix A of spec. + * + * The list is sorted by dccpavr_ack_seqno + * + * @dccpavr_node - node in dccpav_records + * @dccpavr_ack_seqno - sequence number of the packet this record was sent on + * @dccpavr_ack_ackno - sequence number being acknowledged + * @dccpavr_ack_ptr - pointer into dccpav_buf where this record starts + * @dccpavr_ack_nonce - dccpav_ack_nonce at the time this record was sent + * @dccpavr_sent_len - lenght of the record in dccpav_buf + */ +struct dccp_ackvec_record { + struct list_head dccpavr_node; + u64 dccpavr_ack_seqno; + u64 dccpavr_ack_ackno; + u8 dccpavr_ack_ptr; + u8 dccpavr_ack_nonce; + u8 dccpavr_sent_len; }; struct sock; struct sk_buff; #ifdef CONFIG_IP_DCCP_ACKVEC -extern struct dccp_ackvec *dccp_ackvec_alloc(unsigned int len, - const gfp_t priority); +extern int dccp_ackvec_init(void); +extern void dccp_ackvec_exit(void); + +extern struct dccp_ackvec *dccp_ackvec_alloc(const gfp_t priority); extern void dccp_ackvec_free(struct dccp_ackvec *av); extern int dccp_ackvec_add(struct dccp_ackvec *av, const struct sock *sk, @@ -92,8 +111,16 @@ static inline int dccp_ackvec_pending(co return av->dccpav_sent_len != av->dccpav_vec_len; } #else /* CONFIG_IP_DCCP_ACKVEC */ -static inline struct dccp_ackvec *dccp_ackvec_alloc(unsigned int len, - const gfp_t priority) +static inline int dccp_ackvec_init(void) +{ + return 0; +} + +static inline void dccp_ackvec_exit(void) +{ +} + +static inline struct dccp_ackvec *dccp_ackvec_alloc(const gfp_t priority) { return NULL; } diff --git a/net/dccp/ccid.c b/net/dccp/ccid.c index 9d8fc0e..06b191a 100644 --- a/net/dccp/ccid.c +++ b/net/dccp/ccid.c @@ -59,9 +59,6 @@ int ccid_register(struct ccid *ccid) { int err; - if (ccid->ccid_init == NULL) - return -1; - ccids_write_lock(); err = -EEXIST; if (ccids[ccid->ccid_id] == NULL) { @@ -106,7 +103,7 @@ struct ccid *ccid_init(unsigned char id, if (!try_module_get(ccid->ccid_owner)) goto out_err; - if (ccid->ccid_init(sk) != 0) + if (ccid->ccid_init != NULL && ccid->ccid_init(sk) != 0) goto out_module_put; out: ccids_read_unlock(); diff --git a/net/dccp/ccids/Kconfig b/net/dccp/ccids/Kconfig index 7684d83..422af19 100644 --- a/net/dccp/ccids/Kconfig +++ b/net/dccp/ccids/Kconfig @@ -1,6 +1,34 @@ menu "DCCP CCIDs Configuration (EXPERIMENTAL)" depends on IP_DCCP && EXPERIMENTAL +config IP_DCCP_CCID2 + tristate "CCID2 (TCP) (EXPERIMENTAL)" + depends on IP_DCCP + select IP_DCCP_ACKVEC + ---help--- + CCID 2, TCP-like Congestion Control, denotes Additive Increase, + Multiplicative Decrease (AIMD) congestion control with behavior + modelled directly on TCP, including congestion window, slow start, + timeouts, and so forth [RFC 2581]. CCID 2 achieves maximum + bandwidth over the long term, consistent with the use of end-to-end + congestion control, but halves its congestion window in response to + each congestion event. This leads to the abrupt rate changes + typical of TCP. Applications should use CCID 2 if they prefer + maximum bandwidth utilization to steadiness of rate. This is often + the case for applications that are not playing their data directly + to the user. For example, a hypothetical application that + transferred files over DCCP, using application-level retransmissions + for lost packets, would prefer CCID 2 to CCID 3. On-line games may + also prefer CCID 2. + + CCID 2 is further described in: + http://www.icir.org/kohler/dccp/draft-ietf-dccp-ccid2-10.txt + + This text was extracted from: + http://www.icir.org/kohler/dccp/draft-ietf-dccp-spec-13.txt + + If in doubt, say M. + config IP_DCCP_CCID3 tristate "CCID3 (TFRC) (EXPERIMENTAL)" depends on IP_DCCP @@ -15,10 +43,15 @@ config IP_DCCP_CCID3 suitable than CCID 2 for applications such streaming media where a relatively smooth sending rate is of importance. - CCID 3 is further described in [CCID 3 PROFILE]. The TFRC - congestion control algorithms were initially described in RFC 3448. + CCID 3 is further described in: + + http://www.icir.org/kohler/dccp/draft-ietf-dccp-ccid3-11.txt. + + The TFRC congestion control algorithms were initially described in + RFC 3448. - This text was extracted from draft-ietf-dccp-spec-11.txt. + This text was extracted from: + http://www.icir.org/kohler/dccp/draft-ietf-dccp-spec-13.txt If in doubt, say M. diff --git a/net/dccp/ccids/Makefile b/net/dccp/ccids/Makefile index 956f79f..438f20b 100644 --- a/net/dccp/ccids/Makefile +++ b/net/dccp/ccids/Makefile @@ -2,4 +2,8 @@ obj-$(CONFIG_IP_DCCP_CCID3) += dccp_ccid dccp_ccid3-y := ccid3.o +obj-$(CONFIG_IP_DCCP_CCID2) += dccp_ccid2.o + +dccp_ccid2-y := ccid2.o + obj-y += lib/ diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c new file mode 100644 index 0000000..5495980 --- /dev/null +++ b/net/dccp/ccids/ccid2.c @@ -0,0 +1,838 @@ +/* + * net/dccp/ccids/ccid2.c + * + * Copyright (c) 2005, 2006 Andrea Bittau + * + * Changes to meet Linux coding standards, and DCCP infrastructure fixes. + * + * Copyright (c) 2006 Arnaldo Carvalho de Melo + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +/* + * This implementation should follow: draft-ietf-dccp-ccid2-10.txt + * + * BUGS: + * - sequence number wrapping + * - jiffies wrapping + */ + +#include +#include "../ccid.h" +#include "../dccp.h" +#include "ccid2.h" + +static int ccid2_debug; + +#if 0 +#define CCID2_DEBUG +#endif + +#ifdef CCID2_DEBUG +#define ccid2_pr_debug(format, a...) \ + do { if (ccid2_debug) \ + printk(KERN_DEBUG "%s: " format, __FUNCTION__, ##a); \ + } while (0) +#else +#define ccid2_pr_debug(format, a...) +#endif + +static const int ccid2_seq_len = 128; + +static inline struct ccid2_hc_tx_sock *ccid2_hc_tx_sk(const struct sock *sk) +{ + return dccp_sk(sk)->dccps_hc_tx_ccid_private; +} + +static inline struct ccid2_hc_rx_sock *ccid2_hc_rx_sk(const struct sock *sk) +{ + return dccp_sk(sk)->dccps_hc_rx_ccid_private; +} + +#ifdef CCID2_DEBUG +static void ccid2_hc_tx_check_sanity(const struct ccid2_hc_tx_sock *hctx) +{ + int len = 0; + struct ccid2_seq *seqp; + int pipe = 0; + + seqp = hctx->ccid2hctx_seqh; + + /* there is data in the chain */ + if (seqp != hctx->ccid2hctx_seqt) { + seqp = seqp->ccid2s_prev; + len++; + if (!seqp->ccid2s_acked) + pipe++; + + while (seqp != hctx->ccid2hctx_seqt) { + struct ccid2_seq *prev; + + prev = seqp->ccid2s_prev; + len++; + if (!prev->ccid2s_acked) + pipe++; + + /* packets are sent sequentially */ + BUG_ON(seqp->ccid2s_seq <= prev->ccid2s_seq); + BUG_ON(seqp->ccid2s_sent < prev->ccid2s_sent); + BUG_ON(len > ccid2_seq_len); + + seqp = prev; + } + } + + BUG_ON(pipe != hctx->ccid2hctx_pipe); + ccid2_pr_debug("len of chain=%d\n", len); + + do { + seqp = seqp->ccid2s_prev; + len++; + BUG_ON(len > ccid2_seq_len); + } while(seqp != hctx->ccid2hctx_seqh); + + BUG_ON(len != ccid2_seq_len); + ccid2_pr_debug("total len=%d\n", len); +} +#else +#define ccid2_hc_tx_check_sanity(hctx) do {} while (0) +#endif + +static int ccid2_hc_tx_send_packet(struct sock *sk, + struct sk_buff *skb, int len) +{ + struct ccid2_hc_tx_sock *hctx; + + switch (DCCP_SKB_CB(skb)->dccpd_type) { + case 0: /* XXX data packets from userland come through like this */ + case DCCP_PKT_DATA: + case DCCP_PKT_DATAACK: + break; + /* No congestion control on other packets */ + default: + return 0; + } + + hctx = ccid2_hc_tx_sk(sk); + + ccid2_pr_debug("pipe=%d cwnd=%d\n", hctx->ccid2hctx_pipe, + hctx->ccid2hctx_cwnd); + + if (hctx->ccid2hctx_pipe < hctx->ccid2hctx_cwnd) { + /* OK we can send... make sure previous packet was sent off */ + if (!hctx->ccid2hctx_sendwait) { + hctx->ccid2hctx_sendwait = 1; + return 0; + } + } + + return 100; /* XXX */ +} + +static void ccid2_change_l_ack_ratio(struct sock *sk, int val) +{ + struct dccp_sock *dp = dccp_sk(sk); + /* + * XXX I don't really agree with val != 2. If cwnd is 1, ack ratio + * should be 1... it shouldn't be allowed to become 2. + * -sorbo. + */ + if (val != 2) { + struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk); + int max = hctx->ccid2hctx_cwnd / 2; + + /* round up */ + if (hctx->ccid2hctx_cwnd & 1) + max++; + + if (val > max) + val = max; + } + + ccid2_pr_debug("changing local ack ratio to %d\n", val); + WARN_ON(val <= 0); + dp->dccps_l_ack_ratio = val; +} + +static void ccid2_change_cwnd(struct sock *sk, int val) +{ + struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk); + + if (val == 0) + val = 1; + + /* XXX do we need to change ack ratio? */ + ccid2_pr_debug("change cwnd to %d\n", val); + + BUG_ON(val < 1); + hctx->ccid2hctx_cwnd = val; +} + +static void ccid2_start_rto_timer(struct sock *sk); + +static void ccid2_hc_tx_rto_expire(unsigned long data) +{ + struct sock *sk = (struct sock *)data; + struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk); + long s; + + /* XXX I don't think i'm locking correctly + * -sorbo. + */ + bh_lock_sock(sk); + if (sock_owned_by_user(sk)) { + sk_reset_timer(sk, &hctx->ccid2hctx_rtotimer, + jiffies + HZ / 5); + goto out; + } + + ccid2_pr_debug("RTO_EXPIRE\n"); + + ccid2_hc_tx_check_sanity(hctx); + + /* back-off timer */ + hctx->ccid2hctx_rto <<= 1; + + s = hctx->ccid2hctx_rto / HZ; + if (s > 60) + hctx->ccid2hctx_rto = 60 * HZ; + + ccid2_start_rto_timer(sk); + + /* adjust pipe, cwnd etc */ + hctx->ccid2hctx_pipe = 0; + hctx->ccid2hctx_ssthresh = hctx->ccid2hctx_cwnd >> 1; + if (hctx->ccid2hctx_ssthresh < 2) + hctx->ccid2hctx_ssthresh = 2; + ccid2_change_cwnd(sk, 1); + + /* clear state about stuff we sent */ + hctx->ccid2hctx_seqt = hctx->ccid2hctx_seqh; + hctx->ccid2hctx_ssacks = 0; + hctx->ccid2hctx_acks = 0; + hctx->ccid2hctx_sent = 0; + + /* clear ack ratio state. */ + hctx->ccid2hctx_arsent = 0; + hctx->ccid2hctx_ackloss = 0; + hctx->ccid2hctx_rpseq = 0; + hctx->ccid2hctx_rpdupack = -1; + ccid2_change_l_ack_ratio(sk, 1); + ccid2_hc_tx_check_sanity(hctx); +out: + bh_unlock_sock(sk); +/* sock_put(sk); */ +} + +static void ccid2_start_rto_timer(struct sock *sk) +{ + struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk); + + ccid2_pr_debug("setting RTO timeout=%ld\n", hctx->ccid2hctx_rto); + + BUG_ON(timer_pending(&hctx->ccid2hctx_rtotimer)); + sk_reset_timer(sk, &hctx->ccid2hctx_rtotimer, + jiffies + hctx->ccid2hctx_rto); +} + +static void ccid2_hc_tx_packet_sent(struct sock *sk, int more, int len) +{ + struct dccp_sock *dp = dccp_sk(sk); + struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk); + u64 seq; + + ccid2_hc_tx_check_sanity(hctx); + + BUG_ON(!hctx->ccid2hctx_sendwait); + hctx->ccid2hctx_sendwait = 0; + hctx->ccid2hctx_pipe++; + BUG_ON(hctx->ccid2hctx_pipe < 0); + + /* There is an issue. What if another packet is sent between + * packet_send() and packet_sent(). Then the sequence number would be + * wrong. + * -sorbo. + */ + seq = dp->dccps_gss; + + hctx->ccid2hctx_seqh->ccid2s_seq = seq; + hctx->ccid2hctx_seqh->ccid2s_acked = 0; + hctx->ccid2hctx_seqh->ccid2s_sent = jiffies; + hctx->ccid2hctx_seqh = hctx->ccid2hctx_seqh->ccid2s_next; + + ccid2_pr_debug("cwnd=%d pipe=%d\n", hctx->ccid2hctx_cwnd, + hctx->ccid2hctx_pipe); + + if (hctx->ccid2hctx_seqh == hctx->ccid2hctx_seqt) { + /* XXX allocate more space */ + WARN_ON(1); + } + + hctx->ccid2hctx_sent++; + + /* Ack Ratio. Need to maintain a concept of how many windows we sent */ + hctx->ccid2hctx_arsent++; + /* We had an ack loss in this window... */ + if (hctx->ccid2hctx_ackloss) { + if (hctx->ccid2hctx_arsent >= hctx->ccid2hctx_cwnd) { + hctx->ccid2hctx_arsent = 0; + hctx->ccid2hctx_ackloss = 0; + } + } + /* No acks lost up to now... */ + else { + /* decrease ack ratio if enough packets were sent */ + if (dp->dccps_l_ack_ratio > 1) { + /* XXX don't calculate denominator each time */ + int denom; + + denom = dp->dccps_l_ack_ratio * dp->dccps_l_ack_ratio - + dp->dccps_l_ack_ratio; + denom = hctx->ccid2hctx_cwnd * hctx->ccid2hctx_cwnd / denom; + + if (hctx->ccid2hctx_arsent >= denom) { + ccid2_change_l_ack_ratio(sk, dp->dccps_l_ack_ratio - 1); + hctx->ccid2hctx_arsent = 0; + } + } + /* we can't increase ack ratio further [1] */ + else { + hctx->ccid2hctx_arsent = 0; /* or maybe set it to cwnd*/ + } + } + + /* setup RTO timer */ + if (!timer_pending(&hctx->ccid2hctx_rtotimer)) { + ccid2_start_rto_timer(sk); + } +#ifdef CCID2_DEBUG + ccid2_pr_debug("pipe=%d\n", hctx->ccid2hctx_pipe); + ccid2_pr_debug("Sent: seq=%llu\n", seq); + do { + struct ccid2_seq *seqp = hctx->ccid2hctx_seqt; + + while (seqp != hctx->ccid2hctx_seqh) { + ccid2_pr_debug("out seq=%llu acked=%d time=%lu\n", + seqp->ccid2s_seq, seqp->ccid2s_acked, + seqp->ccid2s_sent); + seqp = seqp->ccid2s_next; + } + } while(0); + ccid2_pr_debug("=========\n"); + ccid2_hc_tx_check_sanity(hctx); +#endif +} + +/* XXX Lame code duplication! + * returns -1 if none was found. + * else returns the next offset to use in the function call. + */ +static int ccid2_ackvector(struct sock *sk, struct sk_buff *skb, int offset, + unsigned char **vec, unsigned char *veclen) +{ + const struct dccp_hdr *dh = dccp_hdr(skb); + unsigned char *options = (unsigned char *)dh + dccp_hdr_len(skb); + unsigned char *opt_ptr; + const unsigned char *opt_end = (unsigned char *)dh + + (dh->dccph_doff * 4); + unsigned char opt, len; + unsigned char *value; + + BUG_ON(offset < 0); + options += offset; + opt_ptr = options; + if (opt_ptr >= opt_end) + return -1; + + while (opt_ptr != opt_end) { + opt = *opt_ptr++; + len = 0; + value = NULL; + + /* Check if this isn't a single byte option */ + if (opt > DCCPO_MAX_RESERVED) { + if (opt_ptr == opt_end) + goto out_invalid_option; + + len = *opt_ptr++; + if (len < 3) + goto out_invalid_option; + /* + * Remove the type and len fields, leaving + * just the value size + */ + len -= 2; + value = opt_ptr; + opt_ptr += len; + + if (opt_ptr > opt_end) + goto out_invalid_option; + } + + switch (opt) { + case DCCPO_ACK_VECTOR_0: + case DCCPO_ACK_VECTOR_1: + *vec = value; + *veclen = len; + return offset + (opt_ptr - options); + break; + } + } + + return -1; + +out_invalid_option: + BUG_ON(1); /* should never happen... options were previously parsed ! */ + return -1; +} + +static void ccid2_hc_tx_kill_rto_timer(struct ccid2_hc_tx_sock *hctx) +{ + if (del_timer(&hctx->ccid2hctx_rtotimer)) + ccid2_pr_debug("deleted RTO timer\n"); +} + +static inline void ccid2_new_ack(struct sock *sk, + struct ccid2_seq *seqp, + unsigned int *maxincr) +{ + struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk); + + /* slow start */ + if (hctx->ccid2hctx_cwnd < hctx->ccid2hctx_ssthresh) { + hctx->ccid2hctx_acks = 0; + + /* We can increase cwnd at most maxincr [ack_ratio/2] */ + if (*maxincr) { + /* increase every 2 acks */ + hctx->ccid2hctx_ssacks++; + if (hctx->ccid2hctx_ssacks == 2) { + ccid2_change_cwnd(sk, hctx->ccid2hctx_cwnd + 1); + hctx->ccid2hctx_ssacks = 0; + *maxincr = *maxincr - 1; + } + } + /* increased cwnd enough for this single ack */ + else { + hctx->ccid2hctx_ssacks = 0; + } + } + else { + hctx->ccid2hctx_ssacks = 0; + hctx->ccid2hctx_acks++; + + if (hctx->ccid2hctx_acks >= hctx->ccid2hctx_cwnd) { + ccid2_change_cwnd(sk, hctx->ccid2hctx_cwnd + 1); + hctx->ccid2hctx_acks = 0; + } + } + + /* update RTO */ + if (hctx->ccid2hctx_srtt == -1 || + (jiffies - hctx->ccid2hctx_lastrtt) >= hctx->ccid2hctx_srtt) { + unsigned long r = jiffies - seqp->ccid2s_sent; + int s; + + /* first measurement */ + if (hctx->ccid2hctx_srtt == -1) { + ccid2_pr_debug("R: %lu Time=%lu seq=%llu\n", + r, jiffies, seqp->ccid2s_seq); + hctx->ccid2hctx_srtt = r; + hctx->ccid2hctx_rttvar = r >> 1; + } + else { + /* RTTVAR */ + long tmp = hctx->ccid2hctx_srtt - r; + if (tmp < 0) + tmp *= -1; + + tmp >>= 2; + hctx->ccid2hctx_rttvar *= 3; + hctx->ccid2hctx_rttvar >>= 2; + hctx->ccid2hctx_rttvar += tmp; + + /* SRTT */ + hctx->ccid2hctx_srtt *= 7; + hctx->ccid2hctx_srtt >>= 3; + tmp = r >> 3; + hctx->ccid2hctx_srtt += tmp; + } + s = hctx->ccid2hctx_rttvar << 2; + /* clock granularity is 1 when based on jiffies */ + if (!s) + s = 1; + hctx->ccid2hctx_rto = hctx->ccid2hctx_srtt + s; + + /* must be at least a second */ + s = hctx->ccid2hctx_rto / HZ; + /* DCCP doesn't require this [but I like it cuz my code sux] */ +#if 1 + if (s < 1) + hctx->ccid2hctx_rto = HZ; +#endif + /* max 60 seconds */ + if (s > 60) + hctx->ccid2hctx_rto = HZ * 60; + + hctx->ccid2hctx_lastrtt = jiffies; + + ccid2_pr_debug("srtt: %ld rttvar: %ld rto: %ld (HZ=%d) R=%lu\n", + hctx->ccid2hctx_srtt, hctx->ccid2hctx_rttvar, + hctx->ccid2hctx_rto, HZ, r); + hctx->ccid2hctx_sent = 0; + } + + /* we got a new ack, so re-start RTO timer */ + ccid2_hc_tx_kill_rto_timer(hctx); + ccid2_start_rto_timer(sk); +} + +static void ccid2_hc_tx_dec_pipe(struct ccid2_hc_tx_sock *hctx) +{ + hctx->ccid2hctx_pipe--; + BUG_ON(hctx->ccid2hctx_pipe < 0); + + if (hctx->ccid2hctx_pipe == 0) + ccid2_hc_tx_kill_rto_timer(hctx); +} + +static void ccid2_hc_tx_packet_recv(struct sock *sk, struct sk_buff *skb) +{ + struct dccp_sock *dp = dccp_sk(sk); + struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk); + u64 ackno, seqno; + struct ccid2_seq *seqp; + unsigned char *vector; + unsigned char veclen; + int offset = 0; + int done = 0; + int loss = 0; + unsigned int maxincr = 0; + + ccid2_hc_tx_check_sanity(hctx); + /* check reverse path congestion */ + seqno = DCCP_SKB_CB(skb)->dccpd_seq; + + /* XXX this whole "algorithm" is broken. Need to fix it to keep track + * of the seqnos of the dupacks so that rpseq and rpdupack are correct + * -sorbo. + */ + /* need to bootstrap */ + if (hctx->ccid2hctx_rpdupack == -1) { + hctx->ccid2hctx_rpdupack = 0; + hctx->ccid2hctx_rpseq = seqno; + } + else { + /* check if packet is consecutive */ + if ((hctx->ccid2hctx_rpseq + 1) == seqno) { + hctx->ccid2hctx_rpseq++; + } + /* it's a later packet */ + else if (after48(seqno, hctx->ccid2hctx_rpseq)) { + hctx->ccid2hctx_rpdupack++; + + /* check if we got enough dupacks */ + if (hctx->ccid2hctx_rpdupack >= + hctx->ccid2hctx_numdupack) { + + hctx->ccid2hctx_rpdupack = -1; /* XXX lame */ + hctx->ccid2hctx_rpseq = 0; + + ccid2_change_l_ack_ratio(sk, dp->dccps_l_ack_ratio << 1); + } + } + } + + /* check forward path congestion */ + /* still didn't send out new data packets */ + if (hctx->ccid2hctx_seqh == hctx->ccid2hctx_seqt) + return; + + switch (DCCP_SKB_CB(skb)->dccpd_type) { + case DCCP_PKT_ACK: + case DCCP_PKT_DATAACK: + break; + + default: + return; + } + + ackno = DCCP_SKB_CB(skb)->dccpd_ack_seq; + seqp = hctx->ccid2hctx_seqh->ccid2s_prev; + + /* If in slow-start, cwnd can increase at most Ack Ratio / 2 packets for + * this single ack. I round up. + * -sorbo. + */ + maxincr = dp->dccps_l_ack_ratio >> 1; + maxincr++; + + /* go through all ack vectors */ + while ((offset = ccid2_ackvector(sk, skb, offset, + &vector, &veclen)) != -1) { + /* go through this ack vector */ + while (veclen--) { + const u8 rl = *vector & DCCP_ACKVEC_LEN_MASK; + u64 ackno_end_rl; + + dccp_set_seqno(&ackno_end_rl, ackno - rl); + ccid2_pr_debug("ackvec start:%llu end:%llu\n", ackno, + ackno_end_rl); + /* if the seqno we are analyzing is larger than the + * current ackno, then move towards the tail of our + * seqnos. + */ + while (after48(seqp->ccid2s_seq, ackno)) { + if (seqp == hctx->ccid2hctx_seqt) { + done = 1; + break; + } + seqp = seqp->ccid2s_prev; + } + if (done) + break; + + /* check all seqnos in the range of the vector + * run length + */ + while (between48(seqp->ccid2s_seq,ackno_end_rl,ackno)) { + const u8 state = (*vector & + DCCP_ACKVEC_STATE_MASK) >> 6; + + /* new packet received or marked */ + if (state != DCCP_ACKVEC_STATE_NOT_RECEIVED && + !seqp->ccid2s_acked) { + if (state == + DCCP_ACKVEC_STATE_ECN_MARKED) { + loss = 1; + } + else { + ccid2_new_ack(sk, seqp, + &maxincr); + } + + seqp->ccid2s_acked = 1; + ccid2_pr_debug("Got ack for %llu\n", + seqp->ccid2s_seq); + ccid2_hc_tx_dec_pipe(hctx); + } + if (seqp == hctx->ccid2hctx_seqt) { + done = 1; + break; + } + seqp = seqp->ccid2s_next; + } + if (done) + break; + + + dccp_set_seqno(&ackno, ackno_end_rl - 1); + vector++; + } + if (done) + break; + } + + /* The state about what is acked should be correct now + * Check for NUMDUPACK + */ + seqp = hctx->ccid2hctx_seqh->ccid2s_prev; + done = 0; + while (1) { + if (seqp->ccid2s_acked) { + done++; + if (done == hctx->ccid2hctx_numdupack) { + break; + } + } + if (seqp == hctx->ccid2hctx_seqt) { + break; + } + seqp = seqp->ccid2s_prev; + } + + /* If there are at least 3 acknowledgements, anything unacknowledged + * below the last sequence number is considered lost + */ + if (done == hctx->ccid2hctx_numdupack) { + struct ccid2_seq *last_acked = seqp; + + /* check for lost packets */ + while (1) { + if (!seqp->ccid2s_acked) { + loss = 1; + ccid2_hc_tx_dec_pipe(hctx); + } + if (seqp == hctx->ccid2hctx_seqt) + break; + seqp = seqp->ccid2s_prev; + } + + hctx->ccid2hctx_seqt = last_acked; + } + + /* trim acked packets in tail */ + while (hctx->ccid2hctx_seqt != hctx->ccid2hctx_seqh) { + if (!hctx->ccid2hctx_seqt->ccid2s_acked) + break; + + hctx->ccid2hctx_seqt = hctx->ccid2hctx_seqt->ccid2s_next; + } + + if (loss) { + /* XXX do bit shifts guarantee a 0 as the new bit? */ + ccid2_change_cwnd(sk, hctx->ccid2hctx_cwnd >> 1); + hctx->ccid2hctx_ssthresh = hctx->ccid2hctx_cwnd; + if (hctx->ccid2hctx_ssthresh < 2) + hctx->ccid2hctx_ssthresh = 2; + } + + ccid2_hc_tx_check_sanity(hctx); +} + +static int ccid2_hc_tx_init(struct sock *sk) +{ + struct dccp_sock *dp = dccp_sk(sk); + struct ccid2_hc_tx_sock *hctx; + int seqcount = ccid2_seq_len; + int i; + + dp->dccps_hc_tx_ccid_private = kzalloc(sizeof(*hctx), gfp_any()); + if (dp->dccps_hc_tx_ccid_private == NULL) + return -ENOMEM; + + hctx = ccid2_hc_tx_sk(sk); + + /* XXX init variables with proper values */ + hctx->ccid2hctx_cwnd = 1; + hctx->ccid2hctx_ssthresh = 10; + hctx->ccid2hctx_numdupack = 3; + + /* XXX init ~ to window size... */ + hctx->ccid2hctx_seqbuf = kmalloc(sizeof(*hctx->ccid2hctx_seqbuf) * + seqcount, gfp_any()); + if (hctx->ccid2hctx_seqbuf == NULL) { + kfree(dp->dccps_hc_tx_ccid_private); + dp->dccps_hc_tx_ccid_private = NULL; + return -ENOMEM; + } + for (i = 0; i < (seqcount - 1); i++) { + hctx->ccid2hctx_seqbuf[i].ccid2s_next = + &hctx->ccid2hctx_seqbuf[i + 1]; + hctx->ccid2hctx_seqbuf[i + 1].ccid2s_prev = + &hctx->ccid2hctx_seqbuf[i]; + } + hctx->ccid2hctx_seqbuf[seqcount - 1].ccid2s_next = + hctx->ccid2hctx_seqbuf; + hctx->ccid2hctx_seqbuf->ccid2s_prev = + &hctx->ccid2hctx_seqbuf[seqcount - 1]; + + hctx->ccid2hctx_seqh = hctx->ccid2hctx_seqbuf; + hctx->ccid2hctx_seqt = hctx->ccid2hctx_seqh; + hctx->ccid2hctx_sent = 0; + hctx->ccid2hctx_rto = 3 * HZ; + hctx->ccid2hctx_srtt = -1; + hctx->ccid2hctx_rttvar = -1; + hctx->ccid2hctx_lastrtt = 0; + hctx->ccid2hctx_rpdupack = -1; + + hctx->ccid2hctx_rtotimer.function = &ccid2_hc_tx_rto_expire; + hctx->ccid2hctx_rtotimer.data = (unsigned long)sk; + init_timer(&hctx->ccid2hctx_rtotimer); + + ccid2_hc_tx_check_sanity(hctx); + return 0; +} + +static void ccid2_hc_tx_exit(struct sock *sk) +{ + struct dccp_sock *dp = dccp_sk(sk); + struct ccid2_hc_tx_sock *hctx = dp->dccps_hc_tx_ccid_private; + + ccid2_hc_tx_kill_rto_timer(hctx); + + kfree(hctx->ccid2hctx_seqbuf); + + kfree(dp->dccps_hc_tx_ccid_private); + dp->dccps_hc_tx_ccid_private = NULL; +} + +static void ccid2_hc_rx_packet_recv(struct sock *sk, struct sk_buff *skb) +{ + const struct dccp_sock *dp = dccp_sk(sk); + struct ccid2_hc_rx_sock *hcrx = ccid2_hc_rx_sk(sk); + + switch (DCCP_SKB_CB(skb)->dccpd_type) { + case DCCP_PKT_DATA: + case DCCP_PKT_DATAACK: + hcrx->ccid2hcrx_data++; + if (hcrx->ccid2hcrx_data >= dp->dccps_r_ack_ratio) { + dccp_send_ack(sk); + hcrx->ccid2hcrx_data = 0; + } + break; + } +} + +static int ccid2_hc_rx_init(struct sock *sk) +{ + struct dccp_sock *dp = dccp_sk(sk); + dp->dccps_hc_rx_ccid_private = kzalloc(sizeof(struct ccid2_hc_rx_sock), + gfp_any()); + return dp->dccps_hc_rx_ccid_private == NULL ? -ENOMEM : 0; +} + +static void ccid2_hc_rx_exit(struct sock *sk) +{ + struct dccp_sock *dp = dccp_sk(sk); + + kfree(dp->dccps_hc_rx_ccid_private); + dp->dccps_hc_rx_ccid_private = NULL; +} + +static struct ccid ccid2 = { + .ccid_id = 2, + .ccid_name = "ccid2", + .ccid_owner = THIS_MODULE, + .ccid_hc_tx_init = ccid2_hc_tx_init, + .ccid_hc_tx_exit = ccid2_hc_tx_exit, + .ccid_hc_tx_send_packet = ccid2_hc_tx_send_packet, + .ccid_hc_tx_packet_sent = ccid2_hc_tx_packet_sent, + .ccid_hc_tx_packet_recv = ccid2_hc_tx_packet_recv, + .ccid_hc_rx_init = ccid2_hc_rx_init, + .ccid_hc_rx_exit = ccid2_hc_rx_exit, + .ccid_hc_rx_packet_recv = ccid2_hc_rx_packet_recv, +}; + +module_param(ccid2_debug, int, 0444); +MODULE_PARM_DESC(ccid2_debug, "Enable debug messages"); + +static __init int ccid2_module_init(void) +{ + return ccid_register(&ccid2); +} +module_init(ccid2_module_init); + +static __exit void ccid2_module_exit(void) +{ + ccid_unregister(&ccid2); +} +module_exit(ccid2_module_exit); + +MODULE_AUTHOR("Andrea Bittau "); +MODULE_DESCRIPTION("DCCP TCP CCID2 CCID"); +MODULE_LICENSE("GPL"); +MODULE_ALIAS("net-dccp-ccid-2"); diff --git a/net/dccp/ccids/ccid2.h b/net/dccp/ccids/ccid2.h new file mode 100644 index 0000000..0b08c90 --- /dev/null +++ b/net/dccp/ccids/ccid2.h @@ -0,0 +1,69 @@ +/* + * net/dccp/ccids/ccid2.h + * + * Copyright (c) 2005 Andrea Bittau + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ +#ifndef _DCCP_CCID2_H_ +#define _DCCP_CCID2_H_ + +struct ccid2_seq { + u64 ccid2s_seq; + unsigned long ccid2s_sent; + int ccid2s_acked; + struct ccid2_seq *ccid2s_prev; + struct ccid2_seq *ccid2s_next; +}; + +/** struct ccid2_hc_tx_sock - CCID2 TX half connection + * + * @ccid2hctx_ssacks - ACKs recv in slow start + * @ccid2hctx_acks - ACKS recv in AI phase + * @ccid2hctx_sent - packets sent in this window + * @ccid2hctx_lastrtt -time RTT was last measured + * @ccid2hctx_arsent - packets sent [ack ratio] + * @ccid2hctx_ackloss - ack was lost in this win + * @ccid2hctx_rpseq - last consecutive seqno + * @ccid2hctx_rpdupack - dupacks since rpseq +*/ +struct ccid2_hc_tx_sock { + int ccid2hctx_cwnd; + int ccid2hctx_ssacks; + int ccid2hctx_acks; + int ccid2hctx_ssthresh; + int ccid2hctx_pipe; + int ccid2hctx_numdupack; + struct ccid2_seq *ccid2hctx_seqbuf; + struct ccid2_seq *ccid2hctx_seqh; + struct ccid2_seq *ccid2hctx_seqt; + long ccid2hctx_rto; + long ccid2hctx_srtt; + long ccid2hctx_rttvar; + int ccid2hctx_sent; + unsigned long ccid2hctx_lastrtt; + struct timer_list ccid2hctx_rtotimer; + unsigned long ccid2hctx_arsent; + int ccid2hctx_ackloss; + u64 ccid2hctx_rpseq; + int ccid2hctx_rpdupack; + int ccid2hctx_sendwait; +}; + +struct ccid2_hc_rx_sock { + int ccid2hcrx_data; +}; + +#endif /* _DCCP_CCID2_H_ */ diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c index aa68e0a..165fc40 100644 --- a/net/dccp/ccids/ccid3.c +++ b/net/dccp/ccids/ccid3.c @@ -76,15 +76,6 @@ static struct dccp_tx_hist *ccid3_tx_his static struct dccp_rx_hist *ccid3_rx_hist; static struct dccp_li_hist *ccid3_li_hist; -static int ccid3_init(struct sock *sk) -{ - return 0; -} - -static void ccid3_exit(struct sock *sk) -{ -} - /* TFRC sender states */ enum ccid3_hc_tx_states { TFRC_SSTATE_NO_SENT = 1, @@ -316,8 +307,6 @@ static int ccid3_hc_tx_send_packet(struc switch (hctx->ccid3hctx_state) { case TFRC_SSTATE_NO_SENT: - hctx->ccid3hctx_no_feedback_timer.function = ccid3_hc_tx_no_feedback_timer; - hctx->ccid3hctx_no_feedback_timer.data = (unsigned long)sk; sk_reset_timer(sk, &hctx->ccid3hctx_no_feedback_timer, jiffies + usecs_to_jiffies(TFRC_INITIAL_TIMEOUT)); hctx->ccid3hctx_last_win_count = 0; @@ -681,6 +670,9 @@ static int ccid3_hc_tx_init(struct sock hctx->ccid3hctx_t_rto = USEC_PER_SEC; hctx->ccid3hctx_state = TFRC_SSTATE_NO_SENT; INIT_LIST_HEAD(&hctx->ccid3hctx_hist); + + hctx->ccid3hctx_no_feedback_timer.function = ccid3_hc_tx_no_feedback_timer; + hctx->ccid3hctx_no_feedback_timer.data = (unsigned long)sk; init_timer(&hctx->ccid3hctx_no_feedback_timer); return 0; @@ -1178,8 +1170,6 @@ static struct ccid ccid3 = { .ccid_id = 3, .ccid_name = "ccid3", .ccid_owner = THIS_MODULE, - .ccid_init = ccid3_init, - .ccid_exit = ccid3_exit, .ccid_hc_tx_init = ccid3_hc_tx_init, .ccid_hc_tx_exit = ccid3_hc_tx_exit, .ccid_hc_tx_send_packet = ccid3_hc_tx_send_packet, diff --git a/net/dccp/feat.c b/net/dccp/feat.c new file mode 100644 index 0000000..fb74260 --- /dev/null +++ b/net/dccp/feat.c @@ -0,0 +1,554 @@ +/* + * net/dccp/feat.c + * + * An implementation of the DCCP protocol + * Andrea Bittau + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include +#include + +#include "dccp.h" +#include "feat.h" + +#define DCCP_FEAT_SP_NOAGREE (-123) + +int dccp_feat_change(struct sock *sk, u8 type, u8 feature, u8 *val, u8 len, + gfp_t gfp) +{ + struct dccp_sock *dp = dccp_sk(sk); + struct dccp_opt_pend *opt; + + dccp_pr_debug("feat change type=%d feat=%d\n", type, feature); + + /* check if that feature is already being negotiated */ + list_for_each_entry(opt, &dp->dccps_options.dccpo_pending, + dccpop_node) { + /* ok we found a negotiation for this option already */ + if (opt->dccpop_feat == feature && opt->dccpop_type == type) { + dccp_pr_debug("Replacing old\n"); + /* replace */ + BUG_ON(opt->dccpop_val == NULL); + kfree(opt->dccpop_val); + opt->dccpop_val = val; + opt->dccpop_len = len; + opt->dccpop_conf = 0; + return 0; + } + } + + /* negotiation for a new feature */ + opt = kmalloc(sizeof(*opt), gfp); + if (opt == NULL) + return -ENOMEM; + + opt->dccpop_type = type; + opt->dccpop_feat = feature; + opt->dccpop_len = len; + opt->dccpop_val = val; + opt->dccpop_conf = 0; + opt->dccpop_sc = NULL; + + BUG_ON(opt->dccpop_val == NULL); + + list_add_tail(&opt->dccpop_node, &dp->dccps_options.dccpo_pending); + return 0; +} + +EXPORT_SYMBOL_GPL(dccp_feat_change); + +/* XXX taking only u8 vals */ +static int dccp_feat_update(struct sock *sk, u8 type, u8 feat, u8 val) +{ + /* FIXME implement */ + dccp_pr_debug("changing [%d] feat %d to %d\n", type, feat, val); + return 0; +} + +static int dccp_feat_reconcile(struct sock *sk, struct dccp_opt_pend *opt, + u8 *rpref, u8 rlen) +{ + struct dccp_sock *dp = dccp_sk(sk); + u8 *spref, slen, *res = NULL; + int i, j, rc, agree = 1; + + BUG_ON(rpref == NULL); + + /* check if we are the black sheep */ + if (dp->dccps_role == DCCP_ROLE_CLIENT) { + spref = rpref; + slen = rlen; + rpref = opt->dccpop_val; + rlen = opt->dccpop_len; + } else { + spref = opt->dccpop_val; + slen = opt->dccpop_len; + } + /* + * Now we have server preference list in spref and client preference in + * rpref + */ + BUG_ON(spref == NULL); + BUG_ON(rpref == NULL); + + /* FIXME sanity check vals */ + + /* Are values in any order? XXX Lame "algorithm" here */ + /* XXX assume values are 1 byte */ + for (i = 0; i < slen; i++) { + for (j = 0; j < rlen; j++) { + if (spref[i] == rpref[j]) { + res = &spref[i]; + break; + } + } + if (res) + break; + } + + /* we didn't agree on anything */ + if (res == NULL) { + /* confirm previous value */ + switch (opt->dccpop_feat) { + case DCCPF_CCID: + /* XXX did i get this right? =P */ + if (opt->dccpop_type == DCCPO_CHANGE_L) + res = &dp->dccps_options.dccpo_tx_ccid; + else + res = &dp->dccps_options.dccpo_rx_ccid; + break; + + default: + WARN_ON(1); /* XXX implement res */ + return -EFAULT; + } + + dccp_pr_debug("Don't agree... reconfirming %d\n", *res); + agree = 0; /* this is used for mandatory options... */ + } + + /* need to put result and our preference list */ + /* XXX assume 1 byte vals */ + rlen = 1 + opt->dccpop_len; + rpref = kmalloc(rlen, GFP_ATOMIC); + if (rpref == NULL) + return -ENOMEM; + + *rpref = *res; + memcpy(&rpref[1], opt->dccpop_val, opt->dccpop_len); + + /* put it in the "confirm queue" */ + if (opt->dccpop_sc == NULL) { + opt->dccpop_sc = kmalloc(sizeof(*opt->dccpop_sc), GFP_ATOMIC); + if (opt->dccpop_sc == NULL) { + kfree(rpref); + return -ENOMEM; + } + } else { + /* recycle the confirm slot */ + BUG_ON(opt->dccpop_sc->dccpoc_val == NULL); + kfree(opt->dccpop_sc->dccpoc_val); + dccp_pr_debug("recycling confirm slot\n"); + } + memset(opt->dccpop_sc, 0, sizeof(*opt->dccpop_sc)); + + opt->dccpop_sc->dccpoc_val = rpref; + opt->dccpop_sc->dccpoc_len = rlen; + + /* update the option on our side [we are about to send the confirm] */ + rc = dccp_feat_update(sk, opt->dccpop_type, opt->dccpop_feat, *res); + if (rc) { + kfree(opt->dccpop_sc->dccpoc_val); + kfree(opt->dccpop_sc); + opt->dccpop_sc = 0; + return rc; + } + + dccp_pr_debug("Will confirm %d\n", *rpref); + + /* say we want to change to X but we just got a confirm X, suppress our + * change + */ + if (!opt->dccpop_conf) { + if (*opt->dccpop_val == *res) + opt->dccpop_conf = 1; + dccp_pr_debug("won't ask for change of same feature\n"); + } + + return agree ? 0 : DCCP_FEAT_SP_NOAGREE; /* used for mandatory opts */ +} + +static int dccp_feat_sp(struct sock *sk, u8 type, u8 feature, u8 *val, u8 len) +{ + struct dccp_sock *dp = dccp_sk(sk); + struct dccp_opt_pend *opt; + int rc = 1; + u8 t; + + /* + * We received a CHANGE. We gotta match it against our own preference + * list. If we got a CHANGE_R it means it's a change for us, so we need + * to compare our CHANGE_L list. + */ + if (type == DCCPO_CHANGE_L) + t = DCCPO_CHANGE_R; + else + t = DCCPO_CHANGE_L; + + /* find our preference list for this feature */ + list_for_each_entry(opt, &dp->dccps_options.dccpo_pending, + dccpop_node) { + if (opt->dccpop_type != t || opt->dccpop_feat != feature) + continue; + + /* find the winner from the two preference lists */ + rc = dccp_feat_reconcile(sk, opt, val, len); + break; + } + + /* We didn't deal with the change. This can happen if we have no + * preference list for the feature. In fact, it just shouldn't + * happen---if we understand a feature, we should have a preference list + * with at least the default value. + */ + BUG_ON(rc == 1); + + return rc; +} + +static int dccp_feat_nn(struct sock *sk, u8 type, u8 feature, u8 *val, u8 len) +{ + struct dccp_opt_pend *opt; + struct dccp_sock *dp = dccp_sk(sk); + u8 *copy; + int rc; + + /* NN features must be change L */ + if (type == DCCPO_CHANGE_R) { + dccp_pr_debug("received CHANGE_R %d for NN feat %d\n", + type, feature); + return -EFAULT; + } + + /* XXX sanity check opt val */ + + /* copy option so we can confirm it */ + opt = kzalloc(sizeof(*opt), GFP_ATOMIC); + if (opt == NULL) + return -ENOMEM; + + copy = kmalloc(len, GFP_ATOMIC); + if (copy == NULL) { + kfree(opt); + return -ENOMEM; + } + memcpy(copy, val, len); + + opt->dccpop_type = DCCPO_CONFIRM_R; /* NN can only confirm R */ + opt->dccpop_feat = feature; + opt->dccpop_val = copy; + opt->dccpop_len = len; + + /* change feature */ + rc = dccp_feat_update(sk, type, feature, *val); + if (rc) { + kfree(opt->dccpop_val); + kfree(opt); + return rc; + } + + dccp_pr_debug("Confirming NN feature %d (val=%d)\n", feature, *copy); + list_add_tail(&opt->dccpop_node, &dp->dccps_options.dccpo_conf); + + return 0; +} + +static void dccp_feat_empty_confirm(struct sock *sk, u8 type, u8 feature) +{ + struct dccp_sock *dp = dccp_sk(sk); + /* XXX check if other confirms for that are queued and recycle slot */ + struct dccp_opt_pend *opt = kzalloc(sizeof(*opt), GFP_ATOMIC); + + if (opt == NULL) { + /* XXX what do we do? Ignoring should be fine. It's a change + * after all =P + */ + return; + } + + opt->dccpop_type = type == DCCPO_CHANGE_L ? DCCPO_CONFIRM_R : + DCCPO_CONFIRM_L; + opt->dccpop_feat = feature; + opt->dccpop_val = 0; + opt->dccpop_len = 0; + + /* change feature */ + dccp_pr_debug("Empty confirm feature %d type %d\n", feature, type); + list_add_tail(&opt->dccpop_node, &dp->dccps_options.dccpo_conf); +} + +static void dccp_feat_flush_confirm(struct sock *sk) +{ + struct dccp_sock *dp = dccp_sk(sk); + /* Check if there is anything to confirm in the first place */ + int yes = !list_empty(&dp->dccps_options.dccpo_conf); + + if (!yes) { + struct dccp_opt_pend *opt; + + list_for_each_entry(opt, &dp->dccps_options.dccpo_pending, + dccpop_node) { + if (opt->dccpop_conf) { + yes = 1; + break; + } + } + } + + if (!yes) + return; + + /* OK there is something to confirm... */ + /* XXX check if packet is in flight? Send delayed ack?? */ + if (sk->sk_state == DCCP_OPEN) + dccp_send_ack(sk); +} + +int dccp_feat_change_recv(struct sock *sk, u8 type, u8 feature, u8 *val, u8 len) +{ + int rc; + + dccp_pr_debug("got feat change type=%d feat=%d\n", type, feature); + + /* figure out if it's SP or NN feature */ + switch (feature) { + /* deal with SP features */ + case DCCPF_CCID: + rc = dccp_feat_sp(sk, type, feature, val, len); + break; + + /* deal with NN features */ + case DCCPF_ACK_RATIO: + rc = dccp_feat_nn(sk, type, feature, val, len); + break; + + /* XXX implement other features */ + default: + rc = -EFAULT; + break; + } + + /* check if there were problems changing features */ + if (rc) { + /* If we don't agree on SP, we sent a confirm for old value. + * However we propagate rc to caller in case option was + * mandatory + */ + if (rc != DCCP_FEAT_SP_NOAGREE) + dccp_feat_empty_confirm(sk, type, feature); + } + + /* generate the confirm [if required] */ + dccp_feat_flush_confirm(sk); + + return rc; +} + +EXPORT_SYMBOL_GPL(dccp_feat_change_recv); + +int dccp_feat_confirm_recv(struct sock *sk, u8 type, u8 feature, + u8 *val, u8 len) +{ + u8 t; + struct dccp_opt_pend *opt; + struct dccp_sock *dp = dccp_sk(sk); + int rc = 1; + int all_confirmed = 1; + + dccp_pr_debug("got feat confirm type=%d feat=%d\n", type, feature); + + /* XXX sanity check type & feat */ + + /* locate our change request */ + t = type == DCCPO_CONFIRM_L ? DCCPO_CHANGE_R : DCCPO_CHANGE_L; + + list_for_each_entry(opt, &dp->dccps_options.dccpo_pending, + dccpop_node) { + if (!opt->dccpop_conf && opt->dccpop_type == t && + opt->dccpop_feat == feature) { + /* we found it */ + /* XXX do sanity check */ + + opt->dccpop_conf = 1; + + /* We got a confirmation---change the option */ + dccp_feat_update(sk, opt->dccpop_type, + opt->dccpop_feat, *val); + + dccp_pr_debug("feat %d type %d confirmed %d\n", + feature, type, *val); + rc = 0; + break; + } + + if (!opt->dccpop_conf) + all_confirmed = 0; + } + + /* fix re-transmit timer */ + /* XXX gotta make sure that no option negotiation occurs during + * connection shutdown. Consider that the CLOSEREQ is sent and timer is + * on. if all options are confirmed it might kill timer which should + * remain alive until close is received. + */ + if (all_confirmed) { + dccp_pr_debug("clear feat negotiation timer %p\n", sk); + inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS); + } + + if (rc) + dccp_pr_debug("feat %d type %d never requested\n", + feature, type); + return 0; +} + +EXPORT_SYMBOL_GPL(dccp_feat_confirm_recv); + +void dccp_feat_clean(struct sock *sk) +{ + struct dccp_sock *dp = dccp_sk(sk); + struct dccp_opt_pend *opt, *next; + + list_for_each_entry_safe(opt, next, &dp->dccps_options.dccpo_pending, + dccpop_node) { + BUG_ON(opt->dccpop_val == NULL); + kfree(opt->dccpop_val); + + if (opt->dccpop_sc != NULL) { + BUG_ON(opt->dccpop_sc->dccpoc_val == NULL); + kfree(opt->dccpop_sc->dccpoc_val); + kfree(opt->dccpop_sc); + } + + kfree(opt); + } + INIT_LIST_HEAD(&dp->dccps_options.dccpo_pending); + + list_for_each_entry_safe(opt, next, &dp->dccps_options.dccpo_conf, + dccpop_node) { + BUG_ON(opt == NULL); + if (opt->dccpop_val != NULL) + kfree(opt->dccpop_val); + kfree(opt); + } + INIT_LIST_HEAD(&dp->dccps_options.dccpo_conf); +} + +EXPORT_SYMBOL_GPL(dccp_feat_clean); + +/* this is to be called only when a listening sock creates its child. It is + * assumed by the function---the confirm is not duplicated, but rather it is + * "passed on". + */ +int dccp_feat_clone(struct sock *oldsk, struct sock *newsk) +{ + struct dccp_sock *olddp = dccp_sk(oldsk); + struct dccp_sock *newdp = dccp_sk(newsk); + struct dccp_opt_pend *opt; + int rc = 0; + + INIT_LIST_HEAD(&newdp->dccps_options.dccpo_pending); + INIT_LIST_HEAD(&newdp->dccps_options.dccpo_conf); + + list_for_each_entry(opt, &olddp->dccps_options.dccpo_pending, + dccpop_node) { + struct dccp_opt_pend *newopt; + /* copy the value of the option */ + u8 *val = kmalloc(opt->dccpop_len, GFP_ATOMIC); + + if (val == NULL) + goto out_clean; + memcpy(val, opt->dccpop_val, opt->dccpop_len); + + newopt = kmalloc(sizeof(*newopt), GFP_ATOMIC); + if (newopt == NULL) { + kfree(val); + goto out_clean; + } + + /* insert the option */ + memcpy(newopt, opt, sizeof(*newopt)); + newopt->dccpop_val = val; + list_add_tail(&newopt->dccpop_node, + &newdp->dccps_options.dccpo_pending); + + /* XXX what happens with backlogs and multiple connections at + * once... + */ + /* the master socket no longer needs to worry about confirms */ + opt->dccpop_sc = 0; /* it's not a memleak---new socket has it */ + + /* reset state for a new socket */ + opt->dccpop_conf = 0; + } + + /* XXX not doing anything about the conf queue */ + +out: + return rc; + +out_clean: + dccp_feat_clean(newsk); + rc = -ENOMEM; + goto out; +} + +EXPORT_SYMBOL_GPL(dccp_feat_clone); + +static int __dccp_feat_init(struct sock *sk, u8 type, u8 feat, u8 *val, u8 len) +{ + int rc = -ENOMEM; + u8 *copy = kmalloc(len, GFP_KERNEL); + + if (copy != NULL) { + memcpy(copy, val, len); + rc = dccp_feat_change(sk, type, feat, copy, len, GFP_KERNEL); + if (rc) + kfree(copy); + } + return rc; +} + +int dccp_feat_init(struct sock *sk) +{ + struct dccp_sock *dp = dccp_sk(sk); + int rc; + + INIT_LIST_HEAD(&dp->dccps_options.dccpo_pending); + INIT_LIST_HEAD(&dp->dccps_options.dccpo_conf); + + /* CCID L */ + rc = __dccp_feat_init(sk, DCCPO_CHANGE_L, DCCPF_CCID, + &dp->dccps_options.dccpo_tx_ccid, 1); + if (rc) + goto out; + + /* CCID R */ + rc = __dccp_feat_init(sk, DCCPO_CHANGE_R, DCCPF_CCID, + &dp->dccps_options.dccpo_rx_ccid, 1); + if (rc) + goto out; + + /* Ack ratio */ + rc = __dccp_feat_init(sk, DCCPO_CHANGE_L, DCCPF_ACK_RATIO, + &dp->dccps_options.dccpo_ack_ratio, 1); +out: + return rc; +} + +EXPORT_SYMBOL_GPL(dccp_feat_init); diff --git a/net/dccp/feat.h b/net/dccp/feat.h new file mode 100644 index 0000000..f7c73a3 --- /dev/null +++ b/net/dccp/feat.h @@ -0,0 +1,28 @@ +#ifndef _DCCP_FEAT_H +#define _DCCP_FEAT_H +/* + * net/dccp/feat.h + * + * An implementation of the DCCP protocol + * Copyright (c) 2005 Andrea Bittau + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include + +struct sock; + +extern int dccp_feat_change(struct sock *sk, u8 type, u8 feature, + u8 *val, u8 len, gfp_t gfp); +extern int dccp_feat_change_recv(struct sock *sk, u8 type, u8 feature, + u8 *val, u8 len); +extern int dccp_feat_confirm_recv(struct sock *sk, u8 type, u8 feature, + u8 *val, u8 len); +extern void dccp_feat_clean(struct sock *sk); +extern int dccp_feat_clone(struct sock *oldsk, struct sock *newsk); +extern int dccp_feat_init(struct sock *sk); + +#endif /* _DCCP_FEAT_H */ diff --git a/net/dccp/input.c b/net/dccp/input.c index b6cba72..4b6d43d 100644 --- a/net/dccp/input.c +++ b/net/dccp/input.c @@ -300,6 +300,9 @@ static int dccp_rcv_request_sent_state_p goto out_invalid_packet; } + if (dccp_parse_options(sk, skb)) + goto out_invalid_packet; + if (dp->dccps_options.dccpo_send_ack_vector && dccp_ackvec_add(dp->dccps_hc_rx_ackvec, sk, DCCP_SKB_CB(skb)->dccpd_seq, diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index dc0487b..14adcf9 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -28,6 +28,7 @@ #include "ackvec.h" #include "ccid.h" #include "dccp.h" +#include "feat.h" struct inet_hashinfo __cacheline_aligned dccp_hashinfo = { .lhash_lock = RW_LOCK_UNLOCKED, @@ -535,7 +536,8 @@ int dccp_v4_conn_request(struct sock *sk if (req == NULL) goto drop; - /* FIXME: process options */ + if (dccp_parse_options(sk, skb)) + goto drop; dccp_openreq_init(req, &dp, skb); @@ -1041,13 +1043,6 @@ int dccp_v4_init_sock(struct sock *sk) dccp_options_init(&dp->dccps_options); do_gettimeofday(&dp->dccps_epoch); - if (dp->dccps_options.dccpo_send_ack_vector) { - dp->dccps_hc_rx_ackvec = dccp_ackvec_alloc(DCCP_MAX_ACKVEC_LEN, - GFP_KERNEL); - if (dp->dccps_hc_rx_ackvec == NULL) - return -ENOMEM; - } - /* * FIXME: We're hardcoding the CCID, and doing this at this point makes * the listening (master) sock get CCID control blocks, which is not @@ -1056,6 +1051,13 @@ int dccp_v4_init_sock(struct sock *sk) * setsockopt(CCIDs-I-want/accept). -acme */ if (likely(!dccp_ctl_socket_init)) { + int rc; + + if (dp->dccps_options.dccpo_send_ack_vector) { + dp->dccps_hc_rx_ackvec = dccp_ackvec_alloc(GFP_KERNEL); + if (dp->dccps_hc_rx_ackvec == NULL) + return -ENOMEM; + } dp->dccps_hc_rx_ccid = ccid_init(dp->dccps_options.dccpo_rx_ccid, sk); dp->dccps_hc_tx_ccid = ccid_init(dp->dccps_options.dccpo_tx_ccid, @@ -1071,8 +1073,16 @@ int dccp_v4_init_sock(struct sock *sk) dp->dccps_hc_rx_ccid = dp->dccps_hc_tx_ccid = NULL; return -ENOMEM; } - } else + + rc = dccp_feat_init(sk); + if (rc) + return rc; + } else { + /* control socket doesn't need feat nego */ + INIT_LIST_HEAD(&dp->dccps_options.dccpo_pending); + INIT_LIST_HEAD(&dp->dccps_options.dccpo_conf); dccp_ctl_socket_init = 0; + } dccp_init_xmit_timers(sk); icsk->icsk_rto = DCCP_TIMEOUT_INIT; @@ -1083,6 +1093,7 @@ int dccp_v4_init_sock(struct sock *sk) dp->dccps_mss_cache = 536; dp->dccps_role = DCCP_ROLE_UNDEFINED; dp->dccps_service = DCCP_SERVICE_INVALID_VALUE; + dp->dccps_l_ack_ratio = dp->dccps_r_ack_ratio = 1; return 0; } @@ -1119,6 +1130,9 @@ int dccp_v4_destroy_sock(struct sock *sk ccid_exit(dp->dccps_hc_tx_ccid, sk); dp->dccps_hc_rx_ccid = dp->dccps_hc_tx_ccid = NULL; + /* clean up feature negotiation state */ + dccp_feat_clean(sk); + return 0; } diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c index 29261fc..9e1de59 100644 --- a/net/dccp/minisocks.c +++ b/net/dccp/minisocks.c @@ -22,6 +22,7 @@ #include "ackvec.h" #include "ccid.h" #include "dccp.h" +#include "feat.h" struct inet_timewait_death_row dccp_death_row = { .sysctl_max_tw_buckets = NR_FILE * 2, @@ -114,10 +115,12 @@ struct sock *dccp_create_openreq_child(s newicsk->icsk_rto = DCCP_TIMEOUT_INIT; do_gettimeofday(&newdp->dccps_epoch); + if (dccp_feat_clone(sk, newsk)) + goto out_free; + if (newdp->dccps_options.dccpo_send_ack_vector) { newdp->dccps_hc_rx_ackvec = - dccp_ackvec_alloc(DCCP_MAX_ACKVEC_LEN, - GFP_ATOMIC); + dccp_ackvec_alloc(GFP_ATOMIC); /* * XXX: We're using the same CCIDs set on the parent, * i.e. sk_clone copied the master sock and left the diff --git a/net/dccp/options.c b/net/dccp/options.c index 0a76426..7f99306 100644 --- a/net/dccp/options.c +++ b/net/dccp/options.c @@ -21,12 +21,14 @@ #include "ackvec.h" #include "ccid.h" #include "dccp.h" +#include "feat.h" /* stores the default values for new connection. may be changed with sysctl */ static const struct dccp_options dccpo_default_values = { .dccpo_sequence_window = DCCPF_INITIAL_SEQUENCE_WINDOW, .dccpo_rx_ccid = DCCPF_INITIAL_CCID, .dccpo_tx_ccid = DCCPF_INITIAL_CCID, + .dccpo_ack_ratio = DCCPF_INITIAL_ACK_RATIO, .dccpo_send_ack_vector = DCCPF_INITIAL_SEND_ACK_VECTOR, .dccpo_send_ndp_count = DCCPF_INITIAL_SEND_NDP_COUNT, }; @@ -69,6 +71,8 @@ int dccp_parse_options(struct sock *sk, unsigned char opt, len; unsigned char *value; u32 elapsed_time; + int rc; + int mandatory = 0; memset(opt_recv, 0, sizeof(*opt_recv)); @@ -100,6 +104,11 @@ int dccp_parse_options(struct sock *sk, switch (opt) { case DCCPO_PADDING: break; + case DCCPO_MANDATORY: + if (mandatory) + goto out_invalid_option; + mandatory = 1; + break; case DCCPO_NDP_COUNT: if (len > 3) goto out_invalid_option; @@ -108,6 +117,31 @@ int dccp_parse_options(struct sock *sk, dccp_pr_debug("%sNDP count=%d\n", debug_prefix, opt_recv->dccpor_ndp); break; + case DCCPO_CHANGE_L: + /* fall through */ + case DCCPO_CHANGE_R: + if (len < 2) + goto out_invalid_option; + rc = dccp_feat_change_recv(sk, opt, *value, value + 1, + len - 1); + /* + * When there is a change error, change_recv is + * responsible for dealing with it. i.e. reply with an + * empty confirm. + * If the change was mandatory, then we need to die. + */ + if (rc && mandatory) + goto out_invalid_option; + break; + case DCCPO_CONFIRM_L: + /* fall through */ + case DCCPO_CONFIRM_R: + if (len < 2) + goto out_invalid_option; + if (dccp_feat_confirm_recv(sk, opt, *value, + value + 1, len - 1)) + goto out_invalid_option; + break; case DCCPO_ACK_VECTOR_0: case DCCPO_ACK_VECTOR_1: if (pkt_type == DCCP_PKT_DATA) @@ -208,6 +242,9 @@ int dccp_parse_options(struct sock *sk, sk, opt, len); break; } + + if (opt != DCCPO_MANDATORY) + mandatory = 0; } return 0; @@ -356,7 +393,7 @@ void dccp_insert_option_timestamp(struct { struct timeval tv; u32 now; - + dccp_timestamp(sk, &tv); now = timeval_usecs(&tv) / 10; /* yes this will overflow but that is the point as we want a @@ -402,7 +439,7 @@ static void dccp_insert_option_timestamp tstamp_echo = htonl(dp->dccps_timestamp_echo); memcpy(to, &tstamp_echo, 4); to += 4; - + if (elapsed_time_len == 2) { const u16 var16 = htons((u16)elapsed_time); memcpy(to, &var16, 2); @@ -421,6 +458,93 @@ static void dccp_insert_option_timestamp dp->dccps_timestamp_time.tv_usec = 0; } +static int dccp_insert_feat_opt(struct sk_buff *skb, u8 type, u8 feat, + u8 *val, u8 len) +{ + u8 *to; + + if (DCCP_SKB_CB(skb)->dccpd_opt_len + len + 3 > DCCP_MAX_OPT_LEN) { + LIMIT_NETDEBUG(KERN_INFO "DCCP: packet too small" + " to insert feature %d option!\n", feat); + return -1; + } + + DCCP_SKB_CB(skb)->dccpd_opt_len += len + 3; + + to = skb_push(skb, len + 3); + *to++ = type; + *to++ = len + 3; + *to++ = feat; + + if (len) + memcpy(to, val, len); + dccp_pr_debug("option %d feat %d len %d\n", type, feat, len); + + return 0; +} + +static void dccp_insert_feat(struct sock *sk, struct sk_buff *skb) +{ + struct dccp_sock *dp = dccp_sk(sk); + struct dccp_opt_pend *opt, *next; + int change = 0; + + /* confirm any options [NN opts] */ + list_for_each_entry_safe(opt, next, &dp->dccps_options.dccpo_conf, + dccpop_node) { + dccp_insert_feat_opt(skb, opt->dccpop_type, + opt->dccpop_feat, opt->dccpop_val, + opt->dccpop_len); + /* fear empty confirms */ + if (opt->dccpop_val) + kfree(opt->dccpop_val); + kfree(opt); + } + INIT_LIST_HEAD(&dp->dccps_options.dccpo_conf); + + /* see which features we need to send */ + list_for_each_entry(opt, &dp->dccps_options.dccpo_pending, + dccpop_node) { + /* see if we need to send any confirm */ + if (opt->dccpop_sc) { + dccp_insert_feat_opt(skb, opt->dccpop_type + 1, + opt->dccpop_feat, + opt->dccpop_sc->dccpoc_val, + opt->dccpop_sc->dccpoc_len); + + BUG_ON(!opt->dccpop_sc->dccpoc_val); + kfree(opt->dccpop_sc->dccpoc_val); + kfree(opt->dccpop_sc); + opt->dccpop_sc = NULL; + } + + /* any option not confirmed, re-send it */ + if (!opt->dccpop_conf) { + dccp_insert_feat_opt(skb, opt->dccpop_type, + opt->dccpop_feat, opt->dccpop_val, + opt->dccpop_len); + change++; + } + } + + /* Retransmit timer. + * If this is the master listening sock, we don't set a timer on it. It + * should be fine because if the dude doesn't receive our RESPONSE + * [which will contain the CHANGE] he will send another REQUEST which + * will "retrnasmit" the change. + */ + if (change && dp->dccps_role != DCCP_ROLE_LISTEN) { + dccp_pr_debug("reset feat negotiation timer %p\n", sk); + + /* XXX don't reset the timer on re-transmissions. I.e. reset it + * only when sending new stuff i guess. Currently the timer + * never backs off because on re-transmission it just resets it! + */ + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, + inet_csk(sk)->icsk_rto, DCCP_RTO_MAX); + } +} + void dccp_insert_options(struct sock *sk, struct sk_buff *skb) { struct dccp_sock *dp = dccp_sk(sk); @@ -447,6 +571,17 @@ void dccp_insert_options(struct sock *sk dp->dccps_hc_tx_insert_options = 0; } + /* Feature negotiation */ + switch(DCCP_SKB_CB(skb)->dccpd_type) { + /* Data packets can't do feat negotiation */ + case DCCP_PKT_DATA: + case DCCP_PKT_DATAACK: + break; + default: + dccp_insert_feat(sk, skb); + break; + } + /* XXX: insert other options when appropriate */ if (DCCP_SKB_CB(skb)->dccpd_opt_len != 0) { diff --git a/net/dccp/output.c b/net/dccp/output.c index efd7ffb..0cc2bcf 100644 --- a/net/dccp/output.c +++ b/net/dccp/output.c @@ -64,6 +64,10 @@ static int dccp_transmit_skb(struct sock case DCCP_PKT_DATAACK: break; + case DCCP_PKT_REQUEST: + set_ack = 0; + /* fall through */ + case DCCP_PKT_SYNC: case DCCP_PKT_SYNCACK: ackno = dcb->dccpd_seq; diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 65b11ea..ae2ab50 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -37,6 +37,7 @@ #include "ccid.h" #include "dccp.h" +#include "feat.h" DEFINE_SNMP_STAT(struct dccp_mib, dccp_statistics) __read_mostly; @@ -255,6 +256,39 @@ static int dccp_setsockopt_service(struc return 0; } +/* byte 1 is feature. the rest is the preference list */ +static int dccp_setsockopt_change(struct sock *sk, int type, + struct dccp_so_feat __user *optval) +{ + struct dccp_so_feat opt; + u8 *val; + int rc; + + if (copy_from_user(&opt, optval, sizeof(opt))) + return -EFAULT; + + val = kmalloc(opt.dccpsf_len, GFP_KERNEL); + if (!val) + return -ENOMEM; + + if (copy_from_user(val, opt.dccpsf_val, opt.dccpsf_len)) { + rc = -EFAULT; + goto out_free_val; + } + + rc = dccp_feat_change(sk, type, opt.dccpsf_feat, val, opt.dccpsf_len, + GFP_KERNEL); + if (rc) + goto out_free_val; + +out: + return rc; + +out_free_val: + kfree(val); + goto out; +} + int dccp_setsockopt(struct sock *sk, int level, int optname, char __user *optval, int optlen) { @@ -284,6 +318,25 @@ int dccp_setsockopt(struct sock *sk, int case DCCP_SOCKOPT_PACKET_SIZE: dp->dccps_packet_size = val; break; + + case DCCP_SOCKOPT_CHANGE_L: + if (optlen != sizeof(struct dccp_so_feat)) + err = -EINVAL; + else + err = dccp_setsockopt_change(sk, DCCPO_CHANGE_L, + (struct dccp_so_feat *) + optval); + break; + + case DCCP_SOCKOPT_CHANGE_R: + if (optlen != sizeof(struct dccp_so_feat)) + err = -EINVAL; + else + err = dccp_setsockopt_change(sk, DCCPO_CHANGE_R, + (struct dccp_so_feat *) + optval); + break; + default: err = -ENOPROTOOPT; break; @@ -799,6 +852,7 @@ static int __init dccp_init(void) if (rc) goto out; + rc = -ENOBUFS; dccp_hashinfo.bind_bucket_cachep = kmem_cache_create("dccp_bind_bucket", sizeof(struct inet_bind_bucket), 0, @@ -866,7 +920,8 @@ static int __init dccp_init(void) INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain); } - if (init_dccp_v4_mibs()) + rc = init_dccp_v4_mibs(); + if (rc) goto out_free_dccp_bhash; rc = -EAGAIN; @@ -875,11 +930,17 @@ static int __init dccp_init(void) inet_register_protosw(&dccp_v4_protosw); - rc = dccp_ctl_sock_init(); + rc = dccp_ackvec_init(); if (rc) goto out_unregister_protosw; + + rc = dccp_ctl_sock_init(); + if (rc) + goto out_ackvec_exit; out: return rc; +out_ackvec_exit: + dccp_ackvec_exit(); out_unregister_protosw: inet_unregister_protosw(&dccp_v4_protosw); inet_del_protocol(&dccp_protocol, IPPROTO_DCCP); @@ -921,6 +982,7 @@ static void __exit dccp_fini(void) sizeof(struct inet_ehash_bucket))); kmem_cache_destroy(dccp_hashinfo.bind_bucket_cachep); proto_unregister(&dccp_prot); + dccp_ackvec_exit(); } module_init(dccp_init); diff --git a/net/dccp/timer.c b/net/dccp/timer.c index aa34b57..d7c7866 100644 --- a/net/dccp/timer.c +++ b/net/dccp/timer.c @@ -141,6 +141,17 @@ static void dccp_retransmit_timer(struct { struct inet_connection_sock *icsk = inet_csk(sk); + /* retransmit timer is used for feature negotiation throughout + * connection. In this case, no packet is re-transmitted, but rather an + * ack is generated and pending changes are splaced into its options. + */ + if (sk->sk_send_head == NULL) { + dccp_pr_debug("feat negotiation retransmit timeout %p\n", sk); + if (sk->sk_state == DCCP_OPEN) + dccp_send_ack(sk); + goto backoff; + } + /* * sk->sk_send_head has to have one skb with * DCCP_SKB_CB(skb)->dccpd_type set to one of the retransmittable DCCP @@ -177,6 +188,7 @@ static void dccp_retransmit_timer(struct goto out; } +backoff: icsk->icsk_backoff++; icsk->icsk_retransmits++; diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c index 0dd4d06..d5ddcda 100644 --- a/net/ipv4/fib_rules.c +++ b/net/ipv4/fib_rules.c @@ -40,6 +40,8 @@ #include #include #include +#include +#include #include #include @@ -52,7 +54,7 @@ struct fib_rule { - struct fib_rule *r_next; + struct hlist_node hlist; atomic_t r_clntref; u32 r_preference; unsigned char r_table; @@ -75,6 +77,7 @@ struct fib_rule #endif char r_ifname[IFNAMSIZ]; int r_dead; + struct rcu_head rcu; }; static struct fib_rule default_rule = { @@ -85,7 +88,6 @@ static struct fib_rule default_rule = { }; static struct fib_rule main_rule = { - .r_next = &default_rule, .r_clntref = ATOMIC_INIT(2), .r_preference = 0x7FFE, .r_table = RT_TABLE_MAIN, @@ -93,23 +95,24 @@ static struct fib_rule main_rule = { }; static struct fib_rule local_rule = { - .r_next = &main_rule, .r_clntref = ATOMIC_INIT(2), .r_table = RT_TABLE_LOCAL, .r_action = RTN_UNICAST, }; -static struct fib_rule *fib_rules = &local_rule; -static DEFINE_RWLOCK(fib_rules_lock); +static struct hlist_head fib_rules; + +/* writer func called from netlink -- rtnl_sem hold*/ int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg) { struct rtattr **rta = arg; struct rtmsg *rtm = NLMSG_DATA(nlh); - struct fib_rule *r, **rp; + struct fib_rule *r; + struct hlist_node *node; int err = -ESRCH; - for (rp=&fib_rules; (r=*rp) != NULL; rp=&r->r_next) { + hlist_for_each_entry(r, node, &fib_rules, hlist) { if ((!rta[RTA_SRC-1] || memcmp(RTA_DATA(rta[RTA_SRC-1]), &r->r_src, 4) == 0) && rtm->rtm_src_len == r->r_src_len && rtm->rtm_dst_len == r->r_dst_len && @@ -126,10 +129,8 @@ int inet_rtm_delrule(struct sk_buff *skb if (r == &local_rule) break; - write_lock_bh(&fib_rules_lock); - *rp = r->r_next; + hlist_del_rcu(&r->hlist); r->r_dead = 1; - write_unlock_bh(&fib_rules_lock); fib_rule_put(r); err = 0; break; @@ -150,21 +151,30 @@ static struct fib_table *fib_empty_table return NULL; } +static inline void fib_rule_put_rcu(struct rcu_head *head) +{ + struct fib_rule *r = container_of(head, struct fib_rule, rcu); + kfree(r); +} + void fib_rule_put(struct fib_rule *r) { if (atomic_dec_and_test(&r->r_clntref)) { if (r->r_dead) - kfree(r); + call_rcu(&r->rcu, fib_rule_put_rcu); else printk("Freeing alive rule %p\n", r); } } +/* writer func called from netlink -- rtnl_sem hold*/ + int inet_rtm_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg) { struct rtattr **rta = arg; struct rtmsg *rtm = NLMSG_DATA(nlh); - struct fib_rule *r, *new_r, **rp; + struct fib_rule *r, *new_r, *last = NULL; + struct hlist_node *node = NULL; unsigned char table_id; if (rtm->rtm_src_len > 32 || rtm->rtm_dst_len > 32 || @@ -188,6 +198,7 @@ int inet_rtm_newrule(struct sk_buff *skb if (!new_r) return -ENOMEM; memset(new_r, 0, sizeof(*new_r)); + if (rta[RTA_SRC-1]) memcpy(&new_r->r_src, RTA_DATA(rta[RTA_SRC-1]), 4); if (rta[RTA_DST-1]) @@ -220,28 +231,28 @@ int inet_rtm_newrule(struct sk_buff *skb if (rta[RTA_FLOW-1]) memcpy(&new_r->r_tclassid, RTA_DATA(rta[RTA_FLOW-1]), 4); #endif + r = container_of(fib_rules.first, struct fib_rule, hlist); - rp = &fib_rules; if (!new_r->r_preference) { - r = fib_rules; - if (r && (r = r->r_next) != NULL) { - rp = &fib_rules->r_next; + if (r && r->hlist.next != NULL) { + r = container_of(r->hlist.next, struct fib_rule, hlist); if (r->r_preference) new_r->r_preference = r->r_preference - 1; } } - - while ( (r = *rp) != NULL ) { + + hlist_for_each_entry(r, node, &fib_rules, hlist) { if (r->r_preference > new_r->r_preference) break; - rp = &r->r_next; + last = r; } - - new_r->r_next = r; atomic_inc(&new_r->r_clntref); - write_lock_bh(&fib_rules_lock); - *rp = new_r; - write_unlock_bh(&fib_rules_lock); + + if (last) + hlist_add_after_rcu(&last->hlist, &new_r->hlist); + else + hlist_add_before_rcu(&new_r->hlist, &r->hlist); + return 0; } @@ -254,30 +265,30 @@ u32 fib_rules_tclass(struct fib_result * } #endif +/* callers should hold rtnl semaphore */ static void fib_rules_detach(struct net_device *dev) { + struct hlist_node *node; struct fib_rule *r; - for (r=fib_rules; r; r=r->r_next) { - if (r->r_ifindex == dev->ifindex) { - write_lock_bh(&fib_rules_lock); + hlist_for_each_entry(r, node, &fib_rules, hlist) { + if (r->r_ifindex == dev->ifindex) r->r_ifindex = -1; - write_unlock_bh(&fib_rules_lock); - } + } } +/* callers should hold rtnl semaphore */ + static void fib_rules_attach(struct net_device *dev) { + struct hlist_node *node; struct fib_rule *r; - for (r=fib_rules; r; r=r->r_next) { - if (r->r_ifindex == -1 && strcmp(dev->name, r->r_ifname) == 0) { - write_lock_bh(&fib_rules_lock); + hlist_for_each_entry(r, node, &fib_rules, hlist) { + if (r->r_ifindex == -1 && strcmp(dev->name, r->r_ifname) == 0) r->r_ifindex = dev->ifindex; - write_unlock_bh(&fib_rules_lock); - } } } @@ -286,14 +297,17 @@ int fib_lookup(const struct flowi *flp, int err; struct fib_rule *r, *policy; struct fib_table *tb; + struct hlist_node *node; u32 daddr = flp->fl4_dst; u32 saddr = flp->fl4_src; FRprintk("Lookup: %u.%u.%u.%u <- %u.%u.%u.%u ", NIPQUAD(flp->fl4_dst), NIPQUAD(flp->fl4_src)); - read_lock(&fib_rules_lock); - for (r = fib_rules; r; r=r->r_next) { + + rcu_read_lock(); + + hlist_for_each_entry_rcu(r, node, &fib_rules, hlist) { if (((saddr^r->r_src) & r->r_srcmask) || ((daddr^r->r_dst) & r->r_dstmask) || (r->r_tos && r->r_tos != flp->fl4_tos) || @@ -309,14 +323,14 @@ FRprintk("tb %d r %d ", r->r_table, r->r policy = r; break; case RTN_UNREACHABLE: - read_unlock(&fib_rules_lock); + rcu_read_unlock(); return -ENETUNREACH; default: case RTN_BLACKHOLE: - read_unlock(&fib_rules_lock); + rcu_read_unlock(); return -EINVAL; case RTN_PROHIBIT: - read_unlock(&fib_rules_lock); + rcu_read_unlock(); return -EACCES; } @@ -327,16 +341,16 @@ FRprintk("tb %d r %d ", r->r_table, r->r res->r = policy; if (policy) atomic_inc(&policy->r_clntref); - read_unlock(&fib_rules_lock); + rcu_read_unlock(); return 0; } if (err < 0 && err != -EAGAIN) { - read_unlock(&fib_rules_lock); + rcu_read_unlock(); return err; } } FRprintk("FAILURE\n"); - read_unlock(&fib_rules_lock); + rcu_read_unlock(); return -ENETUNREACH; } @@ -414,20 +428,25 @@ rtattr_failure: return -1; } +/* callers should hold rtnl semaphore */ + int inet_dump_rules(struct sk_buff *skb, struct netlink_callback *cb) { - int idx; + int idx = 0; int s_idx = cb->args[0]; struct fib_rule *r; + struct hlist_node *node; + + rcu_read_lock(); + hlist_for_each_entry(r, node, &fib_rules, hlist) { - read_lock(&fib_rules_lock); - for (r=fib_rules, idx=0; r; r = r->r_next, idx++) { if (idx < s_idx) continue; if (inet_fill_rule(skb, r, cb, NLM_F_MULTI) < 0) break; + idx++; } - read_unlock(&fib_rules_lock); + rcu_read_unlock(); cb->args[0] = idx; return skb->len; @@ -435,5 +454,9 @@ int inet_dump_rules(struct sk_buff *skb, void __init fib_rules_init(void) { + INIT_HLIST_HEAD(&fib_rules); + hlist_add_head(&local_rule.hlist, &fib_rules); + hlist_add_after(&local_rule.hlist, &main_rule.hlist); + hlist_add_after(&main_rule.hlist, &default_rule.hlist); register_netdevice_notifier(&fib_rules_notifier); } diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c index 167619f..e52b50b 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c @@ -141,19 +141,21 @@ static unsigned int ipv4_conntrack_help( { struct nf_conn *ct; enum ip_conntrack_info ctinfo; + struct nf_conn_help *help; /* This is where we call the helper: as the packet goes out. */ ct = nf_ct_get(*pskb, &ctinfo); - if (ct && ct->helper) { - unsigned int ret; - ret = ct->helper->help(pskb, - (*pskb)->nh.raw - (*pskb)->data - + (*pskb)->nh.iph->ihl*4, - ct, ctinfo); - if (ret != NF_ACCEPT) - return ret; - } - return NF_ACCEPT; + if (!ct) + return NF_ACCEPT; + + help = nfct_help(ct); + if (!help || !help->helper) + return NF_ACCEPT; + + return help->helper->help(pskb, + (*pskb)->nh.raw - (*pskb)->data + + (*pskb)->nh.iph->ihl*4, + ct, ctinfo); } static unsigned int ipv4_conntrack_defrag(unsigned int hooknum, diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 16984d4..ebf2e0b 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -664,6 +664,22 @@ ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = &proc_dointvec, }, + { + .ctl_name = NET_TCP_MTU_PROBING, + .procname = "tcp_mtu_probing", + .data = &sysctl_tcp_mtu_probing, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = NET_TCP_BASE_MSS, + .procname = "tcp_base_mss", + .data = &sysctl_tcp_base_mss, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, { .ctl_name = 0 } }; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a97ed54..7b1a695 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1890,6 +1890,34 @@ static void tcp_try_to_open(struct sock } } +static void tcp_mtup_probe_failed(struct sock *sk) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + + icsk->icsk_mtup.search_high = icsk->icsk_mtup.probe_size - 1; + icsk->icsk_mtup.probe_size = 0; +} + +static void tcp_mtup_probe_success(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); + + /* FIXME: breaks with very large cwnd */ + tp->prior_ssthresh = tcp_current_ssthresh(sk); + tp->snd_cwnd = tp->snd_cwnd * + tcp_mss_to_mtu(sk, tp->mss_cache) / + icsk->icsk_mtup.probe_size; + tp->snd_cwnd_cnt = 0; + tp->snd_cwnd_stamp = tcp_time_stamp; + tp->rcv_ssthresh = tcp_current_ssthresh(sk); + + icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size; + icsk->icsk_mtup.probe_size = 0; + tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); +} + + /* Process an event, which can update packets-in-flight not trivially. * Main goal of this function is to calculate new estimate for left_out, * taking into account both packets sitting in receiver's buffer and @@ -2022,6 +2050,17 @@ tcp_fastretrans_alert(struct sock *sk, u return; } + /* MTU probe failure: don't reduce cwnd */ + if (icsk->icsk_ca_state < TCP_CA_CWR && + icsk->icsk_mtup.probe_size && + tp->snd_una == icsk->icsk_mtup.probe_seq_start) { + tcp_mtup_probe_failed(sk); + /* Restores the reduction we did in tcp_mtup_probe() */ + tp->snd_cwnd++; + tcp_simple_retransmit(sk); + return; + } + /* Otherwise enter Recovery state */ if (IsReno(tp)) @@ -2241,6 +2280,13 @@ static int tcp_clean_rtx_queue(struct so acked |= FLAG_SYN_ACKED; tp->retrans_stamp = 0; } + + /* MTU probing checks */ + if (icsk->icsk_mtup.probe_size) { + if (!after(icsk->icsk_mtup.probe_seq_end, TCP_SKB_CB(skb)->end_seq)) { + tcp_mtup_probe_success(sk, skb); + } + } if (sacked) { if (sacked & TCPCB_RETRANS) { @@ -4100,6 +4146,7 @@ static int tcp_rcv_synsent_state_process if (tp->rx_opt.sack_ok && sysctl_tcp_fack) tp->rx_opt.sack_ok |= 2; + tcp_mtup_init(sk); tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); tcp_initialize_rcv_mss(sk); @@ -4210,6 +4257,7 @@ discard: if (tp->ecn_flags&TCP_ECN_OK) sock_set_flag(sk, SOCK_NO_LARGESEND); + tcp_mtup_init(sk); tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); tcp_initialize_rcv_mss(sk); @@ -4398,6 +4446,7 @@ int tcp_rcv_state_process(struct sock *s */ tp->lsndtime = tcp_time_stamp; + tcp_mtup_init(sk); tcp_initialize_rcv_mss(sk); tcp_init_buffer_space(sk); tcp_fast_path_on(tp); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 233bdf2..57e7a26 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -900,6 +900,7 @@ struct sock *tcp_v4_syn_recv_sock(struct inet_csk(newsk)->icsk_ext_hdr_len = newinet->opt->optlen; newinet->id = newtp->write_seq ^ jiffies; + tcp_mtup_init(newsk); tcp_sync_mss(newsk, dst_mtu(dst)); newtp->advmss = dst_metric(dst, RTAX_ADVMSS); tcp_initialize_rcv_mss(newsk); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index a7623ea..bd7e996 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -51,6 +51,12 @@ int sysctl_tcp_retrans_collapse = 1; */ int sysctl_tcp_tso_win_divisor = 3; +int sysctl_tcp_mtu_probing = 0; +int sysctl_tcp_base_mss = 512; + +EXPORT_SYMBOL(sysctl_tcp_mtu_probing); +EXPORT_SYMBOL(sysctl_tcp_base_mss); + static void update_send_head(struct sock *sk, struct tcp_sock *tp, struct sk_buff *skb) { @@ -681,6 +687,62 @@ int tcp_trim_head(struct sock *sk, struc return 0; } +/* Not accounting for SACKs here. */ +int tcp_mtu_to_mss(struct sock *sk, int pmtu) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); + int mss_now; + + /* Calculate base mss without TCP options: + It is MMS_S - sizeof(tcphdr) of rfc1122 + */ + mss_now = pmtu - icsk->icsk_af_ops->net_header_len - sizeof(struct tcphdr); + + /* Clamp it (mss_clamp does not include tcp options) */ + if (mss_now > tp->rx_opt.mss_clamp) + mss_now = tp->rx_opt.mss_clamp; + + /* Now subtract optional transport overhead */ + mss_now -= icsk->icsk_ext_hdr_len; + + /* Then reserve room for full set of TCP options and 8 bytes of data */ + if (mss_now < 48) + mss_now = 48; + + /* Now subtract TCP options size, not including SACKs */ + mss_now -= tp->tcp_header_len - sizeof(struct tcphdr); + + return mss_now; +} + +/* Inverse of above */ +int tcp_mss_to_mtu(struct sock *sk, int mss) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); + int mtu; + + mtu = mss + + tp->tcp_header_len + + icsk->icsk_ext_hdr_len + + icsk->icsk_af_ops->net_header_len; + + return mtu; +} + +void tcp_mtup_init(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); + + icsk->icsk_mtup.enabled = sysctl_tcp_mtu_probing > 1; + icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp + sizeof(struct tcphdr) + + icsk->icsk_af_ops->net_header_len; + icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, sysctl_tcp_base_mss); + icsk->icsk_mtup.probe_size = 0; +} + /* This function synchronize snd mss to current pmtu/exthdr set. tp->rx_opt.user_mss is mss set by user by TCP_MAXSEG. It does NOT counts @@ -708,25 +770,12 @@ unsigned int tcp_sync_mss(struct sock *s { struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); - /* Calculate base mss without TCP options: - It is MMS_S - sizeof(tcphdr) of rfc1122 - */ - int mss_now = (pmtu - icsk->icsk_af_ops->net_header_len - - sizeof(struct tcphdr)); - - /* Clamp it (mss_clamp does not include tcp options) */ - if (mss_now > tp->rx_opt.mss_clamp) - mss_now = tp->rx_opt.mss_clamp; - - /* Now subtract optional transport overhead */ - mss_now -= icsk->icsk_ext_hdr_len; + int mss_now; - /* Then reserve room for full set of TCP options and 8 bytes of data */ - if (mss_now < 48) - mss_now = 48; + if (icsk->icsk_mtup.search_high > pmtu) + icsk->icsk_mtup.search_high = pmtu; - /* Now subtract TCP options size, not including SACKs */ - mss_now -= tp->tcp_header_len - sizeof(struct tcphdr); + mss_now = tcp_mtu_to_mss(sk, pmtu); /* Bound mss with half of window */ if (tp->max_window && mss_now > (tp->max_window>>1)) @@ -734,6 +783,8 @@ unsigned int tcp_sync_mss(struct sock *s /* And store cached results */ icsk->icsk_pmtu_cookie = pmtu; + if (icsk->icsk_mtup.enabled) + mss_now = min(mss_now, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)); tp->mss_cache = mss_now; return mss_now; @@ -1059,6 +1110,140 @@ static int tcp_tso_should_defer(struct s return 1; } +/* Create a new MTU probe if we are ready. + * Returns 0 if we should wait to probe (no cwnd available), + * 1 if a probe was sent, + * -1 otherwise */ +static int tcp_mtu_probe(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); + struct sk_buff *skb, *nskb, *next; + int len; + int probe_size; + unsigned int pif; + int copy; + int mss_now; + + /* Not currently probing/verifying, + * not in recovery, + * have enough cwnd, and + * not SACKing (the variable headers throw things off) */ + if (!icsk->icsk_mtup.enabled || + icsk->icsk_mtup.probe_size || + inet_csk(sk)->icsk_ca_state != TCP_CA_Open || + tp->snd_cwnd < 11 || + tp->rx_opt.eff_sacks) + return -1; + + /* Very simple search strategy: just double the MSS. */ + mss_now = tcp_current_mss(sk, 0); + probe_size = 2*tp->mss_cache; + if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) { + /* TODO: set timer for probe_converge_event */ + return -1; + } + + /* Have enough data in the send queue to probe? */ + len = 0; + if ((skb = sk->sk_send_head) == NULL) + return -1; + while ((len += skb->len) < probe_size && !tcp_skb_is_last(sk, skb)) + skb = skb->next; + if (len < probe_size) + return -1; + + /* Receive window check. */ + if (after(TCP_SKB_CB(skb)->seq + probe_size, tp->snd_una + tp->snd_wnd)) { + if (tp->snd_wnd < probe_size) + return -1; + else + return 0; + } + + /* Do we need to wait to drain cwnd? */ + pif = tcp_packets_in_flight(tp); + if (pif + 2 > tp->snd_cwnd) { + /* With no packets in flight, don't stall. */ + if (pif == 0) + return -1; + else + return 0; + } + + /* We're allowed to probe. Build it now. */ + if ((nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC)) == NULL) + return -1; + sk_charge_skb(sk, nskb); + + skb = sk->sk_send_head; + __skb_insert(nskb, skb->prev, skb, &sk->sk_write_queue); + sk->sk_send_head = nskb; + + TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(skb)->seq; + TCP_SKB_CB(nskb)->end_seq = TCP_SKB_CB(skb)->seq + probe_size; + TCP_SKB_CB(nskb)->flags = TCPCB_FLAG_ACK; + TCP_SKB_CB(nskb)->sacked = 0; + nskb->csum = 0; + if (skb->ip_summed == CHECKSUM_HW) + nskb->ip_summed = CHECKSUM_HW; + + len = 0; + while (len < probe_size) { + next = skb->next; + + copy = min_t(int, skb->len, probe_size - len); + if (nskb->ip_summed) + skb_copy_bits(skb, 0, skb_put(nskb, copy), copy); + else + nskb->csum = skb_copy_and_csum_bits(skb, 0, + skb_put(nskb, copy), copy, nskb->csum); + + if (skb->len <= copy) { + /* We've eaten all the data from this skb. + * Throw it away. */ + TCP_SKB_CB(nskb)->flags |= TCP_SKB_CB(skb)->flags; + __skb_unlink(skb, &sk->sk_write_queue); + sk_stream_free_skb(sk, skb); + } else { + TCP_SKB_CB(nskb)->flags |= TCP_SKB_CB(skb)->flags & + ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH); + if (!skb_shinfo(skb)->nr_frags) { + skb_pull(skb, copy); + if (skb->ip_summed != CHECKSUM_HW) + skb->csum = csum_partial(skb->data, skb->len, 0); + } else { + __pskb_trim_head(skb, copy); + tcp_set_skb_tso_segs(sk, skb, mss_now); + } + TCP_SKB_CB(skb)->seq += copy; + } + + len += copy; + skb = next; + } + tcp_init_tso_segs(sk, nskb, nskb->len); + + /* We're ready to send. If this fails, the probe will + * be resegmented into mss-sized pieces by tcp_write_xmit(). */ + TCP_SKB_CB(nskb)->when = tcp_time_stamp; + if (!tcp_transmit_skb(sk, nskb, 1, GFP_ATOMIC)) { + /* Decrement cwnd here because we are sending + * effectively two packets. */ + tp->snd_cwnd--; + update_send_head(sk, tp, nskb); + + icsk->icsk_mtup.probe_size = tcp_mss_to_mtu(sk, nskb->len); + icsk->icsk_mtup.probe_seq_start = TCP_SKB_CB(nskb)->seq; + icsk->icsk_mtup.probe_seq_end = TCP_SKB_CB(nskb)->end_seq; + + return 1; + } + + return -1; +} + + /* This routine writes packets to the network. It advances the * send_head. This happens as incoming acks open up the remote * window for us. @@ -1072,6 +1257,7 @@ static int tcp_write_xmit(struct sock *s struct sk_buff *skb; unsigned int tso_segs, sent_pkts; int cwnd_quota; + int result; /* If we are closed, the bytes will have to remain here. * In time closedown will finish, we empty the write queue and all @@ -1081,12 +1267,20 @@ static int tcp_write_xmit(struct sock *s return 0; sent_pkts = 0; + + /* Do MTU probing. */ + if ((result = tcp_mtu_probe(sk)) == 0) { + return 0; + } else if (result > 0) { + sent_pkts = 1; + } + while ((skb = sk->sk_send_head)) { unsigned int limit; tso_segs = tcp_init_tso_segs(sk, skb, mss_now); BUG_ON(!tso_segs); - + cwnd_quota = tcp_cwnd_test(tp, skb); if (!cwnd_quota) break; @@ -1451,9 +1645,15 @@ void tcp_simple_retransmit(struct sock * int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); unsigned int cur_mss = tcp_current_mss(sk, 0); int err; - + + /* Inconslusive MTU probe */ + if (icsk->icsk_mtup.probe_size) { + icsk->icsk_mtup.probe_size = 0; + } + /* Do not sent more than we queued. 1/4 is reserved for possible * copying overhead: fragmentation, tunneling, mangling etc. */ @@ -1879,6 +2079,7 @@ static void tcp_connect_init(struct sock if (tp->rx_opt.user_mss) tp->rx_opt.mss_clamp = tp->rx_opt.user_mss; tp->max_window = 0; + tcp_mtup_init(sk); tcp_sync_mss(sk, dst_mtu(dst)); if (!tp->window_clamp) @@ -2176,3 +2377,4 @@ EXPORT_SYMBOL(tcp_make_synack); EXPORT_SYMBOL(tcp_simple_retransmit); EXPORT_SYMBOL(tcp_sync_mss); EXPORT_SYMBOL(sysctl_tcp_tso_win_divisor); +EXPORT_SYMBOL(tcp_mtup_init); diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index e188095..7c1bde3 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -119,8 +119,10 @@ static int tcp_orphan_retries(struct soc /* A write timeout has occurred. Process the after effects. */ static int tcp_write_timeout(struct sock *sk) { - const struct inet_connection_sock *icsk = inet_csk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); + struct tcp_sock *tp = tcp_sk(sk); int retry_until; + int mss; if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { if (icsk->icsk_retransmits) @@ -128,25 +130,19 @@ static int tcp_write_timeout(struct sock retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries; } else { if (icsk->icsk_retransmits >= sysctl_tcp_retries1) { - /* NOTE. draft-ietf-tcpimpl-pmtud-01.txt requires pmtu black - hole detection. :-( - - It is place to make it. It is not made. I do not want - to make it. It is disgusting. It does not work in any - case. Let me to cite the same draft, which requires for - us to implement this: - - "The one security concern raised by this memo is that ICMP black holes - are often caused by over-zealous security administrators who block - all ICMP messages. It is vitally important that those who design and - deploy security systems understand the impact of strict filtering on - upper-layer protocols. The safest web site in the world is worthless - if most TCP implementations cannot transfer data from it. It would - be far nicer to have all of the black holes fixed rather than fixing - all of the TCP implementations." - - Golden words :-). - */ + /* Black hole detection */ + if (sysctl_tcp_mtu_probing) { + if (!icsk->icsk_mtup.enabled) { + icsk->icsk_mtup.enabled = 1; + tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); + } else { + mss = min(sysctl_tcp_base_mss, + tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)/2); + mss = max(mss, 68 - tp->tcp_header_len); + icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss); + tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); + } + } dst_negative_advice(&sk->sk_dst_cache); } diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index ab7a912..163ab31 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -6,8 +6,6 @@ config IPV6 tristate "The IPv6 protocol" default m - select CRYPTO if IPV6_PRIVACY - select CRYPTO_MD5 if IPV6_PRIVACY ---help--- This is complemental support for the IP version 6. You will still be able to do traditional IPv4 networking as well. @@ -22,7 +20,7 @@ config IPV6 module will be called ipv6. config IPV6_PRIVACY - bool "IPv6: Privacy Extensions (RFC 3041) support" + bool "IPv6: Privacy Extensions support" depends on IPV6 ---help--- Privacy Extensions for Stateless Address Autoconfiguration in IPv6 @@ -30,6 +28,9 @@ config IPV6_PRIVACY pseudo-random global-scope unicast address(es) will assigned to your interface(s). + We use our standard pseudo random algorithm to generate randomized + interface identifier, instead of one described in RFC 3041. + By default, kernel do not generate temporary addresses. To use temporary addresses, do @@ -37,6 +38,25 @@ config IPV6_PRIVACY See for details. +config IPV6_ROUTER_PREF + bool "IPv6: Router Preference (RFC 4191) support" + depends on IPV6 + ---help--- + Router Preference is an optional extension to the Router + Advertisement message to improve the ability of hosts + to pick more appropriate router, especially when the hosts + is placed in a multi-homed network. + + If unsure, say N. + +config IPV6_ROUTE_INFO + bool "IPv6: Route Information (RFC 4191) support (EXPERIMENTAL)" + depends on IPV6_ROUTER_PREF && EXPERIMENTAL + ---help--- + This is experimental support of Route Information. + + If unsure, say N. + config INET6_AH tristate "IPv6: AH transformation" depends on IPV6 diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 1db5048..d6d97c1 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -78,8 +78,6 @@ #ifdef CONFIG_IPV6_PRIVACY #include -#include -#include #endif #include @@ -110,8 +108,6 @@ static int __ipv6_try_regen_rndid(struct static void ipv6_regen_rndid(unsigned long data); static int desync_factor = MAX_DESYNC_FACTOR * HZ; -static struct crypto_tfm *md5_tfm; -static DEFINE_SPINLOCK(md5_tfm_lock); #endif static int ipv6_count_addresses(struct inet6_dev *idev); @@ -169,6 +165,15 @@ struct ipv6_devconf ipv6_devconf = { .max_desync_factor = MAX_DESYNC_FACTOR, #endif .max_addresses = IPV6_MAX_ADDRESSES, + .accept_ra_defrtr = 1, + .accept_ra_pinfo = 1, +#ifdef CONFIG_IPV6_ROUTER_PREF + .accept_ra_rtr_pref = 1, + .rtr_probe_interval = 60 * HZ, +#ifdef CONFIG_IPV6_ROUTE_INFO + .accept_ra_rt_info_max_plen = 0, +#endif +#endif }; static struct ipv6_devconf ipv6_devconf_dflt = { @@ -190,6 +195,15 @@ static struct ipv6_devconf ipv6_devconf_ .max_desync_factor = MAX_DESYNC_FACTOR, #endif .max_addresses = IPV6_MAX_ADDRESSES, + .accept_ra_defrtr = 1, + .accept_ra_pinfo = 1, +#ifdef CONFIG_IPV6_ROUTER_PREF + .accept_ra_rtr_pref = 1, + .rtr_probe_interval = 60 * HZ, +#ifdef CONFIG_IPV6_ROUTE_INFO + .accept_ra_rt_info_max_plen = 0, +#endif +#endif }; /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */ @@ -371,8 +385,6 @@ static struct inet6_dev * ipv6_add_dev(s in6_dev_hold(ndev); #ifdef CONFIG_IPV6_PRIVACY - get_random_bytes(ndev->rndid, sizeof(ndev->rndid)); - get_random_bytes(ndev->entropy, sizeof(ndev->entropy)); init_timer(&ndev->regen_timer); ndev->regen_timer.function = ipv6_regen_rndid; ndev->regen_timer.data = (unsigned long) ndev; @@ -1305,52 +1317,67 @@ static void addrconf_leave_anycast(struc __ipv6_dev_ac_dec(ifp->idev, &addr); } +static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev) +{ + if (dev->addr_len != ETH_ALEN) + return -1; + memcpy(eui, dev->dev_addr, 3); + memcpy(eui + 5, dev->dev_addr + 3, 3); + + /* + * The zSeries OSA network cards can be shared among various + * OS instances, but the OSA cards have only one MAC address. + * This leads to duplicate address conflicts in conjunction + * with IPv6 if more than one instance uses the same card. + * + * The driver for these cards can deliver a unique 16-bit + * identifier for each instance sharing the same card. It is + * placed instead of 0xFFFE in the interface identifier. The + * "u" bit of the interface identifier is not inverted in this + * case. Hence the resulting interface identifier has local + * scope according to RFC2373. + */ + if (dev->dev_id) { + eui[3] = (dev->dev_id >> 8) & 0xFF; + eui[4] = dev->dev_id & 0xFF; + } else { + eui[3] = 0xFF; + eui[4] = 0xFE; + eui[0] ^= 2; + } + return 0; +} + +static int addrconf_ifid_arcnet(u8 *eui, struct net_device *dev) +{ + /* XXX: inherit EUI-64 from other interface -- yoshfuji */ + if (dev->addr_len != ARCNET_ALEN) + return -1; + memset(eui, 0, 7); + eui[7] = *(u8*)dev->dev_addr; + return 0; +} + +static int addrconf_ifid_infiniband(u8 *eui, struct net_device *dev) +{ + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; +} + static int ipv6_generate_eui64(u8 *eui, struct net_device *dev) { switch (dev->type) { case ARPHRD_ETHER: case ARPHRD_FDDI: case ARPHRD_IEEE802_TR: - if (dev->addr_len != ETH_ALEN) - return -1; - memcpy(eui, dev->dev_addr, 3); - memcpy(eui + 5, dev->dev_addr + 3, 3); - - /* - * The zSeries OSA network cards can be shared among various - * OS instances, but the OSA cards have only one MAC address. - * This leads to duplicate address conflicts in conjunction - * with IPv6 if more than one instance uses the same card. - * - * The driver for these cards can deliver a unique 16-bit - * identifier for each instance sharing the same card. It is - * placed instead of 0xFFFE in the interface identifier. The - * "u" bit of the interface identifier is not inverted in this - * case. Hence the resulting interface identifier has local - * scope according to RFC2373. - */ - if (dev->dev_id) { - eui[3] = (dev->dev_id >> 8) & 0xFF; - eui[4] = dev->dev_id & 0xFF; - } else { - eui[3] = 0xFF; - eui[4] = 0xFE; - eui[0] ^= 2; - } - return 0; + return addrconf_ifid_eui48(eui, dev); case ARPHRD_ARCNET: - /* XXX: inherit EUI-64 from other interface -- yoshfuji */ - if (dev->addr_len != ARCNET_ALEN) - return -1; - memset(eui, 0, 7); - eui[7] = *(u8*)dev->dev_addr; - return 0; + return addrconf_ifid_arcnet(eui, dev); case ARPHRD_INFINIBAND: - if (dev->addr_len != INFINIBAND_ALEN) - return -1; - memcpy(eui, dev->dev_addr + 12, 8); - eui[0] |= 2; - return 0; + return addrconf_ifid_infiniband(eui, dev); } return -1; } @@ -1376,34 +1403,9 @@ static int ipv6_inherit_eui64(u8 *eui, s /* (re)generation of randomized interface identifier (RFC 3041 3.2, 3.5) */ static int __ipv6_regen_rndid(struct inet6_dev *idev) { - struct net_device *dev; - struct scatterlist sg[2]; - - sg_set_buf(&sg[0], idev->entropy, 8); - sg_set_buf(&sg[1], idev->work_eui64, 8); - - dev = idev->dev; - - if (ipv6_generate_eui64(idev->work_eui64, dev)) { - printk(KERN_INFO - "__ipv6_regen_rndid(idev=%p): cannot get EUI64 identifier; use random bytes.\n", - idev); - get_random_bytes(idev->work_eui64, sizeof(idev->work_eui64)); - } regen: - spin_lock(&md5_tfm_lock); - if (unlikely(md5_tfm == NULL)) { - spin_unlock(&md5_tfm_lock); - return -1; - } - crypto_digest_init(md5_tfm); - crypto_digest_update(md5_tfm, sg, 2); - crypto_digest_final(md5_tfm, idev->work_digest); - spin_unlock(&md5_tfm_lock); - - memcpy(idev->rndid, &idev->work_digest[0], 8); + get_random_bytes(idev->rndid, sizeof(idev->rndid)); idev->rndid[0] &= ~0x02; - memcpy(idev->entropy, &idev->work_digest[8], 8); /* * : @@ -2143,7 +2145,6 @@ static void addrconf_ip6_tnl_config(stru return; } ip6_tnl_add_linklocal(idev); - addrconf_add_mroute(dev); } static int addrconf_notify(struct notifier_block *this, unsigned long event, @@ -3130,6 +3131,15 @@ static void inline ipv6_store_devconf(st array[DEVCONF_MAX_DESYNC_FACTOR] = cnf->max_desync_factor; #endif array[DEVCONF_MAX_ADDRESSES] = cnf->max_addresses; + array[DEVCONF_ACCEPT_RA_DEFRTR] = cnf->accept_ra_defrtr; + array[DEVCONF_ACCEPT_RA_PINFO] = cnf->accept_ra_pinfo; +#ifdef CONFIG_IPV6_ROUTER_PREF + array[DEVCONF_ACCEPT_RA_RTR_PREF] = cnf->accept_ra_rtr_pref; + array[DEVCONF_RTR_PROBE_INTERVAL] = cnf->rtr_probe_interval; +#ifdef CONFIV_IPV6_ROUTE_INFO + array[DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN] = cnf->accept_ra_rt_info_max_plen; +#endif +#endif } static int inet6_fill_ifinfo(struct sk_buff *skb, struct inet6_dev *idev, @@ -3583,6 +3593,51 @@ static struct addrconf_sysctl_table .proc_handler = &proc_dointvec, }, { + .ctl_name = NET_IPV6_ACCEPT_RA_DEFRTR, + .procname = "accept_ra_defrtr", + .data = &ipv6_devconf.accept_ra_defrtr, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = NET_IPV6_ACCEPT_RA_PINFO, + .procname = "accept_ra_pinfo", + .data = &ipv6_devconf.accept_ra_pinfo, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#ifdef CONFIG_IPV6_ROUTER_PREF + { + .ctl_name = NET_IPV6_ACCEPT_RA_RTR_PREF, + .procname = "accept_ra_rtr_pref", + .data = &ipv6_devconf.accept_ra_rtr_pref, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = NET_IPV6_RTR_PROBE_INTERVAL, + .procname = "router_probe_interval", + .data = &ipv6_devconf.rtr_probe_interval, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_jiffies, + .strategy = &sysctl_jiffies, + }, +#ifdef CONFIV_IPV6_ROUTE_INFO + { + .ctl_name = NET_IPV6_ACCEPT_RA_RT_INFO_MAX_PLEN, + .procname = "accept_ra_rt_info_max_plen", + .data = &ipv6_devconf.accept_ra_rt_info_max_plen, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif +#endif + { .ctl_name = 0, /* sentinel */ } }, @@ -3757,13 +3812,6 @@ int __init addrconf_init(void) register_netdevice_notifier(&ipv6_dev_notf); -#ifdef CONFIG_IPV6_PRIVACY - md5_tfm = crypto_alloc_tfm("md5", 0); - if (unlikely(md5_tfm == NULL)) - printk(KERN_WARNING - "failed to load transform for md5\n"); -#endif - addrconf_verify(0); rtnetlink_links[PF_INET6] = inet6_rtnetlink_table; #ifdef CONFIG_SYSCTL @@ -3826,11 +3874,6 @@ void __exit addrconf_cleanup(void) rtnl_unlock(); -#ifdef CONFIG_IPV6_PRIVACY - crypto_free_tfm(md5_tfm); - md5_tfm = NULL; -#endif - #ifdef CONFIG_PROC_FS proc_net_remove("if_inet6"); #endif diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 1bf6d9a..2cb6149 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1105,7 +1105,6 @@ static int fib6_age(struct rt6_info *rt, if (rt->rt6i_flags&RTF_EXPIRES && rt->rt6i_expires) { if (time_after(now, rt->rt6i_expires)) { RT6_TRACE("expiring %p\n", rt); - rt6_reset_dflt_pointer(rt); return -1; } gc_args.more++; diff --git a/net/ipv6/ipcomp6.c b/net/ipv6/ipcomp6.c index d511a88..6107592 100644 --- a/net/ipv6/ipcomp6.c +++ b/net/ipv6/ipcomp6.c @@ -286,8 +286,8 @@ static void ipcomp6_free_scratches(void) for_each_cpu(i) { void *scratch = *per_cpu_ptr(scratches, i); - if (scratch) - vfree(scratch); + + vfree(scratch); } free_percpu(scratches); diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index cb8856b..dfa20d3 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -156,7 +156,11 @@ struct neigh_table nd_tbl = { /* ND options */ struct ndisc_options { - struct nd_opt_hdr *nd_opt_array[__ND_OPT_MAX]; + struct nd_opt_hdr *nd_opt_array[__ND_OPT_ARRAY_MAX]; +#ifdef CONFIG_IPV6_ROUTE_INFO + struct nd_opt_hdr *nd_opts_ri; + struct nd_opt_hdr *nd_opts_ri_end; +#endif }; #define nd_opts_src_lladdr nd_opt_array[ND_OPT_SOURCE_LL_ADDR] @@ -255,6 +259,13 @@ static struct ndisc_options *ndisc_parse if (ndopts->nd_opt_array[nd_opt->nd_opt_type] == 0) ndopts->nd_opt_array[nd_opt->nd_opt_type] = nd_opt; break; +#ifdef CONFIG_IPV6_ROUTE_INFO + case ND_OPT_ROUTE_INFO: + ndopts->nd_opts_ri_end = nd_opt; + if (!ndopts->nd_opts_ri) + ndopts->nd_opts_ri = nd_opt; + break; +#endif default: /* * Unknown options must be silently ignored, @@ -1019,10 +1030,11 @@ static void ndisc_router_discovery(struc struct ra_msg *ra_msg = (struct ra_msg *) skb->h.raw; struct neighbour *neigh = NULL; struct inet6_dev *in6_dev; - struct rt6_info *rt; + struct rt6_info *rt = NULL; int lifetime; struct ndisc_options ndopts; int optlen; + unsigned int pref = 0; __u8 * opt = (__u8 *)(ra_msg + 1); @@ -1081,8 +1093,19 @@ static void ndisc_router_discovery(struc (ra_msg->icmph.icmp6_addrconf_other ? IF_RA_OTHERCONF : 0); + if (!in6_dev->cnf.accept_ra_defrtr) + goto skip_defrtr; + lifetime = ntohs(ra_msg->icmph.icmp6_rt_lifetime); +#ifdef CONFIG_IPV6_ROUTER_PREF + pref = ra_msg->icmph.icmp6_router_pref; + /* 10b is handled as if it were 00b (medium) */ + if (pref == ICMPV6_ROUTER_PREF_INVALID || + in6_dev->cnf.accept_ra_rtr_pref) + pref = ICMPV6_ROUTER_PREF_MEDIUM; +#endif + rt = rt6_get_dflt_router(&skb->nh.ipv6h->saddr, skb->dev); if (rt) @@ -1098,7 +1121,7 @@ static void ndisc_router_discovery(struc ND_PRINTK3(KERN_DEBUG "ICMPv6 RA: adding default router.\n"); - rt = rt6_add_dflt_router(&skb->nh.ipv6h->saddr, skb->dev); + rt = rt6_add_dflt_router(&skb->nh.ipv6h->saddr, skb->dev, pref); if (rt == NULL) { ND_PRINTK0(KERN_ERR "ICMPv6 RA: %s() failed to add default route.\n", @@ -1117,6 +1140,8 @@ static void ndisc_router_discovery(struc return; } neigh->flags |= NTF_ROUTER; + } else if (rt) { + rt->rt6i_flags |= (rt->rt6i_flags & ~RTF_PREF_MASK) | RTF_PREF(pref); } if (rt) @@ -1128,6 +1153,8 @@ static void ndisc_router_discovery(struc rt->u.dst.metrics[RTAX_HOPLIMIT-1] = ra_msg->icmph.icmp6_hop_limit; } +skip_defrtr: + /* * Update Reachable Time and Retrans Timer */ @@ -1186,7 +1213,21 @@ static void ndisc_router_discovery(struc NEIGH_UPDATE_F_ISROUTER); } - if (ndopts.nd_opts_pi) { +#ifdef CONFIG_IPV6_ROUTE_INFO + if (in6_dev->cnf.accept_ra_rtr_pref && ndopts.nd_opts_ri) { + struct nd_opt_hdr *p; + for (p = ndopts.nd_opts_ri; + p; + p = ndisc_next_option(p, ndopts.nd_opts_ri_end)) { + if (((struct route_info *)p)->prefix_len > in6_dev->cnf.accept_ra_rt_info_max_plen) + continue; + rt6_route_rcv(skb->dev, (u8*)p, (p->nd_opt_len) << 3, + &skb->nh.ipv6h->saddr); + } + } +#endif + + if (in6_dev->cnf.accept_ra_pinfo && ndopts.nd_opts_pi) { struct nd_opt_hdr *p; for (p = ndopts.nd_opts_pi; p; diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c index ac702a2..ac35f95 100644 --- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c +++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c @@ -179,31 +179,36 @@ static unsigned int ipv6_confirm(unsigne int (*okfn)(struct sk_buff *)) { struct nf_conn *ct; + struct nf_conn_help *help; enum ip_conntrack_info ctinfo; + unsigned int ret, protoff; + unsigned int extoff = (u8*)((*pskb)->nh.ipv6h + 1) + - (*pskb)->data; + unsigned char pnum = (*pskb)->nh.ipv6h->nexthdr; + /* This is where we call the helper: as the packet goes out. */ ct = nf_ct_get(*pskb, &ctinfo); - if (ct && ct->helper) { - unsigned int ret, protoff; - unsigned int extoff = (u8*)((*pskb)->nh.ipv6h + 1) - - (*pskb)->data; - unsigned char pnum = (*pskb)->nh.ipv6h->nexthdr; - - protoff = nf_ct_ipv6_skip_exthdr(*pskb, extoff, &pnum, - (*pskb)->len - extoff); - if (protoff < 0 || protoff > (*pskb)->len || - pnum == NEXTHDR_FRAGMENT) { - DEBUGP("proto header not found\n"); - return NF_ACCEPT; - } + if (!ct) + goto out; - ret = ct->helper->help(pskb, protoff, ct, ctinfo); - if (ret != NF_ACCEPT) - return ret; + help = nfct_help(ct); + if (!help || !help->helper) + goto out; + + protoff = nf_ct_ipv6_skip_exthdr(*pskb, extoff, &pnum, + (*pskb)->len - extoff); + if (protoff < 0 || protoff > (*pskb)->len || + pnum == NEXTHDR_FRAGMENT) { + DEBUGP("proto header not found\n"); + return NF_ACCEPT; } + ret = help->helper->help(pskb, protoff, ct, ctinfo); + if (ret != NF_ACCEPT) + return ret; +out: /* We've seen it coming out the other side: confirm it */ - return nf_conntrack_confirm(pskb); } diff --git a/net/ipv6/route.c b/net/ipv6/route.c index e0d3ad0..074c49f 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -72,6 +72,10 @@ #define RT6_TRACE(x...) do { ; } while (0) #endif +#define CLONE_OFFLINK_ROUTE 0 + +#define RT6_SELECT_F_IFACE 0x1 +#define RT6_SELECT_F_REACHABLE 0x2 static int ip6_rt_max_size = 4096; static int ip6_rt_gc_min_interval = HZ / 2; @@ -94,6 +98,14 @@ static int ip6_pkt_discard_out(struct s static void ip6_link_failure(struct sk_buff *skb); static void ip6_rt_update_pmtu(struct dst_entry *dst, u32 mtu); +#ifdef CONFIG_IPV6_ROUTE_INFO +static struct rt6_info *rt6_add_route_info(struct in6_addr *prefix, int prefixlen, + struct in6_addr *gwaddr, int ifindex, + unsigned pref); +static struct rt6_info *rt6_get_route_info(struct in6_addr *prefix, int prefixlen, + struct in6_addr *gwaddr, int ifindex); +#endif + static struct dst_ops ip6_dst_ops = { .family = AF_INET6, .protocol = __constant_htons(ETH_P_IPV6), @@ -214,150 +226,211 @@ static __inline__ struct rt6_info *rt6_d return rt; } +#ifdef CONFIG_IPV6_ROUTER_PREF +static void rt6_probe(struct rt6_info *rt) +{ + struct neighbour *neigh = rt ? rt->rt6i_nexthop : NULL; + /* + * Okay, this does not seem to be appropriate + * for now, however, we need to check if it + * is really so; aka Router Reachability Probing. + * + * Router Reachability Probe MUST be rate-limited + * to no more than one per minute. + */ + if (!neigh || (neigh->nud_state & NUD_VALID)) + return; + read_lock_bh(&neigh->lock); + if (!(neigh->nud_state & NUD_VALID) && + time_after(jiffies, neigh->updated + rt->rt6i_idev->cnf.rtr_probe_interval)) { + struct in6_addr mcaddr; + struct in6_addr *target; + + neigh->updated = jiffies; + read_unlock_bh(&neigh->lock); + + target = (struct in6_addr *)&neigh->primary_key; + addrconf_addr_solict_mult(target, &mcaddr); + ndisc_send_ns(rt->rt6i_dev, NULL, target, &mcaddr, NULL); + } else + read_unlock_bh(&neigh->lock); +} +#else +static inline void rt6_probe(struct rt6_info *rt) +{ + return; +} +#endif + /* - * pointer to the last default router chosen. BH is disabled locally. + * Default Router Selection (RFC 2461 6.3.6) */ -static struct rt6_info *rt6_dflt_pointer; -static DEFINE_SPINLOCK(rt6_dflt_lock); +static int inline rt6_check_dev(struct rt6_info *rt, int oif) +{ + struct net_device *dev = rt->rt6i_dev; + if (!oif || dev->ifindex == oif) + return 2; + if ((dev->flags & IFF_LOOPBACK) && + rt->rt6i_idev && rt->rt6i_idev->dev->ifindex == oif) + return 1; + return 0; +} -void rt6_reset_dflt_pointer(struct rt6_info *rt) +static int inline rt6_check_neigh(struct rt6_info *rt) { - spin_lock_bh(&rt6_dflt_lock); - if (rt == NULL || rt == rt6_dflt_pointer) { - RT6_TRACE("reset default router: %p->NULL\n", rt6_dflt_pointer); - rt6_dflt_pointer = NULL; + struct neighbour *neigh = rt->rt6i_nexthop; + int m = 0; + if (neigh) { + read_lock_bh(&neigh->lock); + if (neigh->nud_state & NUD_VALID) + m = 1; + read_unlock_bh(&neigh->lock); } - spin_unlock_bh(&rt6_dflt_lock); + return m; } -/* Default Router Selection (RFC 2461 6.3.6) */ -static struct rt6_info *rt6_best_dflt(struct rt6_info *rt, int oif) +static int rt6_score_route(struct rt6_info *rt, int oif, + int strict) { - struct rt6_info *match = NULL; - struct rt6_info *sprt; - int mpri = 0; - - for (sprt = rt; sprt; sprt = sprt->u.next) { - struct neighbour *neigh; - int m = 0; - - if (!oif || - (sprt->rt6i_dev && - sprt->rt6i_dev->ifindex == oif)) - m += 8; + int m = rt6_check_dev(rt, oif); + if (!m && (strict & RT6_SELECT_F_IFACE)) + return -1; +#ifdef CONFIG_IPV6_ROUTER_PREF + m |= IPV6_DECODE_PREF(IPV6_EXTRACT_PREF(rt->rt6i_flags)) << 2; +#endif + if (rt6_check_neigh(rt)) + m |= 16; + else if (strict & RT6_SELECT_F_REACHABLE) + return -1; + return m; +} - if (rt6_check_expired(sprt)) - continue; +static struct rt6_info *rt6_select(struct rt6_info **head, int oif, + int strict) +{ + struct rt6_info *match = NULL, *last = NULL; + struct rt6_info *rt, *rt0 = *head; + u32 metric; + int mpri = -1; - if (sprt == rt6_dflt_pointer) - m += 4; + RT6_TRACE("%s(head=%p(*head=%p), oif=%d)\n", + __FUNCTION__, head, head ? *head : NULL, oif); - if ((neigh = sprt->rt6i_nexthop) != NULL) { - read_lock_bh(&neigh->lock); - switch (neigh->nud_state) { - case NUD_REACHABLE: - m += 3; - break; + for (rt = rt0, metric = rt0->rt6i_metric; + rt && rt->rt6i_metric == metric; + rt = rt->u.next) { + int m; - case NUD_STALE: - case NUD_DELAY: - case NUD_PROBE: - m += 2; - break; + if (rt6_check_expired(rt)) + continue; - case NUD_NOARP: - case NUD_PERMANENT: - m += 1; - break; + last = rt; - case NUD_INCOMPLETE: - default: - read_unlock_bh(&neigh->lock); - continue; - } - read_unlock_bh(&neigh->lock); - } else { + m = rt6_score_route(rt, oif, strict); + if (m < 0) continue; - } - if (m > mpri || m >= 12) { - match = sprt; + if (m > mpri) { + rt6_probe(match); + match = rt; mpri = m; - if (m >= 12) { - /* we choose the last default router if it - * is in (probably) reachable state. - * If route changed, we should do pmtu - * discovery. --yoshfuji - */ - break; - } + } else { + rt6_probe(rt); } } - spin_lock(&rt6_dflt_lock); - if (!match) { - /* - * No default routers are known to be reachable. - * SHOULD round robin - */ - if (rt6_dflt_pointer) { - for (sprt = rt6_dflt_pointer->u.next; - sprt; sprt = sprt->u.next) { - if (sprt->u.dst.obsolete <= 0 && - sprt->u.dst.error == 0 && - !rt6_check_expired(sprt)) { - match = sprt; - break; - } - } - for (sprt = rt; - !match && sprt; - sprt = sprt->u.next) { - if (sprt->u.dst.obsolete <= 0 && - sprt->u.dst.error == 0 && - !rt6_check_expired(sprt)) { - match = sprt; - break; - } - if (sprt == rt6_dflt_pointer) - break; - } - } + if (!match && + (strict & RT6_SELECT_F_REACHABLE) && + last && last != rt0) { + /* no entries matched; do round-robin */ + *head = rt0->u.next; + rt0->u.next = last->u.next; + last->u.next = rt0; } - if (match) { - if (rt6_dflt_pointer != match) - RT6_TRACE("changed default router: %p->%p\n", - rt6_dflt_pointer, match); - rt6_dflt_pointer = match; + RT6_TRACE("%s() => %p, score=%d\n", + __FUNCTION__, match, mpri); + + return (match ? match : &ip6_null_entry); +} + +#ifdef CONFIG_IPV6_ROUTE_INFO +int rt6_route_rcv(struct net_device *dev, u8 *opt, int len, + struct in6_addr *gwaddr) +{ + struct route_info *rinfo = (struct route_info *) opt; + struct in6_addr prefix_buf, *prefix; + unsigned int pref; + u32 lifetime; + struct rt6_info *rt; + + if (len < sizeof(struct route_info)) { + return -EINVAL; } - spin_unlock(&rt6_dflt_lock); - if (!match) { - /* - * Last Resort: if no default routers found, - * use addrconf default route. - * We don't record this route. - */ - for (sprt = ip6_routing_table.leaf; - sprt; sprt = sprt->u.next) { - if (!rt6_check_expired(sprt) && - (sprt->rt6i_flags & RTF_DEFAULT) && - (!oif || - (sprt->rt6i_dev && - sprt->rt6i_dev->ifindex == oif))) { - match = sprt; - break; - } + /* Sanity check for prefix_len and length */ + if (rinfo->length > 3) { + return -EINVAL; + } else if (rinfo->prefix_len > 128) { + return -EINVAL; + } else if (rinfo->prefix_len > 64) { + if (rinfo->length < 2) { + return -EINVAL; } - if (!match) { - /* no default route. give up. */ - match = &ip6_null_entry; + } else if (rinfo->prefix_len > 0) { + if (rinfo->length < 1) { + return -EINVAL; } } - return match; + pref = rinfo->route_pref; + if (pref == ICMPV6_ROUTER_PREF_INVALID) + pref = ICMPV6_ROUTER_PREF_MEDIUM; + + lifetime = htonl(rinfo->lifetime); + if (lifetime == 0xffffffff) { + /* infinity */ + } else if (lifetime > 0x7fffffff/HZ) { + /* Avoid arithmetic overflow */ + lifetime = 0x7fffffff/HZ - 1; + } + + if (rinfo->length == 3) + prefix = (struct in6_addr *)rinfo->prefix; + else { + /* this function is safe */ + ipv6_addr_prefix(&prefix_buf, + (struct in6_addr *)rinfo->prefix, + rinfo->prefix_len); + prefix = &prefix_buf; + } + + rt = rt6_get_route_info(prefix, rinfo->prefix_len, gwaddr, dev->ifindex); + + if (rt && !lifetime) { + ip6_del_rt(rt, NULL, NULL, NULL); + rt = NULL; + } + + if (!rt && lifetime) + rt = rt6_add_route_info(prefix, rinfo->prefix_len, gwaddr, dev->ifindex, + pref); + else if (rt) + rt->rt6i_flags = RTF_ROUTEINFO | + (rt->rt6i_flags & ~RTF_PREF_MASK) | RTF_PREF(pref); + + if (rt) { + if (lifetime == 0xffffffff) { + rt->rt6i_flags &= ~RTF_EXPIRES; + } else { + rt->rt6i_expires = jiffies + HZ * lifetime; + rt->rt6i_flags |= RTF_EXPIRES; + } + dst_release(&rt->u.dst); + } + return 0; } +#endif struct rt6_info *rt6_lookup(struct in6_addr *daddr, struct in6_addr *saddr, int oif, int strict) @@ -397,14 +470,9 @@ int ip6_ins_rt(struct rt6_info *rt, stru return err; } -/* No rt6_lock! If COW failed, the function returns dead route entry - with dst->error set to errno value. - */ - -static struct rt6_info *rt6_cow(struct rt6_info *ort, struct in6_addr *daddr, - struct in6_addr *saddr, struct netlink_skb_parms *req) +static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort, struct in6_addr *daddr, + struct in6_addr *saddr) { - int err; struct rt6_info *rt; /* @@ -435,25 +503,30 @@ static struct rt6_info *rt6_cow(struct r rt->rt6i_nexthop = ndisc_get_neigh(rt->rt6i_dev, &rt->rt6i_gateway); - dst_hold(&rt->u.dst); - - err = ip6_ins_rt(rt, NULL, NULL, req); - if (err == 0) - return rt; + } - rt->u.dst.error = err; + return rt; +} - return rt; +static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort, struct in6_addr *daddr) +{ + struct rt6_info *rt = ip6_rt_copy(ort); + if (rt) { + ipv6_addr_copy(&rt->rt6i_dst.addr, daddr); + rt->rt6i_dst.plen = 128; + rt->rt6i_flags |= RTF_CACHE; + if (rt->rt6i_flags & RTF_REJECT) + rt->u.dst.error = ort->u.dst.error; + rt->u.dst.flags |= DST_HOST; + rt->rt6i_nexthop = neigh_clone(ort->rt6i_nexthop); } - dst_hold(&ip6_null_entry.u.dst); - return &ip6_null_entry; + return rt; } #define BACKTRACK() \ -if (rt == &ip6_null_entry && strict) { \ +if (rt == &ip6_null_entry) { \ while ((fn = fn->parent) != NULL) { \ if (fn->fn_flags & RTN_ROOT) { \ - dst_hold(&rt->u.dst); \ goto out; \ } \ if (fn->fn_flags & RTN_RTINFO) \ @@ -465,115 +538,138 @@ if (rt == &ip6_null_entry && strict) { \ void ip6_route_input(struct sk_buff *skb) { struct fib6_node *fn; - struct rt6_info *rt; + struct rt6_info *rt, *nrt; int strict; int attempts = 3; + int err; + int reachable = RT6_SELECT_F_REACHABLE; - strict = ipv6_addr_type(&skb->nh.ipv6h->daddr) & (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL); + strict = ipv6_addr_type(&skb->nh.ipv6h->daddr) & (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL) ? RT6_SELECT_F_IFACE : 0; relookup: read_lock_bh(&rt6_lock); +restart_2: fn = fib6_lookup(&ip6_routing_table, &skb->nh.ipv6h->daddr, &skb->nh.ipv6h->saddr); restart: - rt = fn->leaf; - - if ((rt->rt6i_flags & RTF_CACHE)) { - rt = rt6_device_match(rt, skb->dev->ifindex, strict); - BACKTRACK(); - dst_hold(&rt->u.dst); - goto out; - } - - rt = rt6_device_match(rt, skb->dev->ifindex, strict); + rt = rt6_select(&fn->leaf, skb->dev->ifindex, strict | reachable); BACKTRACK(); + if (rt == &ip6_null_entry || + rt->rt6i_flags & RTF_CACHE) + goto out; - if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP)) { - struct rt6_info *nrt; - dst_hold(&rt->u.dst); - read_unlock_bh(&rt6_lock); + dst_hold(&rt->u.dst); + read_unlock_bh(&rt6_lock); - nrt = rt6_cow(rt, &skb->nh.ipv6h->daddr, - &skb->nh.ipv6h->saddr, - &NETLINK_CB(skb)); + if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP)) + nrt = rt6_alloc_cow(rt, &skb->nh.ipv6h->daddr, &skb->nh.ipv6h->saddr); + else { +#if CLONE_OFFLINK_ROUTE + nrt = rt6_alloc_clone(rt, &skb->nh.ipv6h->daddr); +#else + goto out2; +#endif + } - dst_release(&rt->u.dst); - rt = nrt; + dst_release(&rt->u.dst); + rt = nrt ? : &ip6_null_entry; - if (rt->u.dst.error != -EEXIST || --attempts <= 0) + dst_hold(&rt->u.dst); + if (nrt) { + err = ip6_ins_rt(nrt, NULL, NULL, &NETLINK_CB(skb)); + if (!err) goto out2; - - /* Race condition! In the gap, when rt6_lock was - released someone could insert this route. Relookup. - */ - dst_release(&rt->u.dst); - goto relookup; } - dst_hold(&rt->u.dst); + + if (--attempts <= 0) + goto out2; + + /* + * Race condition! In the gap, when rt6_lock was + * released someone could insert this route. Relookup. + */ + dst_release(&rt->u.dst); + goto relookup; out: + if (reachable) { + reachable = 0; + goto restart_2; + } + dst_hold(&rt->u.dst); read_unlock_bh(&rt6_lock); out2: rt->u.dst.lastuse = jiffies; rt->u.dst.__use++; skb->dst = (struct dst_entry *) rt; + return; } struct dst_entry * ip6_route_output(struct sock *sk, struct flowi *fl) { struct fib6_node *fn; - struct rt6_info *rt; + struct rt6_info *rt, *nrt; int strict; int attempts = 3; + int err; + int reachable = RT6_SELECT_F_REACHABLE; - strict = ipv6_addr_type(&fl->fl6_dst) & (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL); + strict = ipv6_addr_type(&fl->fl6_dst) & (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL) ? RT6_SELECT_F_IFACE : 0; relookup: read_lock_bh(&rt6_lock); +restart_2: fn = fib6_lookup(&ip6_routing_table, &fl->fl6_dst, &fl->fl6_src); restart: - rt = fn->leaf; - - if ((rt->rt6i_flags & RTF_CACHE)) { - rt = rt6_device_match(rt, fl->oif, strict); - BACKTRACK(); - dst_hold(&rt->u.dst); + rt = rt6_select(&fn->leaf, fl->oif, strict | reachable); + BACKTRACK(); + if (rt == &ip6_null_entry || + rt->rt6i_flags & RTF_CACHE) goto out; - } - if (rt->rt6i_flags & RTF_DEFAULT) { - if (rt->rt6i_metric >= IP6_RT_PRIO_ADDRCONF) - rt = rt6_best_dflt(rt, fl->oif); - } else { - rt = rt6_device_match(rt, fl->oif, strict); - BACKTRACK(); - } - if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP)) { - struct rt6_info *nrt; - dst_hold(&rt->u.dst); - read_unlock_bh(&rt6_lock); + dst_hold(&rt->u.dst); + read_unlock_bh(&rt6_lock); - nrt = rt6_cow(rt, &fl->fl6_dst, &fl->fl6_src, NULL); + if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP)) + nrt = rt6_alloc_cow(rt, &fl->fl6_dst, &fl->fl6_src); + else { +#if CLONE_OFFLINK_ROUTE + nrt = rt6_alloc_clone(rt, &fl->fl6_dst); +#else + goto out2; +#endif + } - dst_release(&rt->u.dst); - rt = nrt; + dst_release(&rt->u.dst); + rt = nrt ? : &ip6_null_entry; - if (rt->u.dst.error != -EEXIST || --attempts <= 0) + dst_hold(&rt->u.dst); + if (nrt) { + err = ip6_ins_rt(nrt, NULL, NULL, NULL); + if (!err) goto out2; - - /* Race condition! In the gap, when rt6_lock was - released someone could insert this route. Relookup. - */ - dst_release(&rt->u.dst); - goto relookup; } - dst_hold(&rt->u.dst); + + if (--attempts <= 0) + goto out2; + + /* + * Race condition! In the gap, when rt6_lock was + * released someone could insert this route. Relookup. + */ + dst_release(&rt->u.dst); + goto relookup; out: + if (reachable) { + reachable = 0; + goto restart_2; + } + dst_hold(&rt->u.dst); read_unlock_bh(&rt6_lock); out2: rt->u.dst.lastuse = jiffies; @@ -999,8 +1095,6 @@ int ip6_del_rt(struct rt6_info *rt, stru write_lock_bh(&rt6_lock); - rt6_reset_dflt_pointer(NULL); - err = fib6_del(rt, nlh, _rtattr, req); dst_release(&rt->u.dst); @@ -1050,59 +1144,63 @@ static int ip6_route_del(struct in6_rtms void rt6_redirect(struct in6_addr *dest, struct in6_addr *saddr, struct neighbour *neigh, u8 *lladdr, int on_link) { - struct rt6_info *rt, *nrt; - - /* Locate old route to this destination. */ - rt = rt6_lookup(dest, NULL, neigh->dev->ifindex, 1); - - if (rt == NULL) - return; - - if (neigh->dev != rt->rt6i_dev) - goto out; + struct rt6_info *rt, *nrt = NULL; + int strict; + struct fib6_node *fn; /* - * Current route is on-link; redirect is always invalid. - * - * Seems, previous statement is not true. It could - * be node, which looks for us as on-link (f.e. proxy ndisc) - * But then router serving it might decide, that we should - * know truth 8)8) --ANK (980726). + * Get the "current" route for this destination and + * check if the redirect has come from approriate router. + * + * RFC 2461 specifies that redirects should only be + * accepted if they come from the nexthop to the target. + * Due to the way the routes are chosen, this notion + * is a bit fuzzy and one might need to check all possible + * routes. */ - if (!(rt->rt6i_flags&RTF_GATEWAY)) - goto out; + strict = ipv6_addr_type(dest) & (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL); - /* - * RFC 2461 specifies that redirects should only be - * accepted if they come from the nexthop to the target. - * Due to the way default routers are chosen, this notion - * is a bit fuzzy and one might need to check all default - * routers. - */ - if (!ipv6_addr_equal(saddr, &rt->rt6i_gateway)) { - if (rt->rt6i_flags & RTF_DEFAULT) { - struct rt6_info *rt1; - - read_lock(&rt6_lock); - for (rt1 = ip6_routing_table.leaf; rt1; rt1 = rt1->u.next) { - if (ipv6_addr_equal(saddr, &rt1->rt6i_gateway)) { - dst_hold(&rt1->u.dst); - dst_release(&rt->u.dst); - read_unlock(&rt6_lock); - rt = rt1; - goto source_ok; - } - } - read_unlock(&rt6_lock); + read_lock_bh(&rt6_lock); + fn = fib6_lookup(&ip6_routing_table, dest, NULL); +restart: + for (rt = fn->leaf; rt; rt = rt->u.next) { + /* + * Current route is on-link; redirect is always invalid. + * + * Seems, previous statement is not true. It could + * be node, which looks for us as on-link (f.e. proxy ndisc) + * But then router serving it might decide, that we should + * know truth 8)8) --ANK (980726). + */ + if (rt6_check_expired(rt)) + continue; + if (!(rt->rt6i_flags & RTF_GATEWAY)) + continue; + if (neigh->dev != rt->rt6i_dev) + continue; + if (!ipv6_addr_equal(saddr, &rt->rt6i_gateway)) + continue; + break; + } + if (rt) + dst_hold(&rt->u.dst); + else if (strict) { + while ((fn = fn->parent) != NULL) { + if (fn->fn_flags & RTN_ROOT) + break; + if (fn->fn_flags & RTN_RTINFO) + goto restart; } + } + read_unlock_bh(&rt6_lock); + + if (!rt) { if (net_ratelimit()) printk(KERN_DEBUG "rt6_redirect: source isn't a valid nexthop " "for redirect target\n"); - goto out; + return; } -source_ok: - /* * We have finally decided to accept it. */ @@ -1210,38 +1308,27 @@ void rt6_pmtu_discovery(struct in6_addr 1. It is connected route. Action: COW 2. It is gatewayed route or NONEXTHOP route. Action: clone it. */ - if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP)) { - nrt = rt6_cow(rt, daddr, saddr, NULL); - if (!nrt->u.dst.error) { - nrt->u.dst.metrics[RTAX_MTU-1] = pmtu; - if (allfrag) - nrt->u.dst.metrics[RTAX_FEATURES-1] |= RTAX_FEATURE_ALLFRAG; - /* According to RFC 1981, detecting PMTU increase shouldn't be - happened within 5 mins, the recommended timer is 10 mins. - Here this route expiration time is set to ip6_rt_mtu_expires - which is 10 mins. After 10 mins the decreased pmtu is expired - and detecting PMTU increase will be automatically happened. - */ - dst_set_expires(&nrt->u.dst, ip6_rt_mtu_expires); - nrt->rt6i_flags |= RTF_DYNAMIC|RTF_EXPIRES; - } - dst_release(&nrt->u.dst); - } else { - nrt = ip6_rt_copy(rt); - if (nrt == NULL) - goto out; - ipv6_addr_copy(&nrt->rt6i_dst.addr, daddr); - nrt->rt6i_dst.plen = 128; - nrt->u.dst.flags |= DST_HOST; - nrt->rt6i_nexthop = neigh_clone(rt->rt6i_nexthop); - dst_set_expires(&nrt->u.dst, ip6_rt_mtu_expires); - nrt->rt6i_flags |= RTF_DYNAMIC|RTF_CACHE|RTF_EXPIRES; + if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP)) + nrt = rt6_alloc_cow(rt, daddr, saddr); + else + nrt = rt6_alloc_clone(rt, daddr); + + if (nrt) { nrt->u.dst.metrics[RTAX_MTU-1] = pmtu; if (allfrag) nrt->u.dst.metrics[RTAX_FEATURES-1] |= RTAX_FEATURE_ALLFRAG; + + /* According to RFC 1981, detecting PMTU increase shouldn't be + * happened within 5 mins, the recommended timer is 10 mins. + * Here this route expiration time is set to ip6_rt_mtu_expires + * which is 10 mins. After 10 mins the decreased pmtu is expired + * and detecting PMTU increase will be automatically happened. + */ + dst_set_expires(&nrt->u.dst, ip6_rt_mtu_expires); + nrt->rt6i_flags |= RTF_DYNAMIC|RTF_EXPIRES; + ip6_ins_rt(nrt, NULL, NULL, NULL); } - out: dst_release(&rt->u.dst); } @@ -1280,6 +1367,57 @@ static struct rt6_info * ip6_rt_copy(str return rt; } +#ifdef CONFIG_IPV6_ROUTE_INFO +static struct rt6_info *rt6_get_route_info(struct in6_addr *prefix, int prefixlen, + struct in6_addr *gwaddr, int ifindex) +{ + struct fib6_node *fn; + struct rt6_info *rt = NULL; + + write_lock_bh(&rt6_lock); + fn = fib6_locate(&ip6_routing_table, prefix ,prefixlen, NULL, 0); + if (!fn) + goto out; + + for (rt = fn->leaf; rt; rt = rt->u.next) { + if (rt->rt6i_dev->ifindex != ifindex) + continue; + if ((rt->rt6i_flags & (RTF_ROUTEINFO|RTF_GATEWAY)) != (RTF_ROUTEINFO|RTF_GATEWAY)) + continue; + if (!ipv6_addr_equal(&rt->rt6i_gateway, gwaddr)) + continue; + dst_hold(&rt->u.dst); + break; + } +out: + write_unlock_bh(&rt6_lock); + return rt; +} + +static struct rt6_info *rt6_add_route_info(struct in6_addr *prefix, int prefixlen, + struct in6_addr *gwaddr, int ifindex, + unsigned pref) +{ + struct in6_rtmsg rtmsg; + + memset(&rtmsg, 0, sizeof(rtmsg)); + rtmsg.rtmsg_type = RTMSG_NEWROUTE; + ipv6_addr_copy(&rtmsg.rtmsg_dst, prefix); + rtmsg.rtmsg_dst_len = prefixlen; + ipv6_addr_copy(&rtmsg.rtmsg_gateway, gwaddr); + rtmsg.rtmsg_metric = 1024; + rtmsg.rtmsg_flags = RTF_GATEWAY | RTF_ADDRCONF | RTF_ROUTEINFO | RTF_UP | RTF_PREF(pref); + /* We should treat it as a default route if prefix length is 0. */ + if (!prefixlen) + rtmsg.rtmsg_flags |= RTF_DEFAULT; + rtmsg.rtmsg_ifindex = ifindex; + + ip6_route_add(&rtmsg, NULL, NULL, NULL); + + return rt6_get_route_info(prefix, prefixlen, gwaddr, ifindex); +} +#endif + struct rt6_info *rt6_get_dflt_router(struct in6_addr *addr, struct net_device *dev) { struct rt6_info *rt; @@ -1290,6 +1428,7 @@ struct rt6_info *rt6_get_dflt_router(str write_lock_bh(&rt6_lock); for (rt = fn->leaf; rt; rt=rt->u.next) { if (dev == rt->rt6i_dev && + ((rt->rt6i_flags & (RTF_ADDRCONF | RTF_DEFAULT)) == (RTF_ADDRCONF | RTF_DEFAULT)) && ipv6_addr_equal(&rt->rt6i_gateway, addr)) break; } @@ -1300,7 +1439,8 @@ struct rt6_info *rt6_get_dflt_router(str } struct rt6_info *rt6_add_dflt_router(struct in6_addr *gwaddr, - struct net_device *dev) + struct net_device *dev, + unsigned int pref) { struct in6_rtmsg rtmsg; @@ -1308,7 +1448,8 @@ struct rt6_info *rt6_add_dflt_router(str rtmsg.rtmsg_type = RTMSG_NEWROUTE; ipv6_addr_copy(&rtmsg.rtmsg_gateway, gwaddr); rtmsg.rtmsg_metric = 1024; - rtmsg.rtmsg_flags = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT | RTF_UP | RTF_EXPIRES; + rtmsg.rtmsg_flags = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT | RTF_UP | RTF_EXPIRES | + RTF_PREF(pref); rtmsg.rtmsg_ifindex = dev->ifindex; @@ -1326,8 +1467,6 @@ restart: if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ADDRCONF)) { dst_hold(&rt->u.dst); - rt6_reset_dflt_pointer(NULL); - read_unlock_bh(&rt6_lock); ip6_del_rt(rt, NULL, NULL, NULL); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index ca9cf68..14de503 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -987,6 +987,7 @@ static struct sock * tcp_v6_syn_recv_soc inet_csk(newsk)->icsk_ext_hdr_len = (newnp->opt->opt_nflen + newnp->opt->opt_flen); + tcp_mtup_init(newsk); tcp_sync_mss(newsk, dst_mtu(dst)); newtp->advmss = dst_metric(dst, RTAX_ADVMSS); tcp_initialize_rcv_mss(newsk); diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 0ce337a..ece4e83 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -3,7 +3,7 @@ extension. */ /* (C) 1999-2001 Paul `Rusty' Russell - * (C) 2002-2005 Netfilter Core Team + * (C) 2002-2006 Netfilter Core Team * (C) 2003,2004 USAGI/WIDE Project * * This program is free software; you can redistribute it and/or modify @@ -20,6 +20,9 @@ * - generalize L3 protocol denendent part. * 23 Mar 2004: Yasuyuki Kozakai @USAGI * - add support various size of conntrack structures. + * 26 Jan 2006: Harald Welte + * - restructure nf_conn (introduce nf_conn_help) + * - redesign 'features' how they were originally intended * * Derived from net/ipv4/netfilter/ip_conntrack_core.c */ @@ -55,7 +58,7 @@ #include #include -#define NF_CONNTRACK_VERSION "0.4.1" +#define NF_CONNTRACK_VERSION "0.5.0" #if 0 #define DEBUGP printk @@ -259,21 +262,8 @@ static inline u_int32_t hash_conntrack(c nf_conntrack_hash_rnd); } -/* Initialize "struct nf_conn" which has spaces for helper */ -static int -init_conntrack_for_helper(struct nf_conn *conntrack, u_int32_t features) -{ - - conntrack->help = (union nf_conntrack_help *) - (((unsigned long)conntrack->data - + (__alignof__(union nf_conntrack_help) - 1)) - & (~((unsigned long)(__alignof__(union nf_conntrack_help) -1)))); - return 0; -} - int nf_conntrack_register_cache(u_int32_t features, const char *name, - size_t size, - int (*init)(struct nf_conn *, u_int32_t)) + size_t size) { int ret = 0; char *cache_name; @@ -296,8 +286,7 @@ int nf_conntrack_register_cache(u_int32_ DEBUGP("nf_conntrack_register_cache: already resisterd.\n"); if ((!strncmp(nf_ct_cache[features].name, name, NF_CT_FEATURES_NAMELEN)) - && nf_ct_cache[features].size == size - && nf_ct_cache[features].init_conntrack == init) { + && nf_ct_cache[features].size == size) { DEBUGP("nf_conntrack_register_cache: reusing.\n"); nf_ct_cache[features].use++; ret = 0; @@ -340,7 +329,6 @@ int nf_conntrack_register_cache(u_int32_ write_lock_bh(&nf_ct_cache_lock); nf_ct_cache[features].use = 1; nf_ct_cache[features].size = size; - nf_ct_cache[features].init_conntrack = init; nf_ct_cache[features].cachep = cachep; nf_ct_cache[features].name = cache_name; write_unlock_bh(&nf_ct_cache_lock); @@ -377,7 +365,6 @@ void nf_conntrack_unregister_cache(u_int name = nf_ct_cache[features].name; nf_ct_cache[features].cachep = NULL; nf_ct_cache[features].name = NULL; - nf_ct_cache[features].init_conntrack = NULL; nf_ct_cache[features].size = 0; write_unlock_bh(&nf_ct_cache_lock); @@ -432,11 +419,15 @@ nf_ct_invert_tuple(struct nf_conntrack_t /* nf_conntrack_expect helper functions */ void nf_ct_unlink_expect(struct nf_conntrack_expect *exp) { + struct nf_conn_help *master_help = nfct_help(exp->master); + + NF_CT_ASSERT(master_help); ASSERT_WRITE_LOCK(&nf_conntrack_lock); NF_CT_ASSERT(!timer_pending(&exp->timeout)); + list_del(&exp->list); NF_CT_STAT_INC(expect_delete); - exp->master->expecting--; + master_help->expecting--; nf_conntrack_expect_put(exp); } @@ -508,9 +499,10 @@ find_expectation(const struct nf_conntra void nf_ct_remove_expectations(struct nf_conn *ct) { struct nf_conntrack_expect *i, *tmp; + struct nf_conn_help *help = nfct_help(ct); /* Optimization: most connection never expect any others. */ - if (ct->expecting == 0) + if (!help || help->expecting == 0) return; list_for_each_entry_safe(i, tmp, &nf_conntrack_expect_list, list) { @@ -713,6 +705,7 @@ __nf_conntrack_confirm(struct sk_buff ** conntrack_tuple_cmp, struct nf_conntrack_tuple_hash *, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, NULL)) { + struct nf_conn_help *help; /* Remove from unconfirmed list */ list_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].list); @@ -726,7 +719,8 @@ __nf_conntrack_confirm(struct sk_buff ** set_bit(IPS_CONFIRMED_BIT, &ct->status); NF_CT_STAT_INC(insert); write_unlock_bh(&nf_conntrack_lock); - if (ct->helper) + help = nfct_help(ct); + if (help && help->helper) nf_conntrack_event_cache(IPCT_HELPER, *pskb); #ifdef CONFIG_NF_NAT_NEEDED if (test_bit(IPS_SRC_NAT_DONE_BIT, &ct->status) || @@ -842,8 +836,9 @@ __nf_conntrack_alloc(const struct nf_con { struct nf_conn *conntrack = NULL; u_int32_t features = 0; + struct nf_conntrack_helper *helper; - if (!nf_conntrack_hash_rnd_initted) { + if (unlikely(!nf_conntrack_hash_rnd_initted)) { get_random_bytes(&nf_conntrack_hash_rnd, 4); nf_conntrack_hash_rnd_initted = 1; } @@ -863,8 +858,11 @@ __nf_conntrack_alloc(const struct nf_con /* find features needed by this conntrack. */ features = l3proto->get_features(orig); + + /* FIXME: protect helper list per RCU */ read_lock_bh(&nf_conntrack_lock); - if (__nf_ct_helper_find(repl) != NULL) + helper = __nf_ct_helper_find(repl); + if (helper) features |= NF_CT_F_HELP; read_unlock_bh(&nf_conntrack_lock); @@ -872,7 +870,7 @@ __nf_conntrack_alloc(const struct nf_con read_lock_bh(&nf_ct_cache_lock); - if (!nf_ct_cache[features].use) { + if (unlikely(!nf_ct_cache[features].use)) { DEBUGP("nf_conntrack_alloc: not supported features = 0x%x\n", features); goto out; @@ -886,12 +884,10 @@ __nf_conntrack_alloc(const struct nf_con memset(conntrack, 0, nf_ct_cache[features].size); conntrack->features = features; - if (nf_ct_cache[features].init_conntrack && - nf_ct_cache[features].init_conntrack(conntrack, features) < 0) { - DEBUGP("nf_conntrack_alloc: failed to init\n"); - kmem_cache_free(nf_ct_cache[features].cachep, conntrack); - conntrack = NULL; - goto out; + if (helper) { + struct nf_conn_help *help = nfct_help(conntrack); + NF_CT_ASSERT(help); + help->helper = helper; } atomic_set(&conntrack->ct_general.use, 1); @@ -972,11 +968,8 @@ init_conntrack(const struct nf_conntrack #endif nf_conntrack_get(&conntrack->master->ct_general); NF_CT_STAT_INC(expect_new); - } else { - conntrack->helper = __nf_ct_helper_find(&repl_tuple); - + } else NF_CT_STAT_INC(new); - } /* Overload tuple linked list to put us in unconfirmed list. */ list_add(&conntrack->tuplehash[IP_CT_DIR_ORIGINAL].list, &unconfirmed); @@ -1206,14 +1199,16 @@ void nf_conntrack_expect_put(struct nf_c static void nf_conntrack_expect_insert(struct nf_conntrack_expect *exp) { + struct nf_conn_help *master_help = nfct_help(exp->master); + atomic_inc(&exp->use); - exp->master->expecting++; + master_help->expecting++; list_add(&exp->list, &nf_conntrack_expect_list); init_timer(&exp->timeout); exp->timeout.data = (unsigned long)exp; exp->timeout.function = expectation_timed_out; - exp->timeout.expires = jiffies + exp->master->helper->timeout * HZ; + exp->timeout.expires = jiffies + master_help->helper->timeout * HZ; add_timer(&exp->timeout); exp->id = ++nf_conntrack_expect_next_id; @@ -1239,10 +1234,12 @@ static void evict_oldest_expect(struct n static inline int refresh_timer(struct nf_conntrack_expect *i) { + struct nf_conn_help *master_help = nfct_help(i->master); + if (!del_timer(&i->timeout)) return 0; - i->timeout.expires = jiffies + i->master->helper->timeout*HZ; + i->timeout.expires = jiffies + master_help->helper->timeout*HZ; add_timer(&i->timeout); return 1; } @@ -1251,8 +1248,11 @@ int nf_conntrack_expect_related(struct n { struct nf_conntrack_expect *i; struct nf_conn *master = expect->master; + struct nf_conn_help *master_help = nfct_help(master); int ret; + NF_CT_ASSERT(master_help); + DEBUGP("nf_conntrack_expect_related %p\n", related_to); DEBUGP("tuple: "); NF_CT_DUMP_TUPLE(&expect->tuple); DEBUGP("mask: "); NF_CT_DUMP_TUPLE(&expect->mask); @@ -1271,8 +1271,8 @@ int nf_conntrack_expect_related(struct n } } /* Will be over limit? */ - if (master->helper->max_expected && - master->expecting >= master->helper->max_expected) + if (master_help->helper->max_expected && + master_help->expecting >= master_help->helper->max_expected) evict_oldest_expect(master); nf_conntrack_expect_insert(expect); @@ -1283,24 +1283,6 @@ out: return ret; } -/* Alter reply tuple (maybe alter helper). This is for NAT, and is - implicitly racy: see __nf_conntrack_confirm */ -void nf_conntrack_alter_reply(struct nf_conn *conntrack, - const struct nf_conntrack_tuple *newreply) -{ - write_lock_bh(&nf_conntrack_lock); - /* Should be unconfirmed, so not in hash table yet */ - NF_CT_ASSERT(!nf_ct_is_confirmed(conntrack)); - - DEBUGP("Altering reply tuple of %p to ", conntrack); - NF_CT_DUMP_TUPLE(newreply); - - conntrack->tuplehash[IP_CT_DIR_REPLY].tuple = *newreply; - if (!conntrack->master && conntrack->expecting == 0) - conntrack->helper = __nf_ct_helper_find(newreply); - write_unlock_bh(&nf_conntrack_lock); -} - int nf_conntrack_helper_register(struct nf_conntrack_helper *me) { int ret; @@ -1308,9 +1290,8 @@ int nf_conntrack_helper_register(struct ret = nf_conntrack_register_cache(NF_CT_F_HELP, "nf_conntrack:help", sizeof(struct nf_conn) - + sizeof(union nf_conntrack_help) - + __alignof__(union nf_conntrack_help), - init_conntrack_for_helper); + + sizeof(struct nf_conn_help) + + __alignof__(struct nf_conn_help)); if (ret < 0) { printk(KERN_ERR "nf_conntrack_helper_reigster: Unable to create slab cache for conntracks\n"); return ret; @@ -1338,9 +1319,12 @@ __nf_conntrack_helper_find_byname(const static inline int unhelp(struct nf_conntrack_tuple_hash *i, const struct nf_conntrack_helper *me) { - if (nf_ct_tuplehash_to_ctrack(i)->helper == me) { - nf_conntrack_event(IPCT_HELPER, nf_ct_tuplehash_to_ctrack(i)); - nf_ct_tuplehash_to_ctrack(i)->helper = NULL; + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(i); + struct nf_conn_help *help = nfct_help(ct); + + if (help && help->helper == me) { + nf_conntrack_event(IPCT_HELPER, ct); + help->helper = NULL; } return 0; } @@ -1356,7 +1340,8 @@ void nf_conntrack_helper_unregister(stru /* Get rid of expectations */ list_for_each_entry_safe(exp, tmp, &nf_conntrack_expect_list, list) { - if (exp->master->helper == me && del_timer(&exp->timeout)) { + struct nf_conn_help *help = nfct_help(exp->master); + if (help->helper == me && del_timer(&exp->timeout)) { nf_ct_unlink_expect(exp); nf_conntrack_expect_put(exp); } @@ -1695,7 +1680,7 @@ int __init nf_conntrack_init(void) } ret = nf_conntrack_register_cache(NF_CT_F_BASIC, "nf_conntrack:basic", - sizeof(struct nf_conn), NULL); + sizeof(struct nf_conn)); if (ret < 0) { printk(KERN_ERR "Unable to create nf_conn slab cache\n"); goto err_free_hash; diff --git a/net/netfilter/nf_conntrack_ftp.c b/net/netfilter/nf_conntrack_ftp.c index 6f210f3..cd191b0 100644 --- a/net/netfilter/nf_conntrack_ftp.c +++ b/net/netfilter/nf_conntrack_ftp.c @@ -440,7 +440,7 @@ static int help(struct sk_buff **pskb, u32 seq; int dir = CTINFO2DIR(ctinfo); unsigned int matchlen, matchoff; - struct ip_ct_ftp_master *ct_ftp_info = &ct->help->ct_ftp_info; + struct ip_ct_ftp_master *ct_ftp_info = &nfct_help(ct)->help.ct_ftp_info; struct nf_conntrack_expect *exp; struct nf_conntrack_man cmd = {}; diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 9ff3463..f0d6fc9 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -2,7 +2,7 @@ * protocol helpers and general trouble making from userspace. * * (C) 2001 by Jay Schulist - * (C) 2002-2005 by Harald Welte + * (C) 2002-2006 by Harald Welte * (C) 2003 by Patrick Mchardy * (C) 2005 by Pablo Neira Ayuso * @@ -44,7 +44,7 @@ MODULE_LICENSE("GPL"); -static char __initdata version[] = "0.92"; +static char __initdata version[] = "0.93"; #if 0 #define DEBUGP printk @@ -165,15 +165,16 @@ static inline int ctnetlink_dump_helpinfo(struct sk_buff *skb, const struct nf_conn *ct) { struct nfattr *nest_helper; + const struct nf_conn_help *help = nfct_help(ct); - if (!ct->helper) + if (!help || !help->helper) return 0; nest_helper = NFA_NEST(skb, CTA_HELP); - NFA_PUT(skb, CTA_HELP_NAME, strlen(ct->helper->name), ct->helper->name); + NFA_PUT(skb, CTA_HELP_NAME, strlen(help->helper->name), help->helper->name); - if (ct->helper->to_nfattr) - ct->helper->to_nfattr(skb, ct); + if (help->helper->to_nfattr) + help->helper->to_nfattr(skb, ct); NFA_NEST_END(skb, nest_helper); @@ -903,11 +904,17 @@ static inline int ctnetlink_change_helper(struct nf_conn *ct, struct nfattr *cda[]) { struct nf_conntrack_helper *helper; + struct nf_conn_help *help = nfct_help(ct); char *helpname; int err; DEBUGP("entered %s\n", __FUNCTION__); + if (!help) { + /* FIXME: we need to reallocate and rehash */ + return -EBUSY; + } + /* don't change helper of sibling connections */ if (ct->master) return -EINVAL; @@ -924,18 +931,18 @@ ctnetlink_change_helper(struct nf_conn * return -EINVAL; } - if (ct->helper) { + if (help->helper) { if (!helper) { /* we had a helper before ... */ nf_ct_remove_expectations(ct); - ct->helper = NULL; + help->helper = NULL; } else { /* need to zero data of old helper */ - memset(&ct->help, 0, sizeof(ct->help)); + memset(&help->help, 0, sizeof(help->help)); } } - ct->helper = helper; + help->helper = helper; return 0; } @@ -1050,14 +1057,9 @@ ctnetlink_create_conntrack(struct nfattr ct->mark = ntohl(*(u_int32_t *)NFA_DATA(cda[CTA_MARK-1])); #endif - ct->helper = nf_ct_helper_find_get(rtuple); - add_timer(&ct->timeout); nf_conntrack_hash_insert(ct); - if (ct->helper) - nf_ct_helper_put(ct->helper); - DEBUGP("conntrack with id %u inserted\n", ct->id); return 0; @@ -1417,7 +1419,8 @@ ctnetlink_del_expect(struct sock *ctnl, } list_for_each_entry_safe(exp, tmp, &nf_conntrack_expect_list, list) { - if (exp->master->helper == h + struct nf_conn_help *m_help = nfct_help(exp->master); + if (m_help->helper == h && del_timer(&exp->timeout)) { nf_ct_unlink_expect(exp); nf_conntrack_expect_put(exp); @@ -1452,6 +1455,7 @@ ctnetlink_create_expect(struct nfattr *c struct nf_conntrack_tuple_hash *h = NULL; struct nf_conntrack_expect *exp; struct nf_conn *ct; + struct nf_conn_help *help; int err = 0; DEBUGP("entered %s\n", __FUNCTION__); @@ -1472,8 +1476,9 @@ ctnetlink_create_expect(struct nfattr *c if (!h) return -ENOENT; ct = nf_ct_tuplehash_to_ctrack(h); + help = nfct_help(ct); - if (!ct->helper) { + if (!help || !help->helper) { /* such conntrack hasn't got any helper, abort */ err = -EINVAL; goto out; diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 617599a..290d5a0 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -839,7 +839,6 @@ EXPORT_SYMBOL(nf_conntrack_l3proto_unreg EXPORT_SYMBOL(nf_conntrack_protocol_register); EXPORT_SYMBOL(nf_conntrack_protocol_unregister); EXPORT_SYMBOL(nf_ct_invert_tuplepr); -EXPORT_SYMBOL(nf_conntrack_alter_reply); EXPORT_SYMBOL(nf_conntrack_destroyed); EXPORT_SYMBOL(need_conntrack); EXPORT_SYMBOL(nf_conntrack_helper_register); diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c index 3b3c781..dfb25fd 100644 --- a/net/netfilter/nfnetlink_log.c +++ b/net/netfilter/nfnetlink_log.c @@ -11,6 +11,10 @@ * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. * + * 2006-01-26 Harald Welte + * - Add optional local and global sequence number to detect lost + * events from userspace + * */ #include #include @@ -68,11 +72,14 @@ struct nfulnl_instance { unsigned int nlbufsiz; /* netlink buffer allocation size */ unsigned int qthreshold; /* threshold of the queue */ u_int32_t copy_range; + u_int32_t seq; /* instance-local sequential counter */ u_int16_t group_num; /* number of this queue */ + u_int16_t flags; u_int8_t copy_mode; }; static DEFINE_RWLOCK(instances_lock); +static atomic_t global_seq; #define INSTANCE_BUCKETS 16 static struct hlist_head instance_table[INSTANCE_BUCKETS]; @@ -310,6 +317,16 @@ nfulnl_set_qthresh(struct nfulnl_instanc return 0; } +static int +nfulnl_set_flags(struct nfulnl_instance *inst, u_int16_t flags) +{ + spin_lock_bh(&inst->lock); + inst->flags = ntohs(flags); + spin_unlock_bh(&inst->lock); + + return 0; +} + static struct sk_buff *nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size) { @@ -377,6 +394,8 @@ static void nfulnl_timer(unsigned long d spin_unlock_bh(&inst->lock); } +/* This is an inline function, we don't really care about a long + * list of arguments */ static inline int __build_packet_message(struct nfulnl_instance *inst, const struct sk_buff *skb, @@ -515,6 +534,17 @@ __build_packet_message(struct nfulnl_ins read_unlock_bh(&skb->sk->sk_callback_lock); } + /* local sequence number */ + if (inst->flags & NFULNL_CFG_F_SEQ) { + tmp_uint = htonl(inst->seq++); + NFA_PUT(inst->skb, NFULA_SEQ, sizeof(tmp_uint), &tmp_uint); + } + /* global sequence number */ + if (inst->flags & NFULNL_CFG_F_SEQ_GLOBAL) { + tmp_uint = atomic_inc_return(&global_seq); + NFA_PUT(inst->skb, NFULA_SEQ_GLOBAL, sizeof(tmp_uint), &tmp_uint); + } + if (data_len) { struct nfattr *nfa; int size = NFA_LENGTH(data_len); @@ -607,6 +637,11 @@ nfulnl_log_packet(unsigned int pf, spin_lock_bh(&inst->lock); + if (inst->flags & NFULNL_CFG_F_SEQ) + size += NFA_SPACE(sizeof(u_int32_t)); + if (inst->flags & NFULNL_CFG_F_SEQ_GLOBAL) + size += NFA_SPACE(sizeof(u_int32_t)); + qthreshold = inst->qthreshold; /* per-rule qthreshold overrides per-instance */ if (qthreshold > li->u.ulog.qthreshold) @@ -736,10 +771,14 @@ static const int nfula_min[NFULA_MAX] = [NFULA_TIMESTAMP-1] = sizeof(struct nfulnl_msg_packet_timestamp), [NFULA_IFINDEX_INDEV-1] = sizeof(u_int32_t), [NFULA_IFINDEX_OUTDEV-1]= sizeof(u_int32_t), + [NFULA_IFINDEX_PHYSINDEV-1] = sizeof(u_int32_t), + [NFULA_IFINDEX_PHYSOUTDEV-1] = sizeof(u_int32_t), [NFULA_HWADDR-1] = sizeof(struct nfulnl_msg_packet_hw), [NFULA_PAYLOAD-1] = 0, [NFULA_PREFIX-1] = 0, [NFULA_UID-1] = sizeof(u_int32_t), + [NFULA_SEQ-1] = sizeof(u_int32_t), + [NFULA_SEQ_GLOBAL-1] = sizeof(u_int32_t), }; static const int nfula_cfg_min[NFULA_CFG_MAX] = { @@ -748,6 +787,7 @@ static const int nfula_cfg_min[NFULA_CFG [NFULA_CFG_TIMEOUT-1] = sizeof(u_int32_t), [NFULA_CFG_QTHRESH-1] = sizeof(u_int32_t), [NFULA_CFG_NLBUFSIZ-1] = sizeof(u_int32_t), + [NFULA_CFG_FLAGS-1] = sizeof(u_int16_t), }; static int @@ -859,6 +899,12 @@ nfulnl_recv_config(struct sock *ctnl, st nfulnl_set_qthresh(inst, ntohl(qthresh)); } + if (nfula[NFULA_CFG_FLAGS-1]) { + u_int16_t flags = + *(u_int16_t *)NFA_DATA(nfula[NFULA_CFG_FLAGS-1]); + nfulnl_set_flags(inst, ntohl(flags)); + } + out_put: instance_put(inst); return ret; diff --git a/net/netfilter/xt_helper.c b/net/netfilter/xt_helper.c index 38b6715..c451169 100644 --- a/net/netfilter/xt_helper.c +++ b/net/netfilter/xt_helper.c @@ -96,6 +96,7 @@ match(const struct sk_buff *skb, { const struct xt_helper_info *info = matchinfo; struct nf_conn *ct; + struct nf_conn_help *master_help; enum ip_conntrack_info ctinfo; int ret = info->invert; @@ -111,7 +112,8 @@ match(const struct sk_buff *skb, } read_lock_bh(&nf_conntrack_lock); - if (!ct->master->helper) { + master_help = nfct_help(ct->master); + if (!master_help || !master_help->helper) { DEBUGP("xt_helper: master ct %p has no helper\n", exp->expectant); goto out_unlock; @@ -123,8 +125,8 @@ match(const struct sk_buff *skb, if (info->name[0] == '\0') ret ^= 1; else - ret ^= !strncmp(ct->master->helper->name, info->name, - strlen(ct->master->helper->name)); + ret ^= !strncmp(master_help->helper->name, info->name, + strlen(master_help->helper->name)); out_unlock: read_unlock_bh(&nf_conntrack_lock); return ret; diff --git a/net/socket.c b/net/socket.c index a00851f..3ca31e8 100644 --- a/net/socket.c +++ b/net/socket.c @@ -351,8 +351,8 @@ static struct dentry_operations sockfs_d /* * Obtains the first available file descriptor and sets it up for use. * - * This function creates file structure and maps it to fd space - * of current process. On success it returns file descriptor + * These functions create file structures and maps them to fd space + * of the current process. On success it returns file descriptor * and file struct implicitly stored in sock->file. * Note that another thread may close file descriptor before we return * from this function. We use the fact that now we do not refer @@ -365,52 +365,67 @@ static struct dentry_operations sockfs_d * but we take care of internal coherence yet. */ -int sock_map_fd(struct socket *sock) +static int sock_alloc_fd(struct file **filep) { int fd; - struct qstr this; - char name[32]; - - /* - * Find a file descriptor suitable for return to the user. - */ fd = get_unused_fd(); - if (fd >= 0) { + if (likely(fd >= 0)) { struct file *file = get_empty_filp(); - if (!file) { + *filep = file; + if (unlikely(!file)) { put_unused_fd(fd); - fd = -ENFILE; - goto out; + return -ENFILE; } + } else + *filep = NULL; + return fd; +} - this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino); - this.name = name; - this.hash = SOCK_INODE(sock)->i_ino; - - file->f_dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this); - if (!file->f_dentry) { - put_filp(file); +static int sock_attach_fd(struct socket *sock, struct file *file) +{ + struct qstr this; + char name[32]; + + this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino); + this.name = name; + this.hash = SOCK_INODE(sock)->i_ino; + + file->f_dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this); + if (unlikely(!file->f_dentry)) + return -ENOMEM; + + file->f_dentry->d_op = &sockfs_dentry_operations; + d_add(file->f_dentry, SOCK_INODE(sock)); + file->f_vfsmnt = mntget(sock_mnt); + file->f_mapping = file->f_dentry->d_inode->i_mapping; + + sock->file = file; + file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops; + file->f_mode = FMODE_READ | FMODE_WRITE; + file->f_flags = O_RDWR; + file->f_pos = 0; + file->private_data = sock; + + return 0; +} + +int sock_map_fd(struct socket *sock) +{ + struct file *newfile; + int fd = sock_alloc_fd(&newfile); + + if (likely(fd >= 0)) { + int err = sock_attach_fd(sock, newfile); + + if (unlikely(err < 0)) { + put_filp(newfile); put_unused_fd(fd); - fd = -ENOMEM; - goto out; + return err; } - file->f_dentry->d_op = &sockfs_dentry_operations; - d_add(file->f_dentry, SOCK_INODE(sock)); - file->f_vfsmnt = mntget(sock_mnt); - file->f_mapping = file->f_dentry->d_inode->i_mapping; - - sock->file = file; - file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops; - file->f_mode = FMODE_READ | FMODE_WRITE; - file->f_flags = O_RDWR; - file->f_pos = 0; - file->private_data = sock; - fd_install(fd, file); + fd_install(fd, newfile); } - -out: return fd; } @@ -1352,7 +1367,8 @@ asmlinkage long sys_listen(int fd, int b asmlinkage long sys_accept(int fd, struct sockaddr __user *upeer_sockaddr, int __user *upeer_addrlen) { struct socket *sock, *newsock; - int err, len; + struct file *newfile; + int err, len, newfd; char address[MAX_SOCK_ADDR]; sock = sockfd_lookup(fd, &err); @@ -1372,28 +1388,38 @@ asmlinkage long sys_accept(int fd, struc */ __module_get(newsock->ops->owner); + newfd = sock_alloc_fd(&newfile); + if (unlikely(newfd < 0)) { + err = newfd; + goto out_release; + } + + err = sock_attach_fd(newsock, newfile); + if (err < 0) + goto out_fd; + err = security_socket_accept(sock, newsock); if (err) - goto out_release; + goto out_fd; err = sock->ops->accept(sock, newsock, sock->file->f_flags); if (err < 0) - goto out_release; + goto out_fd; if (upeer_sockaddr) { if(newsock->ops->getname(newsock, (struct sockaddr *)address, &len, 2)<0) { err = -ECONNABORTED; - goto out_release; + goto out_fd; } err = move_addr_to_user(address, len, upeer_sockaddr, upeer_addrlen); if (err < 0) - goto out_release; + goto out_fd; } /* File flags are not inherited via accept() unlike another OSes. */ - if ((err = sock_map_fd(newsock)) < 0) - goto out_release; + fd_install(newfd, newfile); + err = newfd; security_socket_post_accept(sock, newsock); @@ -1401,6 +1427,9 @@ out_put: sockfd_put(sock); out: return err; +out_fd: + put_filp(newfile); + put_unused_fd(newfd); out_release: sock_release(newsock); goto out_put;