deliverable/linux.git
9 years agorocker: do not delete fdb entries in rocker_port_fdb_flush() when preparing transactions
Simon Horman [Thu, 21 May 2015 03:40:14 +0000 (12:40 +0900)] 
rocker: do not delete fdb entries in rocker_port_fdb_flush() when preparing transactions

rocker_port_fdb_flush() is called by rocker_port_stp_update() which in
turn may be called with trans == SWITCHDEV_TRANS_PREPARE and then
trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_attr_set() via
br_set_state().

When rocker_port_fdb_flush() is called with trans == SWITCHDEV_TRANS_PREPARE
it calls rocker_port_fdb_learn() for each entry in the FDB table which in
turn calls rocker_flow_tbl_bridge() which will allocate memory using
rocker_port_kzalloc(). rocker_port_fdb_learn() will then remove the entry
from the FDB table.

Then when rocker_port_fdb_learn() is called with
trans == SWITCHDEV_TRANS_PREPARE no calls are made to rocker_port_fdb_learn()
because there are no longer any entries present in the FDB table. Thus the
memory previously allocated by rocker_port_fdb_learn() is leaked resulting
in the kernel BUG() below.

Furthermore, it looks like the driver ends up with an incorrect view of the
fdb table as the FDB entries are purged from the driver's table but not the
hardware's table.

ip link add br0 type bridge
ip link set up dev eth0
sleep 1
ip link set dev eth0 master br0
[    3.704360] ------------[ cut here ]------------
[    3.704611] kernel BUG at drivers/net/ethernet/rocker/rocker.c:4289!
[    3.704962] invalid opcode: 0000 [#1] SMP
[    3.705537] Modules linked in:
[    3.705919] CPU: 0 PID: 63 Comm: ip Not tainted 4.1.0-rc3-01046-gb9fbe709de4d #1044
[    3.706191] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org 04/01/2014
[    3.706820] task: ffff880019f70150 ti: ffff88001f92c000 task.ti: ffff88001f92c000
[    3.707138] RIP: 0010:[<ffffffff811f0080>]  [<ffffffff811f0080>] rocker_port_attr_set+0xe0/0xf0
[    3.707990] RSP: 0018:ffff88001f92f808  EFLAGS: 00000212
[    3.708200] RAX: ffff880019d4fa68 RBX: ffff880019d4f000 RCX: 0000000000000000
[    3.708471] RDX: 000000000000000c RSI: ffff88001f92f890 RDI: ffff880019d4f680
[    3.708740] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000004
[    3.708999] R10: ffff880000034024 R11: 0000000000000000 R12: ffff88001f92f890
[    3.709276] R13: ffff88001f8f1c00 R14: 000000000000000b R15: 0000000000000000
[    3.709303] FS:  00007f8ab66bd700(0000) GS:ffff88001b000000(0000) knlGS:0000000000000000
[    3.709303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.709303] CR2: 0000000000654988 CR3: 000000001f8f3000 CR4: 00000000000006b0
[    3.709303] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    3.709303] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[    3.709303] Stack:
[    3.709303]  ffff88001f8f1c00 000000000000000b ffff88001f92f890 ffff880019d4f000
[    3.709303]  ffff88001f92f890 ffffffff813332f5 ffff88001f92f880 0000000000000000
[    3.709303]  ffff88001f92f890 0000000000000001 ffff880019d4f000 ffffffff81333627
[    3.709303] Call Trace:
[    3.709303]  [<ffffffff813332f5>] ? __switchdev_port_attr_set+0x25/0x90
[    3.709303]  [<ffffffff81333627>] ? switchdev_port_attr_set+0x27/0x120
[    3.709303]  [<ffffffff81318e86>] ? br_set_state+0x36/0x50
[    3.709303]  [<ffffffff8131795c>] ? br_add_if+0x37c/0x400
[    3.709303]  [<ffffffff81238ce1>] ? do_setlink+0x7e1/0x800
[    3.709303]  [<ffffffff8111f980>] ? radix_tree_lookup_slot+0x10/0x30
[    3.709303]  [<ffffffff81136fba>] ? nla_parse+0xaa/0x110
[    3.709303]  [<ffffffff81239c98>] ? rtnl_newlink+0x548/0x870
[    3.709303]  [<ffffffff8111f900>] ? __radix_tree_lookup+0x40/0xb0
[    3.709303]  [<ffffffff81136f3e>] ? nla_parse+0x2e/0x110
[    3.709303]  [<ffffffff81237d7e>] ? rtnetlink_rcv_msg+0x7e/0x250
[    3.709303]  [<ffffffff8121d1be>] ? __skb_recv_datagram+0xfe/0x4b0
[    3.709303]  [<ffffffff81237d00>] ? rtnetlink_rcv+0x30/0x30
[    3.709303]  [<ffffffff81247948>] ? netlink_rcv_skb+0xa8/0xd0
[    3.709303]  [<ffffffff81237cef>] ? rtnetlink_rcv+0x1f/0x30
[    3.709303]  [<ffffffff81247210>] ? netlink_unicast+0x150/0x200
[    3.709303]  [<ffffffff81247704>] ? netlink_sendmsg+0x374/0x3e0
[    3.709303]  [<ffffffff8120f8cf>] ? sock_sendmsg+0xf/0x30
[    3.709303]  [<ffffffff8120ffc3>] ? ___sys_sendmsg+0x1f3/0x200
[    3.709303]  [<ffffffff812100d5>] ? ___sys_recvmsg+0x105/0x140
[    3.709303]  [<ffffffff812228d9>] ? dev_get_by_name_rcu+0x69/0x90
[    3.709303]  [<ffffffff812228d9>] ? dev_get_by_name_rcu+0x69/0x90
[    3.709303]  [<ffffffff81217b7d>] ? skb_dequeue+0x4d/0x60
[    3.709303]  [<ffffffff81217bb0>] ? skb_queue_purge+0x20/0x30
[    3.709303]  [<ffffffff810ebdcf>] ? __inode_wait_for_writeback+0x5f/0xb0
[    3.709303]  [<ffffffff810648b0>] ? autoremove_wake_function+0x30/0x30
[    3.709303]  [<ffffffff81210ee9>] ? __sys_sendmsg+0x39/0x70
[    3.709303]  [<ffffffff8133e097>] ? system_call_fastpath+0x12/0x6a
[    3.709303] Code: bb 90 06 00 00 48 c7 04 24 00 00 00 00 45 31 c9 45 31 c0 48 c7 c1 c0 b7 1e 81 89 ea e8 da da ff ff eb 95 0f 1f 84 00 00 00 00 00 <0f> 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 fe 15 75
[    3.709303] RIP  [<ffffffff811f0080>] rocker_port_attr_set+0xe0/0xf0
[    3.709303]  RSP <ffff88001f92f808>
[    3.721409] ---[ end trace b7481fcb7cb032aa ]---
Segmentation fault

Fixes: c4f20321d968 ("rocker: support prepare-commit transaction model")
Acked-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agospider_net: Use DECLARE_BITMAP
Joe Perches [Wed, 20 May 2015 01:37:55 +0000 (18:37 -0700)] 
spider_net: Use DECLARE_BITMAP

Use the generic mechanism to declare a bitmap instead of unsigned long.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'ebpf-tail-call'
David S. Miller [Thu, 21 May 2015 21:08:00 +0000 (17:08 -0400)] 
Merge branch 'ebpf-tail-call'

Alexei Starovoitov says:

====================
bpf: introduce bpf_tail_call() helper

introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  bpf_tail_call(ctx, &jmp_table, index);
  ...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  if (jmp_table[index])
    return (*jmp_table[index])(ctx);
  ...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.

Use cases:
- simplify complex programs
- dispatch into other programs
  (for example: index in jump table can be syscall number or network protocol)
- build dynamic chains of programs

The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.

patch 1 - support bpf_tail_call() in interpreter
patch 2 - support in x64 JIT
We've discussed what's neccessary to support it in arm64/s390 JITs
and it looks fine.
patch 3 - sample example for tracing
patch 4 - sample example for networking

More details in every patch.

This set went through several iterations of reviews/fixes and older
attempts can be seen:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/log/?h=tail_call_v[123456]
- tail_call_v1 does it without touching JITs but introduces overhead
  for all programs that don't use this helper function.
- tail_call_v2 still has some overhead and x64 JIT does full stack
  unwind (prologue skipping optimization wasn't there)
- tail_call_v3 reuses 'call' instruction encoding and has interpreter
  overhead for every normal call
- tail_call_v4 fixes above architectural shortcomings and v5,v6 fix few
  more bugs

This last tail_call_v6 approach seems to be the best.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agosamples/bpf: bpf_tail_call example for networking
Alexei Starovoitov [Tue, 19 May 2015 23:59:06 +0000 (16:59 -0700)] 
samples/bpf: bpf_tail_call example for networking

Usage:
$ sudo ./sockex3
IP     src.port -> dst.port               bytes      packets
127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
127.0.0.1.59526 -> 127.0.0.1.33778     11422636       173070
127.0.0.1.33778 -> 127.0.0.1.59526  11260224828       341974
127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
IP     src.port -> dst.port               bytes      packets
127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
127.0.0.1.59526 -> 127.0.0.1.33778     23198092       351486
127.0.0.1.33778 -> 127.0.0.1.59526  22972698518       698616
127.0.0.1.12865 -> 127.0.0.1.42010         1832           12

this example is similar to sockex2 in a way that it accumulates per-flow
statistics, but it does packet parsing differently.
sockex2 inlines full packet parser routine into single bpf program.
This sockex3 example have 4 independent programs that parse vlan, mpls, ip, ipv6
and one main program that starts the process.
bpf_tail_call() mechanism allows each program to be small and be called
on demand potentially multiple times, so that many vlan, mpls, ip in ip,
gre encapsulations can be parsed. These and other protocol parsers can
be added or removed at runtime. TLVs can be parsed in similar manner.
Note, tail_call_cnt dynamic check limits the number of tail calls to 32.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agosamples/bpf: bpf_tail_call example for tracing
Alexei Starovoitov [Tue, 19 May 2015 23:59:05 +0000 (16:59 -0700)] 
samples/bpf: bpf_tail_call example for tracing

kprobe example that demonstrates how future seccomp programs may look like.
It attaches to seccomp_phase1() function and tail-calls other BPF programs
depending on syscall number.

Existing optimized classic BPF seccomp programs generated by Chrome look like:
if (sd.nr < 121) {
  if (sd.nr < 57) {
    if (sd.nr < 22) {
      if (sd.nr < 7) {
        if (sd.nr < 4) {
          if (sd.nr < 1) {
            check sys_read
          } else {
            if (sd.nr < 3) {
              check sys_write and sys_open
            } else {
              check sys_close
            }
          }
        } else {
      } else {
    } else {
  } else {
} else {
}

the future seccomp using native eBPF may look like:
  bpf_tail_call(&sd, &syscall_jmp_table, sd.nr);
which is simpler, faster and leaves more room for per-syscall checks.

Usage:
$ sudo ./tracex5
<...>-366   [001] d...     4.870033: : read(fd=1, buf=00007f6d5bebf000, size=771)
<...>-369   [003] d...     4.870066: : mmap
<...>-369   [003] d...     4.870077: : syscall=110 (one of get/set uid/pid/gid)
<...>-369   [003] d...     4.870089: : syscall=107 (one of get/set uid/pid/gid)
   sh-369   [000] d...     4.891740: : read(fd=0, buf=00000000023d1000, size=512)
   sh-369   [000] d...     4.891747: : write(fd=1, buf=00000000023d3000, size=512)
   sh-369   [000] d...     4.891747: : read(fd=1, buf=00000000023d3000, size=512)

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agox86: bpf_jit: implement bpf_tail_call() helper
Alexei Starovoitov [Tue, 19 May 2015 23:59:04 +0000 (16:59 -0700)] 
x86: bpf_jit: implement bpf_tail_call() helper

bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table

In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue, so the callee program reuses the same stack.

The logic can be roughly expressed in C like:

u32 tail_call_cnt;

void *jumptable[2] = { &&label1, &&label2 };

int bpf_prog1(void *ctx)
{
label1:
    ...
}

int bpf_prog2(void *ctx)
{
label2:
    ...
}

int bpf_prog1(void *ctx)
{
    ...
    if (tail_call_cnt++ < MAX_TAIL_CALL_CNT)
        goto *jumptable[index]; ... and pass my 'ctx' to callee ...

    ... fall through if no entry in jumptable ...
}

Note that 'skip current program epilogue and next program prologue' is
an optimization. Other JITs don't have to do it the same way.
>From safety point of view it's valid as well, since programs always
initialize the stack before use, so any residue in the stack left by
the current program is not going be read. The same verifier checks are
done for the calls from the kernel into all bpf programs.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobpf: allow bpf programs to tail-call other bpf programs
Alexei Starovoitov [Tue, 19 May 2015 23:59:03 +0000 (16:59 -0700)] 
bpf: allow bpf programs to tail-call other bpf programs

introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  bpf_tail_call(ctx, &jmp_table, index);
  ...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  if (jmp_table[index])
    return (*jmp_table[index])(ctx);
  ...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.

bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table

Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.

New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.

The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.

Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs

- dispatch routine
  For tracing and future seccomp the program may be triggered on all system
  calls, but processing of syscall arguments will be different. It's more
  efficient to implement them as:
  int syscall_entry(struct seccomp_data *ctx)
  {
     bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
     ... default: process unknown syscall ...
  }
  int sys_write_event(struct seccomp_data *ctx) {...}
  int sys_read_event(struct seccomp_data *ctx) {...}
  syscall_jmp_table[__NR_write] = sys_write_event;
  syscall_jmp_table[__NR_read] = sys_read_event;

  For networking the program may call into different parsers depending on
  packet format, like:
  int packet_parser(struct __sk_buff *skb)
  {
     ... parse L2, L3 here ...
     __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
     bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
     ... default: process unknown protocol ...
  }
  int parse_tcp(struct __sk_buff *skb) {...}
  int parse_udp(struct __sk_buff *skb) {...}
  ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
  ipproto_jmp_table[IPPROTO_UDP] = parse_udp;

- for TC use case, bpf_tail_call() allows to implement reclassify-like logic

- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
  are atomic, so user space can build chains of BPF programs on the fly

Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
  It could have been implemented without JIT changes as a wrapper on top of
  BPF_PROG_RUN() macro, but with two downsides:
  . all programs would have to pay performance penalty for this feature and
    tail call itself would be slower, since mandatory stack unwind, return,
    stack allocate would be done for every tailcall.
  . tailcall would be limited to programs running preempt_disabled, since
    generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
    need to be either global per_cpu variable accessed by helper and by wrapper
    or global variable protected by locks.

  In this implementation x64 JIT bypasses stack unwind and jumps into the
  callee program after prologue.

- bpf_prog_array_compatible() ensures that prog_type of callee and caller
  are the same and JITed/non-JITed flag is the same, since calling JITed
  program from non-JITed is invalid, since stack frames are different.
  Similarly calling kprobe type program from socket type program is invalid.

- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
  abstraction, its user space API and all of verifier logic.
  It's in the existing arraymap.c file, since several functions are
  shared with regular array map.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dev: reduce both ingress hook ifdefs
Daniel Borkmann [Tue, 19 May 2015 20:33:25 +0000 (22:33 +0200)] 
net: dev: reduce both ingress hook ifdefs

Reduce ifdef pollution slightly, no functional change. We can simply
remove the extra alternative definition of handle_ing() and nf_ingress().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: add a force_schedule argument to sk_stream_alloc_skb()
Eric Dumazet [Tue, 19 May 2015 20:26:55 +0000 (13:26 -0700)] 
tcp: add a force_schedule argument to sk_stream_alloc_skb()

In commit 8e4d980ac215 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.

It turns out there are other cases where we want to force memory
schedule :

tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoneigh: Better handling of transition to NUD_PROBE state
Erik Kline [Mon, 18 May 2015 10:44:41 +0000 (19:44 +0900)] 
neigh: Better handling of transition to NUD_PROBE state

[1] When entering NUD_PROBE state via neigh_update(), perhaps received
    from userspace, correctly (re)initialize the probes count to zero.

    This is useful for forcing revalidation of a neighbor (for example
    if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).

[2] Notify listeners when a neighbor goes into NUD_PROBE state.

    By sending notifications on entry to NUD_PROBE state listeners get
    more timely warnings of imminent connectivity issues.

    The current notifications on entry to NUD_STALE have somewhat
    limited usefulness: NUD_STALE is a perfectly normal state, as is
    NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
    a neighbor reachability problem has been confirmed (typically after
    three probes).

Signed-off-by: Erik Kline <ek@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoip: remove unused function prototype
Andy Zhou [Tue, 19 May 2015 19:41:47 +0000 (12:41 -0700)] 
ip: remove unused function prototype

ip_do_nat() function was removed prior to kernel 3.4. Remove the
unnecessary function prototype as well.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: add rfc3168, section 6.1.1.1. fallback
Daniel Borkmann [Tue, 19 May 2015 19:04:22 +0000 (21:04 +0200)] 
tcp: add rfc3168, section 6.1.1.1. fallback

This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

  [...] A host that receives no reply to an ECN-setup SYN within the
  normal SYN retransmission timeout interval MAY resend the SYN and
  any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

 1) Normal ECN-capable path:

    SYN ECE CWR ----->
                <----- SYN ACK ECE
            ACK ----->

 2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
            SYN ----->
                <----- SYN ACK
            ACK ----->

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

  Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
  Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
  http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'cxgb4-next'
David S. Miller [Tue, 19 May 2015 20:47:32 +0000 (16:47 -0400)] 
Merge branch 'cxgb4-next'

Hariprasad Shenai says:

====================
cxgb4: Remove dead code and replace byte-oder functions

This series removes dead fn t4_read_edc and t4_read_mc, also replaces
ntoh{s,l} and hton{s,l} calls with the generic byteorder.

PATCH 2/2 was sent as a single PATCH, but had some byte-ordering issues
in t4_read_edc and t4_read_mc function. Found that t4_read_edc and
t4_read_mc is unused, so PATCH 1/2 is added to remove it.

This patch series is created against net-next tree and includes
patches on cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agocxgb4: replace ntoh{s, l} and hton{s, l} calls with the generic byteorder
Hariprasad Shenai [Tue, 19 May 2015 12:50:44 +0000 (18:20 +0530)] 
cxgb4: replace ntoh{s, l} and hton{s, l} calls with the generic byteorder

replace ntoh{s,l} and hton{s,l} calls with the generic byteorder in
cxgb4/t4_hw.c file

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agocxgb4: Remove dead function t4_read_edc and t4_read_mc
Hariprasad Shenai [Tue, 19 May 2015 12:50:43 +0000 (18:20 +0530)] 
cxgb4: Remove dead function t4_read_edc and t4_read_mc

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge tag 'mac80211-next-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux...
David S. Miller [Tue, 19 May 2015 20:43:17 +0000 (16:43 -0400)] 
Merge tag 'mac80211-next-for-davem-2015-05-19' of git://git./linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This just has a few fixes:
 * LED throughput trigger was crashing
 * fast-xmit wasn't treating QoS changes in IBSS correctly
 * TDLS could use the wrong channel definition
 * using a reserved channel context could use the wrong channel width
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobe2net: make hwmon interface optional
Arnd Bergmann [Mon, 18 May 2015 21:06:45 +0000 (23:06 +0200)] 
be2net: make hwmon interface optional

The hwmon interface in the be2net driver causes a link error when
be2net is built-in while the hwmon subsystem is a loadable module:

drivers/built-in.o: In function `be_probe':
drivers/net/ethernet/emulex/benet/be_main.c:5761: undefined reference to `devm_hwmon_device_register_with_groups'

This adds a new Kconfig symbol, following the example of multiple
other drivers that have the same problem. The new CONFIG_BE2NET_HWMON
will not be available when (BE2NET=y && HWMON=m) to avoid this
problem.

We have to also mark be_hwmon_show_temp as 'static' to ensure the
compiler can optimize out all the unused code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 29e9122b3a ("be2net: Export board temperature using hwmon-sysfs interface.")
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: Return error instead of partial read for saved syn headers
Eric B Munson [Mon, 18 May 2015 18:35:58 +0000 (14:35 -0400)] 
tcp: Return error instead of partial read for saved syn headers

Currently the getsockopt() requesting the cached contents of the syn
packet headers will fail silently if the caller uses a buffer that is
too small to contain the requested data.  Rather than fail silently and
discard the headers, getsockopt() should return an error and report the
required size to hold the data.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet-next: ethtool: Added port speed macros.
Parav Pandit [Mon, 18 May 2015 11:01:47 +0000 (16:31 +0530)] 
net-next: ethtool: Added port speed macros.

Signed-off-by: Parav Pandit <parav.pandit@avagotech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'icmp_frag'
David S. Miller [Tue, 19 May 2015 04:15:50 +0000 (00:15 -0400)] 
Merge branch 'icmp_frag'

Andy Zhou says:

====================
fragmentation ICMP

Currently, we send ICMP packets when errors occur during fragmentation or
de-fragmentation.  However, it is a bug when sending those ICMP packets
in the context of using netfilter for bridging.

Those ICMP packets are only expected in the context of routing, not in
bridging mode.

The local stack is not involved in bridging forward decisions, thus
should be not used for deciding the reverse path for those ICMP messages.

This bug only affects IPV4, not in IPv6.

v1->v2:  restructure the patches into two patches that fix defragmentation and
         fragmentation respectively.

 A bit is add in IPCB to control whether ICMP packet should be
 generated for defragmentation.

 Fragmentation ICMP is now removed by restructuring the
 ip_fragment() API.

v2->v3:  Add droping icmp for bridging contrack users
         drop exporting ip_fragment() API.

v3->v4:  Remove unnecessary parentheses in 'return' statements

v4->v5:  Drop the patch that sets and checks a bit in IPCB
         that prevents ip_defrag to send ICMP.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge_netfilter: No ICMP packet on IPv4 fragmentation error
Andy Zhou [Fri, 15 May 2015 21:15:37 +0000 (14:15 -0700)] 
bridge_netfilter: No ICMP packet on IPv4 fragmentation error

When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.

However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.

This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoIPv4: skip ICMP for bridge contrack users when defrag expires
Andy Zhou [Fri, 15 May 2015 21:15:36 +0000 (14:15 -0700)] 
IPv4: skip ICMP for bridge contrack users when defrag expires

users in [IP_DEFRAG_CONNTRACK_BRIDGE_IN, __IP_DEFRAG_CONNTRACK_BR_IN]
should not ICMP message also.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoipv4: introduce frag_expire_skip_icmp()
Andy Zhou [Fri, 15 May 2015 21:15:35 +0000 (14:15 -0700)] 
ipv4: introduce frag_expire_skip_icmp()

Improve readability of skip ICMP for de-fragmentation expiration logic.
This change will also make the logic easier to maintain when the
following patches in this series are applied.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoselftests/net: expect headroom in psock_fanout rollover
Willem de Bruijn [Mon, 18 May 2015 19:42:11 +0000 (15:42 -0400)] 
selftests/net: expect headroom in psock_fanout rollover

psock_fanout tests the various fanout modes. Change the test for
rollover mode to expect early rollover due to socket pressure
as implemented in 2ccdbaa6d55b ("packet: rollover lock contention
avoidance").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agosfc: nicer log message on Siena SR-IOV probe fail
Edward Cree [Mon, 18 May 2015 13:18:27 +0000 (14:18 +0100)] 
sfc: nicer log message on Siena SR-IOV probe fail

We expect that MC_CMD_SRIOV will fail if the card has no VFs configured.
So output a readable message instead of a cryptic MCDI error.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
David S. Miller [Mon, 18 May 2015 18:47:36 +0000 (14:47 -0400)] 
Merge git://git./linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next. Briefly
speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and
Serget Popovich, more incremental updates to make br_netfilter a better
place from Florian Westphal, ARP support to the x_tables mark match /
target from and context Zhang Chunyu and the addition of context to know
that the x_tables runs through nft_compat. More specifically, they are:

1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the
   IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik.

2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef.

3) Use skb->network_header to calculate the transport offset in
   ip_set_get_ip{4,6}_port(). From Alexander Drozdov.

4) Reduce memory consumption per element due to size miscalculation,
   this patch and follow up patches from Sergey Popovich.

5) Expand nomatch field from 1 bit to 8 bits to allow to simplify
   mtype_data_reset_flags(), also from Sergey.

6) Small clean for ipset macro trickery.

7) Fix error reporting when both ip_set_get_hostipaddr4() and
   ip_set_get_extensions() from per-set uadt functions.

8) Simplify IPSET_ATTR_PORT netlink attribute validation.

9) Introduce HOST_MASK instead of hardcoded 32 in ipset.

10) Return true/false instead of 0/1 in functions that return boolean
    in the ipset code.

11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute.

12) Allow to dereference from ext_*() ipset macros.

13) Get rid of incorrect definitions of HKEY_DATALEN.

14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match.

15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal.

16) Release nf_bridge_info after POSTROUTING since this is only needed
    from the physdev match, also from Florian.

17) Reduce size of ipset code by deinlining ip_set_put_extensions(),
    from Denys Vlasenko.

18) Oneliner to add ARP support to the x_tables mark match/target, from
    Zhang Chunyu.

19) Add context to know if the x_tables extension runs from nft_compat,
    to address minor problems with three existing extensions.

20) Correct return value in several seqfile *_show() functions in the
    netfilter tree, from Joe Perches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'qeth-next'
David S. Miller [Mon, 18 May 2015 16:14:18 +0000 (12:14 -0400)] 
Merge branch 'qeth-next'

Ursula Braun says:

====================
s390: network patches for net-next

here are s390 related patches for net-next
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agos390/lcs: Fix null-pointer access in msg
Peter Oberparleiter [Mon, 18 May 2015 12:27:59 +0000 (14:27 +0200)] 
s390/lcs: Fix null-pointer access in msg

An attempt to configure a CTC device as LCS results in the
following error message:

  (null): Detecting a network adapter for LCS devices failed
          with rc=-5 (0xfffffffb)

"(null)" results from access to &card->dev->dev in lcs_new_device()
which is only initialized later in the function. Fix this by using
&ccwgdev->dev instead which is initialized before lcs_new_device()
is called.

Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoqeth: replace ENOSYS with EOPNOTSUPP
Eugene Crosser [Mon, 18 May 2015 12:27:58 +0000 (14:27 +0200)] 
qeth: replace ENOSYS with EOPNOTSUPP

Since recently, `checkpatch.pl` advices that ENOSYS should not be
used for anything other than "invalid syscall nr". This patch
replaces ENOSYS return code with EOPNOTSUPP for the "unsupported
function" conditions.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoqeth: BRIDGEPORT "sanity check"
Eugene Crosser [Mon, 18 May 2015 12:27:57 +0000 (14:27 +0200)] 
qeth: BRIDGEPORT "sanity check"

Forbid enabling IFF_PROMISC reflection to BRIDGEPORT when a role
is already assigned, and forbid direct manipulation of the role
when reflection mode is engaged.

Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoqeth: OSA version of SETBRIDGEPORT command
Eugene Crosser [Mon, 18 May 2015 12:27:56 +0000 (14:27 +0200)] 
qeth: OSA version of SETBRIDGEPORT command

OSA Ethernet hardware is introducing BRIDGEPORT functionality
similar (but not identical) to HiperSockets BRIDGEPORT. This
patch makes HiperSockets BRIDGEPORT related sysfs attributes
and udev events work with OSA hardware too.

Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoqeth: IFF_PROMISC flag to BRIDGE PORT mode
Eugene Crosser [Mon, 18 May 2015 12:27:55 +0000 (14:27 +0200)] 
qeth: IFF_PROMISC flag to BRIDGE PORT mode

OSA and HiperSocket devices do not support promiscuous mode proper,
but they support "BRIDGE PORT" mode that is functionally similar.
This update introduces sysfs attribute that, when set, makes the driver
try to "reflect" setting and resetting of the IFF_PROMISC flag on the
interface into setting and resetting PRIMARY or SECONDARY bridge port
role on the underlying OSA or HiperSocket device.

Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoqeth: remove locks from sysfs _show
Eugene Crosser [Mon, 18 May 2015 12:27:54 +0000 (14:27 +0200)] 
qeth: remove locks from sysfs _show

Locking is probably unnecessary in this case, and the rest of the
qeth sysfs code does not use locks in the *_show() functions.
Remove locks from the layer2 *_show() functions in which they where
accidentally introduced.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoqeth: fix handling of IPA return codes
Eugene Crosser [Mon, 18 May 2015 12:27:53 +0000 (14:27 +0200)] 
qeth: fix handling of IPA return codes

Function that executes IPA commands returns the result code from the
IPA response block. If non-negative, it needs to be transformed into
errno-compatible code before returning to the caller.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoqeth: fix rx checksum offload handling
Thomas Richter [Mon, 18 May 2015 12:27:52 +0000 (14:27 +0200)] 
qeth: fix rx checksum offload handling

ethtool is used to change some device driver features
such as RX/TX hardware checksum offloading.
The qeth device driver callback function to
turn on/off RX hardware check sum handling never changes
the hardware configuration.
The NETIF_F_RXCSUM bit is cleared when the feature bitset
type netdev_features_t(64bit) is assigned to 32 a bit
variable.

This patch fixes the NETIF_F_RXCSUM handling.
Also there is no need to manipulate the device's features
bit set as this is done by the caller when no error occurs.

Signed-off-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonetlink: Use random autobind rover
Herbert Xu [Sun, 17 May 2015 02:45:34 +0000 (10:45 +0800)] 
netlink: Use random autobind rover

Currently we use a global rover to select a port ID that is unique.
This used to work consistently when it was protected with a global
lock.  However as we're now lockless, the global rover can exhibit
pathological behaviour should multiple threads all stomp on it at
the same time.

Granted this will eventually resolve itself but the process is
suboptimal.

This patch replaces the global rover with a pseudorandom starting
point to avoid this issue.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonetns: make nsid_lock per net
WANG Cong [Fri, 15 May 2015 21:47:32 +0000 (14:47 -0700)] 
netns: make nsid_lock per net

The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: bcm_sf2: properly propagate carrier down state for MoCA
Florian Fainelli [Fri, 15 May 2015 19:38:01 +0000 (12:38 -0700)] 
net: dsa: bcm_sf2: properly propagate carrier down state for MoCA

MoCA interfaces require the use of an user-space daemon (mocad) which
will typically use cmd->autoneg to force the link. This is causing other
network manager applications not to get proper carrier down
notifications because of the following sequence of events:

- link down interrupt is received, link is set to 0 by the interrupt
  handler
- fixed_link update callback runs and updates the BMSR register
  accordingly
- PHY library polls the PHY for link status, sees the link is down,
  proceeds with reporting that
- mocad gets notified of the link state and call phy_ethtool_sset()
  with cmd->autoneg set to the link status (0)
- phy_start_aneg() is called at the end of phy_ethtool_sset() and sets
  the PHY state to PHY_FORCING

Just make sure we notify the interface carrier appropriately when we
detect that the link is down in our fixed_link update callback. This is
made local to the bcm_sf2 driver as the PHY library does the right thing
in any case. This is similar to the GENET change introduced in
54d7c01d3ed699cfc213115eaecfe1175cfaff8f ("net: bcmgenet: enable MoCA
link state change detection").

Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoflow_dissector: remove bogus return in tipc section
Jiri Pirko [Fri, 15 May 2015 11:27:32 +0000 (13:27 +0200)] 
flow_dissector: remove bogus return in tipc section

Fixes: 06635a35d13d42b9 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agohv_netvsc: change member name of struct netvsc_stats
sixiao@microsoft.com [Fri, 15 May 2015 09:33:03 +0000 (02:33 -0700)] 
hv_netvsc: change member name of struct netvsc_stats

Currently the struct netvsc_stats has a member s_sync
of type u64_stats_sync.
This definition will break kernel build as the macro
netdev_alloc_pcpu_stats requires this member name to be syncp.
(see netdev_alloc_pcpu_stats definition in ./include/linux/netdevice.h)

This patch changes netvsc_stats's member name from s_sync to syncp to fix
the build break.

Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoswitchdev: add support for fdb add/del/dump via switchdev_port_obj ops.
Samudrala, Sridhar [Thu, 14 May 2015 04:55:43 +0000 (21:55 -0700)] 
switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.

- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops

v3: updated to sync with named union changes.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'tcp_mem_pressure'
David S. Miller [Mon, 18 May 2015 02:45:49 +0000 (22:45 -0400)] 
Merge branch 'tcp_mem_pressure'

Eric Dumazet says:

====================
tcp: better handling of memory pressure

When testing commit 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory
pressure") using edge triggered epoll applications, I found various
issues under memory pressure and thousands of active sockets.

This patch series is a first round to solve these issues, in send
and receive paths. There are probably other fixes needed, but
with this series, my tests now all succeed.

v2: fix typo in "allow one skb to be received per socket under memory pressure",
as spotted by Jason Baron.
====================

Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: halves tcp_mem[] limits
Eric Dumazet [Fri, 15 May 2015 19:39:30 +0000 (12:39 -0700)] 
tcp: halves tcp_mem[] limits

Allowing tcp to use ~19% of physical memory is way too much,
and allowed bugs to be hidden. Add to this that some drivers use a full
page per incoming frame, so real cost can be twice the advertized one.

Reduce tcp_mem by 50 % as a first step to sanity.

tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: allow one skb to be received per socket under memory pressure
Eric Dumazet [Fri, 15 May 2015 19:39:29 +0000 (12:39 -0700)] 
tcp: allow one skb to be received per socket under memory pressure

While testing tight tcp_mem settings, I found tcp sessions could be
stuck because we do not allow even one skb to be received on them.

By allowing one skb to be received, we introduce fairness and
eventuallu force memory hogs to release their allocation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: fix behavior for epoll edge trigger
Eric Dumazet [Fri, 15 May 2015 19:39:28 +0000 (12:39 -0700)] 
tcp: fix behavior for epoll edge trigger

Under memory pressure, tcp_sendmsg() can fail to queue a packet
while no packet is present in write queue. If we return -EAGAIN
with no packet in write queue, no ACK packet will ever come
to raise EPOLLOUT.

We need to allow one skb per TCP socket, and make sure that
tcp sockets can release their forward allocations under pressure.

This is a followup to commit 790ba4566c1a ("tcp: set SOCK_NOSPACE
under memory pressure")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: introduce tcp_under_memory_pressure()
Eric Dumazet [Fri, 15 May 2015 19:39:27 +0000 (12:39 -0700)] 
tcp: introduce tcp_under_memory_pressure()

Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule()
Eric Dumazet [Fri, 15 May 2015 19:39:26 +0000 (12:39 -0700)] 
tcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule()

We plan to use sk_forced_wmem_schedule() in input path as well,
so make it non static and rename it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: fix sk_mem_reclaim_partial()
Eric Dumazet [Fri, 15 May 2015 19:39:25 +0000 (12:39 -0700)] 
net: fix sk_mem_reclaim_partial()

sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.

SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet-packet: fix null pointer exception in rollover mode
Willem de Bruijn [Sun, 17 May 2015 23:44:02 +0000 (19:44 -0400)] 
net-packet: fix null pointer exception in rollover mode

Rollover can be enabled as flag or mode. Allocate state in both cases.
This solves a NULL pointer exception in fanout_demux_rollover on
referencing po->rollover if using mode rollover.

Also make sure that in rollover mode each silo is tried (contrary
to rollover flag, where the main socket is excluded after an initial
try_self).

Tested:
  Passes tools/testing/net/psock_fanout.c, which tests both modes and
  flag. My previous tests were limited to bench_rollover, which only
  stresses the flag. The test now completes safely. it still gives an
  error for mode rollover, because it does not expect the new headroom
  (ROOM_NORMAL) requirement. I will send a separate patch to the test.

Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")
Signed-off-by: Willem de Bruijn <willemb@google.com>
----

I should have run this test and caught this before submission, of
course. Apologies for the oversight.
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: fix two sparse errors
Eric Dumazet [Fri, 15 May 2015 12:48:07 +0000 (05:48 -0700)] 
net: fix two sparse errors

First one in __skb_checksum_validate_complete() fixes the following
(and other callers)

make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/tcp_ipv4.o
  CHECK   net/ipv4/tcp_ipv4.c
include/linux/skbuff.h:3052:24: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3052:24:    expected restricted __sum16
include/linux/skbuff.h:3052:24:    got int

Second is fixing gso_make_checksum() :

  CHECK   net/ipv4/gre_offload.c
include/linux/skbuff.h:3360:14: warning: incorrect type in assignment (different base types)
include/linux/skbuff.h:3360:14:    expected unsigned short [unsigned] [usertype] csum
include/linux/skbuff.h:3360:14:    got restricted __sum16
include/linux/skbuff.h:3365:16: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3365:16:    expected restricted __sum16
include/linux/skbuff.h:3365:16:    got unsigned short [unsigned] [usertype] csum

Fixes: 5a21232983aa7 ("net: Support for csum_bad in skbuff")
Fixes: 7e2b10c1e52ca ("net: Support for multiple checksums with gso")
Signed-off-by: Eric Dumazet <edumazet@google.com>
CC: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonetfilter: synproxy: fix sparse errors
Eric Dumazet [Fri, 15 May 2015 16:07:31 +0000 (09:07 -0700)] 
netfilter: synproxy: fix sparse errors

Fix verbose sparse errors :

make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/ipt_SYNPROXY.o

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoipip: fix one sparse error
Eric Dumazet [Fri, 15 May 2015 15:58:45 +0000 (08:58 -0700)] 
ipip: fix one sparse error

make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/ipip.o
  CHECK   net/ipv4/ipip.c
net/ipv4/ipip.c:254:27: warning: incorrect type in assignment (different base types)
net/ipv4/ipip.c:254:27:    expected restricted __be32 [addressable] [usertype] o_key
net/ipv4/ipip.c:254:27:    got restricted __be16 [addressable] [usertype] i_flags

Fixes: 3b7b514f44bf ("ipip: fix a regression in ioctl")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: fix sparse error in csum_replace4()
Eric Dumazet [Fri, 15 May 2015 15:52:19 +0000 (08:52 -0700)] 
net: fix sparse error in csum_replace4()

make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/nf_nat_l3proto_ipv4.o
  CHECK   net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64:    got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71:    got restricted __be32 [usertype] to
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64:    got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71:    got restricted __be32 [usertype] to

Fixes: 4565af0d406b ("net: optimise csum_replace4()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonetfilter: Use correct return for seq_show functions
Joe Perches [Wed, 13 May 2015 01:28:23 +0000 (18:28 -0700)] 
netfilter: Use correct return for seq_show functions

Using seq_has_overflowed doesn't produce the right return value.
Either 0 or -1 is, but 0 is much more common and works well when
seq allocation retries.

I believe this doesn't matter as the initial allocation is always
sufficient, this is just a correctness patch.

Miscellanea:

o Don't use strlen, use *ptr to determine if a string
  should be emitted like all the other tests here
o Delete unnecessary return statements

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 years agonet: phy: Add state machine state transitions debug prints
Florian Fainelli [Sat, 16 May 2015 17:17:56 +0000 (10:17 -0700)] 
net: phy: Add state machine state transitions debug prints

It can be useful to debug the PHY state machine, add dynamic debug
prints of the old and new PHY devices state under a friendly format.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agorocker: fix a neigh entry leak issue
Ying Xue [Fri, 15 May 2015 04:53:21 +0000 (12:53 +0800)] 
rocker: fix a neigh entry leak issue

Once we get a neighbour through looking up arp cache or creating a
new one in rocker_port_ipv4_resolve(), the neighbour's refcount is
already taken. But as we don't put the refcount again after it's
used, this makes the neighbour entry leaked.

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'amd-xgbe-next'
David S. Miller [Fri, 15 May 2015 19:21:44 +0000 (15:21 -0400)] 
Merge branch 'amd-xgbe-next'

Tom Lendacky says:

====================
amd-xgbe: AMD XGBE driver updates 2015-05-12

The following series of patches includes functional updates and changes
to the driver.

- Add additional statistics to be collected and reported
- Use the netif_* functions for issuing some debug and informational
  driver messages
- Rx path SKB allocation cleanup/simplification
- Remove stand-alone phylib driver and incorporate function into the nic
  driver
- Simplify device tree support while maintaining backwards compatibility
- Fix the flow control negotiation logic to properly configure flow
  control
- Remove the checking and setting of the device dma_mask field

This patch series is based on net-next.

Changes in v2:
- Change from using the netif_msg_*/netdev_* combination for issuing
  messages to the more concise netif_*
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoamd-xgbe: Remove manual check and set of dma_mask pointer
Lendacky, Thomas [Thu, 14 May 2015 16:44:33 +0000 (11:44 -0500)] 
amd-xgbe: Remove manual check and set of dma_mask pointer

The underlying device support will set the device dma_mask pointer
if DMA is set up properly for the device.  Remove the check for and
assignment of dma_mask when it is null. Instead, just error out if
the dma_set_mask_and_coherent function fails because dma_mask is null.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoamd-xgbe: Fix flow control setting logic
Lendacky, Thomas [Thu, 14 May 2015 16:44:27 +0000 (11:44 -0500)] 
amd-xgbe: Fix flow control setting logic

The flow control negotiation logic is flawed and does not properly
advertise and process auto-negotiation of the flow control settings.
Update the flow control support to properly set the flow control
auto-negotiation settings and process the results approrpriately.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoamd-xgbe: Support defining PHY resources in ETH device node
Lendacky, Thomas [Thu, 14 May 2015 16:44:21 +0000 (11:44 -0500)] 
amd-xgbe: Support defining PHY resources in ETH device node

Simplify the device tree support of the amd-xgbe driver by defining
the PHY-related resources within the ethernet device node. The support
provides backwards compatibility with the original way.

Update the driver version to 1.0.2.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoamd-xgbe: Move the PHY support into amd-xgbe
Lendacky, Thomas [Thu, 14 May 2015 16:44:15 +0000 (11:44 -0500)] 
amd-xgbe: Move the PHY support into amd-xgbe

The AMD XGBE device is intended to work with a specific integrated PHY
and that PHY is not meant to be a standalone PHY for use by other
devices. As such this patch removes the phylib driver and implements
the PHY support in the amd-xgbe driver (the majority of the logic from
the phylib driver is moved into the amd-xgbe driver).

Update the driver version to 1.0.1.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoamd-xgbe: Rework the Rx path SKB allocation
Lendacky, Thomas [Thu, 14 May 2015 16:44:09 +0000 (11:44 -0500)] 
amd-xgbe: Rework the Rx path SKB allocation

Rework the SKB allocation so that all of the buffers of the first
descriptor are handled in the SKB allocation routine. After copying the
data in the header buffer (which can be just the header if split header
processing succeeded for header plus data if split header processing did
not succeed) into the SKB, check for remaining data in the receive
buffer. If there is data remaining in the receive buffer, add that as a
frag to the SKB. Once an SKB has been allocated, all other descriptors
are added as frags to the SKB.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoamd-xgbe: Add netif_* message support to the driver
Lendacky, Thomas [Thu, 14 May 2015 16:44:03 +0000 (11:44 -0500)] 
amd-xgbe: Add netif_* message support to the driver

Add support for the network interface message level settings for
determining whether to issue some of the driver messages. Make
use of the netif_* interface where appropriate.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoamd-xgbe: Add additional stats to be reported via ethtool
Lendacky, Thomas [Thu, 14 May 2015 16:43:57 +0000 (11:43 -0500)] 
amd-xgbe: Add additional stats to be reported via ethtool

Add additional/extended statistics beyond what is provided by the
hardware to be reported via ethtool. The new stats focus on the
calls into ndo_start_xmit and the napi_poll routine.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonetfilter: x_tables: add context to know if extension runs from nft_compat
Pablo Neira Ayuso [Thu, 14 May 2015 12:57:23 +0000 (14:57 +0200)] 
netfilter: x_tables: add context to know if extension runs from nft_compat

Currently, we have four xtables extensions that cannot be used from the
xt over nft compat layer. The problem is that they need real access to
the full blown xt_entry to validate that the rule comes with the right
dependencies. This check was introduced to overcome the lack of
sufficient userspace dependency validation in iptables.

To resolve this problem, this patch introduces a new field to the
xt_tgchk_param structure that tell us if the extension is run from
nft_compat context.

The three affected extensions are:

1) CLUSTERIP, this target has been superseded by xt_cluster. So just
   bail out by returning -EINVAL.

2) TCPMSS. Relax the checking when used from nft_compat. If used with
   the wrong configuration, it will corrupt !syn packets by adding TCP
   MSS option.

3) ebt_stp. Relax the check to make sure it uses the reserved
   destination MAC address for STP.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
9 years agoMerge branch 'stmmac-platform-glue'
David S. Miller [Fri, 15 May 2015 16:44:23 +0000 (12:44 -0400)] 
Merge branch 'stmmac-platform-glue'

Joachim Eastwood says:

====================
convert stmmac glue layers into platform drivers

This patch set aims to convert the current dwmac glue layers into
proper platform drivers as request by Arnd[1]. These changes start
from patch 3 and onwards.

Overview:
Platform driver functions like probe and remove are exported from
the stmmac platform and then used in subsequent glue later
conversions. The conversion involes adding the platform driver
boiler plate code and adding it to the build system. The last patch
removes the driver from the stmmac platform code thus making it into
a library for common platform driver functions.

The two first patches adds glue layer for my platform. I chose to
first add old style glue layer and then convert it. The churn this
creates is just 3 lines.

I would be very nice if people could test this patch set on their
respective platform. My testing has been limited to compiling and
testing on my (LPC18xx) platform. Thanks!

Next I will look into cleaning up the stmmac platform code.

[1] http://marc.info/?l=linux-arm-kernel&m=143059524606459&w=2
====================

Tested-by: Chen-Yu Tsai <wens@csie.org>
Tested-by: Dinh Nguyen <dinguyen@opensource.altera.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: drop driver from stmmac platform code
Joachim Eastwood [Thu, 14 May 2015 10:11:06 +0000 (12:11 +0200)] 
stmmac: drop driver from stmmac platform code

The dwmac-generic replaces the driver inside the stmmac
platform code. This turns stmmac platform into a library
used by drivers for common platform driver functions.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: convert dwmac-sunxi to platform driver
Joachim Eastwood [Thu, 14 May 2015 10:11:05 +0000 (12:11 +0200)] 
stmmac: convert dwmac-sunxi to platform driver

Convert platform glue layer into a proper platform
driver and add it to the build system.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: convert dwmac-sti to platform driver
Joachim Eastwood [Thu, 14 May 2015 10:11:04 +0000 (12:11 +0200)] 
stmmac: convert dwmac-sti to platform driver

Convert platform glue layer into a proper platform
driver and add it to the build system.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: convert dwmac-socfpga to platform driver
Joachim Eastwood [Thu, 14 May 2015 10:11:03 +0000 (12:11 +0200)] 
stmmac: convert dwmac-socfpga to platform driver

Convert platform glue layer into a proper platform
driver and add it to the build system.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: convert dwmac-rk to platform driver
Joachim Eastwood [Thu, 14 May 2015 10:11:02 +0000 (12:11 +0200)] 
stmmac: convert dwmac-rk to platform driver

Convert platform glue layer into a proper platform
driver and add it to the build system.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: convert dwmac-meson to platform driver
Joachim Eastwood [Thu, 14 May 2015 10:11:01 +0000 (12:11 +0200)] 
stmmac: convert dwmac-meson to platform driver

Convert platform glue layer into a proper platform
driver and add it to the build system.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: convert dwmac-lpc18xx to a platform driver
Joachim Eastwood [Thu, 14 May 2015 10:11:00 +0000 (12:11 +0200)] 
stmmac: convert dwmac-lpc18xx to a platform driver

Convert platform glue layer into a proper platform
driver and add it to the build system.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: add a generic dwmac driver
Joachim Eastwood [Thu, 14 May 2015 10:10:59 +0000 (12:10 +0200)] 
stmmac: add a generic dwmac driver

Create a new driver around the generic device tree match strings
in the stmmac platform code. This driver is intended to be used
by all platforms that doesn't require any platform specific code
to function or is using platform data.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: prepare stmmac platform to support stand alone drivers
Joachim Eastwood [Thu, 14 May 2015 10:10:58 +0000 (12:10 +0200)] 
stmmac: prepare stmmac platform to support stand alone drivers

Prepare the stmmac platform code to support standalone drivers
by exporting the need functions and having of_match_device use
the match table reference already present in the driver struct.

This will allow us to reuse the platform driver functions from
this code easily in other stand alone platform drivers.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agodoc: dt: add documentation for nxp,lpc1850-dwmac
Joachim Eastwood [Thu, 14 May 2015 10:10:57 +0000 (12:10 +0200)] 
doc: dt: add documentation for nxp,lpc1850-dwmac

Add device tree binding documentation for nxp,lpc1850-dwmac.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agostmmac: add dwmac glue for NXP 18xx/43xx family
Joachim Eastwood [Thu, 14 May 2015 10:10:56 +0000 (12:10 +0200)] 
stmmac: add dwmac glue for NXP 18xx/43xx family

Add support for Ethernet on NXP LPC18xx and LPC43xx using the
dwmac driver. This glue is required to setup phy interface
mode, MII or RMII, on the SoC.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agohv_netvsc: use per_cpu stats to calculate TX/RX data
sixiao@microsoft.com [Thu, 14 May 2015 08:00:25 +0000 (01:00 -0700)] 
hv_netvsc: use per_cpu stats to calculate TX/RX data

Current code does not lock anything when calculating the TX and RX stats.
As a result, the RX and TX data reported by ifconfig are not accuracy in a
system with high network throughput and multiple CPUs (in my test,
RX/TX = 83% between 2 HyperV VM nodes which have 8 vCPUs and 40G Ethernet).

This patch fixed the above issue by using per_cpu stats.
netvsc_get_stats64() summarizes TX and RX data by iterating over all CPUs
to get their respective stats.

This v2 patch addressed David's comments on the cleanup path when
netdev_alloc_pcpu_stats() failed.

Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotest_bpf: fix sparse warnings
Michael Holzheu [Thu, 14 May 2015 03:40:39 +0000 (20:40 -0700)] 
test_bpf: fix sparse warnings

Fix several sparse warnings like:
lib/test_bpf.c:1824:25: sparse: constant 4294967295 is so big it is long
lib/test_bpf.c:1878:25: sparse: constant 0x0000ffffffff0000 is so big it is long

Fixes: cffc642d93f9 ("test_bpf: add 173 new testcases for eBPF")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: core: set qdisc pkt len before tc_classify
Florian Westphal [Wed, 13 May 2015 22:36:28 +0000 (00:36 +0200)] 
net: core: set qdisc pkt len before tc_classify

commit d2788d34885d4ce5ba ("net: sched: further simplify handle_ing")
removed the call to qdisc_enqueue_root().

However, after this removal we no longer set qdisc pkt length.
This breaks traffic policing on ingress.

This is the minimum fix: set qdisc pkt length before tc_classify.

Only setting the length does remove support for 'stab' on ingress, but
as Alexei pointed out:
 "Though it was allowed to add qdisc_size_table to ingress, it's useless.
  Nothing takes advantage of recomputed qdisc_pkt_len".

Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that
would result in qdisc_pkt_len_init to no longer get inlined due to the
additional 2nd call site.

ingress policing is rare and GRO doesn't really work that well with police
on ingress, as we see packets > mtu and drop skbs that  -- without
aggregation -- would still have fitted the policier budget.
Thus to have reliable/smooth ingress policing GRO has to be turned off.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Fixes: d2788d34885d ("net: sched: further simplify handle_ing")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonetns: fix unbalanced spin_lock on error
Nicolas Dichtel [Wed, 13 May 2015 11:43:09 +0000 (13:43 +0200)] 
netns: fix unbalanced spin_lock on error

Unlock was missing on error path.

Fixes: 95f38411df05 ("netns: use a spin_lock to protect nsid management")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agomdio-gpio: Propagate mii_bus.phy_ignore_ta_mask
Bert Vermeulen [Wed, 13 May 2015 11:35:39 +0000 (13:35 +0200)] 
mdio-gpio: Propagate mii_bus.phy_ignore_ta_mask

This also changes mii_bus.phy_mask to u32 for consistency.

Signed-off-by: Bert Vermeulen <bert@biot.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotest_bpf: add tests related to BPF_MAXINSNS
Daniel Borkmann [Wed, 13 May 2015 11:12:43 +0000 (13:12 +0200)] 
test_bpf: add tests related to BPF_MAXINSNS

Couple of torture test cases related to the bug fixed in 0b59d8806a31
("ARM: net: delegate filter to kernel interpreter when imm_offset()
return value can't fit into 12bits.").

I've added a helper to allocate and fill the insn space. Output on
x86_64 from my laptop:

test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:0 7 PASS
test_bpf: #234 BPF_MAXINSNS: Single literal jited:0 8 PASS
test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:0 11553 PASS
test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
test_bpf: #237 BPF_MAXINSNS: Very long jump jited:0 9 PASS
test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:0 20329 20398 PASS
test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:0 32178 32475 PASS
test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:0 10518 PASS

test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:1 4 PASS
test_bpf: #234 BPF_MAXINSNS: Single literal jited:1 4 PASS
test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:1 1625 PASS
test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
test_bpf: #237 BPF_MAXINSNS: Very long jump jited:1 8 PASS
test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:1 3301 3174 PASS
test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:1 24107 23491 PASS
test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:1 8651 PASS

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: syncookies: extend validity range
Eric Dumazet [Thu, 14 May 2015 21:26:56 +0000 (14:26 -0700)] 
tcp: syncookies: extend validity range

Now we allow storing more request socks per listener, we might
hit syncookie mode less often and hit following bug in our stack :

When we send a burst of syncookies, then exit this mode,
tcp_synq_no_recent_overflow() can return false if the ACK packets coming
from clients are coming three seconds after the end of syncookie
episode.

This is a way too strong requirement and conflicts with rest of
syncookie code which allows ACK to be aged up to 2 minutes.

Perfectly valid ACK packets are dropped just because clients might be
in a crowded wifi environment or on another planet.

So let's fix this, and also change tcp_synq_overflow() to not
dirty a cache line for every syncookie we send, as we are under attack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoip_tunnel: Report Rx dropped in ip_tunnel_get_stats64
Alexander Duyck [Thu, 14 May 2015 21:31:28 +0000 (14:31 -0700)] 
ip_tunnel: Report Rx dropped in ip_tunnel_get_stats64

The rx_dropped stat wasn't being reported when ip_tunnel_get_stats64 was
called.  This was leading to some confusing results in my debug as I was
seeing rx_errors increment but no other value which pointed me toward the
type of error being seen.

This change corrects that by using netdev_stats_to_stats64 to copy all
available dev stats instead of just the few that were hand picked.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agopacket: fix warnings in rollover lock contention
Willem de Bruijn [Thu, 14 May 2015 19:25:02 +0000 (15:25 -0400)] 
packet: fix warnings in rollover lock contention

Avoid two xchg calls whose return values were unused, causing a
warning on some architectures.

The relevant variable is a hint and read without mutual exclusion.
This fix makes all writers hold the receive_queue lock.

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: batch of last_rx update avoidance in ethernet drivers.
françois romieu [Thu, 14 May 2015 18:17:22 +0000 (20:17 +0200)] 
net: batch of last_rx update avoidance in ethernet drivers.

None of those drivers uses last_rx for its own needs.

See 4dc89133f49b8cfd77ba7e83f5960aed63aaa99e ("net: add a comment on
netdev->last_rx") for reference.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Wingman Kwok <w-kwok2@ti.com>
Cc: Murali Karicheri <m-karicheri2@ti.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'phy_turn_around'
David S. Miller [Thu, 14 May 2015 17:40:55 +0000 (13:40 -0400)] 
Merge branch 'phy_turn_around'

Florian Fainelli says:

====================
net: phy: broken turn-around support

This is an attempt at solving the broken turn-around problem in a way that
is not specific to the mdio-gpio driver, since it affects different kinds of
platforms.

We cannot make that localized to PHY device drivers because probing the PHY
device which has a broken turn-around can fail as early as in get_phy_id(),
therefore we need a bit of help from Device Tree/platform_data.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: phy: mdio-gpio: Handle phy_ignore_ta_mask
Florian Fainelli [Tue, 12 May 2015 17:33:26 +0000 (10:33 -0700)] 
net: phy: mdio-gpio: Handle phy_ignore_ta_mask

Update mdiobb_read() to read whether the PHY has a broken turn-around,
and if it does, ignore it to make the read succeeed.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoof: mdio: Add a "broken-turn-around" property
Florian Fainelli [Tue, 12 May 2015 17:33:25 +0000 (10:33 -0700)] 
of: mdio: Add a "broken-turn-around" property

Some Ethernet PHY devices/switches may not properly release the MDIO bus
during turn-around time, and fail to drive it low, which can be seen by
some controllers as a read failure, while the data clocked in is still
correct.

Add a boolean property "broken-turn-around" which is parsed by the
generic MDIO bus probing code and will set the corresponding bit in the
MDIO bus phy_ignore_ta_mask bitmask for MDIO bus drivers to utilize that
information.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: phy: Add phy_ignore_ta_mask to account for broken turn-around
Florian Fainelli [Tue, 12 May 2015 17:33:24 +0000 (10:33 -0700)] 
net: phy: Add phy_ignore_ta_mask to account for broken turn-around

Some PHY devices/switches will not release the turn-around line as they
should do at the end of a MDIO transaction. To help with such
situations, allow MDIO bus drivers to be made aware of such
restrictions.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: use sock_create_kern interface to create kernel socket
Ying Xue [Wed, 13 May 2015 03:20:38 +0000 (11:20 +0800)] 
tipc: use sock_create_kern interface to create kernel socket

After commit eeb1bd5c40ed ("net: Add a struct net parameter to
sock_create_kern"), we should use sock_create_kern() to create kernel
socket as the interface doesn't reference count struct net any more.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agocls_flower: Fix compile error
Brian Haley [Thu, 14 May 2015 17:20:15 +0000 (13:20 -0400)] 
cls_flower: Fix compile error

Fix compile error in net/sched/cls_flower.c

    net/sched/cls_flower.c: In function ‘fl_set_key’:
    net/sched/cls_flower.c:240:3: error: implicit declaration of
     function ‘tcf_change_indev’ [-Werror=implicit-function-declaration]
       err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]);

Introduced in 77b9900ef53ae

Fixes: 77b9900ef53ae ("tc: introduce Flower classifier")
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'tipc-next'
David S. Miller [Thu, 14 May 2015 16:24:46 +0000 (12:24 -0400)] 
Merge branch 'tipc-next'

Jon Maloy says:

====================
tipc: some link layer improvements

We continue eliminating redundant complexity at the link layer, and
add a couple of improvements to the packet sending functionality.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: add packet sequence number at instant of transmission
Jon Paul Maloy [Thu, 14 May 2015 14:46:18 +0000 (10:46 -0400)] 
tipc: add packet sequence number at instant of transmission

Currently, the packet sequence number is updated and added to each
packet at the moment a packet is added to the link backlog queue.
This is wasteful, since it forces the code to traverse the send
packet list packet by packet when adding them to the backlog queue.
It would be better to just splice the whole packet list into the
backlog queue when that is the right action to do.

In this commit, we do this change. Also, since the sequence numbers
cannot now be assigned to the packets at the moment they are added
the backlog queue, we do instead calculate and add them at the moment
of transmission, when the backlog queue has to be traversed anyway.
We do this in the function tipc_link_push_packet().

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: improve link congestion algorithm
Jon Paul Maloy [Thu, 14 May 2015 14:46:17 +0000 (10:46 -0400)] 
tipc: improve link congestion algorithm

The link congestion algorithm used until now implies two problems.

- It is too generous towards lower-level messages in situations of high
  load by giving "absolute" bandwidth guarantees to the different
  priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
  20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
  available bandwidth. But, in the absence of higher level traffic, the
  ratio between two distinct levels becomes unreasonable. E.g. if there
  is only LOW and MEDIUM traffic on a system, the former is guaranteed
  1/3 of the bandwidth, and the latter 2/3. This again means that if
  there is e.g. one LOW user and 10 MEDIUM users, the  former will have
  33.3% of the bandwidth, and the others will have to compete for the
  remainder, i.e. each will end up with 6.7% of the capacity.

- Packets of type MSG_BUNDLER are created at SYSTEM importance level,
  but only after the packets bundled into it have passed the congestion
  test for their own respective levels. Since bundled packets don't
  result in incrementing the level counter for their own importance,
  only occasionally for the SYSTEM level counter, they do in practice
  obtain SYSTEM level importance. Hence, the current implementation
  provides a gap in the congestion algorithm that in the worst case
  may lead to a link reset.

We now refine the congestion algorithm as follows:

- A message is accepted to the link backlog only if its own level
  counter, and all superior level counters, permit it.

- The importance of a created bundle packet is set according to its
  contents. A bundle packet created from messges at levels LOW to
  CRITICAL is given importance level CRITICAL, while a bundle created
  from a SYSTEM level message is given importance SYSTEM. In the latter
  case only subsequent SYSTEM level messages are allowed to be bundled
  into it.

This solves the first problem described above, by making the bandwidth
guarantee relative to the total number of users at all levels; only
the upper limit for each level remains absolute. In the example
described above, the single LOW user would use 1/11th of the bandwidth,
the same as each of the ten MEDIUM users, but he still has the same
guarantee against starvation as the latter ones.

The fix also solves the second problem. If the CRITICAL level is filled
up by bundle packets of that level, no lower level packets will be
accepted any more.

Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: simplify link supervision checkpointing
Jon Paul Maloy [Thu, 14 May 2015 14:46:16 +0000 (10:46 -0400)] 
tipc: simplify link supervision checkpointing

We change the sequence number checkpointing that is performed
by the timer in order to discover if the peer is active. Currently,
we store a checkpoint of the next expected sequence number "rcv_nxt"
at each timer expiration, and compare it to the current expected
number at next timeout expiration. Instead, we now use the already
existing field "silent_intv_cnt" for this task. We step the counter
at each timeout expiration, and zero it at each valid received packet.
If no valid packet has been received from the peer after "abort_limit"
number of silent timer intervals, the link is declared faulty and reset.

We also remove the multiple instances of timer activation from inside
the FSM function "link_state_event()", and now do it at only one place;
at the end of the timer function itself.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: rename fields in struct tipc_link
Jon Paul Maloy [Thu, 14 May 2015 14:46:15 +0000 (10:46 -0400)] 
tipc: rename fields in struct tipc_link

We rename some fields in struct tipc_link, in order to give them more
descriptive names:

next_in_no -> rcv_nxt
next_out_no-> snd_nxt
fsm_msg_cnt-> silent_intv_cnt
cont_intv  -> keepalive_intv
last_retransmitted -> last_retransm

There are no functional changes in this commit.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: simplify packet sequence number handling
Jon Paul Maloy [Thu, 14 May 2015 14:46:14 +0000 (10:46 -0400)] 
tipc: simplify packet sequence number handling

Although the sequence number in the TIPC protocol is 16 bits, we have
until now stored it internally as an unsigned 32 bits integer.
We got around this by always doing explicit modulo-65535 operations
whenever we need to access a sequence number.

We now make the incoming and outgoing sequence numbers to unsigned
16-bit integers, and remove the modulo operations where applicable.

We also move the arithmetic inline functions for 16 bit integers
to core.h, and the function buf_seqno() to msg.h, so they can easily
be accessed from anywhere in the code.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: simplify include dependencies
Jon Paul Maloy [Thu, 14 May 2015 14:46:13 +0000 (10:46 -0400)] 
tipc: simplify include dependencies

When we try to add new inline functions in the code, we sometimes
run into circular include dependencies.

The main problem is that the file core.h, which really should be at
the root of the dependency chain, instead is a leaf. I.e., core.h
includes a number of header files that themselves should be allowed
to include core.h. In reality this is unnecessary, because core.h does
not need to know the full signature of any of the structs it refers to,
only their type declaration.

In this commit, we remove all dependencies from core.h towards any
other tipc header file.

As a consequence of this change, we can now move the function
tipc_own_addr(net) from addr.c to addr.h, and make it inline.

There are no functional changes in this commit.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This page took 0.054459 seconds and 5 git commands to generate.