[deliverable/linux.git] / Documentation / networking / filter.txt

Linux Socket Filtering aka Berkeley Packet Filter (BPF)
=======================================================

Introduction
------------

Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
Though there are some distinct differences between the BSD and Linux
Kernel filtering, but when we speak of BPF or LSF in Linux context, we
mean the very same mechanism of filtering in the Linux kernel.

BPF allows a user-space program to attach a filter onto any socket and
allow or disallow certain types of data to come through the socket. LSF
follows exactly the same filter code structure as BSD's BPF, so referring
to the BSD bpf.4 manpage is very helpful in creating filters.

On Linux, BPF is much simpler than on BSD. One does not have to worry
about devices or anything like that. You simply create your filter code,
send it to the kernel via the SO_ATTACH_FILTER option and if your filter
code passes the kernel check on it, you then immediately begin filtering
data on that socket.

You can also detach filters from your socket via the SO_DETACH_FILTER
option. This will probably not be used much since when you close a socket
that has a filter on it the filter is automagically removed. The other
less common case may be adding a different filter on the same socket where
you had another filter that is still running: the kernel takes care of
removing the old one and placing your new one in its place, assuming your
filter has passed the checks, otherwise if it fails the old filter will
remain on that socket.

SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
set, a filter cannot be removed or changed. This allows one process to
setup a socket, attach a filter, lock it then drop privileges and be
assured that the filter will be kept until the socket is closed.

The biggest user of this construct might be libpcap. Issuing a high-level
filter command like `tcpdump -i em1 port 22` passes through the libpcap
internal compiler that generates a structure that can eventually be loaded
via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
displays what is being placed into this structure.

Although we were only speaking about sockets here, BPF in Linux is used
in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places
such as team driver, PTP code, etc where BPF is being used.

 [1] Documentation/prctl/seccomp_filter.txt

Original BPF paper:

Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
architecture for user-level packet capture. In Proceedings of the
USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]

Structure
---------

User space applications include <linux/filter.h> which contains the
following relevant structures:

struct sock_filter {	/* Filter block */
	__u16	code;   /* Actual filter code */
	__u8	jt;	/* Jump true */
	__u8	jf;	/* Jump false */
	__u32	k;      /* Generic multiuse field */
};

Such a structure is assembled as an array of 4-tuples, that contains
a code, jt, jf and k value. jt and jf are jump offsets and k a generic
value to be used for a provided code.

struct sock_fprog {			/* Required for SO_ATTACH_FILTER. */
	unsigned short		   len;	/* Number of filter blocks */
	struct sock_filter __user *filter;
};

For socket filtering, a pointer to this structure (as shown in
follow-up example) is being passed to the kernel through setsockopt(2).

Example
-------

#include <sys/socket.h>
#include <sys/types.h>
#include <arpa/inet.h>
#include <linux/if_ether.h>
/* ... */

/* From the example above: tcpdump -i em1 port 22 -dd */
struct sock_filter code[] = {
	{ 0x28,  0,  0, 0x0000000c },
	{ 0x15,  0,  8, 0x000086dd },
	{ 0x30,  0,  0, 0x00000014 },
	{ 0x15,  2,  0, 0x00000084 },
	{ 0x15,  1,  0, 0x00000006 },
	{ 0x15,  0, 17, 0x00000011 },
	{ 0x28,  0,  0, 0x00000036 },
	{ 0x15, 14,  0, 0x00000016 },
	{ 0x28,  0,  0, 0x00000038 },
	{ 0x15, 12, 13, 0x00000016 },
	{ 0x15,  0, 12, 0x00000800 },
	{ 0x30,  0,  0, 0x00000017 },
	{ 0x15,  2,  0, 0x00000084 },
	{ 0x15,  1,  0, 0x00000006 },
	{ 0x15,  0,  8, 0x00000011 },
	{ 0x28,  0,  0, 0x00000014 },
	{ 0x45,  6,  0, 0x00001fff },
	{ 0xb1,  0,  0, 0x0000000e },
	{ 0x48,  0,  0, 0x0000000e },
	{ 0x15,  2,  0, 0x00000016 },
	{ 0x48,  0,  0, 0x00000010 },
	{ 0x15,  0,  1, 0x00000016 },
	{ 0x06,  0,  0, 0x0000ffff },
	{ 0x06,  0,  0, 0x00000000 },
};

struct sock_fprog bpf = {
	.len = ARRAY_SIZE(code),
	.filter = code,
};

sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (sock < 0)
	/* ... bail out ... */

ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
if (ret < 0)
	/* ... bail out ... */

/* ... */
close(sock);

The above example code attaches a socket filter for a PF_PACKET socket
in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
be dropped for this socket.

The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
and SO_LOCK_FILTER for preventing the filter to be detached, takes an
integer value with 0 or 1.

Note that socket filters are not restricted to PF_PACKET sockets only,
but can also be used on other socket families.

Summary of system calls:

 * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
 * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
 * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER,   &val, sizeof(val));

Normally, most use cases for socket filtering on packet sockets will be
covered by libpcap in high-level syntax, so as an application developer
you should stick to that. libpcap wraps its own layer around all that.

Unless i) using/linking to libpcap is not an option, ii) the required BPF
filters use Linux extensions that are not supported by libpcap's compiler,
iii) a filter might be more complex and not cleanly implementable with
libpcap's compiler, or iv) particular filter codes should be optimized
differently than libpcap's internal compiler does; then in such cases
writing such a filter "by hand" can be of an alternative. For example,
xt_bpf and cls_bpf users might have requirements that could result in
more complex filter code, or one that cannot be expressed with libpcap
(e.g. different return codes for various code paths). Moreover, BPF JIT
implementors may wish to manually write test cases and thus need low-level
access to BPF code as well.

BPF engine and instruction set
------------------------------

Under tools/net/ there's a small helper tool called bpf_asm which can
be used to write low-level filters for example scenarios mentioned in the
previous section. Asm-like syntax mentioned here has been implemented in
bpf_asm and will be used for further explanations (instead of dealing with
less readable opcodes directly, principles are the same). The syntax is
closely modelled after Steven McCanne's and Van Jacobson's BPF paper.

The BPF architecture consists of the following basic elements:

  Element          Description

  A                32 bit wide accumulator
  X                32 bit wide X register
  M[]              16 x 32 bit wide misc registers aka "scratch memory
                   store", addressable from 0 to 15

A program, that is translated by bpf_asm into "opcodes" is an array that
consists of the following elements (as already mentioned):

  op:16, jt:8, jf:8, k:32

The element op is a 16 bit wide opcode that has a particular instruction
encoded. jt and jf are two 8 bit wide jump targets, one for condition
"jump if true", the other one "jump if false". Eventually, element k
contains a miscellaneous argument that can be interpreted in different
ways depending on the given instruction in op.

The instruction set consists of load, store, branch, alu, miscellaneous
and return instructions that are also represented in bpf_asm syntax. This
table lists all bpf_asm instructions available resp. what their underlying
opcodes as defined in linux/filter.h stand for:

  Instruction      Addressing mode      Description

  ld               1, 2, 3, 4, 10       Load word into A
  ldi              4                    Load word into A
  ldh              1, 2                 Load half-word into A
  ldb              1, 2                 Load byte into A
  ldx              3, 4, 5, 10          Load word into X
  ldxi             4                    Load word into X
  ldxb             5                    Load byte into X

  st               3                    Store A into M[]
  stx              3                    Store X into M[]

  jmp              6                    Jump to label
  ja               6                    Jump to label
  jeq              7, 8                 Jump on k == A
  jneq             8                    Jump on k != A
  jne              8                    Jump on k != A
  jlt              8                    Jump on k < A
  jle              8                    Jump on k <= A
  jgt              7, 8                 Jump on k > A
  jge              7, 8                 Jump on k >= A
  jset             7, 8                 Jump on k & A

  add              0, 4                 A + <x>
  sub              0, 4                 A - <x>
  mul              0, 4                 A * <x>
  div              0, 4                 A / <x>
  mod              0, 4                 A % <x>
  neg              0, 4                 !A
  and              0, 4                 A & <x>
  or               0, 4                 A | <x>
  xor              0, 4                 A ^ <x>
  lsh              0, 4                 A << <x>
  rsh              0, 4                 A >> <x>

  tax                                   Copy A into X
  txa                                   Copy X into A

  ret              4, 9                 Return

The next table shows addressing formats from the 2nd column:

  Addressing mode  Syntax               Description

   0               x/%x                 Register X
   1               [k]                  BHW at byte offset k in the packet
   2               [x + k]              BHW at the offset X + k in the packet
   3               M[k]                 Word at offset k in M[]
   4               #k                   Literal value stored in k
   5               4*([k]&0xf)          Lower nibble * 4 at byte offset k in the packet
   6               L                    Jump label L
   7               #k,Lt,Lf             Jump to Lt if true, otherwise jump to Lf
   8               #k,Lt                Jump to Lt if predicate is true
   9               a/%a                 Accumulator A
  10               extension            BPF extension

The Linux kernel also has a couple of BPF extensions that are used along
with the class of load instructions by "overloading" the k argument with
a negative offset + a particular extension offset. The result of such BPF
extensions are loaded into A.

Possible BPF extensions are shown in the following table:

  Extension                             Description

  len                                   skb->len
  proto                                 skb->protocol
  type                                  skb->pkt_type
  poff                                  Payload start offset
  ifidx                                 skb->dev->ifindex
  nla                                   Netlink attribute of type X with offset A
  nlan                                  Nested Netlink attribute of type X with offset A
  mark                                  skb->mark
  queue                                 skb->queue_mapping
  hatype                                skb->dev->type
  rxhash                                skb->rxhash
  cpu                                   raw_smp_processor_id()
  vlan_tci                              vlan_tx_tag_get(skb)
  vlan_pr                               vlan_tx_tag_present(skb)

These extensions can also be prefixed with '#'.
Examples for low-level BPF:

** ARP packets:

  ldh [12]
  jne #0x806, drop
  ret #-1
  drop: ret #0

** IPv4 TCP packets:

  ldh [12]
  jne #0x800, drop
  ldb [23]
  jneq #6, drop
  ret #-1
  drop: ret #0

** (Accelerated) VLAN w/ id 10:

  ld vlan_tci
  jneq #10, drop
  ret #-1
  drop: ret #0

** SECCOMP filter example:

  ld [4]                  /* offsetof(struct seccomp_data, arch) */
  jne #0xc000003e, bad    /* AUDIT_ARCH_X86_64 */
  ld [0]                  /* offsetof(struct seccomp_data, nr) */
  jeq #15, good           /* __NR_rt_sigreturn */
  jeq #231, good          /* __NR_exit_group */
  jeq #60, good           /* __NR_exit */
  jeq #0, good            /* __NR_read */
  jeq #1, good            /* __NR_write */
  jeq #5, good            /* __NR_fstat */
  jeq #9, good            /* __NR_mmap */
  jeq #14, good           /* __NR_rt_sigprocmask */
  jeq #13, good           /* __NR_rt_sigaction */
  jeq #35, good           /* __NR_nanosleep */
  bad: ret #0             /* SECCOMP_RET_KILL */
  good: ret #0x7fff0000   /* SECCOMP_RET_ALLOW */

The above example code can be placed into a file (here called "foo"), and
then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
and cls_bpf understands and can directly be loaded with. Example with above
ARP code:

$ ./bpf_asm foo
4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,

In copy and paste C-like output:

$ ./bpf_asm -c foo
{ 0x28,  0,  0, 0x0000000c },
{ 0x15,  0,  1, 0x00000806 },
{ 0x06,  0,  0, 0xffffffff },
{ 0x06,  0,  0, 0000000000 },

In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
filters that might not be obvious at first, it's good to test filters before
attaching to a live system. For that purpose, there's a small tool called
bpf_dbg under tools/net/ in the kernel source directory. This debugger allows
for testing BPF filters against given pcap files, single stepping through the
BPF code on the pcap's packets and to do BPF machine register dumps.

Starting bpf_dbg is trivial and just requires issuing:

# ./bpf_dbg

In case input and output do not equal stdin/stdout, bpf_dbg takes an
alternative stdin source as a first argument, and an alternative stdout
sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.

Other than that, a particular libreadline configuration can be set via
file "~/.bpf_dbg_init" and the command history is stored in the file
"~/.bpf_dbg_history".

Interaction in bpf_dbg happens through a shell that also has auto-completion
support (follow-up example commands starting with '>' denote bpf_dbg shell).
The usual workflow would be to ...

> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
  Loads a BPF filter from standard output of bpf_asm, or transformed via
  e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT
  debugging (next section), this command creates a temporary socket and
  loads the BPF code into the kernel. Thus, this will also be useful for
  JIT developers.

> load pcap foo.pcap
  Loads standard tcpdump pcap file.

> run [<n>]
bpf passes:1 fails:9
  Runs through all packets from a pcap to account how many passes and fails
  the filter will generate. A limit of packets to traverse can be given.

> disassemble
l0:	ldh [12]
l1:	jeq #0x800, l2, l5
l2:	ldb [23]
l3:	jeq #0x1, l4, l5
l4:	ret #0xffff
l5:	ret #0
  Prints out BPF code disassembly.

> dump
/* { op, jt, jf, k }, */
{ 0x28,  0,  0, 0x0000000c },
{ 0x15,  0,  3, 0x00000800 },
{ 0x30,  0,  0, 0x00000017 },
{ 0x15,  0,  1, 0x00000001 },
{ 0x06,  0,  0, 0x0000ffff },
{ 0x06,  0,  0, 0000000000 },
  Prints out C-style BPF code dump.

> breakpoint 0
breakpoint at: l0:	ldh [12]
> breakpoint 1
breakpoint at: l1:	jeq #0x800, l2, l5
  ...
  Sets breakpoints at particular BPF instructions. Issuing a `run` command
  will walk through the pcap file continuing from the current packet and
  break when a breakpoint is being hit (another `run` will continue from
  the currently active breakpoint executing next instructions):

  > run
  -- register dump --
  pc:       [0]                       <-- program counter
  code:     [40] jt[0] jf[0] k[12]    <-- plain BPF code of current instruction
  curr:     l0:	ldh [12]              <-- disassembly of current instruction
  A:        [00000000][0]             <-- content of A (hex, decimal)
  X:        [00000000][0]             <-- content of X (hex, decimal)
  M[0,15]:  [00000000][0]             <-- folded content of M (hex, decimal)
  -- packet dump --                   <-- Current packet from pcap (hex)
  len: 42
    0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
   16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
   32: 00 00 00 00 00 00 0a 3b 01 01
  (breakpoint)
  >

> breakpoint
breakpoints: 0 1
  Prints currently set breakpoints.

> step [-<n>, +<n>]
  Performs single stepping through the BPF program from the current pc
  offset. Thus, on each step invocation, above register dump is issued.
  This can go forwards and backwards in time, a plain `step` will break
  on the next BPF instruction, thus +1. (No `run` needs to be issued here.)

> select <n>
  Selects a given packet from the pcap file to continue from. Thus, on
  the next `run` or `step`, the BPF program is being evaluated against
  the user pre-selected packet. Numbering starts just as in Wireshark
  with index 1.

> quit
#
  Exits bpf_dbg.

JIT compiler
------------

The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is
transparently invoked for each attached filter from user space or for internal
kernel users if it has been previously enabled by root:

  echo 1 > /proc/sys/net/core/bpf_jit_enable

For JIT developers, doing audits etc, each compile run can output the generated
opcode image into the kernel log via:

  echo 2 > /proc/sys/net/core/bpf_jit_enable

Example output from dmesg:

[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3

In the kernel source tree under tools/net/, there's bpf_jit_disasm for
generating disassembly out of the kernel log's hexdump:

# ./bpf_jit_disasm
70 bytes emitted from JIT compiler (pass:3, flen:6)
ffffffffa0069c8f + <x>:
   0:	push   %rbp
   1:	mov    %rsp,%rbp
   4:	sub    $0x60,%rsp
   8:	mov    %rbx,-0x8(%rbp)
   c:	mov    0x68(%rdi),%r9d
  10:	sub    0x6c(%rdi),%r9d
  14:	mov    0xd8(%rdi),%r8
  1b:	mov    $0xc,%esi
  20:	callq  0xffffffffe0ff9442
  25:	cmp    $0x800,%eax
  2a:	jne    0x0000000000000042
  2c:	mov    $0x17,%esi
  31:	callq  0xffffffffe0ff945e
  36:	cmp    $0x1,%eax
  39:	jne    0x0000000000000042
  3b:	mov    $0xffff,%eax
  40:	jmp    0x0000000000000044
  42:	xor    %eax,%eax
  44:	leaveq
  45:	retq

Issuing option `-o` will "annotate" opcodes to resulting assembler
instructions, which can be very useful for JIT developers:

# ./bpf_jit_disasm -o
70 bytes emitted from JIT compiler (pass:3, flen:6)
ffffffffa0069c8f + <x>:
   0:	push   %rbp
	55
   1:	mov    %rsp,%rbp
	48 89 e5
   4:	sub    $0x60,%rsp
	48 83 ec 60
   8:	mov    %rbx,-0x8(%rbp)
	48 89 5d f8
   c:	mov    0x68(%rdi),%r9d
	44 8b 4f 68
  10:	sub    0x6c(%rdi),%r9d
	44 2b 4f 6c
  14:	mov    0xd8(%rdi),%r8
	4c 8b 87 d8 00 00 00
  1b:	mov    $0xc,%esi
	be 0c 00 00 00
  20:	callq  0xffffffffe0ff9442
	e8 1d 94 ff e0
  25:	cmp    $0x800,%eax
	3d 00 08 00 00
  2a:	jne    0x0000000000000042
	75 16
  2c:	mov    $0x17,%esi
	be 17 00 00 00
  31:	callq  0xffffffffe0ff945e
	e8 28 94 ff e0
  36:	cmp    $0x1,%eax
	83 f8 01
  39:	jne    0x0000000000000042
	75 07
  3b:	mov    $0xffff,%eax
	b8 ff ff 00 00
  40:	jmp    0x0000000000000044
	eb 02
  42:	xor    %eax,%eax
	31 c0
  44:	leaveq
	c9
  45:	retq
	c3

For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
toolchain for developing and testing the kernel's JIT compiler.

Misc
----

Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
SECCOMP-BPF kernel fuzzing.

Written by
----------

The document was written in the hope that it is found useful and in order
to give potential BPF hackers or security auditors a better overview of
the underlying architecture.

Jay Schulist <jschlst@samba.org>
Daniel Borkmann <dborkman@redhat.com>
Commit	Line	Data
7924cd5e DB	1	Linux Socket Filtering aka Berkeley Packet Filter (BPF)
7924cd5e DB	2	=======================================================
1da177e4 LT	3
1da177e4 LT	4	Introduction
7924cd5e DB	5	------------
	6
	7	Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
	8	Though there are some distinct differences between the BSD and Linux
	9	Kernel filtering, but when we speak of BPF or LSF in Linux context, we
	10	mean the very same mechanism of filtering in the Linux kernel.
	11
	12	BPF allows a user-space program to attach a filter onto any socket and
	13	allow or disallow certain types of data to come through the socket. LSF
	14	follows exactly the same filter code structure as BSD's BPF, so referring
	15	to the BSD bpf.4 manpage is very helpful in creating filters.
	16
	17	On Linux, BPF is much simpler than on BSD. One does not have to worry
	18	about devices or anything like that. You simply create your filter code,
	19	send it to the kernel via the SO_ATTACH_FILTER option and if your filter
	20	code passes the kernel check on it, you then immediately begin filtering
	21	data on that socket.
	22
	23	You can also detach filters from your socket via the SO_DETACH_FILTER
	24	option. This will probably not be used much since when you close a socket
	25	that has a filter on it the filter is automagically removed. The other
	26	less common case may be adding a different filter on the same socket where
	27	you had another filter that is still running: the kernel takes care of
	28	removing the old one and placing your new one in its place, assuming your
	29	filter has passed the checks, otherwise if it fails the old filter will
	30	remain on that socket.
	31
	32	SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
	33	set, a filter cannot be removed or changed. This allows one process to
	34	setup a socket, attach a filter, lock it then drop privileges and be
	35	assured that the filter will be kept until the socket is closed.
	36
	37	The biggest user of this construct might be libpcap. Issuing a high-level
	38	filter command like `tcpdump -i em1 port 22` passes through the libpcap
	39	internal compiler that generates a structure that can eventually be loaded
	40	via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
	41	displays what is being placed into this structure.
	42
	43	Although we were only speaking about sockets here, BPF in Linux is used
	44	in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
	45	qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places
	46	such as team driver, PTP code, etc where BPF is being used.
	47
	48	[1] Documentation/prctl/seccomp_filter.txt
	49
	50	Original BPF paper:
	51
	52	Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
	53	architecture for user-level packet capture. In Proceedings of the
	54	USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
	55	Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
	56	CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
	57
	58	Structure
	59	---------
	60
	61	User space applications include <linux/filter.h> which contains the
	62	following relevant structures:
	63
	64	struct sock_filter { /* Filter block */
	65	__u16 code; /* Actual filter code */
	66	__u8 jt; /* Jump true */
	67	__u8 jf; /* Jump false */
	68	__u32 k; /* Generic multiuse field */
69	};
70
71	Such a structure is assembled as an array of 4-tuples, that contains
72	a code, jt, jf and k value. jt and jf are jump offsets and k a generic
73	value to be used for a provided code.
74
75	struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
76	unsigned short len; /* Number of filter blocks */
77	struct sock_filter __user *filter;
78	};
79
80	For socket filtering, a pointer to this structure (as shown in
81	follow-up example) is being passed to the kernel through setsockopt(2).
82
83	Example
84	-------
85
86	#include <sys/socket.h>
87	#include <sys/types.h>
88	#include <arpa/inet.h>
89	#include <linux/if_ether.h>
90	/* ... */
91
92	/* From the example above: tcpdump -i em1 port 22 -dd */
93	struct sock_filter code[] = {
94	{ 0x28, 0, 0, 0x0000000c },
95	{ 0x15, 0, 8, 0x000086dd },
96	{ 0x30, 0, 0, 0x00000014 },
97	{ 0x15, 2, 0, 0x00000084 },
98	{ 0x15, 1, 0, 0x00000006 },
99	{ 0x15, 0, 17, 0x00000011 },
100	{ 0x28, 0, 0, 0x00000036 },
101	{ 0x15, 14, 0, 0x00000016 },
102	{ 0x28, 0, 0, 0x00000038 },
103	{ 0x15, 12, 13, 0x00000016 },
104	{ 0x15, 0, 12, 0x00000800 },
105	{ 0x30, 0, 0, 0x00000017 },
106	{ 0x15, 2, 0, 0x00000084 },
107	{ 0x15, 1, 0, 0x00000006 },
108	{ 0x15, 0, 8, 0x00000011 },
109	{ 0x28, 0, 0, 0x00000014 },
110	{ 0x45, 6, 0, 0x00001fff },
111	{ 0xb1, 0, 0, 0x0000000e },
112	{ 0x48, 0, 0, 0x0000000e },
113	{ 0x15, 2, 0, 0x00000016 },
114	{ 0x48, 0, 0, 0x00000010 },
115	{ 0x15, 0, 1, 0x00000016 },
116	{ 0x06, 0, 0, 0x0000ffff },
117	{ 0x06, 0, 0, 0x00000000 },
118	};
119
120	struct sock_fprog bpf = {
121	.len = ARRAY_SIZE(code),
122	.filter = code,
123	};
124
125	sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
126	if (sock < 0)
127	/* ... bail out ... */
128
129	ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
130	if (ret < 0)
131	/* ... bail out ... */
132
133	/* ... */
134	close(sock);
135
136	The above example code attaches a socket filter for a PF_PACKET socket
137	in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
138	be dropped for this socket.
139
140	The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
141	and SO_LOCK_FILTER for preventing the filter to be detached, takes an
142	integer value with 0 or 1.
143
144	Note that socket filters are not restricted to PF_PACKET sockets only,
145	but can also be used on other socket families.
146
147	Summary of system calls:
148
149	* setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
150	* setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
151	* setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val));
152
153	Normally, most use cases for socket filtering on packet sockets will be
154	covered by libpcap in high-level syntax, so as an application developer
155	you should stick to that. libpcap wraps its own layer around all that.
156
157	Unless i) using/linking to libpcap is not an option, ii) the required BPF
158	filters use Linux extensions that are not supported by libpcap's compiler,
159	iii) a filter might be more complex and not cleanly implementable with
160	libpcap's compiler, or iv) particular filter codes should be optimized
161	differently than libpcap's internal compiler does; then in such cases
162	writing such a filter "by hand" can be of an alternative. For example,
163	xt_bpf and cls_bpf users might have requirements that could result in
164	more complex filter code, or one that cannot be expressed with libpcap
165	(e.g. different return codes for various code paths). Moreover, BPF JIT
166	implementors may wish to manually write test cases and thus need low-level
167	access to BPF code as well.
168
169	BPF engine and instruction set
170	------------------------------
171
172	Under tools/net/ there's a small helper tool called bpf_asm which can
173	be used to write low-level filters for example scenarios mentioned in the
174	previous section. Asm-like syntax mentioned here has been implemented in
175	bpf_asm and will be used for further explanations (instead of dealing with
176	less readable opcodes directly, principles are the same). The syntax is
177	closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
178
179	The BPF architecture consists of the following basic elements:
180
181	Element Description
182
183	A 32 bit wide accumulator
184	X 32 bit wide X register
185	M[] 16 x 32 bit wide misc registers aka "scratch memory
186	store", addressable from 0 to 15
187
188	A program, that is translated by bpf_asm into "opcodes" is an array that
189	consists of the following elements (as already mentioned):
190
191	op:16, jt:8, jf:8, k:32
192
193	The element op is a 16 bit wide opcode that has a particular instruction
194	encoded. jt and jf are two 8 bit wide jump targets, one for condition
195	"jump if true", the other one "jump if false". Eventually, element k
196	contains a miscellaneous argument that can be interpreted in different
197	ways depending on the given instruction in op.
198
199	The instruction set consists of load, store, branch, alu, miscellaneous
200	and return instructions that are also represented in bpf_asm syntax. This
201	table lists all bpf_asm instructions available resp. what their underlying
202	opcodes as defined in linux/filter.h stand for:
203
204	Instruction Addressing mode Description
205
206	ld 1, 2, 3, 4, 10 Load word into A
207	ldi 4 Load word into A
208	ldh 1, 2 Load half-word into A
209	ldb 1, 2 Load byte into A
210	ldx 3, 4, 5, 10 Load word into X
211	ldxi 4 Load word into X
212	ldxb 5 Load byte into X
213
214	st 3 Store A into M[]
215	stx 3 Store X into M[]
216
217	jmp 6 Jump to label
218	ja 6 Jump to label
219	jeq 7, 8 Jump on k == A
220	jneq 8 Jump on k != A
221	jne 8 Jump on k != A
222	jlt 8 Jump on k < A
223	jle 8 Jump on k <= A
224	jgt 7, 8 Jump on k > A
225	jge 7, 8 Jump on k >= A
226	jset 7, 8 Jump on k & A
227
228	add 0, 4 A + <x>
229	sub 0, 4 A - <x>
230	mul 0, 4 A * <x>
231	div 0, 4 A / <x>
232	mod 0, 4 A % <x>
233	neg 0, 4 !A
234	and 0, 4 A & <x>
235	or 0, 4 A \| <x>
236	xor 0, 4 A ^ <x>
237	lsh 0, 4 A << <x>
238	rsh 0, 4 A >> <x>
239
240	tax Copy A into X
241	txa Copy X into A
242
243	ret 4, 9 Return
244
245	The next table shows addressing formats from the 2nd column:
246
247	Addressing mode Syntax Description
248
249	0 x/%x Register X
250	1 [k] BHW at byte offset k in the packet
251	2 [x + k] BHW at the offset X + k in the packet
252	3 M[k] Word at offset k in M[]
253	4 #k Literal value stored in k
254	5 4([k]&0xf) Lower nibble 4 at byte offset k in the packet
255	6 L Jump label L
256	7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
257	8 #k,Lt Jump to Lt if predicate is true
258	9 a/%a Accumulator A
259	10 extension BPF extension
260
261	The Linux kernel also has a couple of BPF extensions that are used along
262	with the class of load instructions by "overloading" the k argument with
263	a negative offset + a particular extension offset. The result of such BPF
264	extensions are loaded into A.
265
266	Possible BPF extensions are shown in the following table:
267
268	Extension Description
269
270	len skb->len
271	proto skb->protocol
272	type skb->pkt_type
273	poff Payload start offset
274	ifidx skb->dev->ifindex
275	nla Netlink attribute of type X with offset A
276	nlan Nested Netlink attribute of type X with offset A
277	mark skb->mark
278	queue skb->queue_mapping
279	hatype skb->dev->type
280	rxhash skb->rxhash
281	cpu raw_smp_processor_id()
282	vlan_tci vlan_tx_tag_get(skb)
283	vlan_pr vlan_tx_tag_present(skb)
284
285	These extensions can also be prefixed with '#'.
286	Examples for low-level BPF:
287
288	** ARP packets:
289
290	ldh [12]
291	jne #0x806, drop
292	ret #-1
293	drop: ret #0
294
295	** IPv4 TCP packets:
296
297	ldh [12]
298	jne #0x800, drop
299	ldb [23]
300	jneq #6, drop
301	ret #-1
302	drop: ret #0
303
304	** (Accelerated) VLAN w/ id 10:
305
306	ld vlan_tci
307	jneq #10, drop
308	ret #-1
309	drop: ret #0
310
311	** SECCOMP filter example:
312
313	ld [4] /* offsetof(struct seccomp_data, arch) */
314	jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
315	ld [0] /* offsetof(struct seccomp_data, nr) */
316	jeq #15, good /* __NR_rt_sigreturn */
317	jeq #231, good /* __NR_exit_group */
318	jeq #60, good /* __NR_exit */
319	jeq #0, good /* __NR_read */
320	jeq #1, good /* __NR_write */
321	jeq #5, good /* __NR_fstat */
322	jeq #9, good /* __NR_mmap */
323	jeq #14, good /* __NR_rt_sigprocmask */
324	jeq #13, good /* __NR_rt_sigaction */
325	jeq #35, good /* __NR_nanosleep */
326	bad: ret #0 /* SECCOMP_RET_KILL */
327	good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
328
329	The above example code can be placed into a file (here called "foo"), and
330	then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
331	and cls_bpf understands and can directly be loaded with. Example with above
332	ARP code:
333
334	$ ./bpf_asm foo
335	4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
336
337	In copy and paste C-like output:
338
339	$ ./bpf_asm -c foo
340	{ 0x28, 0, 0, 0x0000000c },
341	{ 0x15, 0, 1, 0x00000806 },
342	{ 0x06, 0, 0, 0xffffffff },
343	{ 0x06, 0, 0, 0000000000 },
344
345	In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
346	filters that might not be obvious at first, it's good to test filters before
347	attaching to a live system. For that purpose, there's a small tool called
348	bpf_dbg under tools/net/ in the kernel source directory. This debugger allows
349	for testing BPF filters against given pcap files, single stepping through the
350	BPF code on the pcap's packets and to do BPF machine register dumps.
351
352	Starting bpf_dbg is trivial and just requires issuing:
353
354	# ./bpf_dbg
355
356	In case input and output do not equal stdin/stdout, bpf_dbg takes an
357	alternative stdin source as a first argument, and an alternative stdout
358	sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
359
360	Other than that, a particular libreadline configuration can be set via
361	file "~/.bpf_dbg_init" and the command history is stored in the file
362	"~/.bpf_dbg_history".
363
364	Interaction in bpf_dbg happens through a shell that also has auto-completion
365	support (follow-up example commands starting with '>' denote bpf_dbg shell).
366	The usual workflow would be to ...
367
368	> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
369	Loads a BPF filter from standard output of bpf_asm, or transformed via
370	e.g. `tcpdump -iem1 -ddd port 22 \| tr '\n' ','`. Note that for JIT
371	debugging (next section), this command creates a temporary socket and
372	loads the BPF code into the kernel. Thus, this will also be useful for
373	JIT developers.
374
375	> load pcap foo.pcap
376	Loads standard tcpdump pcap file.
377
378	> run [<n>]
379	bpf passes:1 fails:9
380	Runs through all packets from a pcap to account how many passes and fails
381	the filter will generate. A limit of packets to traverse can be given.
382
383	> disassemble
384	l0: ldh [12]
385	l1: jeq #0x800, l2, l5
386	l2: ldb [23]
387	l3: jeq #0x1, l4, l5
388	l4: ret #0xffff
389	l5: ret #0
390	Prints out BPF code disassembly.
391
392	> dump
393	/* { op, jt, jf, k }, */
394	{ 0x28, 0, 0, 0x0000000c },
395	{ 0x15, 0, 3, 0x00000800 },
396	{ 0x30, 0, 0, 0x00000017 },
397	{ 0x15, 0, 1, 0x00000001 },
398	{ 0x06, 0, 0, 0x0000ffff },
399	{ 0x06, 0, 0, 0000000000 },
400	Prints out C-style BPF code dump.
401
402	> breakpoint 0
403	breakpoint at: l0: ldh [12]
404	> breakpoint 1
405	breakpoint at: l1: jeq #0x800, l2, l5
406	...
407	Sets breakpoints at particular BPF instructions. Issuing a `run` command
408	will walk through the pcap file continuing from the current packet and
409	break when a breakpoint is being hit (another `run` will continue from
410	the currently active breakpoint executing next instructions):
411
412	> run
413	-- register dump --
414	pc: [0] <-- program counter
415	code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
416	curr: l0: ldh [12] <-- disassembly of current instruction
417	A: [00000000][0] <-- content of A (hex, decimal)
418	X: [00000000][0] <-- content of X (hex, decimal)
419	M[0,15]: [00000000][0] <-- folded content of M (hex, decimal)
420	-- packet dump -- <-- Current packet from pcap (hex)
421	len: 42
422	0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
423	16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
424	32: 00 00 00 00 00 00 0a 3b 01 01
425	(breakpoint)
426	>
427
428	> breakpoint
429	breakpoints: 0 1
430	Prints currently set breakpoints.
431
432	> step [-<n>, +<n>]
433	Performs single stepping through the BPF program from the current pc
434	offset. Thus, on each step invocation, above register dump is issued.
435	This can go forwards and backwards in time, a plain `step` will break
436	on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
437
438	> select <n>
439	Selects a given packet from the pcap file to continue from. Thus, on
440	the next `run` or `step`, the BPF program is being evaluated against
441	the user pre-selected packet. Numbering starts just as in Wireshark
442	with index 1.
443
444	> quit
445	#
446	Exits bpf_dbg.
447
448	JIT compiler
449	------------
450
451	The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
452	ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is
453	transparently invoked for each attached filter from user space or for internal
454	kernel users if it has been previously enabled by root:
455
456	echo 1 > /proc/sys/net/core/bpf_jit_enable
457
458	For JIT developers, doing audits etc, each compile run can output the generated
459	opcode image into the kernel log via:
460
461	echo 2 > /proc/sys/net/core/bpf_jit_enable
462
463	Example output from dmesg:
464
465	[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
466	[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
467	[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
468	[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
469	[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
470	[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
471
472	In the kernel source tree under tools/net/, there's bpf_jit_disasm for
473	generating disassembly out of the kernel log's hexdump:
474
475	# ./bpf_jit_disasm
476	70 bytes emitted from JIT compiler (pass:3, flen:6)
477	ffffffffa0069c8f + <x>:
478	0: push %rbp
479	1: mov %rsp,%rbp
480	4: sub $0x60,%rsp
481	8: mov %rbx,-0x8(%rbp)
482	c: mov 0x68(%rdi),%r9d
483	10: sub 0x6c(%rdi),%r9d
484	14: mov 0xd8(%rdi),%r8
485	1b: mov $0xc,%esi
486	20: callq 0xffffffffe0ff9442
487	25: cmp $0x800,%eax
488	2a: jne 0x0000000000000042
489	2c: mov $0x17,%esi
490	31: callq 0xffffffffe0ff945e
491	36: cmp $0x1,%eax
492	39: jne 0x0000000000000042
493	3b: mov $0xffff,%eax
494	40: jmp 0x0000000000000044
495	42: xor %eax,%eax
496	44: leaveq
497	45: retq
498
499	Issuing option `-o` will "annotate" opcodes to resulting assembler
500	instructions, which can be very useful for JIT developers:
501
502	# ./bpf_jit_disasm -o
503	70 bytes emitted from JIT compiler (pass:3, flen:6)
504	ffffffffa0069c8f + <x>:
505	0: push %rbp
506	55
507	1: mov %rsp,%rbp
508	48 89 e5
509	4: sub $0x60,%rsp
510	48 83 ec 60
511	8: mov %rbx,-0x8(%rbp)
512	48 89 5d f8
513	c: mov 0x68(%rdi),%r9d
514	44 8b 4f 68
515	10: sub 0x6c(%rdi),%r9d
516	44 2b 4f 6c
517	14: mov 0xd8(%rdi),%r8
518	4c 8b 87 d8 00 00 00
519	1b: mov $0xc,%esi
520	be 0c 00 00 00
521	20: callq 0xffffffffe0ff9442
522	e8 1d 94 ff e0
523	25: cmp $0x800,%eax
524	3d 00 08 00 00
525	2a: jne 0x0000000000000042
526	75 16
527	2c: mov $0x17,%esi
528	be 17 00 00 00
529	31: callq 0xffffffffe0ff945e
530	e8 28 94 ff e0
531	36: cmp $0x1,%eax
532	83 f8 01
533	39: jne 0x0000000000000042
534	75 07
535	3b: mov $0xffff,%eax
536	b8 ff ff 00 00
537	40: jmp 0x0000000000000044
538	eb 02
539	42: xor %eax,%eax
540	31 c0
541	44: leaveq
542	c9
543	45: retq
544	c3
545
546	For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
547	toolchain for developing and testing the kernel's JIT compiler.
548
549	Misc
550	----
551
552	Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
553	SECCOMP-BPF kernel fuzzing.
554
555	Written by
556	----------
557
558	The document was written in the hope that it is found useful and in order
559	to give potential BPF hackers or security auditors a better overview of
560	the underlying architecture.
561
562	Jay Schulist <jschlst@samba.org>
563	Daniel Borkmann <dborkman@redhat.com>