Commit | Line | Data |
---|---|---|
10016594 TH |
1 | Kernel Connection Mulitplexor |
2 | ----------------------------- | |
3 | ||
4 | Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based | |
5 | interface over TCP for generic application protocols. With KCM an application | |
6 | can efficiently send and receive application protocol messages over TCP using | |
7 | datagram sockets. | |
8 | ||
9 | KCM implements an NxM multiplexor in the kernel as diagrammed below: | |
10 | ||
11 | +------------+ +------------+ +------------+ +------------+ | |
12 | | KCM socket | | KCM socket | | KCM socket | | KCM socket | | |
13 | +------------+ +------------+ +------------+ +------------+ | |
14 | | | | | | |
15 | +-----------+ | | +----------+ | |
16 | | | | | | |
17 | +----------------------------------+ | |
18 | | Multiplexor | | |
19 | +----------------------------------+ | |
20 | | | | | | | |
21 | +---------+ | | | ------------+ | |
22 | | | | | | | |
23 | +----------+ +----------+ +----------+ +----------+ +----------+ | |
24 | | Psock | | Psock | | Psock | | Psock | | Psock | | |
25 | +----------+ +----------+ +----------+ +----------+ +----------+ | |
26 | | | | | | | |
27 | +----------+ +----------+ +----------+ +----------+ +----------+ | |
28 | | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | | |
29 | +----------+ +----------+ +----------+ +----------+ +----------+ | |
30 | ||
31 | KCM sockets | |
32 | ----------- | |
33 | ||
34 | The KCM sockets provide the user interface to the muliplexor. All the KCM sockets | |
35 | bound to a multiplexor are considered to have equivalent function, and I/O | |
36 | operations in different sockets may be done in parallel without the need for | |
37 | synchronization between threads in userspace. | |
38 | ||
39 | Multiplexor | |
40 | ----------- | |
41 | ||
42 | The multiplexor provides the message steering. In the transmit path, messages | |
43 | written on a KCM socket are sent atomically on an appropriate TCP socket. | |
44 | Similarly, in the receive path, messages are constructed on each TCP socket | |
45 | (Psock) and complete messages are steered to a KCM socket. | |
46 | ||
47 | TCP sockets & Psocks | |
48 | -------------------- | |
49 | ||
50 | TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated | |
51 | for each bound TCP socket, this structure holds the state for constructing | |
52 | messages on receive as well as other connection specific information for KCM. | |
53 | ||
54 | Connected mode semantics | |
55 | ------------------------ | |
56 | ||
57 | Each multiplexor assumes that all attached TCP connections are to the same | |
58 | destination and can use the different connections for load balancing when | |
59 | transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) | |
60 | can be used to send and receive messages from the KCM socket. | |
61 | ||
62 | Socket types | |
63 | ------------ | |
64 | ||
65 | KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. | |
66 | ||
67 | Message delineation | |
68 | ------------------- | |
69 | ||
70 | Messages are sent over a TCP stream with some application protocol message | |
71 | format that typically includes a header which frames the messages. The length | |
72 | of a received message can be deduced from the application protocol header | |
73 | (often just a simple length field). | |
74 | ||
75 | A TCP stream must be parsed to determine message boundaries. Berkeley Packet | |
76 | Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a | |
77 | BPF program must be specified. The program is called at the start of receiving | |
78 | a new message and is given an skbuff that contains the bytes received so far. | |
79 | It parses the message header and returns the length of the message. Given this | |
80 | information, KCM will construct the message of the stated length and deliver it | |
81 | to a KCM socket. | |
82 | ||
83 | TCP socket management | |
84 | --------------------- | |
85 | ||
86 | When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and | |
87 | write space available (POLLOUT) events are handled by the multiplexor. If there | |
88 | is a state change (disconnection) or other error on a TCP socket, an error is | |
89 | posted on the TCP socket so that a POLLERR event happens and KCM discontinues | |
90 | using the socket. When the application gets the error notification for a | |
91 | TCP socket, it should unattach the socket from KCM and then handle the error | |
92 | condition (the typical response is to close the socket and create a new | |
93 | connection if necessary). | |
94 | ||
95 | KCM limits the maximum receive message size to be the size of the receive | |
96 | socket buffer on the attached TCP socket (the socket buffer size can be set by | |
97 | SO_RCVBUF). If the length of a new message reported by the BPF program is | |
98 | greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP | |
99 | socket. The BPF program may also enforce a maximum messages size and report an | |
100 | error when it is exceeded. | |
101 | ||
102 | A timeout may be set for assembling messages on a receive socket. The timeout | |
103 | value is taken from the receive timeout of the attached TCP socket (this is set | |
104 | by SO_RCVTIMEO). If the timer expires before assembly is complete an error | |
105 | (ETIMEDOUT) is posted on the socket. | |
106 | ||
107 | User interface | |
108 | ============== | |
109 | ||
110 | Creating a multiplexor | |
111 | ---------------------- | |
112 | ||
113 | A new multiplexor and initial KCM socket is created by a socket call: | |
114 | ||
115 | socket(AF_KCM, type, protocol) | |
116 | ||
117 | - type is either SOCK_DGRAM or SOCK_SEQPACKET | |
118 | - protocol is KCMPROTO_CONNECTED | |
119 | ||
120 | Cloning KCM sockets | |
121 | ------------------- | |
122 | ||
123 | After the first KCM socket is created using the socket call as described | |
124 | above, additional sockets for the multiplexor can be created by cloning | |
125 | a KCM socket. This is accomplished by an ioctl on a KCM socket: | |
126 | ||
127 | /* From linux/kcm.h */ | |
128 | struct kcm_clone { | |
129 | int fd; | |
130 | }; | |
131 | ||
132 | struct kcm_clone info; | |
133 | ||
134 | memset(&info, 0, sizeof(info)); | |
135 | ||
136 | err = ioctl(kcmfd, SIOCKCMCLONE, &info); | |
137 | ||
138 | if (!err) | |
139 | newkcmfd = info.fd; | |
140 | ||
141 | Attach transport sockets | |
142 | ------------------------ | |
143 | ||
144 | Attaching of transport sockets to a multiplexor is performed by calling an | |
145 | ioctl on a KCM socket for the multiplexor. e.g.: | |
146 | ||
147 | /* From linux/kcm.h */ | |
148 | struct kcm_attach { | |
149 | int fd; | |
150 | int bpf_fd; | |
151 | }; | |
152 | ||
153 | struct kcm_attach info; | |
154 | ||
155 | memset(&info, 0, sizeof(info)); | |
156 | ||
157 | info.fd = tcpfd; | |
158 | info.bpf_fd = bpf_prog_fd; | |
159 | ||
160 | ioctl(kcmfd, SIOCKCMATTACH, &info); | |
161 | ||
162 | The kcm_attach structure contains: | |
163 | fd: file descriptor for TCP socket being attached | |
164 | bpf_prog_fd: file descriptor for compiled BPF program downloaded | |
165 | ||
166 | Unattach transport sockets | |
167 | -------------------------- | |
168 | ||
169 | Unattaching a transport socket from a multiplexor is straightforward. An | |
170 | "unattach" ioctl is done with the kcm_unattach structure as the argument: | |
171 | ||
172 | /* From linux/kcm.h */ | |
173 | struct kcm_unattach { | |
174 | int fd; | |
175 | }; | |
176 | ||
177 | struct kcm_unattach info; | |
178 | ||
179 | memset(&info, 0, sizeof(info)); | |
180 | ||
181 | info.fd = cfd; | |
182 | ||
183 | ioctl(fd, SIOCKCMUNATTACH, &info); | |
184 | ||
185 | Disabling receive on KCM socket | |
186 | ------------------------------- | |
187 | ||
188 | A setsockopt is used to disable or enable receiving on a KCM socket. | |
189 | When receive is disabled, any pending messages in the socket's | |
190 | receive buffer are moved to other sockets. This feature is useful | |
191 | if an application thread knows that it will be doing a lot of | |
192 | work on a request and won't be able to service new messages for a | |
193 | while. Example use: | |
194 | ||
195 | int val = 1; | |
196 | ||
197 | setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) | |
198 | ||
199 | BFP programs for message delineation | |
200 | ------------------------------------ | |
201 | ||
202 | BPF programs can be compiled using the BPF LLVM backend. For exmple, | |
203 | the BPF program for parsing Thrift is: | |
204 | ||
205 | #include "bpf.h" /* for __sk_buff */ | |
206 | #include "bpf_helpers.h" /* for load_word intrinsic */ | |
207 | ||
208 | SEC("socket_kcm") | |
209 | int bpf_prog1(struct __sk_buff *skb) | |
210 | { | |
211 | return load_word(skb, 0) + 4; | |
212 | } | |
213 | ||
214 | char _license[] SEC("license") = "GPL"; | |
215 | ||
216 | Use in applications | |
217 | =================== | |
218 | ||
219 | KCM accelerates application layer protocols. Specifically, it allows | |
220 | applications to use a message based interface for sending and receiving | |
221 | messages. The kernel provides necessary assurances that messages are sent | |
222 | and received atomically. This relieves much of the burden applications have | |
223 | in mapping a message based protocol onto the TCP stream. KCM also make | |
224 | application layer messages a unit of work in the kernel for the purposes of | |
225 | steerng and scheduling, which in turn allows a simpler networking model in | |
226 | multithreaded applications. | |
227 | ||
228 | Configurations | |
229 | -------------- | |
230 | ||
231 | In an Nx1 configuration, KCM logically provides multiple socket handles | |
232 | to the same TCP connection. This allows parallelism between in I/O | |
233 | operations on the TCP socket (for instance copyin and copyout of data is | |
234 | parallelized). In an application, a KCM socket can be opened for each | |
235 | processing thread and inserted into the epoll (similar to how SO_REUSEPORT | |
236 | is used to allow multiple listener sockets on the same port). | |
237 | ||
238 | In a MxN configuration, multiple connections are established to the | |
239 | same destination. These are used for simple load balancing. | |
240 | ||
241 | Message batching | |
242 | ---------------- | |
243 | ||
244 | The primary purpose of KCM is load balancing between KCM sockets and hence | |
245 | threads in a nominal use case. Perfect load balancing, that is steering | |
246 | each received message to a different KCM socket or steering each sent | |
247 | message to a different TCP socket, can negatively impact performance | |
248 | since this doesn't allow for affinities to be established. Balancing | |
249 | based on groups, or batches of messages, can be beneficial for performance. | |
250 | ||
251 | On transmit, there are three ways an application can batch (pipeline) | |
252 | messages on a KCM socket. | |
253 | 1) Send multiple messages in a single sendmmsg. | |
254 | 2) Send a group of messages each with a sendmsg call, where all messages | |
255 | except the last have MSG_BATCH in the flags of sendmsg call. | |
256 | 3) Create "super message" composed of multiple messages and send this | |
257 | with a single sendmsg. | |
258 | ||
259 | On receive, the KCM module attempts to queue messages received on the | |
260 | same KCM socket during each TCP ready callback. The targeted KCM socket | |
261 | changes at each receive ready callback on the KCM socket. The application | |
262 | does not need to configure this. | |
263 | ||
264 | Error handling | |
265 | -------------- | |
266 | ||
267 | An application should include a thread to monitor errors raised on | |
268 | the TCP connection. Normally, this will be done by placing each | |
269 | TCP socket attached to a KCM multiplexor in epoll set for POLLERR | |
270 | event. If an error occurs on an attached TCP socket, KCM sets an EPIPE | |
271 | on the socket thus waking up the application thread. When the application | |
272 | sees the error (which may just be a disconnect) it should unattach the | |
273 | socket from KCM and then close it. It is assumed that once an error is | |
274 | posted on the TCP socket the data stream is unrecoverable (i.e. an error | |
275 | may have occurred in in the middle of receiving a messssge). | |
276 | ||
277 | TCP connection monitoring | |
278 | ------------------------- | |
279 | ||
280 | In KCM there is no means to correlate a message to the TCP socket that | |
281 | was used to send or receive the message (except in the case there is | |
282 | only one attached TCP socket). However, the application does retain | |
283 | an open file descriptor to the socket so it will be able to get statistics | |
284 | from the socket which can be used in detecting issues (such as high | |
285 | retransmissions on the socket). |