Commit | Line | Data |
---|---|---|
0c5f9b88 AG |
1 | |
2 | Overview | |
3 | ======== | |
4 | ||
5 | This readme tries to provide some background on the hows and whys of RDS, | |
6 | and will hopefully help you find your way around the code. | |
7 | ||
8 | In addition, please see this email about RDS origins: | |
9 | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html | |
10 | ||
11 | RDS Architecture | |
12 | ================ | |
13 | ||
14 | RDS provides reliable, ordered datagram delivery by using a single | |
15 | reliable connection between any two nodes in the cluster. This allows | |
16 | applications to use a single socket to talk to any other process in the | |
17 | cluster - so in a cluster with N processes you need N sockets, in contrast | |
18 | to N*N if you use a connection-oriented socket transport like TCP. | |
19 | ||
20 | RDS is not Infiniband-specific; it was designed to support different | |
21 | transports. The current implementation used to support RDS over TCP as well | |
22 | as IB. Work is in progress to support RDS over iWARP, and using DCE to | |
23 | guarantee no dropped packets on Ethernet, it may be possible to use RDS over | |
24 | UDP in the future. | |
25 | ||
26 | The high-level semantics of RDS from the application's point of view are | |
27 | ||
28 | * Addressing | |
29 | RDS uses IPv4 addresses and 16bit port numbers to identify | |
30 | the end point of a connection. All socket operations that involve | |
31 | passing addresses between kernel and user space generally | |
32 | use a struct sockaddr_in. | |
33 | ||
34 | The fact that IPv4 addresses are used does not mean the underlying | |
35 | transport has to be IP-based. In fact, RDS over IB uses a | |
36 | reliable IB connection; the IP address is used exclusively to | |
37 | locate the remote node's GID (by ARPing for the given IP). | |
38 | ||
39 | The port space is entirely independent of UDP, TCP or any other | |
40 | protocol. | |
41 | ||
42 | * Socket interface | |
43 | RDS sockets work *mostly* as you would expect from a BSD | |
44 | socket. The next section will cover the details. At any rate, | |
45 | all I/O is performed through the standard BSD socket API. | |
46 | Some additions like zerocopy support are implemented through | |
47 | control messages, while other extensions use the getsockopt/ | |
48 | setsockopt calls. | |
49 | ||
50 | Sockets must be bound before you can send or receive data. | |
51 | This is needed because binding also selects a transport and | |
52 | attaches it to the socket. Once bound, the transport assignment | |
53 | does not change. RDS will tolerate IPs moving around (eg in | |
54 | a active-active HA scenario), but only as long as the address | |
55 | doesn't move to a different transport. | |
56 | ||
57 | * sysctls | |
58 | RDS supports a number of sysctls in /proc/sys/net/rds | |
59 | ||
60 | ||
61 | Socket Interface | |
62 | ================ | |
63 | ||
64 | AF_RDS, PF_RDS, SOL_RDS | |
ebe96e64 SV |
65 | AF_RDS and PF_RDS are the domain type to be used with socket(2) |
66 | to create RDS sockets. SOL_RDS is the socket-level to be used | |
67 | with setsockopt(2) and getsockopt(2) for RDS specific socket | |
68 | options. | |
0c5f9b88 AG |
69 | |
70 | fd = socket(PF_RDS, SOCK_SEQPACKET, 0); | |
71 | This creates a new, unbound RDS socket. | |
72 | ||
73 | setsockopt(SOL_SOCKET): send and receive buffer size | |
74 | RDS honors the send and receive buffer size socket options. | |
75 | You are not allowed to queue more than SO_SNDSIZE bytes to | |
76 | a socket. A message is queued when sendmsg is called, and | |
77 | it leaves the queue when the remote system acknowledges | |
78 | its arrival. | |
79 | ||
80 | The SO_RCVSIZE option controls the maximum receive queue length. | |
81 | This is a soft limit rather than a hard limit - RDS will | |
82 | continue to accept and queue incoming messages, even if that | |
83 | takes the queue length over the limit. However, it will also | |
84 | mark the port as "congested" and send a congestion update to | |
85 | the source node. The source node is supposed to throttle any | |
86 | processes sending to this congested port. | |
87 | ||
88 | bind(fd, &sockaddr_in, ...) | |
89 | This binds the socket to a local IP address and port, and a | |
90 | transport. | |
91 | ||
92 | sendmsg(fd, ...) | |
93 | Sends a message to the indicated recipient. The kernel will | |
94 | transparently establish the underlying reliable connection | |
95 | if it isn't up yet. | |
96 | ||
97 | An attempt to send a message that exceeds SO_SNDSIZE will | |
98 | return with -EMSGSIZE | |
99 | ||
100 | An attempt to send a message that would take the total number | |
101 | of queued bytes over the SO_SNDSIZE threshold will return | |
102 | EAGAIN. | |
103 | ||
104 | An attempt to send a message to a destination that is marked | |
105 | as "congested" will return ENOBUFS. | |
106 | ||
107 | recvmsg(fd, ...) | |
108 | Receives a message that was queued to this socket. The sockets | |
109 | recv queue accounting is adjusted, and if the queue length | |
110 | drops below SO_SNDSIZE, the port is marked uncongested, and | |
111 | a congestion update is sent to all peers. | |
112 | ||
113 | Applications can ask the RDS kernel module to receive | |
114 | notifications via control messages (for instance, there is a | |
115 | notification when a congestion update arrived, or when a RDMA | |
116 | operation completes). These notifications are received through | |
117 | the msg.msg_control buffer of struct msghdr. The format of the | |
118 | messages is described in manpages. | |
119 | ||
120 | poll(fd) | |
121 | RDS supports the poll interface to allow the application | |
122 | to implement async I/O. | |
123 | ||
124 | POLLIN handling is pretty straightforward. When there's an | |
125 | incoming message queued to the socket, or a pending notification, | |
126 | we signal POLLIN. | |
127 | ||
128 | POLLOUT is a little harder. Since you can essentially send | |
129 | to any destination, RDS will always signal POLLOUT as long as | |
130 | there's room on the send queue (ie the number of bytes queued | |
131 | is less than the sendbuf size). | |
132 | ||
133 | However, the kernel will refuse to accept messages to | |
134 | a destination marked congested - in this case you will loop | |
135 | forever if you rely on poll to tell you what to do. | |
136 | This isn't a trivial problem, but applications can deal with | |
137 | this - by using congestion notifications, and by checking for | |
138 | ENOBUFS errors returned by sendmsg. | |
139 | ||
140 | setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) | |
141 | This allows the application to discard all messages queued to a | |
142 | specific destination on this particular socket. | |
143 | ||
144 | This allows the application to cancel outstanding messages if | |
145 | it detects a timeout. For instance, if it tried to send a message, | |
146 | and the remote host is unreachable, RDS will keep trying forever. | |
147 | The application may decide it's not worth it, and cancel the | |
148 | operation. In this case, it would use RDS_CANCEL_SENT_TO to | |
149 | nuke any pending messages. | |
150 | ||
151 | ||
152 | RDMA for RDS | |
153 | ============ | |
154 | ||
155 | see rds-rdma(7) manpage (available in rds-tools) | |
156 | ||
157 | ||
158 | Congestion Notifications | |
159 | ======================== | |
160 | ||
161 | see rds(7) manpage | |
162 | ||
163 | ||
164 | RDS Protocol | |
165 | ============ | |
166 | ||
167 | Message header | |
168 | ||
169 | The message header is a 'struct rds_header' (see rds.h): | |
170 | Fields: | |
171 | h_sequence: | |
172 | per-packet sequence number | |
173 | h_ack: | |
174 | piggybacked acknowledgment of last packet received | |
175 | h_len: | |
176 | length of data, not including header | |
177 | h_sport: | |
178 | source port | |
179 | h_dport: | |
180 | destination port | |
181 | h_flags: | |
182 | CONG_BITMAP - this is a congestion update bitmap | |
183 | ACK_REQUIRED - receiver must ack this packet | |
184 | RETRANSMITTED - packet has previously been sent | |
185 | h_credit: | |
186 | indicate to other end of connection that | |
187 | it has more credits available (i.e. there is | |
188 | more send room) | |
189 | h_padding[4]: | |
190 | unused, for future use | |
191 | h_csum: | |
192 | header checksum | |
193 | h_exthdr: | |
194 | optional data can be passed here. This is currently used for | |
195 | passing RDMA-related information. | |
196 | ||
197 | ACK and retransmit handling | |
198 | ||
199 | One might think that with reliable IB connections you wouldn't need | |
200 | to ack messages that have been received. The problem is that IB | |
201 | hardware generates an ack message before it has DMAed the message | |
202 | into memory. This creates a potential message loss if the HCA is | |
203 | disabled for any reason between when it sends the ack and before | |
204 | the message is DMAed and processed. This is only a potential issue | |
205 | if another HCA is available for fail-over. | |
206 | ||
207 | Sending an ack immediately would allow the sender to free the sent | |
208 | message from their send queue quickly, but could cause excessive | |
209 | traffic to be used for acks. RDS piggybacks acks on sent data | |
210 | packets. Ack-only packets are reduced by only allowing one to be | |
211 | in flight at a time, and by the sender only asking for acks when | |
212 | its send buffers start to fill up. All retransmissions are also | |
213 | acked. | |
214 | ||
215 | Flow Control | |
216 | ||
217 | RDS's IB transport uses a credit-based mechanism to verify that | |
218 | there is space in the peer's receive buffers for more data. This | |
219 | eliminates the need for hardware retries on the connection. | |
220 | ||
221 | Congestion | |
222 | ||
223 | Messages waiting in the receive queue on the receiving socket | |
224 | are accounted against the sockets SO_RCVBUF option value. Only | |
225 | the payload bytes in the message are accounted for. If the | |
226 | number of bytes queued equals or exceeds rcvbuf then the socket | |
227 | is congested. All sends attempted to this socket's address | |
228 | should return block or return -EWOULDBLOCK. | |
229 | ||
230 | Applications are expected to be reasonably tuned such that this | |
231 | situation very rarely occurs. An application encountering this | |
232 | "back-pressure" is considered a bug. | |
233 | ||
234 | This is implemented by having each node maintain bitmaps which | |
235 | indicate which ports on bound addresses are congested. As the | |
236 | bitmap changes it is sent through all the connections which | |
237 | terminate in the local address of the bitmap which changed. | |
238 | ||
239 | The bitmaps are allocated as connections are brought up. This | |
240 | avoids allocation in the interrupt handling path which queues | |
241 | sages on sockets. The dense bitmaps let transports send the | |
242 | entire bitmap on any bitmap change reasonably efficiently. This | |
243 | is much easier to implement than some finer-grained | |
244 | communication of per-port congestion. The sender does a very | |
245 | inexpensive bit test to test if the port it's about to send to | |
246 | is congested or not. | |
247 | ||
248 | ||
249 | RDS Transport Layer | |
250 | ================== | |
251 | ||
252 | As mentioned above, RDS is not IB-specific. Its code is divided | |
253 | into a general RDS layer and a transport layer. | |
254 | ||
255 | The general layer handles the socket API, congestion handling, | |
256 | loopback, stats, usermem pinning, and the connection state machine. | |
257 | ||
258 | The transport layer handles the details of the transport. The IB | |
259 | transport, for example, handles all the queue pairs, work requests, | |
260 | CM event handlers, and other Infiniband details. | |
261 | ||
262 | ||
263 | RDS Kernel Structures | |
264 | ===================== | |
265 | ||
266 | struct rds_message | |
267 | aka possibly "rds_outgoing", the generic RDS layer copies data to | |
268 | be sent and sets header fields as needed, based on the socket API. | |
269 | This is then queued for the individual connection and sent by the | |
270 | connection's transport. | |
271 | struct rds_incoming | |
272 | a generic struct referring to incoming data that can be handed from | |
273 | the transport to the general code and queued by the general code | |
274 | while the socket is awoken. It is then passed back to the transport | |
275 | code to handle the actual copy-to-user. | |
276 | struct rds_socket | |
277 | per-socket information | |
278 | struct rds_connection | |
279 | per-connection information | |
280 | struct rds_transport | |
281 | pointers to transport-specific functions | |
282 | struct rds_statistics | |
283 | non-transport-specific statistics | |
284 | struct rds_cong_map | |
285 | wraps the raw congestion bitmap, contains rbnode, waitq, etc. | |
286 | ||
287 | Connection management | |
288 | ===================== | |
289 | ||
290 | Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and | |
291 | ERROR states. | |
292 | ||
293 | The first time an attempt is made by an RDS socket to send data to | |
294 | a node, a connection is allocated and connected. That connection is | |
295 | then maintained forever -- if there are transport errors, the | |
296 | connection will be dropped and re-established. | |
297 | ||
298 | Dropping a connection while packets are queued will cause queued or | |
299 | partially-sent datagrams to be retransmitted when the connection is | |
300 | re-established. | |
301 | ||
302 | ||
303 | The send path | |
304 | ============= | |
305 | ||
306 | rds_sendmsg() | |
307 | struct rds_message built from incoming data | |
308 | CMSGs parsed (e.g. RDMA ops) | |
309 | transport connection alloced and connected if not already | |
310 | rds_message placed on send queue | |
311 | send worker awoken | |
312 | rds_send_worker() | |
313 | calls rds_send_xmit() until queue is empty | |
314 | rds_send_xmit() | |
315 | transmits congestion map if one is pending | |
316 | may set ACK_REQUIRED | |
317 | calls transport to send either non-RDMA or RDMA message | |
318 | (RDMA ops never retransmitted) | |
319 | rds_ib_xmit() | |
320 | allocs work requests from send ring | |
321 | adds any new send credits available to peer (h_credits) | |
322 | maps the rds_message's sg list | |
323 | piggybacks ack | |
324 | populates work requests | |
325 | post send to connection's queue pair | |
326 | ||
327 | The recv path | |
328 | ============= | |
329 | ||
330 | rds_ib_recv_cq_comp_handler() | |
331 | looks at write completions | |
332 | unmaps recv buffer from device | |
333 | no errors, call rds_ib_process_recv() | |
334 | refill recv ring | |
335 | rds_ib_process_recv() | |
336 | validate header checksum | |
337 | copy header to rds_ib_incoming struct if start of a new datagram | |
338 | add to ibinc's fraglist | |
339 | if competed datagram: | |
340 | update cong map if datagram was cong update | |
341 | call rds_recv_incoming() otherwise | |
342 | note if ack is required | |
343 | rds_recv_incoming() | |
344 | drop duplicate packets | |
345 | respond to pings | |
346 | find the sock associated with this datagram | |
347 | add to sock queue | |
348 | wake up sock | |
349 | do some congestion calculations | |
350 | rds_recvmsg | |
351 | copy data into user iovec | |
352 | handle CMSGs | |
353 | return to application | |
354 | ||
355 |