
Overview
========

This readme tries to provide some background on the hows and whys of RDS,
and will hopefully help you find your way around the code.

There is a *little* bit of extra documentation available. The rds-tools
package has two manpages rds(7) and rds-rdma(7) that describe the interface
a little. If you search the rds-devel archives, Rick Frank posted
a bunch of messages discussing the ideas behing RDS between early to
mid November 2007. Not all of that material still applies 100% to the
current code - for instance we no longer have RDMA barriers. But it may
be helpful.

In particular, there's a message dated Nov 15, subject "What is RDS and
why did we build it?" which contains a doc file with some motivation on
the design.

RDS Architecture
================

RDS provides reliable, ordered datagram delivery by using a single
reliably connection between any two nodes in the cluster. This allows
applications to use a single socket to talk to any other process in the
cluster - so in a cluster with N processes you need N sockets, in contrast
to N*N if you use a connection-oriented socket transport like TCP.

RDS is not Infiniband-specific; it was designed to support different
transports.  The current implementation supports RDS over IB as well as
TCP. Work is in progress to support RDS over iWARP.

The high-level semantics of RDS from the application's point of view are

 *	Addressing
	RDS uses IPv4 addresses and 16bit port numbers to identify
	the end point of a connection. All socket operations that involve
	passing addresses between kernel and user space generally
	use a struct sockaddr_in.

	The fact that IPv4 addresses are used does not mean the underlying
	transport has to be IP-based. In fact, RDS over IB uses a
	reliable IB connection; the IP address is used exclusively to
	locate the remote node's GID (by ARPing for the given IP).

	The port space is entirely independent of UDP, TCP or any other
	protocol.

 *	Socket interface
 	RDS sockets work *mostly* as you would expect from a BSD
	socket. The next section will cover the details. At any rate,
	all I/O is performed through the standard BSD socket API.
	Some additions like zerocopy support are implemented through
	control messages, while other extensions use the getsockopt/
	setsockopt calls.

	Sockets must be bound before you can send or receive data.
	This is needed because binding also selects a transport and
	attaches it to the socket. Once bound, the transport assignment
	does not change. RDS will tolerate IPs moving around (eg in
	a active-active HA scenario), but only as long as the address
	doesn't move to a different transport.

 *	sysctls
 	RDS supports a number of sysctls in /proc/sys/net/rds

Socket Interface
================

AF_RDS, PF_RDS, SOL_RDS
	These constants haven't been assigned yet, because RDS isn't in
	mainline yet. Currently, the kernel module assigns some constant
	and publishes it to user space through two sysctl files
		/proc/sys/net/rds/pf_rds
		/proc/sys/net/rds/sol_rds

fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
	This creates a new, unbound RDS socket.

setsockopt(SOL_SOCKET): send and receive buffer size
	RDS honors the send and receive buffer size socket options.
	You are not allowed to queue more than SO_SNDSIZE bytes to
	a socket. A message is queueud when you call sendmsg, and
	it leaves the queue when the remote system acknowledges
	its arrival.

	The SO_RCVSIZE option controls the maximum receive queue length.
	This is a soft limit rather than a hard limit - RDS will
	continue to accept and queue incoming messages, even if that
	takes the queue length over the limit. However, it will also
	mark the port as "congested" and send a congestion update to
	the source node. The source node is supposed to throttle any
	processes sending to this congested port.

bind(fd, &sockaddr_in, ...)
	This binds the socket to a local IP address and port, and a
	transport.

sendmsg(fd, ...)
	Sends a message to the indicated recipient. The kernel will
	transparently establish the underlying reliable connection
	if it isn't up yet.

	An attempt to send a message that exceeds SO_SNDSIZE will
	return with -EMSGSIZE

	An attempt to send a message that would take the total number
	of queued bytes over the SO_SNDSIZE threshold will return
	EAGAIN.

	An attempt to send a message to a destination that is marked
	as "congested" will return ENOBUFS.

recvmsg(fd, ...)
	Receives a message that was queued to this socket. The sockets
	recv queue accounting is adjusted, and if the queue length
	drops below SO_SNDSIZE, the port is marked uncongested, and
	a congestion update is sent to all peers.

	Applications can ask the RDS kernel module to receive
	notifications via control messages (for instance, there is a
	notification when a congestion update arrived, or when a RDMA
	operation completes). These notifications are received through
	the msg.msg_control buffer of struct msghdr. The format of the
	messages is described in manpages.

poll(fd)
	RDS supports the poll interface to allow the application
	to implement async I/O.

	POLLIN handling is pretty straightforward. When there's an
	incoming message queued to the socket, or a pending notification,
	we signal POLLIN.

	POLLOUT is a little harder. Since you can essentially send
	to any destination, RDS will always signal POLLOUT as long as
	there's room on the send queue (ie the number of bytes queued
	is less than the sendbuf size).
	
	However, the kernel will refuse to accept messages to
	a destination marked congested - in this case you will loop
	forever if you rely on poll to tell you what to do.
	This isn't a trivial problem, but applications can deal with
	this - by using congestion notifications, and by checking for
	ENOBUFS errors returned by sendmsg.

setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
	This allows the application to discard all messages queued to a
	specific destination on this particular socket.

	This allows the application to cancel outstanding messages if
	it detects a timeout. For instance, if it tried to send a message,
	and the remote host is unreachable, RDS will keep trying forever.
	The application may decide it's not worth it, and cancel the
	operation. In this case, it would use RDS_CANCEL_SENT_TO to
	nuke any pending messages.

RDMA for RDS
============

see manpage for now

Congestion Notifications
========================

see manpage

RDS Protocol
============

  Message header
  ACK and retransmit handling
  Cancellation
  Congestion Control

RDS Kernel Structures
=====================

  struct rds_socket
  struct rds_connection
  struct rds_transport
  rds work structs: send, recv, conn

Connection management
=====================

  Connection states
  taking connection up and down

The send path
=============

  Zero-wait send path
  Using trans->xmit

The recv path
=============

  Receiving congestion updates

RDS over Infiniband
===================

  ib_cm
  rds_ib_xmit
  ib_rdma
