Understanding the Transport Layer and Its Protocols

The Transport Layer is the layer at end hosts. It lies between the application and network layer and allows communication between processes on hosts using a transport protocol. A process is an instance of an application, e.g., browser, Netflix client, etc. The network layer allows communication between hosts. Transport Protocols provide logical communication between processes. Common transport protocols are TCP (Transmission Control Protocol) and UDP (User Datagram Protocol). TCP provides connection-oriented reliable, sequenced communication. HTTP uses this. UDP provides connectionless unreliable message delivery. DNS uses this. Transport protocols are in the OS kernel because they are common to many applications, and applications run in user space. Sockets are an abstraction for a communication endpoint, just as a file is an abstraction for data on disk. Application processes create sockets and send and receive data on sockets. SOCK_STREAM: TCP Socket, SOCK_DGRAM: UDP socket. On a single machine, you can have multiple processes. Each process creates a socket, and each process may communicate with different machines. The transport layer sends outgoing traffic from multiple sockets and sends incoming packets to the right sockets by multiplexing and demultiplexing.

Multiplexing as sender: handle data from multiple sockets, add transport header. Demultiplexing as receiver: use header info to deliver received segments to the correct socket. Transport protocols do this by storing source and destination port numbers in the packet header to identify sockets. Demultiplexing has the task of associating an incoming packet with the socket at the receiver. TCP demultiplexing uses IP source and destination address and TCP source and destination port numbers to determine sockets. These uniquely identify the connection. UDP demultiplexing only uses the destination port since UDP is connectionless.

User Datagram Protocol (UDP) provides minimal transport. It’s a best-effort service model, demultiplexes using 16-bit ports, and uses a checksum to verify data. It is used to implement user-level transport protocols and fast protocols with small messages like DNS. TCP Service Model: Best-effort is not sufficient for many applications. TCP provides reliable, in-order delivery. Packets are guaranteed to be delivered in the order sent. Without TCP, applications would need to implement this. However, it does not provide performance guarantees, which are hard to give on the internet. The TCP Service Model has sender and receiver side code, implements the transport protocol, and uses a collection of techniques to ensure reliable transport.

Reliable transport is difficult because TCP uses IP, which provides best-effort delivery, which does not guarantee reliability. Effects/Techniques: Corruption/Checksums, Reordering/Sequence numbering, Duplication/Sequence numbering, Drops/Acknowledgements and retransmissions, Delays/Timers. Checksums are a single number computed using a function on part of all of a packet. It detects corruption by: 1. sender computing checksum C, includes it in the packet. 2. receiver computes checksum C2. 3. Corruption if C != C2. The checksum function must be fast. It uses the 16-bit one’s complement sum. Algorithm: 1. Sum all 16-bit words 2. add back any carryover bit 3. take the complement of the resulting sum. The checksum may not capture all but flips, but is good enough given low rates of packet corruption. This is an example of an engineering trade-off. IP uses a checksum over the IP header and doesn’t include some fields. TCP uses a checksum over the TCP header, payload, and some IP header fields.

Sender knows if checksums matched or didn’t depending on what the receiver does. If the checksum matches, the receiver sends an acknowledgement (ACK); the control packet doesn’t carry data, but helps the receiver send information to the sender. If the checksum doesn’t match, the receiver can send a negative acknowledgement (NACK). Sequence Numbers are numbers assigned to packets indicating the position of the packet in the sequence of packets from sender to receiver. ACK or NACK also carries a sequence number for the packet whose receipt it ACKs or NACKs. Reordering occurs if packet n is received before mDuplication occurs when two packets with the same sequence number n are received. The receiver can drop one. Matching ACKs to packets.

Timers and Retransmissions: 1. After sending packet n 2. Set a timer for T 3. If ACK is received before T, cancel timer 4. Else, retransmit packet n. This works when a packet is lost. It also works when the ACK is lost. If T is too small, the receiver may get duplicates because the previous packet may have been delayed. If T is too large, it may miss the opportunity to send data and lower throughput since the network is being used less efficiently. Stop-and-wait: Sender(s) and Receiver (r). Steps: 1. (s) checksum and send packet n, set timer. 2. (r) get packet 3. (r) compare checksum 4. (r) if match send ACK, else NACK 5. (s) if ACK, n++, cancel timer, go to step 1 6. (s) If NACK or timeout, repeat step 1. Stop-and-wait is inefficient. Time to receive ACK is round-trip time or RTT. With large RTT and a fast network, the sender cannot transmit data even though the network can accommodate it. The transport protocol must be reliable and efficient.

Sliding Window Protocols is a more efficient protocol. Send up to n unacknowledged packets. A sliding window of n packets. Sliding Window Protocols: 1. Packets sent and acknowledged 2. Packets sent but not acknowledged 3. Packets not yet sent, but within the window 4. Packets beyond the window. The window slides when the receiver ACKs a packet. From ACKs, the sender may decide to retransmit if packet(s) are lost. Cumulative ACK: ACK contains a single sequence number m that tells the sender that the receiver has received packets 1…m. If packet m+1 is lost but m+2 is received, the receiver sends a cumulative ACK for m when it gets m+2. Duplicate ACK! The downside is the sender cannot tell which packet(s) beyond m have been lost. The receiver only needs to remember the highest cumulative sequence number received so far. It can choose to drop out-of-order packets. Simple to implement. Go-Back-N: With cumulative ACKs, when a packet is lost, the sender gets a duplicate ACK. It can’t tell if the packet was lost or the network duplicated the ACK. Loss Recovery Strategy: one timer for the earliest unacknowledged packet. On timeout, retransmit all unacknowledged in-flight packets. Selective ACKs are an alternative to cumulative ACKs. The receiver individually ACKs packets even if received out of order. Better than cumulative ACKs because the sender knows exactly which packet was received. Selective Repeat is loss recovery using selective ACKs. The sender maintains a timer for every packet and only retransmits that packet when the timer goes off. Must wait for the timer. If k-1 and k+1 have been acknowledged, you can’t assume the k-th packet is lost. It could have been delayed. Go-Back-N is easier to implement but less efficient. Selective repeat is more complex to implement but more efficient.

TCP is connection-oriented and provides reliable, in-order, byte stream delivery. TCP Header is composed of rows that are 32 bits long. Header values are stored in binary. TCP uses checksums, sequence numbers, and acknowledgements. These are reflected in the header. Checksums field contains a checksum using the internet checksum algorithm. Over pseudo-header and payload. The sender adds the checksum, and the receiver checks. Pseudo-header includes TCP header and some fields in the IP header, e.g., source and destination IP and protocol. Needed for multiplexing. Sequence numbers are byte numbers in the byte stream, not packets. Acknowledgements also specify byte numbers. Applications see a sequence of bytes being sent and received. The sender might send N bytes; the receiver might get M bytes first, then N-M bytes. TCP accumulates data from the application into one TCP segment that’s sent when the segment is full or times out. IP Packet: No bigger than MTU (1500B for Ethernet, 9K for high-speed Ethernet). TCP Packet: IP packet with TCP header (20B) and data inside. TCP Segment: No more than Max Segment Size (MSS) B. Up to 1460 consecutive B. Sequence Numbers: Header value is the sequence number of the first byte in the segment. Numbering starts at a non-zero initial sequence number set up during connection establishment. ACK Sequence Number: sequence number of expected byte = (segment sequence number + length). Cumulative ACKs if the packet is not in order. ACK is the next in-order byte expected. In Go-Back-N, retransmit the lost packet after timeout. With more duplicate ACKs, loss is more likely. TCP retransmits after three duplicate ACKs, which may be faster than waiting for a timeout. Still need a timeout if loss happens or fewer than 3 duplicate ACKs are received. Duplicate ACK means the receiver got some packet after the lost packet. In TCP, the window won’t slide after 3 duplicate ACKs.

Timeouts: The sender sets retransmission for the first packet in the window. If the window slides, change the timer to that of the earliest unacknowledged packet. If the timer fires, we say a timeout occurred, retransmit unacknowledged packets in the window. Timeout should be a function of RTT, which needs to be continuously estimated. If a packet is sent at t1, ACK received at t2, RTT sample is t2 – t1. TCP uses exponentially weighted moving average (EWMA) smoothing. EstRTT = (1-a)*EstRTT + a*SampleRTT. TCP simply ignores samples from retransmitted packets. Retransmission Timeout (RTO) = 2*EstRTT. Each time RTO goes off, double it. When an ACK is received, reset to twice the estimated RTT. Estimate the exponential average of Deviation, DevRTT = |SampleRTT – EstRTT|. RTO = EstRTT + 4DevRTT. Connection-Oriented Protocols: Need to establish a connection initially to set up connection parameters that both ends can agree upon. TCP Connection Establishment sets up initial sequence numbers and exchanges options each side understands. Uses the flags field during connection establishment. Three-way Handshake: 1. Host A sends a SYN (open: synchronize sequence numbers) to host B 2. Host B returns a SYN acknowledgement (SYN ACK) 3. Host A sends an ACK to acknowledge SYN ACK. Use flags in the TCP header: 1. SYN – Initiate connection 2. ACK – Packet contains Acknowledgement 3. FIN – Terminate Connection 4. RST – Abnormal Termination. If SYN is lost, the sender must retransmit on timeout. When one side wants to close the connection, it sends a packet with the FIN flag set. FIN occupies one byte in the sequence space. The other host acknowledges the byte to confirm, and this closes one direction. The other side repeats to close both sides. The other side can close the connection at the same time by responding with FIN+ACK. If the application crashes, the OS may abruptly terminate the connection. It sends a packet with the RST flag set. The remote end does not respond to RST.

QUOkMy+YuXRgegRBCCDkGoXBICCGEtFPQhVeWl1viIbtzQkhL6dRJQx9oCASKhIQQQggxUDgkhBBCCCGEEEIIIYQ4wCAlhBBCCCGEEEIIIYQQBygcEkIIIYQQQgghhBBCHKBwSAghhBBCCCGEEEIIcYDCISGEEEIIIYQQQgghxAEKh4QQQgghhBBCCCGEEAcoHBJCCCGEEEIIIYQQQuog8v9vb3yWlsifIAAAAABJRU5ErkJggg==

979evstAIWALg0gjUAAAAAaMGELgAAAADQgmANAAAAAFoQrAEAAABAC4I1AAAAAGhBsAYAAAAALQjWAAAAAKAFwRoAAAAAtCBYAwAAAIAWBGsAAAAA0IJgDQAAAABaEKwBAAAAQAuCNQAAAABoQbAGAAAAAC0I1gAAAADgzEr5D3aq4bUk5Ae3AAAAAElFTkSuQmCC

xauUFs7ipoGAAAAAElFTkSuQmCC


Flow Control: Receivers need to buffer received packets if received out of order or the application is slow. Buffer memory may be limited. Sender may send too many packets, and the receiver may not have space for them. Advertised Window (RWND) is set by the receiver. The sender ensures total bytes in flight RWND = B – (LastByteReceived – LastByteRead). The sender can send no faster than RWND/RTT bytes/sec. The receiver only advertises a larger window when it has consumed old arriving data. When RWND = 0, the sender keeps probing with one data byte packets. Sliding window can be used to control sending rate. Buffers also exist in routers/switches in TCP literature; these are queues. A packet-switching router can buffer packets before sending them, e.g., if two packets arrive at the same time. If the queue is full, the packet may be dropped by the router. When the queue is full, the packet is dropped, and the sender retransmits. If many packets are dropped, a large number of retransmissions occur. Retransmissions waste network capacity and also congest the network. When the queue is full, latency & RTT increase. When RTT increases, sending rate & RTO decrease. Larger RTT means goodput (rate of actual data) drops. Congestion collapse is the point where goodput drops dramatically. If the network is congested, increase the sending rate. Else, decrease. Routers can send congestion information to senders and receivers. TCP doesn’t use this. Hosts can infer congestion in the network based on packet loss or delay. No explicit congestion signal from routers. TCP uses this. It’s easy to develop/deploy change, but less optimal. TCP Congestion Control main ideas: Exactly one bottleneck link. If the sending rate is more than the bottleneck capacity, delays and drops can occur. Packet drop is a signal of congestion along with possibly increased delay. Corruption also causes drops, but TCP assumes this is uncommon.

Goals: Send at rate R at which no packets are dropped; increasing R to R+o results in losses. R is called available bandwidth. Approach: continuously probe the network by sending more data until a packet is dropped. rate=window/RTT. Congestion window (cwnd), receiver window (rwnd) for flow control, TCP sender uses min(cwnd, rwnd). Phases: 1. Find available bandwidth on startup with algorithm slow-start 2. Track changes to available bandwidth with algorithm congestion avoidance. Common algorithm: upon ACK receipt, increase cwnd; upon loss, decrease cwnd. Actual increase or decrease depends upon the phase. Slow Start: cwnd=1*; sending_rate=MSS/RTT; when (ACK received) { cwnd+=1; } After one RTT, cwnd=2*cwnd; Stop increasing cwnd when it becomes unsafe (loss happens). cwnd/2 is safe guaranteed. cwnd = 1; // in terms of packets; ssthresh = ∞ When an ACK is received { cwnd++; send as many packets as cwnd allows; } When a packet loss is detected { ssthresh = cwnd/2; } Additive Increase Multiplicative Decrease (AIMD): Additive Increase: For each ACK { cwnd += 1/cwnd; } If all segments in the window are ACKed { cwnd increases by 1; } Multiplicative decrease: ssthresh = cwnd/2; cwnd = 1; go back to Slow Start. Exponential increase of the window with slow start. At ssthresh, additive increase begins. Additive increase: linear increase in the window until the first loss happens. Multiplicative decrease, redo slow start.

Flow Control restricts the window to rwnd to avoid receiver buffer overruns. Congestion Control restricts the window to cwnd to avoid router buffer overruns, preventing packet drops. How they work together: the sender uses min(rwnd, cwnd). Phases of TCP Congestion Control: Slow-start: finds available bandwidth at startup. Congestion avoidance: tracks changes to available bandwidth after startup. Congestion Control State Machine: cwnd initialized to a small constant, ssthresh initialized to a large constant. Events: ACK (new data), dupACK (duplicate ACK for old data), Timeout. States and Events in State Machine: Slow Start: if cwnd { cwnd += 1; stay in slow start;}. if cwnd >= ssthresh, switch to congestion avoidance. Congestion Avoidance: cwnd += 1/cwnd. In either state: set ssthresh = cwnd/2, cwnd = 1, and go to slow start. Fast Retransmit: dupAckCount++; if dupACKcount = 3, set ssthresh = cwnd/2, cwnd = cwnd/2, retransmit packet, go to slow start. Fast Recovery: Each dupACK means a packet has left the network, allowing cwnd to increase to keep packets in flight. dupACKcount = 3: set ssthresh = cwnd/2, cwnd = ssthresh + 3. On dupACK: cwnd += 1. On new ACK: cwnd = ssthresh and return to congestion avoidance. TCP Variants: TCP-Tahoe: uses slow start, congestion avoidance, fast-retransmit. TCP-Reno: like Tahoe but skips slow start on fast-retransmit. TCP-NewReno: adds fast recovery to TCP-Reno. TCP-SACK: includes selective acknowledgments. TCP CUBIC: increases the window faster than linearly, commonly used in Linux. TCP CUBIC: Designed for high-speed networks, uses parameter K, which is the time at which the window reaches Wmax. Throughput function increases as the cube of time from K. Modeling TCP Throughput: Throughput depends on RTT and loss rate p. Throughput = 1 / RTT * sqrt(3 / 2p). Implications of Model: TCP-fairness: TCP allocates bandwidth based on RTT and loss. Achieving low loss rates on high-speed networks is challenging. TCP-friendliness: non-TCP applications may match TCP rates to avoid congestion. Non-congestive Losses: TCP assumes loss due to congestion, but in wireless media, packets may be corrupted, impacting TCP performance. Wireless standards often use link-layer retransmissions to improve reliability. Limitations of the Model: Short TCP connections (e.g., web connections transferring ~10 KB of data) do not get out of slow start, making the model less applicable. Web servers often optimize by increasing initial cwnd or using parallel connections.

Transport Layer: runs only on end hosts. Network Layer: runs on all routers and hosts with the goal of routing a packet from source to destination. Terminology: Autonomous System (AS): set of routers under one entity, either ISPs or enterprises. Interior Routers: connect within an AS. Border Routers: connect to other AS routers. Route/Path: sequence of routers traversed by packets from source to destination. Routing and Forwarding: Routing: computes a path using routing algorithms, producing a forwarding table at each router. Forwarding: on each router, uses the forwarding table to decide which neighbor to send the packet next. Control Plane: software runs routing algorithms to determine paths. Data Plane: hardware forwards packets based on network layer header contents. The Internet Protocol (IP): Service Model: IP is the network layer protocol, providing best-effort service with no assurance on bandwidth, loss, or latency.

IP Header Functions: Addressing: globally routable source/destination addresses. Loop-prevention: Time to live (TTL) field prevents infinite packet loops. Corruption-resistance: checksum field detects header corruption. Variable sizing: fragmentation fields allow large packets to be split. Demultiplexing: protocol field directs to transport layer protocols. Special handling: options support unique uses, e.g., Type of Service or Record Route. Addressing: Identifier: 32-bit unique IP address in 4-octet notation (e.g., 192.168.1.1). Locator: prefix of address represents a network or subnet number, suffix represents a host within the network. Hierarchical Assignment: networks receive IP blocks (e.g., ISP assigns sub-blocks to customers). Route Aggregation: permits ISPs to represent all addresses in a single prefix, aiding scalable Internet routing.

Loop-Prevention: TTL field: Each router decrements the TTL; the packet is dropped if TTL reaches zero. Corruption-Resistance: 16-bit checksum detects packet header corruption. Must be recalculated by each router as TTL changes. Demultiplexing: Protocol field directs to transport layer protocols like TCP (6) or UDP (17). Special Handling Options: Type of Service (ToS): indicates priority/class of service, e.g., DSCP bits for differentiated service or congestion notification. Options Field: optional directives (e.g., Record Route, Loose Source Route) make the header variable length. Packet Size Support: Fragmentation: if the packet exceeds Maximum Transmission Unit (MTU), the IP layer fragments it. Reassembly: fragments are reassembled at the destination. Path MTU Discovery: avoids fragmentation by probing network MTU. IPv6 does not support fragmentation, using a fixed MTU instead. IPv6: Benefits: 128-bit addresses address exhaustion; removes unnecessary features from IPv4. Fixed Header: uses an extended header between IP and TCP headers; no header length needed. Flow Label: groups packets of a connection for consistent handling. NATs and Address Space: NATs allow multiple devices to share one IP by using private IPs (e.g., 10.0.0.1) not routable on the Internet. NATs help conserve IPv4 addresses, but IPv6 adoption is still necessary as demand grows.

Routers manage data transfer between network ports, directing traffic between ports. Ports: Each router has N bi-directional ports with speed R (bits/sec). Total router capacity = N x R. Router Capacity Types: 1. Small Capacity: Home/small business routers, few Gbps. 2. Medium Capacity: Enterprise/small ISP routers, hundreds of Gbps. 3. Large Capacity: ISP core routers, Tbps. Router Components: 1. Input Port: Handles packet reception, TTL update, checksum verification, and output port lookup. 2. Fabric: Transfers packets from input to output ports, types include memory, bus, and interconnection-based. 3. Output Port: Manages buffering and packet scheduling. Input Port Queuing occurs when simultaneous sends to the same output port create delays. Output Port Lookup determines the output port using destination-based forwarding, using the longest-prefix match. Longest-prefix match: forwarding tables with multiple entries select the longest prefix. Radix Trees (software) or TCAMs (hardware) implement lookup. Switching Fabrics types: 1. Memory-based: stores packets in memory bank, low cost but slower. 2. Bus-based: contends for bus access; moderate speed. 3. Interconnection-based: high performance, non-blocking fabric ideal for large routers. Output Port & Packet Scheduling: 1. Buffer Management: Decides which packet to drop or mark if the queue is full. 2. Scheduler: Determines packet transmission order, affecting delay and throughput. Scheduling Methods: 1. FIFO (First-In-First-Out): transmits packets in arrival order, risks delay for latency-sensitive traffic. 2. Priority Scheduling: multiple queues by priority level, risks lower-priority packet starvation, reduces delay for high-priority packets. 3. Fair-Share Scheduling: allocates a queue per flow, transmits using round-robin or weighted round-robin, ensures equitable bandwidth distribution.

S4AAAAAElFTkSuQmCC

EthnmAAAAAElFTkSuQmCC

8K2wlfeIbRHAAAAAElFTkSuQmCC

ByJ9RpRHqtABAAAAAElFTkSuQmCC

Ik6q5Cs3F0gAAAABJRU5ErkJggg==