Table of contents

  1. Link Layer
  2. Network Layer
  3. Transport Layer

Networking layer 2

nodes: hosts and routers

links: communication channels between adjacent nodes

  • wired
  • wireless
  • LANs

frame: layer-2 packet encapsulating (layer 3?) datagram

Error Detection

Parity Checking

single bit parity: detect single bit error

  • add bit so that # of 1’s is even/odd for even/odd parity

2D bit parity: detect and correct single bit errors

  • parity of each row/col

Cyclic redundancy check (CRC)

Setup:

  • data of bits
  • choose CRC of bits
  • choose generator of bits

Idea:

  • should be exactly divisible by (mod 2)
  • receiver knows
  • detect all burst errors less than bits

Calculate CRC:

Multiple access protocols

Apply to shared (broadcast) links

Ideally:

  • when one node wants to transmit, it can send at the full link rate
  • when nodes want to transmit, each can send at average rate
  • decentralized (no master, clock synchronization)
  • simple (plug and play, no configuration)

3 types of MAC (medium access control) protocols:

  • channel partitioning (avoids collisions)
  • random access (allows collisions)
  • taking turns

Channel partitioning MAC protocols

time division multiple access (TDMA)

  • nodes get access to channel in rounds
  • adv: fair, collision free
  • disadv: waste bandwidth, unnecessary waiting

frequency division multiple access (FDMA)

  • divide channel into frequency bands
  • same adv/disadv as TDMA

Random access protocols

slotted ALOHA

  • frames, slots same size
  • if collision
    • all nodes detect collision
    • node retransmits in each subsequent slot with probability
  • adv: single node full channel rate, decentralized, simple
  • disadv: collisions waste slots, idle slots, clock sync
  • best: 37% useful transmissions

pure (unslotted) ALOHA

  • no synchronization
  • transmit frames immediately
  • collision probability increases
  • best: 18% useful transmissions

carrier sense multiple access (CSMA)

  • listen before transmit
  • if channel idle, transmit entire frame
  • if channel busy, defer transmission
  • collisions can still occur - propagation delay means nodes may not hear each other’s transmission

CSMA/CD (collision detection)

  • collision detection
    • easy in wired LAN: measure signal strengths, compare transmitted and received signals
    • difficult in wireless LAN: received signal strength overwhelmed by local transmission strength
  • if NIC detects another transmission while transmitting, abort and sen d jam signal and enter exponential backoff (up until value of 10 times)
  • medium sensing for 96 bit times
  • jamming signal length is 48 bits - enough energy for collision detection
  • efficiency as and

CSMA/CA (collision avoidance)

  • if channel idle: transmit frame
  • if channel busy:
    • start random backoff
    • while channel busy, timer counter frozen
    • timer counts down while channel idle
    • transmit when timer expires
    • if no ACK, increase random backoff interval

Taking turn protocols

polling: master polls slaves nodes

  • no collisions and empty slots
  • disadv: polling overhead, single point of failure

token passing: control token passed from between nodes

  • disadv: token overhead, single point of failure (token)

MAC addresses and ARP

MAC address: used locally to get frame from one interface to another interface

  • each adapter on LAN has unique LAN address
  • allocation admistered by IEEE, manufacturer buys portion of MAC address space
  • MAC address is portable - can move LAN card from one LAN to another

address resolution protocol (ARP)

  • determines an interface’s MAC address knowing its IP address
  • plug-and-play
  • ARP table: each IP node (host, router) on LAN has a table
    • <IP address, MAC address, TTL>
    • TTL (time to live): time after which address mapping is forgotten

ARP on same LAN

  • A wants to send to B
    • B’s MAC address not in A’s ARP table
  • A broadcasts ARP query with B’s IP address, include destination MAC address FF-FF-FF-FF-FF-FF
  • B receives ARP packet, replies to A with its MAC address
  • A saves info into ARP table

ARP routing to another LAN

  • first hop router forwards A’s packet to B

Ethernet

Topologies

bus: all nodes in same collision domain

star: prevails today

  • active switch in center
  • no collision between nodes

Ethernet frame

  • preamble: 7 bytes + 1 byte (wakeup procedure to sync receiver and sender clock rates)
  • addresses: 6 byte source and 6 byte destination MAC address
    • on receiving a matching dest address or broadcast address, passes data to network layer
    • otherwise discard frame
  • type: higher layer protocol (usually IP)
  • CRC: if error detected, drop frame

Reliability

connectionless: no handshaking between sender and receiver

unreliable: receiving NIC doesn’t send ack or nack

  • dropped data recovered only if sender uses higher layer protocol (TCP)

Uses unslotted CSMA/CD with binary backoff

Ethernet switch

  • link layer device
    • store and forward ethernet frames
    • examine incoming frames, selectively forward
  • transparent: hosts unaware of presence of switches
  • plug-and-play: no config
  • multiple simultaneous transmissions
    • hosts have dedicated connection to switch
    • each link is own collision domain
    • full duplex

switch table

  • <MAC address of host, interface to host, TTL>
  • like routing table
  • switch learns which hosts can be reached through which interface
    • on frame receive, switch learns location of sender - update switch table
    • if destination entry found, forward to that interface (as long it’s not the incoming interface)
    • otherwise flood - send to every interface (switch still does not learn MAC address until it sends something back)

Wireless LAN (WLAN/Wifi)

Wireless access point (AP) is physically connected to a router

  • AP has physical layer and a link layer above that that converts Wifi to ethernet

basic service set (BSS): an AP and the hosts connected wirelessly to the AP

hidden terminal problem:

  • C transmits frame to B
  • A is unaware of C’s transaction (only sees B)
  • if A transmits, will collide with C’s transaction at B
  • C is hidden from A
  • solution: CSMA/CA

exposed terminal problem:

  • A transmits to D
  • B sees A, knows that it is transmitting
  • B can transmit to C, but it doesn’t because of A’s transmission
  • loss of bandwidth
  • B is exposed to A’s transaction
  • no solution provided so far…

Modes of operation

distributed coordination function (DCF)

  • no need for AP - if there is an AP, it acts like an ordinary node and just provides connectivity to the internet
  • nodes make their own decisions to transmit
  • handshake mode, w/out handshake mode

point coordination function (PCF)

  • AP is the controller in the BSS and decides who and when of transmission
  • no contention for medium access
  • waste of bandwidth if scheduled node has no traffic

Distributed coordination function (DCF) with handshake

dotRTSThreshold: MAC management holds this variable

  • if frame length >= dotRTSThreshold, use DCF with handshake
    • long frames - want to reduce probability of collision
  • otherwise, use DCF without handshake

Control frames:

  • RTS (request to send)
    • 20 bytes
    • duration is how long the sender wants to use the medium for
  • CTS (clear to send)
    • 14 bytes
    • duration is how long the sender wants to use the medium for

WLAN MAC: CSMA/CA via 2 techniques

  • physical level carrier sensing (hardward)
  • virtual carrier sensing:
    • integer variable in each node NAV (network allocation vector) indicates whether a nearby node is likely to be transmitting
    • NAV > 0: most likely another node is transmitting
    • NAV = 0: no one nearby is transmitting

NAV in detail

  • all nodes which receive RTS and CTS (and DATA) update NAV as follows
    • at each micro second, decrease NAV if NAV > 0
    • if a frame with duration is received, NAV=max(, NAV)
  • 4 timing intervals:
    • physical level
      • SIFS: short inter frame space (10 microsec)
        • how long to wait before sending CTS, DATA, ACK
      • aSlot (20 microsec)
    • MAC level
      • PIFS: priority IFS (SIFS + aSlot)
      • DIFS: distributed IFS (PIFS + aSlot)
        • how long to wait (and medium idle and NAV = 0) before sending RTS
    • SIFS &lt PIFS &lt DIFS
    • no one can transmit between RTS - CTS - DATA - ACK
    • duration variable of RTS includes time to send CTS + DATA + ACK and their included SIFS times

Distributed coordination function (DCF) without handshake

  • no RTS/CTS
  • NAV is still used - nodes must process RTS/CTS of others
  • sender wait for DIFS before sending DATA
  • receiver wait for SIFS before sending ACK

Point coordination function (PCF)

  • if AP finds medium idle for PIFS, transmit beacon frame
  • beacon contains CFPMaxDuration field (CFP: contention free period)
  • nodes receiving beacon update NAV and don’t transmit unless asked by AP
  • after beacon, AP waits for PIFS before transmitting one of
    • DATA
      • unicast or broadcast
      • PIFS before sending data again
    • CF poll (contention free)
      • user can send after SIFS
      • if user does not have data, it sends a null DATA frame
    • DATA + CF poll
      • polled user receives data from AP and sends data to another user
    • ACK
    • CF end
      • signifies end of CF period, receiving nodes set NAV=0
  • if ACK not received from AP, user waits until it is polled again

Wifi frame:

  • AP converts between Wifi and ethernet frames
  • A1: destination of the wifi link
  • A2: source transmitter of the wifi link
  • A3: source/destination - the node that the AP is connected to (usually a router)

3. Network Layer

  • transports transport layer segments between hosts
  • encapsulates segments into datagrams
  • in every host and router

Key functions

forwarding: move packets from router’s input to appropriate router output

routing: determine route taken by packets from source to dest via routing algorithms

  • the output of routing algorithm is a forwarding table (<IP address, interface>)

Key router functions

  • run routing algorithm (software)
  • forward datagrams (hardware)

Input port functions

  1. line termination: physical layer
  2. link layer receive
  3. lookup in forwarding table in input port memory, forwarding, queueing

queueing

  • fabric slower than input ports combined
  • HOL (head of line) blocking: queued datagram at front of queue prevents others in queue from moving forward

Switching fabric

Transports packet from input buffer to appropriate output buffer

memory switching

  • traditional computer, switching under direct control of CPU
  • packet copied into system memory
  • 2 bus crossings per datagram

bus switching

  • input port memory to output port memory via shared bus
  • contention for shared bus (1 transmission at a time)
  • 1 bus access per datagram

interconnection network switching

  • like a grid

Output ports

buffering: required when datagrams arrive from fabric faster than transmission rate

  • recommended size with RTT (round trip time), link capacity, flows user buffer size

scheduling discipline: choose among queued datagrams for transmission

Switch vs router

  Routers Switches
Layer network layer device link layer device
Store and forward yes yes
Forwarding table yes (via routing algo) yes (learned)
Traffic isolation yes no (broadcast seen at all interfaces)
Processing delay slower faster
Require config yes no (plug and play)
Broadcast storm no yes

IP datagram format

TODO

Header checksum

Example calculation at the sender:

IP header (in hex) with checksum set to 0000:

                         csum
4500 0073 0000 4000 4011 0000 c0a8 0001 c0a8 00c7

Step 1: add all 16 bit clocks of the header

  • result = 2 479c
  • add carry (2) to the rest to get temp = 479e

Step 2: take 1’s complement of temp to get checksum

  • checksum = b861

Step 3: put this in the header

Example calculation at the receiver:

Assume header received without bit error

                         csum
4500 0073 0000 4000 4011 b861 c0a8 0001 c0a8 00c7

Step 1: add all 16 bit blocks of the header

  • result = 2 fffd
  • add carry (2) to the rest to get temp = ffff

Step 2: take 1’s complement of temp

  • result = 0000

Step 3: if result == 0000 the no error. Otherwise bit-error, drop packet

IP fragmentation

Network links have MTU (maximum transmission unit): largest possible link-level frame.

Large IP datagrams get divided into multiple datagrams.

  • reassembled at final destination
  • IP header bits used to identify and order the fragments
    • offset field in header, which are multiples of 8 bytes

IP addresses

In IPv4 IP addresses are 32-big long

L2 switches and hubs do not have IP addresses

End-devices generally have 1 IP address each (laptop, desktop, server)

Routers have multiple IP addresses, 1 or more for each physical interface

Network prefix

All IP addresses on the same link have a common portion in the most significant bits, known as network prefix.

Example:

     prefix      host ID
IP1: x.y.z.10101 001
IP2: x.y.z.10101 010
IP3: x.y.z.10101 011
IP4: x.y.z.10101 100

Classful addressing

Characterized by fixed length prefixes.

  • class A: addr begins with 0 and prefix is 8 bits
  • class B: addr begins with 10 and prefix is 16 bits
  • class C: addr begins with 110 and prefix is 24 bits

Fixed number of networks and addresses per network

Classless addressing

CIDR (classless interdomain routing): prefix is arbitrary length

Example: 200.23.16.0/23 specifies that 200.23.16.0 is the prefix of length 23

For a network prefix of 129.97.0.0/16, the router represents it as 129.97.0.0/255.255.0.0 where 255.255.0.0 is a mask.

Network ID

network ID appear in routing tables, individual host IPs do not

broadcast address used to perform IP-level broadcast

Example: given a network 10.2.5.16/28

  • first address is 10.2.5.16 and is the network ID
  • last address is 10.2.5.31 and is the broadcast address

Public and private IP addresses

Public:

  • globally unique
  • recognized by every router
  • any host can open a TCP connection

Private:

  • unique within an org
  • only recognized by routers within the org
  • host outside the org cannot open TCP connection
  • IP reuse and added security

Packet forwarding

packet forwarding: relay packets from one physical interface to another by routers

Next hop calculated with destination address and routing table.

  • routing table entries learned through routing protocols
  • for a destination ip_dest and given entry of ip, mask, match occurs when ip_dest && mask == ip && mask

Many matches can occur in a routing table. In this case find the entry with longest prefix (smaller pool of possible destination IPs).

DHCP

dynamic host configuration protocol: allow host to dynamically obtain IP address from network server

  • host broadcasts DHCP discover msg (optional)
  • DHCP server responds with DHCP offer msg (optional)
  • host request IP with DHCP request msg
  • DHCP server sends IP address with DHCP ack msg. This also returns:
    • IP address of first-hop router
    • name and IP of DNS server
    • network mask

DHCP works on UDP.

Address aggregation (supernetting)

If router can reach:

192.168.0.0/24
192.168.1.0/24
192.168.2.0/24
192.168.3.0/24

Instead of advertising all 4 networks, advertise 1 address and say it can reach 192.168.0.0/22

NAT

NAT (network address translation): local network uses one IP address in view of outside world.

  • just need one IP address for all devices
  • change address in local network without notifying external
  • devices not externally accessible

NAT does public-private address/port translation.

  • uses 16 bit port number field
  • controversial since higher than layer 3

NAT traversal problem

External wants to connect to server with private IP 10.0.0.1 (can’t use it as dest addr).

solution 1: statically configure NAT to forward incoming requests at some port to the server

solution 2: universal plug and play (UPnP). Allow server to learn public IP address and add/remove port mappings

solution 3: relaying (Skype)

  • server establishes connection to relay (which is outside the LAN)
  • external client connects to relay
  • relay bridges packets between connections

ICMP

ICMP (internet control message protocol): used by hosts and routers to communicate network level information

  • error reporting: unreachable host, network port, protocol
  • echo request/reply (ping)
  • carried in IP datagrams

Traceroute and ICMP

  • source sends series of UDP segments to dest
    • first set has TTL = 1, second set TTL = 2, etc.
    • unlikely port number
  • when n-th set of datagrams arrives at n-th router
    • router discards datagrams
    • sends source ICMP messages including name of router & IP address
    • when ICMP messages arrive, source records RTTs
  • stopping criteria:
    • UDP segment eventually arrives at dest host
    • returns ICMP “port unreachable”
    • source stops

IPv6

32-bit address space soon to be completely allocated

datagram format:

  • fixed-length 40 byte header
  • no framentation
  • no checksum to reduce processing time at hops
  • options allowed, but outside of header
  • ICMPv6
    • additional “packet too big” message

Transition from IPv4 to IPv6

  • not all routers can be upgraded simultaneously
  • tunneling: IPv6 datagrams carried as payload in IPv4 datagram

Routing algorithms

Classification

global vs decentralized information:

global: all routers within autonomous system have complete topology

  • link state algorithms

decentralized: router knows physically connected neighbours

  • iterative process of computation, exchange info with neighbours
  • distance vector algorithms

static vs dynamic:

static: routes change slowly over time

dynamic: routes change more quickly

  • periodic update in response to link cost changes

Autonomous system (AS)

A set of routers under a single technical admin, using an interior gateway protocol and common metrics to route packets within the AS.

Uses exterior gateway protocol to route packets to other AS’s.

Identified by a 16-bit ID AS-number.

Intra AS routing:

  • RIP: routing information protocol (distance vector)
  • OSPF: open shortest path first (link state)

Inter AS routing:

  • BGP: border gateway protocol

Distance Vectors

is the estimate of least cost from to

  • maintains distance vector
  • is the set of all nodes in the network

Node

  • knows cost for each neighbour
  • maintains neighbours’ distance vectors. For each neighbour , maintains

Algorithm:

  • occasionally each node sends its own distance vector to neighbours
  • when receives new distance vector from neighbour it updates its own distance vector as per Bellman-Ford equation:
  • the estimate should converge to the actual cost

Each local iteration caused by:

  • local link cost change
  • DV update message from neighbour

Each node notifies neighbours only when its DV changes

Operation of node:

Image

Two node instability problem:

The algorithm can get stuck in a loop when node X goes down. Solutions:

  • split horizon: if node B learns about a path to X from A, B does not tell A about it
  • poisoned reverse: if A tries to route to X via B, A tells B that it has an infinity-cost path to X when the (X, A) link is broken
    • enables B won’t try to forward the packet to X via A

Does not solve three-node or four-node instability.

RIP (routing information protocol)

Distance vector algorithm

  • distance metric: # hops (max = 15)
  • each link has cost 1
  • DVs exchanged with neighbours every 30s
  • each advertisement: list up to 25 destination subnets

Link failure: if no advertisement heard after 180s, neighbour declared dead

  • routes via neighbour invalid
  • new advertisement sent
  • propagates quickly
  • poison reverse used to prevent ping-pong loops (infinite distance = 16 hops)

RIP table processing

RIP routing tables managed by app-level process called route-d.

Advertisements sent in UDP packets.

OSPF (Open shortest path first)

  • link state algorithm
  • each node has topology map
  • advertisements carry one entry per neighbour
  • advertisement flooded to entire AS

Overview:

  • each router learns complete topology of network within AS via flooding of OSPF advertisement messages
  • use Dijkstra’s algo to compute shortest paths between all pairs of nodes
  • for a given source/dest pair, find next hop from shortest path

Advanced features:

  • all messages authenticated
  • multiple same-cost paths allowed

Hierarchical OSPF: TODO

BGP (Border gateway protocol)

Forwarding table configured by both intra AS and inter AS routing algorithms.

BGP provides each AS a means to:

  • eBGP (AS external): obtain subnet reachability info from neighbouring AS’s
  • iBGP (AS internal): propagate reachability info to all AS-internal routers
  • determine good routes to other networks based on reachability info

BGP messages: exchanged between peers over TCP

  • OPEN: open TCP to peer and authenticate sender
  • UPDATE: advertise new path or withdraw old
  • KEEPALIVE: keeps connection alive in absence of UPDATES, also acks OPEN
  • NOTIFICATION: report errors in previous message, also used to close connection

BGP session: between two BGP routers

  • advertise paths to different network prefixes
  • on advertisement of a prefix, the AS promises it will forward datagrams toward that prefix

Path attributes: advertised prefix includes BGP attributes (prefix + attributes = route)

  • AS-PATH: contains AS’s through with prefix advertisement has passed
    • on ties, router selects a route based on shortest AS-PATH
  • NEXT-HOP: indicates specific internal-AS router to next-hop AS
    • router uses OSPF to find shortest intra AS path to the NEXT-HOP and saves in the forwarding table
  • gateway router can use import policy to accept/decline route advertisement
    • ex. never accept a route that went through AS 217

Hot potato routing: on equal AS-PATH length, choose route with closest NEXT-HOP using OSPF

BGP algorithm:

  1. router becomes aware of prefix via BGP advertisements
  2. determine router output port for prefix
    • use BGP route selection to find best inter-AS route
    • use OSPF to find best intra-AS route leading to best inter-AS route
    • router identifies router port for that route
  3. enter prefix-port entry in forwarding table

AS export policy: which routes it will advertise

  • to a customer: all routes
  • to a peer or service provider:
    • routes to all its own subnets and to customers’ subnets
    • but not to subnets learned from other providers/peers

4. Transport Layer

Provides logical communication between app processes running on different hosts.

Runs on end systems:

  • send side: break app messages into segments, passes to network layer
  • receive side: reassemble segments into messages, passes to app layer

UDP (user datagram protocol): no reliability (RIP, SNMP)

TCP (transmission control protocol): reliable data delivery (FTP, HTTP)

Ports

Programs use port numbers to address other programs.

Ports identified by 16-bit ID. UDP ports and TCP ports.

3 addresses:

  • MAC addr for 1 hop comm
  • IP addr for multi-hop comm
  • ports for app-app comm

Multiplexing/demultiplexing

multiplexing at sender: handle data from multiple sockets, add transport header

demultiplexing at receiver: use header info to deliver segments to correct socket

  • use IP address and port number to direct segment to appropriate socket

connectionless demultiplexing: UDP segment directed only based on dest address and port

connection oriented demux: TCP socket identified by source/dest address/port

UDP

“best effort” service

Segments can be:

  • lost
  • delievered out of order

Add reliability at app-layer

Connectionless:

  • no handshake
  • each segment handled independently

Uses:

  • streaming multimedia
  • DNS
  • SNMP

Reliable data transfer

  • Automatic Repeat Request (ARQ) protocol
  • requires feedback from receiver (ack)
  • sequence numbers used to solve out-of-order and duplicate packets
  • timers used to solve lost packets

Stop and wait

  • send packet, start timer, wait for ack
  • if timer runs out, retransmit

Duplicate packets and acks will be discarded.

Performance is bad, low utilization of link.

Pipelined protocols

Increases utilization.

Sender can have up to N unacked packets in pipeline

go back N

  • window of up to N consecutive unacked packets
  • receiver only sends cumulative ack: doesn’t ack if there’s a gap
  • sender has timer for oldest unacked packet
    • on timer expires, retransmit all unacked packets
  • out of order packet:
    • discard (no buffer)
    • re-ack packet with highest in-order seq #
  • duplicate ack discarded by sender

selective repeat

  • window of up to N consecutive packets
  • receiver sends individual ack for each packet
  • maintain timer for each unacked packet
    • on timer expire, retransmit only that unacked packet
  • buffer out-of-order packets as needed for eventual in-order delivery to upper layer
  • window size should be at most half the sequence # space
    • because then advancing the window by a full window size will not lead to ambiguity of accepting a sequence # that could be from current window or next window

TCP

  • connection oriented: handshake initializes sender and receiver
  • point to point: 1:1 sender to receiver
  • reliable, in order byte stream
  • full duplex
  • pipelined
  • flow control: sender will not overwhelm receiver

Sequence number and ack number fields counted by # of bytes (not segments).

sequence numbers: byte stream “number” of first byte in segment’s data

ack number: sequence # of next byte expected from other side

  • cumulative ack

Receiver generally buffers out-of-order segments.

Piggybacking

ACK for client-to-server data carried in data segment from server-to-client.

Eliminates need for separate ACK messages.

TCP fast retransmit

  • timeout period relatively long
  • instead, detect lost segments via duplicate ACK
    • if segment lost, there will likely be many duplicate ACKs

fast retransmit:

  • if sender receives 3 ACKs for same data, resend unacked segment with smallest seq #
    • likely the unacked segment was lost, so don’t wait for the existing timeout

TCP flow control

Receiver controls sender so sender won’t overflow receiver buffer by transmitting too fast.

Receiver advertises free buffer space by including rwnd (receiver window) value in TCP header. RcvBuffer is total buffer size.

  • sender limits window to receiver’s rwnd value

TCP handshake

TCP fields:

  • connection request (SYN)
  • initial sequence number (ISN)
  • acknowledgement (ACK)
  • receive window size

3 way handshake

  1. SYN segment client to server
    • SYN = 1
    • random ISN
    • RWND undefined
    • options
  2. SYN+ACK segment server to client
    • SYN = 1
    • random ISN
    • ACK = 1 (ack the SYN)
    • ack seq #: ask for first data byte
    • RWND: receive window size
  3. ACK from client to server
    • ACK the second SYN
    • RWND

Closing TCP connection

  • client and server close by sending TCP segmenet with FIN = 1
  • respond to received FIN with ACK
  • on receiving FIN, ACK can be combined with own FIN

TCP congestion control

congestion:

  • too many sources sending too much data for network to handle
  • router buffer delay and overflow
  • different from flow control

Two approaches:

  • end-end congestion control
    • no explicit feedback from network
    • congestion inferred from end-system observed loss, delay
    • done by TCP
  • network-assisted congestion control:
    • routers provide feedback to end systems
    • single bit indicating congestion
    • explicit rate for sender to send at

TCP approach: sender increases transmission rate (window size), probing for usable bandwidth until loss occurs

  • additive increase: increase cwnd (congestion window) by 1 MSS (maximum segment size) for every acked segment until loss detected
  • multiplicative decrease: cut cwnd in half after loss
  • cwnd is a dynamic function of perceived network congestion
  • sender window limited by min(cwnd, rwnd) (rwnd usually large at receiver)
  • TCP send rate: send cwnd bytes, wait RTT for ACKs, then send more ()

TCP slow start

When connection begins, increase rate exponentially until first loss event:

  • initially cwnd = 1 MSS
  • double cwnd every RTT
  • done by incrementing cwnd for every ACK received

Ramp up is exponentially fast.

TCP congestion avoidance

Once cwnd reaches slow start threshold, TCP connection goes into congestion avoidance (CA) phase.

In CA, sender increases cwnd by 1 MSS when all segments in previous cwnd have been acked.

TCP loss reaction

  • loss indicated by timeout
    • ssthresh = cwnd/2
    • cwnd = 1 MSS
    • window then grows exponentially (like slow start) to threshold
    • when cwnd == ssthresh switch from exponential to linear increase
  • loss indicated by 3 duplicate ACKs (TCP Reno):
    • dup ACKs indicate network capable of delivering some segments
    • ssthresh = cwnd / 2
    • cwnd = ssthresh

Tahoe and Reno are two TCP implementations

  • Tahoe sets cwnd = 1 on 3 duplicate ACKs
  • Reno sets cwnd = ssthresh on 3 duplicate ACKs