On Host Network Interfaces and Architectures for High Speed Networks

Kazantzidis Manthos - UCLA CS Grad
CS218 Winter 1997
prof. Mario Gerla

The last few year we have seen very high speed packet networks come into play. Bandwidths of hundreds of Mbps or even few Gbps are typical of present high speed networks, named after their speed as gigabit networks. The order of these numbers are comparable to bandwidths seen inside a workstation host, much closer to the CPU. With their advent the bottleneck point migrates to the end host. Speed enhancements in network medium technology would provide no performance gain to the end user. Efficient use of the buses is now essential or new architectures should be proposed that alleviate this bottleneck. Further impetus to overcome this problem is introduced by the integrated network promised services which include multimedia, virtual reality and hard and soft real time traffic. In this paper I present three different approaches to solve this problem each at different design layers. An attempt to categorize these follows and one of the approaches studied (MyriNet - the integrated one) is measured to get some hands-on experience on its unicast and multicast operation. I find that small messages are still very problematic since a throughput of less than 1Mbps is achieved- 2Mbps at best, and I experiment with multicasting emulation by unicasting and multicasting on a cycle on Myrinet.

1. Introduction

With the advent of gigabit networks the bottleneck point has migrated within the end host. A number of problem areas can be identified to contribute to this bottleneck:

1. The host network interface which typically resides on the I/O bus and according to layered design principles its job is to copy the incoming or outgoing data to and from the host memory. To do that only a pair of FIFO queues are needed to provide for buffering close to the network. The job can be even carried out with a pair of registers postponing all other function to a software layer. It is now evident that the host network interface can play a very important role at alleviating this bottleneck.

2. The operating system and the host architecture are often responsible for poor performance since the path to the end user goes through data copying, software checksumming, interrupt servicing and context switches.

3. The network communication protocols overhead caused by data moving between protocol structures , incoming and outgoing stream processing, use of timers and small size control messages.

4. Finally, the lack of integration between the hardware architecture, the operating system and the network communication protocols. Inconsistencies caused mainly by layering can be the cause for a lot of extra overhead. A striking example is provided at [Is Layering Harmful? - Clark et al] where a mismatch between layers causes spikes in performance curves. In many areas layering is increasingly limited as a design abstraction and not to be used in implementation.

The existence of a typical host network interface is directed as mentioned earlier by the layered design approach. With this typical approach six bus crosses are needed on a send or receive operation as described in [1]. The application copies the data to a buffer (residing in host memory) to be used for the socket call. The socket code copies the data into the mbuffs in system area. The transport protocols reads to calculate the checksum and writes it back. The DLL finally writes the data to the network interface queues (through the system area).

Solving these problems means alleviating the bottleneck point. To cancel it, a new architecture is needed that can overcome the limitations placed by technology on the memory and system and I/O bus. Note that caching cannot very effectively improve memory bandwidth since network data have very low locality.

1.1 Zero and One Copy Protocols

Approaches so far, are characterized by hardware migration towards the host network interface, in order to avoid crossing the system bus and require access to main memory. If the functions needed by communication protocols are provided in hardware residing between the network and system bus then provided that the interface has sufficient buffering the data will never have to cross the bus. Ideally the applications would have the ability to manipulate their data in a memory location used also by the host network interface. The architecture then is said to provide a providing a zero copy protocol.

In a one copy protocol the application does not share structures with the host network interface. All the functions needed for the protocol implementation are in hardware on the host network interface. The only copy needed then, is the one from the application structures to the host network interface structures. In reality there is always a shared region of memory between applications and network interface. However considering this as a limited resource results in the characterization of the protocol as a one copy protocol since the process available memory is effectively reduced.

Zero or one copy protocols are achieved by properly manipulating virtual memory pages. For example for transmitting memory pages containing the message are pinned to the physical memory until the message is acknowledged. If the application attempts to overwrite these pages it is blocked or the API takes care of it by copying the new data to another location in a transparent to the application manner. On receive data is placed in pinned memory pages. If the application issues the receive using an address that corresponds to a page boundary the data can be made available without copying or remapping. If a nonpage-aligned address is specified the data must be copied. To make zero or one copy protocols accessible in the application layer an Application Programming Interface must be designed.

1.2 Hardware Support on the Network Adapter

Hardware migration to the host network interface can be categorized in three areas:

1. Introducing DMA support for more efficient data movements: By eliminating involvement of the CPU on copying the data:

2. Adapter buffer support

3. Protocol processing hardware support on the adapter: Specific functions can be provided at the network adapter such as checksumming or seperating header and data. The most aggressive option is to perform protocol processing on the network adapter. The protocol can be either hardwired on the adapter or an embedded processor can be used. In the latter we gain in flexibility to support a wide range of protocols while providing for user-defined protocols as well.

I go on describing three different proposals on this issue before attempting to identify similarities and differences in the approach. After that the impact of one of these architectures, which is a commercial product (MyriNet), on multicasting is evaluated.

2. U-Net

U-Net stands for a communication architecture that provides processes with a virtual view of the network interface to enable user level access to high-speed communication devices. The architecture is implemented on standard off-the shelf hardware platform (SPARCStations with Fore ATMs) and offers the potential to remove the kernel completely from the critical path.

A U-Net prototype built on an 8-node ATM cluster of SPARCs offers 65 microseconds round-trip latency and 15Mbytes/sec bandwidth (126 MBps)

The key idea is to place the entire protocol stack at user level and the operating system and hardware migrated onto the host network interface allow for protected user level access to the network. In this way the kernel is removed from the critical path and the processes can tailor the communication layer to their demands.

The issues that need to be solved then are:

1. Multiplexing the network among processes
2. Providing for protection among processes
3. Managing the limited communication resources without needing the kernel
4. Design an efficient yet versatile programming interface to the network

U-Net claims to be the first presentation of a full system that does not require custom hardware or OS modification.

2.1 U-NET User Level Network Architecture

The role of U-Net is to multiplex the actual Network Interface among all processes accessing the network and to enforce protection boundaries and resource consumption limits. The process has control over both the contents of each message and the management of send and receive resources.

2.1.1 Sending/Receiving in U-Net

The main building blocks of U-Net are shown in figure 1.

1. endpoints serve as an application's handle into the network
2. communication segments are regions of memory that hold message data
3. message queues hold descriptors for messages that are to be sent or that have been received.

Each process wishing to access the network creates an end-point and associates with it a communication segment, a send, a receive and a free message queue. To send a message:

1. Use the communication segment to compose the data
2. Push the descriptor to the send queue
3. U-Net's Network Interface multiplexes the message into the network with the necessary info for demultiplexing.

On receive U-net's Network Interface demultiplexes messages based on destination to the appropriate communication segment and pushes its descriptor to the receive queue. Receive can be event-driven or polling, blocking or non-blocking.

Multiplexing and demultiplexing of message is done by use of message tags by the application. The network substrate mechanism for identifying flows may be used for example the VCI for an ATM network. A process registers them with U-Net. Then on outgoing messages the channel identifier is used to place the correct tag into the message and on incoming messages the tag is mapped into a channel identifier to signal the origin of the message to the application.

An OS service must help the application to determine the correct tag based on the destination process and the route between the nodes. The operating system must also handle other network specific tasks. The service may also perform authentication and authorization checks for security and avoiding conflicts by checking that endpoints are used by their owner process. If all the checks are valid and the path to the pier has been determined the resulting tag is registered so that U-Net can perform its multiplexing/ demultiplexing function. A channel identifier is returned to the application to identify the communication channel to the destination.

U-Net handles sends and receives with data residing in the communication segments. Communication segments are virtual memory regions but typical workstation I/O bus addressing does not allow them to span the whole memory space. Consequently zero copy protocols cannot be supported with the strict sense since the application cannot make full use of its memory and has to consider the communication segments as a limited resource. In an one-copy protocol the data are copied from the application structures to the network interface structures before being send out to the network (the reverse holds for the receive operation).

The communication segments are typically pinned to physical memory. Send and received queues hold information about the destination and respectively origin, endpoint addresses of messages, their length and offsets within the communication segment. Free queues hold descriptors that are made available to the network interface for receiving messages.

The management -meaning the size or number, allocation policy, of send buffers is entirely up to the process. The process also provides receive buffers explicitly to the network interface via the free queue but it cannot control the order in which these are filled.

3. The APIC

APIC stands for ATM Port Interconnect Controller. It is a custom chip, the centerpiece for the design of a high performance ATM host network interface. A prototype built could support a sustained aggregate bi-directional data rate of 2.4 Gbps.

3.1. Concept

The key idea to this design is to provide an ATM interconnect that extends to the desk area. This means that APIC chips will reside between the ATM LAN and the processor. Streaming data from I/O devices to the network is supported as well as between devices residing on the interconnection without crossing the system bus. Full AAL-5 segmentation and reassembly and pacing control that provides for single-parameter bandwidth reservation and high degree of scalability in terms of the number of I/O devices that can be simultaneously supported are some important offered features. A zero copy interface to system memory that is achieved through the use of page remapping techniques is developed after publication of [3].

A desk area network is constructed by interconnecting a number of APICs. Each one interfaces to an I/O device or to the main system bus.

A typical workstation architecture is shown in figure 2. As shown in such an architecture the network interface card would reside on the I/O bus, significantly limiting its performance and flexibility. These limitations arise from:

1. The bottleneck imposed by the two buses
2. The effective memory bandwidth is normally no more than 500Mbps
3. Existing protocols require data to cross the bus and access memory multiple times, exacerbating the problem
4. In multiprocessor architectures processors working on multiple media streams have to share the same bus

This paper, based on the above considerations, approaches the problem of host network interfacing by revising both the host communication architecture and the I/O subsystem as a whole.

3.2 The Architecture

In figure 3 a high level schematic of the proposed DAN architecture is shown. Every APIC can interface one or more I/O devices. One of them is connected to the external ATM network through a link interface. ATM cells arriving from the network pass from one APIC to the next until they reach an APIC that is directly connected to the destination device or the system bus. Similarly, data originating from a source device is passed as an ATM cell stream from the APIC directly connected down to the APIC interconnection network until it reaches the APIC connected to the link interface. A video cell stream for example originating at a camera could traverse the interconnect to reach the local display device without crossing the system bus.

The linear topology for interconnecting the I/O devices is not constraining. One can implement for example perfect shuffle network which would guarantee a O(logn) hops for every device residing directly to an APIC to the network (or any other APIC).

For multimedia workstations, an embedded signal processor could reside on the data path between a continuous media device an VRAM. In this way the processor could not be shared among devices but uncompressed data would never have to cross the interconnect. Alternatively, a special media processor could reside as a separate device on the interconnect.

The APIC chip behaves like a 3x3 ATM switch. Two ports are connected to two other APIC chips and the third port interfaces to external memory. One of its functions takes specified data from memory, makes an ATM cell stream out of them and forwards it. In the opposite direction the APIC is responsible for doing a VPI/VCI virtual channel lookup on every arriving cell to determine if it is destined for the local device.

3.3 APIC Features

Any host network interface for an ATM network must handle segmentation and reassembly. The APIC includes some interesting features some of them mentioned here:

1. Multipoint Connections and Loopback: This allows for example a video stream from a camera to be viewed in the local display and be sent out over the network at the same time

2. Includes a Serial Access Memory for interfacing with the VRAM. In this way batches of more then one ATM cells are used for transfers.

3. Whenever queuing occurs inside the APIC there is always a bypass queue for low latency traffic. Capability on a per connection basis to specify a low latency requirement is provided.

4. To provide all these features a considerable amount of state information per connection must be kept. To support many connections connection caching is used. Only active connections' are kept in a static RAM memory on-chip. An active connection is one that received or sent packet within a reasonable amount of time.

5. Single parameter pacing is provided.

3.4 Cell Forwarding Path

Let us now consider how a cell coming in from an upstream APIC is forwarded. In figure 4 the internal architecture of an APIC is shown using high-level blocks.

1. Entering through one of the two Input Framers: Synchronization to the internal clock and error checking is performed.

2. Written in SRAM

3. A copy is passed to the Virtual Circuit Translation Table, which does the table look-up using the virtual path and channel identifiers from the cell header. It determines that whether the cell is destined for the local device, and places it into the appropriate FIFO

4. The Dispatcher will eventually take this entry and schedule the cell contents to be read from SRAM into the Output Framer or the Receive Formatter accordingly.

For sending, the Pacer is responsible for implementing the pacing algorithm

1. It generates a pacing tag containing the address in external memory from where the cell (or batch) contents are to be read, the ATM header to use, the destination ATM port and some other information.

2. The tag is placed in the TagFIFO from where it can be read by the Host Memory Interface.

3. The HMI will then fetch the data and place it to the Transmit FIFO.

4. The Transmit Formatter extracts cells from the Transmit FIFO, takes care of the necessary header information and places into the Cell SRAM.

5. From this point on, the Dispatcher treats the cell exactly the same way as if it was a transit cell, and forwards it to the output port.

This design allows APIC to serve as a single chip ATM host-network interface. Using multiple chips allows going beyond traditional network interfacing and provide DAN capabilities.

4. MyriNet

MyriNet is a local area network that employs the same technology used for packet switches within massively parallel processors. It demonstrates a most noticeable cost/performance fraction among high speed LANs. Its main disadvantage is the 25 meter limit placed on communication channel length. This limit is placed by the low error rate which is a basic assumption on MPP architectures. Optical-fiber translators are being developed so that MyriNet packets can span longer distances.

A MyriNet bi-directional link has an aggregate throughput of 1.28 Gbps (640Mbps *2). This rate however is achieved with channel lengths up to 10m.

MyriNet switches can be connected in an arbitrary topology taking advantage of regular topology routing efficiency and scalability. Cut-through routing is employed at the switches, meaning that the packet advances into the required outgoing channel as soon as the header arrives and is decoded. Flow control is used to block packet when the outgoing channel is busy and by that packet buffering is avoided at the expense of providing flow control mechanisms on each link.

Flow control is managed by having the receiver inject Stop and Go control symbols into the opposite-going channel of the link. The receiver includes a queue-organized slack buffer. If the downstream flow is blocked so that the slack buffer is filled up to the Stop limit the receiver generates a Stop. When the downstream flow resumes and reaches the Go limit a Go is generated. Thanks to this flow control mechanism mixed data rates depending on communications link length may be accommodated automatically.

Of particular interest, and a key point in MyriNet Architecture is the MyriNet host/network interface. It executes a control program to handle direct interaction with host processes and in this way avoid the multiple copies that typical protocol implementations require. It handles sending, receiving, packet buffering and network mapping and monitoring. Sustained one-way data rates of 550 Mbps can be achieved between user processes and the network on Pentium Pro hosts, when using the API and the MyriCom's control program.

4.1 Measurements on Myrinet

A Myrinet network is installed for experimentation at UCLA. The installation on which I run the measurements constists of five Myrinet switches and 6 hosts. Hosts can be connected and disconnected from the network any time, so the topology looks different each time a user takes its cable of the myrinet switch. The topology when I run the measurement is shown in the next picture.

The MCP program provided by Myricom runs complete monitoring on the host network interface LANai chip. An application interface is provided for user access to the MCP's internal monitoring variables. Commands for creating traffic are also available.

The following diagram provides a description of the layering of the programms provided with Myrinet.

Applications using API ht for creating traffic

MT User friendly sending and receiving

API Low level primitives for handling NI queues

MCP LANai Chip

packet size
nothing special
flow control
copy
copy and flow control
4
0.2
0
0.2
0
16
0.52
0.3
0.51
0.31
32
0.9
0.27
0.9
0.3
56
1.46
0.89
1.48
0.94
100
2.51
0.68
2.51
0.16
512
12.16
0.42
12.07
0.51
576
13.47
0.02
13.7
1
640
15.1
0.61
15.01
0.5
768
17.85
0.45
17.89
0
896
20.4
1.02
20.6
1.2
1024
23.26
1.96
23.53
0.04
3000
60.57
1.07
60.49
1.12
5000
86.81
0.2
87.67
0.16
8192
113.21
0.32
117.54
0.32
9200
105.48
0.36
107.46
0.36

4.1.1 One to one unicast connection

I measured the bandwidth achieved with an application using the Mt library with the program ht used to generate and receive traffic. The table of the results from sending to from one host (barbican-ssn) to another one (shloomiel-ssn) separated by one Myrinet switch. Additional experiments were performed. Since the network was operating at very low loads the distance of the two hops only slightly affected the results. Cut-through swtching contributes to that.

Flow control is performed at the application level with minimum buffering and essentially just prevents the application from sending until all the bytes have received and acknowledged. I did not find sufficient information on how does API handles flow control so I am just reporting this one. In any case these results are not related to this subject.

Columns featuring data copying use the Copy Block to buffer the data between the API and the network. The Copy Block is a chunk of kernel memory that the MCP uses for DMA transfers. In the results we can see that DMA transfers can be more efficient than CPU tranfers for large messages. See the section on hardware optimizations for reasons why.

The anomaly appearing at message sizes of 8K and 9000 is attributed to buffer size rounding problems. An integral number of 8192 bytes message are likely to fit into any size of buffer whereas a 9200 byte is more likely to be dropped or stopped when flow control is used.


The most striking observation is that the maximum bandwidth achieved at low load for an application using the Mt library is approximately 120Mbps.

4.1.2 Small messages

We see a dramatic and constant increase in effective (end to end, pier to pier, i.e. ht to ht) bandwidth used by the connection as the message size increases. Small messages suffer from large latencies as compared to their transfer time, resulting in very small bandwidths. Myrinet included a send immediate primitive in the API that bypasses the NI queues and writes the message into the LANai directly from host. This saves overhead in setting up the DMA in the MCP and so reduces latency for small messages. It does not work for messages larger than 1K. The effect to the throughput when using small messages is shown in the graph.

We see that for message sizes of more than 500 bytes we may get less throughput when the message size causes fragmentation in the mapped LANai buffers, i.e. when the message size is not a power of two. In other cases bypassing the DMA is beneficial.

A discussion on the importance of small messages can be found on [1].

4.2 Multicasting on MyriNet

Multicasting can be achieved in two ways in MyriNet.

1. Sending one packet along a cycle - the sender sends one copy of packet which will pass from all the receivers. Ideally each intermediate receiver should be able to copy the packet on the fly and forward it to the next receiver in the circle. However the MyriNet host network interface does not support this. Consequently the packet is stored in all the intermediate stations before being forwarded to the next station in the group.

2. Sending multiple packets - the sender sends multiple copies of the same packet each destined for one different receiver in the group. This generates a lot more traffic but potentially reduces the latency experienced by each receiver.

4.2.1 Attempt to Produce Measurements for Multicasting

It would be interesting to measure the effective bandwidth delivered to multicast applications using the modified MyriNet Control Program to support multicasting.

Presently, ways to do multicasting on Myrinet are studied in UCLA. The MCP programm is changed to handle multicast addresses, in the API multicast functions where introduced each pairing with the corresponding unicast. The Mt library is used just to call the API functions with the appropriate parameters to define unicast or multicast operation. Currently a few changes in function calls and recompilation of the Mt library is needed to switch from using the unicast and multicast functions.

Unfortunately at the time of this report, multicasting was not operable, and my effort to make it work could only address superficial defects. Apparently, looking deeper in the API installation was necessary. The problem finally was that a part of LANAi memory was corrupt by running misbehaving experiments on the Myrinet. Some measurements of the multicasting were produced although multicast became operable too late for this report.


Before getting the multicast to work, I attempted to measure multicast on Myrinet emulated by unicast connections -i.e. the first approach. System configuration did not allow stretching the system though to produce meaningful results. The basic limitation was that multicast should be emulated by using many unicast connections, whereas Myricom setting allow only for 3 user API interfaces on each host network interface card. Interface 0 is used by the IP driver and it must always be active since it is used to set the MCPs address, DMA bursts and handle resets. Running 1 API program per interface would hardly stress the system.

The results given in the following table are for a multicast group of 2 receivers (milanet and shloomiel) and 3 receivers (milanet, shloomiet and aldgate) to be compared with a 3 and 4 member group respectively since in the cycle approach the sender needs to be a member. Barbican is the sender in this case. The worst case receiving bandwidth is reported in each case.

The negative spike occuring at 3-receiver 8K messages is attributed to a synchronization that may occur especially when using a message size that fits exactly to the NI queue sizes. It is an outcome of not actually scheduling the sends but using a different application layer for each unicast connection, so it will be ignored.

Performance is slightly degrading even with adding 1 member in a 3-member group which may suggest a more observable degradation for larger groups. However this is conjecture and the actual responce from Myrinet must be awaited before any suggestion. Note that this is the larger group one may have given 3 API interfaces.

4.2.2 Measurements with multicast version of ht

I was given access to 4 machines connected on the Myrinet network at UCLA. These are barbican-ssn, shloomiel-ssn, aldgate-ssn and milanet-ssn. Consequently the larger group that I could create (with mkcirct) has 4 members. Since the sender is included in the group I had 3 receivers. The hops in the group are in the order mentioned here, where barbican-ssn is the sender.

Limited by the system's available interfaces I could further more have up to 3 connections for the group. Using the same multicast group with different connections can be considered equivalent to having different multicast groups. I chose this approach however because it streches the system more.


The measurements appear in the table in average bandwidth received by each receiver (all the receivers did receive the same bandwidth). The bandwidth is sampled every 2 seconds over these 2 seconds and then averaged arithmetically. (each line represents different number of active connections).

One can observe that when one group is active, using larger messages allows more bandwidth to be received. The utilization on the links also increases when we have more connections for the multicast group (or equivalently more groups) active simultaneously. The maximum bandwidth was achieved for 3 active group connections sending 8192 byte-messages.

I will not try to conclude on performance using these measurements. The system needs to be streched more to suggest multicasting performance, or Myrinet link utilization.

5. Comparison of the approaches presented

To quantitatively evaluate and compare these design proposals, we must be able to assess the hardware in conjunction with the software that provides a specific set of communication operations to the user. Even though it should be possible to use any of the popular communication layers such as TCP/IP for this evaluation these layers impose large software overheads which obscure many other factors as suggested in [S5]. It is proposed in [S1] that Active Messages provide for a systematic way to assess the combination of fast hardware and software delivering communication performance to the applications.

In this paper the comparison aims rather to characterize the approaches qualitatively since the decisions made for each of them are directed with inherently different target in mind, even though the final one is to provide the raw network bandwidth of high speed networks to the applications. In this way, U-Net defines an architectural model which subsequent designs use as a baseline, APIC attacks the bottleneck point itself rather than minimize its impact and MyriNet is an integrated low cost solution for present systems.

Target U-Net APIC MyriNet
Architectural Assumptions Off-the-shelf hardware components Custom I/O Bus Custom network switches
Approach Avoid copies using OS and hardware on the network adapter Avoid system bus crosses by providing a flexible new architecture for an I/O bus Provide a cost effective, flexible integrated network.
Key idea Entire protocol stack can be placed at user with minimum help from hardware and OS Extend network switches to a DAN, thus bringing it closer to the CPU Employ technology used in MPP networks to achieve high throughputs
Main Consideration for End to End Performance Bandwidth and latency for small messages Bandwidth Bandwidth
Bandwidth 126 Mbps on a prototype for the model 2.4 Gbps 1.28 Gbps
Optimized TCP Bandwidth 115 Mbps for more than 1000 bytes messages on a 50/60 Mhz SuperSPARC with SBA-200 interface to Fore ATM switches N/A 70 Mbps on a SPARC/2 20Mhz Sbus
Advantages
  • Low small message latency
  • Off-the-shelf platforms
  • Flexible architectural model
  • The only approach that overcome physical limitations placed by I/O bus bandwidth
  • Allows remote controlling of devices residing on custom bus
  • Allows hardware Coding Decoding Schemes with minimal CPU interference
  • Single chip host network interface
  • Allows multicasting within the bus
  • Integrated h/w support for protocols
  • Very high cost/performance factor
  • LANai processor can be programmed to support user-defined protocols based on user needs
  • Flexibility at supporting existing protocols
  • Low error rates
  • Self mapping and monitoring
Disadvantages
  • Zero Copy although provided by the model cannot be supported with off-the-shelf components
  • Requires a whole new host architecture
  • Very expensive hardware
  • No inherent multicasting
  • 10 meter maximum link length for high bandwidth operation (25m for 640 Mbps)

Acknowledgments

Special thanks to Simon Walton for his valuable contribution and answers.

References

[1] A Systematic Approach to Host Interface Design for High-Speed Networks- Steenkiste - IEEE 1994

[2] U-Net: A User Level Network Interface for Parallel and Distributed Computing - von Eiecken, Basu, Buch, Vogels- ACM 1995

[3] Design of the APIC : A High Performance ATM Host-Network Interface Chip- Dittia, Cox Jr, Parulkar- IEEE 1995

[4] Catching Up With the Networks: Host I/O at Gigabit Rates - Dittia, Cox Jr, Paruklar, Technical Report WUCS-94-11

[5] MyriNet: A Gigabit-per-Second Local Area Network- Boden, Cohen, Felderman, Kulawik, Seitz, Seizovic, Su- IEEE 1995

Other Sources

[S1] Assesing Fast Network Interfaces- Culler, Liu, Martin, Yoshikawa- IEEE 1996

[S2] Programming the LANai at UCLA- Walton- UCLA 1995

[S3] Myrinet Reference Manuals - MyriCom

[S4] Multicasting in SSN - Gerla, Gafni

[S5] An Analysis of TCP Processing Overhead- Clark et al- IEEE 1989