Introduction

Hello everyone! It’s been a while. I am immersing myself in the UE spec once again and this time I want to talk about libfabric. Even though this is the sort of topic that us network engineers sort of gloss over thinking it’s not important but I believe understanding all pieces of the puzzle makes us better engineers. Remember, you are not only building networks, you are partnering with other teams to build a supercomputer, a whole system. The better you know what those teams are talking about, the better decisions you’ll make about the network. So without further ado let’s get into it

What is Libfabric ?

Libfabric is a software component that acts as an abstraction layer between the application and the transport layer. The benefit is that the application layer can send and receive messages without needing to be aware of the internal workings of the technology specific transport layer (IB verbs, AWS EFA, UET etc.)

There are two layers to libfabric :

The application layer : The upper layer is the layer that communicates with the application, the application expresses the intent to send a message to a remote host in the same way regardless of the transport mechanism (RDMA, IB, UET, EFA etc).
The provider : This is the transport specific layer and it is where the translation between the application API calls and the low-level transport operations happens. The provider’s job is also to translate those transport operations into low level NIC operations.

Libfabric Software Architecture

The Libfabric SW architecture is broken down into four main categories

Control (discovery)

This layer allows the application to talk with the underlying hardware and discover what providers are available and what capabilities those providers have. If the application runs on your laptop vs on a server that has an IB HCA the responses will be different.

Example is using an API call such as fi_getinfo() from the application to query the underlying provider, the provider would respond with something such as the below :

For InfiniBand:

  info->fabric_attr->name = "mlx5_0"
  info->fabric_attr->prov_name = "verbs"

For TCP sockets:

  info->fabric_attr->name = "lo"
  info->fabric_attr->prov_name = "sockets"

So it’s basically the application side asking the provider “Who are you and what are you capable of ?” The provider side is responding “I’m verbs (aka Bob), running on mlx5_0, and here’s what I can do: Remote Access Memory (RMA) atomics etc. ”

Communication (connection management, address vectors)

This is the component that is used to set up a connection, and track addresses between peers, it generally consists of 3 steps

Creating endpoint : During this step we create an endpoint which will be the source and destination of all operations to and from the application, common types of endpoints are :
1. FI_EP_MSG → for message-passing (like sockets, supports fi_send/recv).
2. FI_EP_RDM → reliable datagram, supports RMA/atomics, unordered by default.
3. FI_EP_DGRAM → unreliable datagram.
Adding the remote host’s that we want to communicate with to the address vector, think of it as your phone book where you can look up all of your friends phone numbers. Please note that libfabric does not handle how both ends of the communication exchange addresses, that step is handled by the upper layers.
Connection management : This component handles the setup of connections in connection types that require it.

So the application side of the component is saying “Now that I know what I'm capable of, let me set up a communication path to my peer”

The provider is setting up that connection with the specifics of the underlying hw capabilities.

Example below shows creating an address vector then opening a connection :

Creating an address vector (AV) :

struct fid_av *av;                  // Declare a pointer to an Address Vector object
struct fi_av_attr av_attr = {0};    // Create and initialize the AV attributes to zero
av_attr.type  = FI_AV_MAP;          // Choose "map mode" (IDs map directly to addresses)
av_attr.count = 16;                 // Reserve space for up to 16 addresses

int ret = fi_av_open(domain, &av_attr, &av, NULL); // Call libfabric to create the AV
printf("fi_av_open() returned %d\\n", ret);         // Print the return code (0 means success)

Inserting a peers address into the AV :

fi_addr_t peer_id;                                 // Will hold the short ID for the peer
struct sockaddr_in peer_sockaddr = { ... };        // The peer's real network address (like IP+port)

size_t inserted = fi_av_insert(av, &peer_sockaddr, 1, &peer_id, 0, NULL);
printf("fi_av_insert() inserted %zu, peer_id=%lu\\n", inserted, peer_id);

Initiating the connection

int ret = fi_connect(ep, peer_id, NULL, 0);
printf("fi_connect() returned %d\\n", ret);

Polling the event queue to validate connection status

struct fi_eq_cm_entry entry;                       // Will hold connection management info
uint32_t event;                                    // Will hold the event type (e.g., FI_CONNECTED)
ssize_t rd = fi_eq_read(eq, &event, &entry, sizeof(entry), 0);
printf("fi_eq_read(): event=%u (should be FI_CONNECTED)\\n", event);

If the connection was successful you would expect this type of output :

fi_eq_read(): event=2 (FI_CONNECTED), rd=32 bytes read
Connection established with peer.

Completion

The completion part of libfabric is what is used to track send/receive operations between NIC and application.The application is responsible for polling or waiting on these queues, similar to how a network operator would poll counters to view incrementing CRC errors on an interface. It’s composed of 3 parts :

Event Queue (EQ) : the EQ keeps track of fabric control events such as established connections, peers shutting down connections, connection requests etc.
Completion Queue (CQ) : Tracks status of individual operations successes or failures . This lets the application know, for example, that the buffer space allocated to that operation is free once again and that it can be used for another operation or that it has failed and why it failed.
Counter : The Counters keep track of the number of successfully completed operations. The benefit of using counters is that it’s a lightweight operation to keep track of progress. For example if you have posted 10000 sends you can just keep polling the counters until they reach 10000. If it doesn’t reach that number you can then dive into the completion queue and see if there has been an error.

These queues are owned by software but either consumed or produced by the NIC. For the completion queue for example, when the NIC completes the operation, i.e writing data in a remote host’s buffer, it directly writes into the Completion Queue in host memory where it can directly be read by the application.

Data Transfer

Now that we have our capabilities discovered, our communication setup, our queues in place for tracking we can start actually sending and receiving data, this is the purpose that the data transfer part of libfabric serves. This is the part of libfabric that actually moves data between applications in the fabric. There are different communication types that libfabric supports, let’s see what they are :

Messages : These are basic send/receive operations (fi_send / fi_recv). This is a two-sided operation, the sender posts a message, the receiver must have a matching receive in advance.
Remote Memory Access (RMA) : These are messages posted directly into the remote host’s registered memory (fi_write / fi_read). This is a one-sided operation, the initiator directly writes or reads from the remote host’s registered application memory. Think of RMA as a direct memory-to-memory transfer over the network, like sending a packet directly to a remote host’s buffer without involving the remote CPU.
Tag matching : This is a messaging variant where every send includes a tag which allows the receiver to filter based on those tags. The NIC can deliver each of the tagged messages to different receive buffers in the application based on that tag. This is useful because in HPC applications you have a lot of different types of traffic, for example you can tag all the control traffic with one tag and the data with another, so that the control traffic does not need to wait in the same receive buffer before being transmitted.
Atomics : Atomics are like RMA in the sense that they’re one-sided operations which allow a host to read, write, or modify a remote host’s memory. The difference between RMA and atomics is that RMA just writes to the remote buffer, while atomics go through a read–modify–write cycle. In atomics, the operation is atomic, so two hosts cannot overwrite their modifications of the same remote memory location. For example, let’s say Host A and Host B want to modify a counter in Host C’s memory. Host A wants to write the counter to 10 and Host B wants to write it to 20. By using RMA, the counter value will be the last value written to that memory (either 10 or 20). If we use atomics and the counter is set to 10 in Host C’s memory, when Host B performs an atomic operation on the same memory, it will fetch the previous value and add its own (e.g., fetch_and_add(20)), so after both operations the counter value will be 30.

How does it relate to Ultra-Ethernet ?

Ultra-Ethernet uses Libfabric as its communication abstraction layer as it is already widely used in the HPC / AIML industry. It comes in as a Libfabric provider which allows for quick adoption without needing to rewrite application code or application facing APIs. By integrating UET as a libfabric provider, applications can leverage Ultra-Ethernet’s high-speed, low-latency capabilities without rewriting code, making it easier for AI Infra engineers to deploy UET in existing HPC and AI clusters

UET Libfabric Provider Software Architecture

Before initiating communications, there first has to be a setup phase where job IDs, security keys/partitions, and resource allocation are assigned. Prior to UET, there was no single consistent way of doing so; it was scattered through different vendor-specific libraries, kernel drivers, etc. To simplify this, UET introduces a standardized control API built on Netlink. This aims to provide a standard and consistent way to configure UET NICs regardless of the provider. This control API requires vendors to provide a kernel driver in order to facilitate the setup operations. This is kind of like the control plane vs data plane we are used to in networking.

So when a communication is initiated by an orchestrator (slurm or kubernetes for example), it first uses the control path through the kernel drivers to setup the job (Partition and security enforcement, jobID assignment etc.) Once the setup is complete, the application uses the kernel bypass path to transfer data at very high speeds. Bypassing the kernel is what accelerates the data transfer significantly.

Quick Local Test

I installed Libfabric on my local machine and tested it by executing fi_info just to see what it practically looked like :


provider: tcp
    fabric: 192.168.1.0/24
    domain: en0
    version: 203.0
    type: FI_EP_RDM
    protocol: FI_PROTO_XNET

provider: tcp → using the TCP provider (standard sockets, not RDMA).
fabric: 192.168.1.0/24 → communication happens over your local subnet.
domain: en0 → bound to your network interface en0.
version: 203.0 → libfabric API version 2.3.0.
type: FI_EP_RDM → reliable datagram endpoints (message-based, reliable, unordered connections).
protocol: FI_PROTO_XNET → libfabric’s internal reliable protocol over TCP.

Want to learn more ?

Libfabric github repo readme
Libfabric developers guide on the libfabric website.
For details on the UET provider you can find that in chapter 2 of the UET spec.
A Brief Introduction to OpenFabrics (libfabric)
Tutorial: OFI Libfabric

Whats next ?

For me this wasn’t the easiest topic as I don’t have a software engineering background, but nonetheless it’s useful to cover to see the big picture and to understand Ultra-Ethernet in depth. In the next blog post, things will get interesting as we start on the transport layer, the heart of the UE spec. I am excited to start digging into that. I hope this was informative for you and as usual let’s engage on Linkedin or in the comments below, let me know if I missed something or if it weren’t clear enough. Thanks for reading and see you soon!

Ultra Ethernet 1.0 Specification: What Is Libfabric ?

Introduction

What is Libfabric ?

Libfabric Software Architecture

Control (discovery)

Communication (connection management, address vectors)

Completion

Data Transfer

How does it relate to Ultra-Ethernet ?

UET Libfabric Provider Software Architecture

Quick Local Test

Want to learn more ?

Whats next ?

Comments

UltraEthernet Explained

More from this blog

What AI training parallelism actually means for your network fabric

How does RoCE actually deliver a Lossless network ?

Network Engineers' Introductory Guide to NCCL

What You Need to Know About Artificial Intelligence

Command Palette

Introduction

What is Libfabric ?

Libfabric Software Architecture

Control (discovery)

Communication (connection management, address vectors)

Completion

Data Transfer

How does it relate to Ultra-Ethernet ?

UET Libfabric Provider Software Architecture

Quick Local Test

Want to learn more ?

Whats next ?

Comments

UltraEthernet Explained

More from this blog