This is a different kind of article about the Lightning Network. Many articles discuss the mechanism of channels. However, nodes in the network perform two primary functions: creating channels and gossiping information. This series of articles is going to address the less talked about part of the Lighting Network: routing gossip.
This article will be an introduction to gossip. Subsequent articles will break down those topics in detail that goes beyond the BOLT specification and would be useful for anyone taking a deep dive into Lightning Network routing gossip. So let's get started!
The Lightning Network consists of nodes and channels. A node is a machine running the Lightning Network software (LND, c-lightning, Eclaire, rust-lightning, etc). A node performs two primary functions: establishing channels with other nodes and gossiping about those channels. When two nodes create a channel, those two nodes can send payments back and forth along the channel trustlessly, instantaneously, and with no fees.
Channels create connections between the nodes. If we stop and think about nodes and channels, these naturally form a graph where the channels make up the edges of the graph. You have likely seen a graph of the Lightning Network visualized such as through our network visualizer.
This graph isn't just for looks. The creation of a graph is necessary to support one of the neat aspects of the Lightning Network: routing payments through the network. A node in a channel can trustlessly send a payment to another node in the network without requiring a direct channel to that other node.
For example, think about four independent nodes operated by Alice, Bob, Carol and Dave:
- Alice has a channel to Bob
- Bob has a channel with Carol
- Carol has a channel with Dave
The Lightning Network allows Alice to pay Dave by routing a payment through Bob and Carol. This is referred to as a routed payment.
The graph is a directed graph with edges originating on both side. That is a payment can be routed through Alice to Bob and a payment can be routed through Bob to Alice. Now this may sound like an undirected graph since you can traverse in both directions, but it is highly dependent on the balances on each side.
Lightning channels have balances on both side of the channel. A payment that could be sent from one side might not be possible from the other side if there are not sufficient funds. For example, if Alice has 1BTC on her side and Bob has 0BTC, it's not possible to route a payment from Bob to Alice since he has no funds to send!
Because Lightning channel balances are not publicly known this presents an interesting problem for path traversal. We know that an edge exists in both directions, but we don't know if a payment will succeed.
Now we won't dive into the mechanics of path finding or how routed payments work just yet in the series. Instead we're going to focus on the prerequisite for all that fun: how Alice even knows a potential path exists to Dave!
Enter our graph of the Lightning Network that we showed above. Alice needs a view of the nodes and channels that are capable of routing a payment to Dave. We refer to this as the routing topology.
How does Alice obtain this graph? Nodes are constantly opening and closing channels. New nodes are coming online. This is a dynamically changing environment that Alice needs to be aware of to successfully route a payment.
To solve this issue Lightning Network nodes use a peer-to-peer gossip protocol.
What is a gossip protocol?
Gossip protocols are interesting because they are relatively simple to understand but are rather difficult to define. I'll start with some background on the Computer Science side of the fence before jumping into the nitty-gritty of Lightning. So let's take a step backwards and focus the lens on what gossip protocols are and then we can apply this to Alice's problem of building and maintaining a view of the network.
Two fascinating papers discussing gossip protocols are The Promise, and Limitations, of Gossip Protocols by Berman  and Gossiping in Distributed Systems by Kermarrec and van Steen .
One of the shared themes is that gossip protocols are particularly well suited for the spread of information: referred to as information diffusion. When used in this manner, reliability of diffusion is very high, meaning we can be reasonably sure that nodes will receive the information. Information is convergently consistent, meaning after a period of time has passed it is highly probable that nodes will have the same information. Information diffusion is also fairly efficient, data generally flows across the network in a logarithmic number of steps to the size of the network. Gossip is particularly useful for situations where the network is constantly in flux.
This all sound sounds fantastic for solving Alice's problem!
So what is a gossip protocol? Over the years there have been numerous attempts to pin down a definition of what constitutes a gossip protocol. In The Promise, and Limitations, of Gossip Protocols, Berman describes some common features of gossip protocols:
- Involves periodic, pairwise, inter-process interactions
- The information exchanged is of small, bounded size
- When nodes interact, the state of one or both changes to reflect the state of the other node
- Reliable communication is not assumed
- There is some form of randomness to peer selection
These are helpful concepts but lead to a broad definition, so let's define a few different techniques for gossip.
The modern form of gossip protocols was formalized in the paper Epidemic Algorithms for Replicated Databases in 1987 by Demers et al . This paper formalized and analyzed concepts of information diffusion in distributed systems.
Demers et al applied concepts from epidemiology, the study of the spread of diseases, to discuss how information spreads through a system. There are striking similarities to information diffusion and epidemic spread. This is pretty interesting stuff given our unfortunate experiences of 2020.
Demers et al highlight two common types of gossip protocols: rumor mongering and anti-entropy.
The idea of rumor mongering for the spread of information is logically the same as how rumors spread between people: Alice has a secret that she shares with Bob and Carol. They share this information with more people. Those people share with more people, etc. Pretty soon almost everyone has the information. Eventually once people know the rumor, they stop sharing it. With rumor mongering, information spreads with minimal individual effort.
To study how information diffusion works, Demers et al, applied concepts of complex epidemics known as the SIR model where a person is susceptible to infection, infected, or removed due to death or immunity. When applied to information diffusion:
- Information is the pathogen
- Nodes without the information are susceptible to the pathogen
- Nodes become infected with that information and begin trying to infect other nodes
- Eventually a node stops trying to infect other nodes with this information, and the node is considered removed
One consideration for rumor mongering is how long a rumor remains hot. When a rumor first enters the system, it is hot, meaning it is novel. At the onset of this information entering the system there is a very high probability that another node has not seen this new piece of information. Nodes becomes infected with the information and starts gossiping about the information. Fairly quickly the node will start to encounter other nodes that are infected. This means that it becomes increasingly difficult to infect new nodes. After a threshold has been reached, a node may decide the information is no longer hot and will stop trying to gossip.
This presents some interesting choices for a protocol designer: How does an infected peer determine which nodes to try to infect? How does a node know when to stop trying to broadcast a piece of information? There many choices that can be made to optimize rumor mongering.
Demers et all also discussed another type of information spread known as anti-entropy. This type of information spread is the comparison and synchronization of state between two nodes in the network. The goal of anti-entropy is to reduce the total chaos in the system by ensuring that information between two nodes is in-sync.
Intuitively, this seems like an expensive process, and it is. The requirement that two nodes needs to synchronize state from the beginning of a system or past a certain point in time will involve a significant amount of information exchange and processing. Therefore, anti-entropy is usually used to complement rumor mongering during bootstrapping or periodic synchronization to look for missed messages.
To add a bit more complexity to the mix, there are different techniques and pro's and con's to how synchronization happens with anti-entropy. Does a node periodically push its state to another node? Does a node request information from another node? Does it do both? These are known as push, pull, and push-pull variations. Again we are presented with a variety of choices.
But back to the problem at hand, now that we know of two ways to share information and a bunch of options to tinker with. How does Alice go about learning of the nodes and channels in the Lightning Network?
Gossip Protocol for Routing Topography
Let's start at the beginning. When a node first comes online it has no routing information, there is no graph. This new node needs to construct its graph from scratch. Looking back on our gossip protocol theory we discussed above, this fits firmly in the anti-entropy category. That is, we need to perform a full synchronization from one or more peers. This is expensive, so we don't want to do it frequently.
For a new node, a natural prerequisite is actually finding its first peer. Similar to Bitcoin, a node uses DNS seeds to find its first peers. I won't go into the details of this in this article, but suffice it to say, using DNS queries we are able to find our first node to connect with.
Once a node has connected to its first peer, it performs an initial routing synchronization. This synchronization asks for a complete routing table dump from the peer. The new node receives these messages, validates them, and is now aware of the network! It can now connect with additional peers to fill in gaps that may exist in the routing table.
In some regards this synchronization is similar to Bitcoin's initial block download. In most implementations of the Lightning Network, synchronization is performed when the node comes online and uses a look back period to retrieve information it may have missed while it was offline. There are multiple ways the full synchronization can be performed and we'll discuss these in subsequent articles.
Once the initial routing sync has been completed, nodes begin periodically gossiping with their peers. New messages, once validated, are enqueued and pushed to peers periodically. This is the rumor mongering technique used for information diffusion.
Rumor mongering usually happens with peers that you are directly connected with as those connections are usually long-lived. One could argue that information diffusion could happen faster on the network if some randomized peer selection happened. We also see that information only remains hot for a single round. Since we are only sending to our directly connected peers, they become infected with the information and we no longer need to send it, so it is considered removed.
So to bring this full circle with our definition of gossip protocols:
1. Involves periodic, pairwise, inter-process interactions
Check! In Lightning we use rumor mongering to send messages to our peers on a periodic basis.
2. The information exchanged is of small, bounded size
Check! We didn't really discuss this, but Lightning Network messages have a max payload size of 65k bytes. Gossip usually involves many such messages.
3. When nodes interact, the state of one or both changes to reflect the state of the other node
Check! When using anti-entropy to perform a sync we request the remote node send us information (pull). When rumor mongering, the state of both nodes may change as both peers gossip about messages.
4. Reliable communication is not assumed
Check! Gossip on the Lightning Network is a best effort. Our peers may connect or disconnect. Messages may be sent or received more than once. We just need to be eventually consistent (on a best effort) with our peers.
5. There is some form of randomness to peer selection
Check! Lightning Network gossip is a little different because connections are long-lived with our peers. We don't randomly connect with peers in each round of gossip.
So you can see the Lightning Network definitely fits the definition of a gossip protocol by Berman by using both rumor mongering and anti-entropy.
We previously mentioned that we send many small (max size of 65k bytes) messages to gossip about the network. But what information is being exchanged?
The routing graph is constructed using three messages: channel_announcement, channel_update, and node_announcement.
The channel_announcement is the big kahuna. This message gets broadcast by a node when it has a public channel and wants to announce it to the rest of the network. This message looks the same when broadcast by either of the two nodes involved with a channel. For our graph, this generates an edge between the two nodes.
The channel_update message is broadcast by each of the nodes belonging to a channel. This message specifies the routing fees and locktime that must be used to route a payment through the channel when originating from that particular node. A channel cannot be used for routing until a channel_update has been received for at least one of the nodes.
The node_announcement message is an optional message that describes the name, address, color, and features for a particular node. This helps with identifying nodes or to limit connections to nodes that support features of interest.
So this is the basics of what is going on to obtain the routing graph. The next sequence of articles will break down the messages that are sent. Then we'll discuss how initial sync and gossip_queries are used to perform anti-entropy and finally discuss rumor mongering. Thanks for reading!
 The Promise, and Limitations, of Gossip Protocols. Ken Birman. ACM SIGOPS Operating Systems Review Volume 41, Issue 5. October 2007. Link.
 Gossiping in Distributed Systems. Anne-Marie Kermarrec and Maarten van Steen. ACM SIGOPS Operating Systems Review Volume 41, Issue 5. October 2007. Link.
 Epidemic Algorithms for Replicated Database Maintenance. Demers, Greene, Hauser, Larons, Shenker, Sturgis, Swinehart, & Terry. Proceedings of the Sixth Annual AMC Symposium on Principles of Distributed Computing 1987. Link.