DoS: Channel Open Race in CLN

CLN versions between 23.02 and 23.05.2 are susceptible to a DoS attack involving the exploitation of a race condition during channel opens. If you are running any version in this range, your funds may be at risk! Update to at least 23.08 to help protect your node.

The Vulnerability

The vulnerability arises from a race condition between two different flows in CLN: the channel open flow and the peer connection flow.

The Channel Open Flow

When a peer opens a channel with a CLN node, the following interactions occur on the CLN node.

channel open diagram

  1. The connectd daemon notifies lightningd about the channel open request.
  2. lightningd launches a new openingd daemon to handle the channel open negotiation.
  3. openingd completes the channel open negotiation up to the point where the funding outpoint is known.
  4. openingd sends the funding outpoint to lightningd and exits.
  5. lightningd launches a channeld daemon to manage the new channel.

The Peer Connection Flow

Once a peer has a channel with a CLN node, if the peer disconnects and reconnects the following occurs on the CLN node.

channel exists diagram

  1. The connectd daemon notifies lightningd about the new peer connection.
  2. lightningd calls a plugin hook notifying the chanbackup plugin about the new peer connection.
  3. chanbackup notifies lightningd that it is done running the hook.
  4. With the hook finished, lightningd recognizes that a previous channel exists with the peer and launches a channeld daemon to manage it.

The Race Condition

Problems arise when the peer connection flow overlaps with the channel open flow, causing lightningd to attempt launching the same channeld daemon twice. This can happen if the peer quickly opens a channel after connecting, and the chanbackup plugin is delayed in handling the peer connection hook, leading to the following interactions on the CLN node.

channel open race diagram

  1. The connectd daemon notifies lightningd about the new peer connection.
  2. lightningd calls a plugin hook notifying the chanbackup plugin about the new peer connection.
  3. The connectd daemon notifies lightningd about the channel open request.
  4. lightningd launches a new openingd daemon to handle the channel open negotiation.
  5. openingd completes the channel open negotiation up to the point where the funding outpoint is known.
  6. openingd sends the funding outpoint to lightningd and exits.
  7. lightningd launches a channeld daemon to manage the new channel.
  8. chanbackup notifies lightningd that it is done running the hook.
  9. With the hook finished, lightningd recognizes that a previous channel exists with the peer and attempts to launch a channeld daemon to manage it. Since the daemon is already running, an assertion failure occurs and CLN crashes.

The DoS Attack

To reliably trigger the assertion failure, an attacker needs to somehow slow down the chanbackup plugin so that a channel can be opened before the plugin finishes running the peer connected hook. One way to do this is to overload chanbackup with many peer connections and channel state changes. As it turns out, the fake channel DoS attack is a trivial and free method of generating these events and overloading chanbackup.

On a local network with low latency, I was able to generate enough load on chanbackup to consistently crash CLN nodes in under 5 seconds. In the real world the attack would be carried out across the Internet with higher latencies, so more load on chanbackup would be required to trigger the race condition. In my experiments, crashing CLN nodes across the Internet took around 30 seconds.

The Defense

To prevent the assertion failure from triggering, a small patch was added to CLN 23.08 that checks if a channeld is already running when the peer connected hook returns. If so, lightningd does not attempt to start the channeld again.

Note that this patch does not actually remove the race condition, though it does prevent crashing when the race occurs.

Discovery

This vulnerability was discovered during follow-up testing prior to the disclosure of the fake channel DoS vector. At the time, Rusty and I agreed to move forward with the planned disclosure of the fake channel DoS vector, but to delay disclosure of this channel open race until a later date.

Since the channel open race can be triggered by the fake channel DoS attack, it is a valid question how the race went undiscovered during the implementation of defenses against that attack. The answer is that the race was actually untriggerable until a few weeks after the fake channel DoS defenses were merged.

While the race condition was introduced in March 2022, the race couldn’t actually trigger because no plugins used the peer connected hook. It wasn’t until February 2023 that the race was exposed, when the peer storage backup feature made chanbackup the first official plugin to use the hook.

Timeline

  • 2022-03-23: Race condition introduced to CLN 0.11.
  • 2022-12-15: Fake channel DoS vector disclosed to Blockstream.
  • 2023-01-21: Fake channel DoS defenses fully merged [1, 2].
  • 2023-02-08: Peer storage backup feature introduced, exposing the channel open race vulnerability.
  • 2023-03-03: CLN 23.02 released.
  • 2023-07-28: Rusty gives the OK to disclose the fake channel DoS vector.
  • 2023-08-14: Follow-up testing reveals the channel open race vulnerability. Disclosed to Blockstream.
  • 2023-08-21: Defense against the channel open race DoS merged.
  • 2023-08-22: Rusty gives the OK to continue with the fake channel DoS disclosure, but requests that the channel open race vulnerability be omitted from the disclosure.
  • 2023-08-23: Public disclosure of the fake channel DoS.
  • 2023-08-23: CLN 23.08 released.
  • 2023-12-04: Rusty gives the OK to disclose the channel open race vulnerability.
  • 2024-01-08: Public disclosure.

Prevention

This vulnerability could have been prevented by a couple software engineering best practices.

Avoid Race Conditions

The original purpose of the peer connected hook was to enable plugins to filter and reject incoming connections from certain peers. Therefore the hook was designed to be synchronous, and all other events initiated by the peer were blocked until the hook returned. Unfortunately, PR 5078 destroyed that property of the hook by introducing a known race condition to the code (search for “this is racy” in commit 2424b7d). If PR 5078 hadn’t done this, there would be no race condition to exploit and this vulnerability would never have existed.

Race conditions can be nasty and should be avoided whenever possible. Knowingly adding race conditions where they didn’t previously exist is generally a bad idea.

Do Stress Testing

When I disclosed the fake channel DoS vector to Blockstream, I also provided a DoS program that demonstrated the attack. That same DoS program revealed the channel open race vulnerability after it became triggerable in February 2023. If a stress test based on the DoS program had been added to CLN’s CI pipeline or release process, this vulnerability could have been caught much earlier, before it was included in any releases.

In general, there is some difficulty in releasing such a test publicly while the vulnerability it tests for is still secret. In such situations the test can remain unreleased until the vulnerability has been publicly disclosed, and in the meantime the test can be run privately during the release process to ensure no regressions have been introduced. In CLN’s case, this may have been unnecessary – a stress test could have plausibly been added to PR 5849 without raising suspicion.

Takeaways

  • Avoid race conditions.
  • Use regression and stress testing.
  • Update your CLN nodes to at least v23.08.
Invoice Parsing Bugs in CLN

Several invoice parsing bugs were fixed in CLN 23.11, including bugs that caused crashes, undefined behavior, and use of uninitialized memory. These bugs could be reliably triggered by specially crafted invoices, enabling a malicious counterparty to crash the victim’s node upon invoice payment.

The parsing bugs were discovered by a new fuzz test written by Niklas Gögge and enhanced by me.

Bugs fixed in v23.11

#   Type Root Cause Fix
1   undefined behavior unchecked return value eeec529
2   use of uninitialized memory missing check for 0-length TLV ee501b0
3   crash unnecessary assertion ee8cf69
4   crash missing recovery ID validation c1f2068
5   crash missing pubkey validation 87f4907

The fuzz target

The fuzz target that uncovered these bugs was initially written by Niklas Gögge in December 2022, though it wasn’t made public until October 2023. The target simply provides fuzzer-generated inputs to CLN’s invoice decoding function, similar to fuzz targets written for other implementations [1, 2].

To improve the fuzzer’s efficiency, Niklas also wrote a custom mutator for the target. Invoices are encoded in bech32 which requires a valid checksum at the end of the encoding, making it quite difficult for fuzzers to generate valid bech32 consistently. As a result, bech32-naive fuzzers will generally get stuck at the bech32 decoding stage and have a hard time exploring deeper into the invoice parsing logic. Niklas’ custom mutator teaches the fuzzer how to generate valid bech32 so that it can focus its fuzzing on invoice parsing.

Initial fuzzing in 2022

After writing the fuzz target in December 2022, Niklas privately reported several bugs to CLN including a stack buffer overflow, an assertion failure, and undefined behavior due to a 0-length array. Many of the bugs were fixed in PR 5891 and released in CLN 23.02.

Merging the fuzz target in 2023

In October 2023, Niklas submitted his fuzz target for review in PR 6750. The initial corpus in that PR actually triggered bugs 1 and 2, but Niklas didn’t notice because he had been fuzzing with some UBSan options misconfigured. CLN’s CI didn’t detect the bugs either, since UBSan had previously been accidentally disabled in CI.

Niklas also discovered bug 3 during initial fuzzing, but he initially thought it was a false report and hard-coded an exception for it in the fuzz target.

Enhancements

The initial fuzz target only fuzzed the invoice decoding logic, skipping signature checks. I modified the target to also run the signature-checking logic, which enabled the fuzzer to quickly find bug 4.

While bug 5 should have also been discoverable by the fuzzer after this change, it remained undetected even after many weeks of CPU time. It wasn’t until I added a custom cross-over mutator for the fuzz target that bug 5 was discovered. The cross-over mutator is based on Niklas’ custom mutator and simply combines pieces from multiple bech32-decoded invoices before re-encoding the result in bech32. Within a few CPU hours of fuzzing with this extra mutator, the fuzzer found bug 5.

Impact

The severity of these bugs seems relatively low since they can only be triggered when paying an invoice. If a malicious invoice causes your node to crash, as long as you can restart your node in a timely manner and avoid paying any more invoices from the malicious counterparty, no further harm can be done.

Since bug 2 involves uninitialized memory it could potentially be more serious, as a sophisticated attacker may be able to extract sensitive data from the invoice-decoding process. Such an attack would be quite complex, and it is unclear whether it would even be possible in practice. It’s also unclear exactly what sensitive data could be extracted, since CLN handles private keys in a separate dedicated process (the hsmd daemon).

Takeaways

  • Fuzz testing is an essential component of writing robust and secure software. Any API that consumes untrusted inputs should be fuzz tested.
  • Custom mutators can be very powerful for fuzzing deeper logic in the codebase.
  • Fuzz testing of C or C++ code should use both ASan and UBSan. MSan and valgrind can also be useful.
DoS: Fake Lightning Channels

Lightning nodes released prior to the following versions are susceptible to a DoS attack involving the creation of large numbers of fake channels:

If you are running node software older than this, your funds may be at risk! Update to at least the above versions to help protect your node.

The vulnerability

When one lightning node (the funder) wishes to open a channel to another node (the fundee), the following sequence of events takes place:

channel funding diagram

  1. The funder sends an open_channel message with the desired parameters for the channel.
  2. The fundee checks that the channel parameters are reasonable and then sends an accept_channel message.
  3. The funder creates the funding transaction and sends a funding_created message containing the funding outpoint and their signature for the commitment transaction.
  4. The fundee verifies the funder’s commitment signature and sends funding_signed with their own signature for the commitment. The fundee begins watching the chain for the funding transaction.
  5. The funder verifies the fundee’s commitment signature, broadcasts the funding transaction, and then watches for it to show up onchain.
  6. Both nodes send channel_ready once the funding transaction has enough confirmations. Payments can now be sent across the channel.

But what happens if the funder doesn’t broadcast the funding transaction in step 5?

channel funding DoS diagram

The fundee, eager for inbound liquidity, is willing to wait for the funding transaction to confirm for a period of time. But eventually the fundee needs to give up on the pending channel and reclaim the resources allocated to it. BOLT 2 recommends waiting for 2016 blocks (2 weeks) before abandoning the pending channel.

Thus for 2 weeks the fundee devotes some amount of database storage, RAM, and CPU time to watching for the pending channel to confirm.

The fake channel DoS attack

An attacker can thus force a victim node to consume a small amount of resources by opening a fake channel with the victim and never publishing it onchain. If the attacker can create lots of fake channels, they can lock up lots of the victim’s resources.

Fake channels are trivial to create. Since there is no way for the victim to verify the funding outpoint sent to them in the funding_created message, the attacker doesn’t even need to construct a real funding transaction. They can use a randomly-generated funding transaction ID and sign a commitment transaction based on that fake ID. The victim will successfully verify the commitment signature against the provided (fake) funding outpoint and gladly allocate resources for the fake pending channel.

Opening lots of these fake channels is also trivial against node software older than the above releases. Some older node implementations do impose a limit on the number of pending channels allowed per peer, but such limits are easily bypassed by using a new attacker node ID for each fake channel.

DoS effects

In my experiments, I was able to create hundreds of thousands of fake channels against victim nodes (owned by me), with all kinds of adverse effects. In some cases, funds were clearly at risk of being stolen due to the victim node’s inability to respond to cheating attempts.

Here’s how the DoS attack affected each node implementation.

LND

Over the course of a couple days, LND’s performance degraded so drastically that it stopped responding to requests from its peers or from the CLI. The performance degradation continued on restart, even if the attacker was no longer actively DoSing.

I didn’t continue the DoS experiment for more than a couple days, but it’s very possible that with enough time the victim node would have become unresponsive enough that funds could be stolen without consequence.

CLN

After one day of the DoS attack, CLN’s connectd daemon was completely blocked and unable to respond to connection requests from other nodes. Most other functionality of CLN continued to work, and funds were not at risk since the separate lightningd daemon was not blocked by the DoS attack.

eclair

One day into the DoS, eclair OOM crashed. After that, every time eclair restarted, it OOM crashed again within 30 minutes, even if the attacker was no longer actively DoSing. Funds were clearly at risk, since an offline node cannot catch cheating attempts.

LDK

Since LDK is a library and not a full node implementation, it was trickier to experiment with. LDK Node didn’t exist at the time, but I found the ldk-sample node and modified it to run on mainnet for the experiment.

Within hours of the DoS attack, ldk-sample’s performance degraded drastically, causing it to unsync with the blockchain. A few days later, ldk-sample’s view of the blockchain was pinned more than 144 blocks in the past, preventing it from responding to cheating attempts before the attacker’s CSV timelock expired.

DoS defenses

I reported the DoS vector to the 4 major lightning implementations around the start of 2023. eclair and LDK were already aware of the potential DoS vector but hadn’t realized the severity of the vulnerability. Within days of receiving my report, every lightning implementation began working on defenses, some openly and others in secret.

All implementations have now shipped releases with defenses against the DoS. If you’re interested in the technical details of the defenses, see the linked pull requests and commits.

Date Reported Implementation Defenses Release
2022-12-12 LND pending channel limit [1] 0.16.0
2022-12-15 CLN significant performance improvements [1, 2] 23.02
2022-12-28 eclair pending channel and peer limits [1, 2] 0.9.0
2023-01-17 LDK pending channel and peer limits [1] 0.0.114

Lessons

Use watchtowers

When all else fails, watchtowers help to protect funds if your lightning node is incapacitated by a DoS attack. If you have significant funds at risk, it’s cheap insurance to run a private watchtower on a separate machine.

Multiple processes

Prior to the above releases, CLN was the only lightning implementation that clearly kept user funds safe while under DoS, because CLN actually runs as multiple separate daemon processes. In the case of this DoS attack, the connectd daemon responsible for handling peer connections became locked up while the lightningd daemon watching the blockchain was relatively unaffected.

Multiprocess architectures in general provide some defense against DoS, as one process slowing down or crashing doesn’t automatically bring down the other processes. For this reason, other implementations may want to consider splitting their nodes into separate processes. CLN could also improve robustness further by attempting to restart DoS-able subdaemons like connectd and gossipd if they crash, rather than shutting the whole node down.

More security auditing needed

I discovered this DoS vector last year. I had been reviewing the dual funding protocol and found a griefing attack involving fake dual-funded channels. After discussing the attack with Bastien Teinturier, I came to realize that a similar attack may also affect the single-funded protocol.

But I convinced myself for a couple months that surely such a trivial attack would have been defended against already. It wasn’t until I spent some time studying implementations’ funding code that I realized there were no defenses.

The fact that this DoS vector went unnoticed since the beginning of the Lightning Network should make everyone a little scared. If a newcomer like me could discover this vulnerability in a couple months, there are probably many other vulnerabilities in the Lightning Network waiting to be found and exploited.

For quite some time, it seems that security and robustness have not been the top priority for node implementations, with some implementations not even having security policies until 6-10 months ago [1, 2]. Everyone wants new lightning features: dual funded channels, Taproot channels, splicing, BOLT 12, etc. And those things are important. But every one of them introduces more complexity and more potential attack surface. If we’re going to make lightning even more complex, we also need to ramp up the engineering effort we put towards making the network secure and robust.

Because in the end it doesn’t matter how feature-rich and easy-to-use the Lightning Network is if it can’t keep user funds safe.