DoS: LND Onion Bomb

LND versions prior to 0.17.0 are vulnerable to a DoS attack where malicious onion packets cause the node to instantly run out of memory (OOM) and crash. If you are running an LND release older than this, your funds are at risk! Update to at least 0.17.0 to protect your node.

Severity

It is critical that users update to at least LND 0.17.0 for several reasons.

  • The attack is cheap and easy to carry out and will keep the victim offline for as long as it lasts.
  • The source of the attack is concealed via onion routing. The attacker does not need to connect directly to the victim.
  • Prior to LND 0.17.0, all nodes are vulnerable. The fix was not backported to the LND 0.16.x series or earlier.

The Vulnerability

The Lightning Network uses onion routing to provide senders and receivers of payments some degree of privacy. Each node along a payment route receives an onion packet from the previous node, containing forwarding instructions for the next node on the route. The onion packet is encrypted by the initiator of the payment, so that each node can only read its own forwarding instructions.

Once a node has “peeled off” its layer of encryption from the onion packet, it can extract its forwarding instructions according to the format specified in the LN protocol:

Field Name Size Description
length 1-9 bytes The length of the payload field, encoded as BigSize.
payload length bytes The forwarding instructions.
hmac 32 bytes The HMAC to use for the forwarded onion packet.
next_onion remaining bytes The onion packet to forward.

Prior to LND 0.17.0, the code that extracts these instructions is essentially:

// Decode unpacks an encoded HopPayload from the passed reader into the
// target HopPayload.
func (hp *HopPayload) Decode(r io.Reader) error {
    bufReader := bufio.NewReader(r)

    var b [8]byte
    varInt, err := ReadVarInt(bufReader, &b)
    if err != nil {
        return err
    }

    payloadSize := uint32(varInt)

    // Now that we know the payload size, we'll create a new buffer to
    // read it out in full.
    hp.Payload = make([]byte, payloadSize)
    if _, err := io.ReadFull(bufReader, hp.Payload[:]); err != nil {
        return err
    }
    if _, err := io.ReadFull(bufReader, hp.HMAC[:]); err != nil {
        return err
    }

    return nil
}

Note the absence of a bounds check on payloadSize!

Regardless of the actual payload size, LND allocates memory for whatever length is encoded in the onion packet up to UINT32_MAX (4 GB).

The DoS Attack

It is trivial for an attacker to craft an onion packet that contains an encoded length of UINT32_MAX for the victim’s forwarding instructions. If the victim’s node has less than 4 GB of memory available, it will OOM crash instantly upon receiving the attacker’s packet.

However, if the victim’s node has more than 4 GB of memory available, it is able to recover from the malicious packet. The victim’s node will temporarily allocate 4 GB, but the Go garbage collector will quickly reclaim that memory after decoding fails.

So nodes with more than 4 GB of RAM are safe, right?

Not quite. The attacker can send many malicious packets simultaneously. If the victim processes enough malicious packets before the garbage collector kicks in, an OOM will still occur. And since LND decodes onion packets in parallel, it is not difficult for an attacker to beat the garbage collector. In my experiments I was able to consistently crash nodes with up to 128 GB of RAM in just a few seconds.

The Fix

A bounds check on the encoded length field was concealed in a large refactoring commit and included in LND 0.17.0. The fixed code is essentially:

// Decode unpacks an encoded HopPayload from the passed reader into the
// target HopPayload.
func (hp *HopPayload) Decode(r io.Reader) error {
    bufReader := bufio.NewReader(r)

    payloadSize, err := tlvPayloadSize(bufReader)
    if err != nil {
        return err
    }

    // Now that we know the payload size, we'll create a new buffer to
    // read it out in full.
    hp.Payload = make([]byte, payloadSize)
    if _, err := io.ReadFull(bufReader, hp.Payload[:]); err != nil {
        return err
    }
    if _, err := io.ReadFull(bufReader, hp.HMAC[:]); err != nil {
        return err
    }

    return nil
}

// tlvPayloadSize uses the passed reader to extract the payload length
// encoded as a var-int.
func tlvPayloadSize(r io.Reader) (uint16, error) {
    var b [8]byte
    varInt, err := ReadVarInt(r, &b)
    if err != nil {
        return 0, err
    }

    if varInt > math.MaxUint16 {
        return 0, fmt.Errorf("payload size of %d is larger than the "+
            "maximum allowed size of %d", varInt, math.MaxUint16)
    }

    return uint16(varInt), nil
}

This new code reduces the maximum amount of memory LND will allocate when decoding an onion packet from 4 GB to 64 KB, which is enough to fully mitigate the DoS attack.

Discovery

A simple fuzz test for onion packet encoding and decoding revealed this vulnerability.

Timeline

  • 2023-06-20: Vulnerability discovered and disclosed to Lightning Labs.
  • 2023-08-23: Fix merged.
  • 2023-10-03: LND 0.17.0 released containing the fix.
  • 2024-05-16: Laolu gives the OK to disclose publicly once LND 0.18.0 is released and has some uptake.
  • 2024-05-30: LND 0.18.0 released.
  • 2024-06-18: Public disclosure.

Prevention

This vulnerability was found in less than a minute of fuzz testing. If basic fuzz tests had been written at the time the original onion decoding functions were introduced, the bug would have been caught before it was merged.

In general any function that processes untrusted inputs is a strong candidate for fuzz testing, and often these fuzz tests are easier to write than traditional unit tests. A minimal fuzz test that detects this particular vulnerability is exceedingly simple:

func FuzzHopPayload(f *testing.F) {
    f.Fuzz(func(t *testing.T, data []byte) {
        // Hop payloads larger than 1300 bytes violate the spec and never
        // reach the decoding step in practice.
        if len(data) > 1300 {
            return
        }

        var hopPayload sphinx.HopPayload
        hopPayload.Decode(bytes.NewReader(data))
    })
}

Takeaways

  • Write fuzz tests for all APIs that consume untrusted inputs.
  • Update your LND nodes to at least 0.17.0.
DoS: Channel Open Race in CLN

CLN versions between 23.02 and 23.05.2 are susceptible to a DoS attack involving the exploitation of a race condition during channel opens. If you are running any version in this range, your funds may be at risk! Update to at least 23.08 to help protect your node.

The Vulnerability

The vulnerability arises from a race condition between two different flows in CLN: the channel open flow and the peer connection flow.

The Channel Open Flow

When a peer opens a channel with a CLN node, the following interactions occur on the CLN node.

channel open diagram

  1. The connectd daemon notifies lightningd about the channel open request.
  2. lightningd launches a new openingd daemon to handle the channel open negotiation.
  3. openingd completes the channel open negotiation up to the point where the funding outpoint is known.
  4. openingd sends the funding outpoint to lightningd and exits.
  5. lightningd launches a channeld daemon to manage the new channel.

The Peer Connection Flow

Once a peer has a channel with a CLN node, if the peer disconnects and reconnects the following occurs on the CLN node.

channel exists diagram

  1. The connectd daemon notifies lightningd about the new peer connection.
  2. lightningd calls a plugin hook notifying the chanbackup plugin about the new peer connection.
  3. chanbackup notifies lightningd that it is done running the hook.
  4. With the hook finished, lightningd recognizes that a previous channel exists with the peer and launches a channeld daemon to manage it.

The Race Condition

Problems arise when the peer connection flow overlaps with the channel open flow, causing lightningd to attempt launching the same channeld daemon twice. This can happen if the peer quickly opens a channel after connecting, and the chanbackup plugin is delayed in handling the peer connection hook, leading to the following interactions on the CLN node.

channel open race diagram

  1. The connectd daemon notifies lightningd about the new peer connection.
  2. lightningd calls a plugin hook notifying the chanbackup plugin about the new peer connection.
  3. The connectd daemon notifies lightningd about the channel open request.
  4. lightningd launches a new openingd daemon to handle the channel open negotiation.
  5. openingd completes the channel open negotiation up to the point where the funding outpoint is known.
  6. openingd sends the funding outpoint to lightningd and exits.
  7. lightningd launches a channeld daemon to manage the new channel.
  8. chanbackup notifies lightningd that it is done running the hook.
  9. With the hook finished, lightningd recognizes that a previous channel exists with the peer and attempts to launch a channeld daemon to manage it. Since the daemon is already running, an assertion failure occurs and CLN crashes.

The DoS Attack

To reliably trigger the assertion failure, an attacker needs to somehow slow down the chanbackup plugin so that a channel can be opened before the plugin finishes running the peer connected hook. One way to do this is to overload chanbackup with many peer connections and channel state changes. As it turns out, the fake channel DoS attack is a trivial and free method of generating these events and overloading chanbackup.

On a local network with low latency, I was able to generate enough load on chanbackup to consistently crash CLN nodes in under 5 seconds. In the real world the attack would be carried out across the Internet with higher latencies, so more load on chanbackup would be required to trigger the race condition. In my experiments, crashing CLN nodes across the Internet took around 30 seconds.

The Defense

To prevent the assertion failure from triggering, a small patch was added to CLN 23.08 that checks if a channeld is already running when the peer connected hook returns. If so, lightningd does not attempt to start the channeld again.

Note that this patch does not actually remove the race condition, though it does prevent crashing when the race occurs.

Discovery

This vulnerability was discovered during follow-up testing prior to the disclosure of the fake channel DoS vector. At the time, Rusty and I agreed to move forward with the planned disclosure of the fake channel DoS vector, but to delay disclosure of this channel open race until a later date.

Since the channel open race can be triggered by the fake channel DoS attack, it is a valid question how the race went undiscovered during the implementation of defenses against that attack. The answer is that the race was actually untriggerable until a few weeks after the fake channel DoS defenses were merged.

While the race condition was introduced in March 2022, the race couldn’t actually trigger because no plugins used the peer connected hook. It wasn’t until February 2023 that the race was exposed, when the peer storage backup feature made chanbackup the first official plugin to use the hook.

Timeline

  • 2022-03-23: Race condition introduced to CLN 0.11.
  • 2022-12-15: Fake channel DoS vector disclosed to Blockstream.
  • 2023-01-21: Fake channel DoS defenses fully merged [1, 2].
  • 2023-02-08: Peer storage backup feature introduced, exposing the channel open race vulnerability.
  • 2023-03-03: CLN 23.02 released.
  • 2023-07-28: Rusty gives the OK to disclose the fake channel DoS vector.
  • 2023-08-14: Follow-up testing reveals the channel open race vulnerability. Disclosed to Blockstream.
  • 2023-08-21: Defense against the channel open race DoS merged.
  • 2023-08-22: Rusty gives the OK to continue with the fake channel DoS disclosure, but requests that the channel open race vulnerability be omitted from the disclosure.
  • 2023-08-23: Public disclosure of the fake channel DoS.
  • 2023-08-23: CLN 23.08 released.
  • 2023-12-04: Rusty gives the OK to disclose the channel open race vulnerability.
  • 2024-01-08: Public disclosure.

Prevention

This vulnerability could have been prevented by a couple software engineering best practices.

Avoid Race Conditions

The original purpose of the peer connected hook was to enable plugins to filter and reject incoming connections from certain peers. Therefore the hook was designed to be synchronous, and all other events initiated by the peer were blocked until the hook returned. Unfortunately, PR 5078 destroyed that property of the hook by introducing a known race condition to the code (search for “this is racy” in commit 2424b7d). If PR 5078 hadn’t done this, there would be no race condition to exploit and this vulnerability would never have existed.

Race conditions can be nasty and should be avoided whenever possible. Knowingly adding race conditions where they didn’t previously exist is generally a bad idea.

Do Stress Testing

When I disclosed the fake channel DoS vector to Blockstream, I also provided a DoS program that demonstrated the attack. That same DoS program revealed the channel open race vulnerability after it became triggerable in February 2023. If a stress test based on the DoS program had been added to CLN’s CI pipeline or release process, this vulnerability could have been caught much earlier, before it was included in any releases.

In general, there is some difficulty in releasing such a test publicly while the vulnerability it tests for is still secret. In such situations the test can remain unreleased until the vulnerability has been publicly disclosed, and in the meantime the test can be run privately during the release process to ensure no regressions have been introduced. In CLN’s case, this may have been unnecessary – a stress test could have plausibly been added to PR 5849 without raising suspicion.

Takeaways

  • Avoid race conditions.
  • Use regression and stress testing.
  • Update your CLN nodes to at least v23.08.
Invoice Parsing Bugs in CLN

Several invoice parsing bugs were fixed in CLN 23.11, including bugs that caused crashes, undefined behavior, and use of uninitialized memory. These bugs could be reliably triggered by specially crafted invoices, enabling a malicious counterparty to crash the victim’s node upon invoice payment.

The parsing bugs were discovered by a new fuzz test written by Niklas Gögge and enhanced by me.

Bugs fixed in v23.11

#   Type Root Cause Fix
1   undefined behavior unchecked return value eeec529
2   use of uninitialized memory missing check for 0-length TLV ee501b0
3   crash unnecessary assertion ee8cf69
4   crash missing recovery ID validation c1f2068
5   crash missing pubkey validation 87f4907

The fuzz target

The fuzz target that uncovered these bugs was initially written by Niklas Gögge in December 2022, though it wasn’t made public until October 2023. The target simply provides fuzzer-generated inputs to CLN’s invoice decoding function, similar to fuzz targets written for other implementations [1, 2].

To improve the fuzzer’s efficiency, Niklas also wrote a custom mutator for the target. Invoices are encoded in bech32 which requires a valid checksum at the end of the encoding, making it quite difficult for fuzzers to generate valid bech32 consistently. As a result, bech32-naive fuzzers will generally get stuck at the bech32 decoding stage and have a hard time exploring deeper into the invoice parsing logic. Niklas’ custom mutator teaches the fuzzer how to generate valid bech32 so that it can focus its fuzzing on invoice parsing.

Initial fuzzing in 2022

After writing the fuzz target in December 2022, Niklas privately reported several bugs to CLN including a stack buffer overflow, an assertion failure, and undefined behavior due to a 0-length array. Many of the bugs were fixed in PR 5891 and released in CLN 23.02.

Merging the fuzz target in 2023

In October 2023, Niklas submitted his fuzz target for review in PR 6750. The initial corpus in that PR actually triggered bugs 1 and 2, but Niklas didn’t notice because he had been fuzzing with some UBSan options misconfigured. CLN’s CI didn’t detect the bugs either, since UBSan had previously been accidentally disabled in CI.

Niklas also discovered bug 3 during initial fuzzing, but he initially thought it was a false report and hard-coded an exception for it in the fuzz target.

Enhancements

The initial fuzz target only fuzzed the invoice decoding logic, skipping signature checks. I modified the target to also run the signature-checking logic, which enabled the fuzzer to quickly find bug 4.

While bug 5 should have also been discoverable by the fuzzer after this change, it remained undetected even after many weeks of CPU time. It wasn’t until I added a custom cross-over mutator for the fuzz target that bug 5 was discovered. The cross-over mutator is based on Niklas’ custom mutator and simply combines pieces from multiple bech32-decoded invoices before re-encoding the result in bech32. Within a few CPU hours of fuzzing with this extra mutator, the fuzzer found bug 5.

Impact

The severity of these bugs seems relatively low since they can only be triggered when paying an invoice. If a malicious invoice causes your node to crash, as long as you can restart your node in a timely manner and avoid paying any more invoices from the malicious counterparty, no further harm can be done.

Since bug 2 involves uninitialized memory it could potentially be more serious, as a sophisticated attacker may be able to extract sensitive data from the invoice-decoding process. Such an attack would be quite complex, and it is unclear whether it would even be possible in practice. It’s also unclear exactly what sensitive data could be extracted, since CLN handles private keys in a separate dedicated process (the hsmd daemon).

Takeaways

  • Fuzz testing is an essential component of writing robust and secure software. Any API that consumes untrusted inputs should be fuzz tested.
  • Custom mutators can be very powerful for fuzzing deeper logic in the codebase.
  • Fuzz testing of C or C++ code should use both ASan and UBSan. MSan and valgrind can also be useful.
DoS: Fake Lightning Channels

Lightning nodes released prior to the following versions are susceptible to a DoS attack involving the creation of large numbers of fake channels:

If you are running node software older than this, your funds may be at risk! Update to at least the above versions to help protect your node.

The vulnerability

When one lightning node (the funder) wishes to open a channel to another node (the fundee), the following sequence of events takes place:

channel funding diagram

  1. The funder sends an open_channel message with the desired parameters for the channel.
  2. The fundee checks that the channel parameters are reasonable and then sends an accept_channel message.
  3. The funder creates the funding transaction and sends a funding_created message containing the funding outpoint and their signature for the commitment transaction.
  4. The fundee verifies the funder’s commitment signature and sends funding_signed with their own signature for the commitment. The fundee begins watching the chain for the funding transaction.
  5. The funder verifies the fundee’s commitment signature, broadcasts the funding transaction, and then watches for it to show up onchain.
  6. Both nodes send channel_ready once the funding transaction has enough confirmations. Payments can now be sent across the channel.

But what happens if the funder doesn’t broadcast the funding transaction in step 5?

channel funding DoS diagram

The fundee, eager for inbound liquidity, is willing to wait for the funding transaction to confirm for a period of time. But eventually the fundee needs to give up on the pending channel and reclaim the resources allocated to it. BOLT 2 recommends waiting for 2016 blocks (2 weeks) before abandoning the pending channel.

Thus for 2 weeks the fundee devotes some amount of database storage, RAM, and CPU time to watching for the pending channel to confirm.

The fake channel DoS attack

An attacker can thus force a victim node to consume a small amount of resources by opening a fake channel with the victim and never publishing it onchain. If the attacker can create lots of fake channels, they can lock up lots of the victim’s resources.

Fake channels are trivial to create. Since there is no way for the victim to verify the funding outpoint sent to them in the funding_created message, the attacker doesn’t even need to construct a real funding transaction. They can use a randomly-generated funding transaction ID and sign a commitment transaction based on that fake ID. The victim will successfully verify the commitment signature against the provided (fake) funding outpoint and gladly allocate resources for the fake pending channel.

Opening lots of these fake channels is also trivial against node software older than the above releases. Some older node implementations do impose a limit on the number of pending channels allowed per peer, but such limits are easily bypassed by using a new attacker node ID for each fake channel.

DoS effects

In my experiments, I was able to create hundreds of thousands of fake channels against victim nodes (owned by me), with all kinds of adverse effects. In some cases, funds were clearly at risk of being stolen due to the victim node’s inability to respond to cheating attempts.

Here’s how the DoS attack affected each node implementation.

LND

Over the course of a couple days, LND’s performance degraded so drastically that it stopped responding to requests from its peers or from the CLI. The performance degradation continued on restart, even if the attacker was no longer actively DoSing.

I didn’t continue the DoS experiment for more than a couple days, but it’s very possible that with enough time the victim node would have become unresponsive enough that funds could be stolen without consequence.

CLN

After one day of the DoS attack, CLN’s connectd daemon was completely blocked and unable to respond to connection requests from other nodes. Most other functionality of CLN continued to work, and funds were not at risk since the separate lightningd daemon was not blocked by the DoS attack.

eclair

One day into the DoS, eclair OOM crashed. After that, every time eclair restarted, it OOM crashed again within 30 minutes, even if the attacker was no longer actively DoSing. Funds were clearly at risk, since an offline node cannot catch cheating attempts.

LDK

Since LDK is a library and not a full node implementation, it was trickier to experiment with. LDK Node didn’t exist at the time, but I found the ldk-sample node and modified it to run on mainnet for the experiment.

Within hours of the DoS attack, ldk-sample’s performance degraded drastically, causing it to unsync with the blockchain. A few days later, ldk-sample’s view of the blockchain was pinned more than 144 blocks in the past, preventing it from responding to cheating attempts before the attacker’s CSV timelock expired.

DoS defenses

I reported the DoS vector to the 4 major lightning implementations around the start of 2023. eclair and LDK were already aware of the potential DoS vector but hadn’t realized the severity of the vulnerability. Within days of receiving my report, every lightning implementation began working on defenses, some openly and others in secret.

All implementations have now shipped releases with defenses against the DoS. If you’re interested in the technical details of the defenses, see the linked pull requests and commits.

Date Reported Implementation Defenses Release
2022-12-12 LND pending channel limit [1] 0.16.0
2022-12-15 CLN significant performance improvements [1, 2] 23.02
2022-12-28 eclair pending channel and peer limits [1, 2] 0.9.0
2023-01-17 LDK pending channel and peer limits [1] 0.0.114

Lessons

Use watchtowers

When all else fails, watchtowers help to protect funds if your lightning node is incapacitated by a DoS attack. If you have significant funds at risk, it’s cheap insurance to run a private watchtower on a separate machine.

Multiple processes

Prior to the above releases, CLN was the only lightning implementation that clearly kept user funds safe while under DoS, because CLN actually runs as multiple separate daemon processes. In the case of this DoS attack, the connectd daemon responsible for handling peer connections became locked up while the lightningd daemon watching the blockchain was relatively unaffected.

Multiprocess architectures in general provide some defense against DoS, as one process slowing down or crashing doesn’t automatically bring down the other processes. For this reason, other implementations may want to consider splitting their nodes into separate processes. CLN could also improve robustness further by attempting to restart DoS-able subdaemons like connectd and gossipd if they crash, rather than shutting the whole node down.

More security auditing needed

I discovered this DoS vector last year. I had been reviewing the dual funding protocol and found a griefing attack involving fake dual-funded channels. After discussing the attack with Bastien Teinturier, I came to realize that a similar attack may also affect the single-funded protocol.

But I convinced myself for a couple months that surely such a trivial attack would have been defended against already. It wasn’t until I spent some time studying implementations’ funding code that I realized there were no defenses.

The fact that this DoS vector went unnoticed since the beginning of the Lightning Network should make everyone a little scared. If a newcomer like me could discover this vulnerability in a couple months, there are probably many other vulnerabilities in the Lightning Network waiting to be found and exploited.

For quite some time, it seems that security and robustness have not been the top priority for node implementations, with some implementations not even having security policies until 6-10 months ago [1, 2]. Everyone wants new lightning features: dual funded channels, Taproot channels, splicing, BOLT 12, etc. And those things are important. But every one of them introduces more complexity and more potential attack surface. If we’re going to make lightning even more complex, we also need to ramp up the engineering effort we put towards making the network secure and robust.

Because in the end it doesn’t matter how feature-rich and easy-to-use the Lightning Network is if it can’t keep user funds safe.