LDK: Duplicate HTLC Force Close Griefing

LDK 0.1 and below are vulnerable to a griefing attack that causes all of the victim’s channels to be force closed. Update to LDK 0.1.1 to protect your channels.

Background

Whenever a new payment is routed through a lightning channel, or whenever an existing payment is settled on the channel, the parties in that channel need to update their commitment transactions to match the new set of active HTLCs. During the course of these regular commitment updates, there is always a brief moment where one of the parties holds two valid commitment transactions. Normally that party immediately revokes the older commitment transaction after it receives a signature for the new one, bringing their number of valid commitment transactions back down to one. But for that brief moment, the other party in the channel must be able to handle the case where either of the valid commitments confirms on chain.

For this reason, LDK contains logic to detect when there’s a difference between the counterparty’s confirmed commitment transaction and the set of currently outstanding HTLCs. Any HTLCs missing from the confirmed commitment transaction are considered unrecoverable and are immediately failed backward on the upstream channel, while all other HTLCs are left active until the resolution of the downstream HTLC on chain.

Because the same payment hash and amount can be used for multiple HTLCs (e.g., multi-part payments), some extra data is stored to match HTLCs on commitment transactions against the set of outstanding HTLCs. LDK calls this extra data the “HTLC source” data, and LDK maintains this data for both of the counterparty’s valid commitment transactions.

The Duplicate HTLC Failback Bug

Once a counterparty commitment transaction has been revoked, however, LDK forgets the HTLC source data for that commitment transaction to save memory. As a result, if a revoked commitment transaction later confirms, LDK must attempt to match commitment transaction HTLCs up to outstanding HTLCs using only payment hashes and amounts. LDK’s logic to do this matching works as follows:

for htlc, htlc_source in outstanding_htlcs:
  if !confirmed_commitment_tx.is_revoked() &&
      confirmed_commitment_tx.contains_source(htlc_source):
    continue
  if confirmed_commitment_tx.is_revoked() &&
      confirmed_commitment_tx.contains_htlc(htlc.payment_hash, htlc.amount):
    continue

  failback_upstream_htlc(htlc_source)

Note that this logic short-circuits whenever an outstanding HTLC matches the payment hash and amount of an HTLC on the revoked commitment transaction. Thus if there are multiple outstanding HTLCs with the same payment hash and amount, a single HTLC on the revoked commitment transaction can prevent all of the duplicate outstanding HTLCs from being failed back immediately.

Those duplicate HTLCs remain outstanding until corresponding downstream HTLCs are resolved on chain. Except, in this case there’s only one downstream HTLC to resolve on chain, and its resolution only triggers one of the duplicate HTLCs to be failed upstream. All the other duplicate HTLCs are left outstanding indefinitely.

Force Close Griefing

Consider the following topology, where B is the victim and the A_[1..N] nodes are all the nodes that B has channels with. M_1 and M_2 are controlled by the attacker.

     -- A_1 --
    /         \
M_1 --  ...  -- B -- M_2
    \         /
     -- A_N --

The attacker routes N HTLCs from M_1 to M_2 using the same payment hash and amount for each, with each payment going through a different A node. M_2 then confirms a revoked commitment that contains only one of the N HTLCs. Due to the duplicate HTLC failback bug, only one of the routed HTLCs gets failed backwards, while the remaining N-1 HTLCs get stuck.

Finally, after upstream HTLCs expire, all the A nodes with stuck HTLCs force close their channels with B to reclaim the stuck HTLCs.

Attack Cost

The attacker must broadcast a revoked commitment transaction, thereby forfeiting their channel balance. But the size of the channel can be minimal, and the attacker can spend their balance down to the 1% reserve before executing the attack. As a result, the cost of the attack can be negligible compared to the damage caused.

The Fix

Starting in v0.1.1, LDK preemptively fails back HTLCs when their deadlines approach if the downstream channel has been force closed or is in the process of force closing. While the main purpose of this behavior is to prevent cascading force closures when mempool fee rates spike, it also has a nice side effect of ensuring that duplicate HTLCs always get failed back eventually after a revoked commitment transaction confirms. As a result, the duplicate HTLCs are never stuck long enough that the upstream nodes need to force close to reclaim them.

Discovery

This vulnerability was discovered during an audit of LDK’s chain module.

Timeline

  • 2024-12-07: Vulnerability reported to the LDK security mailing list.
  • 2025-01-27: Fix merged.
  • 2025-01-28: LDK 0.1.1 released containing the fix, with public disclosure in release notes.
  • 2025-01-29: Detailed description of vulnerability published.

Prevention

Prior to the introduction of the duplicate HTLC failback bug in 2022, LDK would immediately fail back all outstanding HTLCs once a revoked commitment reached 6 confirmations. This was the safe and conservative thing to do – HTLC source information was missing, so proper matching of HTLCs could not be done. And since all outputs on the revoked commitment and HTLC transactions could be claimed via revocation key, there was no concern about losing funds if the downstream counterparty confirmed an HTLC claim before LDK could.

Better Documentation

Considering that LDK previously had a test explicitly checking for the original (conservative) failback behavior, it does appear that the original behavior was understood and intentional. Unfortunately the original author did not document the reason for the original behavior anywhere in the code or test.

A single comment in the code would likely have been enough to prevent later contributors from introducing the buggy behavior:

// We fail back *all* outstanding HTLCs when a revoked commitment
// confirms because we don't have HTLC source information for revoked
// commitments, and attempting to match up HTLCs based on payment hashes
// and amounts is inherently unreliable.
//
// Failing back all HTLCs after a 6 block delay is safe in this case
// since we can use the revocation key to reliably claim all funds in the
// downstream channel and therefore won't lose funds overall.

Takeaways

  • Code documentation matters for preventing bugs.
  • Update to LDK 0.1.1 for the vulnerability fix.
LDK: Invalid Claims Liquidity Griefing

LDK 0.0.125 and below are vulnerable to a liquidity griefing attack against anchor channels. The attack locks up funds such that they can only be recovered by manually constructing and broadcasting a valid claim transaction. Affected users can unlock their funds by upgrading to LDK 0.1 and replaying the sequence of commitment and HTLC transactions that led to the lock up.

Background

When a channel is force closed, LDK creates and broadcasts transactions to claim any HTLCs it can from the commitment transaction that confirmed on chain. To save on fees, some HTLC claims are aggregated and broadcast together in the same transaction.

If the channel counterparty is able to get a competing HTLC claim confirmed first, it can cause one of LDK’s aggregated transactions to become invalid, since the corresponding HTLC input has already been spent by the counterparty’s claim. LDK contains logic to detect this scenario and remove the already-claimed input from its aggregated claim transaction. When everything works correctly, the aggregated transaction becomes valid again and LDK is able to claim the remaining HTLCs.

The Invalid Claims Bug

Prior to LDK 0.1, the logic to detect conflicting claims works like this:

for confirmed_transaction in confirmed_block:
  for input in confirmed_transaction:
    if claimable_outpoints.contains(input.prevout):
      agg_tx = get_aggregated_transaction_from_outpoint(input.prevout)
      agg_tx.remove_matching_inputs(confirmed_transaction)
      break  # This is the bug.

Note that this logic stops processing a confirmed transaction after finding the first aggregated transaction that conflicts with it. If the confirmed transaction conflicts with multiple aggregated transactions, conflicting inputs are only removed from the first matching aggregated transaction, and any other conflicting aggregated transactions are left invalid.

Any HTLCs claimed by invalid aggregated transactions get locked up and can only be recovered by manually constructing and broadcasting valid claim transactions.

Liquidity Griefing

Prior to LDK 0.1, there are only two types of HTLC claims that are aggregated:

  • HTLC preimage claims
  • revoked commitment HTLC claims

For HTLC preimage claims, LDK takes care to confirm them before their HTLCs time out, so there’s no reliable way for an attacker to confirm a conflicting timeout claim and trigger the invalid claims bug.

For revoked commitment transactions, however, an attacker can immediately spend any incoming HTLC outputs via HTLC-Success transactions. Although LDK is then able to claim the HTLC-Success outputs via the revocation key, the attacker can exploit the invalid claims bug to lock up any remaining HTLCs on the revoked commitment transaction.

Setup

The attacker opens an anchor channel with the victim, creating a network topology as follows:

A -- B -- M

In this case B is the victim LDK node and M is the node controlled by the attacker. The attacker must use an anchor channel so that they can spend multiple HTLC claims in the same transaction and trigger the invalid claims bug.

The attacker then routes HTLCs along the path A->B->M as follows:

  1. 1 small HTLC with CLTV of X
  2. 1 small HTLC with CLTV of X+1
  3. 1 large HTLC with CLTV of X+1 (this is the one the attacker will lock up)

The attacker knows preimages for all HTLCs but withholds them for now.

To complete the setup, the attacker routes some other HTLC through the channel, causing the commitment transaction with the above HTLCs to be revoked.

Forcing Multiple Aggregations

Next the attacker waits until block X-13 and force closes the B-M channel using their revoked commitment transaction, being sure to get it confirmed in block X-12. By confirming in this specific block, the attacker can exploit LDK’s buggy aggregation logic prior to v0.1 (see below), causing LDK to aggregate HTLC justice claims as follows:

  • Transaction 1: HTLC 1
  • Transaction 2: HTLCs 2 and 3

Buggy Aggregation Logic

Prior to v0.1, LDK only aggregates HTLC claims if their timeouts are more than 12 blocks in the future. Presumably 12 blocks was deemed “too soon” to guarantee that LDK can confirm preimage claims before the HTLCs time out, and once one HTLC times out the counterparty can pin a competing timeout claim in mempools, thereby preventing confirmation of all the aggregated preimage claims. In other words, by claiming HTLCs separately in this scenario, LDK limits the damage the counterparty could do if one of those HTLCs expires before LDK successfully claims it.

Unfortunately, this aggregation strategy makes no sense when LDK is trying to group justice claims that the counterparty can spend immediately via HTLC-Success, since the timeout on those HTLCs does not apply to the counterparty. Nevertheless, prior to LDK 0.1, the same 12 block aggregation check applies equally to all justice claims, regardless of whether the counterparty can spend them immediately or must wait to spend via HTLC-Timeout.

An attacker can exploit this buggy aggregation logic to make LDK create multiple claim transactions, as described above.

Locking Up Funds

Finally, the attacker broadcasts and confirms a transaction spending HTLCs 1 and 2 via HTLC-Success. The attacker’s transaction conflicts with both Transaction 1 and Transaction 2, but due to the invalid claims bug, LDK only notices the conflict with Transaction 1. LDK continues to fee bump and rebroadcast Transaction 2 indefinitely, even though it can never be mined.

As a result, the funds in HTLC 3 remain inaccessible until a valid claim transaction is manually constructed and broadcast.

Note that if the attacker ever tries to claim HTLC 3 via HTLC-Success, LDK is able to immediately recover it via the revocation key. So while the attacker can lock up HTLC 3, they cannot actually steal it once the upstream HTLC times out.

Attack Cost

When the attacker’s revoked commitment transaction confirms, LDK is able to immediately claim the attacker’s channel balance. LDK is also able to claim HTLCs 1 and 2 via the revocation key on the B-M channel, while also claiming them via the preimage on the upstream A-B channel.

Thus a smart attacker would minimize costs by spending their channel balance down to the 1% reserve before carrying out the attack and would then set the amounts of HTLCs 1 and 2 to just above the dust threshold. The attacker would also maximize the pain inflicted on the victim by setting HTLC 3 to the maximum allowed amount.

Stealing HTLCs in 0.1-beta

Beginning in v0.1-beta, LDK started aggregating HTLC timeout claims that have compatible locktimes. As a result, the beta release is vulnerable to a variant of the liquidity griefing attack that enables the attacker to steal funds. Thankfully the invalid claims bug was fixed between the 0.1-beta and 0.1 releases, so the final LDK 0.1 release is not vulnerable to this attack.

The fund-stealing variant for LDK 0.1-beta works as follows.

Setup

The attack setup is identical to the liquidity griefing attack, except that the attacker does not cause its commitment transaction to be revoked.

Forcing Multiple Aggregations

The attacker then force closes the B-M channel. Due to differing locktimes, LDK creates HTLC timeout claims as follows:

  • Transaction 1: HTLC 1 (locktime X)
  • Transaction 2: HTLCs 2 and 3 (locktime X+1)

Once height X is reached, LDK broadcasts Transaction 1. At height X+1, LDK broadcasts Transaction 2.

At this point, if Transaction 1 confirmed immediately in block X+1, the attack fails since the attacker can no longer spend HTLCs 1 and 2 together in the same transaction. But if Transaction 1 did not confirm immediately (which is more likely), the attack can continue.

Stealing Funds

The attacker broadcasts and confirms a transaction spending HTLCs 1 and 2 via HTLC-Success. This transaction conflicts with both Transaction 1 and Transaction 2, but due to the invalid claims bug, LDK only notices the conflict with Transaction 1. LDK continues to fee bump and rebroadcast Transaction 2 indefinitely, even though it can never be mined.

Once HTLC 3’s upstream timeout expires, node A force closes and claims a refund, leaving the coast clear for the attacker to claim the downstream HTLC via preimage.

The Fix

The invalid claims bug was fixed by a one-line patch just prior to the LDK 0.1 release.

Discovery

This vulnerability was discovered during an audit of LDK’s chain module.

Timeline

  • 2024-12-23: Vulnerability reported to the LDK security mailing list.
  • 2025-01-15: Fix merged.
  • 2025-01-16: LDK 0.1 released containing the fix, with public disclosure in release notes.
  • 2025-01-23: Detailed description of vulnerability published.

Prevention

The invalid claims bug is fundamentally a problem of incorrect control flow – a break statement was inserted into a loop where it shouldn’t have been. Why wasn’t it caught during initial code review, and why wasn’t it noticed for years after that?

The break statement was introduced back in 2019, long before LDK supported anchor channels. The code was actually correct back then, because before anchor channels there was no way for the counterparty to construct a transaction that conflicted with two of LDK’s aggregated transactions. But even after LDK 0.0.116 added support for anchor channels, the bug went unnoticed for over two years, despite multiple changes being made to the surrounding code in that time frame.

It’s impossible to say exactly what kept the bug hidden, but I think the complexity and unreadability of the surrounding code was a likely contributor. Here’s the for-loop containing the buggy code:

let mut bump_candidates = new_hash_map();
if !txn_matched.is_empty() { maybe_log_intro(); }
for tx in txn_matched {
    // Scan all input to verify is one of the outpoint spent is of interest for us
    let mut claimed_outputs_material = Vec::new();
    for inp in &tx.input {
        if let Some((claim_id, _)) = self.claimable_outpoints.get(&inp.previous_output) {
            // If outpoint has claim request pending on it...
            if let Some(request) = self.pending_claim_requests.get_mut(claim_id) {
                //... we need to check if the pending claim was for a subset of the outputs
                // spent by the confirmed transaction. If so, we can drop the pending claim
                // after ANTI_REORG_DELAY blocks, otherwise we need to split it and retry
                // claiming the remaining outputs.
                let mut is_claim_subset_of_tx = true;
                let mut tx_inputs = tx.input.iter().map(|input| &input.previous_output).collect::<Vec<_>>();
                tx_inputs.sort_unstable();
                for request_input in request.outpoints() {
                    if tx_inputs.binary_search(&request_input).is_err() {
                        is_claim_subset_of_tx = false;
                        break;
                    }
                }

                macro_rules! clean_claim_request_after_safety_delay {
                    () => {
                        let entry = OnchainEventEntry {
                            txid: tx.compute_txid(),
                            height: conf_height,
                            block_hash: Some(conf_hash),
                            event: OnchainEvent::Claim { claim_id: *claim_id }
                        };
                        if !self.onchain_events_awaiting_threshold_conf.contains(&entry) {
                            self.onchain_events_awaiting_threshold_conf.push(entry);
                        }
                    }
                }

                // If this is our transaction (or our counterparty spent all the outputs
                // before we could anyway with same inputs order than us), wait for
                // ANTI_REORG_DELAY and clean the RBF tracking map.
                if is_claim_subset_of_tx {
                    clean_claim_request_after_safety_delay!();
                } else { // If false, generate new claim request with update outpoint set
                    let mut at_least_one_drop = false;
                    for input in tx.input.iter() {
                        if let Some(package) = request.split_package(&input.previous_output) {
                            claimed_outputs_material.push(package);
                            at_least_one_drop = true;
                        }
                        // If there are no outpoints left to claim in this request, drop it entirely after ANTI_REORG_DELAY.
                        if request.outpoints().is_empty() {
                            clean_claim_request_after_safety_delay!();
                        }
                    }
                    //TODO: recompute soonest_timelock to avoid wasting a bit on fees
                    if at_least_one_drop {
                        bump_candidates.insert(*claim_id, request.clone());
                        // If we have any pending claim events for the request being updated
                        // that have yet to be consumed, we'll remove them since they will
                        // end up producing an invalid transaction by double spending
                        // input(s) that already have a confirmed spend. If such spend is
                        // reorged out of the chain, then we'll attempt to re-spend the
                        // inputs once we see it.
                        #[cfg(debug_assertions)] {
                            let existing = self.pending_claim_events.iter()
                                .filter(|entry| entry.0 == *claim_id).count();
                            assert!(existing == 0 || existing == 1);
                        }
                        self.pending_claim_events.retain(|entry| entry.0 != *claim_id);
                    }
                }
                break; //No need to iterate further, either tx is our or their
            } else {
                panic!("Inconsistencies between pending_claim_requests map and claimable_outpoints map");
            }
        }
    }
    for package in claimed_outputs_material.drain(..) {
        let entry = OnchainEventEntry {
            txid: tx.compute_txid(),
            height: conf_height,
            block_hash: Some(conf_hash),
            event: OnchainEvent::ContentiousOutpoint { package },
        };
        if !self.onchain_events_awaiting_threshold_conf.contains(&entry) {
            self.onchain_events_awaiting_threshold_conf.push(entry);
        }
    }
}

Perhaps others have a better mental parser than me, but I find this code quite difficult to read and understand. The loop is so long, with so much nesting and so many low-level implementation details that by the time I get to the buggy break statement, I’ve completely forgotten what loop it applies to. And since the comment attached to the break statement gives a believable explanation, it’s easy to gloss right over it.

Perhaps the buggy control flow would be easier to spot if the loop was simpler and more compact. By hand-waving some helper functions into existence and refactoring, the same code could be written as follows:

maybe_log_intro();

let mut bump_candidates = new_hash_map();
for tx in txn_matched {
    for inp in &tx.input {
        if let Some(claim_request) = self.get_mut_claim_request_from_outpoint(inp.previous_output) {
            let split_requests = claim_request.split_off_matching_inputs(&tx.input);
            debug_assert!(!split_requests.is_empty());

            if claim_request.outpoints().is_empty() {
                // Request has been fully claimed.
                self.mark_request_claimed(claim_request, tx, conf_height, conf_hash);
                break;
            }

            // After removing conflicting inputs, there's still more to claim.  Add the modified
            // request to bump_candidates so it gets fee bumped and rebroadcast.
            self.remove_pending_claim_events(claim_request);
            bump_candidates.insert(claim_request.clone());

            self.mark_requests_contentious(split_requests, tx, conf_height, conf_hash);
            break;
        }
    }

The control flow in this version is much more apparent to the reader. And although there’s no guarantee that the buggy break statements would have been discovered sooner if the code was written this way, I do think the odds would have been much better.

Takeaways

  • Code readability matters for preventing bugs.
  • Update to LDK 0.1 for the vulnerability fix.
DoS: LND Onion Bomb

LND versions prior to 0.17.0 are vulnerable to a DoS attack where malicious onion packets cause the node to instantly run out of memory (OOM) and crash. If you are running an LND release older than this, your funds are at risk! Update to at least 0.17.0 to protect your node.

Severity

It is critical that users update to at least LND 0.17.0 for several reasons.

  • The attack is cheap and easy to carry out and will keep the victim offline for as long as it lasts.
  • The source of the attack is concealed via onion routing. The attacker does not need to connect directly to the victim.
  • Prior to LND 0.17.0, all nodes are vulnerable. The fix was not backported to the LND 0.16.x series or earlier.

The Vulnerability

The Lightning Network uses onion routing to provide senders and receivers of payments some degree of privacy. Each node along a payment route receives an onion packet from the previous node, containing forwarding instructions for the next node on the route. The onion packet is encrypted by the initiator of the payment, so that each node can only read its own forwarding instructions.

Once a node has “peeled off” its layer of encryption from the onion packet, it can extract its forwarding instructions according to the format specified in the LN protocol:

Field Name Size Description
length 1-9 bytes The length of the payload field, encoded as BigSize.
payload length bytes The forwarding instructions.
hmac 32 bytes The HMAC to use for the forwarded onion packet.
next_onion remaining bytes The onion packet to forward.

Prior to LND 0.17.0, the code that extracts these instructions is essentially:

// Decode unpacks an encoded HopPayload from the passed reader into the
// target HopPayload.
func (hp *HopPayload) Decode(r io.Reader) error {
    bufReader := bufio.NewReader(r)

    var b [8]byte
    varInt, err := ReadVarInt(bufReader, &b)
    if err != nil {
        return err
    }

    payloadSize := uint32(varInt)

    // Now that we know the payload size, we'll create a new buffer to
    // read it out in full.
    hp.Payload = make([]byte, payloadSize)
    if _, err := io.ReadFull(bufReader, hp.Payload[:]); err != nil {
        return err
    }
    if _, err := io.ReadFull(bufReader, hp.HMAC[:]); err != nil {
        return err
    }

    return nil
}

Note the absence of a bounds check on payloadSize!

Regardless of the actual payload size, LND allocates memory for whatever length is encoded in the onion packet up to UINT32_MAX (4 GB).

The DoS Attack

It is trivial for an attacker to craft an onion packet that contains an encoded length of UINT32_MAX for the victim’s forwarding instructions. If the victim’s node has less than 4 GB of memory available, it will OOM crash instantly upon receiving the attacker’s packet.

However, if the victim’s node has more than 4 GB of memory available, it is able to recover from the malicious packet. The victim’s node will temporarily allocate 4 GB, but the Go garbage collector will quickly reclaim that memory after decoding fails.

So nodes with more than 4 GB of RAM are safe, right?

Not quite. The attacker can send many malicious packets simultaneously. If the victim processes enough malicious packets before the garbage collector kicks in, an OOM will still occur. And since LND decodes onion packets in parallel, it is not difficult for an attacker to beat the garbage collector. In my experiments I was able to consistently crash nodes with up to 128 GB of RAM in just a few seconds.

The Fix

A bounds check on the encoded length field was concealed in a large refactoring commit and included in LND 0.17.0. The fixed code is essentially:

// Decode unpacks an encoded HopPayload from the passed reader into the
// target HopPayload.
func (hp *HopPayload) Decode(r io.Reader) error {
    bufReader := bufio.NewReader(r)

    payloadSize, err := tlvPayloadSize(bufReader)
    if err != nil {
        return err
    }

    // Now that we know the payload size, we'll create a new buffer to
    // read it out in full.
    hp.Payload = make([]byte, payloadSize)
    if _, err := io.ReadFull(bufReader, hp.Payload[:]); err != nil {
        return err
    }
    if _, err := io.ReadFull(bufReader, hp.HMAC[:]); err != nil {
        return err
    }

    return nil
}

// tlvPayloadSize uses the passed reader to extract the payload length
// encoded as a var-int.
func tlvPayloadSize(r io.Reader) (uint16, error) {
    var b [8]byte
    varInt, err := ReadVarInt(r, &b)
    if err != nil {
        return 0, err
    }

    if varInt > math.MaxUint16 {
        return 0, fmt.Errorf("payload size of %d is larger than the "+
            "maximum allowed size of %d", varInt, math.MaxUint16)
    }

    return uint16(varInt), nil
}

This new code reduces the maximum amount of memory LND will allocate when decoding an onion packet from 4 GB to 64 KB, which is enough to fully mitigate the DoS attack.

Discovery

A simple fuzz test for onion packet encoding and decoding revealed this vulnerability.

Timeline

  • 2023-06-20: Vulnerability discovered and disclosed to Lightning Labs.
  • 2023-08-23: Fix merged.
  • 2023-10-03: LND 0.17.0 released containing the fix.
  • 2024-05-16: Laolu gives the OK to disclose publicly once LND 0.18.0 is released and has some uptake.
  • 2024-05-30: LND 0.18.0 released.
  • 2024-06-18: Public disclosure.

Prevention

This vulnerability was found in less than a minute of fuzz testing. If basic fuzz tests had been written at the time the original onion decoding functions were introduced, the bug would have been caught before it was merged.

In general any function that processes untrusted inputs is a strong candidate for fuzz testing, and often these fuzz tests are easier to write than traditional unit tests. A minimal fuzz test that detects this particular vulnerability is exceedingly simple:

func FuzzHopPayload(f *testing.F) {
    f.Fuzz(func(t *testing.T, data []byte) {
        // Hop payloads larger than 1300 bytes violate the spec and never
        // reach the decoding step in practice.
        if len(data) > 1300 {
            return
        }

        var hopPayload sphinx.HopPayload
        hopPayload.Decode(bytes.NewReader(data))
    })
}

Takeaways

  • Write fuzz tests for all APIs that consume untrusted inputs.
  • Update your LND nodes to at least 0.17.0.
DoS: Channel Open Race in CLN

CLN versions between 23.02 and 23.05.2 are susceptible to a DoS attack involving the exploitation of a race condition during channel opens. If you are running any version in this range, your funds may be at risk! Update to at least 23.08 to help protect your node.

The Vulnerability

The vulnerability arises from a race condition between two different flows in CLN: the channel open flow and the peer connection flow.

The Channel Open Flow

When a peer opens a channel with a CLN node, the following interactions occur on the CLN node.

channel open diagram

  1. The connectd daemon notifies lightningd about the channel open request.
  2. lightningd launches a new openingd daemon to handle the channel open negotiation.
  3. openingd completes the channel open negotiation up to the point where the funding outpoint is known.
  4. openingd sends the funding outpoint to lightningd and exits.
  5. lightningd launches a channeld daemon to manage the new channel.

The Peer Connection Flow

Once a peer has a channel with a CLN node, if the peer disconnects and reconnects the following occurs on the CLN node.

channel exists diagram

  1. The connectd daemon notifies lightningd about the new peer connection.
  2. lightningd calls a plugin hook notifying the chanbackup plugin about the new peer connection.
  3. chanbackup notifies lightningd that it is done running the hook.
  4. With the hook finished, lightningd recognizes that a previous channel exists with the peer and launches a channeld daemon to manage it.

The Race Condition

Problems arise when the peer connection flow overlaps with the channel open flow, causing lightningd to attempt launching the same channeld daemon twice. This can happen if the peer quickly opens a channel after connecting, and the chanbackup plugin is delayed in handling the peer connection hook, leading to the following interactions on the CLN node.

channel open race diagram

  1. The connectd daemon notifies lightningd about the new peer connection.
  2. lightningd calls a plugin hook notifying the chanbackup plugin about the new peer connection.
  3. The connectd daemon notifies lightningd about the channel open request.
  4. lightningd launches a new openingd daemon to handle the channel open negotiation.
  5. openingd completes the channel open negotiation up to the point where the funding outpoint is known.
  6. openingd sends the funding outpoint to lightningd and exits.
  7. lightningd launches a channeld daemon to manage the new channel.
  8. chanbackup notifies lightningd that it is done running the hook.
  9. With the hook finished, lightningd recognizes that a previous channel exists with the peer and attempts to launch a channeld daemon to manage it. Since the daemon is already running, an assertion failure occurs and CLN crashes.

The DoS Attack

To reliably trigger the assertion failure, an attacker needs to somehow slow down the chanbackup plugin so that a channel can be opened before the plugin finishes running the peer connected hook. One way to do this is to overload chanbackup with many peer connections and channel state changes. As it turns out, the fake channel DoS attack is a trivial and free method of generating these events and overloading chanbackup.

On a local network with low latency, I was able to generate enough load on chanbackup to consistently crash CLN nodes in under 5 seconds. In the real world the attack would be carried out across the Internet with higher latencies, so more load on chanbackup would be required to trigger the race condition. In my experiments, crashing CLN nodes across the Internet took around 30 seconds.

The Defense

To prevent the assertion failure from triggering, a small patch was added to CLN 23.08 that checks if a channeld is already running when the peer connected hook returns. If so, lightningd does not attempt to start the channeld again.

Note that this patch does not actually remove the race condition, though it does prevent crashing when the race occurs.

Discovery

This vulnerability was discovered during follow-up testing prior to the disclosure of the fake channel DoS vector. At the time, Rusty and I agreed to move forward with the planned disclosure of the fake channel DoS vector, but to delay disclosure of this channel open race until a later date.

Since the channel open race can be triggered by the fake channel DoS attack, it is a valid question how the race went undiscovered during the implementation of defenses against that attack. The answer is that the race was actually untriggerable until a few weeks after the fake channel DoS defenses were merged.

While the race condition was introduced in March 2022, the race couldn’t actually trigger because no plugins used the peer connected hook. It wasn’t until February 2023 that the race was exposed, when the peer storage backup feature made chanbackup the first official plugin to use the hook.

Timeline

  • 2022-03-23: Race condition introduced to CLN 0.11.
  • 2022-12-15: Fake channel DoS vector disclosed to Blockstream.
  • 2023-01-21: Fake channel DoS defenses fully merged [1, 2].
  • 2023-02-08: Peer storage backup feature introduced, exposing the channel open race vulnerability.
  • 2023-03-03: CLN 23.02 released.
  • 2023-07-28: Rusty gives the OK to disclose the fake channel DoS vector.
  • 2023-08-14: Follow-up testing reveals the channel open race vulnerability. Disclosed to Blockstream.
  • 2023-08-21: Defense against the channel open race DoS merged.
  • 2023-08-22: Rusty gives the OK to continue with the fake channel DoS disclosure, but requests that the channel open race vulnerability be omitted from the disclosure.
  • 2023-08-23: Public disclosure of the fake channel DoS.
  • 2023-08-23: CLN 23.08 released.
  • 2023-12-04: Rusty gives the OK to disclose the channel open race vulnerability.
  • 2024-01-08: Public disclosure.

Prevention

This vulnerability could have been prevented by a couple software engineering best practices.

Avoid Race Conditions

The original purpose of the peer connected hook was to enable plugins to filter and reject incoming connections from certain peers. Therefore the hook was designed to be synchronous, and all other events initiated by the peer were blocked until the hook returned. Unfortunately, PR 5078 destroyed that property of the hook by introducing a known race condition to the code (search for “this is racy” in commit 2424b7d). If PR 5078 hadn’t done this, there would be no race condition to exploit and this vulnerability would never have existed.

Race conditions can be nasty and should be avoided whenever possible. Knowingly adding race conditions where they didn’t previously exist is generally a bad idea.

Do Stress Testing

When I disclosed the fake channel DoS vector to Blockstream, I also provided a DoS program that demonstrated the attack. That same DoS program revealed the channel open race vulnerability after it became triggerable in February 2023. If a stress test based on the DoS program had been added to CLN’s CI pipeline or release process, this vulnerability could have been caught much earlier, before it was included in any releases.

In general, there is some difficulty in releasing such a test publicly while the vulnerability it tests for is still secret. In such situations the test can remain unreleased until the vulnerability has been publicly disclosed, and in the meantime the test can be run privately during the release process to ensure no regressions have been introduced. In CLN’s case, this may have been unnecessary – a stress test could have plausibly been added to PR 5849 without raising suspicion.

Takeaways

  • Avoid race conditions.
  • Use regression and stress testing.
  • Update your CLN nodes to at least v23.08.
Invoice Parsing Bugs in CLN

Several invoice parsing bugs were fixed in CLN 23.11, including bugs that caused crashes, undefined behavior, and use of uninitialized memory. These bugs could be reliably triggered by specially crafted invoices, enabling a malicious counterparty to crash the victim’s node upon invoice payment.

The parsing bugs were discovered by a new fuzz test written by Niklas Gögge and enhanced by me.

Bugs fixed in v23.11

#   Type Root Cause Fix
1   undefined behavior unchecked return value eeec529
2   use of uninitialized memory missing check for 0-length TLV ee501b0
3   crash unnecessary assertion ee8cf69
4   crash missing recovery ID validation c1f2068
5   crash missing pubkey validation 87f4907

The fuzz target

The fuzz target that uncovered these bugs was initially written by Niklas Gögge in December 2022, though it wasn’t made public until October 2023. The target simply provides fuzzer-generated inputs to CLN’s invoice decoding function, similar to fuzz targets written for other implementations [1, 2].

To improve the fuzzer’s efficiency, Niklas also wrote a custom mutator for the target. Invoices are encoded in bech32 which requires a valid checksum at the end of the encoding, making it quite difficult for fuzzers to generate valid bech32 consistently. As a result, bech32-naive fuzzers will generally get stuck at the bech32 decoding stage and have a hard time exploring deeper into the invoice parsing logic. Niklas’ custom mutator teaches the fuzzer how to generate valid bech32 so that it can focus its fuzzing on invoice parsing.

Initial fuzzing in 2022

After writing the fuzz target in December 2022, Niklas privately reported several bugs to CLN including a stack buffer overflow, an assertion failure, and undefined behavior due to a 0-length array. Many of the bugs were fixed in PR 5891 and released in CLN 23.02.

Merging the fuzz target in 2023

In October 2023, Niklas submitted his fuzz target for review in PR 6750. The initial corpus in that PR actually triggered bugs 1 and 2, but Niklas didn’t notice because he had been fuzzing with some UBSan options misconfigured. CLN’s CI didn’t detect the bugs either, since UBSan had previously been accidentally disabled in CI.

Niklas also discovered bug 3 during initial fuzzing, but he initially thought it was a false report and hard-coded an exception for it in the fuzz target.

Enhancements

The initial fuzz target only fuzzed the invoice decoding logic, skipping signature checks. I modified the target to also run the signature-checking logic, which enabled the fuzzer to quickly find bug 4.

While bug 5 should have also been discoverable by the fuzzer after this change, it remained undetected even after many weeks of CPU time. It wasn’t until I added a custom cross-over mutator for the fuzz target that bug 5 was discovered. The cross-over mutator is based on Niklas’ custom mutator and simply combines pieces from multiple bech32-decoded invoices before re-encoding the result in bech32. Within a few CPU hours of fuzzing with this extra mutator, the fuzzer found bug 5.

Impact

The severity of these bugs seems relatively low since they can only be triggered when paying an invoice. If a malicious invoice causes your node to crash, as long as you can restart your node in a timely manner and avoid paying any more invoices from the malicious counterparty, no further harm can be done.

Since bug 2 involves uninitialized memory it could potentially be more serious, as a sophisticated attacker may be able to extract sensitive data from the invoice-decoding process. Such an attack would be quite complex, and it is unclear whether it would even be possible in practice. It’s also unclear exactly what sensitive data could be extracted, since CLN handles private keys in a separate dedicated process (the hsmd daemon).

Takeaways

  • Fuzz testing is an essential component of writing robust and secure software. Any API that consumes untrusted inputs should be fuzz tested.
  • Custom mutators can be very powerful for fuzzing deeper logic in the codebase.
  • Fuzz testing of C or C++ code should use both ASan and UBSan. MSan and valgrind can also be useful.