LND's Deadline-Aware Budget Sweeper

Starting with v0.18.0, LND has a completely rewritten sweeper subsystem for managing transaction batching and fee bumping. The new sweeper uses HTLC deadlines and fee budgets to compute a fee rate curve, dynamically adjusting fees (fee bumping) to prioritize urgent transactions. This new fee bumping strategy has some nice security benefits and is something other Lightning implementations should consider adopting.

Background

When an unreliable (or malicious) Lightning node goes offline while HTLCs are in flight, the other node in the channel can no longer claim the HTLCs off chain and will eventually have to force close and claim the HTLCs on chain. When this happens, it is critical that all HTLCs are claimed before certain deadlines:

  • Incoming HTLCs need to be claimed before their timelocks expire; otherwise, the channel counterparty can submit a competing timeout claim.
  • Outgoing HTLCs need to be claimed before their corresponding upstream HTLCs expire; otherwise, the upstream node can reclaim them on chain.

If HTLCs are not claimed before their deadlines, they can be entirely lost (or stolen).

Thus Lightning nodes need to pay enough transaction fees to ensure timely confirmation of their commitment and HTLC transactions. At the same time, nodes don’t want to overpay the fees, as these fees can become a major cost for node operators.

The solution implemented by all Lightning nodes is to start with a relatively low fee rate for these transactions and then use RBF to increase the fee rate as deadlines get closer.

RBF Strategies

Each node implementation uses a slightly different algorithm for choosing RBF fee rates, but in general there’s two main strategies:

  • external fee rate estimators
  • exponential bumping

External Fee Rate Estimators

This strategy chooses fee rates based on Bitcoin Core’s (or some other) fee rate estimator. The estimator is queried with the HTLC deadline as the confirmation target, and the returned fee rate is used for commitment and HTLC transactions. Typically the estimator is requeried every block to update fee rates and RBF any unconfirmed transactions.

CLN and LND prior to v0.18.0 use this strategy exclusively. eclair uses this strategy until deadlines are within 6 blocks, after which it switches to exponential bumping. LDK uses a combined strategy that sometimes uses the fee rate from the estimator and other times uses exponential bumping.

Exponential Bumping

In this strategy, the fee rate estimator is used to determine the initial fee rate, after which a fixed multiplier is used to increase fee rates for each RBF transaction.

eclair uses this strategy when deadlines are within 6 blocks, increasing fee rates by 20% each block while capping the total fees paid at the value of the HTLC being claimed. When LDK uses this strategy, it increases fee rates by 25% on each RBF.

Problems

While external fee rate estimators can be helpful, they’re not perfect. And relying on them too much can lead to missed deadlines when unusual things are happening in the mempool or with miners (e.g., increasing mempool congestion, pinning, replacement cycling, miner censorship). In such situations, higher-than-estimated fee rates may be needed to actually get transactions confirmed. Exponential bumping strategies help here but can still be ineffective if the original fee rate was too low.

The Deadline and Budget Aware RBF Strategy

LND’s new sweeper subsystem, released in v0.18.0, takes a novel approach to RBFing commitment and HTLC transactions. The system was designed around a key observation: for each HTLC on a commitment transaction, there are specific deadline and budget constraints for claiming that HTLC. The deadline is the block height by which the node needs to confirm the claim transaction for the HTLC. The budget is the maximum absolute fee the node operator is willing to pay to sweep the HTLC by the deadline. In practice, the budget is likely to be a fixed proportion of the HTLC value (i.e. operators are willing to pay more fees for larger HTLCs), so LND’s budget configuration parameters are based on proportions.

The sweeper operates by aggregating HTLC claims with matching deadlines into a single batched transaction. The budget for the batched transaction is calculated as the sum of the budgets for the individual HTLCs in the transaction. Based on the transaction budget and deadline, a fee function is computed that determines how much of the budget is spent as the deadline approaches. By default, a linear fee function is used which starts at a low fee (determined by the minimum relay fee rate or an external estimator) and ends with the total budget being allocated to fees when the deadline is one block away. The initial batched transaction is published and a “fee bumper” is assigned to monitor confirmation status in the background. For each block the transaction remains unconfirmed, the fee bumper broadcasts a new transaction with a higher fee rate determined by the fee function.

The sweeper architecture looks like this:

channel funding diagram

For more details about LND’s new sweeper, see the technical documentation. In this blog post, we’ll focus mostly on the sweeper’s deadline and budget aware RBF strategy.

Benefits

LND’s new sweeper system provides greater security against replacement cycling, pinning, and other adversarial or unexpected scenarios. It also fixed some bad bugs and vulnerabilities present with LND’s previous sweeper system.

Replacement Cycling Defense

Transaction rebroadcasting is a simple mitigation against replacement cycling attacks that has been adopted by all implementations. However, rebroadcasting alone does not guarantee that such attacks become uneconomical, especially when HTLC values are much larger than the fees Lightning nodes are willing to pay when claiming them on chain. By setting fee budgets in proportion to HTLC values, LND’s new sweeper is able to provide much stronger guarantees that any replacement cycling attacks will be uneconomical.

Cost of Replacement Cycling Attacks

With LND’s default parameters an attacker must generally spend at least 20x the value of the HTLC to successfully carry out a replacement cycling attack.

Default parameters:

  • fee budget: 50% of HTLC value
  • CLTV delta: 80 blocks

Assuming the attacker must do a minimum of one replacement per block:

\[attack\_cost \ge \sum_{t = 0}^{80} fee\_function(t)\] \[attack\_cost \ge \sum_{t = 0}^{80} 0.5 \cdot htlc\_value \cdot \frac{t}{80}\] \[attack\_cost \ge 20 \cdot htlc\_value\]

LND also rebroadcasts transactions every minute by default, so in practice the attacker must do ~10 replacements per block, making the cost closer to 200x the HTLC value.

Partial Pinning Defense

Because LND’s new default RBF strategy pays up to 50% of the HTLC value, LND now has a much greater ability to outbid pinning attacks, especially for larger HTLCs. It is unfortunate that significant fees need to be burned in this case, but the end result is still better than losing the full value of the HTLC.

Reduced Reliance on Fee Rate Estimators

As explained earlier, fee rate estimators are not always accurate, especially when mempool conditions are changing rapidly. In these situations, it can be very beneficial to use a simpler RBF strategy, especially when deadlines are approaching. LDK and eclair use exponential bumping in these scenarios, which helps in many cases. But ultimately the fee rate curve for an exponential bumping strategy still depends heavily on the starting fee rate, and if that fee rate is too low then deadlines can be missed. The exponential bumping strategy also ignores the value of the HTLC being claimed, which means that larger HTLCs get the same fee rates as smaller HTLCs, even when deadlines are getting close.

LND’s budget-based approach takes HTLC values into consideration when establishing the fee rate curve, ensuring that budgets are never exceeded and that HTLCs are never lost before an attempt to spend the full budget has been made. As such, the budget-based approach provides more consistent results and greater security in unexpected or adversarial situations.

LND-Specific Bug and Vulnerability Fixes

LND’s new sweeper fixed some bad bugs and vulnerabilities that existed with the previous sweeper.

Fee Bump Failures

Previously, LND had an inconsistent approach to broadcasting and fee bumping urgent transactions. In some places transactions would get broadcast with a specific confirmation target and would never be fee bumped again. In other places transactions would be RBF’d if the fee rate estimator determined that mempool fee rates had gone up, but the confirmation target given to the estimator would not be adjusted as deadlines approached.

Perhaps the worst of these fee bumping failures was a bug reported by Carsten Otto, where LND would fail to use the anchor output to CPFP a commitment transaction if the initial HTLC deadlines were far enough in the future. While this behavior is desirable to save on fees initially, it becomes a major problem when deadlines get closer and the commitment hasn’t confirmed on its own. Because LND did not adjust confirmation targets as deadlines approached, the commitment transaction would remain un-CPFP’d and could fail to confirm before HTLCs expired and funds could be lost. To make matters worse, the bug was trivial for an attacker to exploit.

LND’s sweeper rewrite took the opportunity to correct and unify all the transaction broadcasting and fee bumping logic in one place and fix all of these fee bumping failures at once.

Invalid Batching

LND’s previous sweeper also sometimes generated invalid or unsafe transactions when batching inputs together. This could happen in a couple ways:

  • Inputs that were invalid or had been double-spent could be batched with urgent HTLC claims, making the whole transaction invalid.
  • Anchor spends could be batched together, thereby violating the CPFP carve out and enabling channel counterparties to pin commitment transactions.

Rather than addressing these issues directly, the previous sweeper would use exponential backoff to regroup inputs after random delays and hope for a valid transaction. If another invalid transaction occurred, longer delays would be used before the next regrouping. Eventually, deadlines could be missed and funds lost.

LND’s new sweeper fixed these issues by being more careful about which inputs could be grouped together and by removing double-spent inputs from transactions that failed to broadcast.

Risks

The security of a Lightning node depends heavily on its ability to resolve HTLCs on chain when necessary. And unfortunately proper on-chain resolution can be tricky to get right (see 1, 2, 3). Making changes to the existing on-chain logic runs the risk of introducing new bugs and vulnerabilities.

For example, during code reviews of LND’s new sweeper there were many serious bugs discovered and fixed, ranging from catastrophic fee function failures to new fund-stealing exploits and more (1, 2, 3, 4, 5, 6). Node implementers should tread carefully when touching these parts of the codebase and remember that simplicity is often the best security.

Conclusion

LND’s new deadline-aware budget sweeper provides more secure fee bumping in adversarial situations and more consistent behavior when mempools are rapidly changing. Other implementations should consider incorporating budget awareness into their fee bumping strategies to improve defenses against replacement cycling and pinning attacks, and to reduce reliance on external fee estimators. At the same time, implementers would do well to avoid complete rewrites of the on-chain logic and instead keep the changes small and review them well.

LND: Excessive Failback Exploit

LND 0.17.5 and below contain a bug in the on-chain resolution logic that can be exploited to steal funds. For the attack to be practical the attacker must be able to force a restart of the victim node, perhaps via an unpatched DoS vector. Update to at least LND 0.18.0 to protect your node.

Background

Whenever a new payment is routed through a lightning channel, or whenever an existing payment is settled on the channel, the parties in that channel need to update their commitment transactions to match the new set of active HTLCs. During the course of these regular commitment updates, there is always a brief moment where one of the parties holds two valid commitment transactions. Normally that party immediately revokes the older commitment transaction after it receives a signature for the new one, bringing their number of valid commitment transactions back down to one. But for that brief moment, the other party in the channel must be able to handle the case where either of the valid commitments confirms on chain.

As part of this handling, nodes need to detect when any currently outstanding HTLCs are missing from the confirmed commitment transaction so that those HTLCs can be failed backward on the upstream channel.

The Excessive Failback Bug

Prior to v0.18.0, LND’s logic to detect and fail back missing HTLCs works like this:

func failBackMissingHtlcs(confirmedCommit Commitment) {
  currentCommit, pendingCommit := getValidCounterpartyCommitments()

  var danglingHtlcs HtlcSet
  if confirmedCommit == pendingCommit {
    danglingHtlcs = currentCommit.Htlcs()
  } else {
    danglingHtlcs = pendingCommit.Htlcs()
  }

  confirmedHtlcs := confirmedCommit.Htlcs()
  missingHtlcs := danglingHtlcs.SetDifference(confirmedHtlcs)
  for _, htlc := range missingHtlcs {
    failBackHtlc(htlc)
  }
}

LND compares the HTLCs present on the confirmed commitment transaction against the HTLCs present on the counterparty’s other valid commitment (if there is one) and fails back any HTLCs that are missing from the confirmed commitment. This logic is mostly correct, but it does the wrong thing in one particular scenario:

  1. LND forwards an HTLC H to the counterparty, signing commitment C0 with H added as an output. The previous commitment is revoked.
  2. The counterparty claims H by revealing the preimage to LND.
  3. LND forwards the preimage upstream to start the process of claiming the incoming HTLC.
  4. LND signs a new counterparty commitment C1 with H removed and its value added to the counterparty’s balance.
  5. The counterparty refuses to revoke C0.
  6. The counterparty broadcasts and confirms C1.

In this case, LND compares the confirmed commitment C1 against the other valid commitment C0 and determines that H is missing from the confirmed commitment. As a result, LND incorrectly determines that H needs to be failed back upstream, and executes the following logic:

func failBackHtlc(htlc Htlc) {
  markFailedInDatabase(htlc)
  
  incomingHtlc, ok := incomingHtlcMap[htlc]
  if !ok {
    log("Incoming HTLC has already been resolved")
    return
  }
  failHtlc(incomingHtlc)
  delete(incomingHtlcMap, htlc)
}

In this case, the preimage for the incoming HTLC was already sent upstream (step 3), so the corresponding entry in incomingHtlcMap has already been removed. Thus LND catches the “double resolution” and returns from failBackHtlc without sending the incorrect failure message upstream. Unfortunately, LND only catches the double resolution after H is marked as failed in the database. As a result, when LND next restarts it will reconstruct its state from the database and determine that H still needs to be failed back. If the incoming HTLC hasn’t been fully resolved with the upstream node, the reconstructed incomingHtlcMap will have an entry for H this time, and LND will incorrectly send a failure message upstream.

At that point, the downstream node will have claimed H via preimage while the upstream node will have had the HTLC refunded to them, causing LND to lose the full value of H.

Stealing HTLCs

Consider the following topology, where B is the victim and M0 and M1 are controlled by the attacker.

M0 -- B -- M1

The attacker can steal funds as follows:

  1. M0 routes a large HTLC along the path M0 -> B -> M1.
  2. M0 goes offline.
  3. M1 claims the HTLC from B by revealing the preimage, receives a new commitment signature from B, and then refuses to revoke the previous commitment.
  4. B attempts to claim the upstream HTLC from M0 but can’t because M0 is offline.
  5. M1 force closes the B-M1 channel using their new commitment, thus triggering the excessive failback bug.
  6. The attacker crashes B using an unpatched DoS vector.
  7. M0 comes back online.
  8. B restarts, loads HTLC resolution data from the database, and incorrectly fails the HTLC with M0.

At this point, the attacker has succeeded in stealing the HTLC from B. M0 got the HTLC refunded, while M1 got the value of the HTLC added to their balance on the confirmed commitment.

The Fix

The excessive failback bug was fixed by a small change to prevent failback of HTLCs for which the preimage is already known. The updated logic now explicitly checks for preimage availability before failing back each HTLC:

func failBackMissingHtlcs(confirmedCommit Commitment) {
  currentCommit, pendingCommit := getValidCounterpartyCommitments()

  var danglingHtlcs HtlcSet
  if confirmedCommit == pendingCommit {
    danglingHtlcs = currentCommit.Htlcs()
  } else {
    danglingHtlcs = pendingCommit.Htlcs()
  }

  confirmedHtlcs := confirmedCommit.Htlcs()
  missingHtlcs := danglingHtlcs.SetDifference(confirmedHtlcs)
  for _, htlc := range missingHtlcs {
    if preimageIsKnown(htlc.PaymentHash()) {
      continue  // Don't fail back HTLCs we can claim.
    }
    failBackHtlc(htlc)
  }
}

The preimageIsKnown check prevents failBackHtlc from being called when the preimage is known, so such HTLCs are never failed backward or marked as failed in the database. On restart, the incorrect failback behavior no longer occurs.

The patch was hidden in a massive rewrite of LND’s sweeper system and was released in LND 0.18.0.

Discovery

This vulnerability was discovered during an audit of LND’s contractcourt package, which handles on-chain resolution of force closures.

Timeline

  • 2024-03-20: Vulnerability reported to the LND security mailing list.
  • 2024-04-19: Fix merged.
  • 2024-05-30: LND 0.18.0 released containing the fix.
  • 2025-02-17: Gijs gives the OK to disclose publicly in March.
  • 2025-03-04: Public disclosure.

Prevention

It appears all other lightning implementations have independently discovered and handled the corner case that LND mishandled:

  • CLN added a preimage check to the failback logic in 2018.
  • eclair introduced failback logic in 2023 that filtered upstream HTLCs by preimage availability.
  • LDK added a preimage check to the failback logic in 2023.

Yet the BOLT specification has not been updated to describe this corner case. In fact, by a strict interpretation the specification actually requires the incorrect behavior that LND implemented:

## HTLC Output Handling: Remote Commitment, Local Offers

### Requirements

A local node:
  - for any committed HTLC that does NOT have an output in this commitment transaction:
    - once the commitment transaction has reached reasonable depth:
      - MUST fail the corresponding incoming HTLC (if any).

It is quite unfortunate that all implementations had to independently discover and correct this bug. If any single implementation had contributed a small patch to the specification after discovering the issue, it would have at least sparked some discussion about whether the other implementations had considered this corner case. And if CLN had recognized that the specification needed updating back in 2018, there’s a good chance all other implementations would have handled this case correctly from the start.

Takeaways

  • Keeping specifications up-to-date can improve security for all implementations.
  • Update to at least LND 0.18.0 to protect your funds.
LDK: Duplicate HTLC Force Close Griefing

LDK 0.1 and below are vulnerable to a griefing attack that causes all of the victim’s channels to be force closed. Update to LDK 0.1.1 to protect your channels.

Background

Whenever a new payment is routed through a lightning channel, or whenever an existing payment is settled on the channel, the parties in that channel need to update their commitment transactions to match the new set of active HTLCs. During the course of these regular commitment updates, there is always a brief moment where one of the parties holds two valid commitment transactions. Normally that party immediately revokes the older commitment transaction after it receives a signature for the new one, bringing their number of valid commitment transactions back down to one. But for that brief moment, the other party in the channel must be able to handle the case where either of the valid commitments confirms on chain.

For this reason, LDK contains logic to detect when there’s a difference between the counterparty’s confirmed commitment transaction and the set of currently outstanding HTLCs. Any HTLCs missing from the confirmed commitment transaction are considered unrecoverable and are immediately failed backward on the upstream channel, while all other HTLCs are left active until the resolution of the downstream HTLC on chain.

Because the same payment hash and amount can be used for multiple HTLCs (e.g., multi-part payments), some extra data is stored to match HTLCs on commitment transactions against the set of outstanding HTLCs. LDK calls this extra data the “HTLC source” data, and LDK maintains this data for both of the counterparty’s valid commitment transactions.

The Duplicate HTLC Failback Bug

Once a counterparty commitment transaction has been revoked, however, LDK forgets the HTLC source data for that commitment transaction to save memory. As a result, if a revoked commitment transaction later confirms, LDK must attempt to match commitment transaction HTLCs up to outstanding HTLCs using only payment hashes and amounts. LDK’s logic to do this matching works as follows:

for htlc, htlc_source in outstanding_htlcs:
  if !confirmed_commitment_tx.is_revoked() &&
      confirmed_commitment_tx.contains_source(htlc_source):
    continue
  if confirmed_commitment_tx.is_revoked() &&
      confirmed_commitment_tx.contains_htlc(htlc.payment_hash, htlc.amount):
    continue

  failback_upstream_htlc(htlc_source)

Note that this logic short-circuits whenever an outstanding HTLC matches the payment hash and amount of an HTLC on the revoked commitment transaction. Thus if there are multiple outstanding HTLCs with the same payment hash and amount, a single HTLC on the revoked commitment transaction can prevent all of the duplicate outstanding HTLCs from being failed back immediately.

Those duplicate HTLCs remain outstanding until corresponding downstream HTLCs are resolved on chain. Except, in this case there’s only one downstream HTLC to resolve on chain, and its resolution only triggers one of the duplicate HTLCs to be failed upstream. All the other duplicate HTLCs are left outstanding indefinitely.

Force Close Griefing

Consider the following topology, where B is the victim and the A_[1..N] nodes are all the nodes that B has channels with. M_1 and M_2 are controlled by the attacker.

     -- A_1 --
    /         \
M_1 --  ...  -- B -- M_2
    \         /
     -- A_N --

The attacker routes N HTLCs from M_1 to M_2 using the same payment hash and amount for each, with each payment going through a different A node. M_2 then confirms a revoked commitment that contains only one of the N HTLCs. Due to the duplicate HTLC failback bug, only one of the routed HTLCs gets failed backwards, while the remaining N-1 HTLCs get stuck.

Finally, after upstream HTLCs expire, all the A nodes with stuck HTLCs force close their channels with B to reclaim the stuck HTLCs.

Attack Cost

The attacker must broadcast a revoked commitment transaction, thereby forfeiting their channel balance. But the size of the channel can be minimal, and the attacker can spend their balance down to the 1% reserve before executing the attack. As a result, the cost of the attack can be negligible compared to the damage caused.

The Fix

Starting in v0.1.1, LDK preemptively fails back HTLCs when their deadlines approach if the downstream channel has been force closed or is in the process of force closing. While the main purpose of this behavior is to prevent cascading force closures when mempool fee rates spike, it also has a nice side effect of ensuring that duplicate HTLCs always get failed back eventually after a revoked commitment transaction confirms. As a result, the duplicate HTLCs are never stuck long enough that the upstream nodes need to force close to reclaim them.

Discovery

This vulnerability was discovered during an audit of LDK’s chain module.

Timeline

  • 2024-12-07: Vulnerability reported to the LDK security mailing list.
  • 2025-01-27: Fix merged.
  • 2025-01-28: LDK 0.1.1 released containing the fix, with public disclosure in release notes.
  • 2025-01-29: Detailed description of vulnerability published.

Prevention

Prior to the introduction of the duplicate HTLC failback bug in 2022, LDK would immediately fail back all outstanding HTLCs once a revoked commitment reached 6 confirmations. This was the safe and conservative thing to do – HTLC source information was missing, so proper matching of HTLCs could not be done. And since all outputs on the revoked commitment and HTLC transactions could be claimed via revocation key, there was no concern about losing funds if the downstream counterparty confirmed an HTLC claim before LDK could.

Better Documentation

Considering that LDK previously had a test explicitly checking for the original (conservative) failback behavior, it does appear that the original behavior was understood and intentional. Unfortunately the original author did not document the reason for the original behavior anywhere in the code or test.

A single comment in the code would likely have been enough to prevent later contributors from introducing the buggy behavior:

// We fail back *all* outstanding HTLCs when a revoked commitment
// confirms because we don't have HTLC source information for revoked
// commitments, and attempting to match up HTLCs based on payment hashes
// and amounts is inherently unreliable.
//
// Failing back all HTLCs after a 6 block delay is safe in this case
// since we can use the revocation key to reliably claim all funds in the
// downstream channel and therefore won't lose funds overall.

Takeaways

  • Code documentation matters for preventing bugs.
  • Update to LDK 0.1.1 for the vulnerability fix.
LDK: Invalid Claims Liquidity Griefing

LDK 0.0.125 and below are vulnerable to a liquidity griefing attack against anchor channels. The attack locks up funds such that they can only be recovered by manually constructing and broadcasting a valid claim transaction. Affected users can unlock their funds by upgrading to LDK 0.1 and replaying the sequence of commitment and HTLC transactions that led to the lock up.

Background

When a channel is force closed, LDK creates and broadcasts transactions to claim any HTLCs it can from the commitment transaction that confirmed on chain. To save on fees, some HTLC claims are aggregated and broadcast together in the same transaction.

If the channel counterparty is able to get a competing HTLC claim confirmed first, it can cause one of LDK’s aggregated transactions to become invalid, since the corresponding HTLC input has already been spent by the counterparty’s claim. LDK contains logic to detect this scenario and remove the already-claimed input from its aggregated claim transaction. When everything works correctly, the aggregated transaction becomes valid again and LDK is able to claim the remaining HTLCs.

The Invalid Claims Bug

Prior to LDK 0.1, the logic to detect conflicting claims works like this:

for confirmed_transaction in confirmed_block:
  for input in confirmed_transaction:
    if claimable_outpoints.contains(input.prevout):
      agg_tx = get_aggregated_transaction_from_outpoint(input.prevout)
      agg_tx.remove_matching_inputs(confirmed_transaction)
      break  # This is the bug.

Note that this logic stops processing a confirmed transaction after finding the first aggregated transaction that conflicts with it. If the confirmed transaction conflicts with multiple aggregated transactions, conflicting inputs are only removed from the first matching aggregated transaction, and any other conflicting aggregated transactions are left invalid.

Any HTLCs claimed by invalid aggregated transactions get locked up and can only be recovered by manually constructing and broadcasting valid claim transactions.

Liquidity Griefing

Prior to LDK 0.1, there are only two types of HTLC claims that are aggregated:

  • HTLC preimage claims
  • revoked commitment HTLC claims

For HTLC preimage claims, LDK takes care to confirm them before their HTLCs time out, so there’s no reliable way for an attacker to confirm a conflicting timeout claim and trigger the invalid claims bug.

For revoked commitment transactions, however, an attacker can immediately spend any incoming HTLC outputs via HTLC-Success transactions. Although LDK is then able to claim the HTLC-Success outputs via the revocation key, the attacker can exploit the invalid claims bug to lock up any remaining HTLCs on the revoked commitment transaction.

Setup

The attacker opens an anchor channel with the victim, creating a network topology as follows:

A -- B -- M

In this case B is the victim LDK node and M is the node controlled by the attacker. The attacker must use an anchor channel so that they can spend multiple HTLC claims in the same transaction and trigger the invalid claims bug.

The attacker then routes HTLCs along the path A->B->M as follows:

  1. 1 small HTLC with CLTV of X
  2. 1 small HTLC with CLTV of X+1
  3. 1 large HTLC with CLTV of X+1 (this is the one the attacker will lock up)

The attacker knows preimages for all HTLCs but withholds them for now.

To complete the setup, the attacker routes some other HTLC through the channel, causing the commitment transaction with the above HTLCs to be revoked.

Forcing Multiple Aggregations

Next the attacker waits until block X-13 and force closes the B-M channel using their revoked commitment transaction, being sure to get it confirmed in block X-12. By confirming in this specific block, the attacker can exploit LDK’s buggy aggregation logic prior to v0.1 (see below), causing LDK to aggregate HTLC justice claims as follows:

  • Transaction 1: HTLC 1
  • Transaction 2: HTLCs 2 and 3

Buggy Aggregation Logic

Prior to v0.1, LDK only aggregates HTLC claims if their timeouts are more than 12 blocks in the future. Presumably 12 blocks was deemed “too soon” to guarantee that LDK can confirm preimage claims before the HTLCs time out, and once one HTLC times out the counterparty can pin a competing timeout claim in mempools, thereby preventing confirmation of all the aggregated preimage claims. In other words, by claiming HTLCs separately in this scenario, LDK limits the damage the counterparty could do if one of those HTLCs expires before LDK successfully claims it.

Unfortunately, this aggregation strategy makes no sense when LDK is trying to group justice claims that the counterparty can spend immediately via HTLC-Success, since the timeout on those HTLCs does not apply to the counterparty. Nevertheless, prior to LDK 0.1, the same 12 block aggregation check applies equally to all justice claims, regardless of whether the counterparty can spend them immediately or must wait to spend via HTLC-Timeout.

An attacker can exploit this buggy aggregation logic to make LDK create multiple claim transactions, as described above.

Locking Up Funds

Finally, the attacker broadcasts and confirms a transaction spending HTLCs 1 and 2 via HTLC-Success. The attacker’s transaction conflicts with both Transaction 1 and Transaction 2, but due to the invalid claims bug, LDK only notices the conflict with Transaction 1. LDK continues to fee bump and rebroadcast Transaction 2 indefinitely, even though it can never be mined.

As a result, the funds in HTLC 3 remain inaccessible until a valid claim transaction is manually constructed and broadcast.

Note that if the attacker ever tries to claim HTLC 3 via HTLC-Success, LDK is able to immediately recover it via the revocation key. So while the attacker can lock up HTLC 3, they cannot actually steal it once the upstream HTLC times out.

Attack Cost

When the attacker’s revoked commitment transaction confirms, LDK is able to immediately claim the attacker’s channel balance. LDK is also able to claim HTLCs 1 and 2 via the revocation key on the B-M channel, while also claiming them via the preimage on the upstream A-B channel.

Thus a smart attacker would minimize costs by spending their channel balance down to the 1% reserve before carrying out the attack and would then set the amounts of HTLCs 1 and 2 to just above the dust threshold. The attacker would also maximize the pain inflicted on the victim by setting HTLC 3 to the maximum allowed amount.

Stealing HTLCs in 0.1-beta

Beginning in v0.1-beta, LDK started aggregating HTLC timeout claims that have compatible locktimes. As a result, the beta release is vulnerable to a variant of the liquidity griefing attack that enables the attacker to steal funds. Thankfully the invalid claims bug was fixed between the 0.1-beta and 0.1 releases, so the final LDK 0.1 release is not vulnerable to this attack.

The fund-stealing variant for LDK 0.1-beta works as follows.

Setup

The attack setup is identical to the liquidity griefing attack, except that the attacker does not cause its commitment transaction to be revoked.

Forcing Multiple Aggregations

The attacker then force closes the B-M channel. Due to differing locktimes, LDK creates HTLC timeout claims as follows:

  • Transaction 1: HTLC 1 (locktime X)
  • Transaction 2: HTLCs 2 and 3 (locktime X+1)

Once height X is reached, LDK broadcasts Transaction 1. At height X+1, LDK broadcasts Transaction 2.

At this point, if Transaction 1 confirmed immediately in block X+1, the attack fails since the attacker can no longer spend HTLCs 1 and 2 together in the same transaction. But if Transaction 1 did not confirm immediately (which is more likely), the attack can continue.

Stealing Funds

The attacker broadcasts and confirms a transaction spending HTLCs 1 and 2 via HTLC-Success. This transaction conflicts with both Transaction 1 and Transaction 2, but due to the invalid claims bug, LDK only notices the conflict with Transaction 1. LDK continues to fee bump and rebroadcast Transaction 2 indefinitely, even though it can never be mined.

Once HTLC 3’s upstream timeout expires, node A force closes and claims a refund, leaving the coast clear for the attacker to claim the downstream HTLC via preimage.

The Fix

The invalid claims bug was fixed by a one-line patch just prior to the LDK 0.1 release.

Discovery

This vulnerability was discovered during an audit of LDK’s chain module.

Timeline

  • 2024-12-23: Vulnerability reported to the LDK security mailing list.
  • 2025-01-15: Fix merged.
  • 2025-01-16: LDK 0.1 released containing the fix, with public disclosure in release notes.
  • 2025-01-23: Detailed description of vulnerability published.

Prevention

The invalid claims bug is fundamentally a problem of incorrect control flow – a break statement was inserted into a loop where it shouldn’t have been. Why wasn’t it caught during initial code review, and why wasn’t it noticed for years after that?

The break statement was introduced back in 2019, long before LDK supported anchor channels. The code was actually correct back then, because before anchor channels there was no way for the counterparty to construct a transaction that conflicted with two of LDK’s aggregated transactions. But even after LDK 0.0.116 added support for anchor channels, the bug went unnoticed for over two years, despite multiple changes being made to the surrounding code in that time frame.

It’s impossible to say exactly what kept the bug hidden, but I think the complexity and unreadability of the surrounding code was a likely contributor. Here’s the for-loop containing the buggy code:

let mut bump_candidates = new_hash_map();
if !txn_matched.is_empty() { maybe_log_intro(); }
for tx in txn_matched {
    // Scan all input to verify is one of the outpoint spent is of interest for us
    let mut claimed_outputs_material = Vec::new();
    for inp in &tx.input {
        if let Some((claim_id, _)) = self.claimable_outpoints.get(&inp.previous_output) {
            // If outpoint has claim request pending on it...
            if let Some(request) = self.pending_claim_requests.get_mut(claim_id) {
                //... we need to check if the pending claim was for a subset of the outputs
                // spent by the confirmed transaction. If so, we can drop the pending claim
                // after ANTI_REORG_DELAY blocks, otherwise we need to split it and retry
                // claiming the remaining outputs.
                let mut is_claim_subset_of_tx = true;
                let mut tx_inputs = tx.input.iter().map(|input| &input.previous_output).collect::<Vec<_>>();
                tx_inputs.sort_unstable();
                for request_input in request.outpoints() {
                    if tx_inputs.binary_search(&request_input).is_err() {
                        is_claim_subset_of_tx = false;
                        break;
                    }
                }

                macro_rules! clean_claim_request_after_safety_delay {
                    () => {
                        let entry = OnchainEventEntry {
                            txid: tx.compute_txid(),
                            height: conf_height,
                            block_hash: Some(conf_hash),
                            event: OnchainEvent::Claim { claim_id: *claim_id }
                        };
                        if !self.onchain_events_awaiting_threshold_conf.contains(&entry) {
                            self.onchain_events_awaiting_threshold_conf.push(entry);
                        }
                    }
                }

                // If this is our transaction (or our counterparty spent all the outputs
                // before we could anyway with same inputs order than us), wait for
                // ANTI_REORG_DELAY and clean the RBF tracking map.
                if is_claim_subset_of_tx {
                    clean_claim_request_after_safety_delay!();
                } else { // If false, generate new claim request with update outpoint set
                    let mut at_least_one_drop = false;
                    for input in tx.input.iter() {
                        if let Some(package) = request.split_package(&input.previous_output) {
                            claimed_outputs_material.push(package);
                            at_least_one_drop = true;
                        }
                        // If there are no outpoints left to claim in this request, drop it entirely after ANTI_REORG_DELAY.
                        if request.outpoints().is_empty() {
                            clean_claim_request_after_safety_delay!();
                        }
                    }
                    //TODO: recompute soonest_timelock to avoid wasting a bit on fees
                    if at_least_one_drop {
                        bump_candidates.insert(*claim_id, request.clone());
                        // If we have any pending claim events for the request being updated
                        // that have yet to be consumed, we'll remove them since they will
                        // end up producing an invalid transaction by double spending
                        // input(s) that already have a confirmed spend. If such spend is
                        // reorged out of the chain, then we'll attempt to re-spend the
                        // inputs once we see it.
                        #[cfg(debug_assertions)] {
                            let existing = self.pending_claim_events.iter()
                                .filter(|entry| entry.0 == *claim_id).count();
                            assert!(existing == 0 || existing == 1);
                        }
                        self.pending_claim_events.retain(|entry| entry.0 != *claim_id);
                    }
                }
                break; //No need to iterate further, either tx is our or their
            } else {
                panic!("Inconsistencies between pending_claim_requests map and claimable_outpoints map");
            }
        }
    }
    for package in claimed_outputs_material.drain(..) {
        let entry = OnchainEventEntry {
            txid: tx.compute_txid(),
            height: conf_height,
            block_hash: Some(conf_hash),
            event: OnchainEvent::ContentiousOutpoint { package },
        };
        if !self.onchain_events_awaiting_threshold_conf.contains(&entry) {
            self.onchain_events_awaiting_threshold_conf.push(entry);
        }
    }
}

Perhaps others have a better mental parser than me, but I find this code quite difficult to read and understand. The loop is so long, with so much nesting and so many low-level implementation details that by the time I get to the buggy break statement, I’ve completely forgotten what loop it applies to. And since the comment attached to the break statement gives a believable explanation, it’s easy to gloss right over it.

Perhaps the buggy control flow would be easier to spot if the loop was simpler and more compact. By hand-waving some helper functions into existence and refactoring, the same code could be written as follows:

maybe_log_intro();

let mut bump_candidates = new_hash_map();
for tx in txn_matched {
    for inp in &tx.input {
        if let Some(claim_request) = self.get_mut_claim_request_from_outpoint(inp.previous_output) {
            let split_requests = claim_request.split_off_matching_inputs(&tx.input);
            debug_assert!(!split_requests.is_empty());

            if claim_request.outpoints().is_empty() {
                // Request has been fully claimed.
                self.mark_request_claimed(claim_request, tx, conf_height, conf_hash);
                break;
            }

            // After removing conflicting inputs, there's still more to claim.  Add the modified
            // request to bump_candidates so it gets fee bumped and rebroadcast.
            self.remove_pending_claim_events(claim_request);
            bump_candidates.insert(claim_request.clone());

            self.mark_requests_contentious(split_requests, tx, conf_height, conf_hash);
            break;
        }
    }

The control flow in this version is much more apparent to the reader. And although there’s no guarantee that the buggy break statements would have been discovered sooner if the code was written this way, I do think the odds would have been much better.

Takeaways

  • Code readability matters for preventing bugs.
  • Update to LDK 0.1 for the vulnerability fix.
DoS: LND Onion Bomb

LND versions prior to 0.17.0 are vulnerable to a DoS attack where malicious onion packets cause the node to instantly run out of memory (OOM) and crash. If you are running an LND release older than this, your funds are at risk! Update to at least 0.17.0 to protect your node.

Severity

It is critical that users update to at least LND 0.17.0 for several reasons.

  • The attack is cheap and easy to carry out and will keep the victim offline for as long as it lasts.
  • The source of the attack is concealed via onion routing. The attacker does not need to connect directly to the victim.
  • Prior to LND 0.17.0, all nodes are vulnerable. The fix was not backported to the LND 0.16.x series or earlier.

The Vulnerability

The Lightning Network uses onion routing to provide senders and receivers of payments some degree of privacy. Each node along a payment route receives an onion packet from the previous node, containing forwarding instructions for the next node on the route. The onion packet is encrypted by the initiator of the payment, so that each node can only read its own forwarding instructions.

Once a node has “peeled off” its layer of encryption from the onion packet, it can extract its forwarding instructions according to the format specified in the LN protocol:

Field Name Size Description
length 1-9 bytes The length of the payload field, encoded as BigSize.
payload length bytes The forwarding instructions.
hmac 32 bytes The HMAC to use for the forwarded onion packet.
next_onion remaining bytes The onion packet to forward.

Prior to LND 0.17.0, the code that extracts these instructions is essentially:

// Decode unpacks an encoded HopPayload from the passed reader into the
// target HopPayload.
func (hp *HopPayload) Decode(r io.Reader) error {
    bufReader := bufio.NewReader(r)

    var b [8]byte
    varInt, err := ReadVarInt(bufReader, &b)
    if err != nil {
        return err
    }

    payloadSize := uint32(varInt)

    // Now that we know the payload size, we'll create a new buffer to
    // read it out in full.
    hp.Payload = make([]byte, payloadSize)
    if _, err := io.ReadFull(bufReader, hp.Payload[:]); err != nil {
        return err
    }
    if _, err := io.ReadFull(bufReader, hp.HMAC[:]); err != nil {
        return err
    }

    return nil
}

Note the absence of a bounds check on payloadSize!

Regardless of the actual payload size, LND allocates memory for whatever length is encoded in the onion packet up to UINT32_MAX (4 GB).

The DoS Attack

It is trivial for an attacker to craft an onion packet that contains an encoded length of UINT32_MAX for the victim’s forwarding instructions. If the victim’s node has less than 4 GB of memory available, it will OOM crash instantly upon receiving the attacker’s packet.

However, if the victim’s node has more than 4 GB of memory available, it is able to recover from the malicious packet. The victim’s node will temporarily allocate 4 GB, but the Go garbage collector will quickly reclaim that memory after decoding fails.

So nodes with more than 4 GB of RAM are safe, right?

Not quite. The attacker can send many malicious packets simultaneously. If the victim processes enough malicious packets before the garbage collector kicks in, an OOM will still occur. And since LND decodes onion packets in parallel, it is not difficult for an attacker to beat the garbage collector. In my experiments I was able to consistently crash nodes with up to 128 GB of RAM in just a few seconds.

The Fix

A bounds check on the encoded length field was concealed in a large refactoring commit and included in LND 0.17.0. The fixed code is essentially:

// Decode unpacks an encoded HopPayload from the passed reader into the
// target HopPayload.
func (hp *HopPayload) Decode(r io.Reader) error {
    bufReader := bufio.NewReader(r)

    payloadSize, err := tlvPayloadSize(bufReader)
    if err != nil {
        return err
    }

    // Now that we know the payload size, we'll create a new buffer to
    // read it out in full.
    hp.Payload = make([]byte, payloadSize)
    if _, err := io.ReadFull(bufReader, hp.Payload[:]); err != nil {
        return err
    }
    if _, err := io.ReadFull(bufReader, hp.HMAC[:]); err != nil {
        return err
    }

    return nil
}

// tlvPayloadSize uses the passed reader to extract the payload length
// encoded as a var-int.
func tlvPayloadSize(r io.Reader) (uint16, error) {
    var b [8]byte
    varInt, err := ReadVarInt(r, &b)
    if err != nil {
        return 0, err
    }

    if varInt > math.MaxUint16 {
        return 0, fmt.Errorf("payload size of %d is larger than the "+
            "maximum allowed size of %d", varInt, math.MaxUint16)
    }

    return uint16(varInt), nil
}

This new code reduces the maximum amount of memory LND will allocate when decoding an onion packet from 4 GB to 64 KB, which is enough to fully mitigate the DoS attack.

Discovery

A simple fuzz test for onion packet encoding and decoding revealed this vulnerability.

Timeline

  • 2023-06-20: Vulnerability discovered and disclosed to Lightning Labs.
  • 2023-08-23: Fix merged.
  • 2023-10-03: LND 0.17.0 released containing the fix.
  • 2024-05-16: Laolu gives the OK to disclose publicly once LND 0.18.0 is released and has some uptake.
  • 2024-05-30: LND 0.18.0 released.
  • 2024-06-18: Public disclosure.

Prevention

This vulnerability was found in less than a minute of fuzz testing. If basic fuzz tests had been written at the time the original onion decoding functions were introduced, the bug would have been caught before it was merged.

In general any function that processes untrusted inputs is a strong candidate for fuzz testing, and often these fuzz tests are easier to write than traditional unit tests. A minimal fuzz test that detects this particular vulnerability is exceedingly simple:

func FuzzHopPayload(f *testing.F) {
    f.Fuzz(func(t *testing.T, data []byte) {
        // Hop payloads larger than 1300 bytes violate the spec and never
        // reach the decoding step in practice.
        if len(data) > 1300 {
            return
        }

        var hopPayload sphinx.HopPayload
        hopPayload.Decode(bytes.NewReader(data))
    })
}

Takeaways

  • Write fuzz tests for all APIs that consume untrusted inputs.
  • Update your LND nodes to at least 0.17.0.