Share via

Distribution Cleanup Job Failing Silently in SQL Server 2019 Transactional Replication Causing Distribution DB Growth to 1.5 TB

Parag Nimbhore 5 Reputation points
2026-05-16T08:14:26.1766667+00:00

Hello All,

We are facing an issue with SQL Server Transactional Replication where the Distribution Cleanup job has been failing silently, and the distribution database has grown to approximately 1.5 TB, causing severe disk space issues.

Environment

  • SQL Server Transactional Push Replication
  • Dedicated Distribution database
  • Cleanup job: Distribution Cleanup: distribution

Distribution Server

Microsoft SQL Server 2019 (RTM-CU32) (KB5054833)

15.0.4430.1 (X64)

Standard Edition

Windows Server 2019 Standard (Build 17763)

Hypervisor-based VM

Publisher Server

Microsoft SQL Server 2019 (RTM-CU32-GDR) (KB5084816)

15.0.4465.1 (X64)

Standard Edition

Windows Server 2019 Standard (Build 17763)

Hypervisor-based VM

Issue Details

The replication cleanup job appears to hang indefinitely without throwing any explicit errors.

When we manually execute or monitor the cleanup job, it remains stuck for a long time on the following internal statement:

fetch hCdirs into @dir, @xact_seqno, @command_id, @type

No meaningful cleanup occurs during this time, and rows continue accumulating in replication system tables such as:

  • MSrepl_commands
  • MSrepl_transactions

As a result, the distribution database has continued to grow and is now consuming nearly all available disk space.

Troubleshooting Already Performed

  • Checked for blocking sessions and long-running queries
  • No blocking chains or long-running transactions were identified
  • Replication agents appear to be running normally
  • Cleanup job remains active for an extended duration without reducing data volume

Observed Behavior

  • Cleanup job remains in a “Running” state for hours/days
  • No significant reduction in row count from replication metadata tables
  • Distribution database size has reached approximately 1.5 TB
  • Disk space exhaustion is now impacting server stability and replication maintenance

Assistance Required

We would appreciate guidance on the following:

  1. What could cause the Distribution Cleanup job to hang on the above cursor fetch statement even when there is no blocking or long-running activity?
  2. Are there known issues related to:
  • Replication metadata volume
  • Retention settings
  • Distribution database performance
    • Replication cleanup internals that could cause this behavior?
  1. What is the recommended and safest way to aggressively clean up the distribution database in this scenario?
  2. Are there Microsoft-recommended approaches for:
  • Reducing the size of MSrepl_commands and MSrepl_transactions
  • Identifying why cleanup is not progressing
  • Validating replication health before cleanup
    • Recovering from excessive distribution database growth
  1. Would adjusting replication retention settings or manually invoking replication cleanup procedures help in this situation?

Any troubleshooting steps, DMV queries, known issues, or best practices would be greatly appreciated.

Thank you.

 

SQL Server | Other
SQL Server | Other

Additional SQL Server features and topics not covered by specific categories


Answer recommended by moderator

AI answer

2026-05-16T08:14:48.1233333+00:00

The Distribution Cleanup job removes replicated transactions from the distribution database and also deactivates subscriptions that have not synchronized within the maximum distribution retention period. If this job is not making progress, the distribution database will continue to grow because rows in MSrepl_commands and MSrepl_transactions are not being removed.

From the provided information, the following points are relevant:

  1. The Distribution Cleanup job is expected to:
    • Remove replicated transactions from the distribution database on a frequent schedule.
    • Deactivate subscriptions that have not synchronized within the maximum distribution retention period (@max_distretention).
    This behavior is documented for transactional replication and is driven by the retention settings configured when the Distributor and publications were created.
  2. Subscription expiration and deactivation are controlled by:
    • The maximum distribution retention period (@max_distretention on the distribution database).
    • The publication retention period (@retention on each publication).
    If a subscription is not synchronized within the maximum distribution retention period and there are undistributed commands, the Expired Subscription Clean Up job on the Distributor deactivates the subscription and requires reinitialization. If a subscription is not synchronized within the publication retention period, the Expired Subscription Clean Up job on the Publisher drops the subscription and it must be recreated. If retention values are set very high, or if subscriptions are never allowed to expire, the Distribution Cleanup job will have a very large volume of metadata to evaluate and delete, which can cause it to run for a long time and appear to hang.
  3. Replication maintenance jobs and their roles:
    • Distribution Clean Up: Distribution
      • Removes replicated transactions from the distribution database.
    • Expired Subscription Clean Up
      • Detects and removes expired subscriptions from publication databases.
      • On the Distributor, deactivates subscriptions that were not synchronized within the maximum distribution retention period.
    Both jobs must be running successfully and on an appropriate schedule to keep the distribution database from growing excessively.
  4. To assess whether reinitialization or more aggressive cleanup is appropriate, use replication monitoring tools:
    • Use sp_replmonitorsubscriptionpendingcmds or the Undistributed Commands tab in Replication Monitor to determine:
      • How many commands are still pending in the distribution database for each subscription.
      • The estimated time to deliver those commands.
    • If the estimated time to deliver pending commands is greater than the time required to generate and apply a new snapshot, consider reinitializing the affected subscriptions. Reinitialization will allow the Distribution Agent to apply a fresh snapshot and then resume with a smaller backlog of commands.
  5. Recommended and safer approaches to reduce distribution database size and recover from excessive growth, based on the documented behavior:
    • Verify and, if necessary, adjust retention settings:
      • Ensure @max_distretention (maximum distribution retention) is not set excessively high.
      • Ensure publication @retention values are appropriate for business requirements.
    • Ensure the following jobs are enabled and completing successfully:
      • Distribution Clean Up: Distribution (runs every 10 minutes by default).
      • Expired Subscription Clean Up (runs daily at 1:00 A.M. by default).
    • Use Replication Monitor and sp_replmonitorsubscriptionpendingcmds to:
      • Identify subscriptions with very large numbers of undistributed commands.
      • Decide whether to reinitialize those subscriptions instead of allowing the Distribution Agent to process an extremely large backlog.
    • For subscriptions that are no longer needed or are far behind and not required, allow them to expire or drop them so that the cleanup jobs can remove their associated metadata.
  6. Validating replication health before aggressive cleanup:
    • Confirm that Distribution Agents are running and successfully applying commands to Subscribers.
    • Confirm that there are no subscriptions that must retain historical commands for business or recovery reasons.
    • Use Replication Monitor to verify that publications show a healthy status (green) and that undistributed commands are not increasing unexpectedly.
  7. Regarding manual invocation of cleanup procedures and aggressive cleanup:
    • The documented and supported way to clean up the distribution database is to rely on the built-in maintenance jobs (Distribution Clean Up: Distribution and Expired Subscription Clean Up) with appropriate retention settings.
    • Before making retention more aggressive (shorter periods), ensure that all required Subscribers are synchronizing within the new window; otherwise, subscriptions may be deactivated or expire and require reinitialization.

In summary, the core levers available are:

  • Correct and possibly reduce retention settings (@max_distretention and publication @retention).
  • Ensure the Distribution Cleanup and Expired Subscription Cleanup jobs are running and completing.
  • Use sp_replmonitorsubscriptionpendingcmds and Replication Monitor to identify large backlogs and decide where reinitialization is preferable to processing a huge volume of old commands.
  • Remove or allow expiration of obsolete or non-synchronizing subscriptions so that their metadata can be cleaned up.

References:

AI-generated content may be incorrect. Read our transparency notes for more information.

Was this answer helpful?

1 additional answer

Sort by: Most helpful
  1. Erland Sommarskog 134.4K Reputation points MVP Volunteer Moderator
    2026-05-16T09:14:50.0433333+00:00

    First a disclaimer: While I'm generally considered an expert on SQL Server, I don't work a whole lot with replication. But since you have done a great job with identifying the troublesome statement, I was able to find the procedure where it appears. Although, here I need to add a second disclaimer: I only have a setup with replication on SQL 2022, and you are on SQL 2019. But since replication is a mature technology, I would not expect any significant differences.

    I like to start with some more troubleshooting steps, before I present a possible solution, since that solution is, well, you'll find out.

    When you have a loop like this and a statement appears to be stuck, there are two possibilities:

    1. The statement is truly stuck and is not going anywhere.
    2. The statement is running very slow, so every time you check current activity, you see this statement. (This cursor loop is not doing any actual cleanup, but only seems to be a preparatory step, so even if it is moving forward, you would not see any space reduction.)

    Unfortunately, there is no DMV column which gives us the time when the current statement started. But you could use sp_sqltrace, a free tool written by Lee Tudor and which I host on my web site. Find out the spid for the cleanup process and say:

    EXEC sp_sqltrace, nn, @snoop_time = 60
    

    The trace will then run for 60 seconds and then aggregate the statements. If nothing is printed, the FETCH runs for at least 60 seconds, which is a long time. Rather than using sp_sqltrace, you can use Profiler or Extended Events directly. I just find sp_sqltrace to be very convenient for this type of investigations.

    Another troubleshooting tool I have on my web site is beta_lockinfo. This tool gives the current activity in the SQL Server instance and this includes the execution plans for the running statements. beta_lockinfo uses a DMV that gives you the actual values so far. Possibly this can help you to identify tables that may be in of update statistics or index rebuilds.

    The name of the procedure where this FETCH statement appears is sp_MSdelete_publisherdb_trans. It includes this piece of code:

    	if exists (select * from MSsubscriptions where
    			publisher_database_id = @publisher_database_id and
    			subscriber_id < 0)
    		select @has_immediate_sync = 1
    	else
    		select @has_immediate_sync = 0
    

    The cursor can then be set up in two different ways depending on @has_immeidate_sync. You should check by running the SELECT to see what applies in your case. I would guess on @has_immediate_sync = 1, because that cursor seems more complex.

    The cursor is set up as FAST_FORWARD. To be honest, I have never been able to understand this cursor type. Documentation says "like DYNAMIC, but with optimisations". A dynamic cursor is evaluated on every FETCH, which definitely can be slow. My preference is absolutely for STATIC which means that the query in the cursor is evaluated once for all and the result is saved to a work table and FETCH then retrieves rows from this work table. So if nothing else helps, you change the procedure by replacing FAST_FORWARD with STATIC.

    Obviously, changing code shipped my Microsoft comes with all sorts of caveats and it could render you unsupported. Then again, you are on a version of SQL Server which is out of mainstream support.

    It statistics update/index rebuild does not help and you don't want to change the procedure, I would recommend that you open a support case. But as I said, SQL 2019 is out of mainstream support.

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.