Release pipelines should be boring

Release automation tends to get written in a hopeful mood. One pull request merges, one job runs, one tag gets created, everything lands cleanly. You assume only one thing happens at a time. The assumption isn't deliberate. It's just how you think when you're writing the happy path.

The trouble is that the happy path is a special case. As soon as a repository has busy automated dependency updates, several pull requests can merge within minutes of each other, and each merge fires its own release job. The jobs start from roughly the same point and then race each other to write back. We had a shared GitHub Actions template running releases across a set of repositories, and it turned out to be hiding three separate race conditions, each at a different stage of the same job.

Two jobs creating the same tag

The first race is at tag creation. Two jobs start at almost the same moment, both read the current version, both compute the next one, and both try to create the same version tag. One wins. The other fails with a tag conflict.

The fix is a concurrency group on the workflow. GitHub Actions supports this natively: you name a group, and a run that would join it while another is in flight either waits or gets cancelled. For releases you want it to wait. You're serialising, not skipping.

concurrency:
  group: release-${{ github.ref }}
  cancel-in-progress: false

With that in place, two release jobs in the same repo queue instead of colliding. The second runs cleanly once the first is done.

Two jobs pushing to the same branch

Serialising tag creation doesn't cover the last step: pushing the version-bump commit back to the branch. A job checks out the repo, bumps the version, commits, and pushes. If another job pushed in the gap between checkout and push, the working copy is now a commit behind, and git correctly refuses the non-fast-forward push.

There are two halves to fixing this. The first is to sync with the remote at the last possible moment before writing, so the bump lands on a current view of the branch:

- name: Sync with remote before versioning
  run: git pull --rebase origin ${{ github.ref_name }}

- name: Bump version
  run: npm version patch --no-git-tag-version

The ordering matters. Pull and rebase before the bump, so you're rebasing onto current remote state rather than dragging your version commit over changes that might conflict with it.

The second half is to make the push itself resilient. Even with a pre-push sync, two jobs can reach the push inside the same narrow window. So the push step retries: on a non-fast-forward rejection it pulls, rebases the bump onto the new tip, and tries again, bounded so it fails loudly instead of looping forever.

- name: Push Changes
  run: |
    for attempt in 1 2 3; do
      if git push origin HEAD:${{ github.ref_name }}; then
        break
      fi
      if [ $attempt -lt 3 ]; then
        echo "Push failed, pulling and retrying..."
        git pull --rebase origin ${{ github.ref_name }}
      else
        echo "Push failed after $attempt attempts"
        exit 1
      fi
    done

Note that you recover after a failed attempt rather than pulling before every one. Failing first and then recovering avoids unnecessary work, and it avoids a window where a pre-emptive pull could rebase onto a conflicting state.

Cleaning up after a race that already happened

By the time the fixes land, a race can already have left a mess. A job writes a version-bump commit locally but fails to push it, so the version recorded in the package manifest is a step ahead of the tags actually published. Recovering means going through the CI logs to find the run that partially completed, working out what it left behind, and replaying that specific bump cleanly against the current branch. The archaeology takes longer than the fix.

That's the real cost of partial failures in automation. The failure itself is cheap. Reconstructing the state it leaves behind is what costs you. So you make release steps idempotent where you can: a step that's safe to re-run without doubling its effects turns a tense recovery into a re-run.

Why a flaky pipeline is worse than it looks

A release job that fails about half the time sits in an awkward blind spot. It's not bad enough to block anyone. You retry, it passes, you move on. The workaround is cheaper than the fix, so the failure gets normalised, and people stop reading it as a signal and start treating it as weather.

The cumulative cost is real, and it isn't only the wasted retries. An unreliable pipeline changes how people work. If every release risks a babysitting session, the rational move is to batch changes up, which is the opposite of the small, frequent pull requests that make review easier and rollbacks cheaper. A pipeline that runs quietly on every merge takes that disincentive away, and goes back to being something nobody has to think about.

Doing this in a shared template multiplies the payoff. The fixes land once, and every repository that inherits the template gets them without anyone rediscovering the problem repo by repo, one retry button at a time.