Reliability & Circuit Breakers

PerfLocale wraps every external dependency — machine-translation providers, webhook receivers, exchange-rate APIs, geo-IP services, AI quality scoring — in a circuit breaker. When a dependency starts failing, the breaker detects the pattern in seconds and stops piling retries onto the broken service. Subsequent calls return instantly with a typed exception so callers can route to a graceful-degradation path. Recovery is automatic: after a cooldown the breaker probes once; if the probe succeeds, normal traffic resumes.

This page is an operator's guide. The hooks reference covers every filter; this page covers when to use them, how to interpret what Site Health shows, and what to do when something stays broken.

How a breaker behaves

Each breaker is a tiny state machine with three states:

  • CLOSED — normal operation. Calls go through. Each failure increments a counter in a sliding window (default 5 minutes).
  • OPEN — the counter crossed the threshold (default 5 failures in 5 min). Calls short-circuit instantly with \PerfLocale\Concurrency\BreakerOpenException. After the cooldown (default 5 min), the breaker promotes itself to HALF_OPEN.
  • HALF_OPEN — the next call is allowed as a probe. Success closes the breaker; failure re-opens it for another cooldown cycle.

Authentication errors (HTTP 401/403, "invalid API key") trip the breaker on the first hit. There's no point retrying a rotated key — it will just keep failing. Rate-limit and transient errors (HTTP 429, 5xx, network timeouts) accumulate toward the normal threshold.

Breakers that ship by default

KeyWrapsTrip threshold
mt_<provider_id>Each MT provider (DeepL, Google, Microsoft, LibreTranslate, WP AI Client, custom)1 auth error, OR 5 transient errors in 5 min
webhook_<uuid>Each registered webhook receiver1 auth error, OR 5 transient errors in 5 min
fx_syncExchange-rate sync (WooCommerce multi-currency)5 empty-fetch failures
geo_<provider_id>Each geo-IP provider (ipapi.co, ipinfo, ipinfo lite, ipstack, ip-api.com)3 failures in 5 min (more conservative because a missed lookup costs a wrong-language landing)
mt_quality_scoringThe hourly AI quality-scoring background job1 auth/rate_limit error on row 1, OR 5 transient errors in one tick

Geo-IP breakers use a 15-minute cooldown (the longest of the defaults) because the visitor-facing impact of a wrong-language redirect is more durable than a missed admin webhook.

Site Health visibility

Open breakers appear under Tools → Site Health in a card titled "PerfLocale circuit breakers". Three states:

  • Good (green) — no active breakers. System healthy.
  • Recommended (yellow) — one or more breakers are in HALF_OPEN, probing for recovery. No action needed; the breaker will close itself if the probe succeeds.
  • Critical (red) — one or more breakers are OPEN. The card lists each open breaker with its reason, cooldown countdown, and a one-click Reset now link.

The "Reset now" link does NOT wait for cooldown — it force-closes the breaker so the next call hits the upstream again. Use this after you've manually verified the upstream is healthy (rotated API key, fixed webhook receiver, etc.) and don't want to wait through the remaining cooldown. Cap-protected: only users with manage_options see the link; per-key CSRF nonce on each request.

Other Site Health checks

The breaker card above is the headline reliability signal, but four sibling tests run on every Site Health load:

perflocale_eager_link_map — Eager-link-map state
Reports the live byte size of the autoloaded perflocale_eager_links_post and perflocale_eager_links_term options and whether either has flipped to the too_large sentinel. Fails Recommended when a map exits the cap and link lookups fall back from alloptions to the per-key cascade. What to do: install a persistent object cache so the cascade hits L2, or raise the cap via perflocale/cache/eager_map_row_cap (the byte-size cap is perflocale/cache/eager_map_byte_cap).
perflocale_cron_schedule — Background cron schedule
Verifies the watchdog, daily GC, and lock-cleanup hooks are scheduled (or Action Scheduler is present). Fails Critical when hooks are missing or DISABLE_WP_CRON is defined without Action Scheduler available. What to do: re-activate the plugin to re-register schedules, or if DISABLE_WP_CRON is intentional ensure a system cron hits wp-cron.php on a regular interval.
perflocale_stuck_translations — Stuck translations
Counts translations sitting in in_progress or pending for more than 7 days. Fails Recommended when count is above zero — usually a crashed worker that left a row in flight before the lock-cleanup tier swept its lock. What to do: open PerfLocale Jobs admin and retry or mark failed; wp perflocale jobs list --status=running surfaces wedged job IDs.
perflocale_orphan_rows — Orphan translation rows
Counts translation_links rows referencing a wp_posts or wp_terms row that has been hard-deleted bypassing trash. Fails Recommended when count is above zero. What to do: orphans are harmless (lookup paths ignore them) and the daily GC tier sweeps them automatically; wp perflocale health-check --fix clears them on demand.

Cost note: stuck_translations and orphan_rows each cache their COUNT result in a 1-hour transient, so reloading Tools → Site Health (or polling from a monitoring agent) does not repeat the scan against the translations and translation_links tables. First load after transient expiry pays the count; the next 59 minutes are free.

Managing breakers from the command line

WP-CLI subcommands for ops automation and dashboards:

# Every currently-tracked breaker + its state
wp perflocale breakers list

# Detailed status of one breaker
wp perflocale breakers status mt_deepl

# Force-close one breaker after verifying the upstream
wp perflocale breakers reset mt_deepl

# Force-close every breaker (e.g. after a known-fixed outage)
wp perflocale breakers reset --all

# Machine-readable output for monitoring
wp perflocale breakers list --format=json | jq '.[] | select(.state=="open")'

The list command is safe to run on a per-minute cron from a monitoring host; it's a single transient/option read per breaker (typically < 5 ms for the whole call).

Catching BreakerOpenException in your own code

When you call PerfLocale's MT service from your own code (a custom translation flow, a Gutenberg block, a cron handler), the breaker can interrupt the call by throwing \PerfLocale\Concurrency\BreakerOpenException. The exception is a typed companion to \RuntimeException — catch it specifically to route to a graceful fallback, instead of conflating with a genuine downstream error.

use PerfLocale\Concurrency\BreakerOpenException;

try {
	$translated = $mt_service->translate_text( $text, 'en', 'de', 'deepl' );
} catch ( BreakerOpenException $e ) {
	// Provider is currently in cooldown. The exception carries:
	//   - $e->get_breaker_key()       — e.g. "mt_deepl"
	//   - $e->getMessage()            — human-readable description (includes the retry-in seconds)
	//
	// Degrade gracefully: cached translation, translation memory, or
	// just return the source text unchanged.
	$translated = my_translation_memory_lookup( $text ) ?? $text;
} catch ( \RuntimeException $e ) {
	// Genuine downstream error (not a breaker pre-emption).
	// Log + skip; the breaker will catch the next one if it persists.
	error_log( '[my-plugin] translation failed: ' . $e->getMessage() );
	$translated = $text;
}

The same pattern works for any custom code calling through AbstractProvider::make_request().

When to tune cooldowns & thresholds

Defaults are sized for the median use case. You should tune when:

  • Your MT provider has unusually slow recovery (4+ hour outages are common during quota resets). Raise perflocale/breaker/cooldown_seconds/mt_<provider> to HOUR_IN_SECONDS or higher so you're not probing every 5 minutes during a sustained outage.
  • You run dozens of webhooks against the same receiver. The default 5-failure trip might bounce too often if the receiver flaps. Either consolidate the webhooks (one PerfLocale webhook, one fan-out endpoint on your side) or raise perflocale/breaker/threshold/webhook_<uuid>.
  • Your operator dashboard wants stricter breakers. Tighten perflocale/breaker/threshold/... to 2 or 3 so ops gets alerted faster, paired with a monitoring agent that watches wp perflocale breakers list.
  • You're debugging a flaky upstream and want the breaker out of the way temporarily. add_filter( 'perflocale/breaker/disabled', '__return_true' ) is the global kill-switch. Re-enable as soon as debugging is done.

Where breaker state lives

Each breaker's state lives in a single transient: perflocale_breaker_<key>. When you have a persistent object cache (Redis, Memcached) configured, that's where the state actually lives — transients bypass wp_options. Without an object cache, transients fall back to wp_options rows.

A small autoloaded index option perflocale_breakers_index tracks every breaker key ever touched so the Site Health card + wp perflocale breakers list can enumerate them regardless of where the transient state actually lives. The index entries are cleaned up by Breaker::reset() and the full uninstall sweep.

Cache eviction: if Redis flushes under memory pressure, the breaker "forgets" its state. That's intentional — better one extra failed call than a stuck-open breaker. The breaker will re-trip on the next round of failures.

Concurrency locks (the other half of reliability)

Breakers protect external dependencies; locks protect internal critical sections. PerfLocale uses two flavors:

  • \PerfLocale\Concurrency\Lock — general-purpose advisory lock backed by atomic INSERT IGNORE INTO wp_options. Used everywhere a critical section can't tolerate concurrent execution: translation creation, FX sync, scoring cron, settings update.
  • \PerfLocale\Background\JobLock — per-job and per-type locks for the background-jobs worker pool. Same atomic primitive; different option-key namespace.

Token-guarded release. Both lock types stamp a per-acquire random token in the stored value. release() only deletes the row when the stored value still matches what THIS request stamped — so a lock holder that hangs past TTL and gets its lock taken over by another worker can't accidentally delete the new owner's row when it finally finishes.

Verified at scale. The concurrency test harness (in tools/concurrency-tests/ in the GitHub repo, excluded from the wp.org zip) runs 22+ scenarios across 3 WP installs (single-site, subdir multisite, subdomain multisite). 1000-way parallel acquire produces exactly one winner. 100-blog network activation completes cleanly with all per-blog crons scheduled. 50-way concurrent create_translation() calls for the same post + language produce exactly one new wp_posts row, never an orphan.

Disaster-recovery idempotency (WPML / Polylang / TranslatePress importers)

The three migration importers all guarantee idempotent re-runs — including the disaster-recovery scenario where an operator restores a database backup to a pre-import state and re-runs the importer. Without that guarantee, every re-import after a restore would allocate fresh translation_groups rows for content that already had groups in the prior run, duplicating every linkable post and term.

The mechanism is a dedicated perflocale_migration_source_map table that pins a stable identifier from the source plugin to the translation_groups row PerfLocale created for it. Strings already enjoyed this guarantee via the REPLACE INTO shape of StringTranslationRepository::set(); the source map extends the same shape to posts and terms.

  • Schema. perflocale_migration_source_map has UNIQUE (migration_type, source_key) backing an ON DUPLICATE KEY UPDATE upsert. migration_type is one of wpml, polylang, or trp; source_key is a per-importer natural key (e.g. WPML uses "<trid>|post" / "<trid>|term"; Polylang uses the translation-term id; TranslatePress uses "<post_id>|<lang_id>").
  • Atomic with group creation. The map insert lives inside the same START TRANSACTION as the translation_groups insert in TranslationGroupRepository::create_group(). A crash here rolls back both rows together — never leaving a stale map entry pointing at a non-existent group, and never leaving an orphan group with no map entry.
  • WPML and Polylang importers look up get_group_id(type, key) before calling create_group(). If a mapping exists, the existing group_id is reused; otherwise create_group() writes both rows in one transaction.
  • TranslatePress importer uses a different dedup pattern because it wp_insert_post()s new translation posts on the destination side. The pre-check there is get_translation_in_language() against the translation_links table — if the source post already has a translation in the target language, the importer skips the wp_insert_post() and re-records the source-map row for cross-restart consistency.
  • Operator escape hatch. wp perflocale migrate <source> --force-restart clears the source-map for one importer when the operator has deliberately restored a clean pre-migration backup and wants fresh allocations. The CLI logs the row-count cleared before the new import begins. The default (omit the flag) is the safer choice for almost every scenario.
  • Concurrency. The UNIQUE (migration_type, source_key) constraint serializes concurrent imports via MySQL’s InnoDB locking on the unique index. Two parallel importers hitting the same source key converge to one row via ON DUPLICATE KEY UPDATE; verified live with the regression harness at tools/regression-tests/idempotency.php.

Performance: get_group_id() averages 60–90µs p50, set/upsert ~1 ms p50. At realistic scale a 50,000-trid WPML migration spends ~55 seconds on the source-map path; a re-run (all lookups hit) takes ~4 seconds. Table footprint is ~235 bytes per row — ~11 MB at 50K, ~112 MB at 500K.

Migration jobs (WpmlMigrationJob, PolylangMigrationJob, TranslatePressMigrationJob, DataImportJob) all call MigrationCacheHelper::flush_post_migration_caches() after a successful import. This flushes the L1 static memos on TranslationGroupRepository, deletes the autoloaded perflocale_eager_links_* + perflocale_has_any_groups options, and flushes the L2 cache. Long-running CLI / cron workers no longer serve pre-import group memos for the rest of their process lifetime.

Transactional FK cascades

Operations that touch multiple tables in lock-step run inside a single SQL transaction so a partial failure rolls the whole change back rather than committing an inconsistent state. Two load-bearing paths use this:

  • Language deleteLanguageRepository::delete() wraps every dependent-table DELETE (translation_links, orphan-group GC, glossary source/target, workflow, translation_memory source/target, slug_translations, string_translations) in one START TRANSACTION with ROLLBACK on any per-table failure. A script timeout, dropped DB connection, or unrecoverable error mid-cleanup no longer leaves the site with the languages row gone but tens of thousands of FK-orphan rows still referencing its language_id.
  • Translation group linkingTranslationGroupRepository::link_object() is the primitive every create_translation() call runs through (post, term, string, and migration paths). Its three internal DELETEs enforce the documented one-object-one-group invariant; on any DELETE failure the method rolls back when it owns the transaction and returns false. The TranslatePress importer’s batch transaction now checks this return and throws so a failed link rolls the whole batch back rather than inflating the imported counter past the actual link count on disk.

Helpers that participate in those transactions (e.g. StringTranslationRepository::delete_for_language()) throw on DB failure rather than returning silent 0 values, so the outer cascade catches the failure and rolls back. wp_insert_post() callers across the migration and translation paths pass $wp_error=true so the is_wp_error() guard surfaces real DB errors instead of conflating them with the legitimate int 0 “couldn’t insert” return.

Uninstall and deactivation cleanup

Uninstall cleans up everything the plugin ever wrote, including non-obvious surfaces:

  • L2 object cache flushSiteCleanup::full_purge() iterates the canonical CacheManager::GROUPS list of plugin-owned cache groups and calls wp_cache_flush_group() on each. Without this, persistent Redis or Memcached entries can linger for up to 12 hours past uninstall and be served as ghost reads if the plugin is reinstalled in the same window.
  • Direct-grant capability orphansSiteCleanup::sweep_orphan_user_caps() strips every perflocale_* key from per-user wp_capabilities meta. It’s called from BOTH plugin deactivation and full uninstall so direct-grant caps no longer accumulate across deactivation/reactivation cycles.
  • Per-blog isolation on multisiteuninstall.php iterates every subsite via get_sites() + switch_to_blog(), runs the full cleanup inside a try/finally so a fatal on one subsite doesn’t leave subsequent iterations running against the wrong blog, and respects each subsite’s own delete_data_on_uninstall preference.
  • Importer checkpoints — per-importer resume-checkpoint options (e.g. perflocale_trp_import_post_checkpoint) are included in the uninstall sweep even when an interrupted import left them behind.

MT rate-limit: fail-CLOSED + site-wide cap

The per-user hourly ceiling on machine-translation requests is enforced inside an atomic lock so concurrent POST /perflocale/v1/machine-translate requests can’t both read the pre-increment count, both pass the < limit check, and each write count+1 — stampeding past the cap. If a request can’t acquire the lock, it now fails CLOSED with HTTP 429 + Retry-After rather than open: a request that can’t even take the lock cannot have its count incremented, so treating the miss as “allow” would defeat the cap.

The new perflocale/mt/rate_limit_site filter (default 5000) caps total MT requests per hour across every user summed together. Per-user and site counters share a single global lock so a hostile editor with the perflocale_use_mt capability can’t fan out parallel requests to drain the site-wide budget faster than the rate-limit check can register them.

Data retention & garbage collection

Every plugin-owned data store has a clear bound. There are no tables, options, or meta keys that can grow forever — either the schema invariant (UNIQUE keys + cascade-on-delete from posts / terms / languages) keeps the row count proportional to content, or a layered GC system trims it on a schedule. Every retention knob is filterable.

Strings mark-and-sweep (perflocale_strings & perflocale_string_translations)

The perflocale_strings table holds every translatable gettext call the string scanner has discovered across the active plugins and themes. Disable a plugin (or update one that drops a string), and those rows would otherwise linger forever — carrying with them an arbitrary number of perflocale_string_translations rows per active language.

The scanner stamps a last_seen_at timestamp on every row it re-discovers. register_setting_string() does the same for manually-registered strings (workflow email templates, settings labels). A daily GC fires inside the existing perflocale_jobs_gc cron and deletes rows whose last_seen_at is older than perflocale/strings/stale_retention_days (default 90), cascading to perflocale_string_translations in the same DELETE.

Two safety nets prevent over-eager eviction:

  • 90-day retention — tolerates rare-but-real code paths (settings pages, workflow emails) that may only run a few times a year. Tune via the filter above.
  • Context whitelistperflocale/strings/manual_contexts lists context values that the GC never deletes, no matter how stale. Defaults to ['workflow_email_subject', 'workflow_email_body']; add your own register_setting_string contexts as a hedge against the retention window expiring before the code path that registers them runs.

A separate orphan-sweep runs in the same daily window: a single DELETE … LEFT JOIN over perflocale_string_translations drops any row whose parent strings.id has been deleted by a path that bypassed the cascade (manual SQL, partial-failure recovery, future code paths that forget the cascade). Cheap when nothing to delete; emits perflocale/string_translations/orphans_swept when it finds work.

Translation Memory LRU eviction (perflocale_translation_memory)

The UNIQUE KEY (source_hash, source_language_id, target_language_id) on the TM table makes duplicate writes impossible — store() uses ON DUPLICATE KEY UPDATE to bump usage_count when the same source string + language pair comes back. Combinatorial growth is therefore #distinct-source-strings × #language-pairs, which is bounded but can still be large on a high-vocabulary site.

A weekly LRU GC caps the table at perflocale/tm/gc_row_cap (default 100,000 rows). When the cap is exceeded, the lowest-scoring rows (ORDER BY usage_count ASC, updated_at ASC — least-reused first, then oldest) are deleted down to perflocale/tm/gc_target_rows (default 90,000), leaving 10 % headroom before the next GC tick. Misconfigured filter values (target ≥ cap, target < 0) get sanity-clamped so a typo can't wipe the table.

The GC is wired now as forward-protection. TranslationMemory::store() has no production callers yet (only integration tests and the future learn-on-save feature use the API), so today the weekly hook is a no-op. When learn-on-save activates, the cap is already in place.

Other bounded data stores

For completeness, every other plugin-owned data store and its bound:

perflocale_translation_links + perflocale_translation_groups
Bounded by content. The before_delete_post hook unlinks the deleted post's row; orphan groups (groups left with zero links) get swept in the same daily perflocale_jobs_gc via TranslationGroupRepository::gc_empty_groups() with a 1,000-row cap per tick.
perflocale_slug_translations
Bounded by content. One row per (object_type, object_id, language_id) — cleaned up when the post / term is deleted (cascading hooks) or the language is deleted (transactional cascade in LanguageRepository::delete).
perflocale_content_hashes
Bounded by content. One row per (object_type, object_id) — rehashing UPDATEs the row in place; new objects INSERT; deleted objects cascade.
perflocale_workflow
Bounded by content. One row per (object_type, object_id, language_id) assignment; cleaned up when the underlying translation row is deleted.
perflocale_glossary
Operator-managed. Each row is an admin-curated entry; no automated growth path.
perflocale_jobs
Status-based TTL. JobState::gc() deletes completed / failed / canceled jobs older than 24 h in the daily perflocale_jobs_gc; live jobs (queued, running) stay until they finish.
perflocale_mt_usage_YYYY_MM options
One per blog per month. Weekly perflocale_mt_usage_gc drops anything older than 13 months (only the current month and 12 prior are read by the admin UI).
perflocale_breaker_*, perflocale_*_lock_*, perflocale_addon_* options
Breaker rows are bounded by the small set of named breakers (5–10 typically). Lock rows are TTL-based with daily Lock::reap_expired(). Addon settings + disabled-addon lists are byte-capped (16 KiB and 4 KiB respectively).

Self-healing worker re-schedule

When a background worker finds its per-job lock already held (typically a leaked row from a crashed sibling worker that never released), it now re-queues itself with exponential backoff + jitter instead of silently returning. The retry chain is capped at 20 attempts; on cap exhaustion, the job is marked failed with an actionable diagnostic message pointing at the wedged perflocale_job_lock_<id> row.

Before this fix, the cron event was consumed pre-dispatch but the worker returned without doing anything, leaving the job stuck in 'running' state until the stuck-sweep window (typically several hours). Operators saw long-running jobs that never advanced and assumed silent failure.

Configure via perflocale/jobs/lock_busy_max_retries (default 20) and perflocale/jobs/lock_busy_max_seconds (default 600s).

Troubleshooting

The Site Health card shows a breaker as OPEN and it won't close
  1. Click the "Reset now" link — this force-closes the breaker without waiting for cooldown.
  2. If the breaker re-trips immediately, the upstream is genuinely failing. Check the PHP error log for lines tagged with the breaker's reason ([auth], [rate_limit], [transient]).
  3. For MT providers: verify the API key is valid and the monthly quota isn't exhausted. For webhooks: hit the URL with curl from the WP host; if it doesn't respond, the receiver is the problem. For FX/geo: check the provider's status page.
I never see any breakers tripped — is the system working?
That's the desired steady state. Run wp perflocale breakers list to confirm: an empty list means every external call is succeeding. To verify the subsystem is actually loaded, plant a test trip: wp eval '\PerfLocale\Concurrency\Breaker::record_failure( "test", "auth", 1 );' then check wp perflocale breakers list — you should see a row.
My background job is stuck. How do I diagnose?
The PerfLocale → Jobs admin page shows current status + a per-job log. Look for entries like "Per-job lock held by another worker; deferred by Ns (retry K/20)" — that means a sibling worker is stuck. After 20 retries the job will auto-mark-failed with a pointer to the wedged lock row. Manually clear it with: wp eval 'global $wpdb; $wpdb->delete( $wpdb->options, ["option_name" => "perflocale_job_lock_<the-job-id>"] );'
Can I disable the entire reliability layer?
add_filter( 'perflocale/breaker/disabled', '__return_true' ) kills all breakers (calls go straight through to the upstream, no short-circuit). Locks can't be disabled — they're load-bearing for correctness, not optional safety. If you're hitting lock contention, the right answer is to raise the type-busy or job-busy retry cap, not disable.

← Back to Docs