Trustworthy AI: From Self-Aware Models to Test-Time Repair

Deployed models rarely fail loudly. They drift, grow overconfident, and quietly mispredict when the world stops looking like their training data. This four-part series follows a single reliability pipeline for catching that before it causes harm. It begins with anchoring, a one-line change to how a network reads its input that gives the model a calibrated sense of its own uncertainty. That uncertainty becomes a precise account of when a model is about to fail, then a way to characterize how a distribution has shifted and why, and finally a basis for adapting a deployed model at test time without retraining or access to the original data.

Part 1: Anchoring, a Simple Idea with Outsized Consequences

References:
Single Model Uncertainty Estimation via Stochastic Data Centering (Δ-UQ), NeurIPS 2022
Out of Distribution Detection via Neural Network Anchoring, ACML 2022
On the Use of Anchoring for Training Vision Models, NeurIPS 2024

This series is about building machine-learning models we can actually trust once they leave the lab: models that know when they are unreliable, that flag their own failures, and that hold up when the test data drifts from what they were trained on. It opens here with a single, deceptively small change to how a neural network reads its input, called anchoring, that turns out to be a remarkably powerful starting point for reliability. Anchoring is the thread through the next part as well, where its immediate payoff is a precise account of when, and how badly, a model is about to fail.

Here is the whole idea. Instead of feeding a network a sample x, you feed it a pair: a reference point r drawn at random from the training data, and the residual x − r. The two are stacked together and handed to the same network you would have trained anyway, with only the first layer widened to accept the doubled input. The label you ask it to predict is unchanged. That is it. You have asked the model to predict y from a relative description of x with respect to r, rather than from x directly.

Anchoring reparameterizes an input into a tuple of a random reference (anchor) and the residual. The network is trained to predict the same label regardless of which anchor is drawn, forcing it to model the joint distribution of references and residuals.

Why a Trivial-Looking Change Is Not Trivial?

At first glance this looks like a cosmetic reposing of standard training. It is not. Because a given sample x can be paired with many different references, the network sees many relative views of the same point and is asked to return the same answer for all of them. To satisfy that demand it has to learn the joint structure of references and residuals rather than a direct input-to-label shortcut.

Fourier spectra of the neural tangent kernel. Because the NTK is not shift invariant, varying the reference (anchor) produces a family of distinct kernels (right), the single-model analogue of training many shifted models. Positional embeddings (bottom) change the spectrum but preserve the effect.

The reason this produces something useful comes from the neural tangent kernel. The kernel induced by ordinary deep networks is not shift invariant, so centering the data on different reference points yields genuinely different solutions. You can see this directly in the kernel’s Fourier spectrum: a trivial shift in the input reshapes the NTK, and varying the reference traces out a family of related-but-distinct kernels. Training a separate model on each shift would give you an ensemble whose disagreement is a strong indicator of epistemic uncertainty. Anchoring captures that same effect inside a single network by marginalizing over the choice of reference, which means you get ensemble-quality uncertainty estimates without paying for an ensemble.

At inference, a single anchored model is queried with many anchors for the same input. The mean of those predictions is the answer; their spread is a calibrated estimate of epistemic uncertainty. This is Δ-UQ: ensemble-style uncertainty from one model.

The clearest way to see this working is a simple one-dimensional regression. Where training samples are dense, predictions from different anchors agree and the deviation band collapses onto the true function. In the gaps between clusters of training points, and out past the last samples on either side, the anchors disagree and the band fans out, exactly where a trustworthy model should be unsure. A single anchored network produces this calibrated, input-dependent uncertainty for free.

*Δ-UQ* on a 1D regression function. The mean estimate (green) tracks the true function (dashed) where training samples (red) are dense, while the deviation band widens in the gaps between samples and beyond the training support, the regions where the model should be uncertain.

These uncertainties are good enough to act on, not just to report. In sequential, black-box optimization, where each function evaluation is expensive and the next query must balance exploration against exploitation, the quality of the uncertainty estimate is what drives the search. Using Δ-UQ’s uncertainty inside a Bayesian-optimization loop on a suite of benchmark functions matches or beats Gaussian processes, MC-dropout, and deep ensembles, while only ever training a single model.

A Δ-UQ demonstration: on black-box optimization (e.g., the Ackley function), the single-model uncertainty estimate guides candidate selection effectively, competing with Gaussian processes and deep ensembles at a fraction of the cost.

From Uncertainty to a General Training Principle

The same machinery does more than quantify uncertainty. By tempering each prediction with its anchor-derived spread, anchoring yields a sample-specific calibration that sharply improves out-of-distribution detection, without exposure to outlier data, custom calibration objectives, or ensembling. The well-calibrated, distribution-aware signal that anchoring produces is precisely the raw material the rest of this series builds on.

It helps to be concrete about what training and inference actually involve, because both are lighter than they sound. During training, every time a sample x is drawn, a fresh reference r is sampled from the reference set and the tuple [r, x − r] is formed on the fly. The reference set is just a subset of the training data, so no extra data is needed. Over many epochs the same x is paired with a wide variety of references, and the loss insists that all of those relative views map to the same label. The only architectural change is the first layer, which is widened to accept the stacked tuple. Everything else, the optimizer, the schedule, the augmentations, stays exactly as it was, which is why anchoring drops cleanly into existing pipelines.

At inference there is a choice of protocol, and a useful finding is that the choice matters less than you might expect. The cheapest option uses a single reference per sample. A more careful option draws K references, predicts K times, and marginalizes, recovering the mean prediction and the uncertainty spread. A still more elaborate option, bilinear transduction, searches for the single best reference for each test sample so that the resulting residual looks like one the model saw in training, which is especially natural for extrapolation. Across these protocols the accuracy is statistically indistinguishable while the compute cost grows sharply. The practical takeaway is to use the simple single-reference protocol for prediction and reserve marginalization for when you actually want the uncertainty estimate.

That finding carries a sharper implication: if smarter inference cannot close the gap, the limitation must live in training itself. This is where more recent work establishes anchoring as a general, architecture-agnostic protocol spanning convolutional networks and transformers, and surfaces a real subtlety. Naively, you would expect that exposing the model to a larger, more diverse reference set would steadily improve generalization. In practice, standard anchored training fails to exploit that diversity. The reason is combinatorial: when the reference set is the whole training set, the number of reference-residual pairs is enormous, far more than any fixed training budget can sample. The joint space is covered too sparsely, and the model takes the easy route of predicting from the residual alone while ignoring the reference. That is exactly the kind of shortcut anchoring was supposed to prevent, since a sample should not be identifiable without its reference.

Standard anchored training improves on standard training but does not benefit from a larger, more diverse reference set. A reference-masking regularizer recovers the lost generalization, turning reference diversity back into an asset.

The fix is a simple reference-masking regularizer. With some probability α during training, the reference in a tuple is zeroed out, leaving [0, x − r], and the model is explicitly taught to produce a high-entropy, uniform prediction for these masked tuples. In other words, when the reference is taken away the model is required to be maximally uncertain. This makes it impossible to succeed by reading the residual alone and forces genuine use of the reference, restoring the benefit that reference diversity was supposed to deliver. The effect is sensitive to scale in an intuitive way: when the reference set is tiny, the model already sees most reference-residual combinations, so heavy masking just causes underfitting, and a small or zero α is best. The regularizer earns its keep precisely when the reference set is large, which is the regime that matters for real models.

The payoff is concrete and it scales. Anchored training with reference masking produces models that generalize better under distribution shift, are better calibrated, and reject anomalies more reliably, across both convolutional and transformer backbones. On ImageNet-1K, anchored vision transformers leave clean accuracy essentially untouched while opening a widening lead over standard training as corruptions intensify, with the largest gains in exactly the severe-shift regimes where reliability matters most.

Top-1 accuracy gain of anchored training with reference masking over standard training on ImageNet-1K, averaged across four transformer architectures (SWINv2-T/S/B, ViTb16). The advantage grows with corruption severity, reaching several points at the highest levels, while clean ImageNet accuracy is unchanged.

Underlying this, the masking regularizer steers training toward flatter, wider optima in the loss landscape, the geometry usually associated with better generalization. The benefits stack on top of standard augmentation pipelines rather than competing with them, and because the tuple construction never uses a reference’s label, anchoring stays robust even when a fraction of the training labels are corrupted.

That is the foundation. A one-line change to the input gives a model that knows what it does not know, calibrates itself, and generalizes better, all from the same network you were going to train anyway. The next part puts this signal to its first safety use: turning anchored uncertainty into a precise account of when, and how badly, a model is about to fail.

Part 2: Knowing When a Model Is About to Fail

Reference: PAGER: Accurate Failure Characterization in Deep Regression Models, ICML 2024

Part 1 left us with a model that can estimate its own uncertainty cheaply, through anchoring. The obvious next question for safety is whether that uncertainty is enough to tell us when the model is about to be wrong. For classifiers the question is relatively clean: a failure is a misclassification. For regression it is murkier, because “wrong” is a matter of degree, and the tolerance that counts as failure depends on the application. This part is about characterizing failure in deep regressors, and it starts by puncturing a comfortable assumption.

Uncertainty Is Necessary but Not Sufficient

The intuitive recipe for failure detection is: trust the model where it is confident, distrust it where it is uncertain. A central observation is that this recipe quietly fails in both directions. A model can be confidently wrong, and it can be uncertain yet correct. Consider a simple 1D function. In a region just outside the training support the model may report low uncertainty while drifting well away from the truth, a confident error. In another region the model may extrapolate cleanly even though the inputs are unfamiliar, low risk despite high uncertainty. Uncertainty alone cannot separate these cases.

Why uncertainty is insufficient? On a 1D function, regions just outside the training support can show low uncertainty yet sizable error (confident mistakes), while some unfamiliar regions extrapolate well. PAGER flags these correctly as moderate risk where uncertainty-only methods do not.

The missing ingredient is a sense of whether a test point actually conforms to the data the model was trained on. A prediction can be unreliable not because the model hesitates, but because the input simply does not belong to the manifold the model learned. PAGER calls this manifold non-conformity, and pairs it with uncertainty to get a complete picture of risk.

Two Reads from One Anchored Model

Here is where anchoring pays off a second time. Recall that an anchored model takes a tuple of a reference and a residual, [r, x − r]. PAGER extracts two different signals from the very same trained model simply by changing what it treats as the query.

Running the model the usual way, with the test sample as the query and many references marginalized out, gives the epistemic uncertainty from Part 1, this is forward anchoring. The new trick is to run it backwards. Swap the roles: treat the test sample as the anchor and ask the model to predict a known training target, evaluating F([x, r − x]). Because the target for the training reference is known, the error in recovering it becomes a direct measure of how poorly the test sample sits relative to the training manifold. This reverse anchoring needs no auxiliary model, no calibration set, and no labels for the test data; it falls straight out of the anchored network already in hand.

A side note: Notice what the non-conformity score actually is: a scalar function of the input that sits low on the training manifold and rises as a sample moves off it. That is precisely the role of an energy function in an energy-based model, where low energy marks the data manifold and high energy marks everything else. Reverse anchoring effectively reads such an energy off a model that was only ever trained to regress, much as recent work shows an ordinary classifier can be reinterpreted as an energy-based model. The analogy is only implicit, the score is a reconstruction error maximized over references rather than a normalized log-density, but the intuition is a useful lens. The more interesting consequence is practical. The reason energy-based models are not used everywhere is that training them in high dimensions is genuinely painful, the sampling and partition-function machinery is unstable and expensive. Anchoring sidesteps that entirely: it is trained with ordinary supervised learning and yet exposes an energy-like landscape as a by-product. That suggests a concrete recipe worth exploring, train an anchored model the easy way, then use the non-conformity landscape it already provides as the scaffold for fitting an actual energy model, turning a hard high-dimensional density-estimation problem into a much more tractable one.

PAGER reads two complementary signals from a single anchored model. Forward anchoring (predict the query) gives epistemic uncertainty; reverse anchoring (predict the anchor’s known target) gives a manifold non-conformity score. Neither requires extra models or calibration data.

Organizing Failure into Risk Regimes

Rather than forcing a single brittle threshold on “failure”, we can bin each of the two scores into low, moderate, and high, and read the joint pattern. The combination is what carries the information. Low uncertainty with low non-conformity is genuinely in-distribution and safe. High on both is unambiguous high risk. The interesting cells are the off-diagonal ones: low uncertainty but high non-conformity (a confident prediction on an off-manifold point) or moderate uncertainty with good conformity (an unfamiliar but extrapolable point). Mapping the grid of score combinations onto four regimes, in-distribution, low, moderate, and high risk, turns a vague notion of failure into an actionable triage.

Each sample can be placed on a grid of uncertainty (forward anchoring) against non-conformity (reverse anchoring), and the cells can be mapped to different risk regimes: in-distribution, low, moderate, and high risk. No calibration data or rigid failure threshold is required.

To judge such a detector, PAGER also proposes metrics that respect this structure rather than a single accuracy number. False negatives count genuinely high-risk samples that slipped into the safe regimes, the most costly error. False positives count harmless samples needlessly flagged. Two confusion metrics measure how often neighboring regimes bleed into each other. Across a suite of high-dimensional regression tasks, and image-regression problems, PAGER cuts false negatives and false positives by more than half relative to uncertainty-only methods like DEUP and conformity-only methods like DataSUITE, while remaining cheap enough to run online.

Two practical points make this more than a benchmark result. First, anchored training does not cost accuracy; so the failure-characterization machinery is effectively free once you have adopted anchoring. Second, PAGER needs no held-out calibration data to organize the regimes, though it can use such data to sharpen them when available. The upshot is a single anchored model that not only predicts, but also tells you, sample by sample, how much to trust each prediction, and why.

Part 3: Characterizing How a Distribution Has Shifted

References:
Accurate and Robust Feature Importance Estimation under Distribution Shifts (PRoFILE), AAAI 2021
DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation, ECML-PKDD 2024

Parts 1 and 2 gave a model the ability to say how confident it is and when it is likely to fail. The natural follow-up is diagnostic rather than predictive: when the test data drifts away from training, can we say something about how it has shifted and why a model reacts to it? This part is about characterizing distribution shift at two complementary levels of granularity, the individual feature and the whole model.

PRoFILE: Which Features Matter, Even When the Distribution Moves

Post-hoc explanation tools like LIME and SHAP answer a basic question: which input features drove this prediction? They are widely used, but they share two weaknesses. They are computationally heavy, since they probe the model with many perturbed copies of each input, and, more damagingly, their explanations are fragile under distribution shift, drifting or breaking exactly when you most need to understand what the model is doing.

PRoFILE attacks both problems with one idea: train a direct error estimator alongside the predictor. As the main model learns to predict the target, an auxiliary head learns to predict the main model’s own error from its internal features. This idea has been found to be relevant even for characterizing LLM failures. Once trained, this estimator lets you ask a causal question cheaply. To gauge how important a feature is, mask it and read off how much the estimated loss rises. The bigger the jump, the more the model relied on that feature, a Granger-causal notion of importance computed from loss estimates rather than from expensive repeated forward passes.

PRoFILE trains a loss estimator jointly with the predictor, consuming the model’s multi-stage features. Feature importance is then read off by masking a feature and measuring the rise in estimated loss, a causal score that needs no retraining and works with any masking strategy.

Because the loss estimator is trained to track the model’s error, it inherits a useful side effect: it becomes a detector of distribution shift in its own right. When test inputs drift from training, the estimated loss rises systematically, growing monotonically with corruption severity on benchmarks like CIFAR-10-C and flagging atypical handwriting styles when a digit model trained on MNIST meets USPS. The same signal that makes the explanations robust also reports that the ground has moved.

PRoFILE in action. The loss estimator ranks features by their causal effect on the prediction; masking the top-ranked features flips the model’s decision as expected, and the resulting importance maps stay consistent even on shifted, corrupted inputs where LIME and SHAP degrade.

The result is feature attributions that are faithful to the model, cheap to compute, agnostic to the masking strategy, and, crucially, stable under the distribution shifts where conventional explainers fall apart.

DECIDER: Why a Whole Classifier Fails, in Words

PRoFILE characterizes shift at the level of features. DECIDER zooms out to the level of the entire model and asks a blunter question: given a test image, will this classifier fail, and if so, can we say why in human terms? The obstacle is that the things that cause failure, spurious correlations, class imbalance, domain and corruption shifts, are usually nuisance attributes that are hard to name using visual features alone.

DECIDER’s move is to bring in foundation models as a source of prior knowledge. A large language model is queried for the core, task-relevant attributes of each class, the things that should genuinely define a category. A vision-language model like CLIP then aligns image features to those textual attributes, producing a debiased twin of the original classifier called a Prior-Induced Model. The twin is built to lean on the attributes that ought to matter, rather than whatever shortcut the original model happened to learn.

DECIDER builds a debiased twin of a classifier using language priors (an LLM lists core attributes) and a vision-language model (CLIP aligns image features to those attributes). Disagreement between the original model and the twin flags likely failures, and an attribute-ablation step explains them.

Failure detection then becomes a comparison. When the original classifier and its debiased twin disagree on a sample, that sample is flagged as a likely failure, the intuition being that a prediction resting on nuisance shortcuts will not survive alignment to core attributes. Better still, DECIDER can explain the failure: by adjusting how much each attribute is weighted until the twin’s prediction matches the original model’s, it surfaces which attributes the original model leaned on, giving a human-readable account of the mistake.

*The DECIDER pipeline at a glance: language and vision-language priors construct the prior-induced model, and the cross-entropy disagreement between it and the task model is the failure score.*

Across spurious-correlation benchmarks, class imbalance, and covariate shifts , DECIDER delivers state-of-the-art failure detection and strikes a better balance between catching failures and not over-flagging successes than score-based baselines or ensemble-disagreement methods. It is also robust to imperfect attribute lists, and revealingly, swapping the carefully aligned twin for a raw zero-shot CLIP classifier destroys the signal, confirming that the disagreement works precisely because the twin shares the task model’s backbone while correcting its biases.

Seen together, PRoFILE and DECIDER bracket the problem of distribution shift. One tells you which input features a prediction depends on and keeps that answer trustworthy as the data drifts; the other tells you whether an entire model is about to break under a shift and names the attributes responsible. Both turn an opaque reaction to shifted data into something a practitioner can inspect and act on.

Part 4: Adapting to the Shift, at Test Time

References:
Domain Alignment Meets Fully Test-Time Adaptation (CATTAn), ACML 2022
Single-Shot Domain Adaptation via Target-Aware Generative Augmentations (SiSTA), ICML 2023

The series has, so far, built a model that knows its own uncertainty, can flag when it is about to fail, and can describe how the test distribution has drifted. The closing question is the most actionable one: once we know a shift has happened, can we do something about it on the spot, without retraining from scratch and often without any access to the original training data? This is test-time adaptation, and it is where the reliability story turns from diagnosis to repair.

The two methods here cover the two situations you actually face in deployment. Sometimes you have a decent batch of unlabeled data from the new domain and just need to realign the model to it. Other times you have almost nothing, a single example of the new domain, and must manufacture the rest. CATTAn handles the first; SiSTA handles the second.

CATTAn: Realign the Features You Already Have

Fully test-time adaptation is deliberately strict about what it assumes. You get a trained model and a stream of unlabeled target data, and crucially no access to the source training data, which may be too large to ship or too sensitive to share. Existing methods in this setting, like entropy minimization, adapt by sharpening the model’s own predictions, but they leave a powerful tool from classical domain adaptation on the table: explicitly aligning the source and target feature distributions. They skip it because alignment seems to require the source data.

CATTAn’s insight is that you do not need the source data, only a compact geometric summary of it. During training you compute a low-dimensional subspace that captures where source features live and store just its basis, a couple of megabytes rather than gigabytes of data, with no way to reconstruct individual training examples. At test time you fit the analogous subspace to the target features and learn a simple linear map that rotates the target subspace onto the source one. The frozen classifier, which already knows how to read source-aligned features, can then be reused as-is on the realigned target.

CATTAn bridges classical domain alignment and source-free test-time adaptation. Only a low-rank subspace basis of the source features is stored. At test time the target subspace is aligned to it, and the frozen classifier is reused on the realigned features, with only batch-norm parameters updated.

Because alignment is folded into the adaptation objective alongside prediction-calibration terms, and only the batch-norm parameters and the small alignment map are updated, the overhead is minimal. The payoff is consistent: CATTAn improves on strong test-time adaptation baselines across image-corruption robustness and standard domain-adaptation benchmarks, extends to 3D point clouds and transformer features, and holds up even when the target batch is small or the source model was already robustly trained. It even admits a simple post-hoc check that detects when an incoming sample actually belongs to the source domain, so the alignment can be switched off and source accuracy preserved.

SiSTA: Manufacture the Target You Don’t Have

CATTAn assumes a reasonable amount of unlabeled target data. SiSTA confronts the opposite extreme: a single example from the target domain. With one image you cannot estimate a subspace or minimize entropy reliably; generic augmentations like cropping and color jitter cannot conjure a large semantic shift such as photo-to-sketch. The data simply is not there.

SiSTA’s answer is to generate it. Starting from a generative model of the source domain, a StyleGAN, it fine-tunes the generator to the target using just the one available example, so the generator now produces images with the target’s style. Then comes the clever part: rather than naively sampling, SiSTA prunes the generator’s internal activations to deliberately diversify what it produces, either zeroing low activations or rewinding them to their source-model values. This sweeps out a spread of plausible target-domain images instead of minor variations on a single one.

SiSTA in four steps: take a source classifier and source StyleGAN, fine-tune the generator to the target domain using a single example, generate a diverse synthetic target set via activation pruning, and use it to adapt the classifier with any source-free method.

The resulting synthetic dataset, though grown from one real example, is diverse enough to drive ordinary source-free adaptation of the classifier. The pruning strategies are what make the difference between a useful spread of target-like samples and a near-duplicate collection.

Activation pruning controls the diversity of the synthetic target set. Compared to naive sampling, the prune-zero and prune-rewind strategies produce more varied target-domain reconstructions from the same source image, which is what makes single-shot adaptation work.

On face-attribute tasks under increasingly severe shifts, this single-shot recipe improves substantially over generic-augmentation baselines, by as much as twenty points in the hardest cases, and lands within a few points of an oracle that was allowed to adapt using the entire target dataset. From one example, SiSTA recovers most of the benefit of having the whole target domain in hand.

Closing the Loop

Step back and the four parts form a single reliability pipeline. Anchoring gives a model a calibrated sense of its own uncertainty. That uncertainty, joined with a manifold-conformity signal, becomes a precise account of when the model will fail. Feature- and model-level diagnostics then explain how a distribution has shifted and which attributes are responsible. And finally, test-time adaptation acts on that knowledge, realigning the model when target data is plentiful and synthesizing it when it is scarce. None of these steps demands retraining from scratch or shipping the original training data, which is precisely what makes the whole chain practical for models that are already deployed. Trustworthy machine learning, in this telling, is less about any single trick than about closing the loop from knowing you might be wrong to doing something about it.