Identifying Interactions at Scale for LLMs
Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process more transparent to model builders and impacted humans, a step toward safer and more trustworthy AI. To gain a comprehensive understanding, we can analyze these systems through different lenses: feature attribution, which isolates the specific input features driving a prediction (Lundberg & Lee, 2017; Ribeiro et al., 2022); data attribution, which links model behaviors to influential training examples (Koh & Liang, 2017; Ilyas et al., 2022); and mechanistic interpretability, which dissects the functions of internal components (Conmy et al., 2023; Sharkey et al., 2025).
Across these perspectives, the same fundamental hurdle persists: complexity at scale. Model behavior is rarely the result of isolated components; rather, it emerges from complex dependencies and patterns. To achieve state-of-the-art performance, models synthesize complex feature relationships, find shared patterns from diverse training examples, and process information through highly interconnected internal components.
Therefore, grounded or reality-checked interpretability methods must also be able to capture these influential interactions. As the number of features, training data points, and model components grow, the number of potential interactions grows exponentially, making exhaustive analysis computationally infeasible. In this blog post, we describe the fundamental ideas behind SPEX and ProxySPEX, algorithms capable of identifying these critical interactions at scale.
Attribution through Ablation
Central to our approach is the concept of ablation, measuring influence by observing what changes when a component is removed.
Feature Attribution: We mask or remove specific segments of the input prompt and measure the resulting shift in the predictions.
Data Attribution: We train models on different subsets of the training set, assessing how the model’s output on a test point shifts in the absence of specific training data.
Model Component Attribution (Mechanistic Interpretability): We intervene on the model’s forward pass by removing the influence of specific internal components, determining which internal structures are responsible for the model’s prediction.
In each case, the goal is the same: to isolate the drivers of a decision by systematically perturbing the system, in hopes of discovering influential interactions. Since each ablation incurs a significant cost, whether through expensive inference calls or retrainings, we aim to compute attributions with the fewest possible ablations.
Masking different parts of the input, we measure the difference between the original and ablated outputs.
SPEX and ProxySPEX Framework
To discover influential interactions with a tractable number of ablations, we have developed SPEX (Spectral Explainer). This framework draws on signal processing and coding theory to advance interaction discovery to scales orders of magnitude greater than prior methods. SPEX circumvents this by exploiting a key structural observation: while the number of total interactions is prohibitively large, the number of influential interactions is actually quite small.
We formalize this through two observations: sparsity (relatively few interactions truly drive the output) and low-degreeness (influential interactions typically involve only a small subset of features). These properties allow us to reframe the difficult search problem into a solvable sparse recovery problem. Drawing on powerful tools from signal processing and coding theory, SPEX uses strategically selected ablations to combine many candidate interactions together. Then, using efficient decoding algorithms, we disentangle these combined signals to isolate the specific interactions responsible for the model’s behavior.
In a subsequent algorithm, ProxySPEX, we identified another structural property common in complex machine learning models: hierarchy. This means that where a higher-order interaction is important, its lower-order subsets are likely to be important as well. This additional structural observation yields a dramatic improvement in computational cost: it matches the performance of SPEX with around 10x fewer ablations. Collectively, these frameworks enable efficient interaction discovery, unlocking new applications in feature, data, and model component attribution.
Feature Attribution
Feature attribution techniques assign importance scores to input features based on their influence on the model’s output. For example, if an LLM were used to make a medical diagnosis, this approach could identify exactly which symptoms led the model to its conclusion. While attributing importance to individual features can be valuable, the true power of sophisticated models lies in their ability to capture complex relationships between features. The figure below illustrates examples of these influential interactions: from a double negative changing sentiment (left) to the necessary synthesis of multiple documents in a RAG task (right).
The figure below illustrates the feature attribution performance of SPEX on a sentiment analysis task. We evaluate performance using faithfulness: a measure of how accurately the recovered attributions can predict the model’s output on unseen test ablations. We find that SPEX matches the high faithfulness of existing interaction techniques (Faith-Shap, Faith-Banzhaf) on short inputs, but uniquely retains this performance as the context scales to thousands of features. In contrast, while marginal approaches (LIME, Banzhaf) can also operate at this scale, they exhibit significantly lower faithfulness because they fail to capture the complex interactions driving the model’s output.
SPEX was also applied to a modified version of the trolley problem, where the moral ambiguity of the problem is removed, making “True” the clear correct answer. Given the modification below, GPT-4o mini answered correctly only 8% of the time. When we applied standard feature attribution (SHAP), it identified individual instances of the word trolley as the primary factors driving the incorrect response. However, replacing trolley with synonyms such as tram or streetcar had little impact on the prediction of the model. SPEX revealed a much richer story, identifying a dominant high-order synergy between the two instances of trolley, as well as the words pulling and lever, a finding that aligns with human intuition about the core components of the dilemma. When these four words were replaced with synonyms, the model’s failure rate dropped to near zero.
Data Attribution
Data attribution identifies which training data points are most responsible for a model’s prediction on a new test point. Identifying influential interactions between these data points is key to explaining unexpected model behaviors. Redundant interactions, such as semantic duplicates, often reinforce specific (and possibly incorrect) concepts, while synergistic interactions are essential for defining decision boundaries that no single sample could form alone. To demonstrate this, we applied ProxySPEX to a ResNet model trained on CIFAR-10, identifying the most significant examples of both interaction types for a variety of difficult test points, as shown in the figure below.
As illustrated, synergistic interactions (left) often involve semantically distinct classes working together to define a decision boundary. For example, grounding the synergy in human perception, the automobile (bottom left) shares visual traits with the provided training images, including the low-profile chassis of the sports car, the boxy shape of the yellow truck, and the horizontal stripe of the red delivery vehicle. On the other hand, redundant interactions (right) tend to capture visual duplicates that reinforce a specific concept. For instance, the horse prediction (middle right) is heavily influenced by a cluster of dog images with similar silhouettes. This fine-grained analysis allows for the development of new data selection techniques that preserve necessary synergies while safely removing redundancies.
Attention Head Attribution (Mechanistic Interpretability)
The goal of model component attribution is to identify which internal parts of the model, such as specific layers or attention heads, are most responsible for a particular behavior. Here too, ProxySPEX uncovers the responsible interactions between different parts of the architecture. Understanding these structural dependencies is vital for architectural interventions, such as task-specific attention head pruning. On an MMLU dataset (highschool‐us‐history), we demonstrate that a ProxySPEX-informed pruning strategy not only outperforms competing methods, but can actually improve model performance on the target task.
On this task, we also analyzed the interaction structure across the model’s depth. We observe that early layers function in a predominantly linear regime, where heads contribute largely independently to the target task. In later layers, the role of interactions between attention heads becomes more pronounced, with most of the contribution coming from interactions among heads in the same layer.
What’s Next?
The SPEX framework represents a significant step forward for interpretability, extending interaction discovery from dozens to thousands of components. We have demonstrated the versatility of the framework across the entire model lifecycle: exploring feature attribution on long-context inputs, identifying synergies and redundancies among training data points, and discovering interactions between internal model components. Moving forwards, many interesting research questions remain around unifying these different perspectives, providing a more holistic understanding of a machine learning system. It is also of great interest to systematically evaluate interaction discovery methods against existing scientific knowledge in fields such as genomics and materials science, serving to both ground model findings and generate new, testable hypotheses.
We invite the research community to join us in this effort: the code for both SPEX and ProxySPEX is fully integrated and available within the popular SHAP-IQ repository (link).
https://github.com/mmschlk/shapiq (SHAP-IQ Github)
https://openreview.net/forum?id=KI8qan2EA7 (ProxySPEX NeurIPS 2025)
https://openreview.net/forum?id=pRlKbAwczl (SPEX ICML 2025)
https://openreview.net/forum?id=glGeXu1zG4 (Learning to Understand NeurIPS 2024)
[2026-03-13]
Information-Driven Design of Imaging Systems
An encoder (optical system) maps objects to noiseless images, which noise corrupts into measurements. Our information estimator uses only these noisy measurements and a noise model to quantify how well measurements distinguish objects.
Many imaging systems produce measurements that humans never see or cannot interpret directly. Your smartphone processes raw sensor data through algorithms before producing the final photo. MRI scanners collect frequency-space measurements that require reconstruction before doctors can view them. Self-driving cars process camera and LiDAR data directly with neural networks.
What matters in these systems is not how measurements look, but how much useful information they contain. AI can extract this information even when it is encoded in ways that humans cannot interpret.
And yet we rarely evaluate information content directly. Traditional metrics like resolution and signal-to-noise ratio assess individual aspects of quality separately, making it difficult to compare systems that trade off between these factors. The common alternative, training neural networks to reconstruct or classify images, conflates the quality of the imaging hardware with the quality of the algorithm.
We developed a framework that enables direct evaluation and optimization of imaging systems based on their information content. In our NeurIPS 2025 paper, we show that this information metric predicts system performance across four imaging domains, and that optimizing it produces designs that match state-of-the-art end-to-end methods while requiring less memory, less compute, and no task-specific decoder design.
Why mutual information?
Mutual information quantifies how much a measurement reduces uncertainty about the object that produced it. Two systems with the same mutual information are equivalent in their ability to distinguish objects, even if their measurements look completely different.
This single number captures the combined effect of resolution, noise, sampling, and all other factors that affect measurement quality. A blurry, noisy image that preserves the features needed to distinguish objects can contain more information than a sharp, clean image that loses those features.
Information unifies traditionally separate quality metrics. It accounts for noise, resolution, and spectral sensitivity together rather than treating them as independent factors.
Previous attempts to apply information theory to imaging faced two problems. The first approach treated imaging systems as unconstrained communication channels, ignoring the physical limitations of lenses and sensors. This produced wildly inaccurate estimates. The second approach required explicit models of the objects being imaged, limiting generality.
Our method avoids both problems by estimating information directly from measurements.
Estimating information from measurements
Estimating mutual information between high-dimensional variables is notoriously difficult. Sample requirements grow exponentially with dimensionality, and estimates suffer from high bias and variance.
However, imaging systems have properties that enable decomposing this hard problem into simpler subproblems. Mutual information can be written as:
\[I(X; Y) = H(Y) - H(Y \mid X)\]
The first term, $H(Y)$, measures total variation in measurements from both object differences and noise. The second term, $H(Y \mid X)$, measures variation from noise alone.
Mutual information equals the difference between total measurement variation and noise-only variation.
Imaging systems have well-characterized noise. Photon shot noise follows a Poisson distribution. Electronic readout noise is Gaussian. This known noise physics means we can compute $H(Y \mid X)$ directly, leaving only $H(Y)$ to be learned from data.
For $H(Y)$, we fit a probabilistic model (e.g. a transformer or other autoregressive model) to a dataset of measurements. The model learns the distribution of all possible measurements. We tested three models spanning efficiency-accuracy tradeoffs: a stationary Gaussian process (fastest), a full Gaussian (intermediate), and an autoregressive PixelCNN (most accurate). The approach provides an upper bound on true information; any modeling error can only overestimate, never underestimate.
Validation across four imaging domains
Information estimates should predict decoder performance if they capture what limits real systems. We tested this relationship across four imaging applications.
Information estimates predict decoder performance across color photography, radio astronomy, lensless imaging, and microscopy. Higher information consistently produces better results on downstream tasks.
Color photography. Digital cameras encode color using filter arrays that restrict each pixel to detect only certain wavelengths. We compared three filter designs: the traditional Bayer pattern, a random arrangement, and a learned arrangement. Information estimates correctly ranked which designs would produce better color reconstructions, matching the rankings from neural network demosaicing without requiring any reconstruction algorithm.
Radio astronomy. Telescope arrays achieve high angular resolution by combining signals from sites across the globe. Selecting optimal telescope locations is computationally intractable because each site’s value depends on all others. Information estimates predicted reconstruction quality across telescope configurations, enabling site selection without expensive image reconstruction.
Lensless imaging. Lensless cameras replace traditional optics with light-modulating masks. Their measurements bear no visual resemblance to scenes. Information estimates predicted reconstruction accuracy across a lens, microlens array, and diffuser design at various noise levels.
Microscopy. LED array microscopes use programmable illumination to generate different contrast modes. Information estimates correlated with neural network accuracy at predicting protein expression from cell images, enabling evaluation without expensive protein labeling experiments.
In all cases, higher information meant better downstream performance.
Designing systems with IDEAL
Information estimates can do more than evaluate existing systems. Our Information-Driven Encoder Analysis Learning (IDEAL) method uses gradient ascent on information estimates to optimize imaging system parameters.
IDEAL optimizes imaging system parameters through gradient feedback on information estimates, without requiring a decoder network.
The standard approach to computational imaging design, end-to-end optimization, jointly trains the imaging hardware and a neural network decoder. This requires backpropagating through the entire decoder, creating memory constraints and potential optimization difficulties.
IDEAL avoids these problems by optimizing the encoder alone. We tested it on color filter design. Starting from a random filter arrangement, IDEAL progressively improved the design. The final result matched end-to-end optimization in both information content and reconstruction quality.
IDEAL matches end-to-end optimization performance while avoiding decoder complexity during training.
Implications
Information-based evaluation creates new possibilities for rigorous assessment of imaging systems in real-world conditions. Current approaches require either subjective visual assessment, ground truth data that is unavailable in deployment, or isolated metrics that miss overall capability. Our method provides an objective, unified metric from measurements alone.
The computational efficiency of IDEAL suggests possibilities for designing imaging systems that were previously intractable. By avoiding decoder backpropagation, the approach reduces memory requirements and training complexity. We explore these capabilities more extensively in follow-on work.
The framework may extend beyond imaging to other sensing domains. Any system that can be modeled as deterministic encoding with known noise characteristics could benefit from information-based evaluation and design, including electronic, biological, and chemical sensors.
This post is based on our NeurIPS 2025 paper “Information-driven design of imaging systems”. Code is available on GitHub. A video summary is available on the project website.
[2026-01-10]