Opening the black box: Explaining explainability in AI

I've often found explainability in AI to be an elusive concept—essential, yet difficult to pin down. It underpins trust, adoption, and regulatory compliance, but how does it actually work? A recent paper, Explainable Artificial Intelligence (XAI): From Inherent Explainability to Large Language Models by Fuseini Mumuni and Alhassan Mumuni, traces the evolution of explainability techniques, from inherently interpretable models to LLM-driven explanations. As usual, reading, learning, and writing about it has helped me connect a few more dots.

The Foundations of Explainability

Explainability in AI is about understanding how a model arrives at its decisions. At its core, explainability can be categorized into two approaches: inherent explainability and post-hoc explainability. Inherently explainable models, such as decision trees, linear regression, and generalized additive models (GAMs), offer transparency by design. Their mathematical structure allows direct inspection of how inputs contribute to outputs, making them naturally interpretable without additional tools.

By contrast, post-hoc explainability applies to black-box models—complex architectures like deep neural networks, ensemble methods, and transformer-based models—where decision-making processes are not immediately apparent. These models demand external interpretability techniques to shed light on their inner workings. The need for robust post-hoc explainability arises because black-box models often provide superior predictive performance but at the cost of opacity.

Methods for Understanding Black-Box Models

Post-hoc explainability methods can be broadly classified into feature attribution, counterfactual reasoning, and symbolic approaches.

Feature attribution methods attempt to quantify how much each input variable contributes to a model’s prediction. One of the most mathematically rigorous techniques is SHAP (Shapley Additive Explanations), which draws from cooperative game theory. SHAP assigns each input feature an importance score based on how much it changes the model’s output, computed by evaluating all possible feature combinations. Unlike simpler methods, SHAP ensures consistency and fairness in attribution, making it one of the most robust techniques available.

LIME (Local Interpretable Model-agnostic Explanations) is an alternative technique. Instead of analyzing the entire model, LIME generates local approximations by perturbing input data and fitting a simpler, interpretable model (often a linear regression) around a specific prediction. This allows users to understand why a model made a particular decision but does not provide a global understanding of model behavior.

For deep learning models, particularly those used in computer vision, methods like Grad-CAM (Gradient-weighted Class Activation Mapping) visualize which parts of an image contributed most to a classification decision. Grad-CAM backpropagates gradients from the output layer to the convolutional layers, highlighting the most influential pixels in an image. This approach is particularly useful for validating whether a model is attending to relevant features rather than spurious correlations.

Then there are counterfactual methods. Instead of attributing importance scores, they ask: What is the minimal change needed to flip a model’s decision? Counterfactual explanations generate hypothetical scenarios where small modifications in input variables lead to different predictions. This method is especially useful in high-stakes applications like finance and healthcare, where users need insights on how to achieve a desired outcome.

And finally there are symbolic and rule-based approaches, which tryto distill black-box model behavior into human-readable logic. Decision rule extraction techniques approximate neural network behavior with if-then rules, which provide transparency but often sacrifice fidelity. Hybrid models that integrate neural networks with symbolic reasoning are an emerging area of research, aiming to retain both high performance and interpretability.

The Role of LLMs in Explainability

Large language models (LLMs) are increasingly being used to generate textual justifications for AI predictions. Vision-language models (VLMs) extend this capability further by producing natural language explanations for image classifications. The appeal of LLM-generated explanations is their accessibility—users can ask models to justify decisions in plain English, making complex AI systems more approachable.

However, this approach introduces new challenges. The biggest risk is hallucination—LLMs can generate explanations that are coherent yet factually incorrect. Because these models are trained on vast corpora of human language, they may prioritize fluency over faithfulness, making it difficult to distinguish genuine causal reasoning from plausible-sounding speculation. Ensuring that LLM-generated explanations align with ground truth remains an open research problem.

Challenges and Open Research Questions

Explainability techniques face several persistent challenges. The trade-off between faithfulness and simplicity is one of the most fundamental. Many post-hoc methods simplify model behavior to make explanations understandable, but this abstraction can lead to misleading interpretations. SHAP and LIME, for example, provide approximations that may not fully capture model internals.

Scalability is another concern. As AI systems are deployed in real-time applications, many existing explainability methods struggle with computational efficiency. SHAP, for instance, requires evaluating multiple feature subsets, which becomes computationally expensive in high-dimensional datasets.

Finally, regulatory compliance is an evolving challenge. Explainability is now a legal requirement in many jurisdictions, with laws such as the EU’s AI Act demanding “meaningful explanations” for automated decisions. However, there is no universally accepted standard for what constitutes a sufficient explanation. Ensuring that explainability techniques meet legal and ethical standards without compromising performance is an ongoing debate.

The Future of Explainability in AI

The next phase of explainability research will likely involve automated explanation generation, better benchmarks for evaluating explanation quality, and deeper integration of causal inference methods into AI models. Combining feature attribution with counterfactual reasoning may offer clearer insights. Similarly, the fusion of symbolic reasoning with deep learning could provide explanations that are both interpretable and faithful to underlying model behavior.

Ultimately, as AI systems continue to shape critical sectors like finance, healthcare, and autonomous systems, their ability to explain themselves will be a key determinant in their success.

Opening the black box: Explaining explainability in AI

Read next

The economics of intelligence: AI and the efficient frontier

ASICS on the track and ASICs in the data center: Specialization and the pursuit of record performance

Fast and curious: Tsinghua researchers beat Dijkstra’s legendary shortest path algorithm, opening the door to faster routing everywhere