The Explainability Challenge of Generative AI and LLMs
The blog explores the fundamental challenges and current approaches to governing AI systems, particularly large language models, whose decision-making processes cannot be fully explained, while discussing potential solutions and regulatory implications for organizations deploying these technologies.
With the rise of how do you govern AI systems whose outputs and internal processes are not fully explainable?
With the rapid advancements in generative AI and large language models (LLMs), these systems can now produce vast, often surprising outputs that mimic human-like language, generate detailed images, and even create novel ideas. But as powerful as they are, these models present unique challenges for governance and oversight. A significant part of these challenges stems from the inherent unpredictability and opacity of their decision-making processes. While there are tools and frameworks to bring more insight into how these models operate, the truth is explaining or predicting every output of generative AI remains out of our current each.
In this blog, I explore what is possible (and what is not) to understand and explain the basis for the outputs of these complex systems. I’ll discuss some of our current methods and ongoing research. It’s important to acknowledge that the quest for transparency in generative AI is a complex, ongoing journey. one that will require new approaches and innovations as these systems evolve.
Key Takeaways
- LLM explainability presents fundamental governance challenges that cannot be fully resolved with current technology
- Organizations must balance performance benefits against explainability requirements
- Regulatory requirements are driving the need for new governance approaches
- Boards and senior executives should focus on risk management and oversight frameworks rather than complete technical transparency
1. The Reality of Explainability in Generative AI and LLMs
A fundamental question is “Are Generative AI and LLMs Explainable?” When we talk about explainability in AI, traditional machine learning models provide a baseline: they often have more transparent decision paths, making it easier to see how inputs relate to outputs. But, with generative AI and LLMs, this process is far from straightforward. These models often produce outputs through a cascade of internal steps, making it difficult to backtrack and pinpoint the exact reasoning behind a given result.
Even when we use the best available explainability techniques, we’re often only scratching the surface. Post-hoc explanation methods—tools that attempt to interpret decisions after they’ve been made—can give us insights, but they don’t provide the full picture. They can approximate why a model leaned a certain way in its output, but they’re not fully reliable, nor can they be considered definitive explanations.
The relationship between model size and explainability follows a counter-intuitive pattern: as models grow larger and more capable, they can become harder to interpret, yet simultaneously develop more consistent and predictable behaviors in certain areas. This phenomenon called the 'scaling paradox' means that while we might better predict what a larger model will do, we often have less insight into how it arrives at its decisions. Recent research suggests that this relationship isn't linear—beyond certain scale thresholds, entirely new capabilities and behaviors can emerge, further complicating the ability to understand these systems.
A critical challenge in AI governance is the inherent tension between model performance and explainability. Simpler, more transparent models often struggle to match the sophisticated capabilities of complex LLMs, creating a trade-off where increasing explainability frequently comes at the cost of reduced performance on complex tasks. This presents leaders with difficult choices: should we prioritize understanding exactly how a model arrives at its decisions, or accept some degree of opacity in exchange for superior capabilities?
The bottom line? We may be able to approximate, but for many outputs, especially those that are creative or abstract, complete explainability is well beyond current capabilities.
2. Current Methods for Governing LLMs
While fully explaining every aspect of an LLM’s behavior is not feasible today, and may never be, there are approaches to manage and govern these systems. Here are some of the strategies that offer partial solutions to the explainability challenge:
Post-Hoc Explanation Techniques
- SHAP and LIME: These post-hoc explanation techniques attempt to approximate a model's decision-making by systematically varying inputs and analyzing their impact on outputs. While useful for traditional machine learning models, they face significant limitations with LLMs due to their non-linear nature and the challenge of identifying relevant feature interactions across millions or billions of parameters. These techniques can offer localized insights into specific decisions, but their reliability decreases dramatically when analyzing more complex, context-dependent outputs typical of large language models.
- Surrogate Models: Building simpler models to mimic the behavior of a complex system can sometimes give stakeholders a rough idea of how decisions are made. However, surrogate models might fall short of capturing the complexity of LLMs. Recent research suggests they often fail to reflect the behaviors of LLMs, especially for more complex tasks.
Human-in-the-Loop (HITL) Systems
- For high-stakes scenarios, human oversight of AI models is crucial. Humans can evaluate outputs flagged by automated tools or review results that fall below a confidence threshold. HITL oversight allows some egregious errors, biases, or alignment issues to be studied, but it doesn’t necessarily make the AI more explainable.
- Threshold-Based Reviews: Establishing criteria for review, where only certain high-impact or high-risk outputs require human assessment, can reduce the burden on human reviewers while adding a layer of control.
Ethical and Safety Guardrails
- Generative models can be constrained by built-in filters or safety mechanisms that aim to prevent harmful outputs. However, these safeguards have limits and may not always align with complex real-world contexts. Reinforcement Learning from Human Feedback (RLHF), for example, offers an iterative approach but doesn’t guarantee ethical outputs in every scenario.
- While RLHF represents an important advance in AI alignment, it comes with significant challenges. Human feedback can inadvertently amplify existing biases and inconsistencies, as different annotators may have conflicting views on what constitutes appropriate or optimal responses. Moreover, the process struggles with edge cases and novel scenarios where human preferences aren't well-defined, and the reward signals can become ambiguous or contradictory, potentially leading to unexpected model behaviors in these situations.
- Implementing safety guardrails presents a complex balancing act: too restrictive, and they limit the AI's utility and legitimate uses, particularly in contexts involving sensitive but important topics like healthcare or crisis response; too permissive, and they risk enabling harm or misuse. This tension is complicated by cultural and contextual differences in what constitutes appropriate content, making it challenging to design universal safety mechanisms that work effectively across different use cases, cultures, and geographical regions.
Continuous Monitoring and Auditing
- Regular monitoring of AI outputs for signs of bias, inaccuracy, or unusual patterns can help maintain some level of accountability. This process is essential, especially in regulated industries. However, it only partially addresses the challenges, as unpredictable and unexplainable outputs may still occur.
- Current monitoring and detection systems operate primarily at the output level, employing statistical analysis, pattern matching, and targeted testing to identify potential issues. These systems can detect obvious anomalies and known patterns of unwanted behavior but often struggle with subtle biases and context-dependent errors. Moreover, the dynamic nature of LLM behavior means that monitoring systems must constantly evolve, as what works for detecting issues in one version of a model may become less effective as the model is updated or as usage patterns change. This creates a persistent gap between monitoring capabilities and the full range of potential model behaviors.
Transparency and Documentation
- Documenting the training data, assumptions, and known limitations of an LLM helps set expectations for stakeholders, but it doesn’t make the model’s reasoning clearer. This approach is more about managing expectations than achieving true transparency.
3. The Challenge Ahead: Adapting and Evolving Our Tools
Generative AI’s potential is tremendous, but so are the challenges it brings to transparency and trust. As the models improve, they often become harder to predict and explain, which raises important questions for future AI governance. Here are a few areas where innovation is needed:
- Developing Deeper Explanation Models: To improve our understanding of LLMs, we need tools that can offer more than approximations. Future models may require fundamentally new architectures that embed transparency into their core processes, not as an afterthought.
- Advancing Monitoring and Detection Mechanisms: Real-time, dynamic monitoring tools are crucial for identifying problematic patterns in generative models. But these tools need to become more sophisticated, especially as AI systems become more autonomous.
- Building Ethical, Self-Regulating Models: Some are beginning to explore ways for models to “self-regulate” or recognize when their outputs could be potentially harmful. Reinforcement learning techniques are a step in this direction. However, the concept of "self-regulation" in AI systems is highly contested, as it implies a level of self-awareness that current systems don't possess.
A promising area of research in understanding LLMs is “mechanistic interpretability.” The aim here to reverse-engineer how LLM models process information and generate outputs. Unlike some “black box” testing approaches, mechanistic interpretability seeks to understand the internal mechanisms and computational patterns that emerge during model training and inference. Recent breakthroughs in this field have revealed fascinating insights about how LLMs encode and manipulate information.
However, mechanistic interpretability faces significant challenges. As models get larger and more complex, the patterns and mechanisms become increasingly difficult to study. What works for understanding one component or behavior might not generalize to others, and the emergence of capabilities through model scaling can introduce new phenomena. Still, mechanistic interpretability research could lead to a deeper understanding of how LLMs work, which might someday lead to more controllable and interpretable models.
4. The Regulatory Landscape and Legal Requirements for AI Explainability
The regulatory environment for AI explainability is rapidly evolving, with different jurisdictions taking notably different approaches. The European Union's AI Act stands at the forefront, mandating varying levels of transparency and explainability based on an AI system's risk level and use case. For 'high-risk' AI applications—such as those used in hiring, credit scoring, or healthcare—organizations must demonstrate their ability to provide meaningful explanations of AI decisions that affect individuals.
In contrast, the United States has thus far taken a sector-specific approach. Financial regulators require explainability for AI systems focused on lending decisions, while healthcare regulations demand transparency in AI-assisted medical diagnostics. The FDA's proposed framework for AI/ML in medical devices emphasizes the importance of explainable AI for patient safety and clinical decision-making.
These regulatory requirements create practical challenges for deploying advanced LLMs:
- Organizations must balance model performance with compliance requirements
- Different explainability standards across jurisdictions may require region-specific model variations
- Documentation requirements for model decisions may limit the use of fully opaque AI systems
- Legal liability concerns are driving investment in explainability research and tools
The challenge of meeting these varied regulatory requirements while maintaining model performance has led to the emergence of compliance-focused AI development practices. Some organizations are building explainability features directly into their model architecture rather than treating it as an afterthought. Others are developing hybrid approaches that combine more explainable models for regulated decisions with more complex models for less sensitive tasks.
Looking ahead, we can expect regulatory requirements for AI explainability to become more stringent and widespread. The challenge for organizations will be developing governance frameworks that can adapt to this evolving regulatory landscape while maintaining the benefits of advanced AI capabilities
5. Why Absolute Explainability May Not Be Possible—and Why That’s Okay
It’s important to recognize that full transparency in LLMs may never be achieved. Much like human cognition, AI’s reasoning might always have an element of mystery. This is not necessarily a failure, rather, it is recognizing the limitations inherent in current AI architectures.
This doesn’t mean we should accept a lack of accountability or explainability. By embracing transparency to the fullest extent possible, documenting limitations, and involving human oversight at key decision points, we can build governance frameworks that make generative AI and LLMs safer and more trustworthy, even if they are not entirely explainable.
The Path Forward
As AI becomes more capable, governing its outputs and ensuring they are ethical, accurate, and reliable will be an ongoing challenge. We have tools today that provide some insight and control, but these are incomplete solutions. The answer to whether generative AI can ever be fully explainable may be “no.” (Many researchers argue that complete explainability is fundamentally impossible due to the emergent properties of large neural networks, not just a temporary technical limitation.)
Recognizing the limitations of current tools and acknowledging the complexity of these systems is essential to creating a realistic governance model for generative AI and LLMs. This journey toward better governance and oversight is only beginning, and it will require an honest assessment of what we can achieve—and what we must continue striving to improve.
Featured in: AI / Artificial Intelligence