I got excited when I heard about XAI. Really. After all, it is the way to look into the decision-making process of a machine. At least that’s the idea. Or that was the idea…
When I started to read about it a bit more, what struck me was that the true explainability is still limited and slow. What we actually have now is unbelievably similar to what we humans do. It’s reasoning the decisions backwards.
How humans explain
Humans rarely “think step by step” in a fully transparent way, except for very simple, formalised tasks like arithmetic or following a checklist. Most of our decisions are driven by emotion, habit, intuition, gut feeling (a.k.a. the implicit knowledge we aren’t aware of), social pressures, identity, and self-image.
When we explain, we are almost always doing post hoc reconstruction. Even when we think we’re giving a step-by-step process, we’re simplifying or rationalising what happened to make it look good or to avoid regret. In other words, the “white box” inside a human doesn’t really exist – at least not fully in conscious form.
One way to construct a coherent story after a decision is made is to use counterfactual thinking – the process of imagining alternatives to our past choices. We even assign value to the paths we didn’t take to guide our future decisions and to make us feel better about the choices we made. (Wang et al., 2026)
How AI explain
True AI white boxes exist only in very constrained, simple models, like decision trees and linear regression (i.e., following a process, like arithmetic). They give complete transparency, but they do not allow for the flexibility or adaptability of the human thought process. They scale poorly for tasks humans do effortlessly, like image recognition or natural language understanding, and they aren’t “brain-like” in the sense of distributed, highly interconnected processing.
Neural networks, the AI as we know it, are structurally close to our brains in that they work with distributed representations and spotting patterns – but that very structure makes them opaque by nature. Even the machine learning process per se is not explainable at every step of the process. (Mainz et al., 2025)
That very type of AI also explains its decisions backwards and in a counterfactual way. What’s even more interesting—the explanation doesn’t come from the inside of it but from a system put on top (sounds like cognition put on top of emotion to me, btw).
AI surrogate explanation models like LIME or SHAP create a simplified “shadow” to approximate the logic of the complex black box after the output is generated by them. The actual process is not fully visible to the explanation models either. The effect is that they only approximate the real process and are not fully faithful. (Turpin et al., 2023)
They can hide bias or produce misleading narratives, just like humans do. They optimise for human intelligibility rather than causal accuracy, which is exactly what we do – optimising for other people’s understanding rather than stating the actual reasons for our decisions.
Interestingly, the same “counterfactual explanation” methods used to understand AI can be used by attackers to find adversarial paths, highlighting the power of reasoning backwards from an outcome. (Bernardo et al., 2023) Though, that’s a different story for another time…
Thank goodness, AI doesn’t learn when the training is over. Otherwise, it would be very human-like of it to believe its own rationalisations.
So what
What we often want from AI is “Make it like a human – but smarter, faster, and fully transparent.” But our brains weren’t designed for transparency. Conscious thought is like a “reporting interface” for internal processes, not a faithful log.
So what if it’s structurally impossible to create a human-like synapse network that is a black box itself and expect it to be a white-box decision-maker?
What if asking a system to be both brain-like and fully explainable are mutually conflicting requirements?
What if it’s a fundamental mismatch between the way intelligent systems naturally work and the way we want explanations to work?
And is bias and rationalisation a repetitive process that we can study and quantify in AI as we do it in humans? Or is it too random still and it’s the very thing we do not have in common?
Because if it’s the former, then what if the act of explaining, the “math” of an AI’s self-justification, matches the “math” of human cognitive dissonance in its structure, and the bias is not as much data-dependent as we think in both AI and us but is part of the process?
—
Sources:
Bernardo, V., Attoresi, M., Lareo, X., Velasco, L., & European Data Protection Supervisor (Eds). (2023). TechDispatch: Explainable artificial intelligence; #2/2023. Publications Office. https://doi.org/10.2804/132319
Mainz, J., Munch, L., & Bjerring, J. C. (2025). Cost-effectiveness and algorithmic decision-making. AI and Ethics, 5(3), 2681–2693. https://doi.org/10.1007/s43681-024-00528-0
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2305.04388
Wang, Z., Li, X., & Bakkour, A. (2026). The inferred value of unchosen options spreads to related items in memory. Cognition, 272, 106508. https://doi.org/10.1016/j.cognition.2026.106508