If you asked me on April 1 whether artificial intelligence actually hallucinates into life-and-death decisions, I would have laughed and then listed a dozen technical caveats. On April 2 the laugh is gone. The problem is not a clever turn of phrase. It is operational. When systems trained on incomplete, biased, or synthetic data are asked to infer who or what is a legitimate target, they do not merely make probabilistic errors. They invent patterns, assert spurious linkages, and hand those inventions to human beings whose cognitive load and institutional incentives make them likely to accept the output. The result is a new variety of error: an AI hallucination that can be fatal.
We already have concrete examples of this dynamic in active conflicts. Reporting from late 2023 documented an IDF targeting pipeline that includes an AI decision support tool nicknamed the Gospel and a large person-centric database referred to as Lavender. Journalistic accounts and expert commentary described Gospel as producing hundreds of potential targets in days where humans previously managed tens in years, and Lavender as a rapidly compiled list of tens of thousands of named individuals scored by algorithmic criteria for suspected membership in militant groups. These systems were described as feeding recommendations to human analysts who then had to decide, often at pace, whether to act. The scale and speed are not hypothetical. They are real.
What do we mean when we say “hallucination” in this operational setting? In machine learning parlance a hallucination is an output that confidently states something false or unsupported by the inputs—an invented factual claim or an inferred pattern that has no grounding in reality. In large language models this appears as fabricated citations or invented facts; in sensor-fusion and targeting pipelines it takes the form of misclassified objects, mis-linked identifiers, and manufactured associations between people and hostile activities. The medical literature and empirical studies of generative systems have documented how often models fabricate references or details when they attempt to be helpful but lack verification. This is not mere rhetoric; it is a documented failure mode of contemporary models.
Why does that matter for targeting? There are three linked reasons. First, statistical models generalize from their training data. If that data lacked examples of “non-targets” similar to real-world civilians, the model will over-generalize and label civilians as threats. Second, automation bias leads human operators to overweight machine recommendations under pressure, time scarcity, or institutional demand for output. In high-tempo operations the human in the loop can become the human on the loop—present to satisfy legal formality but effectively deferential in practice. Third, opaque provenance and lack of reproducible audit trails make ex post review and accountability much harder. Each of these dynamics has been analyzed in the legal and human factors literature as a root cause of mistaken lethal decisions in automated contexts.
The examples from Gaza are an instructive case study because they combine scale, urgency, and a preexisting institutional appetite for mass target generation. Journalists and analysts reported that the AI-enabled pipeline dramatically increased the number of candidate targets and that human review was sometimes shallow given the throughput. Those are precisely the conditions under which hallucinations and over-trust interact to create catastrophic outcomes. Whether one accepts every detail of the reporting or not, the structural warning is clear: a system that speeds production of actionable targeting recommendations without proportionate improvements in data provenance, explainability, and operator training invites error.
We must also be precise about the technical character of these failures. Hallucination here is not random noise. It is systematic confabulation driven by optimization objectives that reward recall of suspicious patterns and penalize false negatives in training regimes where the cost function privileges target discovery. Combined with datasets that discard negative examples and with synthetic augmentation, models develop brittle heuristics that can look plausible to a human reviewer but are, in effect, artefacts of the pipeline. The medical studies showing high rates of fabricated references in LLM outputs are a laboratory analogue of the same phenomenon: plausible but false artifacts produced because the model is solving for fluency and pattern completion rather than truth.
So what should militaries, policy makers, and technologists do about it? I propose a short, rigorous checklist grounded in both engineering practice and the demands of international law:
-
Treat human-in-the-loop as an engineering requirement not a checkbox. Design interfaces that surface uncertainty, provenance, and counterfactual data rather than a single “recommended” action. Force operators to see the raw signals that led to a recommendation and require active corroboration steps before escalation.
-
Mandate negative-example curation. Every targeting model must be trained on explicit examples of “non-targets” that mirror the civilian patterns expected in theater. Without negative examples models will generalize disastrously.
-
Require operational red teams and adversarial testing under realistic stress. Hallucinations are often revealed only when models are stressed with edge cases, degraded sensors, or data poisoning attempts.
-
Enforce immutable audit logs and reproducible decision trails. If a model recommends a strike, the system should record the inputs, model weights, confidence estimates, and the human queries that led to authorization. This is both good engineering and the foundation of accountability under international humanitarian law.
-
Limit velocity where the cost of error is high. Speed is a military advantage, but not when it converts probabilistic recommendations into routine lethal action without matching increases in verification capacity.
-
Insist on external legal review. Article 36 style weapons reviews need expansion to include data, model lifecycle, and operational interface audits. Lawful use cannot be certified by a closed testing lab alone.
These are not novel prescriptions. They are, however, rarely implemented with the necessary technical rigor. Too often the rush to exploit a new capability overshadows the work required to operate it safely. The result is a moral hazard: technology that looks precise because it produces more targets, but that is less precise in meaning. The number of recommended targets is not a metric of success if the unit of measurement is error folded into human acquiescence.
Finally, we must reframe the ethical conversation. The term hallucination risks trivializing what is, at core, a human problem mediated by machines. Machines do what we ask them to do. If we ask them to overproduce actionable inferences without robust safeguards, they will. The responsibility therefore rests with the architects of the system, the commanders who field it, and the states that permit its use in populated environments. Technologists must refuse feature requests that trade explainability and negative-example coverage for throughput. Jurists must adapt and make clear that delegating the production of target lists to opaque algorithms does not dilute legal responsibility. Ethicists and philosophers must insist that the fog of war is not a license to hand the fog to an algorithm and call it precision.
If April Fools taught us anything it is that a joke lands only when the audience recognizes it. Hallucinations in AI targeting are not jokes. They are predictably emergent failures of systems designed under pressure to produce more than they can safely justify. It is time to stop treating these failures as amusing metaphors and start treating them as avoidable design choices with lethal consequences.