Question 1

Can't safety filters catch a dangerous output?

Accepted Answer

Filters can catch known dangerous phrases, but an LLM can generate the same dangerous outcome through an unlimited variety of paraphrases, indirect phrasings, or multi-step chains. The space of possible outputs is vast, and no finite filter list covers it completely.

Question 2

Why is war and nuclear language so 'close' in an LLM's world?

Accepted Answer

LLMs learn by seeing which words tend to appear near each other in training text. Military doctrine, news articles, thrillers, and sci-fi all routinely place words like 'threat', 'respond', 'launch', and 'authorize' in close proximity. The model encodes that statistical nearness.

Question 3

Doesn't a tiny probability mean it's practically safe?

Accepted Answer

Only if the number of calls is also tiny. In large-scale deployments, an AI system might be called millions of times a day across automated pipelines. A one-in-a-million chance per call becomes an expected occurrence within days.

Non-Zero Probability

Behind the Comic