The Year of the AI Chatbot: Should You Trust Them to Return Accurate Information?

Originally published by: Africa Check

The many related technologies known as artificial intelligence, or AI, may have felt inescapable in 2024. Countless news articles have provided breathless coverage of new developments, and debates about the technology’s dangers have raged around topics from copyright law and sexual harassment to cheating on schoolwork.

One form of AI has been particularly unavoidable: the AI-powered chatbot. This is software which can respond to written prompts with written answers so that using it feels more like a conversation than a web search.

Meta has integrated “Meta AI” into its major platforms – Facebook, Instagram and WhatsApp. Technology company Apple has begun to incorporate popular chatbot ChatGPT and other forms of what it calls “Apple Intelligence” into its software. And dozens more companies compete to draw attention to their own AI-powered chatbots. Google has one called Gemini, Microsoft calls its model CoPilot, and social media platform X, formerly Twitter, has Grok.

These chatbots have been advertised as having limitless potential. They can supposedly summarize paragraphs of text, draft important documents, teach complex subjects and much, much more. X owner Elon Musk even suggested that the platform’s Grok could diagnose health issues from medical scans.

As impressive as they sound, should you trust the information that chatbots produce?

TL;DR

When accuracy matters, AI chatbots should be considered untrustworthy sources.

Do not rely on them for any information that might have legal, financial, or health and safety-related consequences.

If you do decide that it’s safe to use a chatbot, always check its answers thoroughly before taking them to be true.

Even the most impressive AI models, which most often produce accurate answers, are fallible in ways that can be difficult to spot.

What are LLMs and how do they work?

Many different technologies are being referred to as AI, but to be more specific, we have to introduce some jargon. The chatbots that have become so common are powered by large language models or LLMs.

An LLM is software designed to generate text that mimics human language. That’s the “language” part of the phrase “large language model”.

The “model” part is a complex mathematical process that predicts the next word or group of words in a particular sequence. But how does a model “know” which words are likely to appear in which contexts?

Before it can be used, an LLM must be trained on enormous amounts of pre-written text. With each new piece of text it encounters, the model learns how common various words are in combination with others. This is part of the reason they are called large language models.

It’s important to note that humans aren’t directly involved in this process. The LLM is essentially using trial and error to adjust its parameters in response to new training data, often compared to a collection of knobs and dials that determine how the LLM behaves. An individual parameter can’t be matched with a specific behavior of the model, even by a human who knows what data was used to train the LLM. So a human can’t adjust a part of the model to get the desired output.

Because the exact relationship between inputs and outputs is unclear to outside observers, these kinds of systems are often referred to as “black boxes.

So an LLM, when working correctly, produces text that, based on its training data, is likely to appear in a particular context. How is this helpful and what are the limitations?

Correct ‘just by chance’

Imagine that an LLM has been trained on a text which includes many examples of phrases like “it is blue” in response to the question “what color is the sky?”.

When asked “what color is the sky?”, the response “it is blue” has appeared so often in its training data, that this or a similar phrase will be returned as a response.

Prof Emily M. Bender, a linguist at the University of Washington in the US, has written extensively about large language models. She told Africa Check: “If the output of an LLM happens to correspond … to the correct answer, this is just by chance.”

This is because, unlike humans, LLMs don’t understand the meaning of words and can’t use that understanding to reason out a correct answer. In the example above, the LLM does not know anything about the concept of “the sky”, only that it occurs frequently alongside particular words like “blue”.

Bender said: “Users should treat LLM output as a probabilistic reflection of which words are likely to occur next to each other in the (hidden, undisclosed) training data.”

If this training data is limited or contains incorrect information, then the answers produced by an LLM have a higher chance of being incorrect.

If we asked our hypothetical LLM “what color is the sky at dusk?” it may not provide a correct answer. If its training data hadn’t included phrases like “the sky at dusk”, or not enough of them, the answer might be nonsensical. It would still return a sequence of words considered most likely to follow the phrase “what color is the sky at dusk?” but if the model has not strongly associated this phrase with words like “orange”, the result could be any number of things. It might respond that the sky is blue, or some other color, or something which is not a color at all, or produce a sentence that isn’t even grammatically correct.

We could be lucky, and the LLM could answer “the sky at dusk is orange”, despite not having any similar phrases in its training data.

This is part of what Bender means when she says that answers are correct “just by chance”. In all of these hypotheticals, the LLM would be returning a string of “most likely” words, as it was designed to do.

“If the words happen to make sense,” Bender said, “it's because we make sense of them.”

But what might happen in a less trivial example?

Don’t trust AI with life-altering information

When an LLM produces text that reads as nonsense to a human, this is typically called a hallucination. Hallucinations might be amusing or irritating but probably not believable. However, LLMs can hallucinate information that appears very accurate at first glance.

In a now infamous example, two New York City lawyers and their firm were fined US$5,000 for submitting a legal brief that made reference to nonexistent legal cases. The attorney who drafted the legal brief said that he had asked ChatGPT for examples of relevant historical cases, and had even asked the chatbot for the source of one of the cases it returned. ChatGPT cited actual legal databases like LexisNexis, but the cases could not be found there or anywhere else, because they had never taken place. They were simply a convincing hallucination.

Other lawyers have avoided sanctions for similar errors. But as law journal De Rebus makes clear, users, and not the companies that create these chatbots, are legally liable for acting on false information.

Just a month after the US case was first reported, a Johannesburg court handed down punitive costs to a woman whose lawyers had similarly cited fake cases in court filings. The case was reportedly postponed for two months as lawyers on both sides tried to locate the fake court records, before learning that they had been hallucinated by ChatGPT.

Lawyers at South African firm Cliffe Dekker Hofmeyr have warned that “while ChatGPT is impressive, it is not reliable, even with regard to straightforward legal questions, let alone nuanced factual and legal scenarios”. They pointed out how readily the tool returned false information and advised relying on actual lawyers, rather than AI, for legal advice.

The same should apply to any other issue for which the accuracy of information could have a major impact on your life.

In the worst-case scenario, trusting information written by an LLM could be fatal. For example, AI tools have produced dangerous advice on identifying and eating wild mushrooms. Mycologists have warned against AI-generated guides being sold on the e-commerce site Amazon. Extensive research by consumer safety organization Public Citizen found that even chatbots and other tools specifically designed for the identification of mushrooms were prone to misidentifying deadly species.

It is very impressive that LLMs are able to mimic human language so well that they can fool experienced lawyers or convincingly (mis)identify mushrooms. But this ability to confidently hallucinate false information means that they should not be considered reliable sources.

The bottom line: If acting on a piece of information could affect your health, finances, legal liability, or anything else that you consider important, do not entrust that information to an LLM.

More trivial cases

Not everything is a matter of life and death. An obvious question to ask is: “What’s the harm in someone using an LLM to find information they don’t intend to act on?”

For example, a schoolchild using a chatbot to research a homework assignment is probably not going to use the same research to make a major financial decision or decide whether to trust a particular health cure. But does that mean an LLM is the right tool for the job?

“A student using an LLM to complete a homework assignment is missing out on the learning opportunity the assignment presents. The point of an assignment isn't the written document, but the thinking that went into it,” Bender told Africa Check.

The black-box nature of LLMs and the fact that their training data is typically hidden from ordinary users make it impossible to confirm that a claim is accurate without doing entirely independent research. Bender worries that relying on LLMs discourages people from doing this kind of research. “This approach to information access cuts us off from important information literacy practices.”

If a piece of information appeared in a newspaper article, a reader could question the author, whether they or the paper was known to have a particular bias, and what information other sources might provide. If an LLM returns the same piece of information, it is difficult to interrogate the source in the same way.

These problems will not be solved simply by making language models larger. In fact, they may only get worse.

Bender and others argue that as training datasets grow larger, it becomes more difficult to filter out harmful information. This includes subtle biases such as “stereotypical and derogatory associations along gender, race, ethnicity, and disability status”.

Bender told Africa Check: “Among the hardest kinds of misinformation to detect is subtle biases that confirm our existing prejudices, and LLMs have been extensively shown to reproduce such biases.”

Trying to determine whether an answer from an LLM is accurate is difficult enough when the answer is an easily demonstrable fact that can be checked by independent research. But a subtle prejudice towards a particular group, for instance, may not even register as something that needs to be fact-checked.

The final word

Bender advises that “at no point should one use an LLM as a way to access information about the world”, and it’s easy to see why. To confirm that information provided by an LLM is accurate requires doing essentially the same research it would take to find the information from scratch. Meanwhile, more subtle false or harmful beliefs accidentally included in the LLM’s training data might go totally unnoticed.

It’s possible that those drawbacks are not a concern in the specific context in which you want to use an LLM-powered chatbot. But if your aim is simply finding factual information, LLMs are the wrong tool to use.