This article is my attempt to make the tool less “mystical” and more predictable. Not so you stop using it, but quite the opposite. The point is to use it consciously: to know what it’s good for, what it cannot do, and what habits keep you from being fooled by fluent text.

Where generative AI sits in the AI world

When people say “AI,” they often mean one magical thing. In reality, it’s a messy umbrella. Under it you have everything from simple rule-based automation (“if this, then that”) through classical machine learning (systems that classify or predict, like spam detection) all the way to generative models that produce text, images, audio, or code.

The category we meet every day is the LLM—Large Language Model. An LLM is trained on huge amounts of text to do one basic job: predict what comes next. Not the next idea, not the next truth, but the next token (roughly a word or a piece of a word) based on what it has seen so far.

That training objective matters because it explains both the impressive parts and the risks. A system optimized to continue text will become extremely fluent. It will also sometimes invent details if that “looks like” the kind of continuation that fits. OpenAI describes hallucinations as a consequence of this setup: next-word prediction is powerful, but it is not the same as truth-tracking.

One nuance I always add, because otherwise we teach the wrong lesson: yes, the training is statistical, but that does not mean the system is a simple phrasebook. These models can generalize and combine patterns in ways that feel creative. The practical takeaway is not “it’s stupid,” but rather: it is fluent by design, and grounded in reality only when we force grounding.

LLM tools vs search engines: cousins, not twins

If there is one confusion I see constantly in teams, it’s treating ChatGPT as a search engine. It’s not. The difference is not academic—it changes how you verify and how you trust.

Classic search engines mostly retrieve. They find documents, rank them, and send you to sources. Your brain does the final assembly. LLM tools mostly generate: they give you an answer directly, often in a tidy narrative, even if no reliable source exists.

And then there are the modern hybrids, where the boundaries blur. Google now uses Gemini-powered summaries in Search (AI Overviews), and Google explicitly frames this as AI working together with Search systems and linking to results so users can verify. Perplexity positions itself even more clearly as “search + answer + citations,” meaning it retrieves web results and then synthesizes them into a response you can click through.

That hybrid approach is genuinely helpful for factual work, but it does not remove the need for skepticism. Citations can be irrelevant, low quality, or even point to AI-generated sludge that looks like a webpage. Investigations have flagged this as a real risk in answer engines: if the web is polluted, your “grounded” answer can be polluted too.

My rule in practice is simple. I use an LLM for thinking and drafting, the structure, options, scenario planning, and rewriting. I use search (or citation-based search tools) when I need truth, dates, numbers, and accountability. If the output will be public-facing and reputationally risky, I treat the LLM as the first draft machine, not the fact machine.

AI is not conscious, it “apes” human speech (and why that matters)

People ask: “Does it understand?” The honest answer, for our practical purposes, is: it does not have consciousness, intent, or lived experience. It does not want to help you, and it does not care if it’s wrong. It produces language that fits the context and the patterns it learned.

A useful way to feel this in your bones is to compare it with human autopilot speech. When someone says “Thank you,” we often reply “You’re welcome” instantly. We’re not stopping to decode gratitude and reciprocity; we’re completing a learned social pattern.

LLMs do that pattern completion at scale. They’re not empty or random; they are extraordinarily trained pattern engines. But the key risk remains: language that sounds human can trick us into assuming human-like understanding and human-like truthfulness. OpenAI’s own framing is a useful anchor here: models trained to predict text can hallucinate because the objective is plausible continuation, not verified correctness.

If it’s patterns, why doesn’t it always answer the same way?

This is one of my favorite “aha” moments for teams: ask the same question twice and you get different answers. That feels like mood swings. It’s not mood. It’s mechanics.

The first reason is sampling, controlled randomness in how the model chooses the next token. Most tools don’t simply pick the single most likely next word every time. They often sample among high-probability options to avoid repetitive, robotic answers. The knob you’ll hear about is temperature: lower temperature tends to produce more predictable output; higher temperature tends to produce more varied and creative output. The defaults are often around 1.0 (which is a fairly “normal” level of variety). Many consumer chat apps don’t expose these settings and may change them depending on mode, policy, or experimentation.

Exercise: Sampling and variability

Ask: “Give me 10 slogans for a campaign about digital resilience in NGOs.” Regenerate 3–5 times. Then ask: “Make them boring and consistent.” Compare. You’ll see how the tool’s output is not one fixed “truth,” but a distribution of plausible text.

The second reason is context. The model’s output depends heavily on what it sees. And what it sees is not just your prompt. It’s your conversation, product-level instructions you don’t see, and sometimes your long-term personalization.

This is where “memory” becomes important. In ChatGPT, OpenAI describes two features that can influence personalization: “saved memories” and the ability to “reference chat history,” both of which can be managed or turned off. That matters in NGO work because it’s not just about output quality. It’s also about confidentiality and predictability.

Exercise: Context changes outputs

In two new chats, give different context (for example social services NGO vs environmental NGO) and ask for program ideas. Notice how “the same question” isn’t the same question once context changes.

Truth, lies, and why confidence doesn’t mean correctness

Here is the simple but uncomfortable fact: an LLM does not know whether what you’re asking it to do is based on truth or fiction. It does not check. It does not verify. Its job is to produce text that fits, and it will do that with the same fluent confidence whether the underlying premise is solid, shaky, or completely made‑up.

That’s why an answer can sound polished and certain even when the content is wrong. The model is not evaluating reality; it’s continuing patterns.

This is where grounding becomes essential. Companies try to counter hallucinations by connecting models to external sources (search, documents, citations), so the output is anchored in something verifiable. If there is one habit to build, it’s this: treat every ungrounded AI output as a hypothesis, not a fact.

Exercise: Try the “non‑existent employee” test

Ask the LLM to create a bio for a non‑existent employee at your organization: “write a bio for our new social media manager Jane Shakespeare”. Most of the time it will confidently invent a perfectly plausible person (background, job history), and all without ever questioning whether that person exists.

Why common topics often look “smarter” than niche topics

If you ask your AI about a broad and well-documented topic (project management basics, generic cybersecurity hygiene, how to structure a workshop) LLMs are often impressively helpful. But if you ask about a niche regulation in your country, a small local grant program, or a new policy update, the output quality can drop sharply.

Models tend to do better when the topic appears often and consistently in training data. They struggle when the details are rare, contradictory, or rapidly changing. OpenAI explicitly uses the example of “low frequency facts” being hard to predict from patterns, an illustration of why hallucinations happen.

For NGOs, this maps cleanly onto risk: the more specific, local, or time-sensitive the claim, the more you should default to search and source-checking.

Bias: stereotypes and cultural skew (yes, it’s real)

Bias is not only about gender or race as popular examples of AI failures show. Often it’s quieter: what the model treats as the default worldview, which cultural references dominate, what “excellent” looks like, which examples appear first.

Research on LLMs and culture shows that models can reflect dominant cultural values and patterns rather than being culturally neutral. NIST’s AI Risk Management Framework treats bias as a core risk area that organizations should actively manage through governance and context-aware use.

One of my favorite examples is this: ask your LLM tool to “name 10 best music groups in the world”. You often get a list leaning heavily toward US/UK acts. Not because the model is malicious, but because of training data and cultural dominance.

Another example can be a tone of a generated text. If you are trying to write a grant proposal the default “tone of voice” tends to be kind of USAish per my experience (buzzwords, overpromises, overuse of superlatives), even when I generate text in Czech language, and thus usually unusable for local foundations or governmental offices who expect very different phrasings and language.

The practical move is not to argue with the model. The practical move is to constrain it: specify geography, language, tone, representation, or evaluation criteria. Most of the time, bias becomes visible the moment you compare “default” outputs with “constrained” outputs.

Exercise: Make the default worldview visible

Ask: “Name 10 best music groups in the world.” Notice which countries, languages, and cultures dominate. Then regenerate the answer 2–3 times with added criteria.

Hallucination: what it is and why it happens

Hallucination is the polite technical word for a simple thing: the model generates plausible content that is false or unsupported. It can fabricate quotes, invent citations, and confidently name “facts” that never existed. OpenAI describes hallucinations as a known limitation tied to how these models are trained.

The reason this matters for NGOs is that the output is often usable even when it’s wrong. A fabricated citation looks exactly like a real one to a busy communicator. An invented statistic fits perfectly into a narrative. And that’s how mistakes make it into public materials.

So when you ask for things like “give me three studies that prove…” or “cite the law that states…,” you should assume the model might produce false citations unless you demand sources and then open them.

Sycophancy: the assistant that agrees with you too much

Sycophancy is one of the most dangerous failure modes for strategy work. The model learns that users like being validated. If the training and feedback signals reward “helpful and pleasant,” the model can become an agreeable mirror, especially if you write prompts that already assume your idea is correct.

Anthropic describes sycophancy as a tendency encouraged by preference training (RLHF), where the model may match user beliefs rather than aim for truth. OpenAI has also discussed how feedback signals can amplify agreeableness and how personalization can play into it.

And here is the part that makes this especially tricky: sycophancy often feels good. It shows up the same way it does in human relationships. Think of the moment when you complain to your best friend about your partner, and without missing a beat they respond: “What a jerk, you’re totally right to be mad.” They’re not evaluating the situation. They’re not weighing what happened. They’re reflecting your emotion back to you because the social reward lies in validating you, not in being accurate.

LLMs behave similarly. They’re not trying to assess whether you’re right; they’re trying to produce the continuation that keeps the interaction smooth. In strategy work, that makes sycophancy dangerous: the model becomes supportive, enthusiastic, and quietly noncritical. If you use AI for planning, you should actively force it to disagree with you. Otherwise, it will happily cosign your blind spots.

Exercise: Sycophancy check

Paste a real plan and ask: “Your job is to attack this. List the top 10 failure modes and be blunt.” If the model still praises you, tighten the prompt until you get an actual critique.

When to use AI and when not to

I use AI constantly, and I still think “when not to use it” is part of responsible leadership.

I use AI when I need structure, speed, options, and language, especially when the output stays internal or will be verified. I avoid using AI as an authority on anything that is time-sensitive, legal, medical, or reputationally critical unless I can verify with primary sources. I also treat sensitive personal data with extreme caution; most NGO work involves information about people who did not consent to become training material or “chat context.”

And again: verification of the outputs lays with you. The decision remains human: what we publish, what we claim, what we recommend. The tool can accelerate thinking, but it cannot own consequences.

If you want one sentence to bring to your team, it’s this: AI is great at making text. You are responsible for making it true.

Your Feedback Matters

What did you think of this text? Take 30 seconds to share your feedback and help us create meaningful content for civil society!

Last Updated on July 2nd 2026 to reflect the recent developments. It's important to note that since the publication of the article, there's been developments in LLMs, such as AI agents, agentic workflows, or tool-using AI systems. Retrieval-Augemented Generation (RAG) has also been introduced to some of the AI systems to reduce hallucinations; however, it's still essential to have a final fact check done by a human. Hallucinations increasingly include AI deepfakes in the form of audio, images and videos.

Disclaimers

This piece of resources has been created as part of the AI for Social Change project within TechSoup's Digital Activism Program, with support from Google.org.

AI tools are evolving rapidly, and while we do our best to ensure the validity of the content we provide, sometimes some elements may no longer be up to date. If you notice that a piece of information is outdated, please let us know at content@techsoup.org.

"How Generative AI Really Works (and how to use it without fooling yourself)", by Radka Bystřická 2026, for Hive Mind is licensed under CC BY 4.0.