AI Playground

AI Playground

Learn how AI language models work. Visualize tokenization, simulate temperature sampling, count tokens, build prompts. Interactive LLM tutorial and prompt engineering helper

"How do LLMs actually work" is best understood by watching the moving parts: tokens (the units the model sees), temperature (the randomness knob), and the prompt structure (system → user → assistant turns). This playground visualizes tokenization for major models (GPT-4, Claude, Gemini), simulates temperature sampling with the actual softmax math, counts tokens for the popular tokenizers (tiktoken, BPE), and helps you compose better prompts by showing what the model actually sees.

What a token actually is

Tokens are sub-word units the model operates on. GPT-4's tokenizer turns "hello world" into 2 tokens (hello, " world"); turns "tokenization" into 1 token; turns "anti-disestablishmentarianism" into 5 tokens (anti-, dis-, establish-, ment-, arian-ism). Common words and prefixes/suffixes are single tokens; rare/long words get split. Roughly: 1 token ≈ 0.75 English words ≈ 4 characters.

  • GPT-4 / GPT-3.5 — tiktoken cl100k_base encoder. ~100,256 tokens in vocabulary.
  • GPT-4o / o1 — o200k_base. ~200k tokens; better handling of non-English text.
  • Claude — separate BPE tokenizer; similar token counts for English to GPT-4 but different exact splits.
  • Llama / open-source — varies. Llama 3 has 128k tokens; older Llama had 32k.
  • Tokens carry no meaning by themselves — they are statistical units the model learned during training to be useful predictive units.

Working example: prompt tokenization

Input

Prompt: "Write a haiku about debugging."
Model: GPT-4

Output

Tokenization (tiktoken cl100k_base):
  "Write"     → token 4936
  " a"        → token 264
  " haiku"    → token 95041
  " about"    → token 922
  " debug"    → token 7585
  "ging"      → token 3252
  "."         → token 13

Total: 7 tokens.

Whole prompt represented as: [4936, 264, 95041, 922, 7585, 3252, 13]

Cost calculation:
  Input cost at $0.005/1K tokens: 7 × 0.000005 = $0.000035
  Output ~80 tokens at $0.015/1K tokens: $0.0012
  Total round-trip: ~$0.0013

The "debug-ging" split shows BPE: "debug" is a frequent token; "ging" is a frequent
suffix added separately. The model sees them as two units but learned during
training what they combine into.

The cost-per-token model is why GPT-4 / Claude cost approximately what they cost — they charge per token (input and output), and the cheap way to use them is short, well-structured prompts. Verbose prompts are not "smarter" — they cost more and often perform worse.

Temperature, top-p, and sampling

  • Temperature 0 — deterministic. Same prompt always produces the same output (modulo non-determinism in the API). Use for: code generation, structured output, factual recall.
  • Temperature 0.7 (default) — moderately creative. Good for general dialogue, content generation, brainstorming.
  • Temperature 1.0+ — high variance. Use for: creative writing, lateral thinking, generating multiple alternatives.
  • top_p (nucleus sampling) — only consider tokens whose cumulative probability is ≤ p. top_p=0.9 means "consider the top 90% of probable next tokens". Combined with temperature for finer control.
  • top_k ��� only consider the top k tokens regardless of probability. Less commonly used in 2026.
  • For deterministic critical applications (medical, legal, code), use temperature=0. For exploration and creative tasks, raise it.

Prompt engineering principles that survive 2026

  • Specify the output format. "Return JSON with fields..." beats "give me a summary". Schema-constrained output (JSON mode, structured outputs) eliminates most parsing failures.
  • Provide examples (few-shot). 1-3 input→output examples teach the model the format. Beats long verbose instructions.
  • Chain of thought / reasoning — "Think step by step before answering" improves accuracy on multi-step reasoning. For models with built-in reasoning (o1, Claude Extended Thinking), explicit prompt is less necessary.
  • Role prompting — "You are an expert engineer reviewing this code..." sets context. Modest improvement; sometimes over-emphasized.
  • Avoid negation — "Do not include code" is less reliable than "Provide only prose explanation". Models attend to positive instructions.
  • Be specific about constraints — length, tone, audience, what to exclude. Vague prompts produce vague outputs.

When to reach for this tool

  • You are estimating API costs for an LLM application — count tokens for typical input + output.
  • You are debugging "the model misunderstood my prompt" — see how tokenization fragments the input.
  • You are learning how LLMs work and want to experiment with sampling parameters.
  • You are building tooling around prompts and want to count tokens accurately before sending.

What this tool will not do

  • It will not run inference. The tool tokenizes and simulates sampling; for actual LLM responses, use the API of your chosen model (OpenAI, Anthropic, Google, etc.).
  • It will not optimize prompts for you. Improvement is iterative; tools like LangChain prompt-tuners are alternatives.
  • It will not give exact pricing for every model. Pricing changes; check provider websites for current rates.
  • It will not handle multimodal (images, audio) tokenization. Different models tokenize images differently (CLIP vs ViT vs DINOv2) — image tokenization is an active research area.

Frequently asked questions

Why does the same word produce different token counts in different models?

Different tokenizers. GPT-4 (cl100k_base) and Claude (BPE) have different vocabularies trained on different data. The same word may be one token in one model and three in another. For pricing, count tokens with the specific model's tokenizer.

Should I always use temperature 0?

For deterministic tasks (code, JSON, factual queries), yes — reduces variance and makes outputs reproducible. For creative tasks (writing, brainstorming), higher temperature explores more options. For chatbots, 0.5-0.8 is the typical range — engaging but coherent.

How long should my prompt be?

As short as possible while being unambiguous. Longer prompts cost more, take longer to process, and often perform worse (model attends to less of each part). Aim for clear instructions + necessary examples + structured output specification. Verbose is not better.

How accurate is token counting?

Exact for the tokenizer used. If you count with cl100k_base and call GPT-4o (which uses o200k_base), counts can differ by 10-30%. Use the matching tokenizer for the model you call.

Why do LLMs hallucinate?

They are next-token predictors. When uncertain, they produce a plausible-looking but factually wrong continuation rather than refusing or expressing uncertainty. Training methods like RLHF reduce but do not eliminate this. For factual accuracy, ground in retrieved documents (RAG) and verify outputs.

Is open-source competitive with GPT-4 / Claude in 2026?

For many tasks, yes. Llama 3 405B, DeepSeek V3, Qwen 2.5 72B perform comparably to GPT-4 on benchmarks. Commercial models still lead in some areas (reasoning, multilingual, multimodal). The gap is narrowing; open-source has 6-12 month lag on frontier models.

Related tools

Last updated · E-Utils editorial team