Skip to main content

Command Palette

Search for a command to run...

Understanding Large Language Models: What You Need to Know

Updated
24 min read
Understanding Large Language Models: What You Need to Know
R

I am Developer, Artist and trying my luck on blogging as well. Well I am Ambitious, Passionate towards Learning, The Night Owl, And I Like challenges...

Post #2 in the Complete Prompt Engineering Series

Welcome back! In What is Prompt Engineering? A Complete Introduction, you learned what prompt engineering is and why it matters. Now we're going deeper: understanding the engine under the hood. You don't need to become a machine learning engineer, but understanding how LLMs work will transform how you prompt them.

Why Understanding LLMs Makes You a Better Prompt Engineer

Here's a question: Would you be a better driver if you understood how an engine works? Maybe, maybe not—but you'd definitely be better at diagnosing problems and maximizing performance.

The same applies to prompt engineering.

When you understand:

  • Why temperature settings change output personality

  • How tokens affect your costs and results

  • Why context windows are your biggest constraint

  • What makes different models excel at different tasks

...you stop guessing and start engineering.

This post will give you that X-ray vision. By the end, when an AI gives you an unexpected response, you'll know exactly why—and how to fix it.

Part 1: How Large Language Models Actually Work (The Simplified Truth)

The Core Concept: Statistical Prediction at Scale

Let's start with a truth that changes everything:

Large Language Models don't "understand" language. They predict it.

Imagine you're playing a game where I say: "The cat sat on the..."

Your brain instantly suggests: mat, chair, windowsill, table—all reasonable completions. That's essentially what an LLM does, except:

  1. It analyzes trillions of word patterns from its training data

  2. It calculates the probability of each possible next word

  3. It selects based on those probabilities (modified by parameters we'll discuss)

  4. It repeats this process word by word until the response is complete

Key insight: Every word you see from an AI is a prediction based on:

  • The patterns it learned during training

  • Your prompt (the context you provided)

  • The parameters you've set (temperature, etc.)

The Transformer Architecture: The Breakthrough That Changed Everything

In 2017, Google researchers published a paper called "Attention is All You Need" that introduced the Transformer architecture. This was the earthquake that created the AI revolution we're experiencing.

What made it revolutionary?

Before Transformers: Sequential Processing

Older models (RNNs, LSTMs) had to process text sequentially—one word at a time, left to right. This meant:

  • ❌ Slow processing

  • ❌ Limited context understanding

  • ❌ Difficulty with long-range dependencies

  • ❌ Couldn't parallelize (use multiple processors simultaneously)

Example problem:
In the sentence "The chef who trained in France and Italy and worked at multiple Michelin-starred restaurants opened his new restaurant," the word "his" refers back to "chef"—but there are 15 words in between. Old models struggled with these connections.

The Transformer Revolution: Attention Mechanism

Transformers introduced self-attention, which allows the model to:

  • ✅ Look at all words simultaneously

  • ✅ Understand relationships between any words in the input

  • ✅ Process in parallel (much faster)

  • ✅ Handle long-range dependencies effectively

The "Attention" mechanism answers: "When processing this word, which other words in the sequence should I pay attention to?"

Visual analogy:

  • Old way: Reading a book by looking at one word at a time through a tiny hole

  • Transformer way: Seeing the entire page at once and understanding how every word relates to every other word

The Three Components of a Transformer

1. Input Embeddings

  • Your text is converted into numerical vectors (we'll cover this in tokenization)

  • These vectors capture semantic meaning: "king" - "man" + "woman" ≈ "queen"

2. Encoder-Decoder Architecture (or Decoder-Only)

  • Encoder: Processes the input and creates a rich representation

  • Decoder: Generates the output based on that representation

  • Modern LLMs (GPT series, LLaMA) use decoder-only architecture for generation tasks

3. Attention Layers (The Magic)

  • Multiple layers of attention mechanisms

  • Each layer learns different patterns: grammar, facts, reasoning, style, etc.

  • GPT-4 has 120+ layers; Claude 3.5 has similar complexity

Training: How Models Learn

Phase 1: Pre-training (The Foundation)

The model reads massive amounts of text from the internet and learns to predict the next word.

Scale we're talking about:

  • GPT-3: Trained on ~45TB of text (300 billion tokens)

  • GPT-4: Estimated 10-20 trillion tokens

  • LLaMA 2: 2 trillion tokens

  • Claude 3: Estimated similar or greater scale

What it learns:

  • Language patterns and grammar

  • Factual knowledge (though imperfectly)

  • Reasoning patterns

  • Common sense associations

  • Cultural and domain knowledge

  • Unfortunately, also biases present in training data

Training objective: Given this sequence of words, predict the next one. Repeat billions of times across trillions of words.

Phase 2: Fine-Tuning (Specialization)

After pre-training, models undergo additional training:

a) Supervised Fine-Tuning (SFT)

  • Human-written examples of good responses

  • Teaches the model to follow instructions

  • "When someone asks X, respond like Y"

b) Reinforcement Learning from Human Feedback (RLHF)

  • Humans rank multiple model outputs

  • Model learns which responses humans prefer

  • This is why ChatGPT and Claude are so much better at conversation than raw GPT-3

The result: A model that can follow complex instructions, maintain context, and generate human-like text.

Parameters: The Model's "Memory"

You've probably heard models described by their parameter count: "GPT-4 has 1.7 trillion parameters!"

What are parameters?
Think of them as the model's "knowledge weights"—the numerical values that determine how the model processes and generates text.

Model sizes:

  • GPT-3: 175 billion parameters

  • GPT-4: ~1.7 trillion parameters (estimated, 8 expert mixture)

  • Claude 3 Opus: Estimated ~500B-1T parameters

  • LLaMA 2 70B: 70 billion parameters

  • Gemini Ultra: Estimated 1.5T+ parameters

Bigger = better? Usually, but...

  • ✅ More parameters = more knowledge capacity

  • ✅ Better reasoning and nuance

  • ❌ Much more expensive to run

  • ❌ Slower response times

  • ❌ Not always necessary for simpler tasks

Practical implication: A 70B open-source model might be perfect for your use case, saving you 90% in costs compared to GPT-4.

Why This Matters for Prompting

Understanding this architecture explains:

1. Why context order matters
The model processes your prompt sequentially (even though attention is parallel). Information later in your prompt gets more "attention."

Practical tip: Put the most important instructions near the end of your prompt.

2. Why the model can be confident yet wrong
It's predicting based on patterns, not retrieving facts from a database. High probability ≠ factually correct.

Practical tip: Request citations, use chain-of-thought prompting, verify critical information.

3. Why examples are so powerful
Examples directly influence the probability distribution for the next tokens.

Practical tip: Few-shot prompting works because you're literally showing the model the pattern you want.

4. Why some tasks are harder than others
Complex reasoning requires the model to maintain and manipulate abstract representations across many steps.

Practical tip: Break complex tasks into smaller steps (prompt chaining).

Part 2: Tokenization—The Hidden Language of AI

Here's something that will change how you write prompts forever: AI doesn't see words. It sees tokens.

What Are Tokens?

Tokens are the basic units of text that models process. They're not exactly words, not exactly characters—they're subword units.

The rough rule: 1 token ≈ 4 characters or ≈ 0.75 words in English

Examples:

Unknown"Hello world!" = 3 tokens ["Hello", " world", "!"]
"artificial intelligence" = 4 tokens ["art", "ificial", " intelligence"]
"ChatGPT" = 2 tokens ["Chat", "GPT"]
"antidisestablishmentarianism" = 6 tokens ["ant", "idis", "establish", "ment", "arian", "ism"]

Why not just use words?

  • Many languages don't have clear word boundaries (Chinese, Japanese)

  • Tokenization handles new/rare words better

  • More efficient processing

  • Handles numbers, punctuation, code, etc.

How Tokenization Works

Step 1: Byte-Pair Encoding (BPE)
The most common approach. The algorithm:

  1. Starts with individual characters

  2. Finds the most frequently occurring pairs

  3. Merges them into single tokens

  4. Repeats until reaching the desired vocabulary size

Result: Common words = 1 token, rare words = multiple tokens

Vocabulary sizes:

  • GPT-3/4: ~50,257 tokens

  • Claude: ~100,000 tokens

  • LLaMA: ~32,000 tokens

Why Tokenization Matters for You

1. Cost Calculation

API pricing is per token, not per word:

  • OpenAI GPT-4: $0.03/1K input tokens, $0.06/1K output tokens

  • Claude Opus: $0.015/1K input tokens, $0.075/1K output tokens

Your 100-word prompt isn't 100 tokens—it's more like 130-150 tokens.

Pro tip: Use tokenization tools to check:

2. Context Window Limits

Models have token limits, not word limits:

  • GPT-4: 8K, 32K, or 128K tokens

  • Claude 3.5: 200K tokens

  • Gemini 1.5 Pro: 1M tokens

Common mistake: "I can fit 100,000 words in Claude!"
Reality: 100,000 words = ~130,000-150,000 tokens—you'd exceed the limit.

3. Token Efficiency

Some ways of writing use fewer tokens:

Example:

Unknown❌ Inefficient (many tokens): 
"The individual who is responsible for the management and oversight of..."

✅ Efficient (fewer tokens):
"The manager of..."

But don't obsess over this. Clarity trumps token optimization except at massive scale.

4. Why Some Words Are "Harder" Than Others

Words that tokenize into more pieces are harder for the model:

Easy (1 token): "the", "and", "computer", "science"
Harder (multiple tokens): "antidisestablishmentarianism", rare names, specialized jargon

This is why:

  • Models sometimes misspell unusual names

  • They struggle with very rare technical terms

  • They're better with common vocabulary

5. Numbers and Code

Numbers tokenize unpredictably:

Unknown"1234" might be ["123", "4"] or ["12", "34"] or ["1", "2", "3", "4"]

This is why LLMs are bad at arithmetic—they're predicting token sequences, not calculating.

Code tokenizes differently than prose:

Pythondef calculate_total(items):
    return sum(items)

This might be 15-20 tokens, depending on how the tokenizer handles code syntax.

Practical Tokenization Tips for Prompting

1. Check your token count before submission
Especially for long prompts approaching context limits.

2. Be aware of token-heavy formats

  • JSON (lots of special characters)

  • Tables (formatting characters)

  • Code (syntax elements)

3. Don't pad unnecessarily
This doesn't help and wastes tokens: "Please, if you could, maybe, possibly..."

4. Use the model's native token count in API calls
Most APIs return usage.total_tokens—monitor this for cost tracking.

Part 3: Context Windows and Their Limitations

The context window is the total amount of text (in tokens) a model can consider at once—including both your prompt and its response.

Think of it as the model's "working memory."

Current Context Window Sizes

ModelContext WindowPractical Capacity
GPT-3.5 Turbo16K tokens~12,000 words
GPT-48K tokens~6,000 words
GPT-4 (extended)32K tokens~24,000 words
GPT-4 Turbo128K tokens~96,000 words
Claude 3.5 Sonnet200K tokens~150,000 words
Claude 3 Opus200K tokens~150,000 words
Gemini 1.5 Pro1M tokens~750,000 words
LLaMA 38K-32K tokens~6,000-24,000 words

The race is on: Companies are competing to offer larger context windows.

Why Context Windows Matter

1. They Define What You Can Input

Want to analyze an entire book? You need:

  • "The Great Gatsby" = ~47,000 words = ~63,000 tokens

  • Required: 128K+ context window

2. They Include the Response

If you have 8K tokens total:

  • Your prompt uses 6K tokens

  • Only 2K tokens remain for the response

  • That's ~1,500 words maximum output

Common issue: Long prompts with insufficient room for responses.

3. They Affect Attention Quality

Lost in the Middle Problem: Research shows models pay more attention to:

  • The beginning of the context

  • The end of the context

  • Less attention to information in the middle

Study findings (Liu et al., 2023):

  • Information at the start: ~70% retrieval accuracy

  • Information in the middle: ~40% retrieval accuracy

  • Information at the end: ~65% retrieval accuracy

Practical implication: Put critical information at the beginning or end of your prompt.

Context Window Limitations

1. Processing Time

Larger contexts = longer processing:

  • 8K tokens: ~2-5 seconds

  • 128K tokens: ~10-30 seconds

  • 1M tokens: Several minutes

2. Cost

You pay for every token in the context:

UnknownExample: Analyzing a 100K token document
Input cost: 100K tokens × $0.015/1K = $1.50 per query
If you query 1,000 times: $1,500

3. Quality Degradation

Models perform worse as context windows fill:

  • Needle-in-haystack tests show accuracy drops significantly beyond 50% capacity

  • Complex reasoning degrades faster than simple retrieval

4. The Recency Bias

Models weight more recent information higher. In long contexts:

  • Earlier information may be "forgotten"

  • Later information dominates

Strategies for Working Within Context Limits

1. Summarization

UnknownStep 1: Summarize document in chunks
Step 2: Combine summaries
Step 3: Analyze final summary

2. Sliding Window
Process text in overlapping chunks:

UnknownChunk 1: Tokens 0-8000
Chunk 2: Tokens 6000-14000
Chunk 3: Tokens 12000-20000

3. Retrieval-Augmented Generation (RAG)
Don't put everything in context—retrieve only relevant sections:

Unknown1. Index all documents
2. User asks question
3. Retrieve top 5 relevant passages
4. Feed only those to the model

We'll cover this in detail in Post #18.

4. Prompt Compression
Remove unnecessary verbosity:

Unknown❌ "I would really appreciate it if you could possibly help me understand..."
✅ "Explain..."

5. Stateful Conversations
For chatbots, summarize history periodically:

UnknownEvery 10 messages:
- Summarize conversation so far
- Replace old messages with summary
- Continue with compressed context

The Future of Context Windows

Trends to watch:

  • Infinite context: Research into models without fixed context limits

  • Hierarchical attention: Models that process different context levels differently

  • External memory: Models that can read/write to external storage

  • Selective attention: Models that automatically focus on relevant parts

But for now: Work within the constraints. They're getting better, but they're not going away soon.

Part 4: Parameters That Control Output (Temperature, Top-p, and More)

These are the knobs and dials that change how the AI generates text. Understanding them is like understanding shutter speed and aperture in photography—technical knowledge that unlocks creative control.

Temperature: The Creativity Dial

Definition: Controls randomness in token selection. Range: 0.0 to 2.0 (practical range: 0.0 to 1.5)

How it works:

The model calculates probabilities for all possible next tokens:

Unknown"The sky is ____"
- blue: 40% probability
- clear: 25% probability
- cloudy: 20% probability
- falling: 0.1% probability

Temperature modifies these probabilities:

Temperature = 0 (Deterministic)

  • Always picks the highest probability token

  • Same prompt = same response every time

  • Maximum consistency, zero creativity

Temperature = 0.7 (Default/Balanced)

  • Mostly picks high-probability tokens

  • Some variation allowed

  • Balance of consistency and creativity

Temperature = 1.5 (High Creativity)

  • Flattens probability distribution

  • Low-probability tokens get more chances

  • More creative, more unpredictable

  • Higher risk of nonsense

Visual representation:

UnknownLow temp (0.2):   ████████ blue
                  ██ clear
                  █ cloudy

High temp (1.5):  ████ blue
                  ███ clear
                  ██ cloudy
                  █ falling (now more likely!)

When to Use Different Temperatures

Temperature 0 - 0.3: Maximum Precision

  • ✅ Factual Q&A

  • ✅ Code generation

  • ✅ Data extraction

  • ✅ Classification tasks

  • ✅ Legal/medical applications

  • ✅ Anything requiring consistency

Example prompt:

UnknownTemperature: 0
Extract the following from this email: sender, date, main request, urgency level.

Temperature 0.7 - 1.0: Balanced (Default)

  • ✅ General conversation

  • ✅ Explanations

  • ✅ Content writing

  • ✅ Problem-solving

  • ✅ Most use cases

Example prompt:

UnknownTemperature: 0.7
Write a professional email declining a meeting request.

Temperature 1.0 - 1.5: Maximum Creativity

  • ✅ Creative writing

  • ✅ Brainstorming

  • ✅ Unique content generation

  • ✅ Exploring unconventional ideas

  • ❌ Not for factual accuracy

Example prompt:

UnknownTemperature: 1.3
Generate 20 unusual marketing campaign ideas for a artisanal pickle company.

Temperature 1.5 - 2.0: Experimental

  • Often produces nonsensical or incoherent results

  • Rarely useful in practice

  • Fun for experimentation

Top-p (Nucleus Sampling): The Alternative Control

Definition: Instead of temperature, controls which tokens are considered by cumulative probability. Range: 0.0 to 1.0

How it works:

Top-p = 0.9 means "consider tokens until their cumulative probability reaches 90%"

UnknownToken probabilities:
- blue: 40% (cumulative: 40%)
- clear: 25% (cumulative: 65%)
- cloudy: 20% (cumulative: 85%)
- bright: 10% (cumulative: 95%) ← stops here at top-p=0.9
- falling: 5% (excluded)

The difference from temperature:

  • Temperature: Adjusts ALL probabilities

  • Top-p: Excludes low-probability tokens entirely

Typical values:

  • Top-p = 0.1: Very conservative (top 10% of probability mass)

  • Top-p = 0.5: Moderately conservative

  • Top-p = 0.9: Balanced (default for many models)

  • Top-p = 1.0: Consider all tokens (no filtering)

Pro tip: Use EITHER temperature OR top-p, not both. Combining them can produce unexpected results.

Other Important Parameters

Max Tokens (Max Length)

Definition: Maximum number of tokens in the response.

Use cases:

  • Controlling costs (shorter = cheaper)

  • Enforcing brevity

  • Ensuring responses fit in UI elements

Common mistake:

Unknown❌ Setting max_tokens = 50 and asking for a detailed essay
Result: Response will be cut off mid-sentence

Best practice:

Unknown✅ Set max_tokens to slightly more than expected output
For 500-word response: max_tokens = 700-800

Frequency Penalty

Definition: Reduces likelihood of repeating the same tokens. Range: -2.0 to 2.0

How it works:

  • 0: No penalty (default)

  • Positive values (0.5 - 1.0): Discourages repetition

  • Negative values: Encourages repetition (rarely useful)

Use when:

  • Model is being repetitive

  • You want more diverse vocabulary

  • Generating lists or variations

Example:

UnknownWithout frequency penalty:
"The product is great. The product is amazing. The product is fantastic."

With frequency penalty (0.7):
"The product is great. It's amazing. This is fantastic."

Presence Penalty

Definition: Encourages the model to introduce new topics. Range: -2.0 to 2.0

How it works:

  • 0: No penalty (default)

  • Positive values: Encourages talking about new concepts

  • Negative values: Encourages staying on topic (rarely used)

Use when:

  • Model keeps circling back to same points

  • You want broader coverage

  • Brainstorming diverse ideas

The difference:

  • Frequency penalty: "Don't use the same WORDS repeatedly"

  • Presence penalty: "Don't talk about the same TOPICS repeatedly"

Stop Sequences

Definition: Tokens that signal the model to stop generating.

Common use:

Pythonstop_sequences = ["\n\n", "###", "END"]

Use cases:

  • Generating until specific delimiter

  • Stopping at natural breakpoints

  • Controlling output format

Example:

UnknownPrompt: "List 5 ideas. Use ### to separate each idea."
Stop sequence: "###"

Output: "Idea 1: Product launch ###"
(stops, preventing premature continuation)

Parameter Combinations for Common Tasks

1. Factual Q&A

JSON{
  "temperature": 0.1,
  "max_tokens": 500,
  "top_p": 0.9,
  "frequency_penalty": 0,
  "presence_penalty": 0
}

2. Creative Writing

JSON{
  "temperature": 1.0,
  "max_tokens": 2000,
  "top_p": 0.95,
  "frequency_penalty": 0.5,
  "presence_penalty": 0.3
}

3. Code Generation

JSON{
  "temperature": 0.2,
  "max_tokens": 1500,
  "top_p": 0.9,
  "frequency_penalty": 0.2,
  "presence_penalty": 0
}

4. Brainstorming

JSON{
  "temperature": 1.2,
  "max_tokens": 1000,
  "top_p": 0.95,
  "frequency_penalty": 0.8,
  "presence_penalty": 0.6
}

5. Conversation

JSON{
  "temperature": 0.7,
  "max_tokens": 800,
  "top_p": 0.9,
  "frequency_penalty": 0.3,
  "presence_penalty": 0.1
}

Experimentation Framework

To find optimal parameters:

  1. Start with defaults (temp: 0.7, top_p: 0.9)

  2. Test one variable at a time

  3. Run multiple times (at temp > 0, outputs vary)

  4. Measure results against your criteria

  5. Document what works

Pro tip: Create a spreadsheet tracking:

TaskTemperatureTop_pMax_tokensQuality (1-10)Notes

Part 5: Different Model Architectures (GPT, Claude, Gemini, LLaMA, and More)

Not all LLMs are created equal. Understanding the landscape helps you choose the right tool for the job.

The Major Players: A Comparative Overview

OpenAI GPT Series

GPT-3.5 Turbo

  • Size: 175B parameters

  • Context: 16K tokens

  • Strengths: Fast, cheap, good for most tasks

  • Weaknesses: Less capable than GPT-4, more hallucinations

  • Best for: High-volume, cost-sensitive applications

  • Cost: $0.0015/1K input, $0.002/1K output

GPT-4

  • Size: ~1.7T parameters (mixture of experts)

  • Context: 8K, 32K, or 128K tokens

  • Strengths: Best reasoning, complex tasks, instruction following

  • Weaknesses: Expensive, slower

  • Best for: Complex analysis, high-stakes content, reasoning

  • Cost: $0.03/1K input, $0.06/1K output

GPT-4 Turbo

  • Updated GPT-4 with better performance and longer context

  • Context: 128K tokens

  • Cost: Slightly cheaper than GPT-4

GPT-4V (Vision)

  • Multimodal: text + images

  • Can analyze screenshots, diagrams, photos

  • Same core capabilities as GPT-4

Anthropic Claude Series

Claude 3 Haiku

  • Size: Smallest in Claude 3 family

  • Context: 200K tokens

  • Strengths: Fastest, cheapest Claude, massive context

  • Best for: Simple tasks needing large context

  • Cost: $0.00025/1K input, $0.00125/1K output

Claude 3 Sonnet

  • Size: Mid-tier

  • Context: 200K tokens

  • Strengths: Balance of capability and cost

  • Best for: Most production use cases

  • Cost: $0.003/1K input, $0.015/1K output

Claude 3.5 Sonnet

  • Updated Sonnet with improved performance

  • Better at: Coding, reasoning, nuanced tasks

  • Context: 200K tokens

Claude 3 Opus

  • Size: Largest Claude model

  • Context: 200K tokens

  • Strengths: Highest quality, excellent at complex reasoning

  • Best for: Challenging tasks, long-document analysis

  • Cost: $0.015/1K input, $0.075/1K output

Claude's Unique Characteristics:

  • ✅ Constitutional AI (safety-focused training)

  • ✅ Better at declining inappropriate requests

  • ✅ Excellent at long-form analysis

  • ✅ Strong performance on nuanced tasks

  • ✅ Less verbose than GPT-4 (more concise)

Google Gemini Series

Gemini 1.0 Pro

  • Context: 32K tokens

  • Strengths: Multimodal (text, images, audio, video)

  • Best for: Applications needing multiple input types

Gemini 1.5 Pro

  • Context: 1M tokens (largest available)

  • Strengths: Analyzing entire codebases, books, video transcripts

  • Weaknesses: Long context = slower, expensive

  • Best for: Tasks requiring massive context

Gemini Ultra

  • Largest Gemini model

  • Competitive with GPT-4 and Claude 3 Opus

  • Multimodal capabilities

Gemini's Unique Characteristics:

  • ✅ Native multimodal design (not bolted-on vision)

  • ✅ Longest context window (1M tokens)

  • ✅ Strong at technical/scientific tasks

  • ✅ Deep Google integration

Meta LLaMA Series

LLaMA 2

  • Sizes: 7B, 13B, 70B parameters

  • License: Open source (with usage restrictions)

  • Context: 4K-32K tokens depending on version

  • Strengths: Can self-host, customize, fine-tune

  • Best for: Privacy-sensitive applications, customization needs

LLaMA 3

  • Improved over LLaMA 2

  • Better multilingual support

  • Enhanced reasoning

Open Source Advantages:

  • ✅ Self-host (data stays internal)

  • ✅ Fine-tune for specific domains

  • ✅ No per-token costs (just compute)

  • ✅ Full control over model behavior

Open Source Disadvantages:

  • ❌ Requires infrastructure

  • ❌ Need ML expertise for deployment

  • ❌ Generally less capable than frontier models

  • ❌ You handle all safety/moderation

Other Notable Models

Mistral AI

  • Mistral 7B: Efficient, competitive with 13B models

  • Mixtral 8x7B: Mixture of experts, strong performance

  • Open source: Similar benefits to LLaMA

Cohere

  • Command: Optimized for business applications

  • Strong at: Classification, embeddings, search

AI21 Jurassic

  • Jurassic-2: Various sizes

  • Focus: Multi-language, long-form content

Specialized Models Worth Knowing

Code-Specific Models

CodeLlama (Meta)

  • Based on LLaMA, trained on code

  • Better at programming than base LLaMA

StarCoder

  • Open source, 15B parameters

  • Trained on 80+ programming languages

Phind-CodeLlama

  • Fine-tuned CodeLlama for development

Embedding Models

OpenAI text-embedding-ada-002

  • For semantic search, clustering

  • 1536 dimensions

Cohere Embed

  • Multilingual embeddings

  • Various size options

Sentence Transformers

  • Open source embedding models

  • Self-hostable

Model Comparison Matrix

FeatureGPT-4Claude 3 OpusGemini UltraLLaMA 2 70B
Reasoning★★★★★★★★★★★★★★☆★★★☆☆
Coding★★★★★★★★★★★★★★☆★★★☆☆
Creative Writing★★★★★★★★★★★★★★☆★★★★☆
Long Context★★★★☆★★★★★★★★★★★★☆☆☆
Speed★★★☆☆★★★★☆★★★☆☆★★★★★
Cost Efficiency★★☆☆☆★★★☆☆★★★☆☆★★★★★
Multilingual★★★★☆★★★★☆★★★★★★★★☆☆
Safety★★★★☆★★★★★★★★★☆★★★☆☆

Architectural Differences That Matter

1. Mixture of Experts (MoE)

Used by: GPT-4, Mixtral

How it works:

  • Multiple smaller "expert" models

  • Router network decides which experts to activate

  • Only activate relevant experts for each token

Advantages:

  • More efficient than single large model

  • Specialization (different experts for different domains)

Disadvantages:

  • More complex to train and deploy

  • Can be inconsistent

2. Constitutional AI

Used by: Claude series

How it works:

  • Model is trained with explicit "constitution" of principles

  • Self-critiques and revises responses

  • Trained to explain refusals

Result: Claude tends to be more careful, explicit about limitations

3. Retrieval-Enhanced Generation

Used by: Some specialized models, Perplexity AI

How it works:

  • Model can search external sources

  • Grounds responses in retrieved information

  • Provides citations

Advantage: Reduced hallucinations, up-to-date information

4. Multimodal Architecture

Native multimodal (Gemini):

  • Trained on text, images, audio, video simultaneously

  • Better cross-modal understanding

Adapter multimodal (GPT-4V):

  • Base model + vision adapter

  • Still very capable but different architecture

Choosing the Right Model: Decision Framework

GPT-4Claude 3 OpusClaude 3.5 SonnetGPT-3.5 TurboGemini 1.5 ProLLaMA/Mistra
Maximum quality is criticalLong-document analysis (200K context)Need strong coding assistanceHigh volume, cost mattersNeed 1M token contextData privacy is critical (self-host)
Complex reasoning requiredNuanced, thoughtful responses neededBalance of cost and qualitySimpler tasksAnalyzing entire codebases/booksNeed to fine-tune
Budget allowsSafety/ethics are paramountProduction applicationsSpeed is priorityMultimodal inputsHigh volume (no per-token cost)
Tasks are high-stakesYou prefer more concise outputsLarge context helpful but not requiredGood enough > perfectGoogle ecosystem integrationHave ML infrastructure

Model Performance on Common Tasks

TasksModels
Code GenerationClaude 3.5 Sonnet (best),GPT-4,GPT-3.5 Turbo,LLaMA 2 70B
Creative WritingClaude 3 Opus,GPT-4,Claude 3.5 Sonnet,GPT-3.5 Turbo
Factual Q&AGPT-4,Claude 3 Opus,Gemini Ultra, Claude 3.5 Sonnet
Long-Document AnalysisClaude 3 Opus (200K), Gemini 1.5 Pro (1M), GPT-4 Turbo (128K), Claude 3.5 Sonnet (200K)
Cost per QualityClaude 3.5 Sonnet, GPT-3.5 Turbo, Claude 3 Haiku, Gemini Pro
Reasoning & LogicGPT-4, Claude 3 Opus, Claude 3.5 Sonnet, Gemini Ultra

Part 6: Putting It All Together

Your Model Selection Worksheet

Answer these questions to choose the right model:

1. What's your primary task?

  • Simple Q&A → GPT-3.5 Turbo, Claude Haiku

  • Complex reasoning → GPT-4, Claude 3 Opus

  • Coding → Claude 3.5 Sonnet, GPT-4

  • Creative writing → Claude 3 Opus, GPT-4

  • Document analysis → Claude 3 Opus, Gemini 1.5 Pro

2. What's your context requirement?

  • < 8K tokens → Any model

  • 8K-32K tokens → GPT-4, Claude, Gemini Pro

  • 32K-200K tokens → Claude 3, GPT-4 Turbo

  • 200K+ tokens → Gemini 1.5 Pro

3. What's your volume?

  • Low (< 1M tokens/month) → Use best quality

  • Medium (1M-10M) → Balance cost/quality

  • High (> 10M) → Cost-optimize or self-host

4. What's your budget per 1K tokens?

  • < $0.01 → GPT-3.5, Claude Haiku, self-host

  • $0.01-$0.05 → Claude 3.5 Sonnet, GPT-4 Turbo

  • $0.05 → GPT-4, Claude 3 Opus (when needed)

5. What's your latency requirement?

  • Real-time (<2s) → GPT-3.5 Turbo, Claude Haiku

  • Interactive (<5s) → Most models

  • Batch processing → Any model, optimize for cost

6. Do you need special capabilities?

  • Vision → GPT-4V, Gemini

  • Massive context → Gemini 1.5 Pro

  • Self-hosting → LLaMA, Mistral

  • Maximum safety → Claude 3

Practical Exercises

Exercise 1: Token Counting

Test these prompts in a tokenizer:

  1. "Hello world"

  2. "The quick brown fox jumps over the lazy dog"

  3. Your name (if it's unusual, see how it tokenizes)

  4. A sentence in another language you speak

  5. A code snippet

What did you learn about tokenization?

Exercise 2: Temperature Experiments

Use the same prompt with different temperatures:

Prompt: "Write a product description for wireless headphones"

Try:

  • Temperature 0

  • Temperature 0.7

  • Temperature 1.2

Compare: Consistency, creativity, quality

Exercise 3: Context Window Testing

Take a long document (5,000+ words). Try:

  1. Summarizing the whole thing

  2. Asking about information at the beginning

  3. Asking about information in the middle

  4. Asking about information at the end

Notice: Where accuracy is highest/lowest

Exercise 4: Model Comparison

Same prompt, three different models:

Prompt: "Explain quantum entanglement to a high school student"

Test with:

  • GPT-3.5 Turbo

  • Claude 3.5 Sonnet

  • GPT-4 (if available)

Compare: Clarity, accuracy, style, length

Common Mistakes to Avoid

❌ Mistake 1: Ignoring tokenization
"I'll just write naturally and not worry about it"
Problem: Hitting context limits unexpectedly, wasting tokens

❌ Mistake 2: Using maximum context always
"I'll always use the longest context available"
Problem: Slower, more expensive, worse quality at high capacity

❌ Mistake 3: Default parameters for everything
"I'll just use temperature 0.7 for all tasks"
Problem: Suboptimal results—code generation needs lower, creative needs higher

❌ Mistake 4: Not testing different models
"GPT-4 is best, so I'll use it for everything"
Problem: Overpaying for simple tasks, missing specialized strengths

❌ Mistake 5: Assuming determinism at temp > 0
"I ran it once, so that's what it always does"
Problem: Inconsistent results surprise you in production

❌ Mistake 6: Exceeding context without checking
"I'll just paste this whole document"
Problem: Truncated results, missed information, wasted tokens

❌ Mistake 7: Treating all models the same
"They're all LLMs, so prompts should work identically"
Problem: Each model has quirks, optimal prompting differs

Key Takeaways: What You Must Remember

🔹 LLMs predict text, they don't understand it
This explains why they can be confidently wrong.

🔹 Tokens ≠ words
Budget and plan in tokens, not words.

🔹 Context windows are your constraint
Design prompts that fit. Put critical info at the start/end.

🔹 Temperature controls creativity
Low for facts, high for creativity.

🔹 Different models have different strengths
Choose based on task, not just "best overall."

🔹 Parameters matter as much as the prompt
The same prompt with different parameters produces different results.

🔹 Bigger isn't always better
A well-prompted smaller model beats a poorly-prompted larger one.

🔹 Context quality > context quantity
200K tokens of irrelevant information < 2K tokens of perfect context.

What's Next

In Post #3: "Anatomy of an Effective Prompt", we'll take everything you've learned about how models work and translate it into practical prompt construction:

  • The essential components every prompt needs

  • Clear vs. vague instructions (with examples)

  • How to provide context effectively

  • Formatting outputs exactly as you want

  • Common beginner mistakes and how to avoid them

  • Your first prompt templates

Now you understand the engine. Next, you'll learn to drive it masterfully.

Tokenizers:

Model Documentation:

Research Papers:

  • "Attention is All You Need" (Transformers)

  • "Lost in the Middle" (Context window study)

  • "Constitutional AI" (Anthropic's approach)

Your Assignment Before Post #3

  1. Experiment with at least two different models on the same task

  2. Try the same prompt with different temperatures (0, 0.7, 1.2)

  3. Check the token count of your typical prompts

  4. Test a long document against context limits

  5. Document your findings—what surprised you?

Share your discoveries in the comments! What worked? What didn't? What confused you?

Next up: Post #3 - "Anatomy of an Effective Prompt"

You now understand the machine. Time to master the interface.

Questions? Confused about anything? Drop a comment—I read and respond to all of them. This series only gets better with your input.

Welcome to Level 2 of prompt engineering mastery. You're building something powerful.

Prompt Engineering: From Zero to Mastery

Part 1 of 2

A practical series introducing GenAI and prompt engineering, covering prompt types, common mistakes, real-world use cases, evaluation, ethics, advanced tactics, and future trends, with relatable analogies and actionable examples.

Up next

What is Prompt Engineering? A Complete Introduction

Welcome to the future of human-AI collaboration. If you're reading this in 2024-2025, you're witnessing a fundamental shift in how humans interact with machines—and prompt engineering is your passport to this new world. The Definition: What Exactly I...

More from this blog

D

Deviloper's Blog

39 posts

A developer By Profession, Artist By Nature. I started Blogging recently and I hope, I can provide more contribution to Open Source. This Blog Content will mostly related to Coding, Tips & Tricks.