Generative AI Modalities
Generative AI does not just classify — it creates. Explore text, image, code, audio and multimodal models, the architectures behind them, and the ethics of generated content.
In this lecture
11.1 What is Generative AI?
Key terminology
| Term | Meaning |
|---|---|
| Training Data | Information used to teach the model (books, code repos, images) |
| Parameters | The model's learned settings — GPT-3 has 175 billion |
| Prompt | The input/instruction given to the model |
| Inference | The model generating output — happens after training, not during it |
| Token | The smallest unit of text the model handles — a word may be several tokens |
| Fine-tuning | Adapting a pre-trained model to a specific domain |
11.2 Generative vs Discriminative Models
| Aspect | Discriminative | Generative |
|---|---|---|
| Core function | Map input → label | Learn the data distribution to create new data |
| Question answered | "What category is this?" | "What would a new example look like?" |
| What it learns | Decision boundaries | The full data distribution |
| Output | Classifications, predictions | New text, images, audio |
| Examples | Logistic Regression, SVM, Random Forest, CNNs | GPT, Stable Diffusion, GANs, VAEs |
| Metaphor | A judge who evaluates and categorises | An artist who creates original content |
11.3 Text & Code Models
Text Generation
Models learn patterns in text and predict the next token → sentences → paragraphs → documents. Powered by Transformers (Lecture 10). Used in:
- Text Generation — chatbots, creative writing, drafting.
- Text Summarization — condensing long documents into key points. Two flavours: extractive (pick out existing sentences) and abstractive (write a fresh summary in new words).
Code Models
Trained on programming languages and developer patterns. They help with:
- Autocomplete — predict the next lines of code.
- Refactoring — suggest cleaner versions of legacy code.
- Bug detection — highlight errors and propose fixes.
- Query generation — convert plain English into SQL.
from transformers import pipeline
# Text generation
generator = pipeline("text-generation", model="gpt2")
print(generator("The future of AI is", max_length=20)[0]['generated_text'])
# Text summarization
summarizer = pipeline("summarization")
long_text = "Generative AI has transformed many industries..."
print(summarizer(long_text, max_length=30, min_length=10))
11.4 Image & Audio Models
Image Generation — Diffusion Models
The model is trained to reverse a noising process — it learns to remove noise gradually. Used in marketing visuals, concept art, synthetic medical-imaging data and educational illustrations. Example prompt: "A futuristic classroom with AI robots teaching students, photorealistic, detailed."
Audio Models
- Text-to-Speech (TTS) — converts written text into natural human-like speech, with different voices, accents and emotions. Used for accessibility, audiobooks, voice assistants.
- Voice cloning — replicates a voice from sample audio. Powerful but requires strict consent and ethical use.
- Music generation — creates original music or assists composition (background scores, prototyping).
11.5 Multimodal Models & Architectures
Different tasks need different architectures
| Content type | Architecture | How it works |
|---|---|---|
| Text | Transformers | Self-attention to model context across long sequences |
| Images | Diffusion Models | Start from noise, refine step-by-step, guided by text embeddings |
| Audio | Specialised audio models | Learn temporal/acoustic patterns, generate natural prosody |
Emerging trends
Multimodal models, personalisation, efficiency (smaller models, less compute), controllability (finer control of tone/style), and safety & alignment (outputs matching human values).
11.6 Ethics in Generative AI
| Practice | Why it matters |
|---|---|
| Verify outputs | AI can hallucinate — always cross-check facts with reliable sources |
| Check for bias | Models inherit data biases — review outputs for stereotypes |
| Protect privacy | Never enter sensitive/personal data — public models may log and reuse inputs |
| Be transparent | Disclose when AI tools are used; follow attribution rules |
| Cite synthetic content | Treat AI-generated material like any other source — transparency builds credibility |
Modalities, architectures and terminology are tested heavily here.
Generative AI differs from traditional (discriminative) AI because it:
Generative models learn the data distribution and produce new content (text, images, audio). Discriminative models only assign labels to existing data.
Which of these is a discriminative model?
Logistic Regression maps inputs to labels (discriminative). GPT, Stable Diffusion and GANs all generate new content (generative).
Image generators like Stable Diffusion create pictures by:
Diffusion models begin with noise and iteratively denoise it into a coherent image, guided by the text prompt's embeddings.
When does inference happen?
Training adjusts the parameters; inference is the later phase where the trained model produces outputs from a prompt.
"A token is always exactly one word." This statement is:
Tokens are sub-word chunks. A single word like "microtransactional" or even "Apple" can be broken into multiple tokens.
A multimodal model is one that:
Multimodal = multiple modalities. GPT-4V, for instance, handles both images and text in one model.
A summariser that writes a fresh summary in new words rather than copying sentences is performing:
Abstractive summarization generates new sentences; extractive summarization picks existing sentences out of the source text.
Which is the safest practice when using a public generative AI tool?
AI can hallucinate, so verify facts; and public models may log inputs, so never enter sensitive/personal data. Transparency about AI use is also good practice.
Explain the difference between a generative model and a discriminative model using one example of each.
A discriminative model learns the boundary between classes and answers "what category is this?" — e.g. Logistic Regression deciding spam vs not-spam. A generative model learns the data distribution and answers "what would a new example look like?" — e.g. GPT generating a new paragraph or Stable Diffusion creating a new image.
Use the Hugging Face pipeline to perform sentiment analysis on the sentence "I love this course".
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this course")
print(result)
The pipeline helper downloads a pre-trained model and runs inference in one line.
Write code using Hugging Face to generate text continuing the prompt "Machine learning is".
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
output = generator("Machine learning is", max_length=25,
num_return_sequences=1)
print(output[0]['generated_text'])
Match each modality to its typical architecture: text, images, sequences/audio.
Text → Transformers (self-attention for long-range context). Images → Diffusion models (denoise random noise into a picture). Audio/sequences → specialised audio/sequence models that learn temporal and acoustic patterns. Different content types need architectures matched to their structure.