⚡ LECTURE 11

Generative AI Modalities

Generative AI does not just classify — it creates. Explore text, image, code, audio and multimodal models, the architectures behind them, and the ethics of generated content.

Syllabus topics 39–44 ⏱ ~24 min read 12 practice questions

In this lecture

What is Generative AI?
Generative vs Discriminative Models
Text & Code Models
Image & Audio Models
Multimodal Models & Architectures
Ethics in Generative AI
Practice Questions

11.1 What is Generative AI?

Generative AI — models that learn patterns from massive datasets and use those patterns to create new, original content — text, images, audio or code. They do not retrieve old information; they generate new outputs using learned probability patterns.

Key terminology

Term	Meaning
Training Data	Information used to teach the model (books, code repos, images)
Parameters	The model's learned settings — GPT-3 has 175 billion
Prompt	The input/instruction given to the model
Inference	The model generating output — happens after training, not during it
Token	The smallest unit of text the model handles — a word may be several tokens
Fine-tuning	Adapting a pre-trained model to a specific domain

⚠️ Two common true/false traps "Inference happens during training" → False (inference is after training). "A token is always one word" → False (a word can be split into multiple tokens, e.g. "Apple" → 2 tokens).

11.2 Generative vs Discriminative Models

Aspect	Discriminative	Generative
Core function	Map input → label	Learn the data distribution to create new data
Question answered	"What category is this?"	"What would a new example look like?"
What it learns	Decision boundaries	The full data distribution
Output	Classifications, predictions	New text, images, audio
Examples	Logistic Regression, SVM, Random Forest, CNNs	GPT, Stable Diffusion, GANs, VAEs
Metaphor	A judge who evaluates and categorises	An artist who creates original content

11.3 Text & Code Models

Text Generation

Models learn patterns in text and predict the next token → sentences → paragraphs → documents. Powered by Transformers (Lecture 10). Used in:

Text Generation — chatbots, creative writing, drafting.
Text Summarization — condensing long documents into key points. Two flavours: extractive (pick out existing sentences) and abstractive (write a fresh summary in new words).

Code Models

Trained on programming languages and developer patterns. They help with:

Autocomplete — predict the next lines of code.
Refactoring — suggest cleaner versions of legacy code.
Bug detection — highlight errors and propose fixes.
Query generation — convert plain English into SQL.

Python · text generation & summarization (Hugging Face)

from transformers import pipeline

# Text generation
generator = pipeline("text-generation", model="gpt2")
print(generator("The future of AI is", max_length=20)[0]['generated_text'])

# Text summarization
summarizer = pipeline("summarization")
long_text = "Generative AI has transformed many industries..."
print(summarizer(long_text, max_length=30, min_length=10))

OutputThe future of AI is bright and full of new possibilities for... [{'summary_text': 'Generative AI has transformed many industries.'}]

11.4 Image & Audio Models

Image Generation — Diffusion Models

Diffusion Model — generates images by starting from random noise and refining it step by step into a clear picture, guided by a text prompt. Example: Stable Diffusion.

The model is trained to reverse a noising process — it learns to remove noise gradually. Used in marketing visuals, concept art, synthetic medical-imaging data and educational illustrations. Example prompt: "A futuristic classroom with AI robots teaching students, photorealistic, detailed."

Audio Models

Text-to-Speech (TTS) — converts written text into natural human-like speech, with different voices, accents and emotions. Used for accessibility, audiobooks, voice assistants.
Voice cloning — replicates a voice from sample audio. Powerful but requires strict consent and ethical use.
Music generation — creates original music or assists composition (background scores, prototyping).

11.5 Multimodal Models & Architectures

Multimodal Model — a model that works across multiple types of data at once — text, images, audio, video. Example: GPT-4V can accept an image and answer questions about it in text.

Different tasks need different architectures

Content type	Architecture	How it works
Text	Transformers	Self-attention to model context across long sequences
Images	Diffusion Models	Start from noise, refine step-by-step, guided by text embeddings
Audio	Specialised audio models	Learn temporal/acoustic patterns, generate natural prosody

Emerging trends

Multimodal models, personalisation, efficiency (smaller models, less compute), controllability (finer control of tone/style), and safety & alignment (outputs matching human values).

11.6 Ethics in Generative AI

Practice	Why it matters
Verify outputs	AI can hallucinate — always cross-check facts with reliable sources
Check for bias	Models inherit data biases — review outputs for stereotypes
Protect privacy	Never enter sensitive/personal data — public models may log and reuse inputs
Be transparent	Disclose when AI tools are used; follow attribution rules
Cite synthetic content	Treat AI-generated material like any other source — transparency builds credibility

💡 Tip — what never to type into a public AI tool Passwords, API keys, customer data, medical records, financial details, or any confidential company information. Public models may store and reuse what you submit.

? Practice Questions

Modalities, architectures and terminology are tested heavily here.

MCQQ1Definition

Generative AI differs from traditional (discriminative) AI because it:

A Only classifies existing data into categories
B Creates new, original content from learned patterns
C Never needs training data
D Works only with numbers

Answer: B

Generative models learn the data distribution and produce new content (text, images, audio). Discriminative models only assign labels to existing data.

MCQQ2Discriminative

Which of these is a discriminative model?

A GPT
B Stable Diffusion
C Logistic Regression
D A GAN

Answer: C

Logistic Regression maps inputs to labels (discriminative). GPT, Stable Diffusion and GANs all generate new content (generative).

MCQQ3Image models

Image generators like Stable Diffusion create pictures by:

A Copying images from the internet
B Starting from random noise and refining it step-by-step
C Using a single Transformer decoder
D Drawing pixel by pixel from left to right

Answer: B

Diffusion models begin with noise and iteratively denoise it into a coherent image, guided by the text prompt's embeddings.

MCQQ4Terminology

When does inference happen?

A During training
B After training, when the model generates output
C Before any data is collected
D Only when fine-tuning

Answer: B

Training adjusts the parameters; inference is the later phase where the trained model produces outputs from a prompt.

MCQQ5Tokens

"A token is always exactly one word." This statement is:

A True
B False — a word can be split into several tokens
C True only for English
D True only for code models

Answer: B

Tokens are sub-word chunks. A single word like "microtransactional" or even "Apple" can be broken into multiple tokens.

MCQQ6Multimodal

A multimodal model is one that:

A Has multiple hidden layers
B Works across multiple data types (e.g. text + images)
C Runs on multiple computers
D Was trained by multiple companies

Answer: B

Multimodal = multiple modalities. GPT-4V, for instance, handles both images and text in one model.

MCQQ7Summarization

A summariser that writes a fresh summary in new words rather than copying sentences is performing:

A Extractive summarization
B Abstractive summarization
C Tokenization
D Classification

Answer: B

Abstractive summarization generates new sentences; extractive summarization picks existing sentences out of the source text.

MCQQ8Ethics

Which is the safest practice when using a public generative AI tool?

A Paste confidential customer data to get better answers
B Verify outputs against reliable sources and avoid entering sensitive data
C Always trust the output without checking
D Never disclose that AI was used

Answer: B

AI can hallucinate, so verify facts; and public models may log inputs, so never enter sensitive/personal data. Transparency about AI use is also good practice.

Short AnswerQ9Concept

Explain the difference between a generative model and a discriminative model using one example of each.

Model answer

A discriminative model learns the boundary between classes and answers "what category is this?" — e.g. Logistic Regression deciding spam vs not-spam. A generative model learns the data distribution and answers "what would a new example look like?" — e.g. GPT generating a new paragraph or Stable Diffusion creating a new image.

CodingQ10Hugging Face

Use the Hugging Face pipeline to perform sentiment analysis on the sentence "I love this course".

Solution

Python

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this course")
print(result)

Output[{'label': 'POSITIVE', 'score': 0.9998}]

The pipeline helper downloads a pre-trained model and runs inference in one line.

CodingQ11Text generation

Write code using Hugging Face to generate text continuing the prompt "Machine learning is".

Solution

Python

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
output = generator("Machine learning is", max_length=25,
                   num_return_sequences=1)
print(output[0]['generated_text'])

OutputMachine learning is a powerful tool that allows computers to learn patterns from data and improve over time.

Short AnswerQ12Architectures

Match each modality to its typical architecture: text, images, sequences/audio.

Model answer

Text → Transformers (self-attention for long-range context). Images → Diffusion models (denoise random noise into a picture). Audio/sequences → specialised audio/sequence models that learn temporal and acoustic patterns. Different content types need architectures matched to their structure.

🎯 Lecture 11 — must-remember Generative AI creates new content; discriminative AI labels existing data. Modalities: text/code (Transformers), images (Diffusion — noise→image), audio (TTS, voice cloning, music). Multimodal = many data types. Inference = after training; a token ≠ always one word. Ethics: verify, check bias, protect privacy, be transparent.

← Previous

LLM Architecture

GenAI Commercial APIs