GenAI Commercial APIs
Use powerful AI models without training your own. Learn the OpenAI and Google AI APIs, how to manage API keys securely, how billing and rate limits work, and how to get embeddings via API.
In this lecture
12.1 What are Commercial AI APIs?
The provider manages all the heavy infrastructure — GPUs, scaling, updates. You just send a request and get a response. Key characteristics:
- REST-based — JSON requests and JSON responses over HTTP.
- Pay-as-you-go — subscription or usage-based pricing; no fixed cost to start.
- Authenticated — every request needs a valid API key.
- Enterprise-grade — built for security and scalability.
12.2 OpenAI API & Google AI API
OpenAI API
Gives access to OpenAI's models via HTTP requests for text generation, summarization, translation and embeddings. Core components:
| Component | Purpose |
|---|---|
| Client App | Sends API requests |
| API Gateway | Authentication & routing |
| AI Models | Generate responses |
| Usage Tracker | Billing & limits |
| Response Handler | Returns JSON output |
OpenAI offers chat models (conversations, Q&A), embedding models (search, RAG), lightweight models (low-cost tasks) and advanced models (complex reasoning).
Google AI API (Gemini)
Gemini is Google's AI model family — text, reasoning, code, and some image support — accessed via Google AI Studio or Vertex AI, using REST or SDKs. It is part of the broader Google Cloud ecosystem and integrates with BigQuery and Cloud Storage.
OpenAI vs Google AI
| Aspect | OpenAI API | Google AI API |
|---|---|---|
| Focus | Simple APIs, fast developer onboarding | Enterprise integration, cloud-native |
| Platform style | Standalone AI platform | Part of Google Cloud |
| Billing | Token-based pricing | Integrated with Google Cloud billing |
| Best for | Startups, rapid development | Large-scale enterprise applications |
import os
from openai import OpenAI
# Key is read from an environment variable - NEVER hard-coded
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain RAG in one line."}]
)
print(response.choices[0].message.content)
12.3 API Key Management
API key lifecycle
| Stage | What happens |
|---|---|
| Creation | Generate the key in the provider's dashboard |
| Configuration | Set permissions / usage restrictions |
| Usage | The application uses the key to authenticate calls |
| Rotation | Periodically replace the key to reduce risk |
| Revocation | Immediately disable compromised or unused keys |
Best practices for storing keys
- Store keys in environment variables, not in source code.
- Use a secrets manager for stronger protection.
- Never expose keys in frontend code — the browser can be inspected by anyone.
- Keep API calls on the backend server.
- Rotate keys periodically and limit access by role/environment.
12.4 Costs, Billing & Rate Limits
Token-based pricing
OpenAI charges based on the number of tokens processed — and this counts both directions:
- Input tokens — the size of your prompt.
- Output tokens — the size of the model's response.
Different models have different per-token prices: lightweight models are cheaper but less capable; advanced models cost more but reason better. Pricing is usage-based with no fixed monthly fee by default.
Rate Limits
Limits may apply per API key, per user, or per IP. Exceeding them temporarily blocks requests.
Retry-After header. Other common codes: 401 = authentication issue, 403 = access restricted.
Handling rate limits gracefully
- Exponential backoff — on a 429, wait, retry; if it fails again wait longer, retry; keep doubling the wait. This avoids hammering the server.
- Queue requests instead of sending bursts.
- Cache results to avoid duplicate calls.
- Monitor usage; upgrade the plan for higher throughput.
import time
def call_with_backoff(make_request, max_retries=5):
delay = 1
for attempt in range(max_retries):
response = make_request()
if response.status_code != 429: # success or other error
return response
print(f"Rate limited. Waiting {delay}s...")
time.sleep(delay)
delay *= 2 # double the wait each time
raise Exception("Still rate-limited after retries")
12.5 Word Embeddings via API
Both OpenAI and Google offer dedicated embedding models via API. They convert text into fixed-length numerical vectors that capture semantic meaning — the foundation of search, recommendations and RAG (Lecture 16).
- Similar texts → vectors that are close in vector space.
- Embeddings are generated once and reused — far cheaper than re-generating text.
- They are stored in vector databases for fast similarity search.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.embeddings.create(
model="text-embedding-3-small",
input="How many vacation days do I get?"
)
vector = response.data[0].embedding
print("Embedding length:", len(vector)) # a fixed-length numeric vector
API security and rate limits are common MCQ material.
The main benefit of using a commercial AI API is that:
The provider hosts the models and manages the GPUs/scaling. You just send HTTP requests — no training or infrastructure needed.
An API key is primarily used to:
The key identifies who is calling, what access they have, and lets the provider track usage and apply billing and rate limits.
Where should an API key be stored in a production application?
Keys belong in environment variables / secrets managers and should only be used server-side. Hard-coding or exposing them in frontend/public repos leads to instant theft.
Why must the API call code be kept on the backend, not the frontend?
Anyone can open browser dev tools and read frontend code. Keeping the key and call on the backend hides the credential from users.
OpenAI's token-based pricing charges for:
Both the prompt and the generated response count toward token usage — that is why concise prompts and outputs reduce cost.
Which HTTP status code means "Too Many Requests" (rate limit exceeded)?
HTTP 429 = Too Many Requests. 401 = authentication issue, 403 = access restricted, 200 = success.
"Exponential backoff" means that after each failed retry you:
Exponential backoff doubles the wait time after each failure (1s, 2s, 4s…), reducing load on the server and improving the chance of eventual success.
Why are embeddings better than keyword search?
Embeddings capture semantic meaning, so semantically similar texts have nearby vectors — even when they share no exact keywords.
What happens if an API key is leaked publicly, and what two actions should you take?
A leaked key can be used by attackers to make requests on your account — causing unexpected billing spikes and possibly generating harmful content under your name. You should immediately revoke (disable) the compromised key and generate a new one (rotate), then update your application to use it from a secure store.
Write code that creates an OpenAI client by reading the API key securely from an environment variable.
import os
from openai import OpenAI
# Read the key from the environment - never hard-code it
api_key = os.environ.get("OPENAI_API_KEY")
if api_key is None:
raise ValueError("OPENAI_API_KEY not set")
client = OpenAI(api_key=api_key)
print("Client created securely.")
The key lives in the environment, not in the code, so it is never committed to version control.
Write code to send a chat request to the OpenAI API asking it to "Summarise photosynthesis in one sentence" and print the reply.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user",
"content": "Summarise photosynthesis in one sentence"}
]
)
print(response.choices[0].message.content)
Why do AI providers enforce rate limits, and name two strategies to handle them gracefully.
AI inference is compute-heavy, so rate limits prevent system overload during traffic spikes, ensure fair usage for all customers, maintain low latency, and enforce pricing tiers. Two graceful strategies: exponential backoff (wait progressively longer before each retry) and caching results / queuing requests to avoid bursts and duplicate calls.