Adding an LLM to your product is not a strategy. "AI features" that don't solve a specific, painful user problem are a distraction and an expense. This guide is for teams that have identified a concrete use case — search, summarisation, content generation, data extraction — and need a practical path from API key to production feature.
Choosing the Right LLM for Your Use Case
The model choice is consequential and should be made based on task requirements, not model prestige. The relevant axes are capability, cost, latency, and context window:
- Complex reasoning and long-document analysis — Claude 3.7 Sonnet or GPT-4o. Higher cost, longer latency, but noticeably better on tasks requiring multi-step reasoning.
- High-volume, latency-sensitive tasks — Claude Haiku 3.5 or GPT-4o mini. 10–50× cheaper than frontier models with acceptable quality for classification, extraction, and short-form generation.
- Privacy requirements or on-premise deployment — Open-source models (Llama 3.3, Mistral, Qwen) deployed on your own infrastructure. Higher operational overhead but full data control.
Start with the cheaper, faster model and only upgrade to the more capable one when you have evidence the cheaper model isn't meeting your quality bar. You can measure this — build an evaluation set before you ship.
Integration Patterns
Pattern 1: Prompt-Response (Simplest)
User provides input → you enrich it with context → send to LLM → return response. Appropriate for content generation, summarisation, and classification tasks where the model doesn't need external data.
async function summariseDocument(text: string): Promise<string> {
const response = await anthropic.messages.create({
model: 'claude-haiku-4-5-20251001',
max_tokens: 1024,
messages: [{
role: 'user',
content: `Summarise this document in 3 bullet points:\n\n${text}`
}]
});
return response.content[0].text;
}
Pattern 2: RAG (Retrieval-Augmented Generation)
For use cases where the LLM needs to answer questions about your specific data (internal docs, product catalogue, knowledge base), RAG is the standard architecture. The LLM doesn't need to memorise your data — it retrieves relevant chunks at query time and uses them as context.
A minimal RAG system has three components: an embedding model (converts text to vectors), a vector database (stores and retrieves vectors by semantic similarity), and an LLM (generates an answer given the retrieved context). Pinecone, Weaviate, and pgvector (a PostgreSQL extension) are the most common vector database choices.
async function answerQuestion(question: string): Promise<string> {
// 1. Embed the question
const embedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: question
});
// 2. Retrieve relevant documents
const relevant = await vectorDB.query({
vector: embedding.data[0].embedding,
topK: 5,
includeMetadata: true
});
// 3. Build context and query LLM
const context = relevant.matches
.map(m => m.metadata.text)
.join('\n\n');
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 2048,
messages: [{
role: 'user',
content: \`Answer based on this context:\n\n\${context}\n\nQuestion: \${question}\`
}]
});
return response.content[0].text;
}
Pattern 3: Tool Use (Agentic)
The model can decide to call external functions (search, database queries, API calls) to gather information before responding. Appropriate for complex tasks where the model needs to take actions, not just generate text. More powerful, more expensive, harder to test reliably.
Cost Optimisation Strategies
LLM API costs can spiral quickly. The most impactful optimisations:
- Prompt caching — If you have a long system prompt that's the same across requests, cache it. Anthropic's prompt caching reduces repeated context costs by 90%.
- Response caching — Cache LLM responses for identical or near-identical inputs. Semantic caching (using embeddings to find similar previous queries) can cache responses for paraphrase-equivalent queries.
- Model routing — Use a cheap model for simple queries and route complex ones to the frontier model. A classifier (or just prompt length + keywords) can make this routing decision cheaply.
- Async where possible — For non-real-time tasks (report generation, email drafts, batch analysis), use the Batch API where available. Typically 50% cheaper than synchronous requests.
Evaluation Before Shipping
Every LLM feature needs an eval before it ships. An eval is a set of test cases — input/expected output pairs — that you run against the model to measure quality. Without evals, you're deploying blind: you don't know if a prompt change improved quality or regressed it, and you can't catch model degradation when providers update their models.
Start with 20–50 representative examples. Run them against your prompt. Measure the failure rate. Set a threshold ("we ship when <5% of examples fail"). Automate the eval to run in CI on every prompt change.
Safety and Guardrails
Any LLM feature that accepts arbitrary user input is a potential vector for prompt injection — an attacker including text in their input that overrides your system instructions. Mitigations: don't let user content appear verbatim in system prompts, validate that outputs match expected format/content, use an output classifier for high-risk actions, and apply rate limiting aggressively.
For consumer-facing features, implement content moderation on both inputs and outputs. Most LLM providers offer built-in moderation APIs.
Putting It All Together
The LLM is a component, not a product. The product is the workflow it enables. The teams that ship successful AI features are the ones that started with a specific, painful user problem, chose the simplest architecture that solves it, built an eval suite before writing production code, and measured real user outcomes — not just model quality — after shipping.
Want help integrating AI capabilities into your product? The Blaze Technologies team has built LLM-powered features for products ranging from enterprise knowledge bases to consumer mobile apps.