AI Model Test Prompts — Reproducible Benchmark Tasks
AI Model Test Prompts
Reproducible benchmark tasks for evaluating LLMs. Use these to test models yourself — all prompts are open-source.
These are the same prompts we use in our daily scorecards. Last updated: March 10, 2026.
Why Share Our Prompts?
- Reproducibility: You can verify our scores yourself
- Transparency: No hidden tests or secret criteria
- Usefulness: These are real production tasks, not synthetic
- Community: Help developers choose the right model for their use case
Note: Model scores vary by prompt phrasing. These prompts are designed for operator-grade evaluation (production readiness), not academic benchmarks.
Coding Tasks
TypeScript Generic Type Inference
HardThis generic API client loses type information when chaining methods. Fix it so the response type is correctly inferred through the chain.
```typescript
class ApiClient<T> {
private endpoint: string;
constructor(endpoint: string) {
this.endpoint = endpoint;
}
get<U>(id: string): Promise<U> {
return fetch(`${this.endpoint}/${id}`).then(r => r.json());
}
transform<U>(fn: (data: T) => U): ApiClient<U> {
return new ApiClient<U>(this.endpoint);
}
}
// Usage should infer User, not unknown
const client = new ApiClient('/users');
const user = await client.get<User>('123'); // Should be Promise<User>
``` Evaluation Criteria
- Identifies the generic type flow issue
- Preserves type information through method chains
- Uses proper generic constraints
- Solution compiles without `any` or `unknown`
Multi-Cloud Terraform Module
HardCreate a Terraform module that deploys:
1. An AWS S3 bucket with versioning enabled
2. A GCP Cloud Storage bucket as backup target
3. A Cloudflare Worker that syncs between them daily
Include:
- Proper IAM roles (least privilege)
- Environment variable handling
- README with usage example
- Variables for bucket names and regions Evaluation Criteria
- Uses proper provider configuration with aliases
- Implements least-privilege IAM policies
- Includes runnable example in README
- Handles environment variables securely
- Worker code is complete and functional
Reasoning Tasks
B2B SaaS Pricing Strategy
HardYour B2B SaaS has 500 customers on a 3-tier plan ($29/$99/$299).
Churn analysis shows:
- Tier 1: 8%/mo churn (price-sensitive)
- Tier 2: 3%/mo churn
- Tier 3: 1%/mo churn
- 60% of Tier 1 users hit API limits monthly
Recommend whether to:
1. Introduce usage-based pricing
2. Add a $49 middle tier
3. Grandfather existing users
Model the 12-month revenue impact for each option. Evaluation Criteria
- Quantifies revenue impact for each option
- Considers migration friction and support costs
- Accounts for churn reduction vs revenue optimization
- Provides clear recommendation with reasoning
- Shows sensitivity analysis for key assumptions
Tool Use Tasks
Redis Rate Limiter
MediumImplement a token bucket rate limiter in Redis that:
- Allows 100 requests per minute per user
- Returns remaining quota in response
- Handles race conditions correctly
- Expires old buckets after 1 hour of inactivity
Provide:
- Redis Lua script
- Node.js/TypeScript wrapper
- Test cases for edge conditions Evaluation Criteria
- Uses Redis Lua for atomicity
- Correctly implements token bucket algorithm
- Handles concurrent requests safely
- Includes proper expiration handling
- Test cases cover edge conditions
JWT + Service Mesh Authentication
MediumDesign authentication for microservices that:
- Uses JWTs for user identity
- Propagates user context through service mesh
- Handles token refresh without breaking requests
- Supports service-to-service auth
Provide:
- JWT payload structure
- Middleware code for Node.js services
- Service mesh configuration (Istio or Linkerd)
- Token refresh strategy Evaluation Criteria
- JWT structure follows best practices
- Middleware correctly validates and propagates context
- Token refresh is seamless (no dropped requests)
- Service-to-service auth is secure
- Configuration is complete and deployable
Contribute a Prompt
Have a good test prompt? Submit it via GitHub Issues and we'll add it with attribution.
Good prompts are:
- Real production tasks (not synthetic)
- Have clear evaluation criteria
- Take 5-20 minutes for a human
- Test practical skills (coding, reasoning, tool use)
How to Use These Prompts
- Copy the prompt text
- Paste into your preferred model (ChatGPT, Claude, Gemini, etc.)
- Evaluate the response using our rubric
- Compare scores across models
For fair comparison, use the same prompt verbatim across all models. Small wording changes can affect scores.