Mar 10, 2026

AI Model Test Prompts — Reproducible Benchmark Tasks

AI Model Test Prompts

Reproducible benchmark tasks for evaluating LLMs. Use these to test models yourself — all prompts are open-source.

These are the same prompts we use in our daily scorecards. Last updated: March 10, 2026.

Why Share Our Prompts?

Reproducibility: You can verify our scores yourself
Transparency: No hidden tests or secret criteria
Usefulness: These are real production tasks, not synthetic
Community: Help developers choose the right model for their use case

Note: Model scores vary by prompt phrasing. These prompts are designed for operator-grade evaluation (production readiness), not academic benchmarks.

Coding Tasks

TypeScript Generic Type Inference

Hard

⏱️ 5-10 minutes 📖 Daily scorecard tests

This generic API client loses type information when chaining methods. Fix it so the response type is correctly inferred through the chain.

```typescript
class ApiClient<T> {
  private endpoint: string;
  
  constructor(endpoint: string) {
    this.endpoint = endpoint;
  }
  
  get<U>(id: string): Promise<U> {
    return fetch(`${this.endpoint}/${id}`).then(r => r.json());
  }
  
  transform<U>(fn: (data: T) => U): ApiClient<U> {
    return new ApiClient<U>(this.endpoint);
  }
}

// Usage should infer User, not unknown
const client = new ApiClient('/users');
const user = await client.get<User>('123'); // Should be Promise<User>
```

Evaluation Criteria

Identifies the generic type flow issue
Preserves type information through method chains
Uses proper generic constraints
Solution compiles without `any` or `unknown`

Multi-Cloud Terraform Module

Hard

⏱️ 15-20 minutes 📖 Daily scorecard tests

Create a Terraform module that deploys: 
1. An AWS S3 bucket with versioning enabled
2. A GCP Cloud Storage bucket as backup target  
3. A Cloudflare Worker that syncs between them daily

Include:
- Proper IAM roles (least privilege)
- Environment variable handling
- README with usage example
- Variables for bucket names and regions

Evaluation Criteria

Uses proper provider configuration with aliases
Implements least-privilege IAM policies
Includes runnable example in README
Handles environment variables securely
Worker code is complete and functional

Reasoning Tasks

B2B SaaS Pricing Strategy

Hard

⏱️ 10-15 minutes 📖 Daily scorecard tests

Your B2B SaaS has 500 customers on a 3-tier plan ($29/$99/$299). 

Churn analysis shows:
- Tier 1: 8%/mo churn (price-sensitive)
- Tier 2: 3%/mo churn  
- Tier 3: 1%/mo churn
- 60% of Tier 1 users hit API limits monthly

Recommend whether to:
1. Introduce usage-based pricing
2. Add a $49 middle tier
3. Grandfather existing users

Model the 12-month revenue impact for each option.

Evaluation Criteria

Quantifies revenue impact for each option
Considers migration friction and support costs
Accounts for churn reduction vs revenue optimization
Provides clear recommendation with reasoning
Shows sensitivity analysis for key assumptions

Tool Use Tasks

Redis Rate Limiter

Medium

⏱️ 10-15 minutes 📖 Daily scorecard tests

Implement a token bucket rate limiter in Redis that:
- Allows 100 requests per minute per user
- Returns remaining quota in response
- Handles race conditions correctly
- Expires old buckets after 1 hour of inactivity

Provide:
- Redis Lua script
- Node.js/TypeScript wrapper
- Test cases for edge conditions

Evaluation Criteria

Uses Redis Lua for atomicity
Correctly implements token bucket algorithm
Handles concurrent requests safely
Includes proper expiration handling
Test cases cover edge conditions

JWT + Service Mesh Authentication

Medium

⏱️ 15-20 minutes 📖 Daily scorecard tests

Design authentication for microservices that:
- Uses JWTs for user identity
- Propagates user context through service mesh
- Handles token refresh without breaking requests
- Supports service-to-service auth

Provide:
- JWT payload structure
- Middleware code for Node.js services
- Service mesh configuration (Istio or Linkerd)
- Token refresh strategy

Evaluation Criteria

JWT structure follows best practices
Middleware correctly validates and propagates context
Token refresh is seamless (no dropped requests)
Service-to-service auth is secure
Configuration is complete and deployable

Contribute a Prompt

Have a good test prompt? Submit it via GitHub Issues and we'll add it with attribution.

Good prompts are:

Real production tasks (not synthetic)
Have clear evaluation criteria
Take 5-20 minutes for a human
Test practical skills (coding, reasoning, tool use)

How to Use These Prompts

Copy the prompt text
Paste into your preferred model (ChatGPT, Claude, Gemini, etc.)
Evaluate the response using our rubric
Compare scores across models

For fair comparison, use the same prompt verbatim across all models. Small wording changes can affect scores.