Home Models Compare Local Models Pricing Scorecards Evals OpenClaw Methodology
← Back to all evals

AI Model Test Prompts — Reproducible Benchmark Tasks


AI Model Test Prompts

Reproducible benchmark tasks for evaluating LLMs. Use these to test models yourself — all prompts are open-source.

These are the same prompts we use in our daily scorecards. Last updated: March 10, 2026.

Why Share Our Prompts?

  • Reproducibility: You can verify our scores yourself
  • Transparency: No hidden tests or secret criteria
  • Usefulness: These are real production tasks, not synthetic
  • Community: Help developers choose the right model for their use case

Note: Model scores vary by prompt phrasing. These prompts are designed for operator-grade evaluation (production readiness), not academic benchmarks.

Coding Tasks

TypeScript Generic Type Inference

Hard
⏱️ 5-10 minutes 📖 Daily scorecard tests
This generic API client loses type information when chaining methods. Fix it so the response type is correctly inferred through the chain.

```typescript
class ApiClient<T> {
  private endpoint: string;
  
  constructor(endpoint: string) {
    this.endpoint = endpoint;
  }
  
  get<U>(id: string): Promise<U> {
    return fetch(`${this.endpoint}/${id}`).then(r => r.json());
  }
  
  transform<U>(fn: (data: T) => U): ApiClient<U> {
    return new ApiClient<U>(this.endpoint);
  }
}

// Usage should infer User, not unknown
const client = new ApiClient('/users');
const user = await client.get<User>('123'); // Should be Promise<User>
```

Evaluation Criteria

  • Identifies the generic type flow issue
  • Preserves type information through method chains
  • Uses proper generic constraints
  • Solution compiles without `any` or `unknown`

Multi-Cloud Terraform Module

Hard
⏱️ 15-20 minutes 📖 Daily scorecard tests
Create a Terraform module that deploys: 
1. An AWS S3 bucket with versioning enabled
2. A GCP Cloud Storage bucket as backup target  
3. A Cloudflare Worker that syncs between them daily

Include:
- Proper IAM roles (least privilege)
- Environment variable handling
- README with usage example
- Variables for bucket names and regions

Evaluation Criteria

  • Uses proper provider configuration with aliases
  • Implements least-privilege IAM policies
  • Includes runnable example in README
  • Handles environment variables securely
  • Worker code is complete and functional

Reasoning Tasks

B2B SaaS Pricing Strategy

Hard
⏱️ 10-15 minutes 📖 Daily scorecard tests
Your B2B SaaS has 500 customers on a 3-tier plan ($29/$99/$299). 

Churn analysis shows:
- Tier 1: 8%/mo churn (price-sensitive)
- Tier 2: 3%/mo churn  
- Tier 3: 1%/mo churn
- 60% of Tier 1 users hit API limits monthly

Recommend whether to:
1. Introduce usage-based pricing
2. Add a $49 middle tier
3. Grandfather existing users

Model the 12-month revenue impact for each option.

Evaluation Criteria

  • Quantifies revenue impact for each option
  • Considers migration friction and support costs
  • Accounts for churn reduction vs revenue optimization
  • Provides clear recommendation with reasoning
  • Shows sensitivity analysis for key assumptions

Tool Use Tasks

Redis Rate Limiter

Medium
⏱️ 10-15 minutes 📖 Daily scorecard tests
Implement a token bucket rate limiter in Redis that:
- Allows 100 requests per minute per user
- Returns remaining quota in response
- Handles race conditions correctly
- Expires old buckets after 1 hour of inactivity

Provide:
- Redis Lua script
- Node.js/TypeScript wrapper
- Test cases for edge conditions

Evaluation Criteria

  • Uses Redis Lua for atomicity
  • Correctly implements token bucket algorithm
  • Handles concurrent requests safely
  • Includes proper expiration handling
  • Test cases cover edge conditions

JWT + Service Mesh Authentication

Medium
⏱️ 15-20 minutes 📖 Daily scorecard tests
Design authentication for microservices that:
- Uses JWTs for user identity
- Propagates user context through service mesh
- Handles token refresh without breaking requests
- Supports service-to-service auth

Provide:
- JWT payload structure
- Middleware code for Node.js services
- Service mesh configuration (Istio or Linkerd)
- Token refresh strategy

Evaluation Criteria

  • JWT structure follows best practices
  • Middleware correctly validates and propagates context
  • Token refresh is seamless (no dropped requests)
  • Service-to-service auth is secure
  • Configuration is complete and deployable

Contribute a Prompt

Have a good test prompt? Submit it via GitHub Issues and we'll add it with attribution.

Good prompts are:

  • Real production tasks (not synthetic)
  • Have clear evaluation criteria
  • Take 5-20 minutes for a human
  • Test practical skills (coding, reasoning, tool use)

How to Use These Prompts

  1. Copy the prompt text
  2. Paste into your preferred model (ChatGPT, Claude, Gemini, etc.)
  3. Evaluate the response using our rubric
  4. Compare scores across models

For fair comparison, use the same prompt verbatim across all models. Small wording changes can affect scores.