Why evaluate prompt enhancement through downstream video output instead of prompt quality alone?

A prompt that reads well to a human may produce worse model output — the relationship between prompt text quality and downstream generation quality is not linear or always positive. By evaluating the enhanced prompt's effect on the final video, the benchmark measures what matters for production: does this enhancement improve the outputs users actually see?

How were the 180 evaluation prompts selected?

Prompts were sampled from the client's production traffic to ensure the benchmark reflected real usage patterns, not synthetic test cases. The sample was stratified by prompt length, complexity, domain (creative, technical, instructional), and modality (text-native vs. video-native) to ensure results generalize across the client's actual use-case distribution.

What does statistical significance mean in this context?

The paired permutation test compares each candidate's win rate against every other candidate by resampling the pairwise results 10,000 times. A p < 0.01 result means there is less than a 1% probability that the observed performance difference is due to chance. Models that do not achieve significance against each other are reported as statistically tied.

Can this framework evaluate new enhancement solutions without re-running the full benchmark?

Yes. New candidates are evaluated against the current leader on a subset of the original 180 prompts (typically 60 prompts is sufficient for statistical power given the effect sizes observed). This reduces evaluation cost by approximately 60% while maintaining the same confidence level in the recommendation. The framework includes a power analysis tool to determine the minimum sample size needed for a given expected effect size.

Evaluation

Benchmarking Prompt Enhancement Quality Across Leading LLMs

Q: Can this framework evaluate new enhancement solutions without re-running the full benchmark?

Yes. New candidates are evaluated against the current leader on a subset of the original 180 prompts (typically 60 prompts is sufficient for statistical power given the effect sizes observed). This reduces evaluation cost by approximately 60% while maintaining the same confidence level in the recommendation. The framework includes a power analysis tool to determine the minimum sample size needed for a given expected effect size.

180Prompts benchmarked across modalities

Evaluation

summary.md

Challenge:Prompt enhancement — rewriting user prompts to improve downstream model output — directly shapes generation quality, cost, and user trust, yet no standard benchmark exists for comparing enhancement approaches.

Solution:The evaluation was structured in two phases, one per modality, with a shared methodology for aggregating human judgments into statistically defensible recommendations.

Result:The benchmark replaced subjective internal debate with a quantitative production recommendation.

0Prompts evaluated across 2 modalities

0Clear statistical winners identified (p < 0.01)

0Evaluation dimensions (intent, clarity, effectiveness)

p<0.00Statistical significance separating top 2 from field

// THE CHALLENGE

Prompt enhancement — rewriting user prompts to improve downstream model output — directly shapes generation quality, cost, and user trust, yet no standard benchmark exists for comparing enhancement approaches. The client was evaluating multiple LLM-based enhancement solutions and could not determine which produced the best downstream results. Internal A/B tests on small prompt sets produced inconsistent conclusions depending on prompt selection, evaluator, and modality. For video generation, where each iteration cycle is expensive (minutes of GPU time per render), deploying a suboptimal enhancement solution compounds cost across every user request. They needed a statistically rigorous comparison framework, not another subjective review.

// OUR APPROACH

The evaluation was structured in two phases, one per modality, with a shared methodology for aggregating human judgments into statistically defensible recommendations.

The text phase used side-by-side comparisons: for each of 180 prompts, all candidate enhancement solutions produced an enhanced version, and 3 independent annotators selected the best one. Annotators evaluated holistic quality across three dimensions — intent preservation (does the enhanced prompt maintain the user's original goal?), structural clarity (is the enhanced prompt well-organized and unambiguous?), and effectiveness (would the enhanced prompt produce better downstream output?). Majority vote determined the winner per prompt; prompts with no majority were flagged as ties.

The video phase evaluated enhancement quality indirectly through downstream output. Enhanced prompts were fed to the client's video generation model, and annotators compared the resulting videos in pairwise format. This design is critical: a prompt enhancement that reads well as text but produces worse video is a net negative. By evaluating the enhancement's effect on the final output, the benchmark measured what actually matters for production deployment.

Results were aggregated across annotators using weighted voting (annotators with higher calibration scores received proportionally more weight) and tested for statistical significance using a paired permutation test. Two models emerged as clear leaders with p < 0.01 separation from the remaining candidates.

GenerateEnhance 180 prompts across LLMs

CompareSide-by-side human evaluation (text)

RankPairwise comparison with annotators (video)

RecommendIdentify top 2 models for production

GenerateEnhance 180 prompts across LLMs

CompareSide-by-side human evaluation (text)

RankPairwise comparison with annotators (video)

RecommendIdentify top 2 models for production

// RESULTS

180Prompts evaluated across 2 modalities

2Clear statistical winners identified (p < 0.01)

3Evaluation dimensions (intent, clarity, effectiveness)

p<0.01Statistical significance separating top 2 from field

// IMPACT

The benchmark replaced subjective internal debate with a quantitative production recommendation. The client deployed the top-ranked enhancement model for their video generation workflow, where the cost impact is highest. The evaluation framework was retained as organizational infrastructure — each new candidate enhancement solution can now be benchmarked against the existing leader using a subset of the original prompt set, reducing evaluation cost for future decisions by an estimated 60%.

// SAMPLE DATA

Representative record from the annotation pipeline.

llm_enhancer_sample.json

// INPUT PROMPT

Camera slowly pans right across the landscape, revealing more of the mountain range

// ENHANCED OUTPUT COMPARISON

Video 1 — Claude 3.7 Sonnet

Video 2 — Llama v3 Quality

Which video better fulfills the prompt?WINNER: Claude 3.7 Sonnet

// JSON_RESPONSE

{
"project_title": "LLM Prompt Enhancer",
"better_video_id": "Claude 3.7 Sonnet",
"prompt_text": "Camera slowly pans right across the landscape...",
"video_1_key": "Claude 3.7 Sonnet",
"video_2_key": "Llama v3 Quality",
"created_at": "June 10, 2025, 4:28 PM",
"status": "completed"
}

Processing Time: 42msStatus: 200 OK

Service UsedBenchmark Curation

// RELATED

39K

Human Evaluation of Video Generation Model Configurations

39,000 pairwise human evaluations across 51 model configurations — replacing uncertain aggregate metrics with statistically defensible ELO rankings for text-to-video, image-to-video, and video-to-video.

Read case study

241K+

Scaling Generative AI Safety Through Human-Led Data Labeling

241,000+ safety annotations across text and video — validating content moderation pipelines and delivering defensible evidence of safety performance for audits and governance.

Read case study

// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.