Evaluation

Benchmarking Prompt Enhancement Quality Across Leading LLMs

180Prompts benchmarked across modalities
Evaluation
summary.md

Challenge:Prompt enhancement — rewriting user prompts to improve downstream model output — directly shapes generation quality, cost, and user trust, yet no standard benchmark exists for comparing enhancement approaches.

Solution:The evaluation was structured in two phases, one per modality, with a shared methodology for aggregating human judgments into statistically defensible recommendations.

Result:The benchmark replaced subjective internal debate with a quantitative production recommendation.

0Prompts evaluated across 2 modalities
0Clear statistical winners identified (p < 0.01)
0Evaluation dimensions (intent, clarity, effectiveness)
p<0.00Statistical significance separating top 2 from field
// THE CHALLENGE

Prompt enhancement — rewriting user prompts to improve downstream model output — directly shapes generation quality, cost, and user trust, yet no standard benchmark exists for comparing enhancement approaches. The client was evaluating multiple LLM-based enhancement solutions and could not determine which produced the best downstream results. Internal A/B tests on small prompt sets produced inconsistent conclusions depending on prompt selection, evaluator, and modality. For video generation, where each iteration cycle is expensive (minutes of GPU time per render), deploying a suboptimal enhancement solution compounds cost across every user request. They needed a statistically rigorous comparison framework, not another subjective review.

// OUR APPROACH

The evaluation was structured in two phases, one per modality, with a shared methodology for aggregating human judgments into statistically defensible recommendations.

The text phase used side-by-side comparisons: for each of 180 prompts, all candidate enhancement solutions produced an enhanced version, and 3 independent annotators selected the best one. Annotators evaluated holistic quality across three dimensions — intent preservation (does the enhanced prompt maintain the user's original goal?), structural clarity (is the enhanced prompt well-organized and unambiguous?), and effectiveness (would the enhanced prompt produce better downstream output?). Majority vote determined the winner per prompt; prompts with no majority were flagged as ties.

The video phase evaluated enhancement quality indirectly through downstream output. Enhanced prompts were fed to the client's video generation model, and annotators compared the resulting videos in pairwise format. This design is critical: a prompt enhancement that reads well as text but produces worse video is a net negative. By evaluating the enhancement's effect on the final output, the benchmark measured what actually matters for production deployment.

Results were aggregated across annotators using weighted voting (annotators with higher calibration scores received proportionally more weight) and tested for statistical significance using a paired permutation test. Two models emerged as clear leaders with p < 0.01 separation from the remaining candidates.

01
GenerateEnhance 180 prompts across LLMs
02
CompareSide-by-side human evaluation (text)
03
RankPairwise comparison with annotators (video)
04
RecommendIdentify top 2 models for production
// RESULTS
180Prompts evaluated across 2 modalities
2Clear statistical winners identified (p < 0.01)
3Evaluation dimensions (intent, clarity, effectiveness)
p<0.01Statistical significance separating top 2 from field
// IMPACT

The benchmark replaced subjective internal debate with a quantitative production recommendation. The client deployed the top-ranked enhancement model for their video generation workflow, where the cost impact is highest. The evaluation framework was retained as organizational infrastructure — each new candidate enhancement solution can now be benchmarked against the existing leader using a subset of the original prompt set, reducing evaluation cost for future decisions by an estimated 60%.

// SAMPLE DATA

Representative record from the annotation pipeline.

llm_enhancer_sample.json
// INPUT PROMPT

Camera slowly pans right across the landscape, revealing more of the mountain range

// ENHANCED OUTPUT COMPARISON
Video 1 — Claude 3.7 Sonnet
Video 2 — Llama v3 Quality
Which video better fulfills the prompt?WINNER: Claude 3.7 Sonnet
// JSON_RESPONSE
{
"project_title": "LLM Prompt Enhancer",
"better_video_id": "Claude 3.7 Sonnet",
"prompt_text": "Camera slowly pans right across the landscape...",
"video_1_key": "Claude 3.7 Sonnet",
"video_2_key": "Llama v3 Quality",
"created_at": "June 10, 2025, 4:28 PM",
"status": "completed"
}
Processing Time: 42msStatus: 200 OK
// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.