Benchmarking Prompt Enhancement Quality Across Leading LLMs
Challenge:Prompt enhancement — rewriting user prompts to improve downstream model output — directly shapes generation quality, cost, and user trust, yet no standard benchmark exists for comparing enhancement approaches.
Solution:The evaluation was structured in two phases, one per modality, with a shared methodology for aggregating human judgments into statistically defensible recommendations.
Result:The benchmark replaced subjective internal debate with a quantitative production recommendation.
Prompt enhancement — rewriting user prompts to improve downstream model output — directly shapes generation quality, cost, and user trust, yet no standard benchmark exists for comparing enhancement approaches. The client was evaluating multiple LLM-based enhancement solutions and could not determine which produced the best downstream results. Internal A/B tests on small prompt sets produced inconsistent conclusions depending on prompt selection, evaluator, and modality. For video generation, where each iteration cycle is expensive (minutes of GPU time per render), deploying a suboptimal enhancement solution compounds cost across every user request. They needed a statistically rigorous comparison framework, not another subjective review.
The evaluation was structured in two phases, one per modality, with a shared methodology for aggregating human judgments into statistically defensible recommendations.
The text phase used side-by-side comparisons: for each of 180 prompts, all candidate enhancement solutions produced an enhanced version, and 3 independent annotators selected the best one. Annotators evaluated holistic quality across three dimensions — intent preservation (does the enhanced prompt maintain the user's original goal?), structural clarity (is the enhanced prompt well-organized and unambiguous?), and effectiveness (would the enhanced prompt produce better downstream output?). Majority vote determined the winner per prompt; prompts with no majority were flagged as ties.
The video phase evaluated enhancement quality indirectly through downstream output. Enhanced prompts were fed to the client's video generation model, and annotators compared the resulting videos in pairwise format. This design is critical: a prompt enhancement that reads well as text but produces worse video is a net negative. By evaluating the enhancement's effect on the final output, the benchmark measured what actually matters for production deployment.
Results were aggregated across annotators using weighted voting (annotators with higher calibration scores received proportionally more weight) and tested for statistical significance using a paired permutation test. Two models emerged as clear leaders with p < 0.01 separation from the remaining candidates.
The benchmark replaced subjective internal debate with a quantitative production recommendation. The client deployed the top-ranked enhancement model for their video generation workflow, where the cost impact is highest. The evaluation framework was retained as organizational infrastructure — each new candidate enhancement solution can now be benchmarked against the existing leader using a subset of the original prompt set, reducing evaluation cost for future decisions by an estimated 60%.
Representative record from the annotation pipeline.
Camera slowly pans right across the landscape, revealing more of the mountain range
{ "project_title": "LLM Prompt Enhancer", "better_video_id": "Claude 3.7 Sonnet", "prompt_text": "Camera slowly pans right across the landscape...", "video_1_key": "Claude 3.7 Sonnet", "video_2_key": "Llama v3 Quality", "created_at": "June 10, 2025, 4:28 PM", "status": "completed" }
Ready to build your next dataset?
Tell us about your project and we will scope a plan within 48 hours.