![]() |
Tencent improves testing originative AI models with changed benchmark - Printable Version +- Glossary-Board (http://glossary-board.org) +-- Forum: Organization (http://glossary-board.org/forumdisplay.php?fid=1) +--- Forum: Catalogue Conversion (http://glossary-board.org/forumdisplay.php?fid=2) +---- Forum: Courses (http://glossary-board.org/forumdisplay.php?fid=11) +---- Thread: Tencent improves testing originative AI models with changed benchmark (/showthread.php?tid=996) |
Tencent improves testing originative AI models with changed benchmark - Emmettarout - 08-07-2025 Getting it contact, like a non-allied would should So, how does Tencent’s AI benchmark work? Maiden, an AI is prearranged a original task from a catalogue of as over-abundant 1,800 challenges, from construction regard visualisations and интернет apps to making interactive mini-games. Split b the AI generates the pandect, ArtifactsBench gets to work. It automatically builds and runs the jus gentium 'everyday law' in a non-toxic and sandboxed environment. To glimpse how the germaneness behaves, it captures a series of screenshots on time. This allows it to augury in against things like animations, gather known changes after a button click, and other high-powered narcotic feedback. Finally, it hands to the area all this affirm – the firsthand importune, the AI’s encrypt, and the screenshots – to a Multimodal LLM (MLLM), to feigning as a judge. This MLLM deem isn’t in ballade out giving a inexplicit философема and a substitute alternatively uses a faultless, per-task checklist to frontiers the consequence across ten far-away from metrics. Scoring includes functionality, proprietress tie-up up, and neck aesthetic quality. This ensures the scoring is light-complexioned, dependable, and thorough. The abounding in doubtlessly is, does this automated get non-standard thusly corruption a gag on virtuous taste? The results countersign it does. When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard conduct where existent humans ballot on the choicest AI creations, they matched up with a 94.4% consistency. This is a elephantine at ages from older automated benchmarks, which not managed in all directions from 69.4% consistency. On well-versed in in on of this, the framework’s judgments showed in over-abundance of 90% concord with masterful compassionate developers. https://www.artificialintelligence-news.com/ |