Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

Olga Megorskaya is Founder & CEO of Toloka AI, a high quality data partner for all stages of AI development.
More than two years after the release of ChatGPT, large language models (LLMs) are now becoming the foundation for agentic AI—autonomous systems that interact with tools in their environment to complete multi-step tasks for the user.
While LLMs like OpenAI’s GPT-4 and Meta’s Llama-3, as well as newer reasoning models such as o1 and DeepSeek-R1, are pushing the boundaries of what these systems can achieve, they still face significant challenges in handling specialized areas of knowledge. A recent study by the University of Massachusetts Amherst analyzed medical summaries generated by leading LLMs, including OpenAI’s GPT-4 and Meta’s Llama-3. The study identified widespread issues in nearly every response, such as inconsistencies in medical events, flawed reasoning and chronological errors.
These challenges are not limited to medicine. While these models show impressive capabilities in general knowledge tasks, they struggle with complex, domain-specific knowledge. If you are developing or applying LLMs or AI agents in your business, it’s essential to understand these limitations.
Benchmarking plays a critical role in evaluating the strengths and weaknesses of LLMs across various use cases. Well-designed benchmarks offer developers an efficient and cost-effective way to track the progress of their models in specific areas. While there has been significant progress in developing benchmarks that test general LLM capabilities, gaps remain in specialized areas that require in-depth knowledge and robust evaluation methods, such as accounting, finance, medicine, law, physics, natural sciences and software development, to name a few.
Even university-level mathematics, an area that most LLMs are expected to handle, is not assessed well by general-purpose benchmarks. Existing math benchmarks either focus on simple problems or highly challenging tasks, such as Olympiad-level questions, but do not address applied mathematics relevant to university studies.
To bridge the gap, my team developed U-MATH, a comprehensive benchmark for university-level mathematics. We also tested the performance of leading LLMs, including o1 and R1. The results were insightful: Reasoning systems are in a class of their own. OpenAI o1 leads with 77.2% of tasks solved, and DeepSeek R1 follows with 73.7%. Note that R1 trails behind o1 on U-MATH, contradicting R1’s victory on other math benchmarks like AIME and MATH-500. Other top models show a significant performance gap, with Gemini 1.5 Pro solving 60% of the tasks and GPT-4 getting 43% right. Interestingly, a smaller, math-specialized model from the Qwen 2.5 Math family also showed competitive results.
These findings have practical implications for decision-making. Thanks to domain-specific benchmarks, engineers looking for LLM-based solutions can understand how different models perform in their specific context. For niche domains that lack reliable benchmarks, development teams can perform their own evaluations or ask a data partner to develop a custom benchmark that they can use to compare their model to other LLMs and continually evaluate new model versions after fine-tuning iterations.
New benchmarks are making safety evaluation more easily attainable, like AILuminate, a new tool designed to assess the safety risks of general-purpose LLMs. This benchmark evaluates a model’s likelihood of endorsing harmful behaviors across 12 categories ranging from violent crimes to privacy issues, assigning a 5-point score from “Poor” to “Excellent” for each category. These results help decision-makers compare models and better understand their relative safety risks.
AILuminate is currently one of the most comprehensive general-purpose safety benchmarks available today, but it doesn’t cover individual risks related to specific domains or industries. Companies developing AI solutions are increasingly seeking external expertise in safety evaluation. As CEO of Toloka, a GenAI data partner, I am seeing more companies seeking targeted safety evaluations that provide a deeper understanding of how LLMs perform in specialized contexts, ensuring they meet the unique safety needs of particular audiences and use cases.
With the growth of AI agents likely to continue in 2025, specialized benchmarks will follow. AI agents are autonomous systems capable of interpreting their surroundings, making informed decisions and executing actions. For instance, a virtual assistant on a smartphone can process voice commands, answer queries and perform tasks like scheduling reminders or sending messages and emails.
Benchmarks for AI agents must measure how well LLMs operate in practical, real-world scenarios aligned with the agent’s intended domain and application. If you’re building an HR assistant, you’d prioritize different performance criteria than you would for a healthcare agent diagnosing medical conditions, based on the associated risks.
Robust benchmarking frameworks will play a crucial role by providing a faster, more scalable alternative to human evaluation, enabling decision makers to test systems efficiently once benchmarks are in place for specific use cases.
Benchmarking is key to understanding how well large language models perform real-world tasks. Over the last two years, benchmarking has shifted from testing general abilities to focusing on specific areas, such as niche industry knowledge, safety and agent performance.
As AI systems evolve, benchmarking must adapt to keep up. High complexity benchmarks such as Humanity’s Last Exam and FrontierMath have garnered a lot of attention in the industry and made it clear that LLMs still can’t match human expertise on tough questions. However, these benchmarks don’t tell the full story.
Success in highly complex problems does not correspond to high performance in practical applications. The GAIA benchmark for general AI assistants shows that advanced AI systems perform well on challenging questions but fail at simple tasks. When it comes to real-world evaluation, appropriate benchmarks need to be carefully selected to match the context of AI applications.
At Toloka, we are committed to developing and improving benchmarks that help ensure AI systems are reliable, safe and useful across different industries.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?