Redefining AI Energy Measurement: A New Framework for Agentic Systems
In brief
- Researchers have introduced a groundbreaking framework called A-LEMS, which fundamentally changes how we measure the energy consumption of AI systems.
- Current methods typically assess energy use per model invocation or training run, but this approach falls short when dealing with agentic systems-those that handle multi-step tasks involving tool calls, retries, and recovery from failures.
- The new system, A-LEMS, shifts the focus to "Energy per Goal" (EpG), which calculates the total energy consumed across all attempts to complete a task, including failures, divided by the number of successful goals.
- This method provides a more accurate picture of energy use in complex, real-world scenarios.
- The study reveals that agentic workflows consume significantly more energy than linear approaches-4.33 times higher on average.
- However, for tasks that require tool usage, agentic systems can actually be more efficient, consuming less energy per goal.
- This finding underscores the importance of understanding how orchestration structures influence energy consumption rather than just computing power alone.
- The framework also introduces the Orchestration Overhead Index (OOI), which helps isolate the energy costs associated with orchestration compared to linear execution under the same conditions.
- This new approach to measuring AI energy use is a critical step toward more accurate benchmarking and optimizing efficiency in agentic systems.
- As AI becomes more integrated into complex, real-world applications, tools like A-LEMS will play a vital role in assessing and improving energy performance.
Terms in this brief
- A-LEMS
- A framework for measuring energy consumption in AI systems, particularly agentic ones that handle multi-step tasks. It calculates energy use based on the number of successful goals achieved, providing a more accurate picture of real-world energy efficiency.
- Orchestration Overhead Index (OOI)
- A metric introduced by A-LEMS to isolate and measure the energy costs associated with the orchestration of AI tasks compared to linear execution under the same conditions. It helps understand how orchestration structures influence energy consumption.
Read full story at arXiv CS.AI →
More briefs
AI Finds Folic Acid May Help Heal Diabetic Wounds
Scientists used artificial intelligence to find new uses for old drugs. They looked at 3000 existing drugs to see if any could help heal diabetic wounds. This matters because diabetic wounds are hard to heal. Many biological processes are disrupted at the same time. The scientists used AI to scan scientific literature and identify which drugs may help. They found that folic acid, a common vitamin, is a top candidate to help heal diabetic wounds. The team will now test folic acid in more experiments to see if it really works.
Scientist Runs Groundbreaking Computer Simulation
Mary Tsingou wrote code for a first-of-its-kind numerical experiment in 1955. She used one of the world's first scientific computers to run the experiment. The results of the experiment showed that nonlinear systems behave in surprisingly stable and structured ways. This discovery reshaped how scientists think about systems like the atmosphere and the human heart. The experiment involved a one-dimensional line of masses connected by springs with a small nonlinear change to the spring force. The simulation took years to run and showed that energy flowed back into its original mode. This discovery was a watershed moment and changed how scientists think about nonlinear systems. The results will continue to influence scientific research.
AI's Personality Test Fails When Put to Work
A new study reveals that AI models trained to mimic specific personalities in chat conversations struggle when given real-world tasks. Researchers tested three major AI systems-Llama, Qwen, and Gemma-trained with personality-based fine-tuning (SFT). These models were scored using a classifier designed to identify their personas, achieving high accuracy (86-95%) in controlled chat settings. However, the same models performed poorly when asked to act autonomously-composing emails or making decisions. The classifier's accuracy dropped sharply to 29-55%, showing that AI personalities don't translate well beyond structured chat interactions. This suggests that SFT, a common training method for character-driven AI, may not prepare models for practical, agent-like tasks. The findings highlight the limitations of current personality-training techniques and emphasize the need for more generalized alignment methods. As AI becomes more integrated into daily life, understanding how these systems behave outside of controlled chats will be crucial for developers aiming to create reliable and versatile AI assistants.
AI Models Often Give Right Answers but Point to Wrong Sources
Leading AI models like GPT and Gemini have been found to cite incorrect text passages in their analyses, even when their answers are correct. This issue, called "attribution hallucination," poses risks in fields like law and medicine where accuracy is crucial. Researchers at Peking University developed the CiteVQA benchmark to systematically test for this problem. This discovery highlights a significant flaw in AI systems that could impact reliability in regulated industries. If an AI provides accurate advice but cites wrong sources, it may lead to serious consequences in areas like legal decisions or medical diagnoses. The CiteVQA benchmark aims to identify and address these issues, ensuring AI models provide trustworthy evidence alongside their answers. Looking ahead, researchers hope this new tool will help improve the accuracy of AI systems by pinpointing where they go wrong in attribution. As AI becomes more integrated into critical decision-making processes, tools like CiteVQA will be essential for maintaining trust and reliability in their outputs.
Researchers Find AI Coding Agents Struggle with Complex Constraints
New research shows that AI coding agents perform poorly when generating code with strict structural constraints. The agents were tested on 80 tasks across eight web frameworks and their performance declined significantly as the number of constraints increased. Capable agents lost 30 points on average in assertion pass rates from simple to complex tasks. The study highlights a key challenge for AI coding agents, which is to satisfy both functional and structural requirements, and this issue will need to be addressed in future developments.