Scientists have designed a new way of measuring how much artificial intelligence (Ia) systems are – how fast they can beat or compete with humans in difficult tasks.
Although AIS can generally surpass humans in prediction of text and knowledge tasks, when they have given more substantial projects to carry out, such as distance assistance, they are less effective.
To quantify these performance gains in AI models, a new study proposed to measure the AIS according to the duration of the tasks they can perform, compared to the duration of humans. Researchers published their results on March 30 on the preprint database arxivThey have therefore not yet been evaluated by peers.
“We note that the measurement of the duration of the tasks that models can perform is a useful objective for understanding the current AI capabilities. Model assessment and threat research (METR) explained in a blog accompanying the study.
The researchers found that the AI models had carried out tasks that would take humans for less than four minutes with a success rate of almost 100%. However, this dropped to 10% for tasks taking more than four hours. The older AI models have made it possible to do less longer tasks than the latest systems.
This should expect the study stressing that the duration of AIS general tasks could carry out with a reliability of 50% a doubly two months every six months.
In relation: Scientists discover major differences in the way humans and AI “think” – and implications could be significant
To conduct their study, the researchers took a variety of AI – Sonnet 3.7 and GPT -4 models to Claude 3 opus and older GPT models – and opposed them to a series of tasks. These allegedly affects that generally take humans for a few minutes as the search for a basic factual question on Wikipedia) to those who take human experts for several hours – complex programming tasks such as the drafting of Cuda grains or the repair of a subtle bug in Pytorch, for example.
Test tools, especially Hast And Relaunch were used; The first has 189 autonomy software tasks to assess the capacities of AI agents in task management around automatic learning, cybersecurity and software engineering, while the second uses seven machine research tasks in automatic at opening, such as optimizing a GPU nucleus, benchmarked against human experts.
The researchers then evaluated these tasks for “disorder”, to see and assess how certain tasks contained things such as the need for coordination between several work flows in real time – making the task more complicated – and are therefore more representative of the tasks of the real world.
Researchers have also developed atomic software actions (SWAA) to establish how fast people can do the tasks. These are tasks in one step ranging from one to 30 seconds, basic by METR employees.
Indeed, the study revealed that the “duration of attention” of the AI advances at high speed. By extrapolating this trend, researchers have projected (if their results can generally be applied to real tasks) that AI can automate the value of one month of human software development by 2032.
To better understand the advanced capacities of the AI and its potential impact and its risks for society, this study could form a new reference relating to the results of the real world to allow “a significant interpretation of absolute performance, not only relative performance”, said scientists.
A new border to assess AI?
A new potential reference could allow us to better understand the intelligence and real capacities of AI systems.
“Metric itself is not likely to change the course of AI development, but it will follow the speed with which progress is made on certain types of tasks in which AI systems will be ideally used”, ” Sohrob KazerounianAn eminent IA researcher at Vectra AI, told Live Science.
“The measurement of AI against duration, it takes a human to accomplish a given task is an interesting proxy metric for intelligence and general capacities,” said Kazerounian. “First, because there is no singular metric that captures what we mean when we say” intelligence “. Second he added.
Eleanor WatsonMember of the IEEE and Ethics Engineer of AI at Singularity University, agrees that research is useful.
The measurement of AIs on the duration of the tasks is “precious and intuitive” and “directly reflects the complexity of the real world, capturing the skills of AI to maintain coherent behavior led by objectives over time”, compared to traditional tests that assess the performance of AI on short and isolated problems, she told Live Science.
The generalist AI arrives
Undoubtedly, in addition to a new reference metric, the greatest impact of the document is to underline the speed with which the AI systems are progressing, in parallel with the upward trend of their ability to manage long tasks. In this spirit, Watson predicts that the emergence of general practitioners of the AI who can manage a variety of tasks will be imminent.
“By 2026, we will see the AI becoming more and more general, managing various tasks on a day or a whole week rather than short and closely defined assignments,” said Watson.
For companies, Watson noted, this could produce AIS which can take substantial parts of professional workloads – which could not only reduce costs and improve efficiency, but also allow people to focus on more creative, strategic and interpersonal tasks.
“For consumers, AI will evolve from a simple assistant in a reliable personal manager, capable of managing complex life tasks – such as travel planning, health surveillance or financial portfolio management – during the days or weeks, with minimum surveillance,” added Watson.
Indeed, the ability of AIS to manage a wide range of long tasks could have a significant impact on how the company interacts and uses AI in the coming years.
“While specialized AI tools will persist in niche applications for reasons of efficiency, powerful generalist AI agents – capable of flexibly switching between various tasks – will emerge in a good place,” concluded Watson. “These systems will integrate skills specializing in broader and directed work flows, to reshape everyday life and professional practices in a fundamental way.”