OPENAI says he examines the evidence that the Chinese start-up Deepseek broke its service conditions by collecting large amounts of data from its AI technologies.
The start-up based in San Francisco, which is now estimated at $ 157 billionsaid Deepseek may have used data generated by OPENAI technologies to teach skills similar to its own systems.
This process, called distillation, is common through the field of AI. But the conditions of use of OpenAi say that the company does not allow anyone to use data generated by its systems to create technologies that compete on the same market.
“We know that RPC groups are actively working to use methods, including what is called distillation, to reproduce advanced USAI models,” said Openai spokesperson Liz Bourgeois , in a press release sent by email to the New York Times, referring to the People’s Republic of China.
“We know and examine the indications that Deepseek may have inappropriately distilled our models and share information as we know more,” she said. “We take aggressive and proactive countermeasures to protect our technology and will continue to work closely with the American government to protect the most capable models under construction here.”
Deepseek did not immediately respond to a request for comments.
Deepseek has frightened the technological companies of Silicon Valley and sent the American financial markets to a tail earlier this week after having published AI technologies that equaled the performance of anything else in the market.
The dominant wisdom was that the most powerful systems could not be built without billions of dollars of specialized computer flea, but Deepseek said that it had created its technologies using far fewer resources.
Like any other AI company, Deepseek has built its technologies using the IT code and data covered with the Internet. AI companies are based strongly on an open sourcing practice, freely sharing the code underlying their technologies – and the reusing code shared by others. They see this is a way to accelerate technological development.
They also need huge amounts of online data to train their AI systems. These systems learn their skills by identifying models in the text, computer programs, images, sounds and videos. The main systems acquire their skills by analyzing almost all the text on the Internet.
Distillation is often used to train new systems. If a company takes data from proprietary technology, the practice may be legally problematic. But it is often authorized by open source technologies.
Openai is now faced with more than a dozen legal proceedings accusing it of the illegal use of protected Internet data to train its systems. This includes a New York Times. Against Openai and its partner Microsoft.
The pursuit maintains that millions of articles published by Times have been used to train automated chatbots which now compete with the media as a source of reliable information. Openai and Microsoft deny affirmations.
A Times report also showed that Openai used voice recognition technology To transcribe audio from YouTube videos, producing a new conversational text that would make a system of AI smarter. Some Openai employees explained how such a decision could go against YouTube rules, according to three people with knowledge of conversations.
An OPENAI team, including the president of the company, Greg Brockman, has transcribed more than a million hours of YouTube videos, people said. The texts were then introduced into a system called GPT-4, which was largely considered to be one of the most powerful AI models in the world and was the basis of the latest version of the Chatgpt chatbot.