Cloudflare hopes to offer most affordable solution for running inference with general availability of Workers AI


Cloudflare has announced that Workers AI is now generally available. Workers AI is a solution that allows developers to run machine learning models on the Cloudflare network.

The company says its goal is for Workers AI to be the most affordable solution for running inference. To make that happen, it made some optimizations since the beta, including a 7x reduction in price for running Llama 2 and a 14x reduction in price to run Mistral 7B models.

“The recent generative AI boom has companies across industries investing massive amounts of time and money into AI. Some of it will work, but the real challenge of AI is that the demo is easy, but putting it into production is incredibly hard,” said Matthew Prince, CEO and co-founder, Cloudflare. “We can solve this by abstracting away the cost and complexity of building AI-powered apps. Workers AI is one of the most affordable and accessible solutions to run inference.”

RELATED CONTENT: Cloudflare announces GA releases for D1, Hyperdrive, and Workers Analytics Engine

It also made improvements to load balancing, so requests now get routed to more cities and each city understands the total capacity that is available. This means that if a request would need to wait in a queue, it can instead just route to another city. The company currently has GPUs for running inference in over 150 cities around the world and plans to add more in the coming months.

Cloudflare also increased the rate limits for all models. Most LLMs now have a limit of 300 requests per minute, which is an increase from just 50 per minute during the beta. Smaller models may have a limit that is between 1500 and 3000 requests per minute. 

The company also reworked the Workers AI dashboard and AI playground. The dashboard now shows analytics for usage across models and the AI playground allows developers to test and compare different models as well as configure prompts and parameters, Cloudflare explained. 

Cloudflare and Hugging Face also expanded their partnership, and customers will be able to run models that are available on Hugging Face directly from within Workers AI. The company currently offers 14 models from Hugging Face, and as part of the GA release, it added four new models that are available: Mistral 7B v0.2, Nous Research’s Hermes 2 Pro, Google’s Gemma 7B, and Starling-LM-7B-beta.

“We are excited to work with Cloudflare to make AI more accessible to developers,” said Julien Chaumond, co-founder and CTO, Hugging Face.  Offering the most popular open models with a serverless API, powered by a global fleet of GPUs is an amazing proposition for the Hugging Face community, and I can’t wait to see what they build with it.”

Another new addition is Bring Your Own LoRAs, which allows developers to take a model and adapt only some of the model parameters, rather than all of them. According to Cloudflare, this feature will enable developers to get fine-tuned model outputs without having to go through the process of actually fine-tuning a model. 

 



Source link