Hugging Face StarCoder : A State-of-the-Art LLM for Code.

Posted in Recipe on May 5, 2023 by Venkatesh S ‐ 2 min read

HuggingFace AI ML Beginner Intermediate 10 Minutes

It’s been an amazing race when it comes to AI based LLMs today. While OpenAI ChatGPT has been continuously updating, upgrading and adding new models, there is an open source race that has begun to challenge the traditional commercialization mindset of the closed sources. One such effort is from Hugging Face.

For someone who do not know about it, Hugging Face is an AI community building the future. They Build, train and deploy state of the art models powered by the reference open source in machine learning. More than 5,000 organizations are currently using Hugging Face. Thousands of creators work as a community to solve Audio, Vision, and Language with AI. Refer their website for more details on Hugging Face.

I have been recently exploring these LLMs (ChatGPT, StarCoder) to check how effective they can be as a coding assistant and they never fail to amaze me. While ChatGPT continues to provide consistent coding assistance, this new kid in the block is powered with ammos capable of challenging the ChatGPT and that is StarCoder.

Today’s blog is going to be a short one that will introduce all to a new open source LLM StarCoder.

What is StarCoder?

StarCoder and StarCoderBase are Large Language Models for Code (Code LLMs) trained on permissively licensed data from GitHub, including from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. Similar to LLaMA, this model is trained with a ~15B parameter model for 1 trillion tokens. Hugging Face fine-tuned StarCoderBase model for 35B Python tokens, resulting in a new model that is call StarCoder.

Some cool statistics of this model are

StarCoderBase outperforms existing open Code LLMs on popular programming benchmarks and matches or surpasses closed models such as code-cushman-001 from OpenAI (the original Codex model that powered early versions of GitHub Copilot)
Context length of over 8,000 tokens, the StarCoder models can process more input than any other open LLM, enabling a wide range of interesting applications.
Model can act as a technical assistant, can be used to autocomplete code, make modifications to code via instructions, and explain a code snippet in natural language.
Above all this is open source.