Open source code generator (an alternative to OpenAI Codex)
github.com/salesforce/CodeGen
One can also run one of these models via HuggingFace; it is based on "A Conversational Paradigm for Program Synthesis" paper, arxiv.org/abs/2203.13474
Someone has even created a fake GitHub Copilot based on that (useful for those who prefer VSCode): github.com/moyix/fauxpilot
One can also run one of these models via HuggingFace; it is based on "A Conversational Paradigm for Program Synthesis" paper, arxiv.org/abs/2203.13474
Someone has even created a fake GitHub Copilot based on that (useful for those who prefer VSCode): github.com/moyix/fauxpilot
no subject
no subject
no subject
no subject
(That's one of approaches to "quick and dirty neural architecture search", see also a few 2020 references in https://github.com/anhinga/2020-notes/blob/master/attention-based-models/possible-experiments.md for this topic)
no subject
https://huggingface.co/models?search=salesforce+codegen
E.g. https://huggingface.co/Salesforce/codegen-350M-mono?text=My+name+is+Thomas+and+my+main
produces:
My name is Thomas and my main class is Classname. This name is named "My Classname" and it has been created in Python (a script): "python main.py"
def main():
myName = "Thomas"
whereas https://huggingface.co/Salesforce/codegen-350M-mono?text=def+bubble_sort%28a%29%3A%0A++++%22%22%22sort+a+list%22%22%22
produces
def bubble_sort(a):
"""sort a list"""
n = len(a)
for j in range(n):
swap = True # swap flag
i = 1
while i
Some work is required to make it actually useful.
under 3 pre-training data variants (NL, Multi, Mono) and 4 model size variants (350M, 2B, 6B, 16B)
NL is just a natural language model pretrained on the Pile, Multi is fine-tuned on a mixture of https://console.cloud.google.com/marketplace/details/github/github-repos (a large-scale dataset of multiple programming languages from GitHub repositories; the data consists of 119.2B tokens and includes C, C++, Go, Java, JavaScript, and Python), and Mono is further fine-tined from Multi using BigPython dataset (the data consists of 71.7B tokens of Python programming language).
no subject