`Llemma`: an open language model for mathematics

Repository for Llemma: an open language model for mathematics [Azerbayev et al 2023].

This repository hosts data and training code related to the following artifacts:

Name	HF Hub Link
Llemma 7b	`EleutherAI/llemma_7b`
Llemma 34b	`EleutherAI/llemma_34b`
Proof-Pile-2	`EleutherAI/ProofPile2`
AlgebraicStack	`EleutherAI/AlgebraicStack`

This repository also contains submodules related to the overlap, fine-tuning, and theorem proving experiments described in the paper. Additional evaluation code is in a fork of the Eleuther LM Evaluation Harness.

Directories

This repository contains the following directories

proof_pile_2: scripts for downloading and preprocessing data.
gpt-neox: git submodule containing a modified branch of EleutherAI/gpt-neox
lm-evaluation-harness: code for all evaluations, except formal2formal theorem proving.
llemma_formal2formal: git submodule containing scripts for the formal2formal experiments
overlap: git submodule containing the overlap and memorization analysis
finetunes: git submodule containing scripts for the fine-tuning experiments

Because this project contains submodules, you should clone this project with the --recurse-submodules flag or, alternatively, run git submodule update --init --recursive from within the project directory after cloning the project. After running git pull, you should also run git submodule update.

Citation

Please cite the following:

@article{azerbayev2023llemma,
  title={Llemma: An Open Language Model For Mathematics}, 
  author={Azerbayev, Zhangir and Schoelkopf, Hailey and Paster, Keiran and Dos Santos, Marco and McAleer, Stephen and Jiang, Albert Q. and Deng, Jia and Biderman, Stella and Welleck, Sean},
  journal={arXiv preprint arXiv:2310.06786},
  year={2023}
}

You may also be interested in citing our training data, which is a mix of novel data and data from the following sources:

@article{paster2023openwebmath,
  title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
  author={Paster, Keiran and Santos, Marco Dos and Azerbayev, Zhangir and Ba, Jimmy},
  journal={arXiv preprint arXiv:2310.06786},
  year={2023}
}

@software{together2023redpajama,
  author = {Together Computer},
  title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
  month = April,
  year = 2023,
  url = {https://github.com/togethercomputer/RedPajama-Data}
}

@article{kocetkov2022stack,
  title={The stack: 3 tb of permissively licensed source code},
  author={Kocetkov, Denis and Li, Raymond and Allal, Loubna Ben and Li, Jia and Mou, Chenghao and Ferrandis, Carlos Mu{\~n}oz and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Bahdanau, Dzmitry and von Werra, Leandro and de Vries, Harm},
  journal={arXiv preprint arXiv:2211.15533},
  year={2022}
}

Name	Name	Last commit message	Last commit date
Latest commit zhangir-azerbayev Merge pull request #95 from EleutherAI/update-conversion Oct 26, 2023 9575d13 · Oct 26, 2023 History 257 Commits
finetunes @ 62916b8	finetunes @ 62916b8	add submodule for finetunes	Oct 6, 2023
gpt-neox @ 5dd3665	gpt-neox @ 5dd3665	update neox submodule w conversions	Oct 25, 2023
llemma_formal2formal @ f96fdb1	llemma_formal2formal @ f96fdb1	Formal submodule	Oct 11, 2023
lm-evaluation-harness @ b86d67b	lm-evaluation-harness @ b86d67b	update eval harness	Oct 16, 2023
overlap @ 09e0bbd	overlap @ 09e0bbd	Overlap submodule	Oct 11, 2023
proof_pile_2	proof_pile_2	tweak python filter	Oct 12, 2023
.gitignore	.gitignore	fixed merge conflicts : )	May 31, 2023
.gitmodules	.gitmodules	fix submodules	Oct 20, 2023
.python-version	.python-version	feat: get_math_text	Mar 27, 2023
LICENSE	LICENSE	Initial commit	Feb 2, 2023
README.md	README.md	fix links to the datasets	Oct 17, 2023
llemma.jpg	llemma.jpg	add the image	Oct 16, 2023
requirements.txt	requirements.txt	Update requirements.txt	Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`Llemma`: an open language model for mathematics

Directories

Citation

About

Releases

Packages

Contributors 9

Languages

License

EleutherAI/math-lm

Folders and files

Latest commit

History

Repository files navigation

Llemma: an open language model for mathematics

Directories

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

`Llemma`: an open language model for mathematics

Packages