Corpora in many languages for testing, evaluating, benchmarking, and training Unicode algorithms.
The main content in this repository are books from Project Gutenberg, machine translated by Google. Over 200 languages are represented, with outputs as HTML, txt, and segmented text.
See the gutenberg directory.
Copyright © 2023-2024 Unicode, Inc. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.
A CLA is required to contribute to this project - please refer to the CONTRIBUTING.md file (or start a Pull Request) for more information.
The contents of this repository are governed by the Unicode Terms of Use and are released under LICENSE.