google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

C++PythonCMakeSWIGJupyter NotebookPerlnatural-language-processingneural-machine-translationword-segmentation
This is stars and forks stats for /google/sentencepiece repository. As of 25 Apr, 2024 this repository has 8307 stars and 1041 forks.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. This...
Read on GithubGithub Stats Page
repotechsstarsweeklyforksweekly
FiYHer/kernel_window_hideC++26101080
facebook/follyC++PythonCMake26.1k+285.5k+4
Light-City/CPlusPlusThingsC++StarlarkC33.5k07.9k0
facebook/rocksdbC++JavaC26.1k05.9k0
ossrs/srsC++JavaScriptHTML22.8k05.1k0
esp8266/ArduinoC++CPython15.3k013.4k0
taichi-dev/taichiC++PythonC23.9k02.3k0
huihut/interviewC++CCMake30.6k07.6k0
qinguoyi/TinyWebServerC++CHTML13k+773.4k+23
envoyproxy/envoyC++StarlarkJava22.8k04.5k0