huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

RustPythonJupyter NotebookTypeScriptJavaScriptCSSOthernlpnatural-language-processingtransformersgptlanguage-modelbertnatural-language-understanding
This is stars and forks stats for /huggingface/tokenizers repository. As of 29 Apr, 2024 this repository has 7655 stars and 649 forks.

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tokenize, using today's most used tokenizers. Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. Easy to use, but also extremely versatile. Designed for research and production. Normalization comes with alignments tracking. It's always possible to get...
Read on GithubGithub Stats Page
repotechsstarsweeklyforksweekly
launchbadge/sqlxRustOther10.1k09830
copy/v86RustJavaScriptC18.3k01.3k0
Wilfred/difftasticRustOther15.5k02520
killercup/cargo-editRust2.9k01480
scalacenter/scalafixScalaJavaOther76501800
andrewbanchich/forty-jekyll-themeSCSSJavaScriptHTML977+52k-2
uswds/uswdsSCSSJavaScriptTwig6.5k+3956+1
Esri/calcite-webSCSSJavaScriptRuby1050580
cloudevents/specPythonANTLRMakefile4.4k+115600
longhorn/longhornShellPythonMustache5k+136360