All languages are NOT created (tokenized) equal!

This project compares the tokenization length for different languages. For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g. try English vs. Burmese).

This is part of a larger project of measuring inequality in NLP. See the original article 'All languages are NOT created (tokenized) equal' on Art Fish Intelligence.

Settings

Select Tokenizer

Data Source

The data in this figure is the validation set of the Amazon Massive dataset, which consists of 2033 short sentences and phrases translated into 51 different languages. Learn more about the dataset from Amazon's blog post.

Data loaded: 105716 rows

Select Languages (max 6)

Visualizations

Example Texts