> You may think of these symbols as "Latin" because they're how people writing i...

yorwba · 2025-05-07T12:33:01 1746621181

Precisely because the exact same codepoints are used for digits and mathematical symbols, there's nothing script-specific about them and their linguistic association is determined by the training data mixture. A model trained predominantly on text scraped from Chinese websites would learn to associate them more with Mandarin than English in the latent space, since that would be the context where they most often appear.