Polish Language Beats English in Coding Prompts?

2025-10-26

Polish language leads AI prompting benchmarks — a look into why structure, logic, and bilingual precision matter in coding with LLMs.

Recently, I came across an article in Rzeczpospolita referencing a study from the University of Maryland and Microsoft Research. It claimed that Polish outperformed English in prompting large language models on coding-related tasks.

At first, it sounded surprising — but not entirely.

Back in March 2025, researchers released a paper titled “One ruler to measure them all: Benchmarking multilingual long-context language models,” available on arXiv.

The benchmark, called OneRuler, evaluated how models handle reasoning tasks across 26 languages, using extremely long contexts, from 8,000 up to 128,000 tokens. One of the co-authors, Marzena Karpinska, is Polish, which likely ensured accurate translation and validation for our language.

According to the study, Polish reached about 88% accuracy, ranking first. English scored just under 84%, landing in sixth place.

That difference appeared mainly in tasks with very long prompts ,between 64k and 128k tokens, where the model had to search for or aggregate specific information across long sequences of text.

The missing context.

These results are interesting, but they don’t tell the whole story. We don’t actually know what kind of Polish was used in those prompts. The paper doesn’t specify whether it was pure natural Polish or a technical hybrid that mixed English terminology (functions, libraries, classes, framework names) with Polish syntax and logic.

That detail matters a lot.

After reading the study, I looked at how I write prompts myself. And I realized, what I use isn’t really “Polish.” It’s something in between, what I’d call technical Polish.

A hybrid language in action.

Take this example:

“Napisz funkcję generateUserToken() w Node.js, która tworzy token JWT z określonym czasem ważności. Zapisz w bazie timestamp jako query.”

Grammatically, it’s Polish. But semantically, it’s filled with English programming constructs — generateUserToken, Node.js, token JWT, timestamp, query.

The AI doesn’t treat it as a translation problem. It interprets it as a structured instruction, a natural command (“write a function”) anchored by technical symbols it already knows.

This hybrid works so well because:

Polish provides structural clarity — verbs, relationships, and dependencies are explicit.

English provides exact references — to code, APIs, and functions that exist in the model’s training data.

Together, they form a kind of developer pidgin, a semi-formal, semi-natural syntax optimized for reasoning about code. The Polish hybrid version naturally breaks the task into steps: create → store → define conditions. It reads more like an algorithmic instruction, something between human reasoning and pseudo-code.

This might be one reason Polish performs so well in those benchmark tests. Not because it’s “better” linguistically, but because its grammatical precision combines well with English technical references that anchor meaning inside the model.

Why the benchmark still matters.

Even with those nuances, OneRuler provides an important insight. Language structure directly affects how AI interprets logic and intent.

Polish is morphologically dense. Word endings encode who acts, what changes, and how actions relate, information that English distributes across separate words. That density may help models preserve meaning across long contexts and reduce ambiguity.

Essentially, Polish “compresses” logical information, using fewer tokens to describe relationships, which could improve performance when token limits are tight.

But this doesn’t mean the model is “better” at Polish.

It probably means Polish prompts are more explicit by design, and when mixed with English technical vocabulary, they hit a balance that models find easy to process.

The next step. Studying technical hybrids.

This points to an interesting direction for further research: how hybrid prompting — mixing natural language and programming semantics — influences reasoning accuracy.

Questions worth exploring:

Does blending Polish syntax with English code tokens systematically improve results?
Are multilingual, semi-structured prompts more effective for specific task types (like code generation or data extraction)?
Could “technical hybrids” evolve into their own meta-language for working with LLMs?

Right now, developers are already using such hybrids intuitively. It’s time academia caught up and measured their impact systematically.

Why this matters.

Poland has a strong AI and developer community. Many of us think in Polish but code, write, and build in English. That bilingual mindset might actually be an advantage, it trains us to move naturally between descriptive logic and formal systems, which is exactly how prompt engineering works.

What’s more, Polish engineers and researchers hold key positions in leading AI companies — from OpenAI (Jakub Pachocki, Wojciech Zaremba) to Anthropic, DeepMind, ElevenLabs, and Wordware. This presence isn’t just symbolic. It means Polish expertise directly influences how large models are trained, tested, and optimized, including how they interpret Polish language inputs.

Real representation in research and engineering teams often determines which languages receive more careful data preparation, evaluation, and continuous improvement.

The best results might come not from choosing one language over another, but from combining them intelligently, using Polish for logic and constraints, and English for technical anchors.

Our challenge now isn’t linguistic. It’s educational.

Much of our expertise still comes from individual effort and self-learning. If anything, this study should remind us that the potential is already here. We just need to formalize it, through research, education, and shared experimentation.

At the same time, we often underestimate how deeply AI is already changing us, how it reshapes the way we think, reason, and even structure language itself. That’s why studies like this matter. They’re not just about model performance or linguistic ranking.

They help us understand how intelligence — both human and artificial — adapts through interaction. And that understanding will be just as important as the technology itself.