LLM Evaluation NLP · Pragmatics Computational Linguistics

Roland
Mühlenbernd

// ML Researcher · Architect of Language Models

I build ML models of language — and increasingly, I evaluate whether today's LLMs actually understand it. My current work develops novel calibration metrics (ESR, CDS) to benchmark frontier models (GPT, Claude, Gemini) on fine-grained social and pragmatic tasks.

My background spans Computer Science, Media Studies, and Computational Linguistics (BSc–MSc–PhD). I bring 15+ years of rigorous modeling — game theory, multi-agent RL, probabilistic NLP — to the question that matters most in applied AI: does the model actually get what humans mean?

I'm actively seeking roles in NLP research, LLM evaluation, or applied language AI where linguistic depth meets engineering ambition.

🏅 ERC Seal of Excellence 💶 €75K Research Grant (NAWA) 🎤 Invited Speaker, Stanford
Roland Mühlenbernd
40+
Publications
50+
Presentations
15+
Years Research
25+
Courses Taught
researcher_profile.py
# Roland Mühlenbernd focus = [   "LLM evaluation & calibration",   "pragmatics & social meaning",   "language dynamics", ] stack = {   "ml": ["PyTorch", "HuggingFace"],   "stats": ["Python", "R"], } publications = 40 # and counting
Areas of Expertise

What I Do

Combining formal linguistic theory with modern ML methods to understand and model natural language.

🧠 AI · ML
LLM Evaluation & Calibration
Developing calibration metrics (ESR, CDS) to assess LLM performance on fine-grained social meaning tasks. Testing GPT, Claude, and Gemini across varied prompting conditions.
PyTorch HuggingFace prompt engineering
💬 NLP
Pragmatics & Social Meaning
Probabilistic speaker models predicting politeness, precision, and register choices. Benchmarking LLMs against human pragmatic judgments for fine-grained social evaluation.
RSA models corpus analysis spaCy
🔁 Modeling
Reinforcement Learning & Agents
Multi-agent RL models of cooperation and communication. Introduced learning paradigms that outperformed established models across key behavioral metrics.
RL agents game theory Python
🌐 Computational
Language Dynamics & Networks
Agent-based simulations of how linguistic conventions emerge, spread, and change across social networks. Population-level models of pragmatic behavior over time.
ABM network analysis simulation
🧪 Empirical
Experimental Design & Statistics
Designing and running behavioral experiments (LabVanced) to test theoretical predictions. Full data pipelines: collection, statistical modeling, and visualization in Python and R.
LabVanced R statsmodels
📐 Theory
Formal Linguistics & Game Theory
Mathematical modeling of grammar, semantics, and pragmatics. Evolutionary game theory applied to language change, semantic ambiguity, and social signaling conventions.
signaling games replicator dynamics Bayesian models
Technical Stack
  • PyTorch · TensorFlow · scikit-learn
  • HuggingFace Transformers · fine-tuning
  • LLM evaluation · prompt engineering
  • Python (expert) · R · Julia · C++
  • spaCy · NLTK · Pandas · NumPy
  • Git · JupyterLab · Google Colab
  • Statistical modeling · Bayesian inference
Linguistic Expertise
  • Pragmatics · politeness theory · register
  • Language evolution & historical change
  • Sociolinguistics · network variation
  • Formal semantics · game-theoretic pragmatics
  • Experimental linguistics · behavioral data
  • Morphology · grammaticalization
  • Computational sociolinguistics
Career Highlights

By the Numbers

📄
40+
Peer-reviewed Publications
🎤
50+
Conference Presentations
🏫
25+
University Courses
💶
€75K
Research Grant (NAWA)
Current Work

What I'm Working On

Active research at the intersection of NLP, LLM evaluation, and computational pragmatics.

● Active

LLM Calibration Metrics for Social Meaning Tasks

Developing ESR and CDS metrics to evaluate GPT-4, Claude, and Gemini on politeness, precision, and register. Submitted to CMCL 2026 · follow-up targeting ACL/EMNLP

Submitted
● Active

Probabilistic Speaker Models for Pragmatic NLP

Bayesian/RSA models predicting human judgments of politeness and (im)precision with >90% accuracy · benchmarking against transformer-based baselines

In progress

Full Research & Publications Record

40+ peer-reviewed papers spanning NLP, AI, linguistics, and cognitive science · journals include Synthese, Linguistics & Philosophy, Experimental Economics, AI & Society

Completed

EvoSAL: Evolution of Semantic Ambiguity in the Lab

NAWA-funded (€75K) · Nicolaus Copernicus University · 2019–2021 · iterated learning models, experimental paradigm design