Bayesian Teaching for Persistent User Modelling in LLMs
Most LLM personalisation methods model user preferences at the task level — they don't generalise across domains. This dissertation asks whether stable Big Five personality traits provide a more robust substrate, by extending the Qiu et al. (2026) Bayesian teaching framework from task-level preferences to higher-order trait inference.
Approach
A six-stage Bayesian-teaching pipeline
A complete hypothesis space of 3,125 synthetic users was generated by enumerating all combinations of five Big Five trait values on a 1–5 scale. Each synthetic user answered the 50-item IPIP questionnaire under a Graded Response Model noise process. A Bayesian teacher maintained a posterior over the 3,125 hypotheses, updating after every response — producing 15,625 conversation traces used to fine-tune six open-weight LLMs.
Mechanism
Try the Bayesian update step
The bars below show a posterior over a synthetic user's Openness score on a 1–5 scale. The model starts uniform — it knows nothing. Each Likert answer the user submits updates the posterior using a simplified item likelihood; the same kind of update the Bayesian teacher performs over the full 3,125-user hypothesis space, item by item.
Q1.I have a vivid imagination.
Q2.I'm full of ideas.
Q3.I enjoy thinking about abstract concepts.
Q4.I'm interested in many different things.
Results
Six fine-tuned LLMs converged to the Bayesian teacher
| Metric | Baseline | Fine-tuned | Bayesian teacher |
|---|---|---|---|
| Within-task accuracy | 26.7 – 39.4% | 53.1 – 53.4% | 53.6% |
| Teacher agreement | 24.8 – 50.2% | 91.6 – 96.3% | — |
| Predictive entropy | 0.13 – 0.99 | 1.05 – 1.15 | 1.14 |
| Human validation (n=12) | ~40% | ~51% | — |
| Cross-task transfer (3 of 6) | 32.3 – 48.8% | 44.7 – 51.6% | — |
All six fine-tuned models matched the Bayesian teacher within 0.5 percentage points on the 50-item questionnaire. The KL-divergence training objective preserved the teacher's calibrated uncertainty — five of the six models stayed within 0.022 nats of the teacher's predictive entropy, where standard cross-entropy training in early experiments had produced logit collapse with 98% probability mass on a single token.
Critically: the personality model transferred to a structurally different task. With zero flight-specific training data, the three models tested on flight recommendation improved from 32–49% to 45–52% — direct evidence that the trained behaviour was not a questionnaire-specific pattern.
Surprise
Action without report
“Fine-tuning produced models that acted on personality without being able to report it — a dissociation between action and report that appears to be induced by the training method itself.”
A side-channel probe asked the fine-tuned models to state their belief about each trait directly (“On a scale of 1 to 5, what is this user's level of Openness?”). Belief tracking was inconsistent and model-dependent. The most extreme case: Qwen 14B with full personality context achieved the highest stated-belief correlation (r = +0.81) yet the lowest fine-tuned accuracy on flight recommendation (43.1%).
The likely cause is structural. During training each questionnaire item loaded on exactly one trait, so corrective signal was unambiguous. Flight feedback was different: a single accept/reject was jointly determined by Openness, Conscientiousness, and Emotional Stability, so the per-trait attribution was ambiguous. The models learned how to act on personality, not how to describe it — a procedural rather than declarative skill.
Caveats
Limits and what comes next
Limitations
- Cross-task feedback coupled three traits, making per-trait attribution ambiguous.
- Trait-to-flight mapping was deliberately arbitrary, not psychometrically grounded.
- Belief probe was a single-token forced choice — a narrow channel.
- Human validation used n=12 participants.
- Train/test split shuffled at the episode level, not the user level.
Future work
- Single-trait downstream tasks to isolate per-trait belief tracking.
- Psychometrically grounded trait-to-feature mappings.
- Scaling tests at 70B+ to check whether the dissociation persists.
- Personality representations beyond the Big Five.