← Home·Research · 2026

Bayesian Teaching for Persistent User Modelling in LLMs

Most LLM personalisation methods model user preferences at the task level — they don't generalise across domains. This dissertation asks whether stable Big Five personality traits provide a more robust substrate, by extending the Qiu et al. (2026) Bayesian teaching framework from task-level preferences to higher-order trait inference.

School of Informatics · University of Edinburgh·Supervised by Sohan Seth & Kieran Richards

Read dissertation CV

Approach

A six-stage Bayesian-teaching pipeline

A complete hypothesis space of 3,125 synthetic users was generated by enumerating all combinations of five Big Five trait values on a 1–5 scale. Each synthetic user answered the 50-item IPIP questionnaire under a Graded Response Model noise process. A Bayesian teacher maintained a posterior over the 3,125 hypotheses, updating after every response — producing 15,625 conversation traces used to fine-tune six open-weight LLMs.

Step 1Synthetic users3,125

Step 2GRM responses50 items

Step 3Bayesian teacherposterior trace

Step 4Training data11,250 episodes

Step 5LoRA fine-tuneKL-divergence

Step 6Evaluationwithin + transfer

Mechanism

Try the Bayesian update step

The bars below show a posterior over a synthetic user's Openness score on a 1–5 scale. The model starts uniform — it knows nothing. Each Likert answer the user submits updates the posterior using a simplified item likelihood; the same kind of update the Bayesian teacher performs over the full 3,125-user hypothesis space, item by item.

Belief over Openness · 0/4 answereduniform prior

Q1.I have a vivid imagination.

Q2.I'm full of ideas.

Q3.I enjoy thinking about abstract concepts.

Q4.I'm interested in many different things.

1 = strongly disagree · 5 = strongly agree

Results

Six fine-tuned LLMs converged to the Bayesian teacher

Llama 3.2 3BLlama 3.1 8BGemma 3 4BGemma 3 12BQwen 2.5 7BQwen 2.5 14B

Metric	Baseline	Fine-tuned	Bayesian teacher
Within-task accuracy	26.7 – 39.4%	53.1 – 53.4%	53.6%
Teacher agreement	24.8 – 50.2%	91.6 – 96.3%	—
Predictive entropy	0.13 – 0.99	1.05 – 1.15	1.14
Human validation (n=12)	~40%	~51%	—
Cross-task transfer (3 of 6)	32.3 – 48.8%	44.7 – 51.6%	—

All six fine-tuned models matched the Bayesian teacher within 0.5 percentage points on the 50-item questionnaire. The KL-divergence training objective preserved the teacher's calibrated uncertainty — five of the six models stayed within 0.022 nats of the teacher's predictive entropy, where standard cross-entropy training in early experiments had produced logit collapse with 98% probability mass on a single token.

Critically: the personality model transferred to a structurally different task. With zero flight-specific training data, the three models tested on flight recommendation improved from 32–49% to 45–52% — direct evidence that the trained behaviour was not a questionnaire-specific pattern.

Surprise

Action without report

“Fine-tuning produced models that acted on personality without being able to report it — a dissociation between action and report that appears to be induced by the training method itself.”

A side-channel probe asked the fine-tuned models to state their belief about each trait directly (“On a scale of 1 to 5, what is this user's level of Openness?”). Belief tracking was inconsistent and model-dependent. The most extreme case: Qwen 14B with full personality context achieved the highest stated-belief correlation (r = +0.81) yet the lowest fine-tuned accuracy on flight recommendation (43.1%).

The likely cause is structural. During training each questionnaire item loaded on exactly one trait, so corrective signal was unambiguous. Flight feedback was different: a single accept/reject was jointly determined by Openness, Conscientiousness, and Emotional Stability, so the per-trait attribution was ambiguous. The models learned how to act on personality, not how to describe it — a procedural rather than declarative skill.

Caveats

Limits and what comes next

Limitations

Cross-task feedback coupled three traits, making per-trait attribution ambiguous.
Trait-to-flight mapping was deliberately arbitrary, not psychometrically grounded.
Belief probe was a single-token forced choice — a narrow channel.
Human validation used n=12 participants.
Train/test split shuffled at the episode level, not the user level.

Future work

Single-trait downstream tasks to isolate per-trait belief tracking.
Psychometrically grounded trait-to-feature mappings.
Scaling tests at 70B+ to check whether the dissociation persists.
Personality representations beyond the Big Five.

Get in touch →·Back to home