# Parameter uncertainty in stochastic control problems

Day 3 of my short-term visitor stay before the working group starts.

• Wednesday: Travel

• Thursday: 8:45 Group Meeting: Paul, Michael, Lance, Dan, Jake, Carl.

Training problem II discussion – problem taxonomy: stochastic, model uncertainty, parameter uncertainty, state uncertainty. Learning on uncertainty (passive/adaptive active management) in the uncertainty cases, all of which increase the parameter space. Evening with Paul’s students.

• Friday Writing up Training problem II with Jake, Michael. O’Meara meeting. Evening of old time fiddle and being locked out of NIMBioS.

• Saturday Catching up on reading. Writing down transition matrices to account for the learning is non-trivial.

• Sunday Writing Training problem II with Jake. Brief meeting with Eric & Paul.

## Literature

Still catching up on literature from 70s and 80s on optimal control with an uncertain parameter (uncertain model, or possibly, uncertain state).

﻿Nicest walk through of different approaches in this collection is probably Ludwig & Walters, 1982, introducing the non-adaptive (“average equilibrium”) strategy, passive adaptive management (“Myopic Bayes”) and active adaptive management (also with Bayesian updating) approach (computed through policy iteration rather than value iteration (SDP)).

(Ludwig & Hilborn, 1983) discuss difficulties in estimating stock abundance, but in the context of not knowing the parameter values, as opposed to inherent errors in the stock size.

(Walters & Hilborn, 1976) give a nice walk-through in an example with a 3D state space (stock size, mean + covariance matrix for distribution of parameters), where the updating rules come from the regression formulae:

$a_n = a_{n-1} - \frac{\hat P_{n-1} x_n }{ \sigma^2 + x^T_n \hat P_{n-1} x_n} (a_{n-1}^T x_n - y_n )$

$\hat P_n = \hat P_{n-1} - \frac{\hat P_{n-1} x_n x^T_n \hat P_{n-1} }{ \sigma^2 + \hat x^T_n P_{n-1} x_n}$

Where $^2$ is the regression error variance, and for the Ricker model,

$y_n = \ln \left(\frac{R_t}{S_{t-1}}\right)$ $x_n = \left[ 1 \choose -S_{t-1} \right]$ $a_n = \left[ \alpha_t \choose \beta_t \right]$

Their annual review, (Walters & Hilborn, 1978), is a nice broad perspective without details for implementation. Surprised that ch 8 of Walters 86 textbook doesn’t get beyond the basics of Bellman’s value iteration. Mangel’s 1985 edited volume seems to do a nice job (but the Google copy doesn’t quite cut it).

## Two-armed bandit problem

Imagine machine 1 has payoff $_1$ with unknown probability $p_1$, machine 2 has payoff $_2$ with known probability $p_2$.

Pr( $k$ wins in $m$ gambles $| p_1$ ) is dbinom(k, m, p_1) and we need a good prior density $f(p_1)$ as dbeta(x, alpha, beta) such that the posterior is also a beta density.

The value-to-go in the SDP solution is a function of the distribution parameters of the prior,

$J_n(\alpha,\beta) = \max\left( \alpha_2 p_2 , \alpha_1 \frac{\alpha}{\alpha + \beta}\right)$,

which just the integral over the posterior at that time, since no further information will be gained. We can compute the earlier steps from this.

## Algorithmic implementations

Iadine’s Markov decision process toolbox for Matlab provides nice clean algorithmic specifications for Q-learning, value-iteration, policy-iteration, etc. Work will still be in specifying the transition matrices for these contrived state-spaces.

## Confused

• Is Myopic Bayes (passive adaptive management) differ from Q learning? (which “gives the expected utility of taking a given action in a given state and following a fixed policy thereafter.”) (i.e. do they agree for a given learning rate?) Both are value-iteration updates.

• Why use policy iteration for the active adaptive management solution and value iteration (stochastic dynamic programming/Bellman equation) for Myopic Bayes?

• How are partially observable Markov decision processes different than just adding uncertainty to the observed state (a la Sethi et al, etc?) In particular, why not include observations as part of the state space?

#### estimating what we don’t know

Random ideas that have come up in discussions. * How not to do this: estimate confidence intervals Harvey, 1997, (overconfidence effect) * Better: estimate largest reasonable value, smallest reasonable value. Order matters * Better yet? Bet on outcomes (prediction market?) Bueno de Mesquita (Bueno de Mesquita, 2010)