Mdptoolbox Ex 2

Adapted from Marescot et al. appendix 5, to Reed optimal control problem, including direct comparison against (semi) analytic optimum.

step 1: define objectives

This is a conceptual step which does not require coding

step 2: define states

K <- 150 # state space limit
states <- 0:K # Vector of all possible states

step 3: define control actions

# Vector of actions: harvest
H <- states

step 4: define dynamic model (with demographic parameters)

p <- c(6,0.05)
f <- function(x, h){
  A <- p[1] 
  B <- p[2] 
  s <- pmax(x-h, 0)
  A * s/(1 + B * s)

sigma_g = 0.1

step 5: define utility

# Utility function
get_utility <- function(x,h) {

step 6: solve bellman equation with value iteration

# Initialize transition matrix
transition <- array(0, dim = c(length(states), length(states), length(H)))

# Initialize utility matrix
utility <- array(0, dim = c(length(states), length(H)))
# Fill in the transition and utility matrix
# Loop on all states
for (k in 0:K) {

    # Loop on all actions
    for (i in 1:length(H)) {

# Calculate the transition state at the next step, given the 
# current state k and the harvest H[i]
        nextpop <- f(k, H[i])
        if(nextpop <= 0)
          transition[k+1, , i] <- c(1, rep(0, length(states) - 1))
    # Implement demographic stochasticity by drawing 
  # probability from a density function
        else {

# We need to correct this density for the final capping state ("Pile on boundary")
# For discrete probability distribution, this is easy if `states` includes all possible
# discrete states below the capping state (e.g. all non-negative integers less than K).  
# For a continuous distribution, this is more problematic as we have to first normalize the densities.
# EDIT: this can be negative, due to floating-point errors. so we take max(v,0) to avoid

# Get long-tailed denominator as normalizing factor (continuous distributions only):
          fine_states <- seq(min(states), 10 * max(states), by = states[2]-states[1])
        N <- sum(dlnorm(fine_states, log(nextpop), sdlog = sigma_g))

      transition[k+1, , i] <- dlnorm(states, log(nextpop), sdlog = sigma_g) / N
        # We need to correct this density for the final capping state ("Pile on boundary")
        transition[k+1, K+1, i] <- max(1 - sum(transition[k+1, -(K+1), i]), 0)

        # Compute utility
        utility[k+1, i] <- get_utility(k, H[i])

    } # end of action loop
} # end of state loop
# Discount factor
discount <- 0.95

# Action value vector at tmax
Vtmax <- numeric(length(states))

# Action value vector at t and t+1
Vt <- numeric(length(states))
Vtplus <- numeric(length(states))

# Optimal policy vector
D <- numeric(length(states))

# Time horizon
Tmax <- 150

Solution calculated explicitly:

The backward iteration consists in storing action values in the vector Vt which is the maximum of utility plus the future action values for all possible next states. Knowing the final action values, we can then backwardly reset the next action value Vtplus to the new value Vt. We start The backward iteration at time T-1 since we already defined the action value at Tmax.

for (t in (Tmax - 1):1) {

# We define a matrix Q that stores the updated action values for 
# all states (rows)
# actions (columns)
    Q <- array(0, dim = c(length(states), length(H)))
    for (i in 1:length(H)) {
# For each harvest rate we fill for all states values (row) 
# the ith column (Action) of matrix Q
# The utility of the ith action recorded for all states is 
# added to the product of the transition matrix of the ith 
# action by the action value of all states 
        Q[,i] <- utility[, i] + discount * (transition[,,i] %*% Vtplus)
    } # end of the harvest loop

    # Find the optimal action value at time t is the maximum of Q
    Vt <- apply(Q, 1, max)

# After filling vector Vt of the action values at all states, we 
# update the vector Vt+1 to Vt and we go to the next step standing 
# for previous time t-1, since we iterate backward
    Vtplus <- Vt

} # end of the time loop

# Find optimal action for each state
for (k in 0:K) {
# We look for each state which column of Q corresponds to the 
# maximum of the last updated value 
# of Vt (the one at time t + 1). If the index vector is longer than 1 
# (if there is more than one optimal value we chose the minimum 
# harvest rate)
    D[k + 1] <- H[(min(which(Q[k + 1, ] == Vt[k + 1])))]

plot solution

plot(states, states - D, xlab="Population size", ylab="Escapement")

proof of optimality: compare with analytical solution

From Reed (1979) we know that the optimal solution is a constant-escapement rule:

\[f'(s^*) = 1/\alpha\]

For growth-rate function \(f\), where \(\alpha\) is the discount factor and \(s^*\) the stock size for the constant escapement. Analytic solutions are clearly possible for certain growth functions, but here I’ve just implemented a generic numerical solution.

fun <- function(x) - f(x,0) + x / discount
out <- optimize(f = fun, interval = c(0,K))
S_star <- out$minimum

exact_policy <- sapply(states, 
                        if(x < S_star) 0
                        else x - S_star)
plot(states, states - D, xlab="Population size", ylab="Escapement")

# The difference between Bellman equation solution and the analytical 
# solution is small:
lines(states, states - exact_policy)

Using toolbox

mdp_check(P = transition, R = utility)
[1] ""
out <- mdp_value_iteration(transition, utility, discount = discount, epsilon = 0.001, max_iter = 5e3, V0 = Vtmax)
[1] "MDP Toolbox: iterations stopped, epsilon-optimal policy found"
plot(states, states - D, xlab="Population size", ylab="Escapement")
lines(states, states - H[out$policy], col="red", lty=2)