This paper presents a policy improvement-value approximation algorithm for the average reward Markov decision process when all transition matrices are unichained. In contrast with Howard's algorithm we do not solve for the exact gain and relative value vector but only approximate them. It is shown that the value approximation algorithm produces a nearly optimal strategy. This paper extends the results of a previous paper in which transient states were not allowed. Also the algorithm is slightly different.
Name | Memorandum COSOR |
---|
Volume | 7827 |
---|
ISSN (Print) | 0926-4493 |
---|