Please report any queries concerning the funding data grouped in the sections named "Externally Awarded" or "Internally Disbursed" (shown on the profile page) to
your Research Finance Administrator. Your can find your Research Finance Administrator at https://www.ucl.ac.uk/finance/research/rs-contacts.php by entering your department
Please report any queries concerning the student data shown on the profile page to:
Email: portico-services@ucl.ac.uk
Help Desk: http://www.ucl.ac.uk/ras/portico/helpdesk
Email: portico-services@ucl.ac.uk
Help Desk: http://www.ucl.ac.uk/ras/portico/helpdesk
Publication Detail
Modelling transition dynamics in MDPs with RKHS embeddings
-
Publication Type:Conference
-
Authors:Grunewalder S, Lever G, Baldassarre L, Pontil M, Gretton A
-
Publication date:18/06/2012
-
Keywords:cs.LG, cs.LG
-
Author URL:
-
Notes:ICML2012
Abstract
We propose a new, nonparametric approach to learning and representing
transition dynamics in Markov decision processes (MDPs), which can be combined
easily with dynamic programming methods for policy optimisation and value
estimation. This approach makes use of a recently developed representation of
conditional distributions as \emph{embeddings} in a reproducing kernel Hilbert
space (RKHS). Such representations bypass the need for estimating transition
probabilities or densities, and apply to any domain on which kernels can be
defined. This avoids the need to calculate intractable integrals, since
expectations are represented as RKHS inner products whose computation has
linear complexity in the number of points used to represent the embedding. We
provide guarantees for the proposed applications in MDPs: in the context of a
value iteration algorithm, we prove convergence to either the optimal policy,
or to the closest projection of the optimal policy in our model class (an
RKHS), under reasonable assumptions. In experiments, we investigate a learning
task in a typical classical control setting (the under-actuated pendulum), and
on a navigation problem where only images from a sensor are observed. For
policy optimisation we compare with least-squares policy iteration where a
Gaussian process is used for value function estimation. For value estimation we
also compare to the NPDP method. Our approach achieves better performance in
all experiments.
› More search options
UCL Researchers