Artificialfintelligence Bl
ol
ANVIOAEM APPreacH
Helligigh=eljtjelf)
Artificial Intelligence A Modern Approach Fourth Edition
PEARSON SERIES IN ARTIFICIAL INTELLIGENCE Stuart Russell and Peter Norvig, Editors
FORSYTH & PONCE
Computer Vision: A Modern Approach, 2nd ed.
JURAFSKY & MARTIN
Speech and Language Processing, 2nd ed.
RUSSELL & NORVIG
Artificial Intelligence: A Modern Approach, 4th ed.
GRAHAM
NEAPOLITAN
ANSI Common Lisp
Learning Bayesian Networks
Artificial Intelligence A Modern Approach Fourth Edition Stuart J. Russell and Peter Norvig
Contributing writers: MingWei Chang Jacob Devlin Anca Dragan David Forsyth Tan Goodfellow Jitendra M. Malik Vikash Mansinghka Judea Pearl Michael Wooldridge
@
Pearson
Copyright © 2021, 2010, 2003 by Pearson Education, Inc. or its affiliates, 221 River Street, Hoboken, NJ 07030. All Rights Reserved. Manufactured in the United States of America. This publication is protected by copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise. For information regarding permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights and Permissions department, please visit www.pearsoned.com/permissions/. Acknowledgments of thirdparty content appear on the appropriate page within the text. Cover Images: Alan Turing  Science History Images/Alamy Stock Photo Statue of Aristotle — Panos Karas/Shutterstock Ada Lovelace  Pictorial Press Ltd/Alamy Stock Photo Autonomous cars — Andrey Suslov/Shutterstock Atlas Robot ~ Boston Dynamics, Inc. Berkeley Campanile and Golden Gate Bridge — Ben Chu/Shutterstock Background ghosted nodes — Eugene Sergeev/Alamy Stock Photo Chess board with chess figure — Titania/Shutterstock Mars Rover  Stocktrek Images, Inc./Alamy Stock Photo Kasparov  KATHY WILLENS/AP Images PEARSON, ALWAYS LEARNING is an exclusive trademark owned by Pearson Education, Inc. or its affiliates in the U.S. and/or other countries.
Unless otherwise indicated herein, any thirdparty trademarks, logos, or icons that may appear in this work are the property of their respective owners, and any references to thirdparty trademarks, logos, icons, or other trade dress are for demonstrative or descriptive purposes only. Such references are not intended to imply any sponsorship, endorsement, authorization, or promotion of Pearson’s products
by the owners of such marks, or any relationship between the owner and Pearson Education, Inc., or its affiliates, authors, licensees, or distributors.
Library of Congress CataloginginPublication Data Russell, Stuart J. (Stuart Jonathan), author.  Norvig, Peter, author. rtificial intelligence : @ modern approach/ Stuart J. Russell and Peter Norvig. Description: Fourth edition.  Hoboken : Pearson, [2021]  Series: Pearson series in artificial intelligence  Includes bibliographical references and index.  Summary: “Updated edition of popular textbook on Artificial Intelligence.”— Provided by publisher. Identifiers: LCCN 2019047498  ISBN 9780134610993 (hardcover) Subjects: LCSH: Artificial intelligence. Classification: LCC Q335 .R86 2021  DDC 006.3dc23 LC record available at https://lcen.loc.gov/2019047498 ScoutAutomatedPrintCode
@
Pearson ISBN10: ISBN13:
0134610997 9780134610993
For Loy, Gordon, Lucy, George, and Isaac — S.J.R. For Kris, Isabella, and Juliet — P.N.
This page intentionally left blank
Preface Artificial Intelligence (Al) is a big field, and this is a big book.
We have tried to explore
the full breadth of the field, which encompasses logic, probability, and continuous mathematics; perception, reasoning, learning, and action; fairness, trust, social good, and safety; and applications that range from microelectronic devices to robotic planetary explorers to online services with billions of users. The subtitle of this book is “A Modern Approach.”
That means we have chosen to tell
the story from a current perspective. We synthesize what is now known into a common
framework, recasting early work using the ideas and terminology that are prevalent today.
We apologize to those whose subfields are, as a result, less recognizable. New to this edition
This edition reflects the changes in Al since the last edition in 2010: « We focus more on machine learning rather than handcrafted knowledge engineering, due to the increased availability of data, computing resources, and new algorithms.
« Deep learning, probabilistic programming, and multiagent systems receive expanded coverage, each with their own chapter.
« The coverage of natural language understanding, robotics, and computer vision has been revised to reflect the impact of deep learning. + The robotics chapter now includes robots that interact with humans and the application of reinforcement learning to robotics.
« Previously we defined the goal of Al as creating systems that try to maximize expected utility, where the specific utility information—the objective—is supplied by the human
designers of the system. Now we no longer assume that the objective is fixed and known by the Al system;
instead, the system may be uncertain about the true objectives of the
humans on whose behalf it operates. It must learn what to maximize and must function
appropriately even while uncertain about the objective. « We increase coverage of the impact of Al on society, including the vital issues of ethics, fairness, trust, and safety.
+ We have moved the exercises from the end of each chapter to an online site.
This
allows us to continuously add to, update, and improve the exercises, to meet the needs
of instructors and to reflect advances in the field and in Alrelated software tools.
* Overall, about 25% of the material in the book is brand new. The remaining 75% has
been largely rewritten to present a more unified picture of the field. 22% of the citations in this edition are to works published after 2010.
Overview of the book The main unifying theme is the idea of an intelligent agent. We define Al as the study of agents that receive percepts from the environment and perform actions. Each such agent implements a function that maps percept sequences to actions, and we cover different ways
to represent these functions, such as reactive agents, realtime planners, decisiontheoretic vii
viii
Preface
systems, and deep learning systems. We emphasize learning both as a construction method for competent systems and as a way of extending the reach of the designer into unknown
environments. We treat robotics and vision not as independently defined problems, but as occurring in the service of achieving goals. We stress the importance of the task environment in determining the appropriate agent design.
Our primary aim is to convey the ideas that have emerged over the past seventy years
of Al research and the past two millennia of related work.
We have tried to avoid exces
sive formality in the presentation of these ideas, while retaining precision. We have included mathematical formulas and pseudocode algorithms to make the key ideas concrete; mathe
matical concepts and notation are described in Appendix A and our pseudocode is described in Appendix B. This book is primarily intended for use in an undergraduate course or course sequence. The book has 28 chapters, each requiring about a week’s worth of lectures, so working through the whole book requires a twosemester sequence. A onesemester course can use. selected chapters to suit the interests of the instructor and students.
The book can also be
used in a graduatelevel course (perhaps with the addition of some of the primary sources suggested in the bibliographical notes), or for selfstudy or as a reference.
Term
Throughout the book, important points are marked with a triangle icon in the margin.
‘Wherever a new term is defined, it is also noted in the margin. Subsequent significant uses
of the term are in bold, but not in the margin. We have included a comprehensive index and
an extensive bibliography. The only prerequisite is familiarity with basic concepts of computer science (algorithms, data structures, complexity) at a sophomore level. Freshman calculus and linear algebra are useful for some of the topics. Online resources
Online resources are available through pearsonhighered. com/csresources or at the book’s Web site, aima. cs . berkeley. edu. There you will find: « Exercises, programming projects, and research projects. These are no longer at the end of each chapter; they are online only. Within the book, we refer to an online exercise with a name like “Exercise 6.NARY.”
Instructions on the Web site allow you to find
exercises by name or by topi Implementations of the algorithms in the book in Python, Java, and other programming languages (currently hosted at github.com/aimacode).
A list of over 1400 schools that have used the book, many with links to online course
materials and syllabi. Supplementary material and links for students and instructors.
Instructions on how to report errors in the book, in the likely event that some exist.
Book cover
The cover depicts the final position from the decisive game 6 of the 1997 chess match in which the program Deep Blue defeated Garry Kasparov (playing Black), making this the first time a computer had beaten a world champion in a chess match. Kasparov is shown at the
Preface
top. To his right is a pivotal position from the second game of the historic Go match between former world champion Lee Sedol and DeepMind’s ALPHAGO program. Move 37 by ALPHAGO violated centuries of Go orthodoxy and was immediately seen by human experts as an embarrassing mistake, but it turned out to be a winning move. At top left is an Atlas humanoid robot built by Boston Dynamics. A depiction of a selfdriving car sensing its environment appears between Ada Lovelace, the world’s first computer programmer, and Alan Turing, whose fundamental work defined artificial intelligence.
At the bottom of the chess
board are a Mars Exploration Rover robot and a statue of Aristotle, who pioneered the study of logic; his planning algorithm from De Motu Animalium appears behind the authors’ names. Behind the chess board is a probabilistic programming model used by the UN Comprehensive NuclearTestBan Treaty Organization for detecting nuclear explosions from seismic signals. Acknowledgments It takes a global village to make a book. Over 600 people read parts of the book and made
suggestions for improvement. The complete list is at aima.cs.berkeley.edu/ack.html;
we are grateful to all of them. We have space here to mention only a few especially important contributors. First the contributing writers:
Judea Pearl (Section 13.5, Causal Network Vikash Mansinghka (Section 15.3, Programs as Probability Models): Michael Wooldridge (Chapter 18, Multiagent Decision Making); Tan Goodfellow (Chapter 21, Deep Learning); Jacob Devlin and MeiWing Chang (Chapter 24, Deep Learning for Natural Language); « Jitendra Malik and David Forsyth (Chapter 25, Computer Vision); « Anca Dragan (Chapter 26, Robotics). Then some key roles: « Cynthia Yeung and Malika Cantor (project management); « Julie Sussman and Tom Galloway (copyediting and writing suggestions); « Omari Stephens (illustrations); « Tracy Johnson (editor); « Erin Ault and Rose Kernan (cover and color conversion); « Nalin Chhibber, Sam Goto, Raymond de Lacaze, Ravi Mohan, Ciaran O’Reilly, Amit Patel, Dragomir Radiv, and Samagra Sharma (online code development and mentoring); « Google Summer of Code students (online code development). Stuart would like to thank his wife, Loy Sheflott, for her endless patience and boundless wisdom. He hopes that Gordon, Lucy, George, and Isaac will soon be reading this book after they have forgiven him for working so long on it. RUGS (Russell’s Unusual Group of Students) have been unusually helpful, as always. Peter would like to thank his parents (Torsten and Gerda) for getting him started, and his wife (Kris), children (Bella and Juliet), colleagues, boss, and friends for encouraging and tolerating him through the long hours of writing and rewriting.
ix
About the Authors Stuart Russell was born in 1962 in Portsmouth, England.
He received his B.A. with first
class honours in physics from Oxford University in 1982, and his Ph.D. in computer science from Stanford in 1986.
He then joined the faculty of the University of California at Berke
ley, where he is a professor and former chair of computer science, director of the Center for HumanCompatible Al, and holder of the SmithZadeh Chair in Engineering. In 1990, he received the Presidential Young Investigator Award of the National Science Foundation, and in 1995 he was cowinner of the Computers and Thought Award. He is a Fellow of the Amer
ican Association for Artificial Intelligence, the Association for Computing Machinery, and
the American Association for the Advancement of Science, an Honorary Fellow of Wadham College, Oxford, and an Andrew Carnegie Fellow. He held the Chaire Blaise Pascal in Paris
from 2012 to 2014. He has published over 300 papers on a wide range of topics in artificial intelligence. His other books include The Use of Knowledge in Analogy and Induction, Do the Right Thing: Studies in Limited Rationality (with Eric Wefald), and Human Compatible: Ariificial Intelligence and the Problem of Control.
Peter Norvig is currently a Director of Research at Google, Inc., and was previously the director responsible for the core Web search algorithms. He cotaught an online Al class that signed up 160,000 students, helping to kick off the current round of massive open online
classes. He was head of the Computational Sciences Division at NASA Ames Research Center, overseeing research and development in artificial intelligence and robotics. He received
aB.S. in applied mathematics from Brown University and a Ph.D. in computer science from Berkeley. He has been a professor at the University of Southern California and a faculty
member at Berkeley and Stanford. He is a Fellow of the American Association for Artificial
Intelligence, the Association for Computing Machinery, the American Academy of Arts and Sciences, and the California Academy of Science. His other books are Paradigms of Al Programming: Case Studies in Common Lisp, Verbmobil: A Translation System for FacetoFace Dialog, and Intelligent Help Systems for UNIX. The two authors shared the inaugural AAAI/EAAI Outstanding Educator award in 2016.
Contents Atrtificial Intelligence Introduction L1 WhatTs AI? ... .o 1.2 The Foundations of Artificial Intelligence . 1.3 The History of Artificial Intelligence 1.4 The State of the Art . . . 1.5 Risks and Benefits of AI .
Summary
Bibliographical and Historical Notes
Intelligent Agents 2.1
22
2.3
24
Agentsand Environments
1 1 5 17 27 31
34
. . . . . . .
. . ..
..
... ...
Good Behavior: The Concept of Rationality
35 ...
The Nature of Environments . . . .. .
TheStructure of Agents
36 36
...................
. . . .. .. ... ...
SUMMATY . ..o o e Bibliographical and Historical Notes . . . . .. ..................
it
39
42
47
60 60
Problemsolving
Solving Problems by Searching 3.1 ProblemSolving Agents .
63 63
3.5
84
3.2 3.3 3.4
Example Problems . . . . Search Algorithms . . . Uninformed Search Strategies . . . . . . .
3.6
Heuristic Functions
Informed (Heuristic) Search Strategies . .
Summary
Bibliographical and Historical Notes
. . . . . . .
66 71 76 97
104
106
Search in Complex Environments
110
Summary
141
4.1 4.2 4.3 4.4 4.5
Local Search and Optimization Problems . Local Search in Continuous Spaces . . . . Search with Nondeterministic Actions Search in Partially Observable Environments Online Search Agents and Unknown Environments
Bibliographical and Historical Notes
. . . . .. ..................
Adversarial Search and Games 5.1 Game Theory 5.2 Optimal Decisions in Games
110 119 122 126 134 142
146 146 148 xi
Contents 5.3 54
Heuristic AlphaBeta Tree Search Monte Carlo Tree Search
156 161
5.6 5.7
Partially Observable Games . Limitations of Game Search Algorithms
168 173
5.5
Stochastic Games
Summary
Bibliographical 6
174
and Historical Notes
175
Constraint Satisfaction Problems
180
6.1
Defining Constraint Satisfaction Problems
180
6.3 64 6.5
Backtracking Search for CSPs Local Search for CSPs . . . The Structure of Problems .
191 197 199
6.2
Constraint Propagation: Inference in CSPs .
Summary
Bibliographical and Historical Notes
III
7
164
. . . . .. ..................
204
Logical Agents 7.1 KnowledgeBased Agents
208 209
7.4 7.5
217 222
7.2 73
The Wumpus World . . Logic...................
7.6
Effective Propositional Model Checking
Propositional Logic: A Very Simple Logic . Propositional Theorem Proving
Bibliographical and Historical Notes
. . . . . ...................
210 214
232
237 246 247
FirstOrder Logic
251
8.1 8.2 8.3
251 256 265
Representation Revisited . Syntax and Semantics of FirstOrder Logic Using FirstOrder Logic . .
.
84 Knowledge Engineering in FirstOrder Logic Summary Bibliographical and Historical Notes
9
203
Knowledge, reasoning, and planning
7.7 Agents Based on Propositional Logic . . SUMMary ... 8
185
. . . . . .
271 277
278
Inference in FirstOrder Logic
280
Summary
309
9.1 9.2 9.3 9.4 9.5
Propositional vs. FirstOrder Inference . . . ... ............. Unification and FirstOrder Inference . . Forward Chaining . . . Backward Chaining . . Resolution . ......
Lo
Bibliographical and Historical Notes
. . . . . .
280 282 286 293 298
310
Contents
10 Knowledge Representation 10.1 Ontological Engineering . 102 Categories and Object 10.3 10.4
Events Mental Objects and Modal Logic
10.5
Reasoning Systems for Categories
10.6
Reasoning with Default Information
Summary
Bibliographical
and Historical NOes . . . . .. ..o oot oot
11 Automated Planning 11.1
112
11.3 11.4 11.5 11.6
Definition of Classical Planning
Algorithms for Classical Planning
Heuristics for Planning Hierarchical Planning . . Planning and Acting in Nondeterministic Domains Time, Schedules, and Resources
117 Analysis of Planning Approaches Summary ... Bibliographical and Historical Notes . . . . . . ... IV
. . . .
ooor
oot
Uncertain knowledge and reasoning
12 Quantifying Uncertainty
12.1
Acting under Uncertainty
12.2 12.3
Basic Probability Notation Inference Using Full Joint Distributions . .
12.5 12.6 12.7
Bayes’ Rule and Its Use . Naive Bayes Models . . . The Wumpus World Revisited
124
Independence
Summary
Bibliographical and Historical Notes
. . . . .. ..................
13 Probabilistic Reasoning
13.1
13.2 13.3
134
13.5
Representing Knowledge in an Uncertain Domain . . . .
The Semantics of Bayesian Networks . . . .. ... ... Exact Inference in Bayesian Networks
Approximate Inference for Bayesian Networks . . . . . . Causal Networks . . . . .
Summary
... .........
Bibliographical and Historical Notes 14 Probabilistic Reasoning over Time 14.1 Time and Uncertainty . . 14.2 Inference in Temporal Models
. . . . . . .
xiii
xiv
Contents
14.3 Hidden Markov Models 144 Kalman Filters . . . . . 14.5 Dynamic Bayesian Networks Summary Bibliographic and Historical Notes 15 Probabilistic Programming 15.1 Relational Probability Models 152 OpenUniverse Probability Models
153
. . .
Keeping Track ofa Complex World . . .
Summary Bibliographic 16 Making Simple Decisions 16.1 16.2 16.3 16.4 16.5 16.6 16.7
Combining Beliefs and Desires under Uncertainty The Basis of Utility Theory .
. . . .. ........
Multiattribute Utility Functions Decision Networks . . . . .. ...... The Value of Information Unknown Preferences . . .
Summary Bibliographical and Historical Notes . . . . . . 17
18
Making Complex Decisions 17.1 Sequential Decision Problems 17.2 Algorithms for MDPs . 17.3 Bandit Problems . . . . 17.4 Partially Observable MDPs 17.5 Algorithms for Solving POMDPs . . . . Summary Bibliographical and Historical Notes . . . . . . Multiagent Decision Making 18.1
182 18.3
18.4
Properties of Multiagent Environments
NonCooperative Game Theory Cooperative Game Theory
. . . . .. ... ..........
Making Collective Decisions
Summary Bibliographical and Historical Notes . . . . . . . ... ..............
562 562 572 581 588 590 595 596 599 599 605 626 632 645 646
Machine Learning
19 Learning from Examples 19.1
FormsofLearning
. . .. ..........................
651
Contents
21
22
192 Supervised Learning. . . 193 Learning Decision Trees . 19.4 Model Selection and Optimization 19.5 The Theory of Learning . . . 19.6 Lincar Regression and Classification 19.7 Nonparametric Models 19.8 Ensemble Learning 19.9 Developing Machine Learning Systems . . Summary Bibliographical and Historical Notes Learning Probabilistic Models 20.1 Statistical Leaming . . .« ..o i i it 202 Leaming with Complete Data 203 Leaming with Hidden Variables: The EM Algorithm . . . Summary Bibliographical and Historical Notes Deep Learning 211 Simple Feedforward Networks . . ..« oo oo ooounittt o 212 Computation Graphs for Decp Learning 213 Convolutional Networks . . . . . . . . . . 214 Leaming Algorithms. . . 215 GeneraliZation . . . ... i i i 216 Recurrent Neural Networks   .« oo« oo otot ot 217 Unsupervised Learning and Transfer Learning . . . . . . . ... ... .. 208 APPHCALONS « .« o« e e e e SUMMATY .« o o oo oo e e e e et Bibliographical and Historical Notes . . . . .« .. ... ..o ooooooo. Reinforcement Learning 22.1 Leaming from Rewards . . . . ..o oo oott it 222 Passive Reinforcement Learning 223 Active Reinforcement Learning 224 Generalization in Reinforcement Learning 225 Policy Search 22.6 Apprenticeship and Inverse Reinforcement Learning . . . 227 Applications of Reinforcement Learning SUMMALY  o o v vveeeerneneenenns Bibliographical and Historical Notes
VI
653 657 665 672 676 686 696 704 714 715 721 721 724 737 746 747 750 751 756 760 765 768 772 775 782 784 785 789 789 791 797 803 810 812 815 818 819
Communicating, perceiving, and acting
23 Natural Language Processing 23.1 Language Models 23.2
Grammar
823 823
833
xv
xvi
Contents 233 23.4
Parsing . ........ Augmented Grammars .
235 Complications of Real Natural Language 23.6 Natural Language Tasks . . . . . . . .. Summary Bibliographical and Historical Notes
24 Deep Learning for Natural Language Processing 24.1 24.2
Word Embeddings . . . .. .............. .. ... ..., Recurrent Neural Networks for NLP
24.4
The Transformer Architecture
243 245
24.6
SequencetoSequence Models
Pretraining and Transfer Learning . . . . State of the art
Summary ... ..... ...
Bibliographical and Historical Notes 25
. . . . .. ..................
835 841
845 849 850 851
856
856 860
864 868
871
875
878
878
Computer Vision 25.1 Introduction 25.2 Image Formation . . . .
881 881 882
256
901
253 254 255
Simple Image Features Classifying Images . . . Detecting Objects . . .
The3D World
888 895 899
. . ...
25.7 Using Computer Vision Summary ... ... ...
906 919
Bibliographical and Historical Notes
920
26 Robotics 26.1 Robots 26.2 Robot Hardware . . . . 26.3 What kind of problem is robotics solving? 26.4 Robotic Perception . . . 26.5 Planning and Control 26.6 Planning Uncertain Movements . . . . . 26.7 Reinforcement Learning in Robotics 26.8 Humans and Robots 26.9 Alternative Robotic Frameworks 26.10 Application Domains
Summary ... ...
...
Bibliographical and Historical Notes
VII
. . . . .. ..................
925 925 926 930 931 938 956 958 961 968 971
974 975
Conclusions
27 Philosophy, Ethics, and Safety of AI 27.1
TheLimitsof AL . . ... ..........................
981
981
Contents 27.2 27.3
Can Machines Really Think? The Ethics of AT . .
.
Bibliographical and Historical Notes
.
Summary
. ... .....
L.
984 986
1005
1006
28 The Future of AT 28.1 28.2
A
B
. .
1012 1018
Mathematical Background
1023
A2 Vectors, Matrices, and Linear Algebra A3 Probability Distributions . . . . ... ... e L .. Bibliographical and Historical Notes . . . . .. ..................
1025 1027 1029
Al
Al Components Al Architectures
1012
Complexity Analysis and O() Notation . .
Notes on Languages and Algorithms B.1 B.2 B.3
Defining Languages with Backus—Naur Form (BNF) Describing Algorithms with Pseudocode . Online Supplemental Material . . . . . . .
Bibliography
Index
.
1023
1030
1030 1031 1032
1033
1069
xvii
This page intentionally left blank
G
1
INTRODUCTION In which we try to explain why we consider artificial intelligence to be a subject most
worthy of study, and in which we try to decide what exactly it is, this being a good thing to
decide before embarking.
We call ourselves Homo sapiens—man the wise—because our intelligence is so important
to us. For thousands of years, we have tried to understand how we think and act—that is,
how our brain, a mere handful of matter, can perceive, understand, predict, and manipulate a world far larger and more complicated than itself. The field of artificial intelligence, or Al
is concerned with not just understanding but also building intelligent entities—machines that
Intelligence Artificial intelligence
can compute how to act effectively and safely in a wide variety of novel situations.
Surveys regularly rank Al as one of the most interesting and fastestgrowing fields, and it is already generating over a trillion dollars a year in revenue. Al expert KaiFu Lee predicts
that its impact will be “more than anything in the history of mankind.” Moreover, the intel
lectual frontiers of Al are wide open. Whereas a student of an older science such as physics
might feel that the best ideas have already been discovered by Galileo, Newton, Curie, Ein
stein, and the rest, Al still has many openings for fulltime masterminds.
Al currently encompasses a huge variety of subfields, ranging from the general (learning,
reasoning, perception, and so on) to the specific, such as playing chess, proving mathematical theorems, writing poetry, driving a car, or diagnosing diseases. Al is relevant to any intellectual task; it is truly a universal field.
1.1
What
Is AI?
‘We have claimed that Al is interesting, but we have not said what it is. Historically, researchers have pursued several different versions of Al Some have defined intelligence in
terms of fidelity to human performance, while others prefer an abstract, formal definition of
intelligence called rationality—loosely speaking, doing the “right thing.” The subject matter Rationality itself also varies: some consider intelligence to be a property of internal thought processes
and reasoning, while others focus on intelligent behavior, an external characterization.! From these two dimensions—human
vs. rational® and thought vs. behavior—there are
four possible combinations, and there have been adherents and research programs for all 1" Inthe public eye, there i sometimes confusion between the terms “artificial intelligence” and “machine learning” Machine learning is a subfield of Al that studies the ability to improve performance based on experience. Some Al systems use machine learning methods to achieve competence, but some do not. 2 We are not suggesting that humans are “irrational” in the dictionary sense of “deprived of normal mental clarity” We are merely conceding that human decisions are not always mathematically perfect.
Chapter 1 Introduction
four. The methods used are necessarily different: the pursuit of humanlike intelligence must be in part an empirical science related to psychology, involving observations and hypotheses about actual human behavior and thought processes; a rationalist approach, on the other hand, involves a combination of mathematics and engineering, and connecs to statistics, control theory, and economics. The various groups have both disparaged and helped cach other. Let us look at the four approaches in more detail. 1.1.1
Turing test
Acting humanly:
The Turing test approach
The Turing test, proposed by Alan Turing (1950), was designed as a thought experiment that would sidestep the philosophical vagueness of the question “Can a machine think?” A com
puter passes the test if a human interrogator, after posing some written questions, cannot tell
whether the written responses come from a person or from a computer. Chapter 27 discusses the details of the test and whether a computer would really be intelligent if it passed.
For
now, we note that programming a computer to pass a rigorously applied test provides plenty
Natural language processing Knowledge representation Automated reasoning Machine learning Total Turing test Computer vision Robotics
to work on. The computer would need the following capabilities:
o natural language processing to communicate successfully in a human language;
o knowledge representation to store what it knows or hears; « automated reasoning to answer questions and to draw new conclusions;
« machine learning to adapt to new circumstances and to detect and extrapolate patterns. Turing viewed the physical simulation of a person as unnecessary to demonstrate intelligence.
However, other researchers have proposed a total Turing test, which requires interaction with
objects and people in the real world. To pass the total Turing test, a robot will need * computer vision and speech recognition to perceive the world;
o robotics to manipulate objects and move about.
These six disciplines compose most of AL Yet Al researchers have devoted little effort to
passing the Turing test, believing that it is more important to study the underlying principles of intelligence. The quest for “artificial flight” succeeded when engineers and inventors stopped imitating birds and started using wind tunnels and learning about aerodynamics.
Aeronautical engineering texts do not define the goal of their field as making “machines that
fly so exactly like pigeons that they can fool even other pigeons.” 1.1.2
Thinking humanly:
The cognitive modeling approach
To say that a program thinks like a human, we must know how humans think. We can learn about human thought in three ways:
Introspection
+ introspection—trying to catch our own thoughts as they go by;
Bra
« brain imaging—observing the brain in action.
Psychological experiment
imaging
« psychological experiments—observing a person in action;
Once we have a sufficiently precise theory of the mind, it becomes possible to express the
theory as a computer program. If the program’s input—output behavior matches corresponding human behavior, that is evidence that some of the program’s mechanisms could also be
operating in humans.
For example, Allen Newell and Herbert Simon, who developed GPS, the “General Problem Solver” (Newell and Simon, 1961), were not content merely to have their program solve
Section 1.1
What Is AI?
problems correctly. They were more concerned with comparing the sequence and timing of its reasoning steps to those of human subjects solving the same problems.
The interdisci
plinary field of cognitive science brings together computer models from Al and experimental techniques from psychology to construct precise and testable theories of the human mind.
Cognitive science
Cognitive science is a fascinating field in itself, worthy of several textbooks and at least
one encyclopedia (Wilson and Keil, 1999). We will occasionally comment on similarities or
differences between Al techniques and human cognition. Real cognitive science, however, is necessarily based on experimental investigation of actual humans or animals. We will leave
that for other books, as we assume the reader has only a computer for experimentation.
In the early days of Al there was often confusion between the approaches. An author would argue that an algorithm performs well on a task and that it is therefore a good model
of human performance, or vice versa. Modern authors separate the two kinds of claims; this
distinction has allowed both Al and cognitive science to develop more rapidly. The two fields fertilize each other, most notably in computer vision, which incorporates neurophysiological evidence into computational models.
Recently, the combination of neuroimaging methods
combined with machine learning techniques for analyzing such data has led to the beginnings
of a capability to “read minds™—that is, to ascertain the semantic content of a person’s inner
thoughts. This capability could, in turn, shed further light on how human cognition works. 1.1.3
Thinking rationally:
The
“laws of thought”
approach
The Greek philosopher Aristotle was one of the first to attempt to codify “right thinking”™— that is, irrefutable reasoning processes. His syllogisms provided patterns for argument struc
tures that always yielded correct conclusions when given correct premises. The canonical
Syllogism
example starts with Socrates is a man and all men are mortal and concludes that Socrates is
mortal. (This example is probably due to Sextus Empiricus rather than Aristotle.) These laws of thought were supposed to govern the operation of the mind; their study initiated the field called logic.
Logicians in the 19th century developed a precise notation for statements about objects
in the world and the relations among them. (Contrast this with ordinary arithmetic notation,
which provides only for statements about numbers.) By 1965, programs could, in principle, solve any solvable problem described in logical notation.
The socalled logicist tradition
within artificial intelligence hopes to build on such programs to create intelligent systems.
Logicist
Logic as conventionally understood requires knowledge of the world that is certain—
a condition that, in reality, is seldom achieved. We simply don’t know the rules of, say, politics or warfare in the same way that we know the rules of chess or arithmetic. The theory
of probability fills this gap, allowing rigorous reasoning with uncertain information.
In Probability
principle, it allows the construction of a comprehensive model of rational thought, leading
from raw perceptual information to an understanding of how the world works to predictions about the future. What it does not do, is generate intelligent behavior.
theory of rational action. Rational thought, by itself, is not enough. 1.1.4
Acting ra
nally:
The
For that, we need a
rational agent approach
An agent is just something that acts (agent comes from the Latin agere, to do). Of course,
all computer programs do something, but computer agents are expected to do more: operate autonomously, perceive their environment, persist over a prolonged time period, adapt to
Agent
Chapter 1 Introduction Rational agent
change, and create and pursue goals. A rational agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best expected outcome.
In the “laws of thought” approach to Al the emphasis was on correct inferences. Making correct inferences is sometimes part of being a rational agent, because one way to act rationally is to deduce that a given action is best and then to act on that conclusion.
On the
other hand, there are ways of acting rationally that cannot be said to involve inference. For
example, recoiling from a hot stove is a reflex action that is usually more successful than a slower action taken after careful deliberation.
All the skills needed for the Turing test also allow an agent to act rationally. Knowledge
representation and reasoning enable agents to reach good decisions. We need to be able to
generate comprehensible sentences in natural language to get by in a complex society.
We
need learning not only for erudition, but also because it improves our ability to generate
effective behavior, especially in circumstances that are new.
The rationalagent approach to Al has two advantages over the other approaches. First, it
is more general than the “laws of thought™ approach because correct inference is just one of several possible mechanisms for achieving rationality. Second, it is more amenable to scien
tific development. The standard of rationality is mathematically well defined and completely
general. We can often work back from this specification to derive agent designs that provably
achieve it—something that is largely impossible if the goal is to imitate human behavior or
thought processes. For these reasons, the rationalagent approach to Al has prevailed throughout most of the field’s history. In the early decades, rational agents were built on logical foundations and formed definite plans to achieve specific goals. Later, methods based on probability Do the
> right thing
Standard model
theory and machine learning allowed the creation of agents that could make decisions under
uncertainty to attain the best expected outcome.
In a nutshell, A/ has focused on the study
and construction of agents that do the right thing. What counts as the right thing is defined
by the objective that we provide to the agent. This general paradigm is so pervasive that we
might call it the standard model. It prevails not only in Al but also in control theory, where a
controller minimizes a cost function; in operations research, where a policy maximizes a sum of rewards; in statistics, where a decision rule minimizes a loss function; and in economics, where a decision maker maximizes utility or some measure of social welfare.
We need to make one important refinement to the standard model to account for the fact
that perfect rationality—always taking the exactly optimal action—is not feasible in complex
Limited rationality
environments. The computational demands are just too high. Chapters 5 and 17 deal with the
issue of limited rationality—acting appropriately when there is not enough time to do all the computations one might like. However, perfect rationality often remains a good starting point for theoretical analysis.
1.1.5
Benef
 machines
The standard model
has been a useful guide for Al research since its inception, but it is
probably not the right model in the long run. The reason is that the standard model assumes that we will supply a fully specified objective to the machine.
For an artificially defined task such as chess or shortestpath computation, the task comes
with an objective built in—so the standard model is applicable.
As we move into the real
world, however, it becomes more and more difficult to specify the objective completely and
Section 1.2
The Foundations of Artificial Intelligence
correctly. For example, in designing a selfdriving car, one might think that the objective is to reach the destination safely. But driving along any road incurs a risk of injury due to other
errant drivers, equipment failure, and so on; thus, a strict goal of safety requires staying in the garage. There is a tradeoff between making progress towards the destination and incurring a risk of injury. How should this tradeoff be made? Furthermore, to what extent can we allow the car to take actions that would annoy other drivers? How much should the car moderate
its acceleration, steering, and braking to avoid shaking up the passenger? These kinds of questions are difficult to answer a priori. They are particularly problematic in the general area of humanrobot interaction, of which the selfdriving car is one example. The problem of achieving agreement between our true preferences and the objective we put into the machine is called the value alignment problem: the values or objectives put into
the machine must be aligned with those of the human. If we are developing an Al system in
Value alignment problem
the lab or in a simulator—as has been the case for most of the field’s history—there is an easy
fix for an incorrectly specified objective: reset the system, fix the objective, and try again. As the field progresses towards increasingly capable intelligent systems that are deployed
in the real world, this approach is no longer viable. A system deployed with an incorrect
objective will have negative consequences. more negative the consequences.
Moreover, the more intelligent the system, the
Returning to the apparently unproblematic example of chess, consider what happens if
the machine is intelligent enough to reason and act beyond the confines of the chessboard. In that case, it might attempt to increase its chances of winning by such ruses as hypnotiz
ing or blackmailing its opponent or bribing the audience to make rustling noises during its
opponent’s thinking time.>
It might also attempt to hijack additional computing power for
itself. These behaviors are not “unintelligent” or “insane”; they are a logical consequence of defining winning as the sole objective for the machine. It is impossible to anticipate all the ways in which a machine pursuing a fixed objective might misbehave. There is good reason, then, to think that the standard model is inadequate.
‘We don’t want machines that are intelligent in the sense of pursuing their objectives; we want
them to pursue our objectives. If we cannot transfer those objectives perfectly to the machine, then we need a new formulation—one in which the machine is pursuing our objectives, but is necessarily uncertain as to what they are. When a machine knows that it doesn’t know the
complete objective, it has an incentive to act cautiously, to ask permission, to learn more about
our preferences through observation, and to defer to human control. Ultimately, we want agents that are provably beneficial to humans. We will return to this topic in Section 1.5.
The Foundations of Arti
 Intelligence
In this section, we provide a brief history of the disciplines that contributed ideas, viewpoints,
and techniques to Al Like any history, this one concentrates on a small number of people, events, and ideas and ignores others that also were important. We organize the history around
a series of questions. We certainly would not wish to give the impression that these questions are the only ones the disciplines address or that the disciplines have all been working toward Al as their ultimate fruition.
3 In one of the fi opponent’s eyes.”
books on chess, Ruy Lopez (1561) wrote, “Always place the board so the sun
Provably beneficial
Chapter 1 Introduction 1.2.1
Philosophy
« Can formal rules be used to draw valid conclusions? « How does the mind arise from a physical brain? « Where does knowledge come from? + How does knowledge lead to action?
Aristotle (384322 BCE) was the first to formulate a precise set of laws governing the rational
part of the mind. He developed an informal system of syllogisms for proper reasoning, which in principle allowed one to generate conclusions mechanically, given initial premises. Ramon Llull (c. 12321315) devised a system of reasoning published as Ars Magna or The Great Art (1305). Llull tried to implement his system using an actual mechanical device:
a set of paper wheels that could be rotated into different permutations.
Around 1500, Leonardo da Vinci (14521519) designed but did not build a mechanical calculator; recent reconstructions have shown the design to be functional. The first known calculating machine was constructed around 1623 by the German scientist Wilhelm Schickard (15921635). Blaise Pascal (16231662) built the Pascaline in 1642 and wrote that it “produces effects which appear nearer to thought than all the actions of animals.” Gottfried Wilhelm Leibniz (16461716) built a mechanical device intended to carry out operations on concepts rather than numbers, but its scope was rather limited. In his 1651 book Leviathan, Thomas Hobbes (15881679) suggested the idea of a thinking machine, an “artificial animal”
in his words, arguing “For what is the heart but a spring; and the nerves, but so many strings; and the joints, but so many wheels.” He also suggested that reasoning was like numerical computation: “For ‘reason’ ... is nothing but ‘reckoning,” that is adding and subtracting.”
It’s one thing to say that the mind operates, at least in part, according to logical or nu
merical rules, and to build physical systems that emulate some of those rules. It’s another to
say that the mind itself is such a physical system. René Descartes (15961650) gave the first clear discussion of the distinction between mind and matter. He noted that a purely physical conception of the mind seems to leave little room for free will. If the mind is governed en
Dualism
tirely by physical laws, then it has no more free will than a rock “deciding” to fall downward. Descartes was a proponent of dualism.
He held that there is a part of the human mind (or
soul or spirit) that is outside of nature, exempt from physical laws. Animals, on the other hand, did not possess this dual quality; they could be treated as machines. An alternative to dualism is materialism, which holds that the brain’s operation accord
ing to the laws of physics constitutes the mind. Free will is simply the way that the perception
of available choices appears to the choosing entity. The terms physicalism and naturalism are also used to describe this view that stands in contrast to the supernatural.
Empiricism
Given a physical mind that manipulates knowledge, the next problem is to establish the source of knowledge. The empiricism movement, starting with Francis Bacon’s (15611626) Novum Organum,* is characterized by a dictum of John Locke (16321704): “Nothing is in the understanding, which was not first in the senses.”
Induction
David Hume’s (17111776) A Treatise of Human Nature (Hume,
1739) proposed what
is now known as the principle of induction: that general rules are acquired by exposure to
repeated associations between their elements.
4 The Novum Organum is an update of Aristotle’s Organon, or instrument of thought.
Section 1.2
The Foundations of Artificial Intelligence
Building on the work of Ludwig Wittgenstein (18891951) and Bertrand Russell (1872—
1970), the famous Vienna Circle (Sigmund, 2017), a group of philosophers and mathemati
cians meeting in Vienna in the 1920s and 1930s, developed the doctrine of logical positivism. This doctrine holds that all knowledge can be characterized by logical theories connected, ul
timately, to observation sentences that correspond to sensory inputs; thus logical positivism combines rationalism and empiricism.
The confirmation theory of Rudolf Carnap (18911970) and Carl Hempel (19051997)
attempted to analyze the acquisition of knowledge from experience by quantifying the degree
Logical positivism Observation sentence
Confirmation theory
of belief that should be assigned to logical sentences based on their connection to observations that confirm or disconfirm them.
Carnap’s book The Logical Structure of the World (1928)
was perhaps the first theory of mind as a computational process. The final element in the philosophical
picture of the mind is the connection
between
knowledge and action. This question is vital to AT because intelligence requires action as well as reasoning. Moreover, only by understanding how actions are justified can we understand how to build an agent whose actions are justifiable (or rational).
Avistotle argued (in De Motu Animalium) that actions are justified by a logical connection
between goals and knowledge of the action’s outcome:
But how does it happen that thinking is sometimes accompanied by action and sometimes not, sometimes by motion, and sometimes not? It looks as if almost the same thing happens as in the case of reasoning and making inferences about unchanging objects. But in that case the end is a speculative proposition ... whereas here the conclusion which results from the two premises is an action. . . I need covering; a cloak is a covering. T need a cloak. What I need, I have to make; I need a cloak. I have to make a cloak. And the conclusion, the “T have to make a cloak.” is an action.
In the Nicomachean Ethics (Book IIL. 3, 1112b), Aristotle further elaborates on this topic, suggesting an algorithm:
‘We deliberate not about ends, but about means. For a doctor does not deliberate whether he shall heal, nor an orator whether he shall persuade, . . They assume the end and consider how and by what means it is attained, and if it scems casily and best produced thereby: while if it is achieved by one means only they consider how it will be achieved by this and by what means this will be achieved, till they come to the first cause, ... and what is last in the order of analysis seems to be first in the order of becoming. And if we come on an impossibility, we give up the search, e.g., if we need money and this cannot be got; but if a thing appears possible we try to do it. Aristotle’s algorithm was implemented 2300 years later by Newell and Simon in their General Problem Solver program. We would now call it a greedy regression planning system (see Chapter 11). Methods based on logical planning to achieve definite goals dominated the first few decades of theoretical research in AL
Thinking purely in terms of actions achieving goals is often useful but sometimes inapplicable. For example, if there are several different ways to achieve a goal, there needs to be some way to choose among them. More importantly, it may not be possible to achieve a goal with certainty, but some action must still be taken. How then should one decide? Antoine Ar
nauld (1662), analyzing the notion of rational decisions in gambling, proposed a quantitative formula for maximizing the expected monetary value of the outcome. Later, Daniel Bernoulli
(1738) introduced the more general notion of utility to capture the internal, subjective value Utility
Chapter 1 Introduction of an outcome.
The modern notion of rational decision making under uncertainty involves
maximizing expected utility, as explained in Chapter 16.
Utilitarianism
In matters of ethics and public policy, a decision maker must consider the interests of multiple individuals. Jeremy Bentham (1823) and John Stuart Mill (1863) promoted the idea
of utilitarianism: that rational decision making based on maximizing utility should apply to all spheres of human activity, including public policy decisions made on behalf of many
individuals. Utilitarianism is a specific kind of consequentialism: the idea that what is right
and wrong is determined by the expected outcomes of an action. Deontological ethics
In contrast, Immanuel Kant, in 1875 proposed a theory of rulebased or deontological
ethics, in which “doing the right thing” is determined not by outcomes but by universal social
laws that govern allowable actions, such as “don’t lie” or “don’t kill.” Thus, a utilitarian could tell a white lie if the expected good outweighs the bad, but a Kantian would be bound not to,
because lying is inherently wrong. Mill acknowledged the value of rules, but understood them as efficient decision procedures compiled from firstprinciples reasoning about consequences. Many modern Al systems adopt exactly this approach.
1.2.2
Mathematics
+ What are the formal rules to draw valid conclusions?
« What can be computed?
+ How do we reason with uncertain information? Philosophers staked out some of the fundamental ideas of Al but the leap to a formal science
required the mathematization of logic and probability and the introduction of a new branch
Formal logic
of mathematics: computation. The idea of formal logic can be traced back to the philosophers of ancient Greece, India, and China, but its mathematical development really began with the work of George Boole (18151864), who worked out the details of propositional, or Boolean, logic (Boole, 1847). In 1879, Gottlob Frege (18481925) extended Boole’s logic to include objects and relations,
creating the firstorder logic that is used today.> In addition to its central role in the early period of Al research, firstorder logic motivated the work of Godel and Turing that underpinned
Probability
computation itself, as we explain below.
The theory of probability can be seen as generalizing logic to situations with uncertain
information—a consideration of great importance for Al Gerolamo Cardano (15011576)
first framed the idea of probability, describing it in terms of the possible outcomes of gam
bling events.
In 1654, Blaise Pascal (16231662), in a letter to Pierre Fermat (16011665),
showed how to predict the future of an unfinished gambling game and assign average pay
offs to the gamblers.
Probability quickly became an invaluable part of the quantitative sci
ences, helping to deal with uncertain measurements and incomplete theories. Jacob Bernoulli (16541705, uncle of Daniel), Pierre Laplace (17491827), and others advanced the theory and introduced new statistical methods. Thomas Bayes (17021761) proposed a rule for updating probabilities in the light of new evidence; Bayes’ rule is a crucial tool for Al systems. Statistics
The formalization of probability, combined with the availability of data, led to the emergence of statistics as a field. One of the first uses was John Graunt’s analysis of Lon
5 Frege’s proposed notation for firstorder logic—an arcane combination of textual and geometric features— never became popular.
Section 1.2
don census data in 1662.
The Foundations of Artificial Intelligence
Ronald Fisher is considered the first modern statistician (Fisher,
1922). He brought together the ideas of probability, experiment design, analysis of data, and
computing—in 1919, he insisted that he couldn’t do his work without a mechanical calculator called the MILLIONAIRE
(the first calculator that could do multiplication), even though the
cost of the calculator was more than his annual ry (Ross, 2012). The history of computation is as old as the history of numbers, but the first nontrivial
algorithm is thought to be Euclid’s algorithm for computing greatest common divisors. The word algorithm comes from Muhammad
Algorithm
ibn Musa alKhwarizmi, a 9th century mathemati
cian, whose writings also introduced Arabic numerals and algebra to Europe. Boole and others discussed algorithms for logical deduction, and, by the late 19th century, efforts were under way to formalize general mathematical reasoning as logical deduction.
Kurt Godel (19061978) showed that there exists an effective procedure to prove any true
statement in the firstorder logic of Frege and Russell, but that firstorder logic could not cap
ture the principle of mathematical induction needed to characterize the natural numbers. In 1931, Godel showed that limits on deduction do exist. His incompleteness theorem showed
that in any formal theory as strong as Peano arithmetic (the elementary theory of natural
Incompleteness theorem
numbers), there are necessarily true statements that have no proof within the theory.
This fundamental result can also be interpreted as showing that some functions on the
integers cannot be represented by an algorithm—that is, they cannot be computed.
This
motivated Alan Turing (19121954) to try to characterize exactly which functions are com
putable—capable of being computed by an effective procedure. The ChurchTuring thesis Computability proposes to identify the general notion of computability with functions computed by a Turing machine (Turing, 1936). Turing also showed that there were some functions that no Turing machine can compute. For example, no machine can tell in general whether a given program will return an answer on a given input or run forever. Although computability is important to an understanding of computation, the notion of
tractability has had an even greater impact on Al Roughly speaking, a problem is called Tractability intractable if the time required to solve instances of the problem grows exponentially with
the size of the instances. The distinction between polynomial and exponential growth in
complexity was first emphasized in the mid1960s (Cobham, 1964; Edmonds, 1965). It is important because exponential growth means that even moderately large instances cannot be
solved in any reasonable time. The theory of NPcompleteness, pioneered by Cook (1971) and Karp (1972), provides a NPcompleteness basis for analyzing the tractability of problems: any problem class to which the class of NPcomplete problems can be reduced is likely to be intractable. (Although it has not been proved that NPcomplete problems are necessarily intractable, most theoreticians believe it.) These results contrast with the optimism with which the popular press greeted the first computers—
“Electronic SuperBrains” that were “Faster than Einstein!” Despite the increasing speed of computers, careful use of resources and necessary imperfection will characterize intelligent systems. Put crudely, the world is an extremely large problem instance!
1.2.3
Economics
« How should we make decisions in accordance with our preferences? « How should we do this when others may not go along? « How should we do this when the payoff may be far in the future?
10
Chapter 1 Introduction The science of economics originated in 1776, when Adam Smith (17231790) published An
Inquiry into the Nature and Causes of the Wealth of Nations. Smith proposed to analyze economies as consisting of many individual agents attending to their own interests. Smith
was not, however, advocating financial greed as a moral position: his earlier (1759) book The
Theory of Moral Sentiments begins by pointing out that concern for the wellbeing of others is an essential component of the interests of every individual.
Most people think of economics as being about money, and indeed the first mathemati
cal analysis of decisions under uncertainty, the maximumexpectedvalue formula of Arnauld (1662), dealt with the monetary value of bets. Daniel Bernoulli (1738) noticed that this formula didn’t seem to work well for larger amounts of money, such as investments in maritime
trading expeditions. He proposed instead a principle based utility, and explained human investment choices by proposing additional quantity of money diminished as one acquired more Léon Walras (pronounced “Valrasse™) (18341910) gave
on maximization of expected that the marginal utility of an money. utility theory a more general
foundation in terms of preferences between gambles on any outcomes (not just monetary
outcomes). The theory was improved by Ramsey (1931) and later by John von Neumann
and Oskar Morgenstern in their book The Theory of Games and Economic Behavior (1944).
Decision theory
Economics is no longer the study of money; rather it is the study of desires and preferences.
Decision theory, which combines probability theory with utility theory, provides a for
mal and complete framework for individual decisions (economic or otherwise) made under
uncertainty—that is, in cases where probabilistic descriptions appropriately capture the de
cision maker’s environment.
This is suitable for “large” economies where each agent need
pay no attention to the actions of other agents as individuals.
For “small” economies, the
situation is much more like a game: the actions of one player can significantly affect the utility of another (either positively or negatively). Von Neumann and Morgenstern’s develop
ment of game theory (see also Luce and Raiffa, 1957) included the surprising result that, for some games, a rational agent should adopt policies that are (or least appear to be) randomized. Unlike decision theory, game theory does not offer an unambiguous prescription for selecting actions. In AL decisions involving multiple agents are studied under the heading of multiagent systems (Chapter 18).
Economists, with some exceptions, did not address the third question listed above: how to
Operations research
make rational decisions when payoffs from actions are not immediate but instead result from
several actions taken in sequence. This topic was pursued in the field of operations research,
which emerged in World War II from efforts in Britain to optimize radar installations, and later
found innumerable civilian applications. The work of Richard Bellman (1957) formalized a
class of sequential decision problems called Markov decision processes, which we study in Chapter 17 and, under the heading of reinforcement learning, in Chapter 22. Work in economics and operations research has contributed much to our notion of rational
agents, yet for many years Al research developed along entirely separate paths. One reason was the apparent complexity of making rational decisions. The pioneering Al researcher
Satisfcing
Herbert Simon (19162001) won the Nobel Prize in economics in 1978 for his early work
showing that models based on satisficing—making decisions that are “good enough,” rather than laboriously calculating an optimal decision—gave a better description of actual human behavior (Simon, 1947). Since the 1990s, there has been a resurgence of interest in decisiontheoretic techniques for AL
Section 1.2
1.2.4
The Foundations of Artificial Intelligence
11
Neuroscience
+ How do brains process information?
Neuroscience is the study of the nervous system, particularly the brain. Although the exact Neuroscience way in which the brain enables thought is one of the great mysteries of science, the fact that it
does enable thought has been appreciated for thousands of years because of the evidence that strong blows to the head can lead to mental incapacitation. It has also long been known that human brains are somehow different; in about 335 BCE Aristotle wrote, “Of all the animals,
man has the largest brain in proportion to his size.”® Still, it was not until the middle of the
18th century that the brain was widely recognized as the seat of consciousness. Before then,
candidate locations included the heart and the spleen.
Paul Broca’s (18241880) investigation of aphasia (speech deficit) in braindamaged patients in 1861 initiated the study of the brain’s functional organization by identifying a localized area in the left hemisphere—now called Broca’s area—that is responsible for speech production.” By that time, it was known that the brain consisted largely of nerve cells, or neurons, but it was not until 1873 that Camillo Golgi (18431926) developed a staining technique Neuron allowing the observation of individual neurons (see Figure 1.1). This technique was used by Santiago Ramon y Cajal (18521934) in his pioneering studies of neuronal organization.® It is now widely accepted that cognitive functions result from the electrochemical operation of these structures.
That is, a collection of simple cells can lead to thought, action, and
consciousness. In the pithy words of John Searle (1992), brains cause minds.
Actuators
Agent
Figure 2.14 A modelbased. utilitybased agent. It uses a model of the world, along with a utility function that measures its preferences among states of the world. Then it chooses the action that leads to the best expected utility, where expected utility is computed by averaging over all possible outcome states, weighted by the probability of the outcome. aim for, none of which can be achieved with certainty, utility provides a way in which the likelihood of success can be weighed against the importance of the goals. Partial observability and nondeterminism are ubiquitous in the real world, and so, there
fore, is decision making under uncertainty. Technically speaking, a rational utilitybased agent chooses the action that maximizes the expected utility of the action outcomes—that
is, the utility the agent expects to derive, on average, given the probabilities and utilities of each outcome. (Appendix A defines expectation more precisely.) In Chapter 16, we show that any rational agent must behave as if it possesses a utility function whose expected value
it tries to maximize. An agent that possesses an explicit utility function can make rational de
cisions with a generalpurpose algorithm that does not depend on the specific utility function
being maximized. In this way, the “global” definition of rationality—designating as rational
those agent functions that have the highest performance—is turned into a “local” constraint
on rationalagent designs that can be expressed in a simple program. The utilitybased agent structure appears in Figure 2.14.
Utilitybased agent programs
appear in Chapters 16 and 17, where we design decisionmaking agents that must handle the uncertainty inherent in nondeterministic or partially observable environments. Decision mak
ing in multiagent environments is also studied in the framework of utility theory, as explained in Chapter 18. At this point, the reader may be wondering, “Is it that simple? We just build agents that maximize expected utility, and we're done?”
It’s true that such agents would be intelligent,
but it’s not simple. A utilitybased agent has to model and keep track of its environment,
tasks that have involved a great deal of research on perception, representation, reasoning,
and learning. The results of this research fill many of the chapters of this book. Choosing
the utilitymaximizing course of action is also a difficult task, requiring ingenious algorithms
that fill several more chapters.
Even with these algorithms, perfect rationality is usually
Expected utility
56
Chapter 2 Intelligent Agents Performance standard
Cride

Sensors =
Learning element learning goals Problem generator Agent
changes knowledge
Performance element
JUSWIUOIAUF
feedback ‘
Actuators
Figure 2.15 A general leaning agent. The “performance element” box represents what we have previously considered to be the whole agent program. Now, the “learning element” box gets to modify that program to improve its performance. unachievable in practice because of computational complexity, as we noted in Chapter 1. We. Modelfree agent
also note that not all utilitybased agents are modelbased; we will see in Chapters 22 and 26 that a modelfree agent can learn what action is best in a particular situation without ever
learning exactly how that action changes the environment.
Finally, all of this assumes that the designer can specify the utility function correctly; Chapters 17, 18, and 22 consider the issue of unknown utility functions in more depth. 2.4.6
Learning agents
We have described agent programs with various methods for selecting actions. We have not, so far, explained how the agent programs come into being. Tn his famous early paper, Turing (1950) considers the idea of actually programming his intelligent machines by hand. He estimates how much work this might take and concludes, “Some more expeditious method seems desirable.”
them.
The method he proposes is to build learning machines and then to teach
In many areas of Al this is now the preferred method for creating stateoftheart
systems. Any type of agent (modelbased, goalbased, utilitybased, etc.) can be built as a learning agent (or not). Learning has another advantage, as we noted earlier: it allows the agent to operate in initially unknown environments and to become more competent than its initial knowledge alone
might allow. In this section, we briefly introduce the main ideas of learning agents. Through
out the book, we comment on opportunities and methods for learning in particular kinds of
Learning element
Performance element
agents. Chapters 1922 go into much more depth on the learning algorithms themselves. A learning agent can be divided into four conceptual components, as shown in Figure 2.15.
The most important distinction is between the learning element,
which is re
sponsible for making improvements, and the performance element, which is responsible for selecting external actions. The performance element is what we have previously considered
Section 2.4 The Structure of Agents 1o be the entire agent: it takes in percepts and decides on actions. The learning element uses feedback from the critic on how the agent is doing and determines how the performance element should be modified to do better in the future.
57
Critic
The design of the learning element depends very much on the design of the performance
element. When trying to design an agent that learns a certain capability, the first question is
not “How am I going to get it to learn this?” but “What kind of performance element will my
agent use to do this once it has learned how?” Given a design for the performance element,
learning mechanisms can be constructed to improve every part of the agent.
The critic tells the learning element how well the agent is doing with respect to a fixed
performance standard. The critic is necessary because the percepts themselves provide no indication of the agent’s success. For example, a chess program could receive a percept indicating that it has checkmated its opponent, but it needs a performance standard to know that this is a good thing; the percept itself does not say so. It is important that the performance
standard be fixed. Conceptually, one should think of it as being outside the agent altogether because the agent must not modify it to fit its own behavior.
The last component of the learning agent is the problem generator. It is responsible
for suggesting actions that will lead to new and informative experiences. If the performance
Problem generator
element had its way, it would keep doing the actions that are best, given what it knows, but if the agent is willing to explore a little and do some perhaps suboptimal actions in the short
run, it might discover much better actions for the long run. The problem generator’s job is to suggest these exploratory actions. This is what scientists do when they carry out experiments.
Galileo did not think that dropping rocks from the top of a tower in Pisa was valuable in itself.
He was not trying to break the rocks or to modify the brains of unfortunate pedestrians. His
aim was to modify his own brain by identifying a better theory of the motion of object The learning element can make changes to any of the “knowledge” components shown in the agent diagrams (Figures 2.9, 2.11, 2.13, and 2.14). The simplest cases involve learning directly from the percept sequence. Observation of pairs of successive states of the environ
ment can allow the agent to learn “What my actions do” and “How the world evolves™ in
response (o its actions. For example, if the automated taxi exerts a certain braking pressure
when driving on a wet road, then it will soon find out how much deceleration is actually
achieved, and whether it skids off the road. The problem generator might identify certain parts of the model that are in need of improvement and suggest experiments, such as trying out the brakes on different road surfaces under different conditions.
Improving the model components of a modelbased agent so that they conform better
with reality is almost always a good idea, regardless of the external performance standard.
(In some cases, it is better from a computational point of view to have a simple but slightly inaccurate model rather than a perfect but fiendishly complex model.)
Information from the
external standard is needed when trying to learn a reflex component or a utility function.
For example, suppose the taxidriving agent receives no tips from passengers who have
been thoroughly shaken up during the trip. The external performance standard must inform
the agent that the loss of tips is a negative contribution to its overall performance; then the
agent might a sense, the (or penalty) performance
be able to learn that violent maneuvers do not contribute to its own utility. In performance standard distinguishes part of the incoming percept as a reward Reward that provides direct feedback on the quality of the agent’s behavior. Hardwired Penalty standards such as pain and hunger in animals can be understood in this way.
Chapter 2 Intelligent Agents More example, settles on know it’s
generally, human choices can provide information about human preferences. For suppose the taxi does not know that people generally don’t like loud noises, and the idea of blowing its horn continuously as a way of ensuring that pedestrians coming. The consequent human behavior—covering ears, using bad language, and
possibly cutting the wires to the horn—would provide evidence to the agent with which to update its utility function. This
issue is discussed further in Chapter 22.
In summary, agents have a variety of components, and those components
can be repre
sented in many ways within the agent program, so there appears to be great variety among
learning methods. There is, however, a single unifying theme. Learning in intelligent agents can be summarized as a process of modification of each component of the agent to bring the components into closer agreement with the available feedback information, thereby improv
ing the overall performance of the agent. 2.4.7
How the components of agent programs work
We have described agent programs (in very highlevel terms) as consisting of various compo
nents, whose function it is to answer questions such as: “What is the world like now?” “What action should I do now?” “What do my actions do?” The next question for a student of Al
is, “How on Earth do these components work?” It takes about a thousand pages to begin to answer that question properly, but here we want to draw the reader’s attention to some basic
distinctions among the various ways that the components can represent the environment that the agent inhabits. Roughly speaking, we can place the representations along an axis of increasing complexity and expressive power—atomic, factored, and structured. To illustrate these ideas, it helps
to consider a particular agent component, such as the one that deals with “What my actions
do.” This component describes the changes that might occur in the environment as the result
(a) Atomic
oIIQ.
of taking an action, and Figure 2.16 provides schematic depictions of how those transitions might be represented.
mII.l
58
(b) Factored
() Structured
Figure 2.16 Three ways to represent states and the transitions between them. (a) Atomic
representation: a state (such as B or C) is a black box with no internal structure; (b) Factored representation: a state consists ofa vector of attribute values; values can be Boolean, real
valued, or one of a fixed set of symbols. (c) Structured representation: a state includes objects, each of which may have attributes of its own as well s relationships to other objects.
Section 2.4 The Structure of Agents In an atomic representation each state of the world is indivisible—it has no internal
structure. Consider the task of finding a driving route from one end of a country to the other
59
Atomic representation
via some sequence of cities (we address this problem in Figure 3.1 on page 64). For the purposes of solving this problem, it may suffice to reduce the state of the world to just the name of the city we are in—a single atom of knowledge, a “black box whose only discernible property is that of being identical to or different from another black box. The standard algorithms
underlying search and gameplaying (Chapters 35), hidden Markov models (Chapter 14), and Markov decision processes (Chapter 17) all work with atomic representations.
Factored representation each of which can have a value. Consider a higherfidelity description for the same driving Variable problem, where we need to be concerned with more than just atomic location in one city or Attribute another; we might need to pay attention to how much gas is in the tank, our current GPS Value A factored representation splits up each state into a fixed set of variables or attributes,
coordinates, whether or not the oil warning light is working, how much money we have for
tolls, what station is on the radio, and so on. While two different atomic states have nothing in
common—they are just different black boxes—two different factored states can share some
attributes (such as being at some particular GPS location) and not others (such as having lots
of gas or having no gas); this makes it much easier to work out how to turn one state into an
other. Many important areas of Al are based on factored representations, including constraint satisfaction algorithms (Chapter 6), propositional logic (Chapter 7), planning (Chapter 11), Bayesian networks (Chapters 1216), and various machine learning algorithms. For many purposes, we need to understand the world as having things in it that are related to each other, not just variables with values. For example, we might notice that a large truck ahead of us is reversing into the driveway of a dairy farm, but a loose cow is blocking the truck’s path. A factored representation is unlikely to be preequipped with the at
tribute TruckAheadBackingIntoDairyFarmDrivewayBlockedByLooseCow with value true or
Jfalse. Tnstead, we would need a structured representation, in which objects such as cows Structured representation and trucks and their various and varying relationships can be described explicitly (see Figure 2.16(c)).
Structured representations underlie relational databases and firstorder logic
(Chapters 8, 9, and 10), firstorder probability models (Chapter 15), and much of natural lan
guage understanding (Chapters 23 and 24). In fact, much of what humans express in natural language concerns objects and their relationships.
As we mentioned earlier, the axis along which atomic, factored, and structured repre
sentations lie is the axis of increasing expressiveness. Roughly speaking, a more expressive
representation can capture, at least as concisely, everything a less expressive one can capture,
Expressiveness
plus some more. Ofien, the more expressive language is much more concise; for example, the rules of chess can be written in a page or two of a structuredrepresentation language such
as firstorder logic but require thousands of pages when written in a factoredrepresentation
language such as propositional logic and around 10*® pages when written in an atomic language such as that of finitestate automata. On the other hand, reasoning and learning become
more complex as the expressive power of the representation increases. To gain the benefits
of expressive representations while avoiding their drawbacks, intelligent systems for the real world may need to operate at all points along the axis simultaneously.
Another axis for representation involves the mapping of concepts to locations in physical
memory,
whether in a computer or in a brain.
If there is a onetoone mapping between
concepts and memory locations, we call that a localist representation. On the other hand, Localist representation
Chapter 2 Intelligent Agents if the representation of a concept is spread over many memory locations, and each memory Distributed representation
location is employed as part of the representation of multiple different concepts, we call
that a distributed representation. Distributed representations are more robust against noise and information loss. With a localist representation, the mapping from concept to memory location is arbitrary, and if a transmission error garbles a few bits, we might confuse Truck
with the unrelated concept Truce. But with a
distributed representation, you can think of each
concept representing a point in multidimensional space, and if you garble a few bits you move
to a nearby point in that space, which will have similar meaning.
Summary
This chapter has been something of a whirlwind tour of AL which we have conceived of as the science of agent design. The major points to recall are as follows: « An agent is something that perceives and acts in an environment. The agent function for an agent specifies the action taken by the agent in response to any percept sequence.
+ The performance measure evaluates the behavior of the agent in an environment.
A
rational agent acts so as to maximize the expected value of the performance measure, given the percept sequence it has seen so far.
+ A task environment specification includes the performance measure, the external en
vironment, the actuators, and the sensors. In designing an agent, the first step must always be to specify the task environment as fully as possible.
« Task environments vary along several significant dimensions. They can be fully or partially observable, singleagent or multiagent, deterministic or nondeterministic, episodic or sequential, static or dynamic, discrete or continuous, and known or unknown.
« In cases where the performance measure is unknown or hard to specify correctly, there
is a significant risk of the agent optimizing the wrong objective. In such cases the agent design should reflect uncertainty about the true objective. + The agent program implements the agent function. There exists a variety of basic
agent program designs reflecting the kind of information made explicit and used in the decision process. The designs vary in efficiency, compactness, and flexibility. The appropriate design of the agent program depends on the nature of the environment.
+ Simple reflex agents respond directly to percepts, whereas modelbased reflex agents maintain internal state to track aspects of the world that are not evident in the current
percept. Goalbased agents act to achieve their goals, and utilitybased agents try to
maximize their own expected “happiness.”
« All agents can improve their performance through learning.
Bibliographical and Historical Notes The central role of action in intelligence—the notion of practical reasoning—goes back at least as far as Aristotle’s Nicomachean Ethics.
Practical reasoning was also the subject of
McCarthy’s influential paper “Programs with Common Sense” (1958). The fields of robotics and control theory are, by their very nature, concerned principally with physical agents. The
Bibliographical and Historical Notes
61
concept of a controller in control theory is identical to that of an agent in AL Perhaps sur Controller prisingly, Al has concentrated for most of its history on isolated components of agents—
questionanswering systems, theoremprovers, vision systems, and so on—rather than on
whole agents. The discussion of agents in the text by Genesereth and Nilsson (1987) was an
influential exception. The wholeagent view is now widely accepted and is a central theme in
recent texts (Padgham and Winikoff, 2004; Jones, 2007; Poole and Mackworth, 2017).
Chapter 1 traced the roots of the concept of rationality in philosophy and economics. In
AL the concept was of peripheral interest until the mid1980s, when it began to suffuse many discussions about the proper technical foundations of the field. A paper by Jon Doyle (1983)
predicted that rational agent design would come to be seen as the core mission of Al, while
other popular topics would spin off to form new disciplines. Careful attention to the properties of the environment and their consequences for rational agent design is most apparent in the control theory tradition—for example, cla:
control systems (Dorf and Bishop, 2004; Kirk, 2004) handle fully observable, deterministic environments; stochastic optimal control (Kumar and Varaiya, 1986; Bertsekas and Shreve,
2007) handles partially observable, stochastic environments; and hybrid control (Henzinger and Sastry, 1998; Cassandras and Lygeros, 2006) deals with environments containing both
discrete and continuous elements. The distinction between fully and partially observable en
vironments is also central in the dynamic programming literature developed in the field of operations research (Puterman, 1994), which we discuss in Chapter 17.
Although simple reflex agents were central to behaviorist psychology (see Chapter 1),
most Al researchers view them as too simple to provide much leverage. (Rosenschein (1985)
and Brooks (1986) questioned this assumption; see Chapter 26.) A great deal of work has gone into finding efficient algorithms for keeping track of complex environments (Bar
Shalom et al., 2001; Choset et al., 2005; Simon, 2006), most of it in the probabilistic setting. Goalbased agents are presupposed in everything from Aristotle’s view of practical rea
soning to McCarthy’s early papers on logical Al Shakey the Robot (Fikes and Nilsson, 1971; Nilsson,
1984) was the first robotic embodiment of a logical, goalbased agent.
A
full logical analysis of goalbased agents appeared in Genesereth and Nilsson (1987), and a goalbased programming methodology called agentoriented programming was developed by Shoham (1993). The agentbased approach is now extremely popular in software engineering (Ciancarini and Wooldridge, 2001). It has also infiltrated the area of operating systems, ‘where autonomic computing refers to computer systems and networks that monitor and con
trol themselves with a perceive—act loop and machine learning methods (Kephart and Chess, 2003). Noting that a collection of agent programs designed to work well together in a true
multiagent environment necessarily exhibits modularity—the programs share no internal state
and communicate with each other only through the environment—it
is common
within the
field of multiagent systems to design the agent program of a single agent as a collection of autonomous subagents.
In some cases, one can even prove that the resulting system gives
the same optimal solutions as a monolithic design.
The goalbased view of agents also dominates the cognitive psychology tradition in the area of problem solving, beginning with the enormously influential Human Problem Solving (Newell and Simon, 1972) and running through all of Newell’s later work (Newell, 1990).
Goals, further analyzed as desires (general) and intentions (currently pursued), are central to the influential theory of agents developed by Michael Bratman (1987).
Autonomic computing
62
Chapter 2 Intelligent Agents As noted in Chapter 1, the development of utility theory as a basis for rational behavior goes back hundreds of years. In Al early research eschewed utilities in favor of goals, with some exceptions (Feldman and Sproull, 1977). The resurgence of interest in probabilistic
methods in the 1980s led to the acceptance of maximization of expected utility as the most general framework for decision making (Horvitz et al., 1988). The text by Pearl (1988) was
the first in Al to cover probability and utility theory in depth; its exposition of practical methods for reasoning and decision making under uncertainty was probably the single biggest factor in the rapid shift towards utilitybased agents
in the 1990s (see Chapter 16). The for
malization of reinforcement learning within a decisiontheoretic framework also contributed
to this
shift (Sutton, 1988). Somewhat remarkably, almost all Al research until very recently
has assumed that the performance measure can be exactly and correctly specified in the form
of a utility function or reward function (HadfieldMenell e al., 2017a; Russell, 2019).
The general design for learning agents portrayed in Figure 2.15 is classic in the machine
learning literature (Buchanan et al., 1978; Mitchell, 1997). Examples of the design, as em
bodied in programs, go back at least as far as Arthur Samuel’s (1959, 1967) learning program for playing checkers. Learning agents are discussed in depth in Chapters 1922. Some early papers on agentbased approaches are collected by Huhns and Singh (1998) and Wooldridge and Rao (1999). Texts on multiagent systems provide a good introduction to many aspects of agent design (Weiss, 2000a; Wooldridge, 2009). Several conference series devoted to agents began in the 1990s, including the International Workshop on Agent Theories, Architectures, and Languages (ATAL), the International Conference on Autonomous Agents (AGENTS), and the International Conference on MultiAgent Systems (ICMAS). In
2002, these three merged to form the International Joint Conference on Autonomous Agents and MultiAgent Systems (AAMAS).
From 2000 to 2012 there were annual workshops on
AgentOriented Software Engineering (AOSE). The journal Autonomous Agents and MultiAgent Systems was founded in 1998. Finally, Dung Beetle Ecology (Hanski and Cambefort, 1991) provides a wealth of interesting information on the behavior of dung beetles. YouTube
has inspiring video recordings of their activities.
TR 3
SOLVING PROBLEMS BY SEARCHING In which we see how an agent can look ahead to find a sequence of actions that will eventually achieve its goal.
‘When the correct action to take is not immediately obvious, an agent may need to to plan
ahead: 10 consider a sequence of actions that form a path to a goal state. Such an agent is called a problemsolving agent, and the computational process it undertakes is called search.
Problemsolving agents use atomic representations, as described in Section 2.4.7—that
is, states of the world are considered as wholes, with no internal structure visible to the
~5/ebiemsolvine
Search
problemsolving algorithms. Agents that use factored or structured representations of states are called planning agents and are discussed in Chapters 7 and 11.
We will cover several search algorithms. In this chapter, we consider only the simplest
environments: known.
episodic, single agent, fully observable, deterministic,
static, discrete, and
We distinguish between informed algorithms, in which the agent can estimate how
far it is from the goal, and uninformed algorithms, where no such estimate is available. Chapter 4 relaxes the constraints on environments, and Chapter 5 considers multiple agents. This chapter uses the concepts of asymptotic complexity (that is, O(n) notation). Readers
unfamiliar with these concepts should consult Appendix A. 3.1
ProblemSolving Agents
Imagine an agent enjoying a touring vacation in Romania. The agent wants to take in the sights, improve its Romanian, enjoy the nightlife, avoid hangovers, and so on. The decision problem is a complex one. Now, suppose the agent is currently in the city of Arad and has a nonrefundable ticket to fly out of Bucharest the following day.
The agent observes
street signs and sees that there are three roads leading out of Arad: one toward Sibiu, one to
Timisoara, and one to Zerind. None of these are the goal, so unless the agent is familiar with
the geography of Romania, it will not know which road to follow."
If the agent has no additional information—that is, if the environment is unknown—then
the agent can do no better than to execute one of the actions at random.
This
sad situation
is discussed in Chapter 4. In this chapter, we will assume our agents always have access to information about the world, such as the map in Figure 3.1. With that information, the agent
can follow this fourphase problemsolving process:
+ Goal formulation: The agent adopts the goal of reaching Bucharest.
Goals organize
behavior by limiting the objectives and hence the actions to be considered.
‘We are assuming that most readers are in the same position and can easily imagine themselves to be as clueles as our agent. We apologize to Romanian readers who are unable to take advantage of this pedagogic: device.
Goal formulation
Chapter 3 Solving Problems by Searching
75 Arad
Sibiu
Fagaras
18
Vashui Rimnicu Vileea
Timisoara
Urziceni
75 Drobeta
Bucharest 90
riu Figure 3.1 A simplified road map of part of Romania, with road distances in miles. Craiova
Problem formulation
Hirsova
Eforie
+ Problem formulation: The agent devises a description of the states and actions necessary to reach the goal—an abstract model of the relevant part of the world. For our
agent, one good model is to consider the actions of traveling from one city to an adja
cent city, and therefore the only fact about the state of the world that will change due to
an action is the current city.
Search
+ Search:
Before taking any action in the real world, the agent simulates sequences of
actions in its model, searching until it finds a sequence of actions that reaches the goal. Such a sequence is called a solution. The agent might have to simulate multiple sequences that do not reach the goal, but eventually it will find a solution (such as going
Solution
from Arad to Sibiu to Fagaras to Bucharest), or it will find that no solution is possible.
Execution

+ Execution: The agent can now execute the actions in the solution, one at a time.
It is an important property that in a fully observable, deterministic, known environment, the
solution to any problem is a fixed sequence of actions: drive to Sibiu, then Fagaras, then
Bucharest. If the model is correct, then once the agent has found a solution, it can ignore its
Openloop
Closedloop
percepts while it is executing the actions—closing its eyes, so to speak—because the solution
is guaranteed to lead to the goal. Control theorists call this an openloop system: ignoring the
percepts breaks the loop between agent and environment. If there is a chance that the model
is incorrect, or the environment is nondeterministic, then the agent would be safer using a
closedloop approach that monitors the percepts (see Section 4.4).
In partially observable or nondeterministic environments, a solution would be a branching
strategy that recommends different future actions depending on what percepts arrive. For
example, the agent might plan to drive from Arad to Sibiu but might need a contingency plan
in case it arrives in Zerind by accident or finds a sign saying “Drum inchis” (Road Closed).
Section 3.1 3.1.1
ProblemSolving Agents
Search problems and solutions
Problem
A search problem can be defined formally as follows: * A set of possible states that the environment can be in. We call this the state space. + The initial state that the agent starts in. For example: Arad.
+ A set of one or more goal states. Sometimes there is one goal state (e.g., Bucharest), sometimes there is a small set of alternative goal states, and sometimes the goal is
defined by a property that applies to many states (potentially an infinite number). For example, in a vacuumcleaner world, the goal might be to have no dirt in any location, regardless of any other facts about the state.
States State space Initial state Goal states,
We can account for all three of these
possibilities by specifying an IsGOAL method for a problem. In this chapter we will
sometimes say “the goal” for simplicity, but what we say also applies to “any one of the
possible goal states.”
« The actions available to the agent. Given a state s, ACTIONS(s) returns a finite? set of actions that can be executed in 5. We say that each of these actions is applicable in s.
An example:
Action Applicable
ACTIONS (Arad) = {ToSibiu, ToTimisoara, ToZerind } + A transition model, which describes what each action does.
RESULT(s, a) returns the
state that results from doing action a in state s. For example,
Transition model
RESULT(Arad, ToZerind) = Zerind.
« Anaction cost function, denoted by ACTIONCOST(s,a,s") when we are programming or ¢(s,a,s') when we are doing math, that gives the numeric cost of applying action a in state s to reach state s'.
Action cost function
A problemsolving agent should use a cost function that
reflects its own performance measure; for example, for routefinding agents, the cost of
an action might be the length in miles (as seen in Figure 3.1), or it might be the time it
takes to complete the action. A sequence of actions forms a path, and a solution is a path from the initial state to a goal
state. We assume that action costs are additive; that is, the total cost of a path is the sum of the
Path
individual action costs. An optimal solution has the lowest path cost among all solutions. In Optimal solution
this chapter, we assume that all action costs will be positive, to avoid certain complications.®
The state space can be represented as a graph in which the vertices are states and the Graph directed edges between them are actions. The map of Romania shown in Figure 3.1 is such a graph, where each road indicates two actions, one in each direction. 2 For problems with an infinite number of actions we would need techniques that go beyond this chapter. 3 Inany problem with a cycle of net negative cost, the costoptimal solution is to go around that cycle an infinite number of times. The Bellman—Ford and FloydWarshall algorithms (not covered here) handle negativecost actions, as long as there are no negative cycles. It is easy to accommodate zerocost actions, as long as the number of consecutive zerocost actions is bounded. For example, we might have a robot where there is a cost 1o move, but zero cost to rotate 90°; the algorithms in this chapter can handle this as long as no more than three consecutive 90° turns are allowed. There is also a complication with problems that have an infinite number of arbitrarily small action costs. Consi ion of Zeno's paradox where there is an action to move half way to the goal, at a cost of half of the previous move. This problem has no solution with a finite number of actions, but to prevent a search from taking an unbounded number of actions without quite reaching the goal, we can require that all action costs be at least , for some small positive value c.
66
Chapter 3 Solving Problems by Searching 3.1.2
Formulating problems
Our formulation of the problem of getting to Bucharest is a model—an abstract mathematical
description—and not the real thing. Compare the simple atomic state description Arad to an actual crosscountry trip, where the state of the world includes so many things: the traveling companions, the current radio program, the scenery out of the window, the proximity of law enforcement officers, the distance to the next rest stop, the condition of the road, the weather,
the traffic, and so on.
Abstraction
All these considerations are left out of our model because they are
irrelevant to the problem of finding a route to Bucharest.
The process of removing detail from a representation is called abstraction.
A good
problem formulation has the right level of detail. If the actions were at the level of “move the
right foot forward a centimeter” or “turn the steering wheel one degree left,” the agent would
Level of abstraction
probably never find its way out of the parking lot, let alone to Bucharest.
Can we be more precise about the appropriate level of abstraction? Think of the abstract
states and actions we have chosen as corresponding to large sets of detailed world states and
detailed action sequences.
Now consider a solution to the abstract problem:
for example,
the path from Arad to Sibiu to Rimnicu Vilcea to Pitesti to Bucharest. This abstract solution
corresponds to a large number of more detailed paths. For example, we could drive with the radio on between Sibiu and Rimnicu Vilcea, and then switch it off for the rest of the trip.
The abstraction is valid if we can elaborate any abstract solution into a solution in the
more detailed world; a sufficient condition
is that for every detailed state that is “in Arad,”
there is a detailed path to some state that is “in Sibiu,” and so on.* The abstraction is useful if
carrying out each of the actions in the solution is easier than the original problem; in our case,
the action “drive from Arad to Sibiu™ can be carried out without further search or planning by a driver with average skill. The choice ofa good abstraction thus involves removing as much
detail as possible while retaining validity and ensuring that the abstract actions are easy to
carry out. Were it not for the ability to construct useful abstractions, intelligent agents would be completely swamped by the real world. 3.2
Example
Problems
The problemsolving approach has been applied to a vast array of task environments. We list Standardized problem Realworld problem
some of the best known here, distinguishing between standardized and realworld problems.
A standardized problem is intended to illustrate or exercise various problemsolving meth
ods.
It can be given a concise, exact description and hence is suitable as a benchmark for
researchers to compare the performance of algorithms. A realworld problem, such as robot
navigation, is one whose solutions people actually use, and whose formulation is idiosyn
cratic, not standardized, because, for example, each robot has different sensors that produce different data.
3.2.1 Grid world
Standardized problems
A grid world problem is a twodimensional rectangular array of square cells in which agents can move from cell to cell. Typically the agent can move to any obstaclefree adjacent cell— horizontally or vertically and in some problems diagonally. Cells can contain objects, which
Section 3.2 Example Problems
67
Figure 3.2 The statespace graph for the twocell vacuum world. There are 8 states and three actions for each state: L = Lefr, R = Right, S = Suck.
the agent can pick up, push, or otherwise act upon; a wall or other impassible obstacle in a cell prevents an agent from moving into that cell. The vacuum world from Section 2.1 can be formulated as a grid world problem as follows: e States:
A state of the world says which objects are in which cells.
For the vacuum
world, the objects are the agent and any dirt. In the simple twocell version, the agent
can be in either of the two cells, and each call can either contain dirt or not, so there are 222 = 8 states (see Figure 3.2). In general, a vacuum environment with n cells has
n2" states.
o Initial state: Any state can be designated as the initial state. ® Actions: In the twocell world we defined three actions: Suck, move Lefr, and move Right. Tn a twodimensional multicell world we need more movement actions. We
could add Upward and Downward, giving us four absolute movement actions, or we could switch to egocentric actions, defined relative to the viewpoint of the agent—for example, Forward, Backward, TurnRight, and TurnLefi. o Transition model: Suck removes any dirt from the agent’s cell; Forward moves the agent ahead one cell in the direction it is facing, unless it hits a wall, in which case
the action has no effect.
Backward moves the agent in the opposite direction, while
TurnRight and TurnLeft change the direction it is facing by 90°. * Goal states: The states in which every cell is clean. ® Action cost: Each action costs 1.
Another type of grid world is the sokoban puzzle, in which the agent’s goal is to push a
number of boxes, scattered about the grid, to designated storage locations. There can be at
most one box per cell. When an agent moves forward into a cell containing a box and there is an empty cell on the other side of the box, then both the box and the agent move forward.
Sckoban puzzle
68
Chapter 3 Solving Problems by Searching
Start State
Goal State
Figure 3.3 A typical instance of the 8puzzle. The agent can’t push a box into another box or a wall. For a world with 2 nonobstacle cells
and b boxes, there are n x n!/(b!(n — b)!) states; for example on an 8 x 8 grid with a dozen
Slidingtile puzzle
boxes, there are over 200 trillion states.
In a slidingtile puzzle, a number of tiles (sometimes called blocks or pieces) are ar
ranged in a grid with one or more blank spaces so that some of the tiles can slide into the blank space.
8puzzle 15puzzle
One variant is the Rush Hour puzzle, in which cars and trucks slide around a
6% 6 grid in an attempt to free a car from the traffic jam. Perhaps the bestknown variant is the 8puzzle (see Figure 3.3), which consists of a 3 x 3 grid with eight numbered tiles and
one blank space, and the 15puzzle on a 4 x 4 grid. The object is to reach a specified goal state, such as the one shown on the right of the figure.
puzzle is as follows:
The standard formulation of the 8
o States: A state description specifies the location of each of the tiles.
o Initial state: Any state can be designated as the initial state. Note that a parity prop
erty partitions the state space—any given goal can be reached from exactly half of the possible initial states (see Exercise 3.PART).
e Actions: While in the physical world it is a tile that slides, the simplest way of describ
ing an action is to think of the blank space moving Left, Right, Up, or Down. If the blank is at an edge or corner then not all actions will be applicable.
« Transition model: Maps a state and action to a resulting state; for example, if we apply Left to the start state in Figure 3.3, the resulting state has the 5 and the blank switched.
Goal state: Although any state could be the goal, we typically specify a state with the numbers in order, as in Figure 3.3.
® Action cost: Each action costs 1.
Note that every problem formulation involves abstractions. The 8puzzle actions are ab
stracted to their beginning and final states, ignoring the intermediate locations where the tile
is sliding. We have abstracted away actions such as shaking the board when tiles get stuck
and ruled out extracting the tiles with a knife and putting them back again. We are left with a
description of the rules, avoiding all the details of physical manipulations.
Our final standardized problem was devised by Donald Knuth (1964) and illustrates how
infinite state spaces can arise. Knuth conjectured that starting with the number 4, a sequence
Section 3.2 Example Problems
of square root, floor, and factorial operations can reach any desired positive integer. For example, we can reach 5 from 4 as follows:

@y =s
The problem definition is simple: States: Positive real numbers. o Initial state: 4.
o Actions: Apply square root, floor, or factorial operation (factorial for integers only). o Transition model: As given by the mathematical definitions of the operations.
o Goal state: The desired positive integer. ® Action cost: Each action costs 1.
The state space for this problem is infinite: for any integer greater than 2 the factorial oper
ator will always yield a larger integer. The problem is interesting because it explores very large numbers: the shortest path to 5 goes through (4!)! = 620,448,401,733,239,439,360,000.
Infinite state spaces arise frequently in tasks involving the generation of mathematical expressions, circuits, proofs, programs, and other recursively defined objects. 3.2.2
Realworld problems
We have already seen how the routefinding problem is defined in terms of specified locations and transitions along edges between them. Routefinding algorithms are used in a variety of applications.
Some, such as Web sites and incar systems that provide driving
directions, are relatively straightforward extensions
of the Romania example.
(The main
complications are varying costs due to trafficdependent delays, and rerouting due to road closures.) Others, such as routing video streams in computer networks, military operations planning, and airline travelplanning systems, involve much more complex specifications. Consider the airline travel problems that must be solved by a travelplanning Web site: o States: Each state obviously includes a location (e.g., an airport) and the current time. Furthermore, because the cost of an action (a flight segment) may depend on previous segments, their fare bases, and their status as domestic or international, the state must record extra information about these “historical” aspects. o Initial state: The user’s home airport.
® Actions: Take any flight from the current location, in any seat class, leaving after the
current time, leaving enough time for withinairport transfer if needed.
o Transition model: The state resulting from taking a flight will have the flight’s destination as the new location and the flight’s arrival time as the new time.
o Goal state: A destination city. Sometimes the goal can be more complex, such as “arrive at the destination on a nonstop flight.”
* Action cost: A combination of monetary cost, waiting time, flight time, customs and immigration procedures, seat quality, time of day, type of airplane, frequentflyer re
ward points, and 50 on.
69
70
Chapter 3 Solving Problems by Searching Commercial travel advice systems use a problem formulation of this kind, with many addi
tional complications to handle the airlines’ byzantine fare structures. Any seasoned traveler
Touring problem Traveling salesperson problem (TSP)
knows, however, that not all air travel goes according to plan. A really good system should include contingency plans—what happens if this flight is delayed and the connection is missed? Touring problems describe a set of locations that must be visited, rather than a single
goal destination. The traveling salesperson problem (TSP) is a touring problem in which every city on a map must be visited. optimization
The aim is to find a tour with cost < C (or in the
version, to find a tour with the lowest cost possible).
An enormous
amount
of effort has been expended to improve the capabilities of TSP algorithms. The algorithms
can also be extended to handle fleets of vehicles. For example, a search and optimization algorithm for routing school buses in Boston saved $5 million, cut traffic and air pollution, and saved time for drivers and students (Bertsimas ez al., 2019). In addition to planning trips,
search algorithms have been used for tasks such as planning the movements of automatic
VLSI layout
circuitboard drills and of stocking machines on shop floors.
A VLSI layout problem requires positioning millions of components and connections on a chip to minimize area, minimize circuit delays, minimize stray capacitances, and maximize manufacturing yield. The layout problem comes after the logical design phase and is usually split into two parts: cell layout and channel routing. In cell layout, the primitive components of the circuit are grouped into cells, each of which performs some recognized function. Each cell has a fixed footprint (size and shape) and requires a certain number of connections to each of the other cells. The aim is to place the cells on the chip so that they do not overlap and so that there is room for the connecting wires to be placed between the cells. Channel
routing finds a specific route for each wire through the gaps between the cells. These search
Robot navigation
problems are extremely complex, but definitely worth solving.
Robot navigation is a generalization of the routefinding problem described earlier. Rather than following distinct paths (such as the roads in Romania), a robot can roam around,
in effect making its own paths. For a circular robot moving on a flat surface, the space is
essentially twodimensional. When the robot has arms and legs that must also be controlled,
the search space becomes manydimensional—one dimension for each joint angle. Advanced
techniques are required just to make the essentially continuous search space finite (see Chap
ter 26). In addition to the complexity of the problem, real robots must also deal with errors
in their sensor readings and motor controls, with partial observability, and with other agents
Automatic assembly sequencing
that might alter the environment.
Automatic assembly sequencing of complex objects (such as electric motors) by a robot
has been standard industry practice since the 1970s. Algorithms first find a feasible assembly
sequence and then work to optimize the process. Minimizing the amount of manual human
labor on the assembly line can produce significant savings in time and cost. In assembly problems, the aim is to find an order in which to assemble the parts of some object.
If the
wrong order is chosen, there will be no way to add some part later in the sequence without
undoing some of the work already done. Checking an action in the sequence for feasibility is a difficult geometrical search problem closely related to robot navigation. Thus, the generation Protein design
of legal actions is the expensive part of assembly sequencing. Any practical algorithm must avoid exploring all but a tiny fraction of the state space. One important assembly problem is
protein design, in which the goal is to find a sequence of amino acids that will fold into a threedimensional protein with the right properties to cure some disease.
Section 33 Search Algorithms 3.3 Search Algorithms A search algorithm takes a search problem as input and returns a solution, or an indication of ~Search algorithm failure. In this chapter we consider algorithms that superimpose a search tree over the state
space graph, forming various paths from the initial state, trying to find a path that reaches a goal state. Each node in the search tree corresponds to a state in the state space and the edges
in the search tree correspond to actions. The root of the tree corresponds to the initial state of
Node
the problem.
It is important to understand the distinction between the state space and the search tree.
The state space describes the (possibly infinite) set of states in the world, and the actions that allow transitions from one state to another.
The search tree describes paths between
these states, reaching towards the goal. The search tree may have multiple paths to (and thus
‘multiple nodes for) any given state, but each node in the tree has a unique path back to the root (as in all trees). Figure 3.4 shows the first few steps in finding a path from Arad to Bucharest. The root
node of the search tree is at the initial state, Arad. We can expand the node, by considering
Figure 3.4 Three partial search trees for finding a route from Arad to Bucharest. Nodes that have been expanded are lavender with bold letters; nodes on the frontier that have been generated but not yet expanded are in green; the set of states corresponding to these two
types of nodes are said to have been reached. Nodes that could be generated next are shown
in faint dashed lines. Notice in the bottom tree there is a cycle from Arad to Sibiu to Arad; that can’t be an optimal path, so search should not continue from there.
Expand
72
Chapter 3 Solving Problems by Searching
Figure 3.5 A sequence of search trees generated by a graph search on the Romania problem of Figure 3.1. At each stage, we have expanded every node on the frontier, extending every path with all applicable actions that don’t result in a state that has already been reached. Notice that at the third stage, the topmost city (Oradea) has two successors, both of which have already been reached by other paths, so no paths are extended from Oradea.
(a)
(b)
©
Figure 3.6 The separation property of graph search, illustrated on a rectangulargrid problem. The frontier (green) separates the interior (lavender) from the exterior (faint dashed).
The frontier is the set of nodes (and corresponding states) that have been reached but not yet expanded; the interior s the set of nodes (and corresponding states) that have been expanded; and the exterior is the set of states that have not been reached. In (a), just the root has been
expanded. In (b), the top frontier node is expanded. In (c), the remaining successors of the root are expanded in clockwise order.
Generating Child node Successor node
the available ACTIONS
for that state, using the RESULT function to see where those actions
lead to, and generating a new node (called a child node or successor node) for each of the resulting states. Each child node has Arad as its parent node. Now we must choose which of these three child nodes to consider next.
This is the
Parent node
essence of search—following up one option now and putting the others aside for later. Sup
Frontier Reached
panded nodes (outlined in bold). We call this the frontier of the search tree. We say that any state that has had a node generated for it has been reached (whether or not that node has been
Separator
pose we choose to expand Sibiu first. Figure 3.4 (bottom) shows the result: a set of 6 unex
expanded). Figure 3.5 shows the search tree superimposed on the statespace graph.
Note that the frontier separates two regions of the statespace graph: an interior region
where every state has been expanded, and an exterior region of states that have not yet been
reached. This property is illustrated in Figure 3.6. 5 Some authors call the frontier the open list, which s both geographically less evocative and computationally less appropriate, because a queue is more efficient than a list here. Those authors use the term elosed list to refer 10 the set of previously expanded nodes, which in our terminology would be the reached nodes minus the frontier.
Section 33 Search Algorithms
73
function BESTFIRSTSEARCH(problem, ) returns a solution node or failure node « NODE(STATE=problem.INITIAL)
frontier ® B D
c
G F G E D F G E Figure 3.8 Breadthfirst scarch on a simple binary tree. At each stage, the node to be expanded next s indicated by the triangular marker.
Section 3.4
Uninformed Search Strategies
77
function BREADTHFIRSTSEARCH(problem) returns a solution node or failure
node < NODE(problem.INITIAL) if problem.1sGOAL(node. STATE) then return node frontier 4+ b = 0(b%) All the nodes remain in memory, so both time and space complexity are O(b). Exponential bounds like that are scary. As a typical realworld example, consider a problem with branching factor b = 10, processing speed 1 million nodes/second, and memory requirements of 1 A search to depth d = 10 would take less than 3 hours, but would require 10
terabytes of memory. The memory requirements are a bigger problem for breadthfirst search than the execution time.
But time is still an important factor.
At depth d = 14, even with
infinite memory, the search would take 3.5 years. In general, exponentialcomplexity search problems cannot be solved by uninformed search for any but the smallest instances. 3.4.2
A A
Kbyte/node.
Dijkstra’s algorithm or uniformcost search
‘When actions have different costs, an obvious choice is to use bestfirst search where the
evaluation function is the cost of the path from the root to the current node. This is called Di
jkstra’s algorithm by the theoretical computer science community, and uniformcost search
by the AT community. The idea is that while breadthfirst search spreads out in waves of uni
form depth—first depth 1, then depth 2, and so on—uniformcost search spreads out in waves of uniform pathcost. The algorithm can be implemented as a call to BESTFIRSTSEARCH with PATHCOST as the evaluation function, as shown in Figure 3.9.
Uniformcost search
78
Chapter 3 Solving Problems by Searching
Bucharest Figure 3.10 Part of the Romania state space, selected to illustrate uniformcost search.
Consider Figure 3.10, where the problem is to get from Sibiu to Bucharest. The succes
sors of Sibiu are Rimnicu Vilcea and Fagaras, with costs 80 and 99, respectively. The least
cost node, Rimnicu Vilcea, is expanded next, adding Pitesti with cost 80 +97=177.
The
leastcost node is now Fagaras, so it is expanded, adding Bucharest with cost 99 +211=310.
Bucharest is the goal, but the algorithm tests for goals only when it expands a node, not when it generates a node, so it has not yet detected that this is a path to the goal.
The algorithm continues on, choosing Pitesti for expansion next and adding a second path
to Bucharest with cost 80 +97 + 101 =278.
path in reached and is added to the frontier.
It has a lower cost, so it replaces the previous
It turns out this node now has the lowest cost,
50 it is considered next, found to be a goal, and returned. Note that if we had checked for a
goal upon generating a node rather than when expanding the lowestcost node, then we would have returned a highercost path (the one through Fagaras).
The complexity of uniformcost search is characterized in terms of C*, the cost of the
optimal solution,® and ¢, a lower bound on the cost of each action, with ¢ > 0.
Then the
algorithm’s worstcase time and space complexity is O(b'*1€"/ ¢>0;? costoptimal if action costs are all identical; * if both directions are breadthfirst
or uniformcost. 3.5
Informed (Heuristic) Search Strategies
Informed search
This section shows how an informed search strategy—one that uses domainspecific hints
Heuristic function
The hints come in the form of a heuristic function, denoted /(n):'*
about the location of goals—can find solutions more efficiently than an uninformed strategy. h(n) = estimated cost of the cheapest path from the state at node n to a goal state.
For example, in routefinding problems, we can estimate the distance from the current state to a goal by computing the straightline distance on the map between the two points. We study heuristics
and where they come from in more detail in Section 3.6.
10 It may seem odd that the heuristic function operates on a node, when all it really needs is the node’s state. It is traditional t0 use /(n) rather than h(s) to be consistent with the evaluation function J (n) and the path cost g(n).
Section 3.5 Arad Bucharest Craiova
366 0 160 242 161 176 77 151 226 244
Informed (Heuristic) Search Strategies
Mehadia Neamt Oradea Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind
241 234 380 100 193 253 329 80 199 374
Figure 3.16 Values of is;p—straightline distances to Bucharest. 3.5.1
Greedy bestfirst search
Greedy bestfirst search is a form of bestfirst search that expands first the node with the §eedy bestirst lowest /(n) value—the node that appears to be closest to the goal—on the grounds that this is likely to lead to a solution quickly. So the evaluation function f(n) = h(n).
Let us see how this works for routefinding problems in Romania; we use the straight
line distance heuristic, which we will call hg;p. If the goal is Bucharest, we need to know the straightline distances to Bucharest, which are shown in Figure 3.16. For example, hsip(Arad)=366. Notice that the values of hg;p cannot be computed from the problem description itself (that is, the ACTIONS
and RESULT functions).
Straightine distance
Moreover, it takes a certain
amount of world knowledge to know that g p is correlated with actual road distances and is, therefore, a useful heuristic.
Figure 3.17 shows the progress of a greedy bestfirst search using hg;p to find a path
from Arad to Bucharest.
The first node to be expanded from Arad will be Sibiu because the
heuristic says it is closer to Bucharest than is either Zerind or Timisoara. The next node to be
expanded will be Fagaras because it is now closest according to the heuristic. Fagaras in turn generates Bucharest, which is the goal. For this
particular problem, greedy bestfirst search
using hg;p finds a solution without ever expanding a node that is not on the solution path.
The solution it found does not have optimal cost, however: the path via Sibiu and Fagaras to Bucharest is 32 miles longer than the path through Rimnicu Vilcea and Pitesti. This is why
the algorithm is called “greedy”—on each iteration it tries to get as close to a goal as it can, but greediness can lead to worse results than being careful.
Greedy bestfirst graph search is complete in finite state spaces, but not in infinite ones. The worstcase time and space complexity is O(V'). With a good heuristic function, however, the complexity can be reduced substantially, on certain problems reaching O(bm). 3.5.2
A" search
The most common
informed search algorithm is A* search (pronounced “Astar search”), a
bestfirst search that uses the evaluation function
F(n) = g(n) +h(n) where g(n) is the path cost from the initial state to node n, and h(n) is the estimated cost of the shortest path from 7 to a goal state, so we have f(n) = estimated cost of the best path that continues from 7 to a goal.
A" search
86
Chapter 3 Solving Problems by Searching (a) The initial state
366
(b) After expanding Arad
Chmd>
b >
isoad>
253
CZerind>
329
374
(¢) After expanding Sibiu
(d) After expanding Fagaras
Figure 3.17 Stages in a greedy bestfirst treelike search for Bucharest with the straightline distance heuristic /iszp. Nodes are labeled with their Avalues. In Figure 3.18, we show the progress of an A* search with the goal of reaching Bucharest.
The values of g are computed from the action costs in Figure 3.1, and the values of hgs.p are
given in Figure 3.16. Notice that Bucharest first appears on the frontier at step (e), but it is not selected for expansion (and thus not detected as a solution) because at f =450 it is not the lowestcost node on the frontier—that would be Pitesti, at f=417.
Another way to say this
is that there might be a solution through Pitesti whose cost is as low as 417, so the algorithm will not settle for a solution that costs
450.
At step (f), a different path to Bucharest is now
the lowestcost node, at f =418, so it is selected and detected as the optimal solution.
Admissible heuristic
A search is complete.!! Whether A* is costoptimal depends on certain properties of
the heuristic.
A key property is admissibil
an admissible heuristic is one that never
overestimates the cost to reach a goal. (An admissible heuristic is therefore optimistic.) With
11 Again, assuming all action costs are > ¢ > 0, and the state space either has a solution or is finite.
Section 3.5
(a) The initial state
Informed (Heuristic) Search Strategies
3660+366
(b) After expanding Arad 449754374
646=280+366 415239+176 671291+380 413220+193
(d) After expanding Rimnicu Vilcea
SOI=3384253 4S0450+0
S26=366+160 4173174100 §53300253
(f) After expanding Pitesti 449754374
41841840 615455+160 607=414+193
Figure 3.18 Stages in an A" search for Bucharest. Nodes are labeled with f = g+ /. The h values are the straightline distances to Bucharest taken from Figure 3.16.
87
88
Chapter 3 Solving Problems by Searching
Figure 3.19 Triangle inequality: If the heuristic / is consistent, then the single number k()
will be less than the sum of the cost ¢(n,a,d’) of the action from n to n plus the heuristic
estimate h(n').
an admissible heuristic, A* is costoptimal, which we can show with a proof by contradiction. Suppose the optimal path has cost C*, but the algorithm returns a path with cost C > C*. Then there must be some node 7 which is on the optimal path and is unexpanded (because if all the nodes on the optimal path had been expanded, then we would have returned that optimal solution). So then, using the notation g*(n) to mean the cost of the optimal path from the start to n, and h*(n) to mean the cost of the optimal path from # to the nearest goal, we have:
f(n) > € (otherwise n would have been expanded) f(n) = g(n)+h(n) f(n)
= g"(n)+h(n)
(by definition)
(because n is on an optimal path)
f(n) < g"(n)+"(n) (because of admissibility, h(n) < h*(n)) f(n) < € (by definition,C* = g*(n) + h*(n)) The first and last lines form a contradiction, so the supposition that the algorithm could return
Consistency
a suboptimal path must be wrong—it must be that A* returns only costoptimal paths.
A slightly stronger property is called consistency.
A heuristic i(n) is consistent if, for
every node n and every successor n’ of n generated by an action a, we have:
h(n) < e(nan’)+h(n').
Triangle inequality
This is a form of the triangle inequality, which stipulates that a side of a triangle cannot
be longer than the sum of the other two sides (see Figure 3.19). An example of a consistent
heuristic is the straightline distance /s> that we used in getting to Bucharest.
Every consistent heuristic is admissible (but not vice versa), so with a consistent heuristic, A" is costoptimal. In addition, with a consistent heuristic, the first time we reach a state it will be on an optimal path, so we never have to readd a state to the frontier, and never have to
change an entry in reached. But with an inconsistent heuristic, we may end up with multiple paths reaching the same state, and if each new path has a lower path cost than the previous
one, then we will end up with multiple nodes for that state in the frontier, costing us both
time and space. Because of that, some implementations of A* take care to only enter a state into the frontier once, and if a better path to the state is found, all the successors of the state
are updated (which requires that nodes have child pointers as well as parent pointers). These complications have led many implementers to avoid inconsistent heuristics, but Felner et al. (2011) argues that the worst effects rarely happen in practice, and one shouldn’t be afraid of inconsistent heuristics.
Section 3.5
Informed (Heuristic) Search Strategies
89
Figure 3.20 Map of Romania showing contours at f = 380, f = 400, and f = 420, with Arad as the start state. Nodes inside a given contour have f = g+ costs less than or equal to the contour value.
With an inadmissible heuristic, A* may or may not be costoptimal. Here are two cases where it is: First, if there is even one costoptimal path on which h(n) is admissible for all nodes n on the path, then that path will be found, no matter what the heuristic says for states off the path. Second, if the optimal solution has cost C*, and the secondbest has cost C,, and
if h(n) overestimates some costs, but never by more than C; — C*, then A is guaranteed to
return costoptimal solutions.
3.5.3 Search contours A useful way to visualize a search is to draw contours in the state space, just like the contours
in a topographic map. Figure 3.20 shows an example. Inside the contour labeled 400, all nodes have f(n) = g(n) +h(n) < 400, and so on. Then, because A" expands the frontier node
Contour
of lowest fcost, we can see that an A” search fans out from the start node, adding nodes in concentric bands of increasing fcost. ‘With uniformcost search, we also have contours, but of gcost, not g + . The contours
with uniformcost search will be “circular” around the start state, spreading out equally in all
directions with no preference towards the goal. With A* search using a good heuristic, the g+ h bands will stretch toward a goal state (as in Figure 3.20) and become more narrowly focused around an optimal path. It should be clear that as you extend a path, the g costs are monotonic:
the path cost
always increases as you go along a path, because action costs are always positive.'? Therefore
you get concentric contour lines that don’t cross each other, and if you choose to draw the
lines fine enough, you can put a line between any two nodes on any path.
12 Technically, we say decrease, but might rem:
ctly monotonic™ for costs that always increase, and “monotonic” for cost the same.
that never
Monotonic
Chapter 3 Solving Problems by Searching But it is not obvious whether the f = g+ h cost will monotonically increase. As you ex
tend a path from 7 to 1, the cost goes from g(n) +h(n) to g(n) +c(n,a,n’) +h(n’). Canceling
out the g(n) term, we see that the path’s cost will be monotonically increasing if and only if h(n) < c(n,a,n") + h(n'); in other words if and only if the heuristic is consistent.'> But note that a path might contribute several nodes in a row with the same g(n) + (n) score; this will
happen whenever the decrease in / is exactly equal to the action cost just taken (for example,
in a grid problem, when n is in the same row as the goal and you take a step towards the goal,
g s increased by 1 and h is decreased by 1). If C* is the cost of the optimal solution path, then we can say the following:
Surely expanded nodes
+ A expands all nodes that can be reached from the initial state on a path where every
node on the path has f(n) < C*. We say these are surely expanded nodes. A" might then expand some of the nodes right on the “goal contour” (where f(n) = C*) before selecting a goal node.
« A" expands no nodes with f(n) > C*.
Optimally efficient
We say that A* with a consistent heuristic is optimally efficient in the sense that any algorithm
that extends search paths from the initial state, and uses the same heuristic information, must
expand all nodes that are surely expanded by A* (because any one of them could have been
part of an optimal solution). Among the nodes with f(n)=C", one algorithm could get lucky
and choose the optimal one first while another algorithm is unlucky; we don’t consider this
Pruning
difference in defining optimal efficiency.
A" is efficient because it prunes away search tree nodes that are not necessary for finding an optimal solution. Tn Figure 3.18(b) we see that Timisoara has f = 447 and Zerind has f 449. Even though they are children of the root and would be among the first nodes expanded by uniformcost or breadthfirst search, they are never expanded by A* search because the solution with f = 418 is found first. The concept of pruning—eliminating possibilities from consideration without having to examine them—is important for many areas of Al
That A* search is complete, costoptimal, and optimally efficient among all such algo
rithms is rather satisfying.
Unfortunately, it does not mean that A* is the answer to all our
searching needs. The catch is that for many problems, the number of nodes expanded can be exponential in the length of the solution. For example, consider a version of the vacuum world with a superpowerful vacuum that can clean up any one square at a cost of 1 unit, without even having to visit the square; in that scenario, squares can be cleaned in any order.
With N initially dirty squares, there are 2V states where some subset has been cleaned; all of those states are on an optimal solution path, and hence satisfy f(n) < C*, so all of them would be visited by A*.
3.5.4
Inadmissible heuristic
Satisficing search:
Inadmissible heuristics and weighted A*
A" search has many good qualities, but it expands a lot of nodes. We can explore fewer nodes (taking less time and space) if we are willing to accept solutions that are suboptimal, but are “good enough”—what we call satisficing solutions. If we allow A® search to use an inadmissible heuristic—one that may overestimate—then we risk missing the optimal solution, but the heuristic can potentially be more accurate, thereby reducing the number of 13 In fact, the term * Jonotonic heuristic” is a synonym for “consistent heurist . The two ideas were developed independently, and ther was proved that they are equivalent (Pearl, 1984).
Section 3.5
Informed (Heuristic) Search Strategies
91
(b)
(a)
Figure 3.21 Two searches on the same grid: (a) an A" search and (b) a weighted A" search with weight W = 2. The gray bars are obstacles, the purple line is the path from the green start to red goal, and the small dots are states that were reached by each search. On this particular problem, weighted A” explores 7 times fewer states and finds a path that is 5% more costly.
nodes expanded. For example, road engineers know the concept of a detour index, which is Detour index a multiplier applied to the straightline distance to account for the typical curvature of roads. A detour index of 1.3 means that if two cities are 10 miles apart in straightline distance, a good estimate of the best path between them is 13 miles. For most localities, the detour index
ranges between 1.2 and 1.6. We can apply this idea to any problem, not just ones involving roads, with an approach
called weighted A* search where we weight the heuristic value more heavily, giving us the
evaluation function £(n) = g(n) +W x h(n), for some W > 1. Figure 3.21 shows a search problem on a grid world. In (), an A" search finds the optimal solution, but has to explore a large portion of the state space to find it. In (b), a weighted A*
search finds a solution that is slightly costlier, but the search time is much faster. We see that the weighted search focuses the contour of reached states towards a goal.
That means that
fewer states are explored, but if the optimal path ever strays outside of the weighted search’s
contour (as it does in this case), then the optimal path will not be found. In general, if the optimal solution costs C*, a weighted A* search will find a solution that costs somewhere
between C* and W x C*; but in practice we usually get results much closer to C* than W x C*.
We have considered searches that evaluate states by combining g and / in various ways; weighted A* can be seen as a generalization of the others:
At search:
Uniformcost search:
Greedy bestfirst search: Weighted A" search:
g(n)+h(n) 2(n) h(n)
g(n) +W x h(n)
(1 0 then current
next
else current —next only with probability e~2£/T
Figure 4.5 The simulated annealing algorithm, a version of stochastic hill climbing where some downhill moves are allowed. The schedule input determines the value of the “tempera
ture” 7 as a function of time.
all the probability is concentrated on the global maxima, which the algorithm will find with probability approaching 1. Simulated annealing was used to solve VLSI layout problems beginning in the 1980s. Tt has been applied widely to factory scheduling and other largescale optimization tasks. 4.1.3
Local beam search
Keeping just one node in memory might seem to be an extreme reaction to the problem of
memory limitations.
The local beam search algorithm keeps track of k states rather than
just one. Tt begins with k randomly generated states. At each step, all the successors of all k states are generated. If any one is a goal, the algorithm halts. Otherwise, it selects the k best
Local beam search
successors from the complete list and repeats. At first sight, a local beam search with k states might seem to be nothing more than
running k random restarts in parallel instead of in sequence. are quite different.
In fact, the two algorithms
In a randomrestart search, each search process runs independently of
the others. In a local beam search, useful information is passed among the parallel search threads.
In effect, the states that generate the best successors say to the others, “Come over
2, which occurs
only rarely in nature but is easy enough to simulate on computers.
Selection
« The selection process for selecting the individuals who will become the parents of the next generation: one possibility is to select from all individuals with probability proportional to their fitness score. Another possibility is to randomly select n individuals (n > p), and then select the p most fit ones as parents.
Crossover point
« The recombination procedure. One common approach (assuming p = 2), is to randomly select a crossover point to split each of the parent strings, and recombine the parts to form two children, one with the first part of parent 1 and the second part of parent 2; the other with the second part of parent 1 and the first part of parent 2.
Mutation rate
+ The mutation rate, which determines how often offspring have random mutations to
their representation. Once an offspring has been generated, every bit in its composition is flipped with probability equal to the mutation rate.
« The makeup of the next generation. This can be just the newly formed offspring, or it Elitism
can include a few topscoring parents from the previous generation (a practice called
elitism, which guarantees that overall fitness will never decrease over time). The practice of culling, in which all individuals below a given threshold are discarded, can lead
to a speedup (Baum e al., 1995). Figure 4.6(a) shows a population of four 8digit strings, each representing a state of the 8
queens puzzle: the cth digit represents the row number of the queen in column c. In (b), each state is rated by the fitness function.
Higher fitness values are better, so for the 8
Section 4.1
Local Search and Optimization Problems
17
Figure 4.7 The 8queens states corresponding to the first two parents in Figure 4.6(c) and the first offspring in Figure 4.6(d). The green columns are lost in the crossover step and the red columns are retained. (To interpret the numbers in Figure 4.6: row 1 is the bottom row, and 8 is the top row.) queens problem we use the number of nonattacking pairs of queens, which has a value of 8 x 7/2 =28 for a solution. The values of the four states in (b) are 24, 23, 20, and 11. The
fitness scores are then normalized to probabilities, and the resulting values are shown next to
the fitness values in (b).
In (c), two pairs of parents are selected, in accordance with the probabilities in (b). Notice that one individual is selected twice and one not at all. For each selected pair, a crossover point (dotted line) is chosen randomly. In (d), we cross over the parent strings at the crossover
points, yielding new offspring. For example, the first child of the first pair gets the first three digits (327) from the first parent and the remaining digits (48552) from the second parent.
The 8queens states involved in this recombination step are shown in Figure 4.7.
Finally, in (e), each location in each string is subject to random mutation with a small independent probability. One digit was mutated in the first, third, and fourth offspring. In the
8queens problem, this corresponds to choosing a queen at random and moving it to a random
square in its column. It is often the case that the population is diverse early on in the process,
so crossover frequently takes large steps in the state space early in the search process (as in simulated annealing). After many generations of selection towards higher fitness, the popu
lation becomes less diverse, and smaller steps are typical. Figure 4.8 describes an algorithm
that implements all these steps.
Genetic algorithms are similar to stochastic beam search, but with the addition of the
crossover operation. This is advantageous if there are blocks that perform useful functions.
For example, it could be that putting the first three queens in positions 2, 4, and 6 (where they do not attack each other) constitutes a useful block that can be combined with other useful blocks that appear in other individuals to construct a solution. It can be shown mathematically
that, if the blocks do not serve a purpose—for example if the positions of the genetic code
are randomly permuted—then crossover conveys no advantage.
The theory of genetic algorithms explains how this works using the idea of a schema,
which is a substring in which some of the positions can be left unspecified. For example, the schema 246#*###
Schema
describes all 8queens states in which the first three queens are in
positions 2, 4, and 6, respectively. Strings that match the schema (such as 24613578) are
called instances of the schema. It can be shown that if the average fitness of the instances of a schema is above the mean, then the number of instances of the schema will grow over time.
Instance
118
Chapter 4 Search in Complex Environments Evolution and Search The theory of evolution was developed by Charles Darwin in On the Origin of Species by Means of Natural Selection (1859) and independently by Alfred Russel ‘Wallace (1858).
The central idea is simple:
variations occur in reproduction and
will be preserved in successive generations approximately in proportion to their effect on reproductive fitness.
Darwin’s theory was developed with no knowledge of how the traits of organisms can be inherited and modified. The probabilistic laws governing these processes were first identified by Gregor Mendel (1866), a monk who experimented
with sweet peas. Much later, Watson and Crick (1953) identified the structure of the
DNA molecule and its alphabet, AGTC (adenine, guanine, thymine, cytosine). In the standard model, variation occurs both by point mutations in the letter sequence and by “crossover” (in which the DNA of an offspring is generated by combining long sections of DNA from each parent). The analogy to local search algorithms has already been described; the prin
cipal difference between stochastic beam search and evolution is the use of sexual
reproduction, wherein successors are generated from multiple individuals rather than just one.
The actual mechanisms of evolution are, however, far richer than
most genetic algorithms allow. For example, mutations can involve reversals, duplications, and movement of large chunks of DNA;
some viruses borrow DNA
from one organism and insert it into another; and there are transposable genes that
do nothing but copy themselves many thousands of times within the genome.
There are even genes that poison cells from potential mates that do not carry
the gene, thereby increasing their own chances of replication. Most important is the fact that the genes themselves encode the mechanisms whereby the genome is reproduced and translated into an organism. In genetic algorithms, those mechanisms
are a separate program that is not represented within the strings being manipulated. Darwinian evolution may appear inefficient, having generated blindly some 10" or so organisms without improving its search heuristics one iota. But learning does play a role in evolution.
Although the otherwise great French naturalist
Jean Lamarck (1809) was wrong to propose that traits ing an organism’s lifetime would be passed on to its (1896) superficially similar theory is correct: learning ness landscape, leading to an acceleration in the rate of
acquired by adaptation duroffspring, James Baldwin’s can effectively relax the fitevolution. An organism that
has a trait that is not quite adaptive for its environment will pass on the trait if it also
has enough plasticity to learn to adapt to the environment in a way that is beneficial. Computer simulations (Hinton and Nowlan, 1987) confirm that this Baldwin
effect is real, and that a consequence is that things that are hard to learn end up in the genome, but things that are easy to learn need not reside there (Morgan and
Griffiths, 2015).
Section 4.2
Local Search in Continuous Spaces
function GENETICALGORITHM(population, fitness) returns an individual
repeat weights— WEIGHTEDBY (population, fitness) population2 +—empty list for i = 1 to SIZE(population) do
parent], parent2< WEIGHTEDR ANDOMCHOICES (population, weights, 2) child REPRODUCE (parent ], parent2) if (small random probability) then child < MUTATE(child) add child to population2 population < population2 until some individual is fit enough, or enough time has elapsed return the best individual in population, according to fitness function REPRODUCE(parent], parent2) returns an individual
14 LENGTH(parent])
¢ +random number from 1 to n
return APPEND(SUBSTRING(parentl, 1,c), SUBSTRING (parent2, ¢ + 1,m))
Figure 4.8 A genetic algorithm. Within the function, population is an ordered list of individuals, weights is a list of corresponding fitness values for each individual, and fitess is a function to compute these values.
Clearly, this effect is unlikely to be significant if adjacent bits are totally unrelated to
each other, because then there will be few contiguous blocks that provide a consistent benefit. Genetic algorithms work best when schemas correspond to meaningful components of a
solution. For example, if the string is a representation of an antenna, then the schemas may represent components of the antenna, such as reflectors and deflectors. A good component is likely to be good in a variety of different designs. This suggests that successful use of genetic
algorithms requires careful engineering of the representation. In practice, genetic algorithms have their place within the broad landscape of optimization methods (Marler and Arora, 2004), particularly for complex structured problems such as circuit layout or jobshop scheduling, and more recently for evolving the architecture of deep neural networks (Miikkulainen ef al., 2019). It is not clear how much of the appeal of genetic algorithms arises from their superiority on specific tasks, and how much from the appealing metaphor of evolution.
Local Search in Continuous Spaces In Chapter 2, we explained the distinction between discrete and continuous environments,
pointing out that most realworld environments are continuous.
A continuous action space
has an infinite branching factor, and thus can’t be handled by most of the algorithms we have covered so far (with the exception of firstchoice hill climbing and simulated annealing). This section provides a very brief introduction to some local search techniques for continuous spaces. The literature on this topic is vast; many of the basic techniques originated
119
120
Chapter 4 Search in Complex Environments in the 17th century, after the development of calculus by Newton and Leibniz? We find uses for these techniques in several places in this book, including the chapters on learning, vision, and robotics.
We begin with an example. Suppose we want to place three new airports anywhere in
Romania, such that the sum of squared straightline distances from each city on the map
Variable
to its nearest airport is minimized. (See Figure 3.1 for the map of Romania.) The state space is then defined by the coordinates of the three airports: (xy,y;), (x2,y2), and (x3,y3).
This is a sixdimensional space; we also say that states are defined by six variables. In general, states are defined by an ndimensional vector of variables, x. Moving around in this space corresponds to moving one or more of the airports on the map. The objective function f(X) = f(x1,y1.X2,y2,x3,y3) is relatively easy to compute for any particular state once we
compute the closest cities. Let C; be the set of cities whose closest airport (in the state x) is airport i. Then, we have
3
F) = fiynx e ys) = Y Z(X:*X‘)2+(Yi*n)2»
@
This equation is correct not only for the state x but also for states in the local neighborhood
of x. However, it is not correct globally; if we stray too far from x (by altering the location
of one or more of the airports by a large amount) then the set of closest cities for that airport
Discretization
changes, and we need to recompute C;.
One way to deal with a continuous state space is to discretize it. For example, instead of
allowing the (x;,y;) locations to be any point in continuous twodimensional space, we could
limit them to fixed points on a rectangular grid with spacing of size § (delta). Then instead of
having an infinite number of successors, each state in the space would have only 12 successors, corresponding to incrementing one of the 6 variables by 4. We can then apply any of our local search algorithms to this discrete space. Alternatively, we could make the branching factor finite by sampling successor states randomly, moving in a random direction by a small
Empirical gradient
amount, 5. Methods that measure progress by the change in the value of the objective function between two nearby points are called empirical gradient methods.
Empirical gradient
search is the same as steepestascent hill climbing in a discretized version of the state space.
Reducing the value of § over time can give a more accurate solution, but does not necessarily
converge to a global optimum in the limit.
Often we have an objective function expressed in a mathematical form such that we can
Gradient
use calculus to solve the problem analytically rather than empirically. Many methods attempt
to use the gradient of the landscape to find a maximum. The gradient of the objective function
is a vector V£ that gives the magnitude and direction of the steepest slope. For our problem, we have
vio (9F
9f
9f
9f
9f
of
1= (S A 2 5)
In some cases, we can find a maximum by solving the equation V £ =0. (This could be done,
for example, if we were placing just one airport; the solution is the arithmetic mean of all the
cities’ coordinates.) In many cases, however, this equation cannot be solved in closed form.
For example, with three airports, the expression for the gradient depends on what cities are
Knowledge of vectors, mat ces, and derivatives s useful for this
(see Appendix A).
Section 4.2
Local Search in Continuous Spaces
121
closest to each airport in the current state. This means we can compute the gradient locally (but not globally); for example, of 22 ¥ (x—x0). “2) Given alocally correct expression for the gradient, we can perform steepestascent hill climbing by updating the current state according to the formula X
x+aVf(x),
where a (alpha) is a small constant often called the step size. There exist a huge variety Step size of methods for adjusting a.. The basic problem is that if « is too small, too many steps are needed; if a is too large, the search could overshoot the maximum.
The technique of line
search tries to overcome this dilemma by extending the current gradient direction—usually by repeatedly doubling a—until f starts to decrease again. becomes
the new current state.
The point at which this occurs
There are several schools of thought about how the new
direction should be chosen at this point. For many problems, the most effective algorithm is the venerable NewtonRaphson
method. This is a general technique for finding roots of functions—that is, solving equations
of the form g(x)=0. Newton’s formula
Line search
NewtonRaphson
It works by computing a new estimate for the root x according to
x e x—g(x)/g'(x).
To find a maximum or minimum of f, we need to find x such that the gradient is a zero vector
(i.e., V(x)=0). Thus, g(x) in Newton’s formula becomes V £(x), and the update equation can be written in matrixvector form as
x4 x—H
(x)Vf(x),
where Hy(x) is the Hessian matrix of second derivatives, whose elements H;; are given by
92f/9x;dx;. For our airport example, we can see from Equation (4.2) that Hy(x) is particularly simple: the offdiagonal elements are zero and the diagonal elements for airport are just
Hessian
twice the number of cities in C;. A moment’s calculation shows that one step of the update
moves airport i directly to the centroid of C;, which is the minimum of the local expression
for f from Equation (4.1).3 For highdimensional problems, however, computing the n? en
tries of the Hessian and inverting it may be expensive, so many approximate versions of the
NewtonRaphson method have been developed.
Local search methods suffer from local maxima, ridges, and plateaus in continuous state
spaces just as much as in discrete spaces. Random restarts and simulated annealing are often helpful. Highdimensional continuous spaces are, however, big places in which it is very easy to get lost. Constrained A final topic is constrained optimization. An optimization problem is constrained if optimization solutions must satisfy some hard constraints on the values of the variables.
For example, in
our airportsiting problem, we might constrain sites to be inside Romania and on dry land (rather than in the middle of lakes).
The difficulty of constrained
optimization problems
depends on the nature of the constraints and the objective function. The bestknown category is that of linear programming problems, in which constraints must be linear inequalities
3 In general, the NewtonRaphson update can be seen s fitting a quadratic surface to / at x and then moving direetly to the minimum of that surface—which is also the minimum of f if / is quadrati.
Linear programming
122
Convex set Convex optimization
Chapter 4 Search in Complex Environments forming a convex set* and the objective function is also linear. The time complexity of linear programming is polynomial in the number of variables. Linear programming is probably the most widely studied and broadly useful method for optimization. It is a special case of the more general problem of convex optimization, which
allows the constraint region to be any convex region and the objective to be any function that is
convex within the constraint region. Under certain conditions, convex optimization problems are also polynomially solvable and may be feasible in practice with thousands of variables. Several important problems in machine learning and control theory can be formulated as convex optimization problems (see Chapter 20). 4.3
Search with Nondeterministic Actions
In Chapter 3, we assumed a fully observable, deterministic, known environment.
Therefore,
an agent can observe the initial state, calculate a sequence of actions that reach the goal, and
execute the actions with its “eyes closed,” never having to use its percepts. When the environment is partially observable, however, the agent doesn’t know for sure what state it is in; and when the environment is nondeterministic, the agent doesn’t know
what state it transitions to after taking an action. That means that rather than thinking “I'm in state 51 and ifI do action a I'll end up in state s»,” an agent will now be thinking “I'm either
Belief state Conditional plan
in state s or 53, and if I do action a 'l end up in state s,,54 or s5.” We call a set of physical states that the agent believes are possible a belief state.
In partially observable and nondeterministic environments, the solution to a problem is
no longer a sequence, but rather a conditional plan (sometimes called a contingency plan or a
strategy) that specifies what to do depending on what percepts agent receives while executing the plan. We examine nondeterminism in this section and partial observability in the next.
4.3.1
The erratic vacuum world
The vacuum world from Chapter 2 has eight states, as shown in Figure 4.9. There are three
actions—Right, Left, and Suck—and the goal is to clean up all the dirt (states 7 and 8). If the
environment is fully observable, deterministic,
and completely known, then the problem is
easy to solve with any of the algorithms in Chapter 3, and the solution is an action sequence.
For example, if the initial state is 1, then the action sequence [Suck, Right, Suck] will reach a
goal state, 8.
Now suppose that we introduce nondeterminism in the form of a powerful but erratic
vacuum cleaner. In the erratic vacuum world, the Suck action works as follows:
+ When applied to a dirty square the action cleans the square and sometimes cleans up
dirt in an adjacent square, too.
« When applied to a clean square the action sometimes deposits dirt on the carpet.”
To provide a precise formulation of this problem, we need to generalize the notion of a transition model from Chapter 3. Instead of defining the transition model by a RESULT function
4 A set of points is convex if the line joining any two points in & is also contained in S. A convex function is one for which the space “above” it forms a convex set; by definition, convex functions have no local (as opposed 10 global) minima. 5 We assume that most readers milar problems and can sympathize with our agent. We a owners of modern, efficient cleaning appliances who cannot take advantage of this pedagogical de
Section 4.3
Search with Nondeterministic Actions
123
=0 
Figure 4.9 The eight possible states of the vacuum world; states 7 and 8 are goal states. that returns a single outcome state, we use a RESULTS function that returns a set of possible
outcome states. For example, in the erratic vacuum world, the Suck action in state 1 cleans
up either just the current location, or both locations RESULTS (1, Suck)= {5,7}
If we start in state 1, no single sequence of actions solves the problem, but the following
conditional plan does:
[Suck,if State =5 then [Right, Suck] else []] .
4.3)
Here we see that a conditional plan can contain ifthen—else steps; this means that solutions are frees rather than sequences. Here the conditional in the if statement tests to see what the current state is; this is something the agent will be able to observe at runtime, but doesn’t
know at planning time. Alternatively, we could have had a formulation that tests the percept
rather than the state. Many problems in the real, physical world are contingency problems, because exact prediction of the future is impossible. For this reason, many people keep their eyes open while walking around. 4.3.2
ANDOR
search trees
How do we find these contingent solutions to nondeterministic problems?
As in Chapter 3,
we begin by constructing search trees, but here the trees have a different character. In a de
terministic environment, the only branching is introduced by the agent’s own choices in each
state: I can do this action or that action. We call these nodes OR nodes. In the vacuum world,
Or node
environment, branching is also introduced by the environment’s choice of outcome for each action. We call these nodes AND nodes. For example, the Suck action in state 1 results in the
And node
two kinds of nodes alternate, leading to an ANDOR tree as illustrated in Figure 4.10.
Andor tree
for example, at an OR node the agent chooses Left or Right or Suck. In a nondeterministic
belief state {5,7}, so the agent would need to find a plan for state 5 and for state 7. These
124
Chapter 4 Search in Complex Environments
L
GoaL
=
Suck,
el
el
LOOP
Loop
=
GOAL
Light
Left
Fle]
[ S
EME
L
Loor
Suck
17
o]
GOAL
LooP
Figure 4.10 The first two levels of the search tree for the erratic vacuum world. State nodes are OR nodes where some action must be chosen. At the AND nodes, shown as circles, every ‘outcome must be handled, as indicated by the arc linking the outgoing branches. The solution found is shown in bold lines.
A solution for an ANDOR search problem is a subtree of the complete search tree that
(1) has a goal node at every leaf, (2) specifies one action at each of its OR nodes, and (3) includes every outcome branch at each of its AND nodes. The solution is shown in bold lines
in the figure; it corresponds to the plan given in Equation (4.3).
Figure 4.11 gives a recursive, depthfirst algorithm for ANDOR graph search. One key aspect of the algorithm is the way in which it deals with cycles, which often arise in nonde
terministic problems (e.g., if an action sometimes has no effect or if an unintended effect can be corrected). If the current state is identical to a state on the path from the root, then it returns with failure. This doesn’t mean that there is no solution from the current state; it simply
means that if there is a noncyclic solution, it must be reachable from the earlier incarnation of
the current state, so the new incarnation can be discarded. With this check, we ensure that the
algorithm terminates in every finite state space, because every path must reach a goal, a dead
end, or a repeated state. Notice that the algorithm does not check whether the current state is
a repetition of a state on some other path from the root, which is important for efficiency.
ANDOR graphs can be explored either breadthfirst or bestfirst. The concept of a heuris
tic function must be modified to estimate the cost of a contingent solution rather than a se
quence, but the notion of admissibility carries over and there is an analog of the A* algorithm for finding optimal solutions. (See the bibliographical notes at the end of the chapter.)
Section 4.3
Search with Nondeterministic Actions
125
function ANDORSEARCH(problem) returns a conditional plan, or failure
return ORSEARCH(problem, problem.INITIAL, [])
function ORSEARCH(problem, state, path) returns a conditional plan, or failure if problem.1sGOAL(state) then return the empty plan if IsCYCLE(path) then return failure
for each action in problem. ACTIONS state) do
plan < ANDSEARCH(problem, RESULTS(state, action), [state] + path])
if plan # failure then return [action] + plan] return failure
function ANDSEARCH(problem, states, path) returns a conditional plan, or failure for each s; in states do plan; < ORSEARCH(problem, s;. path)
if plan; = failure then return failure return [ifs, then plan, else if s, then plan, else ...if s,
then plan,,_, else plan,]
Figure 4.11 An algorithm for searching ANDOR graphs generated by nondeterministic environments. A solution is a conditional plan that considers every nondeterministic outcome and makes a plan for each one. 4.3.3
Try, try ag
Consider a slippery vacuum world, which is identical to the ordinary (nonerratic) vacuum
world except that movement actions sometimes fail, leaving the agent in the same location.
For example, moving Right in state 1 leads to the belief state {1,2}. Figure 4.12 shows part of the search graph; clearly, there are no longer any acyclic solutions from state 1, and
ANDORSEARCH would return with failure. There is, however, a cyclic solution, which is to keep trying Right until it works. We can express this with a new while construct:
[Suck, while State =S5 do Right, Suck] or by adding a label to denote some portion of the plan and referring to that label later: [Suck,Ly : Right,if State=S5 then L; else Suck].
When is a cyclic plan a solution? A minimum condition is that every leaf is a goal state and
that a leaf is reachable from every point in the plan. In addition to that, we need to consider the cause of the nondeterminism. If it is really the case that the vacuum robot’s drive mechanism
works some of the time, but randomly and independently slips on other occasions, then the
agent can be confident that if the action is repeated enough times, eventually it will work and the plan will succeed.
But if the nondeterminism is due to some unobserved fact about the
robot or environment—perhaps a drive belt has snapped and the robot will never move—then repeating the action will not help. One way to understand this decision is to say that the initial problem formulation (fully observable, nondeterministic) is abandoned in favor of a different formulation (partially observable, deterministic) where the failure of the cyclic plan is attributed to an unobserved
property of the drive belt. In Chapter 12 we discuss how to decide which of several uncertain possibilities is more likely.
Cyelic solution
126
Chapter 4 Search in Complex Environments
Figure 4.12 Part of the search graph for a slippery vacuum world, where we have shown (some) cycles explicitly. All solutions for this problem are cyclic plans because there is no way to move reliably. 4.4
Search in Partially Observable
Environments
‘We now turn to the problem of partial observability, where the agent’s percepts are not enough to pin down the exact state.
That means that some of the agent’s actions will be aimed at
reducing uncertainty about the current state.
4.4.1
Sensorless Conformant
Searching with no observation
‘When the agent’s percepts provide no information at all, we have what is called a sensorless
problem (or a conformant problem). At first, you might think the sensorless agent has no hope of solving a problem if it has no idea what state it starts in, but sensorless solutions are
surprisingly common and useful, primarily because they don’t rely on sensors working properly. In manufacturing systems, for example, many ingenious methods have been developed for orienting parts correctly from an unknown initial position by using a sequence of actions
with no sensing at all. Sometimes a sensorless plan is better even when a conditional plan
with sensing is available. For example, doctors often prescribe a broadspectrum antibiotic
rather than using the conditional plan of doing a blood test, then waiting for the results to
come back, and then prescribing a more specific antibiotic. The sensorless plan saves time and money, and avoids the risk of the infection worsening before the test results are available.
Consider a sensorless version of the (deterministic) vacuum world. Assume that the agent
knows the geography of its world, but not its own location or the distribution of dirt. In that
case, its initial belief state is {1,2,3,4,5,6,7,8} (see Figure 4.9).
Now, if the agent moves
Right it will be in one of the states {2,4,6,8}—the agent has gained information without
Coercion
perceiving anything! After [Right,Suck] the agent will always end up in one of the states {4,8}. Finally, after [Right,Suck,Lefr,Suck] the agent is guaranteed to reach the goal state 7, no matter what the start state. We say that the agent can coerce the world into state 7.
Section 4.4
Search in Partially Observable Environments
The solution to a sensorless problem is a sequence of actions, not a conditional plan
(because there is no perceiving).
But we search in the space of belief states rather than
physical states.S In beliefstate space, the problem is fully observable because the agent always knows its own belief state. Furthermore, the solution (if any) for a sensorless problem
is always a sequence of actions. This is because, as in the ordinary problems of Chapter 3, the percepts received after each action are completely predictable—they’re always empty! So there are no contingencies to plan for. This is true even if the environment is nondeterministic.
‘We could introduce new algorithms for sensorless search problems. But instead, we can
use the existing algorithms from Chapter 3 if we transform the underlying physical problem
into a beliefstate problem, in which we search over belief states rather than physical states.
The original problem, P, has components Actionsp, Resultp etc., and the beliefstate problem has the following components:
o States: The beliefstate space contains every possible subset of the physical states. If P
has N states, then the beliefstate problem has 2V belief states, although many of those may be unreachable from the initial state.
o Initial state: Typically the belief state consisting of all states in P, although in some cases the agent will have more knowledge than this.
o Actions: This is slightly tricky. Suppose the agent is in belief state b={s;,s2}, but
ACTIONSp(s1) # ACTIONSp(s>): then the agent is unsure of which actions are legal. If we assume that illegal actions have no effect on the environment, then it is safe to take the union of all the actions in any of the physical states in the current belief state b:
AcTIONS(b) = [J ACTIONSp(s). seb
On the other hand, if an illegal action might lead to catastrophe, it is safer to allow only the intersection, that is, the set of actions legal in all the states. For the vacuum world,
every state has the same legal actions, so both methods give the same result.
o Transition model: For deterministic actions, the new belief state has one result state for each of the current possible states (although some result states may be the same):
b =RESULT(b,a) = {s" : '
=RESULTp(s,a) and 5 € b}.
(4.4)
‘With nondeterminism, the new belief state consists of all the possible results of applying the action to any of the
b =ResuLT(b,a)
states in the current belief state:
= {s':5' € RESULTS p(s,a) and s € b}
= [JREsuLtsp(s,a),
seb
The size of b’ will be the same or smaller than b for deterministic actions, but may be larger than b with nondeterministic actions (see Figure 4.13).
o Goal test: The agent possibly achieves the goal if any state s in the belief state satisfies
the goal test of the underlying problem, IsGOALp(s). The agent necessarily achieves the goal if every state satisfies ISGOALp(s). We aim to necessarily achieve the goal.
® Action cost:
This is also tricky.
If the same action can have different costs in dif
ferent states, then the cost of taking an action in a given belief state could be one of
© Ina fully observable environment, each belief state contai ins one physical state. Thus, we can view the algorithms in Chapter 3 as searching in a beliefstate space of leton belief stat
127
128
Chapter 4 Search in Complex Environments
(@
(b)
Figure 4.13 (a) Predicting the next belief state for the sensorless vacuum world with the deterministic action, Right. (b) Prediction for the same belief state and action in the slippery version of the sensorless vacuum world.
several values. (This gives rise to a new class of problems, which we explore in Exercise 4.MVAL.) For now we assume that the cost of an action is the same in all states and 5o can be transferred directly from the underlying physical problem. Figure 4.14 shows the reachable beliefstate space for the deterministic, sensorless vacuum world. There are only 12 reachable belief states out of 28 =256 possible belief states. The preceding definitions enable the automatic construction of the beliefstate problem
formulation from the definition of the underlying physical problem. Once this is done, we can solve sensorless problems with any of the ordinary search algorithms of Chapter 3. In ordinary graph search, newly reached states are tested to see if they were previously reached. This works for belief states, too; for example, in Figure 4.14, the action sequence [Suck.Left,Suck] starting at the initial state reaches the same belief state as [Right,Left,Suck), namely, {5,7}. Now, consider the belief state reached by [Lefr], namely, {1,3,5,7}. Obviously, this is not identical to {5,7}, but it is a superser. We can discard (prune) any such superset belief state. Why? Because a solution from {1,3,5,7} must be a solution for each
of the individual states 1, 3, 5, and 7, and thus it is a solution for any combination of these
individual states, such as {5,7}; therefore we don’t need to try to solve {1,3,5,7}, we can
concentrate on trying to solve the strictly easier belief state {5,7}.
Conversely, if {1,3,5,7} has already been generated and found to be solvable, then any
subset, such as {5,7}, is guaranteed to be solvable. (If I have a solution that works when I'm very confused about what state I'm in, it will still work when I'm less confused.) This extra
level of pruning may dramatically improve the efficiency of sensorless problem solving. Even with this improvement, however, sensorless problemsolving as we have described
itis seldom feasible in practice. One issue is the vastness of the beliefstate space—we saw in
the previous chapter that often a search space of size N is too large, and now we have search
spaces of size 2V. Furthermore, each element of the search space is a set of up to N elements.
For large N, we won’t be able to represent even a single belief state without running out of
memory space.
One solution is to represent the belief state by some more compact description.
In En
glish, we could say the agent knows “Nothing” in the initial state; after moving Left, we could
Section 4.4
Search in Partially Observable Environments
129
L L1 3 [=] 5
1[ 54~7
L
 =]
—
2 e [=A] 55
sfA] %=
4y [
54“
o
o[
o[
1[5
s
1 [
=]
&
R
¢


s [ [ 5[ 7[=a] ] o[
S
EL
s [\EL
——
7[=] R
¥
o[ T s
7[=]
L
—
[~ lk
L
L R
S
e[ T
A A
TL

=
[F
s[4 Ll
= s 
[~
L
3[=
R
7~
l———l
TR
"=
Figure 4.14 The reachable portion of the beliefstate space for the deterministic, sensorless
vacuum world. Each rectangular box corresponds to a single belief state. At any given point,
the agent has a belief state but does not know which physical state it is in. The initial belief state (complete ignorance) is the top center box.
say, “Not in the rightmost column,” and so on. Chapter 7 explains how to do this in a formal representation scheme.
Another approach is to avoid the standard search algorithms, which treat belief states as
black boxes just like any other problem state. Instead, we can look inside the belief states
and develop incremental beliefstate search algorithms that build up the solution one phys Incremental beliefstate search ical state at a time.
For example, in the sensorless vacuum world, the initial belief state is
{1,2,3,4,5,6,7,8}, and we have to find an action sequence that works in all 8 states. We can
do this by first finding a solution that works for state 1; then we check if it works for state 2; if not, go back and find a different solution for state 1, and so on.
Just as an ANDOR search has to find a solution for every branch at an AND node, this
algorithm has to find a solution for every state in the belief state; the difference is that AND—
OR search can find a different solution for each branch, whereas search has to find one solution that works for all the states.
an incremental beliefstate
The main advantage of the incremental approach is that it is typically able to detect failure
quickly—when a belief state is unsolvable, it is usually the case that a small subset of the
130
Chapter 4 Search in Complex Environments belief state, consisting of the first few states examined, is also unsolvable. In some cases, this
leads to a speedup proportional to the size of the belief states, which may themselves be as
large as the physical state space itself. 4.4.2
Searching in partially observable environments
Many problems cannot be solved without sensing. For example, the sensorless 8puzzle is impossible. On the other hand, a little bit of sensing can go a long way: we can solve 8puzzles if we can see just the upperleft corner square.
The solution involves moving each
tile in turn into the observable square and keeping track of its location from then on .
For a partially observable problem, the problem specification will specify a PERCEPT (s)
function that returns the percept received by the agent in a given state. If sensing is non
deterministic, then we can use a PERCEPTS function that returns a set of possible percepts. For fully observable problems, PERCEPT (s) = s for every state s, and for sensorless problems PERCEPT (s) = null.
Consider a localsensing vacuum world, in which the agent has a position sensor that
yields the percept L in the left square, and R in the right square, and a dirt sensor that yields
Dirty when the current square is dirty and Clean when it is clean. Thus, the PERCEPT in state
Lis [L, Dirty]. With partial observability, it will usually be the case that several states produce
the same percept; state 3 will also produce [L, Dirfy]. Hence, given this
initial percept, the
initial belief state will be {1,3}. We can think of the transition model between belief states
for partially observable problems as occurring in three stages, as shown in Figure 4.15:
+ The prediction stage computes the belief state resulting from the action, RESULT (b, a),
exactly as we did with sensorless problems. To emphasize that this is a prediction, we
use the notation h=RESULT(b,a), where the “hat” over the b means “estimated,” and we also use PREDICT(b,a) as a synonym for RESULT (b, a). + The possible percepts stage computes the set of percepts that could be observed in the predicted belief state (using the letter o for observation):
POSSIBLEPERCEPTS (b) = {0 : 0=PERCEPT(s) and s € b} . « The update stage computes, for each possible percept, the belief state that would result from the percept.
The updated belief state b, is the set of states in b that could have
produced the percept:
b, = UPDATE(b,0) = {s : 0=PERCEPT(s) and 5 € b}. The agent needs to deal with possible percepts at planning time, because it won’t know
the actual percepts until it executes the plan. Notice that nondeterminism in the phys
ical environment can enlarge the belief state in the prediction stage, but each updated belief state b, can be no larger than the predicted belief state b; observations can only
help reduce uncertainty. Moreover, for deterministic sensing, the belief states for the
different possible percepts will be disjoint, forming a partition of the original predicted belief state.
Putting these three stages together, we obtain the possible belief states resulting from a given action and the subsequent possible percepts:
RESULTS (b,a) = {b, : b, = UPDATE(PREDICT(b,a),0) and 0 € POSSIBLEPERCEPTS (PREDICT(b,a))} .
(4.5)
Section 4.4
Search in Partially Observable Environments
(a)
B.Dirty]
(b)
=
i1/
o=
‘
[ o[z
)
1B Clean] Figure 4.15 Two examples of transitions in localsensing vacuum worlds. (a) In the deterministic world, Right is applied in the initial belief state, resulting in a new predicted belief
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
and [R, Clean], leading to two belief states, each of which is a singleton. (b) In the slippery world, Right i applied in the initial belief state, giving a new belief state with four physical states; for those states, the possible percepts are [L, Dirty], [R, Dirty], and [R, Clean], leading to three belief states as shown.
[4,Clean]
Figure 4.16 The first level of the ANDOR
[8.0iry) L
[B.Clean]
search tree for a problem in the localsensing
vacuum world; Suck is the first action in the solution.
131
132
Chapter 4 Search in Complex Environments 4.4.3
Solving partially observable problems
The preceding section showed how to derive the RESULTS function for a nondeterministic
beliefstate problem from an underlying physical problem, given the PERCEPT function. With this formulation, the AND—OR search algorithm of Figure 4.11 can be applied directly to
derive a solution.
Figure 4.16 shows part of the search tree for the localsensing vacuum
world, assuming an initial percept [A, Dirty]. The solution is the conditional plan [Suck, Right, if Bstate={6} then Suck else []]. Notice that, because we supplied a beliefstate problem to the ANDOR
search algorithm, it
returned a conditional plan that tests the belief state rather than the actual state. This is as it
should be:
in a partially observable environment the agent won’t know the actual state.
As in the case of standard search algorithms applied to sensorless problems, the AND—
OR search algorithm treats belief states as black boxes, just like any other states. One can improve on this by checking for previously generated belief states that are subsets or supersets
of the current state, just as for sensorless problems.
One can also derive incremental search
algorithms, analogous to those described for sensorless problems, that provide substantial
speedups over the blackbox approach.
4.4.4
An agent for partially observable environments
An agent for partially observable environments formulates a problem, calls a search algorithm (such as ANDORSEARCH)
to solve it, and executes the solution. There are two main
differences between this agent and the one for fully observable deterministic environments.
First, the solution will be a conditional plan rather than a sequence; to execute an ifthen—else
expression, the agent will need to test the condition and execute the appropriate branch of the conditional.
Second, the agent will need to maintain its belief state as it performs actions
and receives percepts. This process resembles the predictionobservationupdate process in
Equation (4.5) but is actually simpler because the percept is given by the environment rather
than calculated by the agent. Given an initial belief state b, an action a, and a percept o, the new belief state is:
b’ = UPDATE(PREDICT(b,a),0).
(4.6)
Consider a kindergarten vacuum world wherein agents sense only the state of their current
square, and any square may become dirty at any time unless the agent is actively cleaning it at that moment.” Figure 4.17 shows the belief state being maintained in this environment. In partially observable environments—which include the vast majority of realworld Monitoring Filtering
State estimation
environments—maintaining one’s belief state is a core function of any intelligent system.
This function goes under various names, including monitoring, filtering, and state estima
tion. Equation (4.6) is called a recursive state estimator because it computes the new belief
state from the previous one rather than by examining the entire percept sequence. If the agent is not to “fall behind,” the computation has to happen as fast as percepts are coming in. As
the environment becomes more complex, the agent will only have time to compute an ap
proximate belief state, perhaps focusing on the implications of the percept for the aspects of the environment that are of current interest. Most work on this problem has been done for
7 The usual apologies to those who are unfamiliar with the effect of small children on the environment.
Section 4.4
[A.Clean)
1B.Diny]
4l
Suck
133
Search in Partially Observable Environments
Figure 4.17 Two predictionupdate cycles of beliefstate maintenance in the kindergarten vacuum world with local sensing. stochastic, continuousstate environments with the tools of probability theory, as explained in
Chapter 14.
In this section we will show an example in a discrete environment with deterministic
sensors and nondeterministic actions. The example concerns a robot with a particular state estimation task called localization: working out where it is, given a map of the world and
a sequence of percepts and actions. Our robot is placed in the mazelike environment of Figure 4.18. The robot is equipped with four sonar sensors that tell whether there is an obstacle—the outer wall or a dark shaded square in the figure—in each of the four compass directions. The percept is in the form of a bt vector, one bit for each of the directions north,
east, south, and west in that order, so 1011 means there are obstacles to the north, south, and
west, but not east. ‘We assume that the sensors give perfectly correct data, and that the robot has a correct
map of the environment.
But unfortunately, the robot’s navigational system is broken, so
when it executes a Right action, it moves randomly to one of the adjacent squares. robot’s task is to determine its current location.
The
Suppose the robot has just been switched on, and it does not know where it is—its initial
belief state b consists of the set of all locations.
The robot then receives the percept 1011
and does an update using the equation b, =UPDATE(1011), yielding the 4 locations shown
in Figure 4.18(a). You can inspect the maze to see that those are the only four locations that yield the percept 1011.
Next the robot executes a Right action, but the result is nondeterministic. The new belief
state, b,
=PREDICT (b, Right), contains all the locations that are one step away from the lo
cations in b,. When the second percept, 1010, arrives, the robot does UPDATE (b,, 1010) and finds that the belief state has collapsed down to the single location shown in Figure 4.18(b). That’s the only location that could be the result of
UPDATE (PREDICT (UPDATE(b, 1011),Right),1010) . ‘With nondeterministic actions the PREDICT step grows the belief state, but the UPDATE step
shrinks it back down—as long as the percepts provide some useful identifying information.
Sometimes the percepts don’t help much for localization: If there were one or more long east
west corridors, then a robot could receive a long sequence of 1010 percepts, but never know
Localization
134
Chapter 4 Search in Complex Environments
(b) Possible locations of robot after E; = 1011, E,=1010
Figure 4.18 Possible positions of the robot, ©, (a) after one observation, £; = 1011, and (b) after moving one square and making a second observation, E> = 1010. When sensors are noiseless and the transition model is accurate, there is only one possible location for the robot consistent with this sequence of two observations. where in the corridor(s) it was. But for environments with reasonable variation in geography, localization often converges quickly to a single point, even when actions are nondeterministic.
What happens if the sensors are faulty? If we can reason only with Boolean logic, then we.
have to treat every sensor bit as being either correct or incorrect, which is the same as having
no perceptual information at all. But we will see that probabilistic reasoning (Chapter 12), allows us to extract useful information from a faulty sensor as long as it is wrong less than half the time.
Online Search Agents and Unknown Offline search Online search
Environments
So far we have concentrated on agents that use offfine search algorithms. They compute
a complete solution before taking their first action.
In contrast, an online search® agent
interleaves computation and action: first it takes an action, then it observes the environment
and computes the next action. Online search is a good idea in dynamic or semidynamic environments, where there is a penalty for sitting around and computing too long. Online 8 The term “onl i¢” here refers to algorithms that must proc nput as it is received rather than waiting for the entire input ata set to become available. This usage of “online’ unrelated to the concept of “having an Internet
connecti
"
Section 4.5
Online Search Agents and Unknown Environments
135
search is also helpful in nondeterministic domains because it allows the agent to focus its
computational efforts on the contingencies that actually arise rather than those that might
happen but probably won’t.
Of course, there is a tradeoff:
the more an agent plans ahead, the less often it will find
itself up the creek without a paddle. In unknown environments, where the agent does not
know what states exist or what its actions do, the agent must use its actions as experiments in order to learn about the environment.
A canonical example of online search is the mapping problem: a robot is placed in an Mapping problem
unknown building and must explore to build a map that can later be used for getting from
A to B. Methods for escaping from labyrinths—required knowledge for aspiring heroes of antiquity—are also examples of online search algorithms. Spatial exploration is not the only
form of online exploration, however. Consider a newborn baby: it has many possible actions but knows the outcomes of none of them, and it has experienced only a few of the possible states that it can reach.
4.5.1
Online search problems
An online search problem is solved by interleaving computation, sensing, and acting. We’ll start by assuming a deterministic and fully observable environment (Chapter 17 relaxes these assumptions) and stipulate that the agent knows only the following: * ACTIONS(s), the legal actions in state s;
* ¢(s,a,s'), the cost of applying action a in state s to arrive at state s'. Note that this cannot be used until the agent knows that s” is the outcome. * IsGOAL(s), the goal test. Note in particular that the agent cannot determine RESULT (s, @) except by actually being in s
and doing a. For example, in the maze problem shown in Figure 4.19, the agent does not know that going Up from (1,1) leads to (1,2); nor, having done that, does it know that going Down will take it back to (1,1). This degree of ignorance can be reduced in some applications—for example, a robot explorer might know how its movement actions work and be ignorant only of the locations of obstacles.
Finally, the agent might have access to an admissible heuristic function /(s) that estimates
the distance from the current state to a goal state. For example, in Figure 4.19, the agent might know the location of the goal and be able to use the Manhattandistance heuristic (page 97).
Typically, the agent’s objective is to reach a goal state while minimizing cost. (Another
possible objective is simply to explore the entire environment.) The cost is the total path
cost that the agent incurs as it travels. It is common to compare this cost with the path cost
the agent would incur if it knew the search space in advance—that is, the optimal path in the known environment. In the language of online algorithms, this comparison is called the competitive ratio; we would like it to be as small as possible.
Online explorers are vulnerable to dead ends: states from which no goal state is reach
able. If the agent doesn’t know what each action does, it might execute the “jump into bottomless pit” action, and thus never reach the goal. In general, no algorithm can avoid dead ends in all state spaces. Consider the two deadend state spaces in Figure 4.20(a). An on
line search algorithm that has visited states S and A cannot tell if it is in the top state or the bottom one; the two look identical based on what the agent has seen. Therefore, there is no
Competitive ratio Dead end
n, then the constraint cannot be satisfied.
Section 6.2 Constraint Propagation: Inference in CSPs
189
This leads to the following simple algorithm: First, remove any variable in the constraint
that has a singleton domain, and delete that variable’s value from the domains of the re
maining variables. Repeat as long as there are singleton variables. If at any point an empty
domain is produced or there are more variables than domain values left, then an inconsistency has been detected. This method can detect the inconsistency in the assignment {WA = red, NSW =red} for Figure 6.1. Notice that the variables SA, NT, and Q are effectively connected by an Alldiff
constraint because each pair must have two different colors. After applying AC3 with the partial assignment, the domains of SA, NT, and Q are all reduced to {green, blue}. That is, we have three variables and only two colors, so the Alldiff constraint is violated.
Thus, a
simple consistency procedure for a higherorder constraint is sometimes more effective than
applying arc consistency to an equivalent set of binary constraints.
Another important higherorder constraint is the resource constraint, sometimes called
the Atmost constraint. For example, in a scheduling problem, let P, ..., P; denote the numbers.
Resource constraint
of personnel assigned to each of four tasks. The constraint that no more than 10 personnel
are assigned in total is written as Ammost(10, Py, Py, Py, Py). We can detect an inconsistency simply by checking the sum of the minimum values of the current domains; for example, if each variable has the domain {3,4,5,6}, the Armost constraint cannot be satisfied. We can
also enforce consistency by deleting the maximum value of any domain if it is not consistent
with the minimum values of the other domains. Thus, if each variable in our example has the domain {2,3,4,5,6}, the values 5 and 6 can be deleted from each domain.
For large resourcelimited problems with integer values—such as logistical problems involving moving thousands of people in hundreds of vehicles—it is usually not possible to represent the domain of each variable as a large set of integers and gradually reduce that set by consistencychecking methods. Instead, domains are represented by upper and lower bounds and are managed by bounds propagation. For example, in an airlinescheduling
problem, let’s suppose there are two flights, i and F, for which the planes have capacities 165 and 385, respectively. The initial domains for the numbers of passengers on flights Fi and F; are then
D;=[0,165]
and
Bounds propagation
D;=[0,385]
Now suppose we have the additional constraint that the two flights together must carry 420
people: Fj + Fy = 420. Propagating bounds constraints, we reduce the domains to
Dy =[35,165]
and D, =[255,385].
‘We say that a CSP is boundsconsistent if for every variable X, and for both the lowerbound and upperbound values of X, there exists some value of ¥ that satisfies the constraint between
Boundsconsistent
X and Y for every variable Y. This kind of bounds propagation is widely used in practical constraint problems. 6.2.6 Sudoku
The popular Sudoku puzzle has introduced millions of people to constraint satisfaction problems, although they may not realize it. A Sudoku board consists of 81 squares, some of which are initially filled with digits from 1 to 9. The puzzle is to fill in all the remaining squares such that no digit appears twice in any row, column, or 3x 3 box (see Figure 6.4). A row, column, or box is called a unit.
Sudoku
(@)
R [orunfoaw— i.
To solve a treestructured CSP, first pick any variable to be the root of the tree, and choose
an ordering of the variables such that each variable appears after its parent in the tree. Such an ordering is called a topological sort. Figure 6.10(a) shows a sample tree and (b) shows
one possible ordering. Any tree with n nodes has n— 1 edges, so we can make this graph directed arcconsistent in O(n) steps, each of which must compare up to d possible domain
values for two variables, for a total time of O(nd?). Once we have a directed arcconsistent
graph, we can just march down the list of variables and choose any remaining value. Since each edge from a parent to its child is arcconsistent, we know that for any value we choose
for the parent, there will be a valid value left to choose for the child. That means we won’t
3 A careful cartographer or patriotic Tasmanian might object that Tasmania should not be colored the same as its nearest mainland neighbor, to avoid the impression that it might be part of that state. 4 Sadly, very few regions of the world have treestructured maps, although Sulawesi comes close.
Topological sort
200
Chapter 6
Constraint Satisfaction Problems
ze ez @
®
Figure 6.10 () The constraint graph of a treestructured CSP. (b) A linear ordering of the variables consistent with the tree with A as the root. This is known as a topological sort of the variables. function TREECSPSOLVER(csp) returns a solution, or failure inputs: csp, a CSP with components X, D, C n < number of variables in X
assignment < an empty assignment root any variable in X X < TOPOLOGICALSORT(X, roof) for j=n down to 2 do
MAKEARCCONSISTENT(PARENT(X)), X))
if it cannot be made consistent then return failure fori=1tondo
assignment[X;]
any consistent value from D;
if there is no consistent value then return failure
return assignment
Figure 6.11 The TREECSPSOLVER algorithm for solving treestructured CSPs. If the (CSP has a solution, we will find it in linear time; if not, we will detect a contradiction.
have to backtrack; we can move linearly through the variables. The complete algorithm is shown in Figure 6.11.
Now that we have an efficient algorithm for trees, we can consider whether more general constraint graphs can be reduced to trees somehow. There are two ways to do this: by
removing nodes (Section 6.5.1) or by collapsing nodes together (Section 6.5.2). 6.5.1
Cutset conditioning
The first way to reduce a constraint graph to a tree involves assigning values to some variables so that the remaining variables form a tree. Consider the constraint graph for Australia, shown
again in Figure 6.12(a). Without South Australia, the graph would become a tree, as in (b). Fortunately, we can delete South Australia (in the graph, not the country) by fixing a value for SA and deleting from the domains of the other variables any values that are inconsistent with the value chosen for SA.
Now, any solution for the CSP after SA and its constraints are removed will be consistent
with the value chosen for SA. (This works for binary CSPs; the situation is more complicated with higherorder constraints.) Therefore, we can solve the remaining tree with the algorithm
Section 6.5
(@)
201
The Structure of Problems
(®)
Figure 6.12 (a) The original constraint graph from Figure 6.1. (b) After the removal of SA,
the constraint graph becomes a forest of two trees.
given above and thus solve the whole problem. OF course, in the general case (as opposed o map coloring), the value chosen for SA could be the wrong one, so we would need to try each possible value. The general algorithm is as follows: 1. Choose a subset S of the CSP’s variables such that the constraint graph becomes a tree after removal of S. is called a cycle cutset.
Cycle cutset
2. For each possible assignment to the variables in S that satisfies all constraints on S,
(a) remove from the domains of the remaining variables any values that are inconsis
tent with the assignment for S, and (b) if the remaining CSP has a solution, return it together with the assignment for S.
If the cycle cutset has size c, then the total run time is O(d®  (n — ¢)d?): we have to try each of
the d° combinations of values for the variables in §, and for each combination we must solve
atree problem of size n — c. If the graph is “nearly a tree,” then ¢ will be small and the savings over straight backtracking will be huge—for our 100Booleanvariable example, if we could
find a cutset of size ¢ =20, this would get us down from the lifetime of the Universe to a few
minutes. In the worst case, however, ¢ can be as large as (n — 2). Finding the smallest cycle cutset is NPhard, but several efficient approximation algorithms are known.
The overall
algorithmic approach is called cutset conditioning; it comes up again in Chapter 13, where
itis used for reasoning about probabilities.
6.5.2
Cutset conditioning
Tree decomposition
The second way to reduce a constraint graph to a tree is based on constructing a tree decom
position of the constraint graph: a transformation of the original graph into a tree where each node in the tree consists of a set of variables, as in Figure 6.13. A tree decomposition must
satisfy these three requirements: + Every variable in the original problem appears in at least one of the tree nodes. If two variables are connected by a constraint in the original problem, they must appear together (along with the constraint) in at least one of the tree nodes.
« If a variable appears in two nodes in the tree, it must appear in every node along the
path connecting those nodes.
Tree decomposition
202
Chapter 6
Constraint Satisfaction Problems
Figure 6.13 A tree decomposition of the constraint graph in Figure 6.12(a).
The first two conditions ensure that all the variables and constraints are represented in the tree decomposition.
The third condition seems rather technical, but allows us to say that
any variable from the original problem must have the same value wherever it appears: the
constraints in the tree say that a variable in one node of the tree must have the same value as
the corresponding variable in the adjacent node in the tree. For example, SA appears in all four of the connected nodes in Figure 6.13, so each edge in the tree decomposition therefore
includes the constraint that the value of SA in one node must be the same as the value of SA
in the next. You can verify from Figure 6.12 that this decomposition makes sense.
Once we have a treestructured graph, we can apply TREECSPSOLVER to get a solution
in O(nd?) time, where n is the number of tree nodes and d is the size of the largest domain. But note that in the tree, a domain is a set of tuples of values, not just individual values.
For example, the top left node in Figure 6.13 represents, at the level of the original prob
lem, a subproblem with variables {WA,NT,SA}, domain {red, green, blue}, and constraints WA # NT,SA # NT,WA # SA. At the level of the tree, the node represents a single variable, which we can call SANTWA,
whose value must be a threetuple of colors,
such as
(red, green,blue), but not (red,red,blue), because that would violate the constraint SA # NT
from the original problem. We can then move from that node to the adjacent one, with the variable we can call SANTQ, and find that there is only one tuple, (red, green, blue), that is
consistent with the choice for SANTWA.
The exact same process is repeated for the next two
nodes, and independently we can make any choice for T.
We can solve any tree decomposition problem in O(nd?) time with TREECSPSOLVER,
which will be efficient as long as d remains small. Going back to our example with 100 Boolean variables, if each node has 10 variables, then d=2'° and we should be able to solve
the problem in seconds. But if there is a node with 30 variables, it would take centuries.
Tree width
A given graph admits many tree decompositions; in choosing a decomposition, the aim is to make the subproblems as small as possible. (Putting all the variables into one node is technically a tree, but is not helpful.) The tree width of a tree decomposition of a graph is
Summary
203
one less than the size of the largest node; the tree width of the graph itself is defined to be
the minimum width among all its tree decompositions.
If a graph has tree width w then the
problem can be solved in O(nd"*!) time given the corresponding tree decomposition. Hence, CSPs with constraint graphs of bounded tree width are solvable in polynomial time.
Unfortunately, finding the decomposition with minimal tree width is NPhard, but there
are heuristic methods that work well in practice.
Which is better: the cutset decomposition
with time O(d° (n — ¢)d?), or the tree decomposition with time O(nd"*')? Whenever you
have a cyclecutset of size c, there is also a tree width of size w < ¢+ 1, and it may be far
smaller in some cases. So time consideration favors tree decomposition, but the advantage of
the cyclecutset approach is that it can be executed in linear memory, while tree decomposition requires memory exponential in w.
6.5.3
Value symmetry
So far, we have looked at the structure of the constraint graph. There can also be important structure in the values of variables, or in the structure of the constraint relations themselves.
Consider the mapcoloring problem with d colors.
For every consistent solution, there is
actually a set of d! solutions formed by permuting the color names. For example, on the
Australia map we know that WA, NT, and SA must all have different colors, but there are
31
=6 ways to assign three colors to three regions. This is called value symmetry. We would Value symmetry
like to reduce the search space by a factor of d! by breaking the symmetry in assignments.
Symmetrybreaking We do this by introducing a symmetrybreaking constraint. For our example, we might cons traint impose an arbitrary ordering constraint, NT < SA < WA, that requires the three values to be in alphabetical order.
This constraint ensures that only one of the d! solutions is possible:
{NT = blue,SA = green, WA = red}.
For map coloring, it was easy to find a constraint that eliminates the symmetry. In general it is NPhard to eliminate all symmetry, but breaking value symmetry has proved to be
important and effective on a wide range of problems.
Summary + Constraint satisfaction problems (CSPs) represent a state with a set of variable/value
pairs and represent the conditions for a solution by a set of constraints on the variables. Many important realworld problems can be described as CSPs.
+ A number of inference techniques use the constraints to rule out certain variable as
signments. These include node, arc, path, and kconsistency.
+ Backtracking search, a form of depthfirst search, is commonly used for solving CSPs. Inference can be interwoven with search.
« The minimumremainingvalues and degree heuristics are domainindependent methods for deciding which variable to choose next in a backtracking search.
The least
constrainingvalue heuristic helps in deciding which value to try first for a given variable. Backtracking occurs when no legal assignment can be found for a variable.
Conflictdirected backjumping backtracks directly to the source of the problem. Constraint learning records the conflicts as they are encountered during search in order to avoid the same conflict later in the search.
204
Chapter 6
Constraint Satisfaction Problems
+ Local search using the minconflicts heuristic has also been applied to constraint satis
faction problems with great success. + The complexity of solving a CSP is strongly related to the structure of its constraint
graph. Treestructured problems can be solved in linear time. Cutset conditioning can
reduce a general CSP to a treestructured one and is quite efficient (requiring only lin
ear memory) if a small cutset can be found. Tree decomposition techniques transform
the CSP into a tree of subproblems and are efficient if the tree width of the constraint
graph is small; however they need memory exponential in the tree width of the con
straint graph. Combining cutset conditioning with tree decomposition can allow a better
tradeoff of memory versus time.
Bibliographical and Historical Notes
Diophantine equations
The Greek mathematician Diophantus (c. 200284) presented and solved problems involving algebraic constraints on equations, although he didn’t develop a generalized methodology. ‘We now call equations over integer domains Diophantine equations.
The Indian mathe
matician Brahmagupta (c. 650) was the first to show a general solution over the domain of integers for the equation ax+ by = c. Systematic methods for solving linear equations by variable elimination were studied by Gauss (1829); the solution of linear inequality constraints
goes back to Fourier (1827). Finitedomain constraint satisfaction problems also have a long history. For example, graph coloring (of which map coloring is a special case) is an old problem in mathematics.
The fourcolor conjecture (that every planar graph can be colored with four or fewer colors) was first made by Francis Guthrie, a student of De Morgan, in 1852.
It resisted solution—
despite several published claims to the contrary—until a proof was devised by Appel and Haken (1977) (see the book Four Colors Suffice (Wilson, 2004)). Purists were disappointed that part of the proof relied on a computer,
so Georges Gonthier (2008), using the COQ
theorem prover, derived a formal proof that Appel and Haken’s proof program was correct.
Specific classes of constraint satisfaction problems occur throughout the history of com
puter science. One of the most influential early examples was SKETCHPAD (Sutherland, 1963), which solved geometric constraints in diagrams and was the forerunner of modern
drawing programs and CAD tools. The identification of CSPs as a general class is due to Ugo Montanari (1974). The reduction of higherorder CSPs to purely binary CSPs with auxiliary variables (see Exercise 6.NARY) is due originally to the 19thcentury logician Charles Sanders Peirce. It was introduced into the CSP literature by Dechter (1990b) and was elaborated by Bacchus and van Beek (1998). CSPs with preferences among solutions are studied
widely in the optimization literature; see Bistarelli er al. (1997) for a generalization of the
CSP framework to allow for preferences.
Constraint propagation methods were popularized by Waltz’s (1975) success on polyhedral linelabeling problems for computer vision. Waltz showed that in many problems, propagation completely eliminates the need for backtracking. Montanari (1974) introduced the notion of constraint graphs and propagation by path consistency. Alan Mackworth (1977)
proposed the AC3 algorithm for enforcing arc consistency as well as the general idea of combining backtracking with some degree of consistency enforcement. AC4, a more efficient
Bibliographical and Historical Notes
205
arcconsistency algorithm developed by Mohr and Henderson (1986), runs in O(cd?) worstcase time but can be slower than AC3 on average cases. The PC2 algorithm (Mackworth,
1977) achieves path consistency in much the same way that AC3 achieves arc consistency. Soon after Mackworth’s paper appeared, researchers began experimenting with the trade
off between the cost of consistency enforcement and the benefits in terms of search reduction. Haralick and Elliott (1980) favored the minimal forwardchecking algorithm described
by McGregor (1979), whereas Gaschnig (1979) suggested full arcconsistency checking after each variable assignment—an algorithm later called MAC
by Sabin and Freuder (1994). The
latter paper provides somewhat convincing evidence that on harder CSPs, full arcconsistency
checking pays off. Freuder (1978, 1982) investigated the notion of kconsistency and its relationship to the complexity of solving CSPs. Dechter and Dechter (1987) introduced directional arc consistency. Apt (1999) describes a generic algorithmic framework within which consistency propagation algorithms can be analyzed, and surveys are given by Bessiére (2006) and Bartdk et al. (2010). Special methods for handling higherorder or global constraints were developed first logic within the context of constraint logic programming. Marriott and Stuckey (1998) pro Constraint programming vide excellent coverage of research in this area. The Alldiff constraint was studied by Regin (1994), Stergiou and Walsh (1999), and van Hoeve (2001). There are more complex inference algorithms for Alldiff (see van Hoeve and Katriel, 2006) that propagate more constraints but are more computationally expensive to run. Bounds constraints were incorporated into con
straint logic programming by Van Hentenryck et al. (1998). A survey of global constraints is provided by van Hoeve and Katriel (2006). Sudoku has become the most widely known CSP and was described as such by Simonis (2005). Agerbeck and Hansen (2008) describe some of the strategies and show that Sudoku
on an n? x n? board is in the class of NPhard problems. In 1850, C. F. Gauss described
a recursive backtracking algorithm
for solving the 8
queens problem, which had been published in the German chess magazine Schachzeirung in
1848. Gauss called his method Tatonniren, derived from the French word taronner—to grope around, as if in the dark. According to Donald Knuth (personal communication), R. J. Walker introduced the term backtrack in the 1950s. Walker (1960) described the basic backtracking algorithm and used it to find all solutions to the 13queens problem. Golomb and Baumert (1965) formulated, with
examples, the general class of combinatorial problems to which backtracking can be applied, and introduced what we call the MRV
heuristic.
Bitner and Reingold (1975) provided an
influential survey of backtracking techniques. Brelaz (1979) used the degree heuristic as a
tiebreaker after applying the MRV heuristic. The resulting algorithm, despite its simplicity,
is still the best method for kcoloring arbitrary graphs. Haralick and Elliott (1980) proposed
the leastconstrainingvalue heuristic. The basic backjumping
method is due to John Gaschnig (1977,
1979).
Kondrak and
van Beek (1997) showed that this algorithm is essentially subsumed by forward checking. Conflictdirected backjumping was devised by Prosser (1993).
Dechter (1990a) introduced
graphbased backjumping, which bounds the complexity of backjumpingbased algorithms. as a function of the constraint graph (Dechter and Frost, 2002).
A very general form of intelligent backtracking was developed early on by Stallman and Sussman (1977). Their technique of dependencydirected backtracking combines bacl k. Dependencydirected backtracking
206
Chapter 6
Constraint Satisfaction Problems
jumping with nogood learning (McAllester, 1990) and led to the development of truth maintenance systems (Doyle, 1979), which we discuss in Section 10.6.2. The connection between
Constraint learning
the two areas is analyzed by de Kleer (1989).
The work of Stallman and Sussman also introduced the idea of constraint learning, in which partial results obtained by search can be saved and reused later in the search. The
idea was formalized by Dechter (1990a). Backmarking (Gaschnig, 1979) is a particularly
simple method in which consistent and inconsistent pairwise assignments are saved and used to avoid rechecking constraints. Backmarking can be combined with conflictdirected back
jumping; Kondrak and van Beek (1997) present a hybrid algorithm that provably subsumes either method taken separately. The method of dynamic backtracking (Ginsberg,
1993) retains successful partial as
signments from later subsets of variables when backtracking over an earlier choice that does
not invalidate the later success.
Moskewicz er al. (2001) show how these techniques and
others are used to create an efficient SAT solver. Empirical studies of several randomized backtracking methods were done by Gomes ez al. (2000) and Gomes and Selman (2001).
Van Beek (2006) surveys backtracking. Local search in constraint satisfaction problems was popularized by the work of Kirkpatrick et al. (1983) on simulated annealing (see Chapter 4), which is widely used for VLSI
layout and scheduling problems. Beck ef al. (2011) give an overview of recent work on jobshop scheduling. The minconflicts heuristic was first proposed by Gu (1989) and was devel
oped independently by Minton ez al. (1992).
Sosic and Gu (1994) showed how it could be
applied to solve the 3,000,000 queens problem in less than a minute. The astounding success
of local search using minconflicts on the nqueens problem led to a reappraisal of the nature and prevalence of “easy” and “hard” problems.
Peter Cheeseman ez al. (1991) explored the
difficulty of randomly generated CSPs and discovered that almost all such problems either are trivially easy or have no solutions. Only if the parameters of the problem generator are
set in a certain narrow range, within which roughly half of the problems are solvable, do we. find “hard” problem instances. We discuss this phenomenon further in Chapter 7. Konolige (1994) showed that local search is inferior to backtracking search on problems
with a certain degree of local structure; this led to work that combined
local search and
inference, such as that by Pinkas and Dechter (1995). Hoos and Tsang (2006) provide a survey of local search techniques, and textbooks are offered by Hoos and Stiitzle (2004) and
Aarts and Lenstra (2003). Work relating the structure and complexity of CSPs originates with Freuder (1985) and
Mackworth and Freuder (1985), who showed that search on arcconsistent trees works with
out any backtracking. A similar result, with extensions to acyclic hypergraphs, was developed in the database community (Beeri ef al., 1983). Bayardo and Miranker (1994) present an algorithm for treestructured CSPs that runs in linear time without any preprocessing. Dechter
(1990a) describes the cyclecutset approach. Since those papers were published, there has been a great deal of progress in developing more general results relating the complexity of solving a CSP to the structure of its constraint
graph.
The notion of tree width was introduced by the graph theorists Robertson and Sey
mour (1986). Dechter and Pearl (1987, 1989), building on the work of Freuder, applied a related notion (which they called induced width but is identical to tree width) to constraint
satisfaction problems and developed the tree decomposition approach sketched in Section 6.5.
Bibliographical and Historical Notes Drawing on this work and on results from database theory, Gottlob ez al. (1999a, 1999b)
developed a notion, hypertree width, that is based on the characterization of the CSP as a
hypergraph. In addition to showing that any CSP with hypertree width w can be solved in time O(n"*+!logn), they also showed that hypertree width subsumes all previously defined measures of “width” in the sense that there are cases where the hypertree width is bounded
and the other measures are unbounded.
The RELSAT algorithm of Bayardo and Schrag (1997) combined constraint learning and
backjumping and was shown to outperform many other algorithms of the time. This led to ANDOR search algorithms applicable to both CSPs and probabilistic reasoning (Dechter and Mateescu, 2007). Brown et al. (1988) introduce the idea of symmetry breaking in CSPs,
and Gent et al. (2006) give a survey.
The field of distributed constraint satisfaction looks at solving CSPs when there is a
collection of agents, each of which controls a subset of the constraint variables. There have
been annual workshops on this problem since 2000, and good coverage elsewhere (Collin et al., 1999; Pearce et al., 2008).
Comparing CSP algorithms is mostly an empirical science: few theoretical results show that one algorithm dominates another on all problems; instead, we need to run experiments to see which algorithms perform better on typical instances of problems. As Hooker (1995)
points out, we need to be careful to distinguish between competitive testing—as occurs in
competitions among algorithms based on run time—and scientific testing, whose goal is to identify the properties of an algorithm that determine its efficacy on a class of problems. The textbooks by Apt (2003), Dechter (2003), Tsang (1993), and Lecoutre (2009), and the collection by Rossi er al. (2006), are excellent resources on constraint processing. There
are several good survey articles, including those by Dechter and Frost (2002), and Bartdk et al. (2010). Carbonnel and Cooper (2016) survey tractable classes of CSPs.
Kondrak and
van Beek (1997) give an analytical survey of backtracking search algorithms, and Bacchus and van Run (1995) give a more empirical survey. Constraint programming is covered in the books by Apt (2003) and Fruhwirth and Abdennadher (2003). Papers on constraint satisfac
tion appear regularly in Artificial Intelligence and in the specialist journal Constraints; the latest SAT solvers are described in the annual International SAT Competition. The primary
conference venue is the International Conference on Principles and Practice of Constraint
Programming, often called CP.
207
TR
[
LOGICAL AGENTS In which we design agents that can form representations ofa complex world, use a process. of inference to derive new representations about the world, and use these new representa
tions to deduce what to do.
Knowledgebased agents.
Reasoning Representation
Humans, it seems, know things; and what they know helps them do things. In AI knowledge
based agents use a process of reasoning over an internal representation of knowledge to
decide what actions to take.
The problemsolving agents of Chapters 3 and 4 know things, but only in a very limited, inflexible sense. They know what actions are available and what the result of performing a specific action from a specific state will be, but they don’t know general facts. A routefinding agent doesn’t know that it is impossible for a road to be a negative number of kilometers long.
An 8puzzle agent doesn’t know that two tiles cannot occupy the same space. The knowledge they have is very useful for finding a path from the start to a goal, but not for anything else. The atomic representations used by problemsolving agents are also very limiting. In a partially observable environment, for example, a problemsolving agent’s only choice for representing what it knows about the current state is to list all possible concrete states. I could
give a human the goal of driving to a U.S. town with population less than 10,000, but to say
that to a problemsolving agent, I could formally describe the goal only as an explicit set of the 16,000 or so towns that satisfy the description.
Chapter 6 introduced our first factored representation, whereby states are represented as
assignments of values to variables; this is a step in the right direction, enabling some parts of
the agent to work in a domainindependent way and allowing for more efficient algorithms.
In this chapter, we take this step to its logical conclusion, so to speak—we develop logic as a
general class of representations to support knowledgebased agents. These agents can com
bine and recombine information to suit myriad purposes. This can be far removed from the needs of the moment—as
when a mathematician proves a theorem or an astronomer calcu
lates the Earth’s life expectancy.
Knowledgebased agents can accept new tasks in the form
new environment, the wumpus
world, and illustrates the operation of a knowledgebased
of explicitly described goals; they can achieve competence quickly by being told or learning new knowledge about the environment; and they can adapt to changes in the environment by updating the relevant knowledge. We begin in Section 7.1 with the overall agent design. Section 7.2 introduces a simple agent without going into any technical detail. Then we explain the general principles of logic
in Section 7.3 and the specifics of propositional logic in Section 7.4. Propositional logic is
a factored representation; while less expressive than firstorder logic (Chapter 8), which is the canonical structured representation, propositional logic illustrates all the basic concepts
Section 7.1
KnowledgeBased Agents
209
function KBAGENT(percept) returns an action
persistent: KB, a knowledge base 1, acounter, initially 0, indicating time TELL(KB, MAKEPERCEPTSENTENCE(percept, 1))
action ASK(KB, MAKEACTIONQUERY(1)) TELL(KB, MAKEACTIONSENTENCE(action, 1)) tertl return action
Figure 7.1 A generic knowledgebased agent. Given a percept, the agent adds the percept o0 its knowledge base, asks the knowledge base for the best action, and tells the knowledge base that it has in fact taken that action.
of logic. It also comes with welldeveloped inference technologies, which we describe in sections 7.5 and 7.6. Finally, Section 7.7 combines the concept of knowledgebased agents with the technology of propositional logic to build some simple agents for the wumpus world. 7.1
KnowledgeBased
Agents
The central component of a knowledgebased agent is its knowledge base, or KB. A knowledge base
is a set of sentences.
(Here “sentence” is used as a technical term.
It is related
but not identical to the sentences of English and other natural languages.) Each sentence is
Knowledge base Sentence
expressed in a language called a knowledge representation language and represents some
Knowledge representation language
from other sentences, we call it an axiom.
Axiom
assertion about the world. When the sentence is taken as being given without being derived
There must be a way to add new sentences to the knowledge base and a way to query
what is known.
The standard names for these operations are TELL and ASK, respectively.
Both operations may involve inference—that is, deriving new sentences from old. Inference Inference must obey the requirement that when one ASKs a question of the knowledge base, the answer
should follow from what has been told (or TELLed) to the knowledge base previously. Later in this chapter, we will be more precise about the crucial word “follow.” For now, take it to
mean that the inference process should not make things up as it goes along. Figure 7.1 shows the outline of a knowledgebased agent program. Like all our agents, it takes a percept as input and returns an action. The agent maintains a knowledge base, KB,
which may initially contain some background knowledge.
Each time the agent program is called, it does three things. First, it TELLS the knowledge
base what it perceives. Second, it ASKs the knowledge base what action it should perform. In
the process of answering this query, extensive reasoning may be done about the current state of the world, about the outcomes of possible action sequences, and so on. Third, the agent
program TELLs the knowledge base which action was chosen, and returns the action so that it can be executed.
The details of the representation language are hidden inside three functions that imple
ment the interface between the sensors and actuators on one side and the core representation and reasoning system on the other.
MAKEPERCEPTSENTENCE
constructs a sentence as
Background knowledge
210
Chapter 7 Logical Agents serting that the agent perceived the given percept at the given time. MAKEACTIONQUERY constructs a sentence that asks what action should be done at the current time. Finally, MAKEACTIONSENTENCE constructs a sentence asserting that the chosen action was executed. The details of the inference mechanisms are hidden inside TELL and ASK. Later
sections will reveal these details.
The agent in Figure 7.1 appears quite similar to the agents with internal state described
in Chapter 2. Because of the definitions of TELL and ASK, however, the knowledgebased
Knowledge level
agent iis not an arbitrary program for calculating actions. It is amenable to a description at the knowledge level, where we need specify only what the agent knows and what its goals are, in order to determine its behavior.
For example, an automated taxi might have the goal of taking a passenger from San
Francisco to Marin County and might know that the Golden Gate Bridge is the only link
between the two locations. Then we can expect it to cross the Golden Gate Bridge because it
Implementation level
knows that that will achieve its goal. Notice that this analysis is independent of how the taxi
works at the implementation level. It doesn’t matter whether its geographical knowledge is
implemented as linked lists or pixel maps, or whether it reasons by manipulating strings of symbols stored in registers or by propagating noisy signals in a network of neurons.
A knowledgebased agent can be built simply by TELLing it what it needs to know. Start
Declarative Procedural
ing with an empty knowledge base, the agent designer can TELL sentences one by one until
the agent knows how to operate in its environment. This is called the declarative approach to system building. In contrast, the procedural approach encodes desired behaviors directly as program code. In the 1970s and 1980s, advocates of the two approaches engaged in heated debates.
We now understand that a successful agent often combines both declarative and
procedural elements in its design, and that declarative knowledge can often be compiled into more efficient procedural code.
We can also provide a knowledgebased agent with mechanisms that allow it to learn for
itself. These mechanisms, which are discussed in Chapter 19, create general knowledge about
the environment from a series of percepts. A learning agent can be fully autonomous.
7.2 Wumpus world
The Wumpus
World
In this section we describe an environment in which knowledgebased agents can show their
worth. The wumpus world is a cave consisting of rooms connected by passageways. Lurking
somewhere in the cave is the terrible wumpus, a beast that eats anyone who enters its room.
The wumpus can be shot by an agent, but the agent has only one arrow. Some rooms contain bottomless pits that will trap anyone who wanders into these rooms (except for the wumpus, which
is too big to fall in).
The only redeeming feature of this bleak environment is the
possibility of finding a heap of gold. Although the wumpus world is rather tame by modern computer game standards, it illustrates some important points about intelligence. A sample wumpus world is shown in Figure 7.2. The precise definition of the task environment is given, as suggested in Section 2.3, by the PEAS description: o Performance measure:
+1000 for climbing out of the cave with the gold, 1000
for
falling into a pit or being eaten by the wumpus, —1 for each action taken, and ~10 for
using up the arrow. The game ends either when the agent dies or when the agent climbs out of the cave.
Section7.2
4
 S8
L0
Zhreozs
W= o
sss START. 1
The Wumpus World
@)= o PIT
2
3
4
Figure 7.2 A typical wumpus world. The agent is in the botiom left corner, facing east (rightward). o Environment: A 4 x4 grid of rooms, with walls surrounding the grid. The agent always starts in the square labeled [1,1], facing to the east. The locations of the gold and the wumpus are chosen randomly, with a uniform distribution, from the squares other
than the start square. In addition, each square other than the start can be a pit, with probability 0.2. o Actuators: The agent can move Forward, TurnLeft by 90°, or TurnRight by 90°. The agent dies a miserable death if it enters a square containing a pit or a live wumpus. (It is safe, albeit smelly, to enter a square with a dead wumpus.) If an agent tries to move
forward and bumps into a wall, then the agent does not move. The action Grab can be
used to pick up the gold if it is in the same square as the agent. The action Shoot can
be used to fire an arrow in a straight line in the direction the agent is facing. The arrow continues until it either hits (and hence kills) the wumpus or hits a wall. The agent has
only one arrow, so only the first Shoot action has any effect. Finally, the action Climb can be used to climb out of the cave, but only from square [1,1].
o Sensors: The agent has five sensors, each of which gives a single bit of information: — In the squares directly (not diagonally) adjacent to the wumpus, the agent will — — — ~
perceive a Stench.!
In the In the When When where
squares directly adjacent to a pit, the agent will perceive a Breeze. square where the gold is, the agent will perceive a Glitter. an agent walks into a wall, it will perceive a Bump. the wumpus is killed, it emits a woeful Scream that can be perceived anyin the cave.
The percepts will be given to the agent program in the form of a list of five symbols;
for example, if there is a stench and a breeze, but no glitter, bump, or scream, the agent program will get [Stench, Breeze, None, None, None]. ! Presumably the square containing the wumpus also has a stench, but any agent entering that square is eaten before being able to perceive anything.
211
212
Chapter 7 Logical Agents 34
33
A
B G o » s v W
2 31
=Wumpus
14
24
34
44
13
23
33
43
12
22 P
32
42
11
0K v oK.
@
31 5
[A1
()
Figure 7.3 The first step taken by the agent in the wumpus world. (a) The initial situation, after percept [None, None, None, None, None]. (b) After moving to [2,1] and perceiving [None, Breeze, None, None, None]. We can characterize the wumpus environment along the various dimensions given in Chapter 2. Clearly, it is deterministic, discrete, static, and singleagent.
(The wumpus doesn’t
move, fortunately.) It is sequential, because rewards may come only after many actions are taken. Itis partially observable, because some aspects of the state are not directly perceivable:
the agent’s location, the wumpus'’s state of health, and the availability of an arrow. As for the
locations of the pits and the wumpus: we could treat them as unobserved parts of the state— in which case, the transition model for the environment is completely known, and finding the
locations of pits completes the agent’s knowledge of the state. Alternatively, we could say that the transition model itself is unknown because the agent doesn’t know which Forward
actions are fatal—in which case, discovering the locations of pits and wumpus completes the agent’s knowledge of the transition model.
For an agent in the environment, the main challenge is its initial ignorance of the configuration of the environment; overcoming this ignorance seems to require logical reasoning. In most instances of the wumpus world, it is possible for the agent to retrieve the gold safely.
Occasionally, the agent must choose between going home emptyhanded and risking death to
find the gold. About 21% of the environments are utterly unfair, because the gold is in a pit or surrounded by pits. Let us watch a knowledgebased wumpus agent exploring the environment shown in Figure 7.2. We use an informal knowledge representation language consisting of writing down symbols in a grid (as in Figures 7.3 and 7.4). The agent’s initial knowledge base contains the rules of the environment, as described
previously; in particular, it knows that it is in [1,1] and that [1,1] is a safe square; we denote that with an “A” and “OK,” respectively, in square [1,1]. The first percept is [None, None, None, None, None], from which the agent can conclude that its neighboring squares, [1,2] and [2,1], are free of dangers—they are OK. Figure 7.3(a) shows the agent’s state of knowledge at this point.
Section7.2
14
24
3.4
44
13w
28
33
4.3
22
3.2
4.2
14
3.4
44
13 1
33 p,
[43
12
3.2
4.2
i OK
B
3.1
P
4.1
v OK
1.1
v oK 2,1
W OK
(@)
»
s
v oK
0K 2.1
2.4
The Wumpus World
B
3,1
Pt
4.1
WL OK
®)
Figure 7.4 Two later stages in the progress of the agent. (a) After moving to [1.1] and then [1.2), and perceiving [Stench, None, None, None, None]. (b) After moving to [2.2] and then [2.3), and perceiving [Stench, Breeze, Glitter, None, None]. A cautious agent will move only into a square that it knows to be OK. Let us suppose
the agent decides to move forward to [2,1]. The agent perceives a breeze (denoted by “B”) in [2,1], 50 there must be a pit in a neighboring square. The pit cannot be in [1,1], by the rules of the game, so there must be a pit in [2,2] or [3,1] or both. The notation “P?” in Figure 7.3(b) indicates a possible pit in those squares. At this point, there is only one known square that is
OK and that has not yet been visited. So the prudent agent will turn around, go back to [1,1], and then proceed to [1,2]. The agent perceives a stench in [1,2], resulting in the state of knowledge shown in Figure 7.4(a). The stench in [1,2] means that there must be a wumpus nearby. But the wumpus cannot be in [1,1], by the rules of the game, and it cannot be in [2,2] (or the agent would have detected a stench when it was in [2,1]). Therefore, the agent can infer that the wumpus
is in [1,3]. The notation W' indicates this inference. Moreover, the lack of a breeze in [1,2] implies that there is no pit in [2,2]. Yet the agent has already inferred that there must be a pit in either [2,2] or [3,1], so this means it must be in [3,1]. This is a fairly difficult inference, because it combines knowledge gained at different times in different places and relies on the
lack of a percept to make one crucial step.
The agent has now proved to itself that there is neither a pit nor a wumpus in [2,2], so it
is OK to move there. We do not show the agent’s state of knowledge at [2,2]; we just assume
that the agent turns and moves to [2,3], giving us Figure 7.4(b). In [2,3], the agent detects a glitter, so it should grab the gold and then return home.
Note that in each case for which the agent draws a conclusion from the available infor
mation, that conclusion is guaranteed to be correct if the available information is correct.
This is a fundamental property of logical reasoning. In the rest of this chapter, we describe
how to build logical agents that can represent information and draw conclusions such as those
described in the preceding paragraphs.
213
214
Chapter 7 Logical Agents 7.3
Logic
This section summarizes the fundamental concepts of logical representation and reasoning. These beautiful ideas are independent of any of logic’s particular forms. We therefore post
pone the technical details of those forms until the next section, using instead the familiar
example of ordinary arithmetic. Syntax Semantics Truth Possible world Model
In Section 7.1, we said that knowledge bases consist of sentences. These sentences are expressed according to the syntax of the representation language, which specifies all the
sentences that are well formed. The notion of syntax is clear enough in ordinary arithmetic: “x+y=4"is a wellformed sentence, whereas “x4y+ =" is not.
A logic must also define the semantics, or meaning, of sentences. The semantics defines
the truth of each sentence with respect to each possible world. For example, the semantics
for arithmetic specifies that the sentence “x+y=4"is true in a world where x is 2 and y is 2, but false in a world where x is 1 and y is 1. In standard logics, every sentence must be either true or false in each possible world—there is no “in between.”
When we need to be precise, we use the term model in place of “possible world.” Whereas possible worlds might be thought of as (potentially) real environments that the agent
might or might not be in, models are mathematical abstractions, each of which has a fixed
truth value (true or false) for every relevant sentence. Informally, we may think of a possible world as, for example, having x men and y women sitting at a table playing bridge, and the sentence x + y=4 is true when there are four people in total. Formally, the possible models are just all possible assignments of nonnegative integers to the variables x and y. Each such
Satisfaction Entailment
assignment determines the truth of any sentence of arithmetic whose variables are x and y. If a sentence « is true in model m, we say that m satisfies o or sometimes m is a model of o
‘We use the notation M(«) to mean the set of all models of a.
Now that we have a notion of truth, we are ready to talk about logical reasoning. This in
volves the relation of logical entailment between sentences—the idea that a sentence follows
logically from another sentence. In mathematical notation, we write
aks
to mean that the sentence « entails the sentence 3. The formal definition of entailment is this:
«a k= Bif and only if, in every model in which a is true, 4 s also true. Using the notation just introduced, we can write
a k= Bifand only if M(a) C M(B). (Note the direction of the C here: if o = 3, then « is a stronger assertion than 3 it rules out more possible worlds.) The relation of entailment is familiar from arithmetic; we are happy with the idea that the sentence x = 0 entails the sentence xy = 0. Obviously, in any model
where x i zero, it is the case that xy is zero (regardless of the value of y). We can apply the same kind of analysis to the wumpusworld reasoning example given in the preceding section.
Consider the situation in Figure 7.3(b):
the agent has detected
nothing in [1,1] and a breeze in [2,1]. These percepts, combined with the agent’s knowledge
of the rules of the wumpus world, constitute the KB. The agent is interested in whether the
adjacent squares [1,2], [2,2], and [3,1] contain pits. Each of the three squares might or might
2 Fuzzy logic,
cussed in Chapter 13, allows for degrees of truth.
Section7.3
Logic
215
(@
Figure 7.5 Possible models for the presence of pits in squares [1,2], [2:2], and [3.1]. The KB corresponding to the observations of nothing in [1.1] and a breeze in [2,1]is shown by the solid line. (a) Dotted line shows models of a1 (no pit in [1.2]). (b) Dotted line shows models of a3 (no pit in [2.2]). not contain a pit, so (ignoring other aspects of the world for now) there are 23=8 possible models. These eight models are shown in Figure 7.5.3
The KB can be thought of as a set of sentences or as a single sentence that asserts all the individual sentences. The KB is false in models that contradict what the agent knows—
for example, the KB is false in any model in which [1,2] contains a pit, because there is no breeze in [1,1]. There are in fact just three models in which the KB is true, and these are shown surrounded by a solid line in Figure 7.5. Now let us consider two possible conclusions: ay = “There is no pit in [1,2].”
=
“There is no pit in [2,2]."
We have surrounded the models of a; and a2 with dotted lines in Figures 7.5(a) and 7.5(b), respectively. By inspection, we see the following: in every model in which KB s true, v is also true. Hence, KB = a: there is no pit in [1,2]. We can also see that in some models in which KB is true, a is false.
Hence, KB does not entail a: the agent cannot conclude that there is no pit in [2,2]. (Nor can it conclude that there is a pit in [2,2].)*
The preceding example not only illustrates entailment but also shows how the definition
of entailment can be applied to derive conclusions—that is, to carry out logical inference.
The inference algorithm illustrated in Figure 7.5 is called model checking, because it enu‘merates all possible models to check that a is true in all models in which KB is true, that is, that M(KB) C M(a). 3 Although the figure shows the models as partial wumpus worlds, they are really nothing more than assignments of true and false 1o the sentences “there s a pit in [1,2]” etc. Models, in the mathematical sense, do not need to have “orrible "airy wumpuses in them. 4 The agent can calculate the probability that there is a pit in [2.2]; Chapter 12 shows how.
Logical inference Model checking
Chapter 7 Logical Agents
216
Aspects of the ~~=~Follows =~ > Aspect of the " veaiworld atwond '
Figure 7.6 Sentences are physical configurations of the agent, and reasoning is a process of constructing new physical configurations from old ones. Logical reasoning should ensure that the new configurations represent aspects of the world that actually follow from the aspects that the old configurations represent.
In understanding entailment and inference, it might help to think of the set of all consequences of KB as a haystack and of a as a needle. Entailment is like the needle being in the haystack; inference is like finding it. This distinction is embodied in some formal notation: if an inference algorithm i can derive from KB, we write KBt a, Sound
Truthpreserving
which is pronounced “a is derived from KB by i or “i derives a from KB.” An inference algorithm that derives only entailed sentences is called sound or truth
preserving. Soundness is a highly desirable property. An unsound inference procedure essentially makes things up as it goes along—it announces the discovery of nonexistent needles.
Itis easy to see that model checking, when it is applicable,’ is a sound procedure.
Completeness
The property of completeness is also desirable:
an inference algorithm is complete if
it can derive any sentence that is entailed. For real haystacks, which are finite in extent, it seems obvious that a systematic examination can always decide whether the needle is in
the haystack. For many knowledge bases, however, the haystack of consequences is infinite, and completeness becomes an important issue.® Fortunately, there are complete inference procedures for logics that are sufficiently expressive to handle many knowledge bases. We have described a reasoning process whose conclusions are guaranteed to be true in any world in which the premises are true; in particular, if KB is true in the real world, then any sentence o derived from KB by a sound inference procedure is also true in the real world. So, while an inference process operates on “syntax™—internal physical configurations such as bits in registers or patterns of electrical blips in brains—the process corresponds to the realworld relationship whereby some aspect of the real world is the case by virtue of other aspects
of the real world being the case.” This correspondence between world and representation is Grounding
illustrated in Figure 7.6.

The final
issue to consider is grounding—the connection between logical reasoning pro
cesses and the real environment in which the agent exists. In particular, how do we know that
5 Model checking works if the space of models is finite—for example, in wumpus worlds of fixed size. For arithmetic, on the other hand, the space of models is infinite: even if we restrict ourselves to the integers, there are infinitely many pairs of values forx and y in the sentence x+y = 4. 6 Compare with the case of infinite search spaces in Chapter 3, where depthfirst search is not complete. 7 As Wittgenstein (1922) put it in his famous Tractatus: “The world is everything that is the case.”
Section7.4
Propositional Logic: A Very Simple Logic
217
KB is true in the real world? (After all, KB is just “syntax” inside the agent’s head.) This is a
philosophical question about which many, many books have been written. (See Chapter 27.) A simple answer is that the agent’s sensors create the connection. For example, our wumpusworld agent has a smell sensor. The agent program creates a suitable sentence whenever there is a smell. Then, whenever that sentence is in the knowledge base, it is true in the real world.
Thus, the meaning and truth of percept sentences are defined by the processes of sensing and sentence construction that produce them. What about the rest of the agent’s knowledge, such
as its belief that wumpuses cause smells in adjacent squares? This is not a direct representation of a single percept, but a general rule—derived, perhaps, from perceptual experience but not identical to a statement of that experience.
General rules like this are produced by
a sentence construction process called learning, which is the subject of Part V. Learning is
fallible. Tt could be the case that wumpuses cause smells except on February 29 in leap years, which is when they take their baths.
Thus, KB may not be true in the real world, but with
good learning procedures, there is reason for optimism. Propo:
nal Logic:
A Very
ple Log
We now present propositional logic. We describe its syntax (the structure of sentences) and Propositional logic its semantics
(the way in which the truth of sentences is determined). From these, we derive
a simple, syntactic algorithm for logical inference that implements the semantic notion of
entailment. Everything takes place, of course, in the wumpus world. 7.4.1
Syntax
The syntax of propositional logic defines the allowable sentences. The atomic sentences Atomic sentences consist of a single proposition symbol. Each such symbol stands for a proposition that can be true or false.
Proposition symbol
We use symbols that start with an uppercase letter and may contain other
letters or subscripts, for example: P, Q, R, Wy3 and FacingEast. The names are arbitrary but are often chosen to have some mnemonic value—we use W 3 to stand for the proposition that the wumpus is in [1,3]. (Remember that symbols such as W;3 are atomic, i.e., W, 1, and 3 are not meaningful parts of the symbol.) There are two proposition symbols with fixed meanings: True is the alwaystrue proposition and False is the alwaysfalse proposition. Complex sentences are constructed from simpler sentences, using parentheses and operators Complex sentences
called logical connectives. There are five connectives in common use:
Logical connectives
~ (not). A sentence such as =W, 3 is called the negation of W,5. A literal is cither an Negation Literal atomic sentence (a positive literal) or a negated atomic sentence (a negative literal). A (and). A sentence whose main connective is A, such as Wy 3 A Py 1, is called a conjunction; its parts are the conjuncts. (The A looks like an “A” for “And.”)
V (or). A sentence whose main connective is V, such as (Wi3 A Ps,1) V Wao, is a disjunc
Conjunction
Disjunction tion; its parts are disjuncts—in this example, (Wj3 A Py;) and Wa. = (implies). A sentence such as (W3 APs;) = —Way is called an implication (or con Implication ditional). Its premise or antecedent is (Wi3 A Ps.1), and its conclusion or consequent is =Wa.
Implications are also known as rules or ifthen statements. The implication
symbol is sometimes written in other books as D or —.
< (if and only if). The sentence W; 3 —W, is a biconditional.
Premise Conclusion Rules Biconditional
218
Chapter 7 Logical Agents Sentence
—
AtomicSentence ComplexSentence
— —
True False P Q R ... ( Sentence)

 Sentence

Sentence A Sentence

Sentence V Sentence

Sentence
=
Sentence

Sentence
=8 possible models—exactly those depicted in Figure 7.5. Notice, however, that the models are purely mathematical objects with no
necessary connection to wumpus worlds. P> is just a symbol; it might mean “there is a pit in [1,2]” or “I'm in Paris today and tomorrow.”
The semantics for propositional logic must specify how to compute the truth value of any
sentence, given a model. This is done recursively. All sentences are constructed from atomic
sentences and the five connectives; therefore, we need to specify how to compute the truth
of atomic sentences and how to compute the truth of sentences formed with each of the five
connectives. Atomic sentences are easy:
« True is true in every model and False is false in every model.
* The truth value of every other proposition symbol must be specified directly in the
model. For example, in the model m, given earlier, Py ; is false.
Section7.4 Propositional Logic: A Very Simple Logic
219
Figure 7.8 Truth tables for the five logical connectives. To use the table to compute, for example, the value of PVQ when P is true and Q is false, first look on the left for the row where P is true and Q is false (the third row). Then look in that row under the PV Q column
to see the result: true.
For complex sentences, we have five rules, which hold for any subsentences P and Q (atomic or complex) in any model m (here “iff” means “if and only if”): « =Pis true iffP s false in m. « PAQis true iff both P and Q are true in m. « PV Qis true iff either P or Q is truc in m. « P= Qis true unless P is true and Q is false in m. « P& Qis true iff P and Q are both true or both false in m. The rules can also be expressed with truth tables that specify the truth value of a complex
sentence for each possible assignment of truth values to its components. Truth tables for the
five connectives are given in Figure 7.8. From these tables, the truth value of any sentence s can be computed with respect to any model m by a simple recursive evaluation.
For ex
ample, the sentence —P3 A (P2 V Py 1), evaluated in my, gives true A (false\/ true) = trueA true =true. Exercise 7.TRUV asks you to write the algorithm PLTRUE?(s, m), which com
putes the truth value of a propositional logic sentence s in a model 1.
The truth tables for “and,” “or,” and “not” are in close accord with our intuitions about
the English words. The main point of possible confusion is that P or Q is true or both.
Q is true when P is true
A different connective, called “exclusive or” (“xor” for short), yields
false when both disjuncts are true.® There is no consensus on the symbol for exclusive or;
some choices are \/ or # or &
The truth table for = may not quite fit one’s intuitive understanding of “P implies Q” or “if P then Q. For one thing, propositional logic does not require any relation of causation or relevance between P and Q. The sentence “5 is odd implies Tokyo is the capital of Japan” is a true sentence of propositional logic (under the normal interpretation), even though it is a decidedly odd sentence of English. Another point of confusion is that any implication is true whenever its antecedent is false. For example, “5 is even implies Sam is smart” is true, regardless of whether Sam is smart. This seems bizarre, but it makes sense if you think of
“P = Q" as saying, “If P is true, then I am claiming that Q is true; otherwise I am making no claim.” The only way for this sentence to be false is if P is true but Q is false. The biconditional, P < @, is true whenever both P => Q and Q = P are true. In English,
this is often written as “P if and only if Q.” Many of the rules of the wumpus world are best
8 Latin u;
two separate words:
el is inclusive or and “aut” is exclusive or.
Truth table
220
Chapter 7 Logical Agents
written using
For this reason, we have used a special notation—the doubleboxed link—in Figure 10.4.
This link asserts that
Vx x€Persons
=
[Vy HasMother(x,y)
= y& FemalePersons).
‘We might also want to assert that persons have two legs—that is, Vx x€Persons
= Legs(x,2).
As before, we need to be careful not to assert that a category has legs; the singleboxed link
in Figure 10.4 is used to assert properties of every member of a category.
The semantic network notation makes it convenient to perform inheritance reasoning of
the kind introduced in Section 10.2. For example, by virtue of being a person, Mary inherits the property of having two legs. Thus, to find out how many legs Mary has, the inheritance
algorithm follows the MemberOf link from Mary to the category she belongs to, and then S Several early systems failed to distinguish between properties of members of a category and properties of the category as a whole. Thi istencies, as pointed out by Drew McDermott (1976) in his article “Artificial Intelligence Meets Natural Stupidity.” Another common problem was the use of /s links for both subset and membership relations, in correspondence with English usage: “a cat isa mammal” and “Fif is 4 cat” See Exercise 10.NATS for more on these is ues.
330
Chapter 10 Knowledge Representation
Tastioter
&5 Subser0y
ez
Persons )—=2 \uhm(l/ Subser0) Memberf
sisier0f
Menberof
Figure 10.4 A semantic network with four objects (John, Mary, 1, and 2) and four categories. Relations are denoted by labeled links.
Menberof Agent
During
Figure 10.5 A fragment of a semantic network showing the representation of the logical assertion Fly(Shankar, NewYork, NewDelhi, Yesterday). follows SubsetOf links up the hierarchy until it finds a category for which there is a boxed Legs link—in this case, the Persons category. The simplicity and efficiency of this inference mechanism, compared with semidecidable logical theorem proving, has been one of the main attractions of semantic networks.
Inheritance becomes complicated when an object can belong to more than one category
Multiple inheritance
or when a category can be a subset of more than one other category; this is called multiple inheritance. In such cases, the inheritance algorithm might find two or more conflicting values
answering the query. For this reason, multiple inheritance is banned in some objectoriented
programming (OOP) languages, such as Java, that use inheritance in a class hierarchy. It is usually allowed in semantic networks, but we defer discussion of that until Section 10.6.
The reader might have noticed an obvious drawback of semantic network notation, com
pared to firstorder logic: the fact that links between bubbles represent only binary relations.
For example, the sentence Fly(Shankar,NewYork, NewDelhi, Yesterday) cannot be asserted
directly in a semantic network.
Nonetheless, we can obtain the effect of nary assertions
by reifying the proposition itself as an event belonging to an appropriate event category.
Figure 10.5 shows the semantic network structure for this particular event.
Notice that the
restriction to binary relations forces the creation of a rich ontology of reified concepts.
Section 105
Reasoning Systems for Categories
331
Reification of propositions makes it possible to represent every ground, functionfree
atomic sentence of firstorder logic in the semantic network notation. Certain kinds of univer
sally quantified sentences can be asserted using inverse links and the singly boxed and doubly boxed arrows applied to categories, but that still leaves us a long way short of full firstorder logic.
Negation, disjunction, nested function symbols, and existential quantification are all
missing. Now it is possible to extend the notation to make it equivalent to firstorder logic—as
in Peirce’s existential graphs—but doing so negates one of the main advantages of semantic
networks, which s the simplicity and transparency of the inference processes. Designers can build a large network and still have a good idea about what queries will be efficient, because (a) it is easy to visualize the steps that the inference procedure will go through and (b) in some cases the query language is so simple that difficult queries cannot be posed.
In cases where the expressive power proves to be too limiting, many semantic network
systems provide for procedural attachment to fill in the gaps.
Procedural attachment is a
technique whereby a query about (or sometimes an assertion of) a certain relation results in a
Procedural attachment
call to a special procedure designed for that relation rather than a general inference algorithm.
One of the most important aspects of semantic networks is their ability to represent default values for categories. Examining Figure 10.4 carefully, one notices that John has one Default value leg, despite the fact that he is a person and all persons have two legs. In a strictly logical KB, this would be a contradiction, but in a semantic network, the assertion that all persons have
two legs has only default status; that is, a person is assumed to have two legs unless this is contradicted by more specific information. The default semantics is enforced naturally by the inheritance algorithm, because it follows links upwards from the object itself (John in this
case) and stops as soon as it finds a value. We say that the default is overridden by the more
specific value. Notice that we could also override the default number of legs by creating a category of OneLeggedPersons, a subset of Persons of which John is a member. ‘We can retain a strictly logical semantics for the network if we say that the Legs
Overriding
rtion
for Persons includes an exception for John:
Vx x€Persons Ax # John = Legs(x,2). For a fixed network, this is semantically adequate but will be much less concise than the network notation itself if there are lots of exceptions. For a network that will be updated with
more assertions, however, such an approach fails—we really want to say that any persons as
yet unknown with one leg are exceptions t0o. Section 10.6 goes into more depth on this issue and on default reasoning in general. 10.5.2
Description logics
The syntax of firstorder logic is designed to make it easy to say things about objects. Description logics are notations that are designed to make it easier to describe definitions
and
properties of categories. Description logic systems evolved from semantic networks in re
Description logic
sponse to pressure to formalize what the networks mean while retaining the emphasis on
taxonomic structure as an organizing principle. The principal inference tasks for description logics are subsumption (checking if one Subsumption category is a subset of another by comparing their definitions) and classification (checking Classification whether an object belongs to a category). Some systems also include consistency of a cate
gory definition—whether the membership criteria are logically satisfiable.
Consistency
332
Chapter 10 Knowledge Representation Concept
—
Thing ConceptName
 And(Concept,...)  All(RoleName, Concept)  AtLeast(Integer, RoleName)  AtMost(Integer,RoleName)
 Fills(RoleName, IndividualName, . 
 Path
ConceptName RoleName
—
SameAs(Path, Path)
OneOf(IndividualName, . [RoleName, ...
— Adult Female Male — Spouse Daughter  Son ...
Figure 10.6 The syntax of descriptions in a subset of the CLASSIC language.
The CLASSIC language (Borgida er al., 1989) is a typical description logic. The syntax
of CLASSIC descriptions is shown in Figure 10.6.% unmarried adult males we would write
For example, to say that bachelors are
Bachelor = And(Unmarried, Adult, Male) The equivalent in firstorder logic would be Bachelor(x) < Unmarried(x) AAdult(x) A Male(x). Notice that the description logic has an algebra of operations on predicates, which of course
we can’t do in firstorder logic. Any description in CLASSIC can be translated into an equiv
alent firstorder sentence, but some descriptions are more straightforward in CLASSIC.
For
example, to describe the set of men with at least three sons who are all unemployed and married to doctors, and at most two daughters who are all professors in physics or math
departments, we would use
And(Man, AtLeast(3, Son), AtMost(2, Daughter), All(Son, And(Unemployed, Married All(Spouse, Doctor))), All(Daughter, And(Professor, Fills(Department, Physics, Math)))) .
‘We leave it as an exercise to translate this into firstorder logic.
Perhaps the most important aspect of description logics s their emphasis on tractability of inference. A problem instance is solved by describing it and then asking if it is subsumed by one of several possible solution categories. In standard firstorder logic systems, predicting the solution time is often impossible. It is frequently left to the user to engineer the represen
tation to detour around sets of sentences that seem to be causing the system to take several
© Notice that the language does not allow one to simply state that one concept, or category, is a subset of another. This is a deliberate policy: subsumption between categories must be derivable from some aspects of the descriptions of the categories. If not, then something is missing from the descriptions.
Section 10.6
Reasoning with Default Information
333
weeks to solve a problem. The thrust in description logics, on the other hand, is to ensure that
subsumptiontesting can be solved in time polynomial in the size of the descriptions.”
This sounds wonderful in principle, until one realizes that it can only have one of two
consequences: either hard problems cannot be stated at all, or they require exponentially
large descriptions! However, the tractability results do shed light on what sorts of constructs
cause problems and thus help the user to understand how different representations behave. For example, description logics usually lack negation and disjunction. Each forces firstorder logical systems to go through a potentially exponential case analysis in order to ensure completeness.
CLASSIC allows only a limited form of disjunction in the Fills and OneOf
constructs, which permit disjunction over explicitly enumerated individuals but not over descriptions. With disjunctive descriptions, nested definitions can lead easily to an exponential
number of alternative routes by which one category can subsume another. 10.6
Reasoning with Default Information
In the preceding section, we saw a simple example of an assertion with default status: people have two legs. This default can be overridden by more specific information, such as that Long John Silver has one leg. We saw that the inheritance mechanism in semantic networks
implements the overriding of defaults in a simple and natural way. In this section, we study
defaults more generally, with a view toward understanding the semantics of defaults rather than just providing a procedural mechanism. 10.6.1
Circumscription and default logic
‘We have seen two examples of reasoning processes that violate the monotonicity property of
logic that was proved in Chapter 7.5 In this chapter we saw that a property inherited by all members of a category in a semantic network could be overridden by more specific informa
Monotonicity
tion for a subcategory. In Section 9.4.4, we saw that under the closedworld assumption, ifa
proposition « is not mentioned in KB then KB }= —a, but KBA o = ..
Simple introspection suggests that these failures of monotonicity are widespread in com
monsense reasoning. It seems that humans often “jump to conclusions.” For example, when
one sees a car parked on the street, one is normally willing to believe that it has four wheels
even though only three are visible. Now, probability theory can certainly provide a conclusion
that the fourth wheel exists with high probability; yet, for most people, the possibility that the
car does not have four wheels will not arise unless some new evidence presents itself. Thus,
it seems that the fourwheel conclusion is reached by default, in the absence of any reason to
doubt it. If new evidence arrives—for example, if one sees the owner carrying a wheel and notices that the car is jacked up—then the conclusion can be retracted. This kind of reasoning
is said to exhibit nonmonotonicity, because the set of beliefs does not grow monotonically over time as new evidence arrives. Nonmonotonic logics have been devised with modified notions of truth and entailment in order to capture such behavior. We will look at two such
logics that have been studied extensively: circumscription and default logic.
Circumscription can be seen as a more powerful and precise version of the closedworld
7 CLASSIC provides efficient subsumption testing in practice, but the worstcase run time is exponential. 8 Recall that monotonicity requires all entailed sentences to remain entailed after new sentences are added to the KB. Thatis, if KB = o then KBA 5 = .
Nonmonotonicity
Nonmonotonic logic Circumscription
334
Chapter 10 Knowledge Representation assumption. The idea is to specify particular predicates that are assumed to be “as false as possible”—that is, false for every object except those for which they are known to be true.
For example, suppose we want to assert the default rule that birds fly. We would introduce a predicate, say Abnormal, (x), and write Bird(x) A —Abnormaly (x) = Flies(x). If we say that Abnormal;
is to be circumscribed, a circumscriptive reasoner is entitled to
assume —Abnormal, (x) unless Abnormal; (x) is known to be true. This allows the conclusion
Model preference
Flies(Tweety) to be drawn from the premise Bird(Tweety), but the conclusion no longer holds if Abnormal, (Tweety) is asserted. Circumscription can be viewed as an example of a model preference logic.
In such
logics, a sentence is entailed (with default status) if it is true in all preferred models of the KB,
as opposed to the requirement of truth in all models in classical logic. For circumscription,
one model is preferred to another if it has fewer abnormal objects.” Let us see how this idea
works in the context of multiple inheritance in semantic networks. The standard example for
which multiple inheritance is problematic is called the “Nixon diamond.” It arises from the observation that Richard Nixon was both a Quaker (and hence by default a pacifist) and a
Republican (and hence by default not a pacifist). We can write this as follows: Republican(Nixon) A Quaker(Nixon). Republican(x) A ~Abnormaly(x) = —Pacifist(x).
Quaker(x) A —~Abnormals(x) = Pacifist(x).
If we circumscribe Abnormaly and Abnormals, there are two preferred models: one in which
Abnormaly(Nixon) and Pacifist(Nixon) are true and one in which Abnormals(Nixon) and —Pacifist(Nixon) are true. Thus, the circumscriptive reasoner remains properly agnostic as Prioritized circumscription
Default logic Default rules
to whether Nixon was a pacifist. If we wish, in addition, to assert that religious beliefs
take
precedence over political beliefs, we can use a formalism called prioritized circumscription to give preference to models where Abnormals is minimized.
Default logic is a formalism in which default rules can be written to generate contingent,
nonmonotonic conclusions. A default rule looks like this:
Bird(x) : Flies(x)/Flies(x). This rule means that if Bird(x) is true, and if Flies(x) is consistent with the knowledge base, then Flies(x) may be concluded by default. In general, a default rule has the form P:li,...,Jn/C
where P s called the prerequisite, C is the conclusion, and J; are the justifications—if any one
of them can be proven false, then the conclusion cannot be drawn. Any variable that appears
in J; or C must also appear in P. The Nixondiamond example can be represented in default logic with one fact and two default rules:
Republican(Nixon) A Quaker(Nixon).
Republican(x) : ~Pacifist(x)  ~Pacifist(x) . Quaker(x) : Pacifist(x) /Pacifist(x) mption, one model s preferred to another if it has fewer true atoms—that i, preferred models are minimal models. There is a natural connection between the closedworld assumption and definiteclause KBs, because the fixed point reached by forward chaining on definiteclause KBs is the unique minimal model. See page 231 for more on this 9 Forthe closedworld
Section 10.6
Reasoning with Default Information
To interpret what the default rules mean, we define the notion of an extension of a default
335 Extension
theory to be a maximal set of consequences of the theory. That is, an extension S consists
of the original known facts and a set of conclusions from the default rules, such that no additional conclusions can be drawn from S, and the justifications of every default conclusion
in S are consistent with S. As in the case of the preferred models in circumscription, we have
two possible extensions for the Nixon diamond: one wherein he is a pacifist and one wherein he is not. Prioritized schemes exist in which some default rules can be given precedence over
others, allowing some ambiguities to be resolved. Since 1980, when nonmonotonic logics were first proposed, a great deal of progress
has been made in understanding their mathematical properties. There are still unresolved questions, however. For example, if “Cars have four wheels” is false, what does it mean to have it in one’s knowledge base? What is a good set of default rules to have? If we cannot
decide, for each rule separately, whether it belongs in our knowledge base, then we have a serious problem of nonmodularity.
Finally, how can beliefs that have default status be used
to make decisions? This is probably the hardest issue for default reasoning.
Decisions often involve tradeoffs, and one therefore needs to compare the strengths of be
lief in the outcomes of different actions, and the costs of making a wrong decision. In cases
where the same kinds of decisions are being made repeatedly, it is possible to interpret default rules as “threshold probability” statements. For example, the default rule “My brakes are always OK” really means “The probability that my brakes are OK, given no other information, is sufficiently high that the optimal decision is for me to drive without checking them.” When
the decision context changes—for example, when one is driving a heavily laden truck down a steep mountain road—the default rule suddenly becomes inappropriate, even though there is no new evidence of faulty brakes. These considerations have led researchers to consider how
to embed default reasoning within probability theory or utility theory.
10.6.2
Truth maintenance systems
‘We have seen that many of the inferences drawn by a knowledge representation system will
have only default status, rather than being absolutely certain. Inevitably, some of these inferred facts will turn out to be wrong and will have to be retracted in the face of new infor
mation. This process is called belief revision.!® Suppose that a knowledge base KB contains a sentence P—perhaps a default conclusion recorded by a forwardchaining algorithm, or perhaps just an incorrect assertion—and we want to execute TELL(KB, —P).
ating a contradiction, we must first execute RETRACT(KB, P).
To avoid cre
This sounds easy enough.
Problems arise, however, if any additional sentences were inferred from P and asserted in the KB. For example, the implication P = Q might have been used to add Q. The obvious “solution”—retracting all sentences inferred from P—fails because such sentences may have other justifications besides P. For example, ifR and R = Q are also in the KB, then Q does not have to be removed after all. Truth maintenance systems, or TMSs, are designed to
handle exactly these kinds of complications. One simple approach to truth maintenance is to keep track of the order in which sentences are told to the knowledge
base by numbering
Belief revision
them from P; to P,.
When
the call
10 Belief revision is often contrasted with belief update, which occurs when a knowledge base is revised to reflect a change in the world rather than new information about a fixed world. Belief update combines belief revis with reasoning about time and change; it is also related to the process of filtering described in Chapter 14.
Truth maintenance system
336
Chapter 10 Knowledge Representation RETRACT(KB,
JT™S Justification
P,) is made, the system reverts to the state just before P; was added, thereby
removing both P, and any inferences that were derived from P, The sentences P through P, can then be added again. This is simple, and it guarantees that the knowledge base will be consistent, but retracting P; requires retracting and reasserting n — i sentences as well as undoing and redoing all the inferences drawn from those sentences. For systems to which many facts are being added—such as large commercial databas is impractical. A more efficient approach is the justificationbased truth maintenance system, or JTMS.
In a JTMS, each sentence in the knowledge base is annotated with a justification consisting
of the set of sentences from which it was inferred. For example, if the knowledge base already contains P = Q, then TELL(P) will cause Q to be added with the justification {P, P = Q}.
In general, a sentence can have any number of justifications. Justifications make retraction efficient. Given the call RETRACT(P), the JTMS will delete exactly those sentences for which P is a member of every justification. So, if a sentence Q had the single justification {P. P =
Q}, it would be removed; if it had the additional justification {P, PVR
would still be removed; but if it also had the justification {R, PVR
=
Q}, it
= Q}, then it would
be spared. In this way, the time required for retraction of P depends only on the number of sentences derived from P rather than on the number of sentences added after P.
The JTMS assumes that sentences that are considered once will probably be considered
again, so rather than deleting a sentence from the knowledge base entirely when it loses
all justifications, we merely mark the sentence as being our of the knowledge base. If a subsequent assertion restores one of the justifications, then we mark the sentence as being back in.
In this way, the JTMS
retains all the inference chains that it uses and need not
rederive sentences when a justification becomes valid again.
In addition to handling the retraction of incorrect information, TMSs
can be used to
speed up the analysis of multiple hypothetical situations. Suppose, for example, that the Romanian Olympic Committee is choosing sites for the swimming, athletics, and equestrian
events at the 2048 Games to be held in Romania.
For example, let the first hypothesis be
Site(Swimming, Pitesti), Site(Athletics, Bucharest), and Site(Equestrian,Arad). A great deal of reasoning must then be done to work out the logistical consequences and hence
the desirability of this selection.
If we
want to consider Site(Athletics, Sibiu)
instead, the TMS avoids the need to start again from scratch. Instead, we simply retract Site(Athletics, Bucharest) and as ert Site(Athletics, Sibiu) and the TMS takes care of the necessary revisions. Inference chains generated from the choice of Bucharest can be reused with
ATMS
Sibiu, provided that the conclusions are the same.
An assumptionbased truth maintenance system, or ATMS,
makes this type of context
switching between hypothetical worlds particularly efficient. In a JTMS, the maintenance of
justifications allows you to move quickly from one state to another by making a few retrac
tions and assertions, but at any time only one state is represented. An ATMS represents all the states that have ever been considered at the same time. Whereas a JTMS simply labels each
sentence as being in or out, an ATMS keeps track, for each sentence, of which assumptions
would cause the sentence to be true. In other words, each sentence has a label that consists of
a set of assumption sets. The sentence is true just in those cases in which all the assumptions
Explanation
in one of the assumption sets are true. Truth maintenance systems also provide a mechanism for generating explanations. Tech
nically, an explanation of a sentence P is a set of sentences E such that E entails P. If the
Summary sentences in E are already known to be true, then E simply provides a sufficient basis for proving that P must be the case. But explanations can also include assumptions—sentences
that are not known to be true, but would suffice to prove P if they were true. For example, if your car won’t start, you probably don’t have enough information to definitively prove the
reason for the problem. But a reasonable explanation might include the assumption that the
battery is dead. This, combined with knowledge of how cars operate, explains the observed nonbehavior.
In most cases, we will prefer an explanation E that is minimal, meaning that
there is no proper subset of E that is also an explanation. An ATMS can generate explanations for the “car won’t start” problem by making assumptions (such as “no gas in car” or “battery
dead”) in any order we like, even if some assumptions are contradictory. Then we look at the
label for the sentence “car won’t start™ to read off the sets of assumptions that would justify the sentence. The exact algorithms used to implement truth maintenance systems are a little compli
cated, and we do not cover them here. The computational complexity of the truth maintenance. problem is at least as great as that of propositional inference—that is, NPhard.
Therefore,
you should not expect truth maintenance to be a panacea. When used carefully, however, a TMS can provide a substantial increase in the ability of a logical system to handle complex environments and hypotheses.
Summary
By delving into the details of how one represents a variety of knowledge, we hope we have given the reader a sense of how real knowledge bases are constructed and a feeling for the interesting philosophical issues that arise. The major points are as follows: + Largescale knowledge representation requires a generalpurpose ontology to organize
and tie together the various specific domains of knowledge. « A generalpurpose ontology needs to cover a wide variety of knowledge and should be capable, in principle, of handling any domain. « Building a large, generalpurpose ontology is a significant challenge that has yet to be fully realized, although current frameworks seem to be quite robust.
« We presented an upper ontology based on categories and the event calculus. We covered categories, subcategories, parts, structured objects, measurements, substances,
events, time and space, change, and beliefs. « Natural kinds cannot be defined completely in logic, but properties of natural kinds can be represented. * Actions, events, and time can be represented with the event calculus.
Such represen
tations enable an agent to construct sequences of actions and make logical inferences
about what will be true when these actions happen.
* Specialpurpose representation systems, such as semantic networks and description logics, have been devised to help in organizing a hierarchy of categories. Inheritance is an important form of inference, allowing the properties of objects to be deduced from
their membership in categories.
337
Assumption
338
Chapter 10 Knowledge Representation « The closedworld assumption, as implemented in logic programs, provides a simple way to avoid having to specify lots of negative information. default that can be overridden by additional information.
It is best interpreted as a
+ Nonmonotonic logics, such as circumscription and default logic, are intended to cap
ture default reasoning in general.
+ Truth maintenance systems handle knowledge updates and revisions efficiently.
« It is difficult to construct large ontologies by hand; extracting knowledge from text makes the job easier. Bibliographical and Historical Notes Briggs (1985) claims that knowledge representation research began with first millennium BCE
Indian theorizing about the grammar of Shastric Sanskrit. Western philosophers trace their work on the subject back to c. 300 BCE in Aristotle’s Metaphysics (literally, what comes after
the book on physics). The development of technical terminology in any field can be regarded as a form of knowledge representation.
Early discussions of representation in Al tended to focus on “problem representation”
rather than “knowledge representation.” (See, for example, Amarel’s (1968) discussion of the “Mi naries and Cannibals” problem.) In the 1970s, Al emphasized the development of
“expert systems” (also called “knowledgebased systems”) that could, if given the appropriate domain knowledge, match or exceed the performance of human experts on narrowly defined tasks. For example, the first expert system, DENDRAL (Feigenbaum ef al., 1971; Lindsay
et al., 1980), interpreted the output of a mass spectrometer (a type of instrument used to ana
Iyze the structure of organic chemical compounds) as accurately as expert chemists. Although
the success of DENDRAL was instrumental in convincing the Al research community of the importance of knowledge representation, the representational formalisms used in DENDRAL
are highly specific to the domain of chemistry.
Over time, researchers became interested in standardized knowledge representation for
malisms and ontologies that could assist in the creation of new expert systems. This brought
them into territory previously explored by philosophers of science and of language. The discipline imposed in Al by the need for one’s theories to “work” has led to more rapid and deeper progress than when these problems were the exclusive domain of philosophy (although it has at times also led to the repeated reinvention of the wheel). But to what extent can we trust expert knowledge?
As far back as 1955, Paul Meehl
(see also Grove and Meehl, 1996) studied the decisionmaking processes of trained experts at subjective tasks such as predicting the success of a student in a training program or the recidivism of a criminal.
In 19 out of the 20 studies he looked at, Meehl found that simple
statistical learning algorithms (such as linear regression or naive Bayes) predict better than the experts. Tetlock (2017) also studies expert knowledge and finds it lacking in difficult cases. The Educational Testing Service has used an automated program to grade millions of essay questions on the GMAT exam since 1999. The program agrees with human graders 97%
of the time, about the same level that two human graders agree (Burstein ez al., 2001).
(This does not mean the program understands essays, just that it can distinguish good ones from bad ones about as well as human graders can.)
Bibliographical and Historical Notes The creation of comprehensive taxonomies or classifications dates back to ancient times.
Aristotle (384322 BCE) strongly emphasized classification and categorization schemes. His
Organon, a collection of works on logic assembled by his students after his death, included a
treatise called Categories in which he attempted to construct what we would now call an upper ontology. He also introduced the notions of genus and species for lowerlevel classification. Our present system of biological classification, including the use of “binomial nomenclature” (classification via genus and species in the technical sense), was invented by the Swedish biologist Carolus Linnaeus, or Carl von Linne (17071778).
The problems associated with
natural kinds and inexact category boundaries have been addressed by Wittgenstein (1953), Quine (1953), Lakoff (1987), and Schwartz (1977), among others. See Chapter 24 for a discussion of deep neural network representations of words and concepts that escape some of the problems ofa strict ontology, but also sacrifice some of the precision.
We still don’t know the best way to combine the advantages of neural networks
and logical semantics for representation. Interest in largerscale ontologies is increasing, as documented by the Handbook on Ontologies (Staab, 2004).
The OPENCYC
project (Lenat and Guha,
1990; Matuszek er al.,
2006) has released a 150,000concept ontology, with an upper ontology similar to the one in Figure 10.1 as well as specific concepts like “OLED Display” and “iPhone,” which is a type of “cellular phone,” which in turn is a type of “consumer electronics,” “phone,” “wireless communication device,” and other concepts. The NEXTKB project extends CYC and other resources including FrameNet and WordNet into a knowledge base with almost 3 million
facts, and provides a reasoning engine, FIRE to go with it (Forbus et al., 2010). The DBPEDIA
project extracts structured data from Wikipedia,
specifically from In
foboxes: the attribute/value pairs that accompany many Wikipedia articles (Wu and Weld, 2008; Bizer et al., 2007).
As of 2015, DBPEDIA
contained 400 million facts about 4 mil
lion objects in the English version alone; counting all 110 languages yields 1.5 billion facts (Lehmann et al., 2015). The IEEE working group P1600.1 created SUMO, the Suggested Upper Merged Ontology (Niles and Pease, 2001; Pease and Niles, 2002), with about 1000 terms in the upper ontology and links to over 20,000 domainspecific terms. Stoffel ef al. (1997) describe algorithms for efficiently managing a very large ontology. A survey of techniques for extracting knowledge from Web pages is given by Etzioni ef al. (2008). On the Web, representation languages are emerging. RDF (Brickley and Guha, 2004)
allows for assertions to be made in the form of relational triples and provides some means for
evolving the meaning of names over time. OWL (Smith ef al., 2004) is a description logic that supports inferences over these triples. So far, usage seems to be inversely proportional
to representational complexity: the traditional HTML and CSS formats account for over 99% of Web content, followed by the simplest representation schemes, such as RDFa (Adida and Birbeck, 2008), and microformats (Khare, 2006; PatelSchneider, 2014) which use HTML
and XHTML markup to add attributes to text on web pages. Usage of sophisticated RDF and OWL ontologies is not yet widespread, and the full vision of the Semantic Web (BernersLee et al., 2001) has not been realized. The conferences on Formal Ontology in Information
Systems (FOIS) covers both general and domainspecific ontologies.
The taxonomy used in this chapter was developed by the authors and is based in part
on their experience in the CYC project and in part on work by Hwang and Schubert (1993)
339
Chapter 10 Knowledge Representation and Davis (1990, 2005). An inspirational discussion of the general project of commonsense
knowledge representation appears in Hayes’s (1978, 1985b) “Naive Physics Manifesto.” Successful deep ontologies within a specific field include the Gene Ontology project (Gene Ontology Consortium, 2008) and the Chemical Markup Language (MurrayRust ef al., 2003). Doubts about the feasibility of a single ontology for all knowledge are expressed by Doctorow (2001), Gruber (2004), Halevy et al. (2009), and Smith (2004).
The event calculus was introduced by Kowalski and Sergot (1986) to handle continuous time, and there have been several variations (Sadri and Kowalski, 1995; Shanahan, 1997) and overviews (Shanahan, 1999; Mueller, 2006). James Allen introduced time intervals for the same reason (Allen, 1984), arguing that intervals were much more natural than situations for reasoning about extended and concurrent events. In van Lambalgen and Hamm (2005) we see how the logic of events maps onto the language we use to talk about events. An alternative to the event and situation calculi is the fluent calculus (Thielscher, 1999), which reifies the facts
out of which states are composed.
Peter Ladkin (1986a, 1986b) introduced “concave” time intervals (intervals with gaps—
essentially, unions of ordinary “convex” time intervals) and applied the techniques of mathematical abstract algebra to time representation. Allen (1991) systematically investigates the wide variety of techniques available for time representation; van Beek and Manchak (1996)
analyze algorithms for temporal reasoning. There are significant commonalities between the
eventbased ontology given in this chapter and an analysis of events due to the philosopher
Donald Davidson (1980). The histories in Pat Hayes’s (1985a) ontology of liquids and the chronicles in McDermott’s (1985) theory of plans were also important influences on the field
and on this chapter. The question of the ontological status of substances has a long history. Plato proposed
that substances were abstract entities entirely distinct from physical objects; he would say MadeOf (Butters, Butter) rather than Butters € Butter. This leads to a substance hierarchy in which, for example, UnsaltedButter is a more specific substance than Butter.
The position
adopted in this chapter, in which substances are categories of objects, was championed by
Richard Montague (1973). It has also been adopted in the CYC project. Copeland (1993) mounts a serious, but not invincible, attack.
The alternative approach mentioned in the chapter, in which butter is one object consisting of all buttery objects in the universe, was proposed originally by the Polish logician Lesniewski (1916).
His mereology (the name is derived from the Greek word for “part”)
used the partwhole relation as a substitute for mathematical set theory, with the aim of elim
inating abstract entities such as sets. A more readable exposition of these ideas is given by Leonard and Goodman (1940), and Goodman’s The Structure of Appearance (1977) applies the ideas to various problems in knowledge representation. While some aspects of the mereological approach are awkward—for example, the need for a separate inheritance mechanism based on partwhole relations—the approach gained the support of Quine (1960). Harry Bunt (1985) has provided an extensive analysis of its use in knowledge representation. Casati and Varzi (1999) cover parts, wholes, and a general theory of spatial locations.
There are three main approaches to the study of mental objects. The one taken in this chapter, based on modal logic and possible worlds, is the classical approach from philosophy (Hintikka,
1962; Kripke,
1963; Hughes and Cresswell,
1996).
The book Reasoning about
Bibliographical and Historical Notes Knowledge (Fagin et al., 1995) provides a thorough introduction, and Gordon and Hobbs
(2017) provide A Formal Theory of Commonsense Psychology. The second approach is a firstorder theory in which mental objects are fluents. Davis
(2005) and Davis and Morgenstern (2005) describe this approach. It relies on the possibleworlds formalism, and builds on work by Robert Moore (1980, 1985).
The third approach is a syntactic theory, in which mental objects are represented by
character strings. A string is just a complex term denoting a list of symbols, so CanFly(Clark)
can be represented by the list of symbols [C,a,n,F,1,y,(,C,l,a,r,k,)]. The syntactic theory of mental objects was first studied in depth by Kaplan and Montague (1960), who showed
that it led to paradoxes if not handled carefully. Ernie Davis (1990) provides an excellent comparison of the syntactic and modal theories of knowledge. Pnueli (1977) describes a
temporal logic used to reason about programs, work that won him the Turing Award and which was expanded upon by Vardi (1996). Littman e al. (2017) show that a temporal logic can be a good language for specifying goals to a reinforcement learning robot in a way that is easy for a human to specify, and generalizes well to different environments.
The Greek philosopher Porphyry (c. 234305 CE), commenting on Aristotle’s Caregories, drew what might qualify as the first semantic network. Charles S. Peirce (1909) developed existential graphs as the first semantic network formalism using modern logic. Ross Quillian (1961), driven by an interest in human memory and language processing, initiated work on semantic networks within Al An influential paper by Marvin Minsky (1975)
presented a version of semantic networks called frames; a frame was a representation of an object or category, with attributes and relations to other objects or categories.
The question of semantics arose quite acutely with respect to Quillian’s semantic net
works (and those of others who followed his approach), with their ubiquitous and very vague
“ISA links” Bill Woods’s (1975) famous article “What's In a Link?” drew the attention of AT
researchers to the need for precise semantics in knowledge representation formalisms. Brachman (1979) elaborated on this point and proposed solutions.
Ron
Patrick Hayes’s (1979)
“The Logic of Frames” cut even deeper, claiming that “Most of ‘frames’ is just a new syntax for parts of firstorder logic.” Drew McDermott’s (1978b) “Tarskian Semantics, or, No No
tation without Denotation!”
argued that the modeltheoretic approach to semantics used in
firstorder logic should be applied to all knowledge representation formalisms. This remains
a controversial idea; notably, McDermott himself has reversed his position in “A Critique of Pure Reason” (McDermott, 1987). Selman and Levesque (1993) discuss the complexity of
inheritance with exceptions, showing that in most formulations it is NPcomplete.
Description logics were developed as a useful subset of firstorder logic for which infer
ence is computationally tractable.
Hector Levesque and Ron Brachman (1987) showed that
certain uses of disjunction and negation were primarily responsible for the intractability of logical inference. This led to a better understanding of the interaction between complexity
and expressiveness in reasoning systems. Calvanese ef al. (1999) summarize the state of the art, and Baader et al. (2007) present a comprehensive handbook of description logic. The three main formalisms for dealing with nonmonotonic inference—circumscription
(McCarthy, 1980), default logic (Reiter, 1980), and modal nonmonotonic logic (McDermott
and Doyle, 1980)—were all introduced in one special issue of the Al Journal. Delgrande and Schaub (2003) discuss the merits of the variants, given 25 years of hindsight. Answer
set programming can be seen as an extension of negation as failure or as a refinement of
341
342
Chapter 10 Knowledge Representation circumscription; the underlying theory of stable model semantics was introduced by Gelfond
and Lifschitz (1988), and the leading answer set programming systems are DLV (Eiter ef al.,
1998) and SMODELS (Niemelii et al., 2000). Lifschitz (2001) discusses the use of answer set programming for planning. Brewka er al. (1997) give a good overview of the various approaches to nonmonotonic logic. Clark (1978) covers the negationasfailure approach to logic programming and Clark completion. Lifschitz (2001) discusses the application of answer set programming to planning. A variety of nonmonotonic reasoning systems based on
logic programming are documented in the proceedings of the conferences on Logic Program
ming and Nonmonotonic Reasoning (LPNMR).
The study of truth maintenance systems began with the TMS (Doyle, 1979) and RUP
(McAllester,
1980) systems, both of which were essentially JTMSs.
Forbus and de Kleer
(1993) explain in depth how TMSs can be used in Al applications. Nayak and Williams
(1997) show how an efficient incremental TMS called an ITMS makes it feasible to plan the
operations of a NASA spacecraft in real time.
This chapter could not cover every area of knowledge representation in depth. The three
principal topics omitted are the following: Qualitative physics.
Qualitative physics: Qualitative physics is a subfield of knowledge representation concerned specifically with constructing a logical, nonnumeric theory of physical objects and processes. The term was coined
by Johan de Kleer (1975), although the enterprise could be said to
have started in Fahlman’s (1974) BUILD, a sophisticated planner for constructing complex towers of blocks.
Fahlman discovered in the process of designing it that most of the effort
(80%, by his estimate) went into modeling the physics of the blocks world to calculate the
stability of various subassemblies of blocks, rather than into planning per se. He sketches a hypothetical naivephysicslike process to explain why young children can solve BUILDlike
problems without access to the highspeed floatingpoint arithmetic used in BUILD’s physical
modeling. Hayes (1985a) uses “histories"—fourdimensional slices of spacetime similar to
Davidson’s events—to construct a fairly complex naive physics of liquids. Davis (2008) gives an update to the ontology of liquids that describes the pouring of liquids into containers.
De Kleer and Brown (1985), Ken Forbus (1985), and Benjamin Kuipers (1985) independently and almost simultaneously developed systems that can reason about a physical system based on qualitative abstractions of the underlying equations. Qualitative physics soon developed to the point where it became possible to analyze an impressive variety of complex phys
ical systems (Yip, 1991). Qualitative techniques have been used to construct novel designs
for clocks, windshield wipers, and sixlegged walkers (Subramanian and Wang, 1994). The collection Readings in Qualitative Reasoning about Physical Systems (Weld and de Kleer, 1990), an encyclopedia article by Kuipers (2001), and a handbook article by Davis (2007)
provide good introductions to the field.
Spatial reasoning
Spatial reasoning: The reasoning necessary to navigate in the wumpus world is trivial in comparison to the rich spatial structure of the real world.
The earliest serious attempt to
capture commonsense reasoning about space appears in the work of Ernest Davis (1986, 1990). The region connection calculus of Cohn ef al. (1997) supports a form of qualitative spatial reasoning and has led to new kinds of geographical information systems; see also (Davis, 2006). As with qualitative physics, an agent can go a long way, so to speak, without
resorting to a full metric representation.
Bibliographical and Historical Notes Psychological reasoning: Psychological reasoning involves the development of a working
psychology for artificial agents to use in reasoning about themselves and other agents. This
is often based on socalled folk psychology, the theory that humans in general are believed
to use in reasoning about themselves and other humans.
When Al researchers provide their
artificial agents with psychological theories for reasoning about other agents, the theories are
frequently based on the researchers” description of the logical agents” own design. Psychological reasoning is currently most useful within the context of natural language understanding,
where divining the speaker’s intentions is of paramount importance. Minker (2001) collects papers by leading researchers in knowledge representation, summarizing 40 years of work in the field. The proceedings of the international conferences on Principles of Knowledge Representation and Reasoning provide the most uptodate sources. for work in this area. Readings in Knowledge Representation (Brachman and Levesque, 1985) and Formal Theories of the Commonsense
World (Hobbs and Moore,
1985) are ex
cellent anthologies on knowledge representation; the former focuses more on historically
important papers in representation languages and formalisms, the latter on the accumulation of the knowledge itself. Davis (1990), Stefik (1995), and Sowa (1999) provide textbook introductions to knowledge representation, van Harmelen et al. (2007) contributes a handbook, and Davis and Morgenstern (2004) edited a special issue of the Al Journal on the topic. Davis
(2017) gives a survey of logic for commonsense reasoning. The biennial conference on Theoretical Aspects of Reasoning About Knowledge (TARK) covers applications of the theory of knowledge in Al, economics, and distributed systems.
Psychological reasoning
TR
1 1
AUTOMATED PLANNING In which we see how an agent can take advantage of the structure ofa problem to efficiently construct complex plans of action.
Planning a course of action is a key requirement for an intelligent agent. The right representation for actions and states and the right algorithms can make this easier.
In Section 11.1
we introduce a general factored representation language for planning problems that can naturally and succinctly represent a wide variety of domains, can efficiently scale up to large
problems, and does not require ad hoc heuristics for a new domain. Section 11.4 extends the
representation language to allow for hierarchical actions, allowing us to tackle more complex
problems. We cover efficient algorithms for planning in Section 11.2, and heuristics for them in Section
11.3.
In Section
11.5 we account for partially observable and nondeterministic
domains, and in Section 11.6 we extend the language once again to cover scheduling problems with resource constraints. This gets us closer to planners that are used in the real world
for planning and scheduling the operations of spacecraft, factories, and military campaigns. Section 11.7 analyzes the effectiveness of these techniques.
11.1 Classical planning
Definition of Classical Planning
Classical planning is defined as the task of finding a sequence of actions to accomplish a
goal in a discrete, deterministic, static, fully observable environment. We have seen two approaches to this task: the problemsolving agent of Chapter 3 and the hybrid propositional
logical agent of Chapter 7. Both share two limitations. First, they both require ad hoc heuristics for each new domai
heuristic evaluation function for search, and handwritten code
for the hybrid wumpus agent. Second, they both need to explicitly represent an exponentially large state space.
For example, in the propositional logic model of the wumpus world, the
axiom for moving a step forward had to be repeated for all four agent orientations, T time. steps, and n* current locations. PDDL
In response to these limitations, planning researchers have invested in a factored repre
sentation using a family of languages called PDDL, the Planning Domain Definition Lan
guage (Ghallab ef al., 1998), which allows us to express all 47n? actions with a single action schema, and does not need domainspecific knowledge.
Basic PDDL can handle classical
planning domains, and extensions can handle nonclassical domains that are continuous, partially observable, concurrent, and multiagent. The syntax of PDDL is based on Lisp, but we. State
will translate it into a form that matches the notation used in this book.
In PDDL, a state is represented as a conjunction of ground atomic fluents. Recall that
“ground” means no variables, “fluent” means an aspect of the world that changes over time,
Section 11.1
Definition of Classical Planning
and “ground atomic” means there is a single predicate, and if there are any arguments, they must be constants. For example, Poor A Unknown might represent the state of a hapless agent,
and Ar(Truck,, Melbourne) AAt(Trucks, Sydney) could represent a state in a package delivery problem. PDDL uses database semantics: the closedworld assumption means that any fluents that are not mentioned are false, and the unique names
assumption means that Truck;
and Truck; are distinct. The following fluents are nor allowed in a state: Ar(x,y) (because it has variables), ~Poor (because it is a negation), and Ar(Spouse(Ali), Sydney) (because it uses a function symbol, Spouse). When convenient, we can think of the conjunction of fluents as a set of fluents.
An action schema represents a family of ground actions. For example, here is an action Action schema
schema for flying a plane from one location to another:
Action(Fly(p. from,t0),
PRECOND:At(p. from) A Plane(p) A Airport(from) A Airport(to) EFFECT:—A!(p. from) A At(p, t0))
The schema consists of the action name,
a list of all the variables used in the schema,
a
precondition and an effect. The precondition and the effect are each conjunctions of literals (positive or negated atomic sentences).
yielding a ground (variablefree) action:
We can choose constants to instantiate the variables,
Precondition Effect
Action(Fly(P,,SFO,JFK),
PRECOND:At(Py,SFO) A Plane(Py) AAirport(SFO) AAirport(JFK) EFFECT:—AI(P1,SFO) AA1(Py,JFK))
A ground action a is applicable in state s if s entails the precondition of a; that is, if every
positive literal in the precondition is in s and every negated literal is not.
The result of executing applicable action a in state s is defined as a state s’ which is
represented by the set of fluents formed by starting with s, removing the fluents that appear as negative literals in the action’s effects (what we call the delete list or DEL(a)), and adding the fluents that are positive literals in the action’s effects (what we call the add list or ADD(a)):
RESULT(s,a) = (s — DEL(a)) UADD(a) .
(11.1)
For example, with the action Fiy(Py,SFO,JFK), we would remove the fluent Ar(Py,SFO) and add the fluent At(Py,JFK). A set of action schemas serves as a definition of a planning domain. A specific problem within the domain is defined with the addition of an initial state and a goal. The initial state is a conjunction of ground fluents (introduced with the keyword Inir in Figure 11.1). As with all states, the closedworld assumption is used, which means that any atoms that are not mentioned are false. The goal (introduced with Goal) is just like a precondition: a conjunction of literals (positive or negative) that may contain variables. For example, the goal At(C1,SFO) A=At(C2,SFO) AAt(p, SFO), refers to any state in which cargo C} is at SFO but C is not, and in which there is a plane at SFO. 11.1.1
Example domain:
Air cargo transport
Figure 11.1 shows an air cargo transport problem involving loading and unloading cargo and flying it from place to place. The problem can be defined with three actions: Load, Unload,
and Fly. The actions affect two predicates: In(c, p) means that cargo c is inside plane p, and Ar(x,a) means that object x (either plane or cargo) is at airport a. Note that some care
Delete list Add list
Chapter 11
Automated Planning
Init(AK(Cy, SFO) A At(Cy, JFK) A At(Py, SFO) A At(Py, JFK) A Cargo(Cy) A Cargo(Cs) A Plane(P;) A Plane(Ps) AAirport(JFK) A Airport(SFO)) Goal(A1(Cy, JFK) A At(Ca, SFO)) Action(Load(c, p, @), PRECOND: At(c, a) A At(p, a) A Cargo(c)
EFFECT: = At(c, a) A In(c, p))
Action(Unload(c, p, a),
PRECOND: In(c, p) A At(p, a) A Cargo(c) EFFECT: At(c, a) A —In(c, p))
A Plane(p) A Airport(a)
A Plane(p) A Airport(a)
Action(Fly(p, from, t0),
PRECOND: At(p, from) A Plane(p) A Airport(from) A Airport(to) EFFECT: = At(p, from) A\ At(p, t0))
Figure 1.1 A PDDL description of an air cargo transportation planning problem. must be taken to make sure the Az predicates are maintained properly.
When a plane flies
from one airport to another, all the cargo inside the plane goes with it. Tn firstorder logic it would be easy to quantify over all objects that are inside the plane. But PDDL does not have
a universal quantifier, so we need a different solution. The approach we use is to say that a piece of cargo ceases to be Ar anywhere when it is /n a plane; the cargo only becomes At the new airport when it is unloaded.
So At really means “available for use at a given location.”
The following plan is a solution to the problem:
[Load(Cy, Py, SFO). Fly(Py,SFO.JFK), Unload(C\,Pi,JFK). Load(Ca. P2, JFK), Fly(P2,JFK SFO), Unload(Ca, P2, SFO)]..
11.1.2
Example domain:
The spare tire problem
Consider the problem of changing a flat tire (Figure 11.2). The goal is to have a good spare tire properly mounted onto the car’s axle, where the initial state has a flat tire on the axle and a good spare tire in the trunk.
To keep it simple, our version of the problem is
an abstract
one, with no sticky lug nuts or other complications. There are just four actions: removing the spare from the trunk, removing the flat tire from the axle, putting the spare on the axle, and leaving the car unattended overnight.
We assume that the car is parked in a particu
larly bad neighborhood, so that the effect of leaving it overnight is that the tires disappear. [Remove(Flat,Axle),Remove(Spare, Trunk), PutOn(Spare,Axle)] is a solution to the problem. 11.1.3
Example domain:
The blocks world
One of the most famous planning domains is the blocks world. This domain consists of a set
of cubeshaped blocks sitting on an arbitrarilylarge table.! The blocks can be stacked, but
only one block can fit directly on top of another. A robot arm can pick up a block and move it
to another position, either on the table or on top of another block. The arm can pick up only
one block at a time, so it cannot pick up a block that has another one on top of it. A typical
goal to get block A on B and block B on C (see Figure 11.3). 1" The blocks world commonly used in planning research is much simpler than SHRDLUs version (page 20).
Section 11.1
Definition of Classical Planning
Init(Tire(Flat) A Tire(Spare) A At(Flat,Axle) A At(Spare, Trunk)) Goal(At(Spare,Axle))
Action(Remove(obj. loc),
PRECOND: At(obj. loc)
EFFECT: At(obj,loc) A A(obj, Ground))
Action(PutOn(t, Axle),
PRECOND: Tire(t) A At(t,Ground) A — At(Flat,Axle) N\ — At(Spare,Axle) EFFECT: = At(1,Ground) A At(t,Axle))
Action(LeaveOvernight, PRECOND: EFFECT: — At(Spare, Ground) A — At(Spare, Axle) A = At(Spare, Trunk) A = At(Flat,Ground) A = At(Flat,Axle) A = At(Flat, Trunk))
Figure 11.2 The simple spare tire problem.
BN Start State
A
Goal State
Figure 11.3 Diagram of the blocksworld problem in Figure 11.4.
Init(On(A, Table) A On(B,Table) A On(C,A) A Block(A) A Block(B) A Block(C) A Clear(B) A Clear(C) A Clear(Table)) Goal(On(A,B) A On(B,C))
Action(Move(b.x.y),
PRECOND: On(b,x) A Clear(b) A Clear(y) A Block(b) A Block(y) N
(b#x) A (b#y) A (x#),
EFFECT: On(b,y) A Clear(x) A —On(b,x) A —Clear(y))
Action(MoveToTable(b,x),
PRECOND: On(b,x) A Clear(b) A Block(b) A Block(x), EFFECT: On(b, Table) A Clear(x) A —On(b,x))
Figure 11.4 A planning problem in the blocks world: building a threeblock tower. One solution is the sequence [MoveToTuble(C.A), Move(B, Table.C), Move(A, Table, B)).
347
Chapter 11
Automated Planning
We use On(b,x) to indicate that block b is on x, where x is either another block or the
table. The action for moving block b from the top of x to the top of y will be Move(b,x,y).
Now, one of the preconditions on moving b is that no other block be on it. In firstorder logic,
this would be —3x On(x,b) or, alternatively, Vx —On(x,b). Basic PDDL does not allow
quantifiers, so instead we introduce a predicate Clear(x) that is true when nothing is on x.
(The complete problem description is in Figure 11.4.)
The action Move moves a block b from x to y if both b and y are clear. After the move is made, b is still clear but y is not. A first attempt at the Move schema is
Action(Move(b, x.y),
PRECOND:On(b.x) A Clear(b) A Clear(y),
EFFECT:On(b,y) A Clear(x) A =On(b,x) A—Clear(y)).
Unfortunately, this does not maintain Clear properly when x or y is the table. When x is
the Table, this action has the effect Clear(Table), but the table should not become clear; and
when y=Table, it has the precondition Clear(Table), but the table does not have to be clear for us to move a block onto it. To fix this, we do two things.
action to move a block b from x to the table:
First, we introduce another
Action(MoveToTable(b,x),
PRECOND:On(b,x) A Clear(b),
EFFECT: On(b, Table) A Clear(x) A—On(b,x)) .
Second, we take the interpretation of Clear(x) to be “there is a clear space on x to hold a block.” Under this interpretation, Clear(Table) will always be true. The only problem is that nothing prevents the planner from using Move(b,x, Table) instead of MoveToTable(b,x). We could live with this problem—it will lead to a largerthannecessary scarch space, but will not lead to incorrect answers—or we could introduce the predicate Block and add Block(b) A Block(y) to the precondition of Move, as shown in Figure 11.4. 11.2
Algorithms for Classical Planning
The description of a planning problem provides an obvious way to search from the initial state through the space of states, looking for a goal. A nice advantage of the declarative representation of action schemas is that we can also search backward from the goal, looking
for the initial state (Figure 11.5 compares forward and backward searches). A third possibility
is to translate the problem description into a set of logic sentences, to which we can apply a
logical inference algorithm to find a solution. 11.2.1
Forward statespace search for planning
We can solve planning problems by applying any of the heuristic search algorithms from Chapter 3 or Chapter 4. The states in this search state space are ground states, where every fluent is either true or not.
The goal is a state that has all the positive fluents in the prob
lem’s goal and none of the negative fluents. The applicable actions in a state, Actions(s), are
grounded instantiations of the action schemas—that is, actions where the variables have all
been replaced by constant values.
To determine the applicable actions we unify the current state against the preconditions
of each action schema.
For each unification that successfully results in a substitution, we
Section 112 Algorithms for Classical Planning apply the substitution to the action schema to yield a ground action with no variables. (It is a requirement of action schemas that any variable in the effect must also appear in the
precondition; that way, we are guaranteed that no variables remain after the substitution.) Each schema may unify in multiple ways. In the spare tire example (page 346), the Remove action has the precondition Ar(obj, loc), which matches against the initial state in two
ways, resulting in the two substitutions {obj/Flat,loc/Axle} and {obj/Spare,loc/Trunk}; applying these substitutions yields two ground actions.
If an action has multiple literals in
the precondition, then each of them can potentially be matched against the current state in multiple ways.
At first, it seems that the state space might be too big for many problems. Consider an
air cargo problem with 10 airports, where each airport initially has 5 planes and 20 pieces of cargo. The goal is to move all the cargo at airport A to airport B. There is a 41step solution
to the problem: load the 20 pieces of cargo into one of the planes at A, fly the plane to B, and
unload the 20 pieces. Finding this apparently straightforward solution can be difficult because the average branching factor is huge: each of the 50 planes can fly to 9 other airports, and each of the 200
packages can be either unloaded (if it is loaded) or loaded into any plane at its airport (if it is unloaded).
So in any state there is a minimum of 450 actions (when all the packages are
at airports with no planes) and a maximum of 10,450 (when all packages and planes are at
Fly(P,, A B) Fly(P,, A B)
Figure 11.5 Two approaches to searching for a plan. (a) Forward (progression) search through the space of ground states, starting in the initial state and using the problem’s actions to search forward for a member of the set of goal states.
(b) Backward (regression)
search through state descriptions, starting at the goal and using the inverse of the actions to search backward for the initial state.
349
350
Chapter 11
Automated Planning
the same airport). On average, let’s say there are about 2000 possible actions per state, so the search graph up to the depth of the 41step solution has about 2000*' nodes.
Clearly, even this relatively small problem instance is hopeless without an accurate heuristic. Although many realworld applications of planning have relied on domainspecific heuris
tics, it turns out (as we see in Section 11.3) that strong domainindependent heuristics can be
derived automatically; that is what makes forward search feasible. 11.2.2 Regression search Relevant action
Backward search for planning
In backward search (also called regression search) we start at the goal and apply the actions backward until we find a sequence of steps that reaches the initial state. At each step we
consider relevant actions (in contrast to forward search, which considers actions that are
applicable). This reduces the branching factor significantly, particularly in domains with many possible actions.
A relevant action is one with an effect that unifies with one of the goal literals, but with
no effect that negates any part of the goal. For example, with the goal ~PoorA Famous, an action with the sole effect Famous would be relevant, but one with the effect Poor A Famous
is not considered relevant: even though that action might be used at some point in the plan (to
establish Famous), it cannot appear at rhis point in the plan because then Poor would appear in the final state.
Regression
What does it mean to apply an action in the backward direction?
Given a goal g and
an action a, the regression from g over a gives us a state description g’ whose positive and negative literals are given by Pos(¢') = (Pos(g) — ADD(a)) UPOS(Precond(a))
NEG(g') = (NEG(g) — DEL(a)) UNEG(Precond(a)).
That is, the preconditions
must have held before, or else the action could not have been
executed, but the positive/negative literals that were added/deleted by the action need not have been true before.
These equations are straightforward for ground literals, but some care is required when there are variables in g and a. For example, suppose the goal is to deliver a specific piece of cargo to SFO: A1(Ca,SFO). The Unload action schema has the effect Ar(c,a). When we
unify that with the goal, we get the substitution {c/C,a/SFO}; applying that substitution to the schema gives us a new schema which captures the idea of using any plane that is at SFO:
Action(Unload(Cs, p/,SFO),
PRECOND:In(Cy, p') AAt(p',SFO) A Cargo(C,) A Plane(p') A Airport(SFO) EFFECT:A1(C2,SFO) A —In(Ca, p')) .
Here we replaced p with a new variable named p'. This is an instance of standardizing apart
variable names so there will be no conflict between different variables that happen to have the
same name (see page 284). The regressed state description gives us a new goal: & = In(Ca, p') AAI(p',SFO) A Cargo(C2) A Plane(p') A Airport(SFO) As another example, consider the goal of owning a book with a specific ISBN number: Own(9780134610993). Given a trillion 13digit ISBNs and the single action schema A = Action(Buy(i), PRECOND:ISBN (i), EFFECT: Own(i)) .
a forward search without a heuristic would have to start enumerating the 10 billion ground
Buy actions. But with backward search, we would unify the goal Own(9780134610993 ) with
Section 112
Algorithms for Classical Planning
the effect Own(i"), yielding the substitution 6 = {i /9780134610993 }. Then we would regress over the action Subst(6,A) to yield the predecessor state description ISBN (9780134610993).
This is part of the initial state, so we have a solution and we are done, having considered just one action, not a trillion.
More formally, assume a goal description g that contains a goal literal g; and an action
schema A. If A has an effect literal ¢); where Unify(g;.¢})=0 and where we define A’
SUBST(0,A) and if there is no effect in A’ that is the negation of a literal in g, then A’ is a
relevant action towards g.
For most problem domains backward search keeps the branching factor lower than for
ward search.
However, the fact that backward search uses states with variables rather than
ground states makes it harder to come up with good heuristics. That is the main reason why
the majority of current systems favor forward search. 11.2.3
Planning as Boolean satisfiability
In Section 7.7.4 we showed how some clever axiomrewriting could turn a wumpus world problem into a propositional logic satisfiability problem that could be handed to an efficient satisfiability solver. SATbased planners such as SATPLAN operate by translating a PDDL problem description into propositional form. The translation involves a series of steps: « Propositionalize the actions: for each action schema, form ground propositions by substituting constants for each of the variables. So instead of a single Unload(c,p,a) schema, we would have separate action propositions for each combination of cargo,
plane, and airport (here written with subscripts), and for each time step (here written as a superscript). + Add action exclusion axioms saying that no two actions can occur at the same time, e.g.
—(FlyP;SFOJFK' A FlyP,;SFOBUH").
+ Add precondition axioms:
For each ground action A, add the axiom A’ =
PRE(A)',
that is, if an action is taken at time 7, then the preconditions must have been true. For
example, FIyP;SFOJFK' = At(P,,SFO) A Plane(P\) A Airport(SFO) AAirport (JFK).
« Define the initial state: assert FO for every fluent F in the problem’s initial state, and
—F" for every fluent not mentioned in the initial state.
« Propositionalize the goal: the goal becomes a disjunction over all of its ground instances, where variables are replaced by constants. For example, the goal of having block A on another block, On(A,x) ABlock(x) in a world with objects A, B and C, would
be replaced by the goal
(On(A,A) ABlock(A)) V (On(A,B) ABlock(B)) V (On(A,C) A Block(C)). + Add successorstate axioms: For each fluent ', add an axiom of the form
F'*' & ActionCausesF' V (F' A=ActionCausesNotF"), where ActionCausesF stands for a disjunction of all the ground actions that add F, and ActionCausesNotF stands for a disjunction of all the ground actions that delete F.
The resulting translation is typically much larger than the original PDDL, but modern the efficiency of modern SAT solvers often more than makes up for this.
351
352
Chapter 11 11.2.4
Planning graph
Automated Planning
Other classical planning approaches
The three approaches we covered above are not the only ones tried in the 50year history of automated planning. We briefly describe some others here. An approach called Graphplan uses a specialized data structure, a planning graph, to
encode constraints on how actions are related to their preconditions and effects, and on which
Situation calculus
things are mutually exclusive. Situation calculus is a method of describing planning problems in firstorder logic. It uses successorstate axioms just as SATPLAN
does, but firstorder logic allows for more
flexibility and more succinct axioms. Overall the approach has contributed to our theoretical
understanding of planning, but has not made a big impact in practical applications, perhaps
because firstorder provers are not as well developed as propositional satisfiability programs. It is possible to encode a bounded planning problem (i.c., the problem of finding a plan
of length k) as a constraint satisfaction problem
(CSP). The encoding
is similar to the
encoding to a SAT problem (Section 11.2.3), with one important simplification: at each time
step we need only a single variable, Action', whose domain s the set of possible actions. We. no longer need one variable for every action, and we don’t need the action exclusion axioms.
Partialorder planning
All the approaches we have seen so far construct fotally ordered plans consisting of strictly linear sequences of actions. But if an air cargo problem has 30 packages being loaded onto one plane and 50 packages being loaded onto another, it seems pointless to decree a specific linear ordering of the 80 load actions. An alternative called partialorder planning represents a plan as a graph rather than a linear sequence: each action is a node in the graph, and for each precondition of the action there is an edge from another action (or from the initial state) that indicates that the predeces
sor action establishes the precondition. So we could have a partialorder plan that says that ac
tions Remove(Spare, Trunk) and Remove(Flat, Axle) must come before PutOn(Spare,Axle),
but without saying which of the two Remove actions should come first. We search in the space
of plans rather than worldstates, inserting actions to satisfy conditions.
In the 1980s and 1990s, partialorder planning was seen as the best way to handle planning problems with independent subproblems. By 2000, forwardsearch planners had developed excellent heuristics that allowed them to efficiently discover the independent subprob
lems that partialorder planning was designed for. Moreover, SATPLAN was able to take ad
vantage of Moore’s law: a propositionalization that was hopelessly large in 1980 now looks tiny, because computers have 10,000 times more memory today.
As a result, partialorder
planners are not competitive on fully automated classical planning problems.
Nonetheless, partialorder planning remains an important part of the field.
For some
specific tasks, such as operations scheduling, partialorder planning with domainspecific heuristics is the technology of choice. Many of these systems use libraries of highlevel plans, as described in Section 11.4.
Partialorder planning is also often used in domains where it is important for humans
to understand the plans. For example, operational plans for spacecraft and Mars rovers are generated by partialorder planners and are then checked by human operators before being uploaded to the vehicles for execution. The plan refinement approach makes it easier for the humans to understand what the planning algorithms are doing and to verify that the plans are
correct before they are executed.
Section 11.3 11.3
Heuristics for Planning
353
Heuristics for Planning
Neither forward nor backward search is efficient without a good heuristic function.
Recall
from Chapter 3 that a heuristic function h(s) estimates the distance from a state s to the goal, and that if we can derive an admissible heuristic for this distance—one that does not
overestimate—then we can use A* search to find optimal solutions.
By definition, there is no way to analyze an atomic state, and thus it requires some ingenuity by an analyst (usually human) to define good domainspecific heuristics for search problems with atomic states. But planning uses a factored representation for states and actions, which makes it possible to define good domainindependent heuristics.
Recall that an admissible heuristic can be derived by defining a relaxed problem that is
easier to solve. The exact cost of a solution to this easier problem then becomes the heuristic
for the original problem. A search problem is a graph where the nodes are states and the edges are actions.
The problem is to find a path connecting the initial state to a goal state.
There are two main ways we can relax this problem to make it easier: by adding more edges to the graph, making it strictly easier to find a path, or by grouping multiple nodes together, forming an abstraction of the state space that has fewer states, and thus is easier to search.
We look first at heuristics that add edges to the graph. Perhaps the simplest is the ignore
preconditions heuristic, which drops all preconditions from actions. Every action becomes
applicable in every state, and any single goal fluent can be achieved in one step (if there are any applicable actions—if not, the problem is impossible). This almost implies that the number of steps required to solve the relaxed problem is the number of unsatisfied goals—
Ignorepreconditions heuristic
almost but not quite, because (1) some action may achieve multiple goals and (2) some actions
may undo the effects of others. For many problems an accurate heuristic is obtained by considering (1) and ignoring (2). First, we relax the actions by removing all preconditions and all effects except those that are
literals in the goal. Then, we count the minimum number of actions required such that the union of those actions’ effects satisfies the goal. This is an instance of the setcover problem.
There is one minor irritation: the setcover problem is NPhard. Fortunately a simple greedy algorithm is guaranteed to return a set covering whose size is within a factor of logn of the true minimum covering, where n is the number of literals in the goal.
Unfortunately, the
greedy algorithm loses the guarantee of admissibility. It is also possible to ignore only selected preconditions of actions. Consider the slidingtile puzzle (8puzzle or 15puzzle) from Section 3.2. We could encode this as a planning problem involving tiles with a single schema Slide: Action(Slide(t,s1,52),
PRECOND:On(t,sy) ATile(t) A Blank(sy) A Adjacent (s, s2) EFFECT:On(t,s2) A Blank(s;) A =On(t,s1) A —Blank(s;))
As we saw in Section 3.6, if we remove the preconditions Blank(s) A Adjacent(sy ,s2) then any tile can move in one action to any space and we get the numberofmisplacedtiles heuris
tic. If we remove only the Blank(s,) precondition then we get the Manhattandistance heuris
tic. It is easy to see how these heuristics could be derived automatically from the action schema description. The ease of manipulating the action schemas is the great advantage of the factored representation of planning problems, as compared with the atomic representation
of search problems.
Setcover problem
354
Chapter 11
Automated Planning
Figure 11.6 Two state spaces from planning problems with the ignoredeletelists heuristic. The height above the bottom plane is the heuristic score of a state; states on the bottom plane are goals. There are no local minima, so search for the goal is straightforward. From Hoffmann (2005). Ignoredeletelists heuristic
Another possibility is the ignoredeletelists heuristic.
Assume for a moment that all
goals and preconditions contain only positive literals.> We want to create a relaxed version
of the original problem that will be easier to solve, and where the length of the solution will serve as a good heuristic.
We can do that by removing the delete lists from all actions
(i.e., removing all negative literals from effects). That makes it possible to make monotonic
progress towards the goal—no action will ever undo progress made by another action. It turns
out it is still NPhard to find the optimal solution to this relaxed problem, but an approximate
solution can be found in polynomial time by hill climbing.
Figure 11.6 diagrams part of the state space for two planning problems using the ignore
deletelists heuristic. The dots represent states and the edges actions, and the height of each
dot above the bottom plane represents the heuristic value. States on the bottom plane are solutions. Tn both of these problems, there is a wide path to the goal. There are no dead ends, 50 no need for backtracking; a simple hillclimbing search will easily find a solution to these
problems (although it may not be an optimal solution). 11.3.1
Dom:
ndependent
pruning
Factored representations make it obvious that many states are just variants of other states. For
example, suppose we have a dozen blocks on a table, and the goal is to have block A on top
of a threeblock tower. The first step in a solution is to place some block x on top of block y
(where x, y, and A are all different). After that, place A on top ofx and we’re done. There are 11 choices for x, and given x, 10 choices for y, and thus 110 states to consider. But all these
Symmetry reduction
states are symmetric: choosing one over another makes no difference, and thus a planner should only consider one of them. This is the process of symmetry reduction: we prune out aren’t, replace every negative literal ~P in tate and the action effects accordingly.
Section 11.3
Heuristics for Planning
355
of consideration all symmetric branches of the search tree except for one. For many domains, this makes the difference between intractable and efficient solving.
Another possibility is to do forward pruning, accepting the risk that we might prune away an optimal solution, in order to focus the search on promising branches. We can define
a preferred action as follows: First, define a relaxed version of the problem, and solve it to
Preferred action
teractions can be ruled out. We say that a problem has serializable subgoals if there exists
Serializable subgoals
get a relaxed plan. Then a preferred action s either a step of the relaxed plan, or it achieves some precondition of the relaxed plan. Sometimes it is possible to solve a problem efficiently by recognizing that negative in
an order of subgoals such that the planner can achieve them in that order without having to undo any of the previously achieved subgoals. For example, in the blocks world, if the goal is to build a tower (e.g., A on B, which in turn is on C, which in turn is on the Table, as in
Figure 11.3 on page 347), then the subgoals are serializable bottom to top: if we first achieve C on Table, we will never have to undo it while we are achieving the other subgoals.
A
planner that uses the bottomtotop trick can solve any problem in the blocks world without backtracking (although it might not always find the shortest plan). As another example, if there is a room with n light switches, each controlling a separate light, and the goal is to have them all on, then we don’t have to consider permutations of the order; we could arbitrarily
restrict ourselves to plans that flip switches in, say, ascending order.
For the Remote Agent planner that commanded NASA’s Deep Space One spacecraft, it was determined that the propositions involved in commanding a spacecraft are serializable. This is perhaps not too surprising, because a spacecraft is designed by its engineers to be as easy as possible to control (subject to other constraints). Taking advantage of the serialized
ordering of goals, the Remote Agent planner was able to eliminate most of the search. This meant that it was fast enough to control the spacecraft in real time, something previously
considered impossible. 11.3.2
State abstraction in planning
A relaxed problem leaves us with a simplified planning problem just to calculate the value of the heuristic function. Many planning problems have 10'% states or more, and relaxing
the actions does nothing to reduce the number of states, which means that it may still be
expensive to compute the heuristic. Therefore, we now look at relaxations that decrease the
number of states by forming a state abstraction—a manytoone mapping from states in the State abstraction ground representation of the problem to the abstract representation.
The easiest form of state abstraction is to ignore some fluents. For example, consider an
air cargo problem with 10 airports, 50 planes, and 200 pieces of cargo. Each plane can be at one of 10 airports and each package can be either in one of the planes or unloaded at one of the airports. So there are 10% x (50 + 10)** ~ 10**5 states. Now consider a particular
problem in that domain in which it happens that all the packages are at just 5 of the airports,
and all packages at a given airport have the same destination. Then a useful abstraction of the
problem is to drop all the Ar fluents except for the ones involving one plane and one package
at each of the 5 airports. Now there are only 10% x (5+ 10)° ~ 10'! states. A solution in this abstract state space will be shorter than a solution in the original space (and thus will be an admissible heuristic), and the abstract solution is easy to extend to a solution to the original
problem (by adding additional Load and Unload actions).
356 Decomposition B ence
Chapter 11
Automated Planning
A key idea in defining heuristics is decomposition: dividing a problem into parts, solving each part independently, and then combining the parts. The subgoal independence assumption is that the cost of solving a conjunction of subgoals is approximated by the sum of the costs of solving each subgoal independently. The subgoal independence assumption can be optimistic or pessimistic.
It is optimistic when there are negative interactions between
the subplans for each subgoal—for example, when an action in one subplan deletes a goal achieved by another subplan. It is pessimistic, and therefore inadmissible, when subplans contain redundant actions—for instance, two actions that could be replaced by a single action in the merged plan. Suppose the goal is a set of fluents G, which we divide into disjoint subsets Gy, . ., G. ‘We then find optimal plans P, ..., P, that solve the respective subgoals. What is an estimate
of the cost of the plan for achieving all of G? We can think of each COST(P;) as a heuristic estimate, and we know that if we combine estimates by taking their maximum
value, we
always get an admissible heuristic. So max;COST(P,) is admissible, and sometimes it is exactly correct: it could be that Py serendipitously achieves all the G;. But usually the estimate is too low. Could we sum the costs instead? For many problems that is a reasonable estimate, but it is not admissible. The best case is when G; and G; are independent, in the sense that plans for one cannot reduce the cost of plans for the other. In that case, the estimate
COST(P,) + CosT(P;) is admissible, and more accurate than the max estimate.
It is clear that there is great potential for cutting down the search space by forming abstractions. The trick is choosing the right abstractions and using them in a way that makes the total cost—defining an abstraction, doing an abstract search, and mapping the abstraction back to the original problem—Iless than the cost of solving the original problem. The techniques of pattern databases from Section 3.6.3 can be useful, because the cost of creating the pattern database can be amortized over multiple problem instances. A system that makes use of effective heuristics is FF, or FASTFORWARD
(Hoffmann,
2005), a forward statespace searcher that uses the ignoredeletelists heuristic, estimating
the heuristic with the help of a planning graph. FF then uses hill climbing search (modified
to keep track of the plan) with the heuristic to find a solution. FF’s hill climbing algorithm is
nonstandard:
it avoids local maxima by running a breadthfirst search from the current state
until a better one is found. If this fails, FF switches to a greedy bestfirst search instead.
11.4
Hierarchical Planning
The problemsolving and planning methods of the preceding chapters all operate with a fixed set of atomic actions.
Actions can be strung together, and stateoftheart algorithms can
generate solutions containing thousands of actions. That’s fine if we are planning a vacation and the actions are at the level of “fly from San Francisco to Honolulu,” but at the motor
control level of “bend the left knee by 5 degrees” we would need to string together millions or billions of actions, not thousands.
Bridging this gap requires planning at higher levels of abstraction. A highlevel plan for
a Hawaii vacation might be “Go to San Francisco airport; take flight HA 11 to Honolulu; do vacation stuff for two weeks; take HA 12 back to San Francisco; go home.” Given such
a plan, the action “Go to San Francisco airport™ can be viewed as a planning task in itself,
with a solution such as “Choose a ridehailing service; order a car; ride to airport.” Each of
Section 11.4
Hierarchical Planning
357
these actions, in turn, can be decomposed further, until we reach the lowlevel motor control
actions like a buttonpress.
In this example, planning and acting are interleaved; for example, one would defer the problem of planning the walk from the curb to the gate until after being dropped off. Thus, that particular action will remain at an abstract level prior to the execution phase. discussion of this topic until Section 11.5.
We defer
Here, we concentrate on the idea of hierarchi
cal decomposition, an idea that pervades almost all attempts to manage complexity.
For
example, complex software is created from a hierarchy of subroutines and classes; armies, governments and corporations have organizational hierarchies. The key benefit of hierarchi
Hierarchical decomposition
cal structure is that at each level of the hierarchy, a computational task, military mission, or
administrative function is reduced to a small number of activities at the next lower level, so
the computational cost of finding the correct way to arrange those activities for the current problem is small. 11.4.1
Highlevel actions
The basic formalism we adopt to understand hierarchical decomposition comes from the area
of hierarchical task networks or HTN planning. For now we assume full observability and determinism and a set of actions, now called primitive actions, with standard precondition—
effect schemas.
The key additional concept is the highlevel action or HLA—for example,
the action “Go to San Francisco airport.” Each HLA has one or more possible refinements,
into a sequence of actions, each of which may be an HLA or a primitive action. For example, the action “Go to San Francisco airport,” represented formally as Go(Home,SFO), might have two possible refinements, as shown in Figure 11.7. The same figure shows a recursive
Hierarchical task network Primitive action Highlevel action Refinement
refinement for navigation in the vacuum world: to get to a destination, take a step, and then
20 to the destination.
These examples show that highlevel actions and their refinements embody knowledge
about how to do things. For instance, the refinements for Go(Home,SFO) say that to get to
the airport you can drive or take a ridehailing service; buying milk, sitting down, and moving the knight to e4 are not to be considered.
An HLA refinement that contains only primitive actions is called an implementation
of the HLA. In a grid world, the sequences [Right,Right, Down] and [Down, Right,Right] both implement the HLA Navigate([1,3],[3,2]). An implementation of a highlevel plan (a sequence of HLAS) is the concatenation of implementations of each HLA in the sequence. Given the precondition—effect definitions of each primitive action, it is straightforward to
determine whether any given implementation of a highlevel plan achieves the goal. We can say, then, that a highlevel plan achieves the goal from a given state if at least one of its implementations achieves the goal from that state. ~ The “at least one” in this
definition is crucial—not all implementations need to achieve the goal, because the agent gets
to decide which implementation it will execute. Thus, the set of possible implementations in
HTN planning—each of which may have a different outcome—is not the same as the set of
possible outcomes in nondeterministic planning. There, we required that a plan work for all outcomes because the agent doesn’t get to choose the outcome; nature does.
The simplest case is an HLA that has exactly one implementation. In that case, we can
compute the preconditions and effects of the HLA from those of the implementation (see
Exercise 11.HLAU) and then treat the HLA exactly as if it were a primitive action itself. It
Implementation
358
Chapter 11
Automated Planning
Refinement(Go(Home,SFO), STEPS: [Drive(Home, SFOLongTermParking), Shuttle(SFOLongTermParking, SFO)] ) Refinement(Go(Home,SFO), STEPS: [Taxi(Home,SFO)] )
Refinement(Navigate([a,b], [x,]), PRECOND: a=x A b=y
STEPS: ] ) Refinement(Navigate([a.b), x.
PRECOND: Connected([ab], a— 1.b]) STEPS: [Left, Navigate([a— 1,b].[x.])] ) Refinement(Navigate([a,b], [ PRECOND: Connected ([a,b], a + 1,5])
STEPS: [Right. Navigate([a+ 1,5],[x.])]) Figure 11.7 Definitions of possible refinements for two highlevel actions: going to San Francisco airport and navigating in the vacuum world. In the latter case, note the recursive nature of the refinements and the use of preconditions.
can be shown that the right collection of HLAs can result in the time complexity of blind
search dropping from exponential in the solution depth to linear in the solution depth, although devising such a collection of HLAs may be a nontrivial task in itself. When HLAs
have multiple possible implementations, there are two options: one is to search among the implementations for one that works, as in Section 11.4.2; the other is to reason directly about
the HLAs—despite the multiplicity of implementations—as explained in Section 11.4.3. The latter method enables the derivation of provably correct abstract plans, without the need to
consider their implementations.
11.4.2
Searching for primitive solutions
HTN planning is often formulated with a single “top level” action called Act, where the aim is to find an implementation of Act that achieves the goal. This approach is entirely general. For example, classical planning problems can be defined as follows: for each primitive action
a;, provide one refinement of Act with steps [a;, Act]. That creates a recursive definition of Act
that lets us add actions. But we need some way to stop the recursion; we do that by providing
one more refinement for Act, one with an empty list of steps and with a precondition equal to the goal of the problem.
This says that if the goal is already achieved, then the right
implementation is to do nothing. The approach leads to a simple algorithm: repeatedly choose an HLA in the current plan and replace it with one of its refinements, until the plan achieves the goal. One possible implementation based on breadthfirst tree search is shown in Figure 11.8. Plans are considered
in order of depth of nesting of the refinements, rather than number of primitive steps. It is straightforward to design a graphsearch version of the algorithm as well as depthfirst and iterative deepening versions.
Section 11.4
Hierarchical Planning
function HIERARCHICALSEARCH(problem, hierarchy) returns a solution or failure
frontier —a FIFO queue with [Aci] as the only element while rrue do if ISEMPTY( frontier) then return failure
plan — POP(frontier)
1/ chooses the shallowest plan in frontier
hla + the first HLA in plan, or null if none
prefix,suffix — the action subsequences before and after hla in plan outcome«— RESULT(problem.INITIAL, prefix) if hla is null then
// so plan is primitive and outcome is its result
if problem.1sGOAL(outcome) then return plan else for each sequence in REFINEMENTS (hla, outcome, hierarchy) do add APPEND(prefix, sequence, suffix) to frontier
Figure 11.8 A breadthfirst implementation of hierarchical forward planning search. The
initial plan supplied to the algorithm is [Acr]. The REFINEMENTS function returns a set of
action sequences, one for each refinement of the HLA whose preconditions are satisfied by the specified state, outcome.
In essence, this form of hierarchical search explores the space of sequences that conform to the knowledge contained in the HLA library about how things are to be done. A great deal of knowledge can be encoded, not just in the action sequences specified in each refinement but also in the preconditions for the refinements. For some domains, HTN planners have been able to generate huge plans with very little search. For example, OPLAN (Bell and Tate, 1985), which combines HTN planning with scheduling, has been used to develop production plans for Hitachi. A typical problem involves a product line of 350 different products, 35 assembly machines, and over 2000 different operations. The planner generates a 30day schedule with three 8hour shifts a day, involving tens of millions of steps. Another important
aspect of HTN plans is that they are, by definition, hierarchically structured; usually this makes them easy for humans to understand.
The computational benefits of hierarchical search can be seen by examining an ideal
ized case.
Suppose that a planning problem has a solution with d primitive actions.
a nonhierarchical,
For
forward statespace planner with b allowable actions at each state, the
cost is O(b?), as explained in Chapter 3. For an HTN planner, let us suppose a very regular refinement structure:
each nonprimitive action has r possible refinements,
k actions at the next lower level. there are with this structure.
Now,
each into
We want to know how many different refinement trees
if there are d actions at the primitive level, then the
number of levels below the root is log,d, so the number of internal refinement nodes is
14 k+ K24+ K24~
= (¢ — 1) /(k— 1). Each internal node has r possible refinements,
s0 rld=1/(=1) possible decomposition trees could be constructed.
Examining this formula, we see that keeping r small and k large can result in huge sav
ings: we are taking the kth root of the nonhierarchical cost, if » and r are comparable. Small r
and large k means a library of HLAs with a small number of refinements each yielding a long
action sequence. This is not always possible: long action sequences that are usable across a
wide range of problems are extremely rare.
359
360
Chapter 11
Automated Planning
The key to HTN planning is a plan library containing known methods for implementing
complex, highlevel actions.
One way to construct the library is to learn the methods from
problemsolving experience. After the excruciating experience of constructing a plan from scratch, the agent can save the plan in the library as a method for implementing the highlevel
action defined by the task. In this way, the agent can become more and more competent over
time as new methods are built on top of old methods. One important aspect of this learning process is the ability to generalize the methods that are constructed, eliminating detail that is specific to the problem instance (e.g., the name of the builder or the address of the plot of land) and keeping just the key elements of the plan. It seems to us inconceivable that humans could be as competent as they are without some such mechanism. 11.4.3
Searching for abstract solutions
The hierarchical search algorithm in the preceding section refines HLAs all the way to primitive action sequences to determine if a plan is workable. This contradicts common sense: one should be able to determine that the twoHLA highlevel plan
[Drive(Home,SFOLongTermParking), Shuttle(SFOLongTermParking, SFO)]
gets one to the airport without having to determine a precise route, choice of parking spot, and so on. The solution is to write precondition—effect descriptions of the HLAs, just as we
do for primitive actions. From the descriptions, it ought to be easy to prove that the highlevel
plan achieves the goal. This is the holy grail, so to speak, of hierarchical planning, because if we derive a highlevel plan that provably achieves the goal, working in a small search space of highlevel actions, then we can commit to that plan and work on the problem of refining each step of the plan. This gives us the exponential reduction we seek.
For this to work, it has to be the case that every highlevel plan that “claims” to achieve
Downward refinement property
the goal (by virtue of the descriptions of its steps) does in fact achieve the goal in the sense defined earlier: it must have at least one implementation that does achieve the goal. This
property has been called the downward refinement property for HLA descriptions.
Writing HLA descriptions that satisfy the downward refinement property is, in principle,
easy: as long as the descriptions are frue, then any highlevel plan that claims to achieve
the goal must in fact do so—otherwise, the descriptions are making some false claim about
what the HLAs do. We have already seen how to write true descriptions for HLAs that have
exactly one implementation (Exercise 11.HLAU); a problem arises when the HLA has multiple implementations.
How can we describe the effects of an action that can be implemented in
many different ways? One safe answer (at least for problems where all preconditions and goals are positive) is
to include only the positive effects that are achieved by every implementation of the HLA and
the negative effects of any implementation. Then the downward refinement property would be satisfied. Unfortunately, this semantics for HLAs is much too conservative.
Consider again the HLA Go(Home, SFO), which has two refinements, and suppose, for the sake of argument, a simple world in which one can always drive to the airport and park, but taking a taxi requires Cash as a precondition. In that case, Go(Home, SFO) doesn’t always get you to the airport.
In particular, it fails if Cash is false, and so we cannot assert
At(Agent,SFO) as an effect of the HLA. This makes no sense, however; if the agent didn’t
have Cash, it would drive itself. Requiring that an effect hold for every implementation is
equivalent to assuming that someone else—an adversary—will choose the implementation.
Section 11.4
(@)
Hierarchical Planning
361
(b)
Figure 11.9 Schematic examples of reachable sets. The set of goal states is shaded in purple. Black and gray arrows indicate possible implementations of Ay and ha, respectively. (a) The reachable set of an HLA hy in a state s. (b) The reachable set for the sequence [k, ha]. Because this intersects the goal set, the sequence achieves the goal. It treats the HLA’s multiple outcomes exactly as if the HLA were a nondeterministic action,
as in Section 4.3. For our case, the agent itself will choose the implementation.
The programming languages community has coined the term demonic nondeterminism Demeric
for the case where an adversary makes the choices, contrasting this with angelic nondeterminism, where the agent itself makes the choices. We borrow this term to define angelic semantics for HLA descriptions. The basic concept required for understanding angelic se
mantics is the reachable set of an HLA: given a state s, the reachable set for an HLA h,
written as REACH (s, h), is the set of states reachable by any of the HLA’s implementations.
The key idea is that the agent can choose which element of the reachable set it ends up in when it executes the HLA; thus, an HLA with multiple refinements is more “powerful” than the same HLA with fewer refinements. We can also define the reachable set of a sequence of HLAs. For example, the reachable set of a sequence [hy, /] is the union of all the reachable
sets obtained by applying h, in each state in the reachable set of h;: REACH(s, [, ha]) = U REACH(S, ). #eREACH(s.In) Given these definitions, a highlevel plan—a sequence of HLAs—achieves the goal if its reachable set intersects the set of goal states. (Compare this to the much stronger condition for demonic semantics,
where every member of the reachable set has to be a goal state.)
Conversely, if the reachable set doesn’t intersect the goal, then the plan definitely doesn’t work. Figure 11.9 illustrates these ideas.
The notion of reachable sets yields a straightforward algorithm: search among highlevel plans, looking for one whose reachable set intersects the goal; once that happens, the
algorithm can commit to that abstract plan, knowing that it works, and focus on refining the
plan further. We will return to the algorithmic issues later; for now consider how the effects
:;g;l‘; rmminism Angelic semantics Reachable set
362
Chapter 11
Automated Planning
of an HLA—the reachable set for each possible initial state—are represented.
action can set a fluent to true or false or leave it unchanged.
A primitive
(With conditional effects (see
Section 11.5.1) there is a fourth possibility: flipping a variable to its opposite.)
An HLA under angelic semantics can do more: it can control the value of a fluent, setting
it to true or false depending on which implementation is chosen. That means that an HLA can
have nine different effects on a fluent: if the variable starts out true, it can always keep it true,
always make it false, or have a choice; if the fluent starts out false, it can always keep it false, always make it true, or have a choice; and the three choices for both cases can be combined
arbitrarily, making nine. Notationally, this is a bit challenging. We’ll use the language of add lists and delete lists (rather than true/false fluents) along with the ~ symbol to mean “possibly, if the agent
so chooses.” Thus, the effect 4 means “possibly add A,” that is, either leave A unchanged
or make it true. or delete A
Similarly, —A means “possibly delete A” and FA means “possibly add
For example, the HLA
Go(Home,SFO),
with the two refinements shown in
Figure 11.7, possibly deletes Cash (if the agent decides to take a taxi), so it should have the
effect ~Cash. Thus, we see that the descriptions of HLAs are derivable from the descriptions
of their refinements. Now, suppose we have the following schemas for the HLAs /; and h»: Action(hy, PRECOND: A, EFFECT:A A ~B)
Action(hy, PRECOND: ~B, EFFECT: +A A £C) That is, /; adds A and possibly deletes B, while &, possibly adds A and has full control over
C. Now, if only B is true in the initial state and the goal is A A C then the sequence [hy,hs]
achieves the goal: we choose an implementation of /; that makes B false, then choose an implementation of /2, that leaves A true and makes C true. The preceding discussion assumes that the effects of an HLA—the reachable set for any
given initial state—can be described exactly by describing the effect on each fluent. It would
be nice if this cause an HLA gly reachable page 243. For
were always true, but in many cases we can only approximate may have infinitely many implementations and may produce sets—rather like the wigglybeliefstate problem illustrated in example, we said that Go(Home, SFO) possibly deletes Cash;
the effects bearbitrarily wigFigure 7.21 on it also possibly
adds At(Car,SFOLongTermParking); but it cannot do both—in fact, it must do exactly one.
Optimistic lescription essimistic description
As with belief states, we may need to write approximate descriptions. We will use two kinds of approximation: an optimistic description REACH* (s, ) of an HLA h may overstate the reachable set, while a pessimistic description REACH ™ (s, 1) may understate the reachable set. Thus, we have
REACH ™ (s,h) C REACH(s, h) C REACHT (s, h)
For example, an optimistic description of Go(Home, SFO) says that it possibly deletes Cash and possibly adds At(Car,SFOLongTermParking). Another good example arises in the 8puzzle, half of whose states are unreachable from any given state (see Exercise 11.PART): the optimistic description of Act might well include the whole state space, since the exact reachable set is quite wiggly.
With approximate descriptions, the test for whether a plan achieves the goal needs to be modified slightly. If the optimistic reachable set for the plan doesn’t intersect the goal, then the plan doesn’t work; if the pessimistic reachable set intersects the goal, then the plan
does work (Figure 11.10(a)). With exact descriptions, a plan either works or it doesn’t, but
Section 11.4
Hierarchical Planning
Figure 11.10 Goal achievement for highlevel plans with approximate descriptions. The set of goal states is shaded in purple. For each plan, the pessimistic (solid lines, light blue) and optimistic (dashed lines, light green) reachable sets are shown. (a) The plan indicated by the black arrow definitely achieves the goal, while the plan indicated by the gray arrow definitely doesn’t. (b) A plan that possibly achieves the goal (the optimistic reachable set intersects the goal) but does not necessarily achieve the goal (the pessimistic reachable set does not intersect the goal). The plan would need to be refined further to determine if it really does achieve the goal. with approximate descriptions, there is a middle ground: if the optimistic set intersects the goal but the pessimistic set doesn’t, then we cannot tell if the plan works (Figure 11.10(b)).
When this circumstance arises, the uncertainty can be resolved by refining the plan. This is a very common situation in human reasoning. For example, in planning the aforementioned
twoweek Hawaii vacation, one might propose to spend two days on each of seven islands.
Prudence would indicate that this ambitious plan needs to be refined by adding details of interisland transportation.
An algorithm for hierarchical planning with approximate angelic descriptions is shown in Figure 11.11. For simplicity, we have kept to the same overall scheme used previously in Figure 11.8, that is, a breadthfirst search in the space of refinements. As just explained,
the algorithm can detect plans that will and won’t work by checking the intersections of the optimistic and pessimistic reachable sets with the goal. (The details of how to compute
the reachable sets of a plan, given approximate descriptions of each step, are covered in Exercise 1 1.LHLAP.)
When a workable abstract plan is found, the algorithm decomposes the original problem into subproblems, one for each step of the plan. The initial state and goal for each subproblem are obtained by regressing a guaranteedreachable goal state through the action schemas for each step of the plan. (See Section 11.2.2 for a discussion of how regression works.) Figure 11.9(b) illustrates the basic idea: the righthand circled state is the guaranteedreachable
goal state, and the lefthand circled state is the intermediate goal obtained by regressing the
goal through the final action.
363
364
Chapter 11
Automated Planning
function ANGELICSEARCH(problem, hierarchy, initialPlan) returns solution or fail Jfrontier +a FIFO queue with initialPlan as the only element while rrue do if EMPTY?(frontier) then return fail
plan & POP(frontier) /1 chooses the shallowest node in frontier if REACH* (problem.INITIAL, plan) intersects problem.GOAL then if plan is primitive then return plan
// REACH" is exact for primitive plans
guaranteed— REACH™ (problem.INITIAL, plan) N problem.GOAL
if guaranteed#{ } and MAKINGPROGRESS(plan, initialPlan) then
JfinalState —any element of guaranteed return DECOMPOSE (hierarchy, problem.INITIAL, plan, finalState)
hla+some HLA in plan
prefix,suffix < the action subsequences before and after hla in plan outcome < RESULT(problem.INITIAL, prefix) for each sequence in REFINEMENTS(hla, outcome, hierarchy) do Sfrontier «— Insert( APPEND(prefix, sequence, suffix), frontier)
function DECOMPOSE hierarchy, so. plan, sy) returns a solution solution «an empty plan while plan is not empty do action < REMOVELAST(plan)
si¢—astate in REACH™ (sp, plan) such that s;€REACH™ (s;, action)
problem «—a problem with INITIAL = s; and GOAL = sy solution « APPEND(ANGELICSEARCH(problem, hierarchy, action), solution) Spesi
return solution
Figure 11.11 A hierarchical planning algorithm that uses angelic semantics to identify and commit to highlevel plans that work while avoiding highlevel plans that don’t. The predicate MAKINGPROGRESS
checks to make sure that we aren’t stuck in an infinite regression
of refinements. At top level, call ANGELICSEARCH with [Act] as the initialPlan.
The ability to commit to or reject highlevel plans can give ANGELICSEARCH
a sig
nificant computational advantage over HIERARCHICALSEARCH, which in turn may have a
large advantage over plain old BREADTHFIRSTSEARCH.
Consider, for example, cleaning
up a large vacuum world consisting of an arrangement of rooms connected by narrow corridors, where each room is a w x h rectangle of squares. It makes sense to have an HLA for Navigate (as shown in Figure 11.7) and one for CleanWholeRoom. (Cleaning the room could
be implemented with the repeated application of another HLA to clean each row.) Since there
are five primitive actions, the cost for BREADTHFIRSTSEARCH grows as 59, where d is the
length of the shortest solution (roughly twice the total number of squares); the algorithm
cannot manage even two 3 x 3 rooms.
HIERARCHICALSEARCH
is more efficient, but still
suffers from exponential growth because it tries all ways of cleaning that are consistent with the hierarchy. ANGELICSEARCH scales approximately linearly in the number of squares— it commits to a good highlevel sequence of roomcleaning and navigation steps and prunes
away the other options.
Section 1.5
Planning and Acting in Nondeterministic Domains
365
Cleaning a set of rooms by cleaning each room in turn is hardly rocket science: it is
easy for humans because of the hierarchical structure of the task. When we consider how
difficult humans find it to solve small puzzles such as the 8puzzle, it seems likely that the
human capacity for solving complex problems derives not from considering combinatorics, but rather from skill in abstracting and decomposing problems to eliminate combinatorics.
The angelic approach can be extended to find leastcost solutions by generalizing the
notion of reachable set. Instead of a state being reachable or not, each state will have a cost for the most efficient way to get there.
(The cost is
infinite for unreachable states.)
optimistic and pessimistic descriptions bound these costs.
The
In this way, angelic search can
find provably optimal abstract plans without having to consider their implementations. The
same approach can be used to obtain effective hierarchical lookahead algorithms for online
search, in the style of LRTA* (page 140). In some ways, such algorithms mirror aspects of human deliberation in tasks such as planning a vacation to Hawaii—consideration of alternatives is done initially at an abstract
level over long time scales; some parts of the plan are left quite abstract until execution time,
such as how to spend two lazy days on Moloka'i, while others parts are planned in detail, such as the flights to be taken and lodging to be reserved—without these latter refinements,
there is no guarantee that the plan would be feasible. 11.5
Planning and Acting in Nondeterministic
Domains
In this section we extend planning to handle partially observable, nondeterministic, and un
known environments. The basic concepts mirror those in Chapter 4, but there are differences
arising from the use of factored representations rather than atomic representations. This affects the way we represent the agent’s capability for action and observation and the way
we represent belief states—the sets of possible physical states the agent might be in—for partially observable environments.
We can also take advantage of many of the domain
independent methods given in Section 11.3 for calculating search heuristics.
‘We will cover sensorless planning (also known as conformant planning) for environ
ments with no observations; contingency planning for partially observable and nondeterministic environments; and online planning and replanning for unknown environments. This
will allow us to tackle sizable realworld problems.
Consider this problem:
given a chair and a table, the goal is to have them match—have
the same color. In the initial state we have two cans
of paint, but the colors of the paint and
the furniture are unknown. Only the table is initially in the agent’s field of view:
Init(Object(Tuble) A Object(Chair) A Can(Cy) A Can(Cz) A InView(Tuble)) Goal(Color(Chair,c) A Color(Table,c))
There are two actions:
removing the lid from a paint can and painting an object using the
paint from an open can. Action(RemoveLid(can), PRECOND:Can(can) EFFECT: Open(can))
Action(Paint(x,can),
PRECOND:Object(x) A Can(can) A Color(can,c) A Open(can) EFFECT: Color(x,c))
Hierarchical lookahead
366
Chapter 11
Automated Planning
The action schemas are straightforward, with one exception: preconditions and effects now may contain variables that are not part of the action’s variable list.
That is, Paint(x,can)
does not mention the variable ¢, representing the color of the paint in the can. In the fully
observable case, this is not allowed—we would have to name the action Paint(x,can,c). But
in the partially observable case, we might or might not know what color is in the can.
To solve a partially observable problem, the agent will have to reason about the percepts
it will obtain when it is executing the plan.
The percept will be supplied by the agent’s
sensors when it is actually acting, but when it is planning it will need a model of its sensors. Percept schema
In Chapter 4, this model was given by a function, PERCEPT(s).
PDDL with a new type of schema, the percept schema: Percept(Color(x.c),
For planning, we augment
PRECOND: Object(x) A InView(x)
Percept(Color(can,c),
PRECOND:Can(can) A InView(can) A Open(can)
The first schema says that whenever an object is in view, the agent will perceive the color of the object (that is, for the object x, the agent will learn the truth value of Color(x,c) for all ¢). The second schema says that if an open can is in view, then the agent perceives the color of the paint in the can. Because there are no exogenous events in this world, the color
of an object will remain the same, even if it is not being perceived, until the agent performs
an action to change the object’s color. Of course, the agent will need an action that causes objects (one at a time) to come into view:
Action(LookAt(x),
PRECOND:InView(y) A (x # y) EFFECT: InView(x) A —~InView(y))
For a fully observable environment, we would have a Percept schema with no preconditions
for each fluent. A sensorless agent, on the other hand, has no Percept schemas at all. Note
that even a sensorless agent can solve the painting problem. One solution is to open any can of paint and apply it to both chair and table, thus coercing them to be the same color (even
though the agent doesn’t know what the color is). A contingent planning agent with sensors can generate a better plan. First, look at the
table and chair to obtain their colors; if they are already the same then the plan is done. If
not, look at the paint cans; if the paint in a can is the same color as one piece of furniture,
then apply that paint to the other piece. Otherwise, paint both pieces with any color.
Finally, an online planning agent might generate a contingent plan with fewer branches
at first—perhaps ignoring the possibility that no cans match any of the furniture—and deal
with problems when they arise by replanning. It could also deal with incorrectness of its
action schemas.
Whereas a contingent planner simply assumes that the effects of an action
always succeed—that painting the chair does the job—a replanning agent would check the result and make an additional plan to fix any unexpected failure, such as an unpainted area or the original color showing through. In the real world, agents use a combination of approaches. Car manufacturers sell spare tires and air bags, which are physical embodiments of contingent plan branches designed to handle punctures or crashes.
On the other hand, most car drivers never consider these
possibilities; when a problem arises they respond as replanning agents. In general, agents
Section 1.5
Planning and Acting in Nondeterministic Domains
plan only for contingencies that have important consequences and a nonnegligible chance of happening. Thus, a car driver contemplating a trip across the Sahara desert should make explicit contingency plans for breakdowns, whereas a trip o the supermarket requires less advance planning. We next look at cach of the three approaches in more detail. 11.5.1
Sensorless planning
Section 4.4.1 (page 126) introduced the basic idea of searching in beliefstate space to find
a solution for sensorless problems. Conversion of a sensorless planning problem to a beliefstate planning problem works much the same way as it did in Section 4.4.1; the main dif
ferences are that the underlying physical transition model is represented by a collection of
action schemas, and the belief state can be represented by a logical formula instead of by
an explicitly enumerated set of states. We assume that the underlying planning problem is deterministic.
The initial belief state for the sensorless painting problem can ignore InView fluents
because the agent has no sensors. Furthermore, we take as given the unchanging facts Object(Table) A Object(Chair) A Can(Cy) A Can(Cs) because these hold in every belief state. The agent doesn’t know the colors of the cans or the objects, or whether the cans are open or closed, but it does know that objects and cans have colors:
Skolemizing (see Section 9.5.1), we obtain the initial belief state:
Vx
3¢
Color(x,c).
After
bo = Color(x,C(x)).
In classical planning, where the closedworld assumption is made, we would assume that any fluent not mentioned in a state is false, but in sensorless (and partially observable) plan
ning we have to switch to an openworld assumption in which states contain both positive and negative fluents, and if a fluent does not appear, its value is unknown.
Thus, the belief
state corresponds exactly to the set of possible worlds that satisfy the formula. Given this initial belief state, the following action sequence is a solution:
[RemoveLid(Cany ), Paint(Chair, Cany ), Paint(Table, Cany ).
‘We now show how to progress the belief state through the action sequence to show that the
final belief state satisfies the goal.
First, note that in a given belief state b, the agent can consider any action whose preconditions are satisfied by b. (The other actions cannot be used because the transition model doesn’t define the effects of actions whose preconditions might be unsatisfied.) According
to Equation (4.4) (page 127), the general formula for updating the belief state b given an applicable action a in a deterministic world is as follows:
b =RESULT(b,a) = {s' : ' =RESULTp(s,a) and s € b} where RESULTp defines the physical transition model. For the time being, we assume that the
initial belief state is always a conjunction of literals, that is, a 1CNF formula. To construct the new belief state 5, we must consider what happens to each literal £ in each physical state s in b when action a is applied. For literals whose truth value is already known in b, the truth value in b is computed from the current value and the add list and delete list of the action. (For example, if £ is in the delete list of the action, then —¢ is added to &'.) What about a literal whose truth value is unknown in b? There are three cas 1. If the action adds ¢, then ¢ will be true in b’ regardless of its initial value.
367
368
Chapter 11
Automated Planning
2. If the action deletes £, then ¢ will be false in b’ regardless of its initial value.
3. If the action does not affect £, then ¢ will retain its initial value (which is unknown) and will not appear in 4. Hence, we see that the calculation of 4/ is almost identical to the observable case, which was specified by Equation (11.1) on page 345: b =ResuLT(b,a) = (b— DEL(a)) UADD(a). ‘We cannot quite use the set semantics because (1) we must make sure that b’ does not contain both ¢ and —¢, and (2) atoms may contain unbound variables.
But it is still the case
that RESULT (b, a) is computed by starting with b, setting any atom that appears in DEL(a) to false, and setting any atom that appears in ADD(a) to true. For example, if we apply
RemoveLid(Cany) to the initial belief state by, we get
by = Color(x,C(x)) A Open(Can,). When we apply the action Paint(Chair,Cany), the precondition Color(Cany,c) is satisfied by the literal Color(x,C(x)) with binding {x/Cany,c/C(Can;)} and the new belief state is by = Color(x,C(x)) A Open(Cany) A Color(Chair,C(Cany ). Finally, we apply the action Paint(Table, Cany) to obtain by = Color(x,C(x)) A Open(Can,) A Color(Chair,C(Cany ) A Color(Table,C(Cany)).
The final belief state satisfies the goal, Color(Table,c) A Color(Chair,c), with the variable ¢ bound to C(Cany). The preceding analysis of the update rule has
shown a very important fact: the family
of belief states defined as conjunctions of literals is closed under updates defined by PDDL action schemas.
That is, if the belief state starts as a conjunction of literals, then any update
will yield a conjunction of literals.
That means that in a world with n fluents, any belief
state can be represented by a conjunction of size O(n).
This is a very comforting result,
considering that there are 2" states in the world. It says we can compactly represent all the
subsets of those 2" states that we will ever need. Moreover, the process of checking for belief
states that are subsets or supersets of previously visited belief states is also easy, at least in
the propositional case.
The fly in the ointment of this pleasant picture is that it only works for action schemas
that have the same effects for all states in which their preconditions are satisfied.
It is this
property that enables the preservation of the 1CNF beliefstate representation. As soon as
the effect can depend on the state, dependencies are introduced between fluents, and the 1
CNF property is lost.
Consider, for example, the simple vacuum world defined in Section 3.2.1. Let the fluents be AzL and AtR for the location of the robot and CleanL and CleanR for the state of the
squares. According to the definition of the problem, the Suck action has no precondition—it
Conditional effect
can always be done. The difficulty is that its effect depends on the robot’s location: when the robot is ArL, the result is CleanL, but when it is AzR, the result is CleanR. For such actions, our action schemas will need something new: a conditional effect. These have the syntax
Section 1.5
Planning and Acting in Nondeterministic Domains
“when condition: effect,” where condition is a logical formula to be compared against the current state, and effect is a formula describing the resulting state. For the vacuum world:
Action(Suck,
EFFECT:when ArL: CleanL \ when AtR: CleanR) .
When applied to the initial belief state True, the resulting belief state is (AL A CleanL) V' (ARA CleanR), which is no longer in 1CNF. (This transition can be seen in Figure 4.14
on page 129.) In general, conditional effects can induce arbitrary dependencies among the fluents in a belief state, leading to belief states of exponential size in the worst case.
Itis important to understand the difference between preconditions and conditional effects.
All conditional effects whose conditions are satisfied have their effects applied to generate the resulting belief state; if none are satisfied, then the resulting state is unchanged. On the other hand, if a precondition is unsatisfied, then the action is inapplicable and the resulting state is undefined.
From the point of view of sensorless planning, it is better to have conditional
effects than an inapplicable action. unconditional effects as follows:
Action(SuckL,
PRECOND:AL;
Action(SuckR,
For example, we could split Suck into two actions with
EFFECT: CleanL)
PRECOND:AtR; EFFECT: CleanR).
Now we have only unconditional schemas, so the belief states all remain in 1CNF; unfortu
nately, we cannot determine the applicability of SuckL and SuckR in the initial belief state.
It seems inevitable, then, that nontrivial problems will involve wiggly belief states, just
like those encountered when we considered the problem of state estimation for the wumpus
world (see Figure 7.21 on page 243). The solution suggested then was to use a conservative
approximation to the exact belief state; for example, the belief state can remain in 1CNF
if it contains all literals whose truth values can be determined and treats all other literals as
unknown.
While this approach is sound, in that it never generates an incorrect plan, it is
incomplete because it may be unable to find solutions to problems that necessarily involve interactions among literals.
To give a trivial example, if the goal is for the robot to be on
a clean square, then [Suck] is a solution but a sensorless agent that insists on 1CNF belief
states will not find it.
Perhaps a better solution is to look for action sequences that keep the belief state as simple as possible. In the sensorless vacuum world, the action sequence [Right, Suck, Left, Suck] generates the following sequence of belief states: by
=
True
by
=
AR
by = ARACleanR by = AILACleanR by
= AtL A CleanR N CleanL
That is, the agent can solve the problem while retaining a 1CNF belief state, even though
some sequences (e.g., those beginning with Suck) go outside 1CNE. The general lesson is not lost on humans: we are always performing little actions (checking the time, patting our
369
370
Chapter 11
Automated Planning
pockets to make sure we have the car keys, reading street signs as we navigate through a city) to eliminate uncertainty and keep our belief state manageable.
There is another, quite different approach to the problem of unmanageably wiggly belief states: don’t bother computing them at all. Suppose the initial belief state is by and we would like to know the belief state resulting from the action sequence [a1, ...,
). Instead of com
puting it explicitly, just represent it as “by then [ar, ..., a,,].” This is a lazy but unambiguous
representation of the belief state, and it’s quite concise—O(n +m) where n is the size of the initial belief state (assumed to be in 1CNF) and m is the maximum
length of an action se
quence. As a beliefstate representation, it suffers from one drawback, however: determining
whether the goal is satisfied, or an action is applicable, may require a lot of computation.
The computation can be implemented as an entailment test: if A,, represents the collec
tion of successorstate axioms required to define occurrences of the actions aj....,a,—as
explained for SATPLAN in Section 11.2.3—and G, asserts that the goal is true after m steps,
then the plan achieves the goal if by A A,, = G—that is, if by A Ay A G, is unsatisfiable. Given a modern SAT solver, it may be possible to do this much more quickly than computing the full belief state. For example, if none of the actions in the sequence has a particular goal fluent in its add list, the solver will detect this immediately.
It also helps if partial results
about the belief state—for example, fluents known to be true or false—are cached to simplify
subsequent computations.
The final piece of the sensorless planning puzzle is a heuristic function to guide the
search. The meaning of the heuristic function is the same as for classical planning: an esti
mate (perhaps admissible) of the cost of achieving the goal from the given belief state. With
belief states, we have one additional fact: solving any subset of a belief state is necessarily easier than solving the belief state:
if by C by then h*(by) < h*(b2). Hence, any admissible heuristic computed for a subset is admissible for the belief state itself. The most obvious candidates are the singleton subsets, that is, individual physical states. We
can take any random collection of states s admissible heuristic /, and return
sy that are in the belief state b, apply any
H(b) =max{h(s1),....h(sn)}
as the heuristic estimate for solving b. We can also use inadmissible heuristics such as the ignoredeletelists heuristic (page 354), which seems to work quite well in practice. 11.5.2
Contingent planning
We saw in Chapter 4 that contingency planning—the generation of plans with conditional
branching based on percepts—is appropriate for environments with partial observability, non
determinism, or both. For the partially observable painting problem with the percept schemas given earlier, one possible conditional solution is as follows:
[LookAt(Table), LookAt(Chair), if Color(Tuble,c) A Color(Chair, c) then NoOp else [RemoveLid(Can, ), LookAt(Can, ), RemoveLid(Cans), LookAt(Cans), if Color(Table,c) A Color(can, c) then Paint(Chair, can) else if Color(Chair,c) A Color(can,c) then Paint(Table,can) else [Paint(Chair, Can), Paint (Table, Can, )]]
Section 1.5
Planning and Acting in Nondeterministic Domains
Variables in this plan should be considered existentially quantified;
the second
line says
that if there exists some color ¢ that is the color of the table and the chair, then the agent
need not do anything to achieve the goal. When executing this plan, a contingentplanning
agent can maintain its belief state as a logical formula and evaluate each branch condition
by determining if the belief state entails the condition formula or its negation.
(It is up to
the contingentplanning algorithm to make sure that the agent will never end up in a belief
state where the condition formula’s truth value is unknown.)
Note that with firstorder
conditions, the formula may be satisfied in more than one way; for example, the condition Color(Table, c) A Color(can, ¢) might be satisfied by {can/Cany } and by {can/Cans} if both cans are the same color as the table. In that case, the agent can choose any satisfying substitution to apply to the rest of the plan.
As shown in Section 4.4.2, calculating the new belief state b after an action a and subse
quent percept is done in two stages. The first stage calculates the belief state after the action,
just as for the sensorless agent: b= (b—DEL(a))UADD(a)
where, as before, we have assumed a belief state represented as a conjunction of literals. The
second stageis a little trickier. Suppose that percept literals p;., ..., py are received. One might
think that we simply need to add these into the belief state; in fact, we can also infer that the
preconditions for sensing are satisfied. Now, if a percept p has exactly one percept schema, Percept(p, PRECOND:c), where c is a conjunction of literals, then those literals can be thrown into the belief state along with p. On the other hand, if p has more than one percept schema
whose preconditions might hold according to the predicted belief state b, then we have to add in the disjunction of the preconditions. Obviously, this takes the belief state outside 1CNF
and brings up the same complications as conditional effects, with much the same classes of
solutions.
Given a mechanism for computing exact or approximate belief states, we can generate
contingent plans with an extension of the AND—OR
forward search over belief states used
in Section 4.4. Actions with nondeterministic effects—which are defined simply by using a
disjunction in the EFFECT of the action schema—can be accommodated with minor changes
to the beliefstate update calculation and no change to the search algorithm.® For the heuristic
function, many of the methods suggested for sensorless planning are also applicable in the partially observable, nondeterministic case. 11.5.3
Online planning
Imagine watching a spotwelding robot in a car plant. The robot’s fast, accurate motions are
repeated over and over again as each car passes down the line. Although technically impressive, the robot probably does not seem at all intelligent because the motion is a fixed, preprogrammed sequence; the robot obviously doesn’t “know what it’s doing” in any meaningful sense.
Now suppose that a poorly attached door falls off the car just as the robot is
about to apply a spotweld. The robot quickly replaces its welding actuator with a gripper,
picks up the door, checks it for scratches, reattaches it to the car, sends an email to the floor supervisor, switches back to the welding actuator, and resumes its work.
All of a sudden,
3 If cyclic solutions are required for a nondeterministic problem, AND—OR search must be generalized to a loopy version such as LAO" (Hansen and Zilberstein, 2001).
371
372
Chapter 11
Automated Planning
the robot’s behavior seems purposive rather than rote; we assume it results not from a vast,
precomputed contingent plan but from an online replanning process—which means that the
Execution monitoring
robot does need to know what it’s trying to do. Replanning presupposes some form of execution monitoring to determine the need for a new plan. One such need arises when a contingent planning agent gets tired of planning
for every little contingency, such as whether the sky might fall on its head.* This means that the contingent plan is left in an incomplete form.
For example, Some branches of a
partially constructed contingent plan can simply say Replan; if such a branch is reached
during execution, the agent reverts to planning mode. As we mentioned earlier, the decision
as to how much of the problem to solve in advance and how much to leave to replanning
is one that involves tradeoffs among possible events with different costs and probabilities of
occurring.
Nobody wants to have a car break down in the middle of the Sahara desert and
only then think about having enough water.
Missing precondition Missing effect
Missing fluent Exogenous event
Replanning may be needed if the agent’s model of the world is incorrect. The model
for an action may have a missing precondition—for example, the agent may not know that
removing the lid of a paint can often requires a screwdriver. The model may have a missing effect—painting an object may get paint on the floor as well.
Or the model may have a
missing fluent that is simply absent from the representation altogether—for example, the model given earlier has no notion of the amount of paint in a can, of how its actions affect
this amount, or of the need for the amount to be nonzero. The model may also lack provision
for exogenous events such as someone knocking over the paint can. Exogenous events can
also include changes in the goal, such as the addition of the requirement that the table and
chair not be painted black. Without the ability to monitor and replan, an agent’s behavior is likely to be fragile if it relies on absolute correctness of its model.
The online agent has a choice of (at least) three different approaches for monitoring the environment during plan execution:
Action monitoring
+ Action monitoring: before executing an action, the agent verifies that all the precondi
Plan monitoring
+ Plan monitoring: before executing an action, the agent verifies that the remaining plan
Goal monitoring
* Goal monitoring: before executing an action, the agent checks to see if there is a better
tions still hold.
will still succeed.
set of goals it could be trying to achieve.
In Figure 11.12 we see a schematic of action monitoring.
The agent keeps track of both its
original plan, whole plan, and the part of the plan that has not been executed yet, which is denoted by plan.
After executing the first few steps of the plan, the agent expects to be in
state E. But the agent observes that it is actually in state O. It then needs to repair the plan by
finding some point P on the original plan that it can get back to. (It may be that P is the goal
state, G.) The agent tries to minimize the total cost of the plan: the repair part (from O to P)
plus the continuation (from P to G).
# In 1954, a Mrs. Hodges of Alabama was hit by meteorite that crashed through her roof. In 1992, a piece of the Mbale metcorite hit a small boy on the head; fortunately. its descent was slowed by banana leaves (Jenniskens etal., 1994). Andiin 2009, a German boy claimed to have been injuries resulted from any of these incidents, suggesting that the need for preplanning st such contingencies is sometimes overstated.
Section 1.5
Planning and Acting in Nondeterministic Domains whole plan
Figure 11.12 At first, the sequence “whole plan” is expected to get the agent from S to G. The agent executes steps of the plan until it expects to be in state £, but observes that it is actually in O. The agent then replans for the minimal repair plus continuation to reach G. Now let’s return to the example problem of achieving a chair and table of matching color.
Suppose the agent comes up with this plan:
[LookAt(Table), LookAt(Chair), if Color(Tuble, ) A Color(Chair, ) then NoOp else [RemoveLid(Cany ), LookAt(Cany),
if Color(Tuble, c) A Color(Can) ) then Paint(Chair, Cany) else REPLAN]].
Now the agent is ready to execute the plan. The agent observes that the table and can of paint are white and the chair is black.
It then executes Paint(Chair,Can,).
At this point a
classical planner would declare victory; the plan has been executed. But an online execution monitoring agent needs to check that the action succeeded.
Suppose the agent perceives that the chair is a mottled gray because the black paint is
showing through. The agent then needs to figure out a recovery position in the plan to aim for and a repair action sequence to get there. The agent notices that the current state is identical to the precondition before the Paint(Chair,Can;) action, so the agent chooses the empty
sequence for repair and makes its plan be the same [Paint] sequence that it just attempted.
With this new plan in place, execution monitoring resumes, and the Paint action is retried.
This behavior will loop until the chair is perceived to be completely painted. But notice that
the loop is created by a process of planexecutereplan, rather than by an explicit loop in a plan. Note also that the original plan need not cover every contingency. If the agent reaches
the step marked REPLAN, it can then generate a new plan (perhaps involving Cany).
Action monitoring is a simple method of execution monitoring, but it can sometimes lead
to less than intelligent behavior. For example, suppose there is no black or white paint, and
the agent constructs a plan to solve the painting problem by painting both the chair and table
red. Suppose that there is only enough red paint for the chair. With action monitoring, the agent would go ahead and paint the chair red, then notice that it is out of paint and cannot
paint the table, at which point it would replan a repair—perhaps painting both chair and table
green. A planmonitoring agent can detect failure whenever the current state is such that the
remaining plan no longer works. Thus, it would not waste time painting the chair red.
373
374
Chapter 11
Automated Planning
Plan monitoring achieves this by checking the preconditions for success of the entire
remaining plan—that is, the preconditions of each step in the plan, except those preconditions
that are achieved by another step in the remaining plan. Plan monitoring cuts off execution of a doomed plan as soon as possible, rather than continuing until the failure actually occurs.3 Plan monitoring also allows for serendipity—accidental success.
If someone comes along
and paints the table red at the same time that the agent is painting the chair red, then the final plan preconditions are satisfied (the goal has been achieved), and the agent can go home early. It is straightforward to modify a planning algorithm so that each action in the plan is annotated with the action’s preconditions, thus enabling action monitoring.
It is slightly more
complex to enable plan monitoring. Partialorder planners have the advantage that they have already built up structures that contain the relations necessary for plan monitoring. Augment
ing statespace planners with the necessary annotations can be done by careful bookkeeping as the goal fluents are regressed through the plan. Now that we have described a method for monitoring and replanning, we need to ask, “Does it work?” This is a surprisingly tricky question. If we mean, “Can we guarantee that the agent will always achieve the goal?” then the answer is no, because the agent could
inadvertently arrive at a dead end from which there is no repair. For example, the vacuum
agent might have a faulty model of itself and not know that its batteries can run out. Once
they do, it cannot repair any plans. If we rule out dead ends—assume that there exists a plan to reach the goal from any state in the environment—and
assume that the environment is
really nondeterministic, in the sense that such a plan always has some chance of success on
any given execution attempt, then the agent will eventually reach the goal.
Trouble occurs when a seeminglynondeterministic action is not actually random, but
rather depends on some precondition that the agent does not know about. For example, sometimes a paint can may be empty, so painting from that can has no effect.
No amount
of retrying is going to change this.® One solution is to choose randomly from among the set of possible repair plans, rather than to try the same one each time. In this case, the repair plan of opening another can might work. A better approach is to learn a better model. Every
prediction failure is an opportunity for learning; an agent should be able to modify its model
of the world to accord with its percepts. From then on, the replanner will be able to come up with a repair that gets at the root problem, rather than relying on luck to choose a good repair.
11.6_Time, Schedules, and Resources Classical planning talks about what to do, in what order, but does not talk about time: how
Scheduling Resource constraint
long an action takes and when it occurs. For example, in the airport domain we could produce aplan saying what planes go where, carrying what, but could not specify departure and arrival
times. This is the subject matter of scheduling. The real world also imposes resource constraints: an airline has a limited number of staff, and staff who are on one flight cannot be on another at the same time. This section
introduces techniques for planning and scheduling problems with resource constraints.
5 Plan monitoring means that finally, after 374 pages, we have an agent that is smarter than a dung beetle (see page 41). A planmonitoring agent would notice that the dung ball was missing from its grasp and would replan 10 get another ball and plug its hole. 6 Futile repetition of a plan repair is exactly the behavior exhibited by the sphex wasp (page 41).
Section 11.6
Time, Schedules, and Resources
375
Jobs({AddEnginel < AddWheels] < Inspectl}, {AddEngine2 < AddWheels2 < Inspeci2}) Resources(EngineHoists(1), WheelStations(1), Inspectors(e2), LugNuts(500)) Action(AddEnginel, DURATION:30,
USE:EngineHoists(1)) Action(AddEngine2, DURATION:60,
USE:EngineHoists(1)) Action(AddWheels], DURATION:30,
CONSUME: LugNuts(20), USE: WheelStations(1))
Action(AddWheels2, DURATION:15,
CONSUME: LugNuts(20), USE:WheelStations(1))
Action(Inspect;, DURATION: 10,
UsE:Inspectors(1))
Figure 11.13 A jobshop scheduling problem for assembling two cars, with resource constraints. The notation A < B means that action A must precede action B. The approach we take is “plan first, schedule later”: divide the overall problem into a planning phase in which actions are selected, with some ordering constraints, to meet the goals of the problem, and a later scheduling phase, in which temporal information is added to the plan to ensure that it meets resource and deadline constraints. This approach is common in realworld manufacturing and logistical settings, where the planning phase is sometimes
automated, and sometimes performed by human experts. 11.6.1
Representing temporal and resource constraints
A typical jobshop scheduling problem (see Section 6.1.2), consists of a set of jobs, each
of which has a collection of actions with ordering constraints among them. Each action has
a duration and a set of resource constraints required by the action.
A constraint specifies
a fype of resource (e.g., bolts, wrenches, or pilots), the number of that resource required,
and whether that resource is consumable (e.g., the bolts are no longer available for use) or
reusable (e.g., a pilot is occupied during a flight but is available again when the flight is over). Actions can also produce resources (¢.g., manufacturing and resupply actions). A solution to a jobshop scheduling problem specifies the start times for each action and must satisfy all the temporal ordering constraints and resource constraints.
Jobshop scheduling problem Job Duration Consumable Reusable
As with search
and planning problems, solutions can be evaluated according to a cost function; this can be quite complicated, with nonlinear resource costs, timedependent delay costs, and so on. For simplicity, we assume that the cost function is just the total duration of the plan, which is called the makespan.
Figure 11.13 shows a simple example: a problem involving the assembly of two cars. The problem consists of two jobs, each of the form [AddEngine,AddWheels, Inspect]. Then
the Resources statement declares that there are four types of resources, and gives the number of each type available at the start: 1 engine hoist, 1 wheel station, 2 inspectors, and 500 lug nuts. The action schemas give the duration and resource needs of each action. The lug nuts
Makespan
376
Chapter 11
Automated Planning
are consumed as wheels are added to the car, whereas the other resources are “borrowed” at
the start of an action and released at the action’s end.
Aggregation
The representation of resources as numerical quantities, such as Inspectors(2), rather than as named entities, such as Inspector (1) and Inspector (L), is an example of a technique called aggregation:
grouping individual objects into quantities when the objects are all in
distinguishable. In our assembly problem, it does not matter which inspector inspects the car, so there is no need to make the distinction. Aggregation is essential for reducing complexity.
Consider what happens when a proposed schedule has 10 concurrent Inspect actions but only
9 inspectors are available. With inspectors represented as quantities, a failure is detected im
mediately and the algorithm backtracks to try another schedule. With inspectors represented as individuals, the algorithm would try all 9! ways of assigning inspectors to actions before noticing that none of them work. 11.6.2
Solving scheduling problems
We begin by considering just the temporal scheduling problem, ignoring resource constraints.
To minimize makespan (plan duration), we must find the earliest start times for all the actions
consistent with the ordering constraints supplied with the problem. It is helpful to view these Critical path method
ordering constraints as a directed graph relating the actions, as shown in Figure 11.14. We can
apply the critical path method (CPM) to this graph to determine the possible start and end
times of each action. A path through a graph representing a partialorder plan is a linearly
Critical path
ordered sequence of actions beginning with Start and ending with Finish. (For example, there are two paths in the partialorder plan in Figure 11.14.) The critical path is that path whose total duration is longest; the path is “critical” because it determines the duration of the entire plan—shortening other paths doesn’t shorten the plan
as a whole, but delaying the start of any action on the critical path slows down the whole plan. Actions that are off the critical path have a window of time in which they can be executed.
Slack Schedule
The window is specified in terms of an earliest possible start time, ES, and a latest possible
start time, LS.
The quantity LS — ES is known as the slack of an action.
We can see in
Figure 11.14 that the whole plan will take 85 minutes, that each action in the top job has 15 minutes of slack, and that each action on the critical path has no slack (by definition). Together the ES and LS times for all the actions constitute a schedule for the problem.
The following formulas define ES and LS and constitute a dynamicprogramming algo
rithm to compute them. A and B are actions, and A < B means that A precedes B:
ES(Start) =0
ES(B) = maxs < ES(A) + Duration(A)
LS(Finish) = ES(Finish) LS(A) = ming,. o LS(B) — Duration(A) .
The idea is that we start by assigning ES(Start) to be 0. Then, as soon as we get an action
B such that all the actions that come immediately before B have ES values assigned, we
set ES(B) to be the maximum of the earliest finish times of those immediately preceding
actions, where the earliest finish time of an action is defined as the earliest start time plus
the duration. This process repeats until every action has been assigned an ES value. The LS values are computed in a similar manner, working backward from the Finish action.
The complexity of the critical path algorithm is just O(Nb), where N is the number of
actions and b is the maximum
branching factor into or out of an action.
(To see this, note
Section 11.6
Start
o
o g 30
[EX5] 07 nsawiess =] tmpect 30 10
o7 g2 &
[@er ] niawteci 15
2
o
“
Time, Schedules, and Resources
[85.85]
Finish
[FEE ! o o
@
o
"
»
Figure 11.14 Top: a representation of the temporal constraints for the jobshop scheduling problem of Figure 11.13. The duration of each action is given at the bottom of each rectangle. In solving the problem, we compute the earliest and latest start times as the pair [ES, LS],
displayed in the upper left. The difference between these two numbers is the slack of an
action; actions with zero slack are on the critical path, shown with bold arrows. Bottom: the
same solution shown as a timeline. Grey rectangles represent time intervals during which an action may be executed, provided that the ordering constraints are respected. The unoccupied portion ofa gray rectangle indicates the slack.
that the LS and ES computations are done once for each action, and each computation iterates
over at most b other actions.) Therefore, finding a minimumduration schedule, given a partial
ordering on the actions and no resource constraints, is quite easy.
Mathematically speaking, criticalpath problems are easy to solve because they are de
fined as a conjunction of linear inequalities on the start and end times. When we introduce
resource constraints, the resulting constraints on start and end times become more complicated. For example, the AddEngine actions, which begin at the same time in Figure 11.14, require the same EngineHoist and so cannot overlap. The “cannot overlap” constraint is a disjunction of two linear inequalities, one for each possible ordering. The introduction of disjunctions turns out to make scheduling with resource constraints NPhard.
Figure 11.15 shows the solution with the fastest completion time, 115 minutes. This is
30 minutes longer than the 85 minutes required for a schedule without resource constraints.
Notice that there is no time at which both inspectors are required, so we can immediately
‘move one of our two inspectors to a more productive position. There is a long history of work on optimal scheduling. A challenge problem posed in 1963—to find the optimal schedule for a problem involving just 10 machines and 10 jobs of 100 actions each—went unsolved for 23 years (Lawler et al., 1993). Many approaches have been tried, including branchandbound, simulated annealing, tabu search, and constraint sat
377
378
Chapter 11
Automated Planning
——&
Engincoist)
o
T
oo
T
—
T
T w0
T 0
T @
W
T s»
T %
T w
T
Figure 11.15 A solution to the jobshop scheduling problem from Figure 11.13, taking into account resource constraints. The lefthand margin lists the three reusable resources, and actions are shown aligned horizontally with the resources they use. There are two pos ble schedules, depending on which assembly uses the engine hoist first; we’ve shown the shortestduration solution, which takes 115 minutes. Minimum slack
isfaction. One popular approach is the minimum slack heuristic:
on each iteration, schedule
for the earliest possible start whichever unscheduled action has all its predecessors scheduled and has the least slack; then update the ES and LS times for each affected action and
repeat. This greedy heuristic resembles the minimumremainingvalues (MRV) heuristic in constraint satisfaction. It often works well in practice, but for our assembly problem it yields a 130minute solution, not the 115inute solution of Figure 11.15.
Up to this point, we have assumed that the set of actions and ordering constraints is fixed. Under these assumptions, every scheduling problem can be solved by a nonoverlapping sequence that avoids all resource conflicts, provided that each action is feasible by itself.
However if a scheduling problem is proving very difficult, it may not be a good idea to solve
it this way—it may be better to reconsider the actions and constraints, in case that leads to a
much easier scheduling problem. Thus, it makes sense to infegrate planning and scheduling by taking into account durations and overlaps during the construction of a plan. Several of
the planning algorithms in Section 11.2 can be augmented to handle this information.
11.7
Analysis of Planning Approaches
Planning combines the two major areas of Al we have covered so far: search and logic. A planner can be seen either as a program that searches for a solution or as one that (constructively) proves the existence of a solution. The crossfertilization of ideas from the two areas
has allowed planners to scale up from toy problems where the number of actions and states was limited to around a dozen, to realworld industrial applications with millions of states and thousands of actions.
Planning is foremost an exercise in controlling combinatorial explosion. If there are n propositions in a domain, then there are 2" states. Against such pessimism, the identification of independent subproblems can be a powerful weapon. In the best case—full decomposability of the problem—we get an exponential speedup. Decomposability is destroyed, however, by negative interactions between actions.
SATPLAN can encode logical relations between
subproblems. Forward search addresses the problem heuristically by trying to find patterns (subsets of propositions) that cover the independent subproblems.
Since this approach is
heuristic, it can work even when the subproblems are not completely independent.
Summary Unfortunately, we do not yet have a clear understanding of which techniques work best on which kinds of problems. Quite possibly, new techniques will emerge, perhaps providing a synthesis of highly expressive firstorder and hierarchical representations with the highly efficient factored and propositional representations that dominate today. We are seeing exam
ples of portfolio planning systems, where a collection of algorithms are available to apply to
any given problem. This can be done selectively (the system classifies each new problem to choose the best algorithm for it), or in parallel (all the algorithms run concurrently, each on a different CPU), or by interleaving the algorithms according to a schedule.
Summary
In this chapter, we described the PDDL representation for both classical and extended planning problems, and presented several algorithmic approaches for finding solutions. The points to remember: + Planning systems are problemsolving algorithms that operate on explicit factored rep
resentations of states and actions. These representations make possible the derivation of
efffective domainindependent heuristics and the development of powerful and flexible algorithms for solving problems. « PDDL, the Planning Domain Definition Language, describes the initial and goal states as conjunctions of literals, and actions in terms of their preconditions and effects. Ex
tensions represent time, resources, percepts, contingent plans, and hierarchical plans
+ Statespace search can operate in the forward direction (progression) or the backward
direction (regression). Effective heuristics can be derived by subgoal independence assumptions and by various relaxations of the planning problem.
« Other approaches include encoding a planning problem as a Boolean satisfiability problem or as a constraint satisfaction problem; and explicitly searching through the space of partially ordered plans. + Hierarchical task network (HTN) planning allows the agent to take advice from the
domain designer in the form of highlevel actions (HLAs) that can be implemented in
various ways by lowerlevel action sequences. The effects of HLAs can be defined with angelic semantics, allowing provably correct highlevel plans to be derived without consideration of lowerlevel implementations. HTN methods can create the very large plans required by many realworld applications. + Contingent plans allow the agent to sense the world during execution to decide what
branch of the plan to follow. In some cases, sensorless or conformant planning can be used to construct a plan that works without the need for perception. Both conformant and contingent plans can be constructed by search in the space of belief states. Efficient representation or computation of belief states is a key problem.
« An online planning agent uses execution monitoring and splices in repairs as needed to recover from unexpected situations, which can be due to nondeterministic
exogenous events, or incorrect models of the environment.
actions,
* Many actions consume resources, such as money, gas, or raw materials. It is convenient
to treat these resources as numeric measures in a pool rather than try to reason about,
379
Portfolio
380
Chapter 11
Automated Planning
say, each individual coin and bill in the world.
Time is one of the most important
resources. It can be handled by specialized scheduling algorithms, or scheduling can be integrated with planning. « This chapter extends classical planning to cover nondeterministic environments (where
outcomes of actions are uncertain), but it is not the last word on planning. Chapter 17 describes techniques for stochastic environments (in which outcomes of actions have
probabilities associated with them): Markov decision processes, partially observable Markov decision processes, and game theory. In Chapter 22 we show that reinforcement learning allows an agent to learn how to behave from past successes and failures. Bibliographical and Historical Notes
AT planning arose from investigations into statespace search, theorem proving, and control theory.
STRIPS (Fikes and Nilsson, 1971, 1993), the first major planning system, was de
signed as the planner for the Shakey robot at SRL The first version of the program ran on a
computer with only 192 KB of memory.
Its overall control structure was modeled on GPS,
the General Problem Solver (Newell and Simon, 1961), a statespace search system that used
means—ends analysis.
The STRIPS representation language evolved into the Action Description Language, or ADL (Pednault, 1986), and then the Problem Domain Description Language, or PDDL (Ghallab et al., 1998), which has been used for the International Planning Competition since 1998. The most recent version is PDDL 3.1 (Kovacs, 2011).
Linear planning
Planners in the early 1970s decomposed problems by computing a subplan for each subgoal and then stringing the subplans together in some order. This approach, called linear planning by Sacerdoti (1975), was soon discovered to be incomplete. It cannot solve some very simple problems, such as the Sussman anomaly (see Exercise 11.5Uss), found by Allen Brown during experimentation with the HACKER system (Sussman, 1975). A complete plan
ner must allow for interleaving of actions from different subplans within a single sequence.
‘Warren’s (1974) WARPLAN system achieved that, and demonstrated how the logic programming language Prolog can produce concise programs; WARPLAN is only 100 lines of code.
Partialorder planning dominated the next 20 years of research, with theoretical work
describing the detection of conflicts (Tate, 1975a) and the protection of achieved conditions (Sussman, 1975), and implementations including NOAH (Sacerdoti, 1977) and NONLIN (Tate, 1977). That led to formal models (Chapman, 1987; McAllester and Rosenblitt, 1991)
that allowed for theoretical analysis of various algorithms and planning problems, and to a widely distributed system, UCPOP (Penberthy and Weld, 1992).
Drew McDermott suspected that the emphasis on partialorder planning was crowding out
other techniques that should perhaps be reconsidered now that computers had 100 times the
memory of Shakey’s day. His UNPOP (McDermott, 1996) was a statespace planning program employing the ignoredeletelist heuristic. HSP, the Heuristic Search Planner (Bonet and Geffner, 1999; Haslum, 2006) made statespace search practical for large planning problems. The FF or Fast Forward planner (Hoffmann, 2001; Hoffmann and Nebel, 2001; Hoffmann, 2005) and the FASTDOWNWARD variant (Helmert, 2006) won international planning
competitions in the 2000s.
Bibliographical and Historical Notes Bidirectional search (see Section 3.4.5) has also been known
381
to suffer from a lack of
heuristics, but some success has been obtained by using backward search to create a perimeter around the goal, and then refining a heuristic to search forward towards that perimeter (Torralba et al., 2016). The SYMBA*
bidirectional search planner (Torralba ez al., 2016)
won the 2016 competition.
Researchers turned to PDDL and the planning paradigm so that they could use domain independent heuristics. Hoffmann (2005) analyzes the search space of the ignoredeletelist heuristic.
Edelkamp (2009) and Haslum er al. (2007) describe how to construct pattern
databases for planning heuristics. Felner ef al. (2004) show encouraging results using pattern databases for slidingtile puzzles, which can be thought of as a planning domain, but Hoffmann ef al. (2006) show some limitations of abstraction for classical planning problems. (Rintanen, 2012) discusses planningspecific variableselection heuristics for SAT solving. Helmert et al. (2011) describe the Fast Downward Stone Soup (FDSS) system, a portfolio
planner that, as in the fable of stone soup, invites us to throw in as many planning algorithms. as possible. The system maintains a set of training problems, and for each problem and each algorithm records the run time and resulting plan cost of the problem’s solution. Then when
faced with a new problem, it uses the past experience to decide which algorithm(s) to try, with what time limits, and takes the solution with minimal cost. FDSS
was a winner in the 2018
International Planning Competition (Seipp and Rger, 2018). Seipp ef al. (2015) describe a machine learning approach to automatically learn a good portfolio, given a new problem. Vallati ef al. (2015) give an overview of portfolio planning. The idea of algorithm portfolios for combinatorial search problems goes back to Gomes and Selman (2001). Sistla and Godefroid (2004) cover symmetry reduction, and Godefroid (1990) covers
heuristics for partial ordering. Richter and Helmert (2009) demonstrate the efficiency gains
of forward pruning using preferred actions.
Blum and Furst (1997) revitalized the field of planning with their Graphplan system, which was orders of magnitude faster than the partialorder planners of the time. Bryce and Kambhampati (2007) give an overview of planning graphs. The use of situation calculus for planning was introduced by John McCarthy (1963) and refined by Ray Reiter (2001). Kautz et al. (1996) investigated various ways to propositionalize action schemas, finding
that the most compact forms did not necessarily lead to the fastest solution times. A systematic analysis was carried out by Emst er al. (1997), who also developed an automatic “compiler” for generating propositional representations from PDDL problems. The BLACKBOX
planner, which combines ideas from Graphplan and SATPLAN, was developed by Kautz and Selman (1998). Planners based on constraint satisfaction include CPLAN van Beek and Chen
(1999) and GPCSP (Do and Kambhampati, 2003). There has also been interest in the representation of a plan as a binary decision diagram
(BDD), a compact data structure for Boolean expressions widely studied in the hardware
verification community (Clarke and Grumberg, 1987; McMillan, 1993). There are techniques
for proving properties of binary decision diagrams, including the property of being a solution to a planning problem. Cimatti ef al. (1998) present a planner based on this approach. Other representations have also been used, such as integer programming (Vossen et al., 2001).
There are some interesting comparisons of the various approaches to planning. Helmert (2001) analyzes several classes of planning problems, and shows that constraintbased approaches such as Graphplan and SATPLAN are best for NPhard domains, while searchbased
Binary decision diagram (BDD)
382
Chapter 11
Automated Planning
approaches do better in domains where feasible solutions can be found without backtracking.
Graphplan and SATPLAN have trouble in domains with many objects because that means
they must create many actions. In some cases the problem can be delayed or avoided by generating the propositionalized actions dynamically, only as needed, rather than instantiating them all before the search begins.
Macrops Abstraction hierarchy
The first mechanism for hierarchical planning was a facility in the STRIPS program for
learning macrops—‘macrooperators”
consisting of a sequence of primitive steps (Fikes
et al., 1972). The ABSTRIPS system (Sacerdoti, 1974) introduced the idea of an abstraction
hierarchy, whereby planning at higher levels was permitted to ignore lowerlevel precon
ditions of actions in order to derive the general structure of a working plan. Austin Tate’s Ph.D. thesis (1975b) and work by Earl Sacerdoti (1977) developed the basic ideas of HTN
planning. Erol, Hendler, and Nau (1994, 1996) present a complete hierarchical decomposi
tion planner as well as a range of complexity results for pure HTN planners. Our presentation of HLAs and angelic semantics is due to Marthi ez al. (2007, 2008).
One of the goals of hierarchical planning has been the reuse of previous planning experience in the form of generalized plans. The technique of explanationbased learning has been used as a means of generalizing previously computed plans in systems such as SOAR (Laird et al., 1986) and PRODIGY
(Carbonell ef al., 1989). An alternative approach is
to store previously computed plans in their original form and then reuse them to solve new,
Casebased planning
similar problems by analogy to the original problem. This is the approach taken by the field called casebased planning (Carbonell, 1983; Alterman, 1988). Kambhampati (1994) argues that casebased planning should be analyzed as a form of refinement planning and provides a formal foundation for casebased partialorder planning. Early planners lacked conditionals and loops, but some could use coercion to form conformant plans. Sacerdoti’s NOAH solved the “keys and boxes” problem (in which the planner knows little about the initial state) using coercion.
Mason (1993) argued that sensing often
can and should be dispensed with in robotic planning, and described a sensorless plan that can move a tool into a specific position on a table by a sequence of tilting actions, regardless
of the initial position.
Goldman and Boddy (1996) introduced the term conformant planning, noting that sen
sorless plans are often effective even if the agent has sensors. The first moderately efficient conformant planner was Smith and Weld's (1998) Conformant Graphplan (CGP). Ferraris and Giunchiglia (2000) and Rintanen (1999) independently developed SATPLANbased conformant planners. Bonet and Geffner (2000) describe a conformant planner based on heuristic
search in the space of belief states, drawing on ideas first developed in the 1960s for partially
observable Markov decision processes, or POMDPs (see Chapter 17).
Currently, there are three main approaches to conformant planning. The first two use
heuristic search in beliefstate space:
HSCP (Bertoli ef al., 2001a) uses binary decision di
agrams (BDDs) to represent belief states, whereas Hoffmann and Brafman (2006) adopt the lazy approach of computing precondition and goal tests on demand using a SAT solver. The third approach, championed primarily by Jussi Rintanen (2007), formulates the entire sensorless planning problem as a quantified Boolean formula (QBF) and solves it using a
generalpurpose QBF solver. Current conformant planners are five orders of magnitude faster than CGP. The winner of the 2006 conformantplanning track at the International Planning
Competition was Tj (Palacios and Geffner, 2007), which uses heuristic search in beliefstate
Bibliographical and Historical Notes
383
space while keeping the beliefstate representation simple by defining derived literals that cover conditional effects. Bryce and Kambhampati (2007) discuss how a planning graph can be generalized to generate good heuristics for conformant and contingent planning. The contingentplanning approach described in the chapter is based on Hoffmann and Brafman (2005), and was influenced by the efficient search algorithms for cyclic AND—OR graphs developed by Jimenez and Torras (2000) and Hansen and Zilberstein (2001).
The
problem of contingent planning received more attention after the publication of Drew Mc
Dermott’s (1978a) influential article, Planning and Acting. Bertoli et al. (2001b) describe MBP
(ModelBased
Planner), which uses binary decision diagrams to do conformant and
contingent planning. Some authors use “conditional planning” and “contingent planning” as synonyms; others make the distinction that “conditional” refers to actions with nondetermin
istic effects, and “contingent” means using sensing to overcome partial observability. In retrospect, it is now possible to see how the major classical planning algorithms led to
extended versions for uncertain domains. Fastforward heuristic search through state space led to forward search in belief space (Bonet and Geffner, 2000; Hoffmann and Brafman, 2005); SATPLAN led to stochastic SATPLAN (Majercik and Littman, 2003) and to planning
with quantified Boolean logic (Rintanen, 2007); partial order planning led to UWL (Etzioni et al., 1992) and CNLP (Peot and Smith, 1992); Graphplan led to Sensory Graphplan or SGP
(Weld et al., 1998).
The first online planner with execution
monitoring
was
PLANEX
(Fikes ef al.,
1972),
which worked with the STRIPS planner to control the robot Shakey. SIPE (System for Interactive Planning and Execution monitoring) (Wilkins, 1988) was the first planner to deal
systematically with the problem of replanning. It has been used in demonstration projects in
several domains, including planning operations on the flight deck of an aircraft carrier, jobshop scheduling for an Australian beer factory, and planning the construction of multistory buildings (Kartam and Levitt, 1990).
In the mid1980s, pessimism about the slow run times of planning systems led to the pro
posal of reflex agents called reactive planning systems (Brooks, 1986; Agre and Chapman,
1987). “Universal plans” (Schoppers, 1989) were developed as a lookuptable method for
reactive planning, but turned out to be a rediscovery of the idea of policies that had long been
used in Markov decision processes (see Chapter 17). Koenig (2001) surveys online planning techniques, under the name AgentCentered Search.
Planning with time constraints was first dealt with by DEVISER (Vere, 1983). The representation of time in plans was addressed by Allen (1984) and by Dean et al. (1990) in the FORBIN
system.
NONLIN+
(Tate and Whiter,
1984) and SiPE (Wilkins,
1990) could rea
son about the allocation of limited resources to various plan steps. OPLAN (Bell and Tate,
1985) has been applied to resource problems such as software procurement planning at Price Waterhouse and backaxle assembly planning at Jaguar Cars. The two planners SAPA
(Do and Kambhampati,
2001) and T4 (Haslum and Geffner,
2001) both used forward statespace search with sophisticated heuristics to handle actions
with durations and resources. An alternative is to use very expressive action languages, but guide them by humanwritten, domainspecific heuristics, as is done by ASPEN (Fukunaga et al., 1997), HSTS (Jonsson et al., 2000), and IxTeT (Ghallab and Laruelle, 1994). A number of hybrid planningandscheduling systems have been deployed: Isis (Fox et al., 1982; Fox, 1990) has been used for jobshop scheduling at Westinghouse, GARI (De
Reactive planning
384
Chapter 11
Automated Planning
scotte and Latombe,
1985) planned
the machining and construction
of mechanical
parts,
FORBIN was used for factory control, and NONLIN+ was used for naval logistics planning.
We chose to present planning and scheduling as two separate problems; Cushing et al. (2007) show that this can lead to incompleteness on certain problems. There is a long history of scheduling in aerospace. TSCHED (Drabble, 1990) was used
to schedule missioncommand sequences for the UOSATII satellite. OPTIMUMAIV (Aarup et al., 1994) and PLANERS1 (Fuchs ez al., 1990), both based on OPLAN, were used for
spacecraft assembly and observation planning, respectively, at the European Space Agency.
SPIKE (Johnston and Adorf, 1992) was used for observation planning at NASA for the Hub
ble Space Telescope, while the Space Shuttle Ground Processing Scheduling System (Deale et al., 1994) does jobshop scheduling of up to 16,000 workershifts. Remote Agent (Muscettola et al., 1998) became the first autonomous planner—scheduler to control a spacecraft, when
it flew onboard the Deep Space One probe in 1999. Space applications have driven the development of algorithms for resource allocation; see Laborie (2003) and Muscettola (2002). The literature on scheduling is presented in a classic survey article (Lawler ef al., 1993), a book (Pinedo, 2008), and an edited handbook (Blazewicz et al., 2007).
The computational complexity of of planning has been analyzed by several authors (By
lander,
1994; Ghallab et al., 2004; Rintanen, 2016).
There are two main tasks:
PlanSAT
is the question of whether there exists any plan that solves a planning problem.
Bounded
PlanSAT asks whether there is a solution of length k or less; this can be used to find an optimal plan. Both are decidable for classical planning (because the number of states is finite). But if we add function symbols to the language, then the number of states becomes infinite,
and PlanSAT becomes only semidecidable. For propositionalized problems both are in the
complexity class PSPACE, a class that is larger (and hence more difficult) than NP and refers
to problems that can be solved by a deterministic Turing machine with a polynomial amount of space. These theoretical results are discouraging, but in practice, the problems we want to solve tend to be not so bad.
The true advantage of the classical planning formalism is
that it has facilitated the development of very accurate domainindependent heuristics; other
approaches have not been as fruitful. Readings in Planning (Allen et al., 1990) is a comprehensive anthology of early work in the field. Weld (1994, 1999) provides two excellent surveys of planning algorithms of the
1990s. It is interesting to see the change in the five years between the two surveys: the first
concentrates on partialorder planning, and the second introduces Graphplan and SATPLAN.
Automated Planning and Acting (Ghallab ez al., 2016) is an excellent textbook on all aspects of the field. LaValle’s text Planning Algorithms (2006) covers both classical and stochastic
planning, with extensive coverage of robot motion planning.
Planning research has been central to Al since its inception, and papers on planning are a staple of mainstream Al journals and conferences. There are also specialized conferences such as the International Conference on Automated Planning and Scheduling and the Inter
national Workshop on Planning and Scheduling for Space.
TS
12
QUANTIFYING UNCERTAINTY In which we see how to tame uncertainty with numeric degrees of belief.
12.1
Acting under Uncertainty
Agents in the real world need to handle uncertainty, whether due to partial observability,
nondeterminism, or adversaries. An agent may never know for sure what state it is in now or
Uncertainty
where it will end up after a sequence of actions.
We have seen problemsolving and logical agents handle uncertainty by keeping track of
a belief state—a representation of the set of all possible world states that it might be in—and
generating a contingency plan that handles every possible eventuality that its sensors may report during execution. This approach works on simple problems, but it has drawbacks: « The agent must consider every possible explanation for its sensor observations, no matter how unlikely. This leads to a large beliefstate full of unlikely possibilities. « A correct contingent plan that handles every eventuality can grow arbitrarily large and must consider arbitrarily unlikely contingencies.
« Sometimes there is no plan that is guaranteed to achieve the goal—yet the agent must act. It must have some way to compare the merits of plans that are not guaranteed.
Suppose, for example, that an automated taxi has the goal of delivering a passenger to the
airport on time. The taxi forms a plan, A, that involves leaving home 90 minutes before the
flight departs and driving at a reasonable speed. Even though the airport is only 5 miles away,
alogical agent will not be able to conclude with absolute certainty that “Plan Agy will get us to the airport in time.” Instead, it reaches the weaker conclusion “Plan Agg will get us to the
airport in time, as long as the car doesn’t break down, and I don’t get into an accident, and
the road isn’t closed, and no meteorite hits the car, and ... .” None of these conditions can be
deduced for sure, so we can’t infer that the plan succeeds. This is the logical qualification
problem (page 241), for which we so far have seen no real solution.
Nonetheless, in some sense Ag is in fact the right thing to do. What do we mean by this?
As we discussed in Chapter 2, we mean that out of all the plans that could be executed, Agy
is expected to maximize the agent’s performance measure (where the expectation is relative
to the agent’s knowledge about the environment). The performance measure includes getting to the airport in time for the flight, avoiding a long, unproductive wait at the airport, and
avoiding speeding tickets along the way. The agent’s knowledge cannot guarantee any of these outcomes for Agg, but it can provide some degree of belief that they will be achieved. Other plans, such as Ago, might increase the agent’s belief that it will get to the airport on time, but also increase the likelihood of a long, boring wait. The right thing to do—the
rational decision—therefore depends on both the relative importance of various goals and
=28,224 entries—still a manageable number.
If we add the possibility of dirt in each of the 42 squares, the number of states is multiplied
by 2*2 and the transition matrix has more than 102 entries—no longer a manageable number.
In general, if the state is composed of n discrete variables with at most d values each, the
corresponding HMM transition matrix will have size O(d2") and the perupdate computation time will also be O(d?").
For these reasons, although HMMs have many uses in areas ranging from speech recogni
tion to molecular biology, they are fundamentally limited in their ability to represent complex
processes.
In the terminology introduced in Chapter 2, HMMs are an atomic representation:
states of the world have no internal structure and are simply labeled by integers. Section 14.5
shows how to use dynamic Bayesian networks—a factored representation—to model domains
with many state variables. The next section shows how to handle domains with continuous state variables, which of course lead to an infinite state space.
Section 14.4
14.4
Kalman
Kalman Filters
Filters
Imagine watching a small bird flying through dense jungle foliage at dusk: you glimpse
brief, intermittent flashes of motion; you try hard to guess where the bird is and where it will
appear next so that you don’t lose it. Or imagine that you are a World War II radar operator
peering at a faint, wandering blip that appears once every 10 seconds on the screen. Or, going
back further still, imagine you are Kepler trying to reconstruct the motions of the planets
from a collection of highly inaccurate angular observations taken at irregular and imprecisely measured intervals.
In all these cases, you are doing filtering: estimating state variables (here, the position and velocity of a moving object) from noisy observations over time. If the variables were discrete, we could model the system with a hidden Markov model.
This section examines
methods for handling continuous variables, using an algorithm called Kalman filtering, after Kalman filtering
one of its inventors, Rudolf Kalman.
The bird’s flight might be specified by six continuous variables at each time point; three for position (X;,Y;,Z) and three for velocity (X;.,¥;,Z). We will need suitable conditional
densities to represent the transition and sensor models; as in Chapter 13, we will use linear— Gaussian distributions. This means that the next state X,
must be a linear function of the
current state X, plus some Gaussian noise, a condition that turns out to be quite reasonable in
practice. Consider, for example, the Xcoordinate of the bird, ignoring the other coordinates
for now. Let the time interval between observations be A, and assume constant velocity during
the interval; then the position update is given by X;;a = X, + X A. Adding Gaussian noise (to account for wind variation, etc.), we obtain a linearGaussian transition model:
P(Xrra=3eal X =2, X =5) = N (0a3% + 5% A,0%).
The Bayesian network structure for a system with position vector X, and velocity X, is shown
in Figure 14.9. Note that this is a very specific form of linearGaussian model: the general
form will be described later in this
section and covers a vast array of applications beyond the
simple motion examples of the first paragraph. The reader might wish to consult Appendix A for some of the mathematical properties of Gaussian distributions; for our immediate purposes, the most important is that a multivariate Gaussian
distribution for ¢ variables is
specified by a delement mean 12 and a d x d covariance matrix . 14.4.1
Updating
Gaussian distributions
In Chapter 13 on page 423, we alluded to a key property of the lincarGaussian family of distributions: it remains closed under Bayesian updating.
(That is, given any evidence, the
posterior is still in the linearGaussian family.) Here we make this claim precise in the context
of filtering in a temporal probability model. The required properties correspond to the twostep filtering calculation in Equation (14.5):
1. If the current distribution P(X, ey) is Gaussian and the transition model P(X,+x,) is linearGaussian, then the onestep predicted distribution given by
P(Xpiifer) = Jx, P(Xoior[%)P(x er)dx is also a Gaussian distribution.
(14.17)
479
Chapter 14 Probabilistic Reasoning over Time
P
480
Figure 14.9 Bayesian network structure for a linear dynamical system with position X,. velocity X,. and position measurement Z 2. If the prediction P(X,; e,) is Gaussian and the sensor model P(e,1 X, 1) is linear— Gaussian, then, after conditioning on the new evidence, the updated distribution
P(Xpiieri1) = aP(erst [ X
is also a Gaussian distribution.
) P(Xrii  err)
(14.18)
Thus, the FORWARD operator for Kalman filtering takes a Gaussian forward message f.;,
specified by a mean z, and covariance ¥, and produces a new multivariate Gaussian forward
message f1... 1, specified by a mean i,
and covariance ¥, . So if we start with a Gaussian
prior f1.0=P(Xo) = (119, o), filtering with a linearGaussian model produces a Gaussian state distribution for all time.
This seems to be a nice, elegant result, but why is it so important? The reason is that except for a few special cases such as this, filtering with continuous or hybrid (discrete and continuous) networks generates state distributions whose representation grows without bound
over time. This statement is not easy to prove in general, but Exercise 14.KFSW shows what
happens for a simple example. 14.4.2
A simple onedimensional example
‘We have said that the FORWARD operator for the Kalman filter maps a Gaussian into a new
Gaussian. This translates into computing a new mean and covariance from the previous mean
and covariance. Deriving the update rule in the general (multivariate) case requires rather a
lot of linear algebra, so we will stick to a very simple univariate case for now, and later give
the results for the general case. Even for the univariate case, the calculations are somewhat
tedious, but we feel that they are worth seeing because the usefulness of the Kalman filter is
tied so intimately to the mathematical properties of Gaussian distributions.
The temporal model we consider describes a random walk of a single continuous state
variable X, with a noisy observation Z. An example might be the “consumer confidence” in
dex, which can be modeled as undergoing a random Gaussiandistributed change each month
and is measured by a random consumer survey that also introduces Gaussian sampling noise.
The prior distribution is assumed to be Gaussian with variance a‘%:
P(xo) =ae °
Section 14.4
Kalman Filters
(For simplicity, we use the same symbol o for all normalizing constants in this section.) The transition model adds a Gaussian perturbation of constant variance o2 to the current state: e
Pl
= e ().
The sensor model assumes Gaussian noise with variance o2:2
Pla
= ae
(),
Now, given the prior P(Xo), the onestep predicted distribution comes from Equation (14.17):
p) = [ Plalsopain=a [~ Al ‘This integral looks rather complicated. The key to progress is to notice that the exponent is the sum of two expressions that are quadratic in xo and hence
is itself a quadratic in xo. A simple
trick known as completing the square allows the rewriting of any quadratic ax} +bxo+cas Sempletie the the sum of a squared term a(xo — 52)? and a residual term ¢ — % that is independent of xo. In this case, we have a= (03 +02)/(0302), b=—2(03x) + 2 p0)/(03072). and c= (o} + o2u3)/(0302). The residual term can be taken outside the integral, giving us
P() = ae HEE) /"“ e Hatw2) gy
Now the integral is just the integral of a Gaussian over its full range, which is simply 1. Thus,
we are left with only the residual term from the quadratic. Plugging back in the expressions for a, b, and ¢ and simplifying, we obtain 4
)
That is, the onestep predicted distribution is a Gaussian with the same mean 19 and a variance
equal to the sum of the original variance o7 and the transition variance 2.
To complete the update step, we need to condition on the observation at the first time
step, namely, z;. From Equation (14.18), this is given by
P(xiz1) = aP(ax)P(x) Once again, we combine the exponents and complete the square (Exercise 14.KALM), obtain
ing the following expression for the posterior:
(14.19)
P(xz1) =ae *
Thus, after one update cycle, we have a new Gaussian distribution for the state variable. From the Gaussian formula in Equation (14.19), we see that the new mean and standard deviation can be calculated from the old mean and standard deviation as follows: M1
=
+ o)z (2
+otm
Ftoit ol
and
(07 +a})e? 2 _ _
Fraitol
1420 14.20,
481
Chapter 14 Probabilistic Reasoning over Time 0.45 0.4 0.35 0.3
P(x)
482
0.25 0.2
0.15 0.1 0.05
xposition Figure 14.10 Stages in the Kalman filter update cycle for a random walk with a prior given by f1o=0.0 and o= 1.5, transition noise given by o, =2.0, sensor noise given by o= 1.0, and a first observation
.5 (marked on the xaxis).
flattened out, relative to P(xg), by the transition noise.
Notice how the prediction P(xi) is
Notice also that the mean of the
posterior distribution P(x; 2y is slightly to the left of the observation z; because the mean is a weighted average of the prediction and the observation.
Figure 14.10 shows one update cycle of the Kalman filter in the onedimensional case for particular values of the transition and sensor models. Equation (14.20) plays exactly the same role as the general filtering equation (14.5) or the HMM filtering equation (14.12). Because of the special nature of Gaussian distributions, however, the equations have some interesting additional properties. First, we can interpret the calculation for the new mean y, 1 as a weighted mean of the
new observation z; and the old mean ;. If the observation is unreliable, then o2 is large
and we pay more attention to the old mean; if the old mean is unreliable (o7 is large) or the
process is highly unpredictable (o7 is large), then we pay more attention to the observation.
Second, notice that the update for the variance o7,  is independent of the observation. We
can therefore compute in advance what the sequence of variance values will be. Third, the sequence of variance values converges quickly to a fixed value that depends only on o2 and
2, thereby substantially simplifying the subsequent calculations. (See Exercise 14.VARL) 14.4.3
The general case
The preceding derivation illustrates the key property of Gaussian distributions that allows
Kalman filtering to work: the fact that the exponent is a quadratic form. This is true not just
for the univariate case; the full multivariate Gaussian distribution has the form L
N(x:p,E) =ae’5((x’m
)
B
(x’m)
Multiplying out the terms in the exponent, we see that the exponent is also a quadratic func
tion of the values x; in . Thus, filtering preserves the Gaussian nature of the state distribution.
Section 14.4
Kalman Filters
483
Let us first define the general temporal model used with Kalman filtering. Both the tran
sition model and the sensor model are required to be a linear transformation with additive Gaussian noise. Thus, we have
P(%1x) = N(x1:F%, %) P(zx) = N(z:Hx,.Z.),
(14.21)
where F and Z, are matrices describing the linear transition model and transition noise co
variance, and H and X are the corresponding matrices for the sensor model. Now the update equations for the mean and covariance, in their full, hairy horribleness, are
Hrir = Fpy+ Ko (241 — HFpy,) S where K,
(14.22)
= (=K H)(FEFT 1),
=
(FL,F +X)H
(H(FZ,F' +X,)H" +X;)"! is the Kalman gain matrix. Be
lieve it or not, these equations make some intuitive sense. For example, consider the up
date for the mean state estimate j.. The term Fy, is the predicted state at 1 +
Kalman gain matrix
1, so HFpy, is
the predicted observation. Therefore, the term 7 — HFyy, represents the error in the pre
dicted observation. This is multiplied by K, to correct the predicted state; hence, K, is a measure of how seriously to take the new observation relative to the prediction. As in
Equation (14.20), we also have the property that the variance update is independent of the observations.
The sequence of values for £, and K, can therefore be computed offline, and
the actual calculations required during online tracking are quite modest.
To illustrate these equations at work, we have applied them to the problem of tracking an object moving on the X—Y plane. The state variables are X = (X,Y,X,¥)7, so F, £,, H, and X are 4 x 4 matrices. Figure 14.11(a) shows the true trajectory, a series of noisy observations,
and the trajectory estimated by Kalman filtering, along with the covariances indicated by the onestandarddeviation contours. The filtering process does a good job of tracking the actual
motion, and, as expected, the variance quickly reaches a fixed point. We can also derive equations for smoothing as well as filtering with linearGaussian
models. The smoothing results are shown in Figure 14.11(b). Notice how the variance in the
position estimate is sharply reduced, except at the ends of the trajectory (why?), and that the
estimated trajectory is much smoother. 14.4.4
Applicability of Kalman
filtering
The Kalman filter and its elaborations are used in a vast array of applications. The “classical”
application is in radar tracking of aircraft and missiles. Related applications include acoustic tracking of submarines and ground vehicles and visual tracking of vehicles and people. In a slightly more esoteric vein, Kalman filters are used to reconstruct particle trajectories from
bubblechamber photographs and ocean currents from satellite surface measurements. The range of application is much larger than just the tracking of motion: any system characterized
by continuous state variables and noisy measurements will do. Such systems include pulp mills, chemical plants, nuclear reactors, plant ecosystems, and national economies.
The fact that Kalman filtering can be applied to a system does not mean that the re
sults will be valid or useful. The assumptions made—linearGaussian transition and sensor
Kalman models—are very strong. The extended Kalman filter (EKF) attempts to overcome nonlin Extended filter (EKF)
earities in the system being modeled.
A system is nonlinear if the transition model cannot
be described as a matrix multiplication of the state vector, as in Equation (14.21). The EKF
Nonlinear
484
Chapter 14 Probabilistic Reasoning over Time 2D filiering iT
2D smoothing, e hered
(a)
Served Smoathed
(b)
Figure 14.11 (a) Results of Kalman filtering for an object moving on the XY plane, showing the true trajectory (left to right), a series of noisy observations, and the trajectory estimated by Kalman filtering. Variance in the position estimate is indicated by the ovals. (b) The results of Kalman smoothing for the same observation sequence. works by modeling the system as locally linear in x, in the region of X, = j1,, the mean of the
current state distribution. This works well for smooth, wellbehaved systems and allows the
tracker to maintain and update a Gaussian state distribution that is a reasonable approximation to the true posterior. A detailed example is given in Chapter 26.
What does it mean for a system to be “unsmooth” or “poorly behaved”? Technically,
it means that there is significant nonlinearity in system response within the region that is “close” (according to the covariance X;) to the current mean f,.
To understand this idea
in nontechnical terms, consider the example of trying to track a bird as it flies through the
jungle. The bird appears to be heading at high speed straight for a tree trunk. The Kalman
filter, whether regular or extended, can make only a Gaussian prediction of the location of the
bird, and the mean of this Gaussian will be centered on the trunk, as shown in Figure 14.12(a). A reasonable model of the bird, on the other hand, would predict evasive action to one side or the other, as shown in Figure 14.12(b). Such a model is highly nonlinear, because the bird’s
decision varies sharply depending on its precise location relative to the trunk.
To handle examples like these, we clearly need a more expressive language for repre
senting the behavior of the system being modeled. Within the control theory community, for
Switching Kalman filter
which problems such as evasive maneuvering by aircraft raise the same kinds of difficulties,
the standard solution is the switching Kalman filter. In this approach, multiple Kalman filters run in parallel, each using a different model of the system—for example, one for straight
flight, one for sharp left turns, and one for sharp right turns. A weighted sum of predictions
is used, where the weight depends on how well each filter fits the current data. We will see
in the next section that this is simply a special case of the general dynamic Bayesian net
work model, obtained by adding a discrete “maneuver” state variable to the network shown in Figure 14.9. Switching Kalman filters are discussed further in Exercise 14.KFSW.
Section 14.5
Dynamic Bayesian Networks
485
Figure 14.12 A bird flying toward a tree (top views). (a) A Kalman filter will predict the location of the bird using a single Gaussian centered on the obstacle. (b) A more realistic model allows for the bird’s evasive action, predicting that it will fly to one side or the other. 14.5
Dyna
Bayesian
Networks
Bayesian Dynamic Bayesian networks, or DBNs, extend the semantics of standard Bayesian networks Dynamic network to handle temporal probability models of the kind described in Section 14.1. We have already seen examples of DBN:
the umbrella network in Figure 14.2 and the Kalman filter network
in Figure 14.9. In general, each slice of a DBN can have any number of state variables X,
and evidence variables E,. For simplicity, we assume that the variables, their links, and their
conditional distributions are exactly replicated from slice to slice and that the DBN represents
a firstorder Markov process, so that each variable can have parents only in its own slice or the immediately preceding slice. In this way, the DBN corresponds to a Bayesian network with infinitely many variables.
It should be clear that every hidden Markov model can be represented as a DBN
with
a single state variable and a single evidence variable. It is also the case that every discrete
variable DBN can be represented as an HMM; as explained in Section 14.3, we can combine all the state variables in the DBN into a single state variable whose values are all possible tuples of values of the individual state variables. Now, if every HMM is a DBN and every DBN can be translated into an HMM, what's the difference? The difference is that, by decomposing the state of a complex system into its constituent variables, we can take advantage
of sparseness in the temporal probability model.
To see what this means in practice, remember that in Section 14.3 we said that an HMM
representation for a temporal process with n discrete variables, each with up to d values,
needs a transition matrix of size 0({]2”). ‘The DBN representation, on the other hand, has size
0O(nd") if the number of parents of each variable is bounded by . In other words, the DBN
representation is linear rather than exponential in the number of variables.
For the vacuum
robot with 42 possibly dirty locations, the number of probabilities required is reduced from 5% 10% to a few thousand.
We have already explained that every Kalman filter model can be represented in a DBN
with continuous
variables and linearGaussian conditional distributions (Figure
14.9).
It
should be clear from the discussion at the end of the preceding section that not every DBN
can be represented by a Kalman filter model. In a Kalman filter, the current state distribution
486
Chapter 14 Probabilistic Reasoning over Time
o
o)
07!
Ry [PRilRo) S
07
03
B
(P(UIIRY) 0.9 0.2
Figure 14.13 Left: Specification of the prior, transition model. and sensor model for the umbrella DBN. Subsequent slices are copies of slice 1. Right: A simple DBN for robot motion in the XY plane. is always a single multivariate Gaussian distribution—that s, a single “bump” in a particular location. DBNSs, on the other hand, can model arbitrary distributions.
For many realworld applications, this flexibility is essential. Consider, for example, the current location of my keys. They might be in my pocket, on the bedside table, on the kitchen counter, dangling from the front door, or locked in the car. A single Gaussian bump that included all these places would have to allocate significant probability to the keys being in midair above the front garden. Aspects of the real world such as purposive agents, obstacles, and pockets introduce “nonlinearities” that require combinations of discrete and continuous variables in order to get reasonable models.
14.5.1
Constructing
DBNs
To construct a DBN, one must specify three kinds of information: the prior distribution over
the state variables, P(Xo); the transition model P(X,+ 1 X;); and the sensor model P(E, X,).
To specify the transition and sensor models, one must also specify the topology of the connections between successive slices and between the state and evidence variables. Because the transition and sensor models are assumed to be timehomogeneous—the same for all r—
it is most convenient simply to specify them for the first slice. For example, the complete DBN specification for the umbrella world is given by the threenode network shown in Figure 14.13(a). From this specification, the complete DBN with an unbounded number of time
slices can be constructed as needed by copying the first slice. Let us now consider a more interesting example: monitoring a batterypowered robot moving in the XY plane, as introduced at the end of Section 14.1.
First, we need state
variables, which will include both X, = (X,,¥;) for position and X, = (X;.;) for velocity. We assume some method of measuring position—perhaps a fixed camera or onboard GPS (Global Positioning System)—yielding measurements Z,. The position at the next time step depends
on the current position and velocity, as in the standard Kalman filter model. The velocity at the next step depends on the current velocity and the state of the battery. We add Battery; to
Section 14.5
Dynamic Bayesian Networks
487
represent the actual battery charge level, which has as parents the previous battery level and the velocity, and we add BMeter,, which measures the battery charge level. This gives us the basic model shown in Figure 14.13(b). It is worth looking in more depth at the nature of the sensor model for BMeter;.
Let us
suppose, for simplicity, that both Battery, and BMeter, can take on discrete values 0 through 5. (Exercise
model.)
14.BATT asks you to relate this discrete model to a corresponding continuous
If the meter is always accurate, then the CPT P(BMeter,  Battery,) should have
probabilities of 1.0 “along the diagonal” and probabilities of 0.0 elsewhere.
always creeps into measurements.
In reality, noise
For continuous measurements, a Gaussian distribution
with a small variance might be used.”
For our discrete variables, we can approximate a
Gaussian using a distribution in which the probability of error drops off in the appropriate
way, so that the probability of a large error is very small. We use the term Gaussian error model to cover both the continuous and discrete versions.
Anyone with handson experience of robotics, computerized process control, or other
Gaussian error model
forms of automatic sensing will readily testify to the fact that small amounts of measurement noise are often the least of one’s
problems.
Real sensors fail.
When a sensor fails, it does
not necessarily send a signal saying, “Oh, by the way, the data I'm about to send you is a load of nonsense.”
Instead, it simply sends the nonsense.
The simplest kind of failure is
called a transient failure, where the sensor occasionally decides to send some nonsense. For
example, the battery level sensor might have a habit of sending a reading of 0 when someone bumps the robot, even if the battery is fully charged.
Let’s see what happens when a transient failure occurs with a Gaussian error model that
doesn’t accommodate such failures.
Suppose, for example, that the robot is sitting quietly
and observes 20 consecutive battery readings of 5. Then the battery meter has a temporary seizure and the next reading is BMetery;
=0. What will the simple Gaussian error model lead
us to believe about Barteryy? According to Bayes’ rule, the answer depends on both the sensor model P(BMeter =0 Batterys) and the prediction P(Batterys)  BMetery). If the
probability of a large sensor error is significantly less than the probability of a transition to
Battery,; =0, even if the latter is very unlikely, then the posterior distribution will assign a high probability to the battery’s being empty.
A second reading of 0 at 7 =22 will make this conclusion almost certain. If the transient
failure then disappears and the reading returns to 5 from 7 =23 onwards, the estimate for the
battery level will quickly return to 5. (This does not mean the algorithm thinks the battery magically recharged itself, which may be physically impossible; instead, the algorithm now
believes that the battery was never low and the extremely unlikely hypothesis that the battery
meter had two consecutive huge errors must be the right explanation.) This course of events
is illustrated in the upper curve of Figure 14.14(a), which shows the expected value (see
Appendix A) of Battery, over time, using a discrete Gaussian error model.
Despite the recovery, there is a time (r=22) when the robot is convinced that its battery is empty; presumably, then, it should send out a mayday signal and shut down. Alas, its oversimplified sensor model has led it astray. The moral of the story is simple: for the system 10 handle sensor failure properly, the sensor model must include the possibility of failure. 7 Strictly speaking, a Gaussia ution is problematic because it assigns nonzero probability to large negative charge levels. The beta distribution s a better choice for a variable whose range is restricted.
Transient failure
488
Chapter 14 Probabilistic Reasoning over Time E(Battery, ..5555005555...)
5
5
4
4
S 2 4 1
S 2 = 1
T
\
3
£
o
El
E(Battery, ..5555005555...
KM e K E(Battery, ...5555000000...) 15
20
25
30
Time step ¢
Bl
¥\
\
\\
k\'—n—n—n—n—n—n—x E(Bartery, ...5555000000...) 15
20
Time step
()
25
30
(®)
Figure 14.14 (a) Upper curve: trajectory of the expected value of Battery, for an observation sequence consisting of all Ss except for Os at =21 and =22, using a simple Gaussian error model. Lower curve: trajectory when the observation remains at 0 from =21 onwards. (b) The same experiment run with the transient failure model. The transient failure is handled well, but the persistent failure results in excessive pessimism about the battery charge. The simplest kind of failure model for a sensor allows a certain probability that the sensor will return some completely incorrect value, regardless of the true state of the world. For
example, if the battery meter fails by returning 0, we might say that
P(BMeter,=0 Battery, =5)=0.03, Transient failure
which is presumably much larger than the probability assigned by the simple Gaussian error
model. Let's call this the transient failure model. How does it help when we are faced
with a reading of 0? Provided that the predicted probability of an empty battery, according
to the readings so far, is much less than 0.03, then the best explanation of the observation
BMeter;; =0 is that the sensor has temporarily failed. Intuitively, we can think of the belief
about the battery level as having a certain amount of “inertia” that helps to overcome tempo
rary blips in the meter reading. The upper curve in Figure 14.14(b) shows that the transient failure model can handle transient failures without a catastrophic change in beliefs.
So much for temporary blips. What about a persistent sensor failure? Sadly, failures of
this kind are all too common.
If the sensor returns 20 readings of 5 followed by 20 readings
of 0, then the transient sensor failure model described in the preceding paragraph will result in the robot gradually coming to believe that its battery is empty when in fact it may be that the meter has failed.
The lower curve in Figure
14.14(b) shows the belief “trajectory” for
this case. By r=25—five readings of O—the robot is convinced that its battery is empty.
Obviously, we would prefer the robot to believe that its battery meter is broken—if indeed
Pt
fallure
this is the more likely event.
Unsurprisingly, to handle persistent failure, we need a persistent failure model that
describes how the sensor behaves
under normal conditions
and after failure.
To do this,
we need to augment the state of the system with an additional variable, say, BMBroken, that
describes the status of the battery meter. The persistence of failure must be modeled by an
Section 14.5
Dynamic Bayesian Networks E(Battery,...5555005555...)
B,  P(B)
7 [ 1.000
/
489
ey
0.001
E(Battery,..5555000000...)
BMBrokeny
P(BMBroken, ..5555000000...) Roe e eea
BMeter,
P(BMBroken, ...5555005555..) 15
(a)
20
“Time step
25
30
(b)
Figure 14.15 (a) A DBN fragment showing the sensor status variable required for modeling persistent failure of the battery sensor. (b) Upper curves: trajectories of the expected value of Battery, for the “transient failure” and “permanent failure” observations sequences. Lower curves: probability trajectories for BMBroken given the two observation sequences. arc linking BMBrokeny to BMBroken,.
This persistence arc has a CPT that gives a small
probability of failure in any given time step, say, 0.001, but specifies that the sensor stays
broken once it breaks. When the sensor is OK, the sensor model for BMeter is identical to the transient failure model; when the sensor is broken, it says BMeter is always 0, regardless
of the actual battery charge. The persistent failure model for the battery sensor is shown in Figure 14.15(a). Its performance on the two data sequences (temporary blip and persistent failure) is shown in Figure 14.15(b).
There are several things to notice about these curves.
First, in the case of the
temporary blip, the probability that the sensor is broken rises significantly after the second
0 reading, but immediately drops back to zero once a 5 is observed. Second, in the case of persistent failure, the probability that the sensor is broken rises quickly to almost 1 and stays there.
Finally, once the sensor is known to be broken, the robot can only assume that its
battery discharges at the “normal” rate. This is shown by the gradually descending level of E(Battery,...).
So far, we have merely scraiched the surface of the problem of representing complex
processes.
The variety of transition models is huge, encompassing
topics as disparate as
‘modeling the human endocrine system and modeling multiple vehicles driving on a freeway. Sensor modeling is also a vast subfield in itself. But dynamic Bayesian networks can model
even subtle phenomena, such as sensor drift, sudden decalibration, and the effects of exoge
nous conditions (such as weather) on sensor readings. 14.5.2
Exact inference in DBNs
Having sketched some ideas for representing complex processes
as DBNs, we now turn to the
question of inference. In a sense, this question has already been answered: dynamic Bayesian networks are Bayesian networks, and we already have algorithms for inference in Bayesian networks. Given a sequence of observations, one can construct the full Bayesian network rep
Persistence arc
490
Chapter 14 Probabilistic Reasoning over Time
BR[O RyP(R)R,
07
03
I
07
P(R)R,
R,
[P(R IR,
Ry
P(RyR;
f1
03
f1
03
f1
03
@@@@
T
(R,  P(UIR,)
AN
f1l
02
T e
Ry 
@@
Tl T Tl (R P(UIR)
o[
fl
09
02
Ry [ PCUAR,) fl
[
o9  02
(R; [ PCUSR,)
[t]
£l
o9
02
Figure 14.16 Unrolling a dynamic Bayesian network: slices are replicated to accommodate the observation sequence Umbrellay;s. Further slices have no effect on inferences within the observation period. resentation of a DBN by replicating slices until the network is large enough to accommodate the observations, as in Figure 14.16. This technique is called unrolling. (Technically, the DBN
is equivalent to the semiinfinite network obtained by unrolling forever.
Slices added
beyond the last observation have no effect on inferences within the observation period and can be omitted.) Once the DBN is unrolled, one can use any of the inference algorithms— variable elimination, clustering methods, and so on—described in Chapter 13.
Unfortunately, a naive application of unrolling would not be particularly efficient. If
we want to perform filtering or smoothing with a long sequence of observations ey, the
unrolled network would require O(r) space and would thus grow without bound as more observations were added. Moreover, if we simply run the inference algorithm anew each time an observation is added, the inference time per update will also increase as O(f). Looking back to Section 14.2.1, we see that constant time and space per filtering update
can be achieved if the computation can be done recursively. Essentially, the filtering update in Equation (14.5) works by summing out the state variables of the previous time step to get the distribution for the new time step. Summing out variables is exactly what the variable
elimination (Figure 13.13) algorithm does, and it turns out that running variable elimination
with the variables in temporal order exactly mimics the operation of the recursive filtering
update in Equation (14.5). The modified algorithm keeps at most two slices in memory at any one time: starting with slice 0, we add slice 1, then sum out slice 0, then add slice 2, then
sum out slice 1, and so on. In this way, we can achieve constant space and time per filtering update.
(The same performance can be achieved by suitable modifications to the clustering
algorithm.) Exercise 14.DBNE asks you to verify this fact for the umbrella network.
So much for the good news; now for the bad news: It turns out that the “constant” for the
perupdate time and space complexity is, in almost all cases, exponential in the number of state variables. What happens is that, as the variable elimination proceeds, the factors grow
to include all the state variables (or, more precisely, all those state variables that have parents
in the previous time slice). The maximum factor size is O(d"**) and the total update cost per
step is O(nd"**), where d is the domain size of the variables and k is the maximum number
of parents of any state variable.
Of course, this is much less than the cost of HMM updating, which is O(d?"), but it s still infeasible for large numbers of variables. This grim fact means is that even though we can use
DBN s to represent very complex temporal processes with many sparsely connected variables,
Section 14.5
Dynamic Bayesian Networks
491
we cannot reason efficiently and exactly about those processes. The DBN model itself, which
represents the prior joint distribution over all the variables, is factorable into its constituent CPTs, but the posterior joint distribution conditioned on an observation sequence—that
is,
the forward message—is generally not factorable. The problem is intractable in general, so we must fall back on approximate methods.
14.5.3
Approximate inference in DBNs
Section 13.4 described two approximation algorithms: likelihood weighting (Figure 13.18)
and Markov chain Monte Carlo (MCMC, Figure 13.20). Of the two, the former is most easily adapted to the DBN context. (An MCMC filtering algorithm is described briefly in the notes
at the end of this chapter.) We will see, however, that several improvements are required over
the standard likelihood weighting algorithm before a practical method emerges. Recall that likelihood weighting works by sampling the nonevidence nodes of the network in topological order, weighting each sample by the likelihood it accords to the observed evidence variables. As with the exact algorithms, we could apply likelihood weighting directly to an unrolled DBN, but this would suffer from the same problems of increasing time
and space requirements per update as the observation sequence grows. The problem is that
the standard algorithm runs each sample in turn, all the way through the network.
Instead, we can simply run all N samples together through the DBN, one slice at a time.
The modified algorithm fits the general pattern of filtering algorithms, with the set of N samples as the forward message. The first key innovation, then, is to use the samples themselves as an approximate representation of the current state distribution.
This meets the require
For a single customer Cj recommending a single book B1, the Bayes net might look like the one shown in Figure 15.2(a). (Just as in Section 9.1, expressions with parentheses such as Honest(Cy) are just fancy symbols—in this case, fancy names for random variables.) With 1" The name relational probability model was given by Pfeffer (2000) to a slightly different representation, but the underlying ideas are the same. 2 A game theorist would advise a dishones! customer to avoid detection by occasionally recommending a good book from a competitor. See Chapter 18.
Section 15.1
503
Relational Probability Models
two customers and two books, the Bayes net looks like the one in Figure 15.2(b). For larger
numbers of books and customers, it is quite impractical to specify a Bayes net by hand.
Fortunately, the network has a lot of repeated structure. Each Recommendation(c,b) vari
able has as its parents the variables Honest(c), Kindness(c), and Quality(b). Moreover, the conditional probability tables (CPTs) for all the Recommendation(c,b) variables are identical, as are those for all the Honest(c) variables, and so on. The situation seems tailormade
for a firstorder language. We would like to say something like Recommendation(c,b) ~ RecCPT (Honest(c), Kindness c), Quality(b))
which means that a customer’s recommendation for a book depends probabilistically on the
customer’s honesty and kindness and the book’s quality according to a fixed CPT.
Like firstorder logic, RPMs have constant, function, and predicate symbols. We will also
assume a type signature for each function—that is, a specification of the type of each argu
ment and the function’s value. (If the type of each object is known, many spurious possible worlds are eliminated by this mechanism; for example, we need not worry about the kindness of each book, books recommending customers, and so on.)
Type signature
For the bookrecommendation
domain, the types are Customer and Book, and the type signatures for the functions and predicates are as follows:
Honest : Customer — {true false}
Kindness : Customer — {1,2,3,4,5} Quality : Book — {1,2,3,4,5} Recommendation : Customer x Book — {1,2,3,4,5} The constant symbols will be whatever customer and book names appear in the retailer’s data
set. In the example given in Figure 15.2(b), these were Cy, C> and B1, B. Given the constants and their types, together with the functions and their type signatures,
the basic random variables of the RPM are obtained by instantiating each function with each possible combination of objects. For the book recommendation model, the basic random
variables include Honest(Cy), Quality(Bs), Recommendation(C\,B3), and so on. These are exactly the variables appearing in Figure 15.2(b). Because each type has only finitely many instances (thanks to the domain closure assumption), the number of basic random variables is also finite.
To complete the RPM, we have to write the dependencies that govern these random vari
ables. There function is a For example, of honesty is
is one dependency statement for each function, where each argument of the logical variable (i.c., a variable that ranges over objects, as in firstorder logic). the following dependency states that, for every customer ¢, the prior probability 0.9 true and 0.01 false:
Honest(c) ~ (0.99,0.01)
Similarly, we can state prior probabilities for the kindness value of each customer and the quality of each book, each on the 15 scale: Kindness(c) ~ (0.1,0.1,0.2,0.3,0.3)
Quality(h) ~ (0.05,0.2,0.4,0.2,0.15) Finally, we need the dependency for recommendations: for any customer ¢ and book b, the score depends on the honesty and kindness of the customer and the quality of the book:
Recommendation(c,b) ~ RecCPT(Honest(c), Kindness(c), Quality (b))
Basic random variable
504
Chapter 15 Probabilistic Programming where RecCPT is a separately defined conditional probability table with 2 x 5 x 5=50 rows,
each with 5 entries. For the purposes of illustration, we’ll assume that an honest recommen
dation for a book of quality ¢ from a person of kindness is uniformly distributed in the range 1155, 145411 The semantics of the RPM can be obtained by instantiating these dependencies for all known constants, giving a Bayesian network (as in Figure 15.2(b)) that defines a joint distribution over the RPM’s random variables.>
The set of possible worlds is the Cartesian product of the ranges of all the basic random variables, and, as with Bayesian networks, the probability for each possible world is the product of the relevant conditional probabilities from the model. With C customers and B books, there are C Honest variables, C Kindness variables, B Quality variables, and BC Recommendation variables, leading to 265*3+5C possible worlds. With ten million books
and a billion customers, that’s about 107> 1" worlds. Thanks to the expressive power of
RPMs, the complete probability model still has only fewer than 300 parameters—most of them in the RecCPT table. We can refine the model by asserting a contextspecific independence (see page 420) to reflect the fact that dishonest customers ignore quality when giving a recommendation; more
over, kindness plays no role in their decisions. Thus, Recommendation(c,b) is independent of Kindness(c) and Quality(b) when Honest(c) = false:
Recommendation(c,b) ~
if Honest(c) then HonestRecCPT (Kindness(c), Quality(b)) else (0.4,0.1,0.0,0.1,0.4).
This kind of dependency may look like an ordinary ifthen—else statement in a programming language, but there is a key difference: the inference engine doesn’t necessarily know the value of the conditional test because Honest(c) is a random variable.
We can elaborate this model in endless ways to make it more realistic.
For example,
suppose that an honest customer who is a fan of a book’s author always gives the book a 5, regardless of quality:
Recommendation(c,b) ~
if Honest(c) then if Fan(c, Author (b)) then Exactly(5) else HonestRecCPT (Kindness(c), Quality(b)) else (0.4,0.1,0.0,0.1,0.4)
Again, the conditional test Fan(c,Author(b)) is unknown, but if a customer gives only 5s to a
particular author’s books and is not otherwise especially kind, then the posterior probability
that the customer is a fan of that author will be high. Furthermore, the posterior distribution will tend to discount the customer’s 5s in evaluating the quality of that author’s books.
In this example, we implicitly assumed that the value of Author(b) is known for every b, but this may not be the case. How can the system reason about whether, say, C) is a fan of Author(B,) when Author(Bs) is unknown?
The answer is that the system may have to
reason about all possible authors. Suppose (to keep things simple) 3 Some technical conditions are required for an RPM to define a proper distribution. be acyclic; otherwise the resulting Bayesian network will have cycles. Second, the be wellfounded: there can be no infinite ancestor chains, such as might arise from Exercise 15.HAMD for an exception to this rule.
that there are just two First, the dependencies must dependencies must (usually) recursive dependen
Section 15.1
Recommendation(Cy, B>
Relational Probability Models
505
Recommendation(Cy, By)
Figure 15.3 Fragment of the equivalent Bayes net for the book recommendation RPM when Author(Bz) is unknown. authors, A; and Ay, Then Author(Ba) is a random variable with two possible values, A; and Ay, and it is a parent of Recommendation(Cy, By). The variables Fan(Cy,A;) and Fan(C),As)
are parents too. The conditional distribution for Recommendation(C),B,) is then essentially a
‘multiplexer in which the Author(B,) parent acts as a selector to choose which of Fan(Cy,A;)
and Fan(C,A,) actually gets to influence the recommendation. A fragment of the equivalent
Multiplexer
Bayes net is shown in Figure 15.3. Uncertainty in the value of Author(B,), which affects the dependency structure of the network, is an instance of relational uncertainty. e In case you are wondering how the system can possibly work out who the author of B, is: consider the possibility that three other customers are fans of A} (and have no other favorite authors in common) and all three have given B; a 5, even though most other customers find
it quite dismal. In that case, it is extremely likely that A, is the author of B,. The emergence of sophisticated reasoning like this from an RPM model of just a few lines is an intriguing
example of how probabilistic influences spread through the web of interconnections among objects in the model. As more dependencies and more objects are added, the picture conveyed by the posterior distribution often becomes clearer and clearer.
15.1.2
Example:
Rating player skill levels
Many competitive games have a numerical measure of players’ skill levels, sometimes called arating. Perhaps the bestknown is the Elo rating for chess players, which rates a typical be Rating ginner at around 800 and the world champion usually somewhere above 2800. Although Elo ratings have a statistical basis, they have some ad hoc elements. We can develop a Bayesian rating scheme as follows: each player i has an underlying skill level Skill(i); in each game g, Ps actual performance is Performance(i,g), which may vary from the underlying skill level; and the winner of g is the player whose performance in g is better. As an RPM, the model looks like this: Skill(i) ~ N (11, 0?) Performance(i,g) ~ N (Skill(i),%)
Win(i.j,g) = if Game(g,i, j) then (Performance(i,g) > Performance(j,g))
where /32 is the variance ofa player’s actual performance in any specific game relative to the
player’s underlying skill level. Given a set of players and games, as well as outcomes for
some of the games, an RPM inference engine can compute a posterior distribution over the
skill of each player and the probable outcome of any additional game that might be played.
506
Chapter 15 Probabilistic Programming For team games, we’ll assume, as a first approximation, that the overall performance of
team ¢ in game g is the sum of the individual performances of the players on :
TeamPerformance(t,g) = ¥ic, Performance(i,g).
Even though the individual performances are not visible to the ratings engine, the players’ skill levels can still be estimated from the results of several games, as long as the team com
positions vary across games. Microsoft’s TrueSkill™ ratings engine uses this model, along
with an efficient approximate inference algorithm, to serve hundreds of millions of users
every day. This model can be elaborated in numerous ways. For example, we might assume that weaker players have higher variance in their performance; we might include the player’s role on the team; and we might consider specific kinds of performance and skill—e.g., defending and attacking—in order to improve team composition and predictive accuracy.
15.1.3
Inference in relational probability models
The most straightforward approach to inference in RPMs s simply to construct the equivalent Bayesian network, given the known constant symbols belonging to each type. With B books
and C customers, the basic model given previously could be constructed with simple loops:*
forb=1toBdo add node Quality,, with no parents, prior ( 0.05,0.2,0.4,0.2,0.15) forc=1toCdo add node Honest, with no parents, prior ( 0.99,0.01 ) add node Kindness. with no parents, prior ( 0.1,0.1,0.2,0.3,03 ) forb=1toBdo add node Recommendation, with parents Honeste, Kindness.. Quality), and conditional distribution RecCPT (Honestc, Kindnessc, Quality,) Grounding Unrolling
This technique is called grounding or unrolling; it is the exact analog of propositionaliza
tion for firstorder logic (page 280). The obvious drawback is that the resulting Bayes net may be very large. Furthermore, if there are many candidate objects for an unknown relation or function—for example, the unknown author of By—then some variables in the network
may have many parents. Fortunately, it is often possible to avoid generating the entire implicit Bayes net. As we saw in the discussion of the variable elimination algorithm on page 433, every variable that is not an ancestor of a query variable or evidence variable is irrelevant to the query. Moreover, if the query is conditionally independent of some variable given the evidence, then that variable is also irrelevant. So, by chaining through the model starting from the query and evidence, we can identify just the set of variables that are relevant to the query. These are the only ones
that need to be instantiated to create a potentially tiny fragment of the implicit Bayes net.
Inference in this fragment gives the same answer as inference in the entire implicit Bayes net.
Another avenue for improving the efficiency of inference comes from the presence of repeated substructure in the unrolled Bayes net. This means that many of the factors constructed during variable elimination (and similar kinds of tables constructed by clustering algorithms) 4 Several statistical packages would view this code as defining the RPM, rather than just constructing a Bayes net to perform inference in the RPM. This view, however, misses an important role for RPM syntax: without a syntax with clear semantics, there is no way the model structure can be learned from data.
Section 152
OpenUniverse Probability Models
507
will be identical; effective caching schemes have yielded speedups of three orders of magnitude for large networks.
Third, MCMC inference algorithms have some interesting properties when applied to RPMs with relational uncertainty. MCMC works by sampling complete possible worlds, 50 in each state the relational structure is completely known. In the example given earlier, each MCMC state would specify the value of Author(B,), and so the other potential authors are no longer parents of the recommendation nodes for B;.
For MCMC,
then, relational
uncertainty causes no increase in network complexity; instead, the MCMC process includes
transitions that change the relational structure, and hence the dependency structure, of the unrolled network.
Finally, it may be possible in some cases to avoid grounding the model altogether. Resolution theorem provers and logic programming systems avoid propositionalizing by instanti
ating the logical variables only as needed to make the inference go through; that is, they /ift
the inference process above the level of ground propositional sentences and make each lifted step do the work of many ground steps.
The same idea can be applied in probabilistic inference. For example, in the variable
elimination algorithm, a lifted factor can represent an entire set of ground factors that assign
probabilities to random variables in the RPM, where those random variables differ only in the
constant symbols used to construct them. The details of this method are beyond the scope of
this book, but references are given at the end of the chapter. 15.2
OpenUniverse
Probability Models
‘We argued earlier that database semantics was appropriate for situations in which we know
exactly the set of relevant objects that exist and can identify them unambiguously. (In partic
ular, all observations about an object are correctly associated with the constant symbol that
names it.) In many realworld settings, however, these assumptions are simply untenable. For example, a book retailer might use an ISBN (International Standard Book Number)
constant symbol to name each book, even though a given “logical” book (e.g., “Gone With the Wind”) may have several ISBNs corresponding to hardcover, paperback, large print, reissues, and so on. It would make sense to aggregate recommendations across multiple ISBN, but the retailer may not know for sure which ISBNs are really the same book. (Note that we
are not reifying the individual copies of the book, which might be necessary for usedbook
sales, car sales, and so on.) Worse still, each customer is identified by a login ID, but a dis
honest customer may have thousands of IDs!
In the computer security field, these multiple
IDs are called sybils and their use to confound a reputation system is called a sybil attack.’ Thus, even a simple application in a relatively welldefined, online domain involves both existence uncertainty (what are the real books and customers underlying the observed data) and identity uncertainty (which logical terms really refer to the same object).
The phenomena of existence and identity uncertainty extend far beyond online book
sellers. In fact they are pervasive:
+ A vision system doesn’t know what exists, if anything, around the next corner, and may not know if the object it sees now is the same one it saw a few minutes ago.
S The name “Sybi
comes from a famous case of multiple personality disorder.
Sybil
Sybil attack Existence uncertainty Identity uncertainty
508
Chapter 15 Probabilistic Programming +A
textunderstanding system does not know in advance the entities that will be featured in a text, and must reason about whether phrases such as “Mary,” “Dr. Smith,” “she,” “his cardiologist,” “his mother,” and so on refer to the same object.
« An intelligence analyst hunting for spies never knows how many spies there really are and can only guess whether various pseudonyms, phone numbers, and sightings belong to the same individual.
Indeed, a major part of human cognition seems to require learning what objects exist and
being able to connect observations—which almost never come with unique IDs attached—to
Open universe probability model (OUPM)
hypothesized objects in the world.
Thus, we need to be able to define an open universe probability model (OUPM) based
on the standard semantics of firstorder logic, as illustrated at the top of Figure
15.1.
A
language for OUPMs provides a way of easily writing such models while guaranteeing a unique, consistent probability distribution over the infinite space of possible worlds. 15.2.1
Syntax and semantics
The basic idea is to understand how ordinary Bayesian networks and RPMs manage to define a unique probability model and to transfer that insight to the firstorder setting. In essence, a Bayes net generates each possible world, event by event, in the topological order defined
by the network structure, where each event is an assignment of a value to a variable. An RPM extends this to entire sets of events, defined by the possible instantiations of the logical
variables in a given predicate or function. OUPMs go further by allowing generative steps that
Number statement
add objects to the possible world under construction, where the number and type of objects may depend on the objects that are already in that world and their properties and relations. That is, the event being generated is not the assignment of a value to a variable, but the very existence of objects. One way o do this in OUPMs is to provide number statements that specify conditional distributions over the numbers of objects of various kinds. For example, in the book
recommendation domain, we might want to distinguish between customers (real people) and their login IDs. (It’s actually login IDs that make recommendations, not customers!) Suppose
(to keep things simple) the number of customers is uniform between 1 and 3 and the number of books is uniform between 2 and 4:
#Customer ~ UniformInt(1,3) #Book ~ Uniformini(2,4).
(15.2)
‘We expect honest customers to have just one ID, whereas dishonest customers might have anywhere between 2 and 5 IDs:
#LoginID(Owner=c) ~ Origin function
if Honest(c) then Exactly(1) else Uniformint(2,5).
(15.3)
This number statement specifies the distribution over the number of login IDs for which customer c is the Owner.
The Owner function is called an origin function because it says
where each object generated by this number statement came from.
The example in the preceding paragraph uses a uniform distribution over the integers
between 2 and 5 to specify the number of logins for a dishonest customer.
This particular
distribution is bounded, but in general there may not be an a priori bound on the number of
Section 152
OpenUniverse Probability Models
objects. The most commonly used distribution over the nonnegative integers is the Poisson distribution. The Poisson has one parameter, \, which is the expected number of objects, and a variable X sampled from Poisson(X) has the following distribution:
509
Poisson distribution
P(X=k)=Xe /K. The variance of the Poisson is also A, so the standard deviation is v/X.
This means that
for large values of ), the distribution is narrow relative to the mean—for example, if the number of ants in a nest is modeled by a Poisson with a mean of one million, the standard deviation is only a thousand, or 0.1%. For large numbers, it often makes more sense to use
the discrete lognormal distribution, which is appropriate when the log of the number of
Discrete lognormal distribution
magnitude distribution, uses logs to base 10: thus, a distribution OM(3, 1) has a mean of
Orderofmagnitude distribution
objects is normally distributed.
A particularly intuitive form, which we call the orderof
10° and a standard deviation of one order of magnitude, i.c., the bulk of the probability mass falls between 102 and 10°.
The formal semantics of OUPMs begins with a definition of the objects that populate
possible worlds. In the standard semantics of typed firstorder logic, objects are just numbered tokens with types. In OUPMs, each object is a generation history; for example, an object might be “the fourth login ID of the seventh customer.” (The reason for this slightly baroque construction will become clear shortly.) For types with no origin functions—e.g., the Customer and Book types in Equation (15.2)—the objects have an empty origin; for example, (Customer, ,2) refers to the second customer generated from that number statement.
For number statements with origin functions—e.g., Equation (15.3)—each object records its origin; for example, the object (LoginID, (Owner, (Customer, ,2)),3) is the third login belonging to the second customer. The number variables of an OUPM specify how many objects there are of each type with
each possible origin in each possible world; thus #L0ginlD guuer (Customer, 2)) () =4 means
that in world w, customer 2 owns 4 login IDs. As in relational probability models, the basic random variables determine the values of predicates and functions for all tuples of objects; thus, Honest(cusiomer, 2) (w) = true means that in world w, customer 2 is honest. A possible world is defined by the values of all the number variables and basic random variables. A
world may be generated from the model by sampling in topological order; Figure 15.4 shows an example. The probability of a world so constructed is the product of the probabilities for all the sampled values; in this case, 1.2672x 10~!!. Now it becomes clear why each object contains its origin: this property ensures that every world can be constructed by exactly
one generation sequence. If this were not the case, the probability of a world would be an unwieldy combinatorial sum over all possible generation sequences that create it.
Openuniverse models may have infinitely many random variables, so the full theory in
volves nontrivial measuretheoretic considerations.
For example, number statements with
Poisson or orderofmagnitude distributions allow for unbounded numbers of objects, lead
ing to unbounded numbers of random variables for the properties and relations of those objects. Moreover, OUPMs can have recursive dependencies and infinite types (integers, strings, etc.). Finally, wellformedness disallows cyclic dependencies and infinitely receding ancestor chains; these conditions are undecidable in general, but certain syntactic sufficient conditions
can be checked easily.
Number variable
510
Chapter 15 Probabilistic Programming Variable #Customer #Book Honest cusiomer. 1)
Value 2 3 rue
Probability 03333 03333 099
1
0.1
Honest(cusiomer. 2) Kindness cusiomer..1)
Jalse 4
Quality gy, 1)
1 3
Kindness customer. 2) Quality gy, Quality gy,
2 3)
FLOgINID,e (Cusomer, 1)
#LoginID guner, (Customer, 2))
Recommendationtginip, (Owner,(Customer, 1))1),(Book 1) Recommendationtaginip, (Owner.(Customer, 1))1)(Book 2)
Recommendation(poginp,(Owner.(Customer, 1)1 (Book.3) RecommendationLoginip, (Owner. (Customer. 2)).1).(Book..1) RecommendationLoginip, (Owner. (Customer. 2)).1).(Book, 2) Recommendationt
ginip,(Owner.(Customer. 2)).1).(Boo
3)
i (0wner. (Custome 2)).2).(Book..1) Recommendationy.oginip,
Recommendation(g oginip, (Owner.(Customer, 2)).2).(Book. 2) Recommendation g oginip, (Owner.(Customer, 2)).2). (Book. 3
5 1 2 2 4 5 5 1 5 5 1
0.01 0.3
0.05 04
015
1.0 0.25
0.5 0.5
05
04 04
04
0.4 0.4 0.4
Figure 15.4 One particular world for the book recommendation OUPM. The number variables and basic random variables are shown in topological order, along with their chosen values and the probabilities for those values.
15.2.2
Inference in openuniverse
probability models
Because of the potentially huge and sometimes unbounded size of the implicit Bayes net that corresponds to a typical OUPM, unrolling it fully and performing exact inference is quite impractical. Instead, we must consider approximate inference algorithms such as MCMC (see Section 13.4.2).
Roughly speaking, an MCMC algorithm for an OUPM is exploring the space of possible worlds defined by sets of objects and relations among them, as illustrated in Figure 15.1(top). A move between adjacent states in this space can not only alter relations and functions but also add or subtract objects and change the interpretations of constant symbols. Even though
each possible world may be huge, the probability computations required for each step— whether in Gibbs sampling or MetropolisHastings—are entirely local and in most cases take constant time. This is because the probability ratio between neighboring worlds depends on a subgraph of constant size around the variables whose values are changed. Moreover, a logical query can be evaluated incrementally in each world visited, usually in constant time. per world, rather than being recomputing from scratch.
Some special consideration needs to be given to the fact that a typical OUPM may have
possible worlds of infinite size. As an example, consider the multitarget tracking model in Figure 15.9: the function X (a.t), denoting the state of aircraft a at time 1, corresponds to an infinite sequence of variables for an unbounded number of aircraft at each step. For this
reason, MCMC for OUPMs samples not completely specified possible worlds but partial
Section 152
OpenUniverse Probability Models
worlds, each corresponding to a disjoint set of complete worlds. A partial world is a minimal
selfsupporting instantiation® of a subset of the relevant variables—that is, ancestors of the
evidence and query variables. For example, variables X (a,) for values of7 greater than the last observation time (or the query time, whichever is greater) are irrelevant, so the algorithm can consider just a finite prefix of the infinite sequence.
15.2.3
Examples
The standard “use case” for an OUPM
has three elements:
the model,
the evidence (the
Kknown facts in a given scenario), and the guery, which may be any expression, possibly with free logical variables. The answer is a posterior joint probability for each possible set of substitutions for the free variables, given the evidence, according to the model.” Every model includes type declarations, type signatures for the predicates and functions, one or
more number statements for each type, and one dependency statement for each predicate and
function. (In the examples below, declarations and signatures are omitted where the meaning
is clear.) As in RPMs, dependency statements use an ifthenelse syntax to handle contextspecific dependencies. Citation matching
Millions of academic research papers and technical reports are to be found online in the form of pdf files. Such papers usually contain a section near the end called “References” or “Bibliography,” in which citations—strings of characters—are provided to inform the reader
of related work. These strings can be located and “scraped” from the pdf files with the aim of
creating a databaselike representation that relates papers and researchers by authorship and
citation links. Systems such as CiteSeer and Google Scholar present such a representation to their users; behind the scenes, algorithms operate to find papers, scrape the citation strings, and identify the actual papers to which the citation strings refer. This is a difficult task because these strings contain no object identifiers and include errors of syntax, spelling, punctuation,
and content. To illustrate this, here are two relatively benign examples:
1. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. 2. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994.
The key question is one of identity: are these citations of the same paper or different pa
pers? Asked this question, even experts disagree or are unwilling to decide, indicating that
reasoning under uncertainty is going to be an important part of solving this problem.® Ad hoc approaches—such as methods based on a textual similarity metric—often fail miserably. For example, in 2002,
CiteSeer reported over 120 distinct books written by Russell and Norvig.
one in which the parents of every variable in the set are
7 As with Prolog, there may be infinitely many sets of substitutions of unbounded size; designing exploratory interfaces for such answers is an interesting visualization challenge. 8 The answer is yes, they are the same paper. The “National Conference on Articial Intelligence” (notice how the “fi” s missing, thanks to an error in seraping the ligature character) is another name for the AAAI conference; the conference took place in Seattle whereas the proceedings publisher is in Cambridge.
511
512
Chapter 15 Probabilistic Programming type Researcher, Paper, Citation random String Name(Researcher) random String Title(Paper) random Paper PubCited(Citation) random String Text(Citation) random Boolean Professor(Researcher) origin Researcher Author(Paper) #Researcher ~ OM(3,1) Name(r) ~ NamePrior() Professor(r) ~ Boolean(0.2) #Paper(Author=r) ~ if Professor(r) then OM(1.5,0.5) else OM(1,0.5) Title(p) ~ PaperTitlePrior() CitedPaper(c) ~ UniformChoice({Paper p}) Text(c) ~ HMMGrammar(Name(Author(CitedPaper(c))), Title(CitedPaper(c))) Figure 15.5 An OUPM for citation information extraction. For simplicity the model assumes one author per paper and omits details of the grammar and error models. In order to solve the problem using a probabilistic approach, we need a generative model
for the domain.
That is, we ask how these citation strings come to be in the world.
The
process begins with researchers, who have names. (We don’t need to worry about how the researchers came into existence; we just need to express our uncertainty about how many
there are.) These researchers write some papers, which have titles; people cite the papers,
combining the authors’ names and the paper’s title (with errors) into the text of the citation according to some grammar.
The basic elements of this model are shown in Figure 15.5,
covering the case where papers have just one author.”
Given just citation strings as evidence, probabilistic inference on this model to pick
out the most likely explanation for the data produces an error rate 2 to 3 times lower than CiteSeer’s (Pasula et al., 2003).
The inference process also exhibits a form of collective,
knowledgedriven disambiguation: the more citations for a given paper, the more accurately each of them is parsed, because the parses have to agree on the facts about the paper. Nuclear treaty monitoring
Verifying the Comprehensive NuclearTestBan Treaty requires finding all seismic events on Earth above a minimum
magnitude.
The UN CTBTO
maintains a network of sensors, the
International Monitoring System (IMS); its automated processing software, based on 100 years of seismology research, has a detection failure rate of about 30%. The NETVISA system (Arora ef al., 2013), based on an OUPM, significantly reduces detection failures. The NETVISA model (Figure 15.6) expresses the relevant geophysics directly. It describes distributions over the number of events in a given time interval (most of which are
9 The multiauthor case has the same overall structure but is a bit more complicated. The parts of the model not shown—the NamePrior, ritlePrior, and HMMGrammar—are traditional probability models. For example, the NamePrior is a mixture of a categorical distribution over actual names and a letter trigram model (sec Section 23.1) to cover names not previously seen, both trained from data in the U.S. Census database.
Section 152
OpenUniverse Probability Models
#SeismicEvents ~ Poisson(T * \e) Time(e) ~ UniformReal(0,T) EarthQuake(e) ~ Boolean(0.999) Location(e) ~ if Earthquake(e) then SpatialPrior() else UniformEarth() Depth(e) ~ if Earthquake(e) then UniformReal(0,700) else Exacly(0) Magnitude(e) ~ Exponential(log(10)) Detected(e,p.s) ~ Logistic(weights(s, p), Magnitude(e), Depth(e), Dist(e.s)) #Detections(site = 5) ~ Poisson(T * Af(s)) #Detections(event=e, phase=p, station=s) = if Detected(e, p.s) then 1 else0 OnserTime(a,s)if (event(a) = null) then ~ UniformReal(0,T) else = Time(event(a)) + GeoTT(Dist(event(a),s), Depth(event(a)).phase(a)) + Laplace(pu(s),o:(s)) Amplitude(a,s) if (event(a) = null) then ~ NoiseAmpModel(s) else = AmpModel(Magnitude(event(a)). Dist(event(a), 5), Depth(event(a)). phase(a) Azimuth(a,s) if (event(a) = null) then ~ UniformReal(0, 360) else = GeoAzimuth(Location(event(a)), Depth(event(a)), phase(a). Site(s)) + Laplace(0,7a(s)) Slowness(a,s) if (event(a) = null) then ~ UniformReal(0,20) else = GeoSlowness(Location(event(a)), Depth(event(a)), phase(a), Site(s)) + Laplace(0,7(s)) ObservedPhase(a,s) ~ CategoricalPhaseModel(phase(a)) Figure 15.6 A simplified version of the NETVISA model (see text).
naturally occurring) as well as over their time, magnitude, depth, and location. The locations of natural events are distributed according to a spatial prior that is trained (like other parts
of the model) from historical data; manmade events are, by the treaty rules, assumed to oc
cur uniformly over the surface of the Earth.
At every station s, each phase (seismic wave
type) p from an event ¢ produces either 0 or 1 detections (abovethreshold signals); the detection probability depends on the event magnitude and depth and its distance from the station.
“False alarm” detections also occur according to a stationspecific rate parameter. The measured arrival time, amplitude, and other properties of a detection d from a real event depend
on the properties of the originating event and its distance from the station.
Once trained, the model runs continuously. The evidence consists of detections (90% of
which are false alarms) extracted from raw IMS waveform data, and the query typically asks for the most likely event history, or bulletin, given the data. Results so far are encouraging;
for example, in 2009 the UN’s SEL3 automated bulletin missed 27.4% of the 27294 events in the magnitude range 34 while NETVISA missed 11.1%. Moreover, comparisons with dense regional networks show that NETVISA finds up to 50% more real events than the
final bulletins produced by the UN’s expert seismic analysts. NETVISA also tends to associate more detections with a given event, leading to more accurate location estimates (see
Figure 15.7). As of January 1, 2018, NETVISA has been deployed as part of the CTBTO ‘monitoring pipeline.
Despite superficial differences, the two examples are structurally similar: there are un
Kknown objects (papers, earthquakes) that generate percepts according to some physical pro
513
514
Chapter 15 Probabilistic Programming
i)
:
e (@
(b)
Figure 15.7 (a) Top: Example of seismic waveform recorded at Alice Springs, Australia. Bottom: the waveform after processing to detect the arrival times of seismic waves. Blue lines are the automatically detected arrivals; red lines are the true arrivals. (b) Location estimates for the DPRK nuclear test of February 12, 2013: UN CTBTO Late Event Bulletin (green triangle at top left): NETVISA (blue square in center). The entrance to the underground test facility (small “x”) is 0.75km from NETVISA's estimate. Contours show NETVISA's posterior location distribution. Courtesy of CTBTO Preparatory Commission. cess (citation, seismic propagation). The percepts are ambiguous as to their origin, but when
multiple percepts are hypothesized to have originated with the same unknown object, that
object’s properties can be inferred more accurately. The same structure and reasoning patterns hold for areas such as database deduplication
and natural language understanding. In some cases, inferring an object’s existence involves
grouping percepts together—a process that resembles the clustering task in machine learning.
In other cases, an object may generate no percepts at all and still have its existence inferred— as happened, for example, when observations of Uranus led to the discovery of Neptune. The
existence of the unobserved object follows from its effects on the behavior and properties of
observed objects. 15.3
Keeping Track of a Complex World
Chapter 14 considered the problem of keeping track of the state of the world, but covered only the case of atomic representations (HMMs)
and factored representations (DBNs and
Kalman filters). This makes sense for worlds with a single object—perhaps a single patient
in the intensive care unit or a single bird flying through the forest. In this section, we see what
happens when two or more objects generate the observations. What makes this case different
from plain old state estimation is that there is now the possibility of uncertainty about which
object generated which observation. This is the identity uncertainty problem of Section 15.2 Data association
(page 507), now viewed in a temporal context. In the control theory literature, this is the data association problem—that is, the problem of associating observation data with the objects that generated them. Although we could view this as yet another example of openuniverse
probabilistic modeling, it is important enough in practice to deserve its own section.
Section 153
Keeping Track of a Complex World
515
track termination detection failure
(©
@
false alarm —
®
'rack/
initiation
0; a positive affine transformation.? This fact was noted
in Chapter 5 (page 167) for twoplayer games of chance; here, we see that it applies to all kinds of decision scenarios.
Value function Ordinal utility function
As
in gameplaying, in a deterministic environment an agent needs only a preference
ranking on states—the numbers don’t matter.
This is called a value function or ordinal
utility function.
Itis important to remember that the existence ofa utility function that describes an agent’s
preference behavior does not necessarily mean that the agent is explicirly maximizing that utility function in its own deliberations. As we showed in Chapter 2, rational behavior can be
generated in any number of ways. A rational agent might be implemented with a table lookup
(if the number of possible states is small enough).
By observing a rational agent’s behavior, an observer can learn about the utility function
that represents what the agent is actually trying to achieve (even if the agent doesn’t know it). ‘We return to this point in Section 16.7.
16.3
Utility Functions
Utility functions map from lotteries to real numbers. We know they must obey the axioms
of orderability, transitivity, continuity, substitutability, monotonicity, and decomposability. Is
that all we can say about utility functions? Strictly speaking, that is it: an agent can have any preferences it likes. For example, an agent might prefer to have a prime number of dollars in its bank account; in which case, if it had $16 it would give away $3. This might be unusual,
but we can’t call it irrational. An agent might prefer a dented 1973 Ford Pinto to a shiny new
Mercedes. The agent might prefer prime numbers of dollars only when it owns the Pinto, but when it owns the Mercedes, it might prefer more dollars to fewer. Fortunately, the preferences
of real agents are usually more systematic and thus easier to deal with.
3 Inthis nse, utilities resemble temperatures: a temperature in Fahrenheit 1.8 times the Cel plus3 . but converting from one to the other doesn’t make you hotter or colder.
s temperature
Section 163
Utility Functions
533
y assessment and utility scales If we want to build a decisiontheoretic system that helps a human make decisions or acts on his or her behalf, we must first work out what the human’s utility function is. This process,
often called preference elicitation, involves presenting choices to the human and using the
observed preferences to pin down the underlying utility function.
Preference elicitation
Equation (16.2) says that there is no absolute scale for utilities, but it is helpful, nonethe
less, to establish some scale on which utilities can be recorded and compared for any particu
lar problem. A scale can be established by fixing the utilities of any two particular outcomes,
just as we fix a temperature scale by fixing the freezing point and boiling point of water. Typically, we fix the utility of a “best possible prize” at U(S) = ur and a “worst possible
catastrophe” at U (S) = u,. (Both of these should be finite.) Normalized utilities use a scale
with u; =0 and ur = 1. With such a scale, an England fan might assign a utility of 1 to England winning the World Cup and a utility of 0 to England failing to qualify.
Normalized utilties
Given a utility scale between ut and u , we can assess the utility of any particular prize
S by asking the agent to choose between S and a standard lottery [p,ur; (1 — p),u.]. The Standard lottery probability p is adjusted until the agent is indifferent between S and the standard lottery.
Assuming normalized utilities, the utility ofS is given by p. Once this is done for each prize, the utilities for all lotteries involving those prizes are determined.
Suppose, for example,
we want to know how much our England fan values the outcome of England reaching the
semifinal and then losing. We compare that outcome to a standard lottery with probability p
of winning the trophy and probability 1 — p of an ignominious failure to qualify. If there is indifference at p=0.3, then 0.3 is the value of reaching the semifinal and then losing.
In medical, transportation, environmental and other decision problems, people’s lives are
at stake. (Yes, there are things more important than England’s fortunes in the World Cup.) In such cases, u, is the value assigned to immediate death (or in the really worst cases, many deaths).
Although nobody feels
comfortable with putting a value on human life, it is a fact
that tradeoffs on matters of life and death are made all the time. Aircraft are given a complete
overhaul at intervals, rather than after every trip. Cars are manufactured in a way that trades off costs against accident survival rates.
We tolerate a level of air pollution that kills four
million people a year. Paradoxically, a refusal to put a monetary value on life can mean that life is undervalued. Ross Shachter describes a government agency that commissioned a study on removing asbestos from schools. The decision analysts performing the study assumed a particular dollar value for the life of a schoolage child, and argued that the rational choice under that assump
tion was to remove the asbestos. The agency, morally outraged at the idea of setting the value
of a life, rejected the report out of hand. It then decided against asbestos removal—implicitly asserting a lower value for the life of a child than that assigned by the analysts.
Currently several agencies of the U.S. government, including the Environmental Protec
tion Agency, the Food and Drug Administration, and the Department of Transportation, use the value of a statistical life to determine the costs and benefits of regulations and interven
Value of a statistical
One common “currency” used in medical and safety analysis is the micromort, a one in a
Micromort
tions. Typical values in 2019 are roughly $10 million. Some attempts have been made to find out the value that people place on their own lives. million chance of death. If you ask people how much they would pay to avoid a risk—for
534
Chapter 16 Making Simple Decisions example, to avoid playing Russian roulette with a millionbarreled revolver—they will respond with very large numbers, perhaps tens of thousands of dollars, but their actual behavior reflects a much lower monetary value for a micromort. For example, in the UK, driving in a car for 230 miles incurs a risk of one micromort.
Over the life of your car—say, 92,000 miles—that’s 400 micromorts.
People appear to be
willing to pay about $12,000 more for a safer car that halves the risk of death.
Thus, their
carbuying action says they have a value of $60 per micromort. A number of studies have confirmed a figure in this range across many individuals and risk types. However, government
agencies such as the U.S. Department of Transportation typically set a lower figure; they will
QALY
spend only about $6 in road repairs per expected life saved. Of course, these calculations hold only for small risks. Most people won’t agree to kill themselves, even for $60 million. Another measure is the QALY or qualityadjusted life year. Patients are willing to accept a shorter life expectancy to avoid a disability.
For example, kidney patients on average are
indifferent between living two years on dialysis and one year at full health. 16.3.2
The uti
y of money
Utility theory has its roots in economics, and economics provides one obvious candidate for a utility measure:
money (or more specifically, an agent’s total net assets). The almost
universal exchangeability of money for all kinds of goods and services suggests that money plays a significant role in human utility functions.
Monotonic preference
Tt will usually be the case that an agent prefers more money to less, all other things being
equal.
We say that the agent exhibits a monotonic preference for more money.
This does
not mean that money behaves as a utility function, because it says nothing about preferences
between lotteries involving money.
Suppose you have triumphed over the other competitors in a television game show. The
host now offers you a choice: either you can take the $1,000,000 prize or you can gamble it
Expected monetary value
on the flip of a coin. If the coin comes up heads, you end up with nothing, but if it comes up tails, you get $2,500,000. If you're like most people, you would decline the gamble and pocket the million. Are you being irrational?
Assuming the coin is fair, the expected monetary value (EMV) of the gamble is %($0) +
%(52.500,000) = §$1,250,000, which is more than the original $1,000,000. But that does not necessarily mean that accepting the gamble is a better decision. Suppose we use S, to denote
the state of possessing total wealth $n, and that your current wealth is $k. Then the expected utilities of the two actions of accepting and declining the gamble are
EU(Accept) = U(Sk U (Sk12:5 )+ 00.000) EU(Decline) = U(Sk41,000000) To determine what to do, we need to assign utilities to the outcome states.
Utility is not
directly proportional to monetary value, because the utility for your first million is very high (or so they say), whereas the utility for an additional million is smaller. Suppose you assign a utility of 5 to your current financial status (S), a 9 to the state Sk+2.500.000 and an 8 to the. state Si+1.000.000 Then the rational action would be to decline, because the expected utility of accepting is only 7 (less than the 8 for declining). On the other hand, a billionaire would most likely have a utility function that is locally linear over the range of a few million more,
and thus would accept the gamble.
Section 16,3 Utility Functions U
150000
535
U
800000
3
(a)
5
(b)
Figure 16.2 The utility of money. (a) Empirical data for Mr. Beard over a limited range. (b) A typical curve for the full range. In a pioneering study of actual utility functions, Grayson (1960) found that the utility
of money was almost exactly proportional to the logarithm of the amount. (This idea was
first suggested by Bernoulli (1738); see Exercise 16.STPT.) One particular utility curve, for a certain Mr. Beard, is shown in Figure 16.2(a). The data obtained for Mr. Beard’s preferences are consistent with a utility function
U(Sksn) = —263.31 422.091og(n+ 150,000)
for the range between n = —$150,000 and n = $800, 000.
‘We should not assume that this is the definitive utility function for monetary value, but
itis likely that most people have a utility function that is concave for positive wealth. Going into debt is bad, but preferences between different levels of debt can display a reversal of the concavity associated with positive wealth. For example, someone already $10,000,000 in debt might well accept a gamble on a fair coin with a gain of $10,000,000 for heads and a Toss of $20,000,000 for tails.* This yields the Sshaped curve shown in Figure 16.2(b). If we restrict our attention to the positive part of the curves, where the slope is decreasing, then for any lottery L, the utility of being faced with that lottery is less than the utility of being handed the expected monetary value of the lottery as a sure thing:
U(L) 0, then life is
positively enjoyable and the agent avoids borh exits. As long as the actions in (4,1), (3,2), and (3,3) are as shown, every policy is optimal, and the agent obtains infinite total reward
because it never enters a terminal state. It turns out that there are nine optimal policies in all
for various ranges of r; Exercise 17.THRC asks you to find them.
The introduction of uncertainty brings MDPs closer to the real world than deterministic search problems. For this reason, MDPs have been studied in several fields, including Al,
operations research, economics, and control theory. Dozens of solution algorithms have been proposed, several of which we discuss in Section 17.2. First, however, we spell out in more detail the definitions of utilities, optimal policies, and models for MDPs.
17.1.1
Utilities over time
In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of rewards for the transitions experienced. This choice of performance measure is not arbitrary,
but it is not the only possibility for the utility function> on environment histories, which we
write as Up([s0,a0,51,a1 .., ,)).
2 In this chapter we use U for the utility function (to be consistent with the rest of the book), but many works about MDPs use V (for value) instead.
Section 17.1
Sequential Decision Problems
565
&1

o
T
Geeh
—0.0274 0. It is easy to see, from the definition of utilities as discounted sums of rewards, that a similar transformation of rewards will leave the optimal policy unchanged in an MDP: R(s,a,5) = mR(s,a,5') +b. It turns out, however, that the additive reward decomposition of utilities leads to significantly
more freedom in defining rewards. Let &(s) be any function of the state s. Then, according to the shaping theorem, the following transformation leaves the optimal policy unchanged:
R(s,a,5') = R(s,a,8') +7D(s') — B(s) .
(17.9)
To show that this is true, we need to prove that two MDPs, M and M’, have identical optimal
policies as long as they differ only in their reward functions as specified in Equation (17.9). We start from the Bellman equation for Q, the Qfunction for MDP M:
0(s.0) = L) P  s.0)[R(s.a.8) + 7 maxs Qs )]
Now let Q'(s,a)=Q(s,a) — ®(s) and plug it into this equation; we get
Q'(s.0) +®(s) = Y. P(s  5.0)[R(s.a.5') +7 myX(Q'(:'st/) +@(s))]. which then simplifies to
Q50) = LPE15.0)[RG.0,5) +1() ~B(s) +7 max ()] LPE 5 @)R (s.a.8) +7 max @/ ()]
Shaping theorem
570
Chapter 17 Making Complex Decisions In other words, Q'(s,a) satisfies the Bellman equation for MDP M. Now we can extract the optimal policy for M’ using Equation (17.7):
i (s) = argmax Q' (s, a) = argmax O(s,@) — B(s) = argmax Q(s,a) = wj(s) a
a
a
The function ®(s) is often called a potential, by analogy to the electrical potential (volt
age) that gives rise to electric fields. The term 7(s') — d(s) functions as a gradient of the
potential. Thus, if ®(s) has higher value in states that have higher utility, the addition of ~YP(s') — D(s) to the reward has the effect of leading the agent “uphill” in utility. At first sight, it may seem rather counterintuitive that we can modify the reward in this
way without changing the optimal policy. It helps if we remember that all policies are optimal with a reward function that is zero everywhere. This means, according to the shaping theorem,
that all policies are optimal for any potentialbased reward of the form R(s,a,s') = y®(s') — @(s).
Intuitively, this is because with such a reward it doesn’t matter which way the agent
goes from A to B. (This is easiest to see when v=1: along any path the sum of rewards collapses to d(B) —D(A), so all paths are equally good.) So adding a potentialbased reward to any other reward shouldn’t change the optimal policy.
The flexibility afforded by the shaping theorem means that we can actually help out the
agent by making the immediate reward more directly reflect what the agent should do. In fact, if we set @(s) =U(s), then the greedy policy 7 with respect to the modified reward R’
is also an optimal policy:
7G(s) = argmax ) P(s'5,a)R (s,a,5)
= argmax Y P(s' 5,a) [R(s,a,5') +7(s') — ®(s)] . & = argmax Y P(s 5,0)[R(s,a,5') + U (s') = U (s)) a
g
i
I
= argmax )"
P(s'5,a)[R(s,a,s') + U (s)]
(by Equation (17.4)).
Of course, in order to set (s) =U (s), we would need to know U (s); so there is no free lunch, but there is still considerable value in defining a reward function that is helpful to the extent
possible.
This is precisely what animal trainers do when they provide a small treat to the
animal for each step in the target sequence.
17.1.4
Representing MDPs
The simplest way to represent P(s's,a) and R(s,a,s') is with big, threedimensional tables of size [S?A. This is fine for small problems such as the 4 x 3 world, for which the tables
have 112 x 4=484 entries each. In some cases, the tables are sparse—most entries are zero
because each state s can transition to only a bounded number of states s'—which means the
tables are of size O(S]A]). For larger problems, even sparse tables are far too big.
Dynamic decision
Just as in Chapter 16, where Bayesian networks were extended with action and utility nodes to create decision networks, we can represent MDPs by extending dynamic Bayesian networks (DBNS, see Chapter 14) with decision, reward, and utility nodes to create dynamic
decision networks, or DDNs. DDN are factored representations in the terminology of
Section 17.1
[ Prag/nplus,
Sequential Decision Problems
Plug/Unplug., LeftWheel,,,
Figure 17.4 A dynamic decision network for a mobile robot with state variables for battery level, charging status, location, and velocity, and action variables for the left and right wheel motors and for charging. Chapter 2; they typically have an exponential complexity advantage over atomic representations and can model quite substantial realworld problems.
Figure 17.4, which is based on the DBN in Figure 14.13(b) (page 486), shows some
elements of a slightly realistic model for a mobile robot that can charge itself. The state S; is
decomposed into four state variables: + X, consists of the twodimensional location on a grid plus the orientation;
« X, is the rate of change of X;;
« Charging, is true when the robot is plugged in to a power source; « Battery, is the battery level, which we model as an integer in the range 0, The state space for the MDP is the Cartesian product of the ranges of these four variables. The
action is now a set A, of action variables, comprised of Plug/Unplug, which has three values (plug, unplug, and noop); LefitWheel for the power sent to the left wheel; and RightWheel for
the power sent to the right wheel. The set of actions for the MDP is the Cartesian product of
the ranges of these three variables. Notice that each action variable affects only a subset of the state variables. The overall transition model is the conditional distribution P(X,X;,A,), which can be
computed as a product of conditional probabilities from the DDN. The reward here is a single variable that depends only on the location X (for, say, arriving at a destination) and Charging, as the robot has to pay for electricity used; in this particular model, the reward doesn’t depend
on the action or the outcome state.
The network in Figure 17.4 has been projected three steps into the future. Notice that the
network includes nodes for the rewards for t, 1 + 1, and 1 + 2, but the wtility for 1 + 3. This
571
572
Chapter 17 Making Complex Decisions Next
F
&
o
A,
A1
CurrentPiece;
CurrentPiece
Filled,
Filledyy
(a)
(b)
Figure 17.5 (a) The game of Tetris. The Tshaped piece at the top center can be dropped in any orientation and in any horizontal position. Ifa row is completed, that row disappears and the rows above it move down, and the agent receives one point. The next piece (here, the Lshaped piece at top right) becomes the current piece, and a new next piece appears, chosen at random from the seven piece types. The game ends if the board fills up to the top. (b) The DDN for the Tetris MDP. is because the agent must maximize the (discounted) sum of all future rewards, and U (X;3) represents the reward for all rewards from ¢ + 3 onwards. If a heuristic approximation to U is available, it can be included in the MDP representation in this way and used in lieu of further expansion. This approach is closely related to the use of boundeddepth search and heuristic evaluation functions for games in Chapter 5.
Another interesting and wellstudied MDP is the game of Tetris (Figure 17.5(a)). The
state variables for the game are the CurrentPiece, the NextPiece, and a bitvectorvalued
variable Filled with one bit for each of the 10 x 20 board locations. Thus, the state space has
7 %7 % 22% 2 10
states. The DDN for Tetris is shown in Figure 17.5(b). Note that Filled,
is a deterministic function of Filled, and A,. It turns out that every policy for Tetris is proper (reaches a terminal state): eventually the board fills despite one’s best efforts to empty it.
17.2
Algorithms for MDPs
In this section, we present four different algorithms for solving MDPs. The first three, value
Monte Carlo planning
iteration, policy iteration, and linear programming, generate exact solutions offline. The fourth is a family of online approximate algorithms that includes Monte Carlo planning.
17.2.1 Value lteration
Value iteration
‘The Bellman equation (Equation (17.5)) s the basis of the value iteration algorithm for solving MDPs. If there are n possible states, then there are n Bellman equations, one for cach
Section 17.2
Algorithms for MDPs
573
function VALUEITERATION(mdp, €) returnsa utility function
inputs: mdp, an MDP with states S, actions A(s), transition model P(s’s,a),
rewards R(s,a, "), discount ¢, the maximum error allowed in the utility of any state
local variables: U, U’, vectors of utilities for states in S, initially zero
4, the maximum relative change in the utility of any state
repeat
VU360 for each state s in S do U'ls] max,c o5y QVALUE(mdp, s,a. U) if U[s] — Uls] > & then 5 U'[s] — U[s] until§ < e(17)/7 return U Figure 17.6 The value iteration algorithm for calculating tilities of states. The termination condition is from Equation (17.12). state. The n equations contain n unknowns—the utilities of the
states. So we would like to
solve these simultaneous equations to find the utilities. There is one problem: the equations
are nonlinear, because the “max” operator is not a linear operator. Whereas systems of linear equations can be solved quickly using linear algebra techniques, systems of nonlinear equations are more problematic. One thing to try is an iterative approach. We start with arbitrary
initial values for the utilities, calculate the righthand side of the equation, and plug it into the
lefthand side—thereby updating the utility of each state from the utilities of its neighbors.
We repeat this until we reach an equilibrium.
Let U;(s) be the utility value for state s at the ith iteration.
Bellman update, looks like this:
The iteration step, called a
Ui (s) = aeA(s) max Y7 P(s'5,a)[R(s,a.s) + v Uls')],
(17.10)
where the update is assumed to be applied simultaneously to all the states at each iteration. If we apply the Bellman update infinitely often, we are guaranteed to reach an equilibrium
(see “convergence of value iteration” below), in which case the final utility values must be
solutions to the Bellman equations. In fact, they are also the unigue solutions, and the corre
sponding policy (obtained using Equation (17.4)) is optimal. The detailed algorithm, including a termination condition when the utilities are “close enough,” is shown in Figure 17.6.
Notice that we make use of the QVALUE function defined on page 569. ‘We can apply value iteration to the 4x 3 world in Figure 17.1(a).
Starting with initial
values of zero, the utilities evolve as shown in Figure 17.7(a). Notice how the states at different distances from (4,3) accumulate negative reward until a path is found to (4,3), whereupon
the utilities start to increase.
We can think of the value iteration algorithm as propagating
information through the state space by means of local updates.
Bellman update
574
Chapter 17 Making Complex Decisions 1x107
558 g S
Iierations required
1x10°
5
10 15 20 25 30 Number of terations (a)
35
40
05
06
07 08 09 Discount factorY ®)
1
Figure 17.7 (a) Graph showing the evolution of the utilities of selected states using value iteration. (b) The number of value iterations required to guarantee an error of at most e =c
Rumax, for different values of ¢, as a function of the discount factor 7. Convergence of value iteration
We said that value iteration eventually converges to a unique set of solutions of the Bellman
equations. In this section, we explain why this happens. We introduce some useful mathematical ideas along the way, and we obtain some methods for assessing the error in the utility function returned when the algorithm is terminated early; this is useful because it means that we don’t have to run forever. This section is quite technical.
Contraction
The basic concept used in showing that value iteration converges is the notion of a con
traction. Roughly speaking, a contraction is a function of one argument that, when applied to two different inputs in turn, produces two output values that are “closer together.” by at least some constant factor, than the original inputs. For example, the function “divide by two™ is
a contraction, because, after we divide any two numbers by two, their difference is halved.
Notice that the “divide by two” function has a fixed point, namely zero, that is unchanged by
the application of the function. From this example, we can discern two important properties of contractions:
+ A contraction has only one fixed point; if there were two fixed points they would not
get closer together when the function was applied, so it would not be a contraction. « When the function is applied to any argument, the value must get closer to the fixed point (because the fixed point does not move), so repeated application of a contraction
always reaches the fixed point in the limit. Now, suppose we view the Bellman update (Equation (17.10)) as an operator B that is applied simultaneously to update the utility of every state. Then the Bellman equation becomes
U=BU and the Bellman update equation can be written as Uit1
Max norm
0.5 and Go otherwise. Once we have utilities c,(s) for all the conditional plans p of depth 1 in each physical state s, we can compute the utilities for conditional plans of depth 2 by considering each
possible first action, each possible subsequent percept, and then each way of choosing a
depth1 plan to execute for each percept: [Stay; if Percept=A then Stay else Stay] [Stay; if Percept =A then Stay else Go
[Go; if Percept=A then Stay else Stay]
Dominated plan
There are eight distinct depth2 plans in all, and their utilities are shown in Figure 17.15(b). Notice that four of the plans, shown as dashed lines, are suboptimal across the entire belief space—we say these plans are dominated, and they need not be considered further. There are four undominated plans, cach of which is optimal in a specific region, as shown in Figure 17.15(c). The regions partition the beliefstate space. We repeat the process for depth 3, and so on. In general, let p be a depthd conditional plan whose initial action is @ and whose depth(d — 1) subplan for percept e is p.e; then (17.18) ap(s) = );P(!\smlkts,a,:’) +7 L P(es)ape(s))This recursion naturally gives us a value iteration algorithm, which is given in Figure 17.16. The structure of the algorithm and its error analysis are similar to those of the basic value
iteration algorithm in Figure 17.6 on page 573; the main difference is that instead of computing one utility number for each state, POMDPVALUEITERATION
maintains a collection of
undominated plans with their utility hyperplanes.
The algorithm’s complexity depends primarily on how many plans get generated. Given A] actions and E possible observations, there are A?(£1"") distinct depthd plans. Even for the lowly twostate world with d=8, that’s 2255 plans. The elimination of dominated plans is essential for reducing this doubly exponential growth: the number of undominated plans with d =8 is just 144. The utility function for these 144 plans is shown in Figure 17.15(d). Notice that the intermediate belief states have lower value than state A and state B, because in the intermediate states the agent lacks the information needed to choose a good action. This is why information has value in the sense defined in Section 16.6 and optimal policies in POMDPs often include informationgathering actions.
Section 17.5
Algorithms for Solving POMDPs
function POMDPVALUEITERATION(pomdp, ¢) returns a utility function inputs: pomdp, a POMDP with states S, actions A(s). transition model P(s's.a). sensor model P(es). rewards R(s), discount y ¢, the maximum error allowed in the utility of any state local variables: U, U, sets of plans p with associated utility vectors o, U’ ¢ a set containing just the empty plan []. with a)(s) = R(s) repeat [
U’ the set of all plans consisting of an action and, for each possible next percept, aplanin U with utility vectors computed according to Equation (17.18) U’ REMOVEDOMINATEDPLANS(U")
until MAXDIFFERENCE(U,U’) return U
< e(17)/y
Figure 17.16 A highlevel sketch of the value iteration algorithm for POMDPs. The REMOVEDOMINATEDPLANS step and MAXDIFFERENCE test are typically implemented as linear programs. Given such a utility function, an executable policy can be extracted by looking at which hyperplane is optimal at any given belief state b and executing the first action of the corresponding plan. In Figure 17.15(d), the corresponding optimal policy is still the same as for depth1 plans: Stay when b(B) > 0.5 and Go otherwise. In practice, the value iteration algorithm in Figure 17.16 is hopelessly inefficient for larger problems—even the 4 x 3 POMDP is too hard. The main reason is that given n undominated
conditional plans at level d, the algorithm constructs AnlE conditional plans at level d + 1 before eliminating the dominated ones. With the fourbit sensor, E is 16, and n can be in the hundreds, so this is hopeless. Since this algorithm was developed in the 1970s, there have been several advances, in
cluding more efficient forms of value iteration and various kinds of policy iteration algorithms.
Some
of these are discussed in the notes at the end of the chapter.
For general
POMDPs, however, finding optimal policies is very difficult (PSPACEhard, in fact—that is, very hard indeed). The next section describes a different, approximate method for solving POMDPs, one based on lookahead search.
17.5.2
Online algorithms for POMDPs
The basic design for an online POMDP agent is straightforward: it starts with some prior belief state; it chooses an action based on some deliberation process centered on its current
belief state; after acting, it receives an observation and updates its belief state using a filtering algorithm; and the process repeats.
One obvious choice for the deliberation process is the expectimax algorithm from Sec
tion 17.2.4, except with belief states rather than physical states as the decision nodes in the tree. The chance nodes in the POMDP
tree have branches labeled by possible observations
and leading to the next belief state, with transition probabilities given by Equation (17.17). A fragment of the beliefstate expectimax tree for the 4 x 3 POMDP is shown in Figure 17.17.
593
594
Chapter 17 Making Complex Decisions
Up.
1100
Righ,
o110/
1100
Down
Left
1010
0110
Figure 17.17 Part of an expectimax tree for the 4 x 3 POMDP with a uniform initial belief state. The belief states are depicted with shading proportional to the probability of being in each location. The time complexity of an exhaustive search to depth d is O(JA"  [E[Y), where A is
the number of available actions and E is the number of possible percepts.
(Notice that
this is far less than the number of possible depthd conditional plans generated by value iteration.)
As in the observable case, sampling at the chance nodes is a good way to cut
down the branching factor without losing too much accuracy in the final decision. Thus, the
complexity of approximate online decision making in POMDPs may not be drastically worse than that in MDPs.
For very large state spaces, exact filtering is infeasible, so the agent will need to run
an approximate filtering algorithm such as particle filtering (see page 492). Then the belief
POMCP
states in the expectimax tree become collections of particles rather than exact probability distributions. For problems with long horizons, we may also need to run the kind of longrange playouts used in the UCT algorithm (Figure 5.11). The combination of particle filtering and UCT applied to POMDPs goes under the name of partially observable Monte Carlo planning
or POMCP.
With a DDN
representation for the model, the POMCP
algorithm is, at least
in principle, applicable to very large and realistic POMDPs. Details of the algorithm are explored in Exercise 17.PoMC. POMCP is capable of generating competent behavior in the
4 x 3 POMDP. A short (and somewhat fortunate) example is shown in Figure 17.18. POMDP agents based on dynamic decision networks and online decision making have a
number of advantages compared with other, simpler agent designs presented in earlier chap
ters. In particular, they handle partially observable, stochastic environments and can easily revise their “plans” to handle unexpected evidence. With appropriate sensor models, they can handle sensor failure and can plan to gather information. They exhibit “graceful degradation” under time pressure and in complex environments, using various approximation techniques.
So what is missing? The principal obstacle to realworld deployment of such agents is
the inability to generate successful behavior over long timescales. Random or nearrandom
playouts have no hope of gaining any positive reward on, say, the task of laying the table
Summary
B 1000 111 0001
1010
1001
Figure 17.18 A sequence of percepts, belief states, and actions in the 4 x 3 POMDP with a wallsensing error of ¢=0.2. Notice how the early Lefr moves are safe—they are very unlikely to fall into (4,2)—and coerce the agent’s location into a small number of possible locations. After moving Up, the agent thinks it is probably in (3,3). but possibly in (1.3). Fortunately, moving Right is a good idea in both cases, so it moves Right, finds out that it had been in (1,3) and is now in (2,3), and then continues moving Right and reaches the goal. for dinner, which might take tens of millions of motorcontrol actions.
It seems necessary
to borrow some of the hierarchical planning ideas described in Section 11.4. At the time of
writing, there are not yet satisfactory and efficient ways to apply these ideas in stochastic,
partially observable environments.
Summary
This chapter shows how to use knowledge about the world to make decisions even when the outcomes of an action are uncertain and the rewards for acting might not be reaped until many actions have passed. The main points are as follows: + Sequential decision problems in stochastic environments, also called Markov decision
processes, or MDPs, are defined by a transition model specifying the probabilistic
outcomes of actions and a reward function specifying the reward in each state. « The utility of a state sequence is the sum of all the rewards over the sequence, possibly discounted over time.
The solution of an MDP is a policy that associates a decision
with every state that the agent might reach. An optimal policy maximizes the utility of
the state sequences encountered when it is executed. + The utility of a state is the expected sum of rewards when an optimal policy is executed
from that state. The value iteration algorithm iteratively solves a set of equations
relating the utility of each state to those of its neighbors. + Policy iteration alternates between calculating the utilities of states under the current
policy and improving the current policy with respect to the current utilities.
* Partially observable MDPs, or POMDPs, are much more difficult to solve than are MDPs. They can be solved by conversion to an MDP in the continuous space of belief
states; both value iteration and policy iteration algorithms have been devised. Optimal behavior in POMDPs
includes information gathering to reduce uncertainty and there
fore make better decisions in the future.
* A decisiontheoretic agent can be constructed for POMDP environments.
The agent
uses a dynamic decision network to represent the transition and sensor models, to update its belief state, and to project forward possible action sequences.
We shall return MDPs and POMDPs in Chapter 22, which covers reinforcement learning methods that allow an agent to improve its behavior from experience.
595
596
Chapter 17 Making Complex Decisions Bibliographical and Historical Notes
Richard Bellman developed the ideas underlying the modern approach to sequential decision problems while working at the RAND Corporation beginning in 1949. According to his autobiography (Bellman, 1984), he coined the term “dynamic programming” to hide from a researchphobic Secretary of Defense, Charles Wilson, the fact that his group was doing mathematics. (This cannot be strictly true, because his first paper using the term (Bellman,
1952) appeared before Wilson became Secretary of Defense in 1953.) Bellman’s book, Dynamic Programming (1957), gave the new field a solid foundation and introduced the value iteration algorithm.
Shapley (1953b) actually described the value iteration algorithm independently of Bellman, but his results were not widely appreciated in the operations research community, perhaps because they were presented in the more general context of Markov games. Although the original formulations included discounting, its analysis in terms of stationary preferences was suggested by Koopmans (1972). The shaping theorem is due to Ng ef al. (1999). Ron Howard’s Ph.D. thesis (1960) introduced policy iteration and the idea of average reward for solving infinitehorizon problems. Several additional results were introduced
by Bellman and Dreyfus (1962). The use of contraction mappings in analyzing dynamic programming algorithms is due to Denardo (1967). Modified policy iteration is due to van Nunen (1976) and Puterman and Shin (1978). Asynchronous policy iteration was analyzed by Williams and Baird (1993), who also proved the policy loss bound in Equation (17.13). The general family of prioritized sweeping algorithms aims to speed up convergence to optimal policies by heuristically ordering the value and policy update calculations (Moore and Atkeson, 1993; Andre ez al., 1998; Wingate and Seppi, 2005).
The formulation of MDPsolving as a linear program is due to de Ghellinck (1960), Manne (1960), and D’Epenoux (1963). Although linear programming has traditionally been considered inferior to dynamic programming as an exact solution method for MDPs, de Farias and Roy (2003) show that it is possible to use linear programming and a linear representation of the utility function to obtain provably good approximate solutions to very large MDPs.
Papadimitriou and Tsitsiklis (1987) and Littman er al. (1995) provide general results on the
computational complexity of MDPs.
Yinyu Ye (2011) analyzes the relationship between
policy iteration and the simplex method for linear programming and proves that for fixed v,
the runtime of policy iteration is polynomial in the number of states and actions.
Seminal work by Sutton (1988) and Watkins (1989) on reinforcement learning methods
for solving MDPs played a significant role in introducing MDPs into the Al community. (Earlier work by Werbos (1977) contained many similar ideas, but was not taken up to the same extent.) Al researchers have pushed MDPs in the direction of more expressive representations that can accommodate much larger problems than the traditional atomic representations based on transition matrices.
The basic ideas for an agent architecture using dynamic decision networks were proposed by Dean and Kanazawa (1989a). Tatman and Shachter (1990) showed how to apply dynamic programming algorithms to DDN models. Several authors made the connection between MDPs and Al planning problems, developing probabilistic forms of the compact STRIPS representation for transition models (Wellman,
1990b; Koenig,
1991).
The book Planning
and Control by Dean and Wellman (1991) explores the connection in great depth.
Bibliographical and Historical Notes Later work on factored MDPs (Boutilier ez al., 2000; Koller and Parr, 2000; Guestrin et al., 2003b) uses structured representations of the value function as well as the transition model, with provable improvements in complexity. Relational MDPs (Boutilier et al., 2001;
Guestrin e al., 2003a) go one step further, using structured representations to handle domains
with many related objects. Openuniverse MDPs and POMDPs (Srivastava et al., 2014b) also allow for uncertainty over the existence and identity of objects and actions.
Many authors have developed approximate online algorithms for decision making in MDPs, often borrowing explicitly from earlier AT approaches to realtime search and gameplaying (Werbos, 1992; Dean ez al., 1993; Tash and Russell, 1994). The work of Barto et al.
(1995) on RTDP (realtime dynamic programming) provided a general framework for understanding such algorithms and their connection to reinforcement learning and heuristic search.
The analysis of depthbounded expectimax with sampling at chance nodes is due to Kearns
et al. (2002). The UCT algorithm described in the chapter is due to Kocsis and Szepesvari (2006) and borrows from earlier work on random playouts for estimating the values of states (Abramson, 1990; Briigmann, 1993; Chang er al., 2005).
Bandit problems were introduced by Thompson (1933) but came to prominence after
‘World War II through the work of Herbert Robbins (1952).
Bradt et al. (1956) proved the
first results concerning stopping rules for onearmed bandits, which led eventually to the
breakthrough results of John Gittins (Gittins and Jones, 1974; Gittins, 1989). Katehakis and
Veinott (1987) suggested the restart MDP as a method of computing Gittins indices. The text by Berry and Fristedt (1985) covers many variations on the basic problem, while the pellucid online text by Ferguson (2001) connects bandit problems with stopping problems. Lai and Robbins (1985) initiated the study of the asymptotic regret of optimal bandit
policies. The UCB heuristic was introduced and analyzed by Auer et al. (2002).
Bandit su
perprocesses (BSPs) were first studied by Nash (1973) but have remained largely unknown
in AL HadfieldMenell and Russell (2015) describe an efficient branchandbound algorithm
capable of solving relatively large BSPs. Selection problems were introduced by Bechhofer (1954). Hay et al. (2012) developed a formal framework for metareasoning problems, showing that simple instances mapped to selection rather than bandit problems. They also proved
the satisfying result that expected computation cost of the optimal computational strategy is
never higher than the expected gain in decision quality—although there are cases where the optimal policy may, with some probability, keep computing long past the point where any
possible gain has been used up. The observation that a partially observable MDP can be transformed into a regular MDP over belief states is due to Astrom (1965) and Aoki (1965).
The first complete algorithm
for the exact solution of POMDPs—essentially the value iteration algorithm presented in
this chapter—was proposed by Edward Sondik (1971) in his Ph.D. thesis. (A later journal paper by Smallwood and Sondik (1973) contains some errors, but is more accessible.)
Lovejoy (1991) surveyed the first twentyfive years of POMDP research, reaching somewhat pessimistic conclusions about the feasibility of solving large problems.
The first significant contribution within Al was the Witness algorithm (Cassandra et al., 1994; Kaelbling er al., 1998), an improved version of POMDP value iteration. Other algo
rithms soon followed, including an approach due to Hansen (1998) that constructs a policy
incrementally in the form of a finitestate automaton whose states define the possible belief
states of the agent.
597
Factored MDP Relational MDP.
598
Chapter 17 Making Complex Decisions More recent work in Al has focused on pointbased value iteration methods that, at each
iteration, generate conditional plans and avectors for a finite set of belief states rather than
for the entire belief space. Lovejoy (1991) proposed such an algorithm for a fixed grid of points, an approach taken also by Bonet (2002). An influential paper by Pineau ef al. (2003) suggested generating reachable points by simulating trajectories in a somewhat greedy fash
ion; Spaan and Vlassis (2005) observe that one need generate plans for only a small, randomly
selected subset of points to improve on the plans from the previous iteration for all points in
the set. Shani ef al. (2013) survey these and other developments in pointbased algorithms,
which have led to good solutions for problems with thousands of states. Because POMDPs are PSPACEhard (Papadimitriou and Tsitsiklis, 1987), further progress on offline solution
methods may require taking advantage of various kinds of structure in value functions arising from a factored representation of the model.
The online approach for POMDPs—using lookahead search to select an action for the
current belief state—was first examined by Satia and Lave (1973).
The use of sampling at
chance nodes was explored analytically by Keamns ef al. (2000) and Ng and Jordan (2000). ‘The POMCP algorithm is due to Silver and Veness (2011).
With the development of reasonably effective approximation algorithms for POMDPs,
their use as models for realworld problems has increased, particularly in education (Rafferty et al., 2016), dialog systems (Young et al., 2013), robotics (Hsiao et al., 2007; Huynh and
Roy, 2009), and selfdriving cars (Forbes et al., 1995; Bai et al., 2015). An important largescale application is the Airborne Collision Avoidance System X (ACAS X), which keeps airplanes and drones from colliding midair. The system uses POMDPs with neural networks
to do function approximation. ACAS X significantly improves safety compared to the legacy
TCAS system, which was built in the 1970s using expert system technology (Kochenderfer, 2015; Julian er al., 2018).
Complex decision making has also been studied by economists and psychologists. They find that decision makers are not always rational, and may not be operating exactly as described by the models in this chapter. For example, when given a choice, a majority of people prefer $100 today over a guarantee of $200 in two years, but those same people prefer $200 Hyperbolic reward
in eight years over $100 in six years. One way to interpret this result is that people are not
using additive exponentially discounted rewards; perhaps they are using hyperbolic rewards (the hyperbolic function dips more steeply in the near term than does the exponential decay function). This and other possible interpretations are discussed by Rubinstein (2003).
The texts by Bertsekas (1987) and Puterman (1994) provide rigorous introductions to sequential decision problems and dynamic programming. Bertsekas and Tsitsiklis (1996) include coverage of reinforcement learning. Sutton and Barto (2018) cover similar ground but in a more accessible style. Sigaud and Buffet (2010), Mausam and Kolobov (2012) and
Kochenderfer (2015) cover sequential decision making from an Al perspective. Krishnamurthy (2016) provides thorough coverage of POMDPs.
TG
18
MULTIAGENT DECISION MAKING In which we examine what to do when more than one agent inhabits the environment.
18.1
Properties of Multiagent
Environments
So far, we have largely assumed that only one agent has been doing the sensing, planning, and
acting. But this represents a huge simplifying assumption, which fails to capture many realworld Al settings.
In this chapter, therefore, we will consider the issues that arise when an
agent must make decisions in environments that contain multiple actors. Such environments
are called multiagent systems, and agents in such a system face a multiagent planning Multiagent systems problem. However, as we will see, the precise nature of the multiagent planning problem— }13{25" Plarnine and the techniques that are appropriate for solving it—will depend on the relationships among the agents in the environment.
18.1.1
One decision maker
The first possibility is that while the environment contains multiple actors, it contains only one decision maker. In such a case, the decision maker develops plans for the other agents, and tells them what to do.
The assumption that agents will simply do what they are told
agent is called the benevolent agent assumption. However, even in this setting, plans involving Benevolent assumption
multiple actors will require actors to synchronize their actions. Actors A and B will have to
act at the same time for joint actions (such as singing a duet), at different times for mutually exclusive actions (such as recharging batteries when there is only one plug), and sequentially
when one establishes a precondition for the other (such as A washing the dishes and then B
drying them).
One special case is where we have a single decision maker with multiple effectors that
can operate concurrently—for example, a human who can walk and talk at the same time.
Such an agent needs to do multieffector planning to manage each effector while handling =45 e/iec*>" positive and negative interactions among the effectors.
When the effectors are physically
decoupled into detached units—as in a fleet of delivery robots in a factory—multieffector
planning becomes multibody planning.
A multibody problem is still a “standard” singleagent problem as long as the relevant sensor information collected by each body can be pooled—either centrally or within each body—to
form a common
Multibody planning
estimate of the world state that then informs the execution of
the overall plan; in this case, the multiple bodies can be thought of as acting as a single body. When communication constraints make this impossible, we have what is sometimes
called a decentralized planning problem; this is perhaps a misnomer, because the planning  Jeente!= —3. « Now suppose we change the rules to force O to reveal his strategy first, followed by E. Then the minimax value of this game is Up ., and because this game favors E we know that U is at most Up &
know U < 42.
With pure strategies, the value is +2 (see Figure 18.2(b)), so we
Combining these two arguments, we see that the true utility U of the solution to the original game must satisfy
Upo v(C)+Vv(D)
forall CCDCN
If a game is superadditive, then the grand coalition receives a value that is at least as high
as or higher than the total received by any other coalition structure. However, as we will see
shortly, superadditive games do not always end up with a grand coalition, for much the same reason that the players do not always arrive at a collectively desirable Paretooptimal outcome in the prisoner’s dilemma.
18.3.2
Strategy in cooperative games
The basic assumption in cooperative game theory is that players will make strategic decisions about who they will cooperate with. Intuitively, players will not desire to work with unproductive players—they will naturally seek out players that collectively yield a high coalitional
value. But these soughtafter players will be doing their own strategic reasoning. Before we can describe this reasoning, we need some further definitions.
An imputation for a cooperative game (N,) is a payoff vector that satisfies the follow
Imputation
ing two conditions:
2= v(N)
x> v({i}) forallie N. The first condition says that an imputation must distribute the total value of the grand coali
tion; the second condition, known as individual rationality, says that each player is at least
Individual rationality
as well off as if it had worked alone.
Given an imputation X = (x,...,x,) and a coalition C C N, we define x(C) to be the sum
¥jecXi—the total amount disbursed to C by the imputation x.
Next, we define the core of a game (N, v) as the set of all imputations X that satisfy the
condition x(C) > v(C) for every possible coalition C C N. Thus, if an imputation x is not in the core, then there exists some coalition C C N such that v(C) > x(C). The players in C
would refuse to join the grand coalition because they would be better off sticking with C.
The core of a game therefore consists of all the possible payoff vectors that no coalition
could object to on the grounds that they could do better by not joining the grand coalition.
Thus, if the core is empty, then the grand coalition cannot form, because no matter how the
grand coalition divided its payoff, some smaller coalition would refuse to join. The main
computational questions around the core relate to whether or not it is empty, and whether a
particular payoff distribution is in the core.
Core
628
Chapter 18 Multiagent Decision Making
The definition of the core naturally leads to a system of linear inequalities, as follows (the unknowns are variables xi....,.x,, and the values v(C) are constants): x>
v({i})
forallie N
v(C)
foralCCN
Lienxi = V(N) Yiecxi
=
Any solution to these inequalities will define an imputation in the core. We can formulate the
inequalities as a linear program by using a dummy objective function (for example, maximizing ¥y x;), which will allow us to compute imputations in time polynomial in the number
of inequalities.
The difficulty is that this gives an exponential number of inequalities (one
for each of the 2" possible coalitions). Thus, this approach yields an algorithm for checking
nonemptiness of the core that runs in exponential time. Whether we can do better than this
depends on the game being studied: for many classes of cooperative game, the problem of checking nonemptiness of the core is coNPcomplete. We give an example below. Before proceeding, let’s see an example of a superadditive game with an empty core. The game has three players N = {1,2,3}, and has a characteristic function defined as follows:
[
=2
o) = { 0 otherwise.
Now consider any imputation (x;,x2,x3) for this game. Since v(N) = 1, it must be the case that at least one player i has x; > 0, and the other two get a total payoff less than 1. Those two could benefit by forming a coalition without player i and sharing the value 1 among themselves. But since this holds for all imputations, the core must be empty.
The core formalizes the idea of the grand coalition being stable, in the sense that no
coalition can profitably defect from it. However, the core may contain imputations that are unreasonable, in the sense that one or more players might feel they were unfair.
N = {1,2}, and we have a characteristic function v defined as follows:
Suppose
v({1}) =v({2}) =5
v({1,2}) =20.
Here, cooperation yields a surplus of 10 over what players could obtain working in isolation,
and so intuitively, cooperation will make sense in this scenario. Now, it is easy to see that the
imputation (6, 14) is in the core of this game: neither player can deviate to obtain a higher
uiility. But from the point of view of player 1, this might appear unreasonable, because it gives 9/10 of the surplus to player 2. Thus, the notion of the core tells us when a grand Shapley value
coalition can form, but it does not tell us how to distribute the payoff.
The Shapley value is an elegant proposal for how to divide the v(N) value among the players, given that the grand coalition N formed. Formulated by Nobel laureate Lloyd Shapley in the early 1950s, the Shapley value is intended to be a fair distribution scheme. What does fair mean?
It would be unfair to distribute v(N) based on the eye color of
players, or their gender, or skin color. Students often suggest that the value v(N) should be
divided equally, which seems like it might be fair, until we consider that this would give the
same reward to players that contribute a lot and players that contribute nothing. Shapley’s insight was to suggest that the only fair way to divide the value v(N) was to do so according
Mond 
to how much each player contributed to creating the value v(N).
First we need to define the notion of a player’s marginal contribution. The marginal
Section 18.3 contribution that a player i makes to a coalition C
Cooperative Game Theory
629
is the value that i would add (or remove),
should i join the coalition C. Formally, the marginal contribution that player i makes to C is
denoted by mc;(C):
me;(C) = v(Cu{i})  v(C). Now, a first attempt to define a payoff division scheme in line with Shapley’s suggestion
that players should be rewarded according to their contribution would be to pay each playeri the value that they would add to the coalition containing all other players:
mei(N —{i}). The problem is that this implicitly assumes that player i is the last player to enter the coalition.
So, Shapley suggested, we need to consider all possible ways that the grand coalition could form, that is, all possible orderings of the players N, and consider the value that i adds to the preceding players in the ordering. Then, a player should be rewarded by being paid the average marginal contribution that player i makes, over all possible orderings of the players, 10 the set of players preceding i in the ordering.
‘We let P denote all possible permutations (e.g., orderings) of the players N, and denote
members of P by p,p',... etc. Where p € P and i € N, we denote by p; the set of players
that precede i in the ordering p.
Then the Shapley value for a game G is the imputation
3(G) = (61(G),...,¢a(G)) defined as follows: 1
4(G) = ﬁp();mc.(p,).
(18.1)
This should convince you that the Shapley value is a reasonable proposal. But the remark
able fact is that it is the unique solution to a set of axioms that characterizes a “fair” payoff distribution scheme. We’ll need some more definitions before defining the axioms.
‘We define a dummy player as a player i that never adds any value to a coalition—that is,
mc;(C) =0 for all C C N — {i}. We will say that two players i and j are symmetric players
if they always make identical contributions to coalitions—that is, mc;(C) = me;(C) for all
C CN—{i,j}. Finally, where G = (N,v) and G’ = (N, V') are games with the same set of players, the game G + G’ is the game with the same player set, and a characteristic function
V" defined by v/(C) = v(C) + V/(C).
Given these definitions, we can define the fairness axioms satisfied by the Shapley value:
« Efficiency: ¥y 6i(G) = v(N). (All the value should be distributed.) « Dummy Player: 1f i is a dummy player in G then ¢;(G) = 0. (Players who never contribute anything should never receive anything.)
« Symmetry: 1f i and j are symmetric in G then ¢;(G) = ¢,(G). (Players who make identical contributions should receive identical payoffs.)
« Additivity: The value is additive over games: For all games G = (N, v) and G’ = (N, V), and for all players i € N, we have ¢;(G +G') = ¢i(G) + ¢i(G').
The additivity axiom is admittedly rather technical. If we accept it as a requirement, however,
we can establish the following key property: the Shapley value is the only way to distribute
coalitional value so as to satisfy these fairness axioms.
Dummy player
Symmetric players
630
Chapter 18 Multiagent Decision Making 18.3.3
Computation
in cooperative games
From a theoretical point of view, we now have a satisfactory solution. But from a computa
tional point of view, we need to know how to compactly represent cooperative games, and
how to efficiently compute solution concepts such as the core and the Shapley value.
The obvious representation for a characteristic function would be a table listing the value
v(C) for all 2" coalitions. This is infeasible for large n. A number of approaches to compactly representing cooperative games have been developed, which can be distinguished by whether or not they are complete. A complete representation scheme is one that is capable of representing any cooperative game. The drawback with complete representation schemes is that there will always be some games that cannot be represented compactly. An alternative is to use a representation scheme that is guaranteed to be compact, but which is not complete.
Marginal contribution nets
Marginal conibation net
‘We now describe one representation scheme, called marginal contribution nets (MCnets).
We will use a slightly simplified version to facilitate presentation, and the simplification makes it incomplete—the full version of MCnets is a complete representation.
The idea behind marginal contribution nets is to represent the characteristic function of a
game (N,v) as a set of coalitionvalue rules, of the form: (C;,x;), where C; C N is a coalition
and x; is a number. To compute the value of a coalition C, we simply sum the values of all rules (C;,x;) such that C; C C. Thus, given a set of rules R = {(C1,x1),...,(C,x¢)}, the corresponding characteristic function is: v(C€) = Y {xi  (Cixi) eRand G; C C} Suppose we have a rule set R containing the following three rules:
{({1.2}5), ({212, ({344} Then, for example, we have: « v({1}) = 0 (because no rules apply), v({3}) = 4 v({1,3}) = v({2,3}) = v({1,2,3})
(third rule), 4 (third rule), 6 (second and third rules), and = 11 (first, second, and third rules).
‘With this representation we can compute the Shapley value in polynomial time.
The key
insight is that each rule can be understood as defining a game on its own, in which the players
are symmetric. By appealing to Shapley’s axioms of additivity and symmetry, therefore, the Shapley value ¢;(R) of player i in the game associated with the rule set R is then simply:
aR=
T
iec . 5er { 0fr otherwise The version of marginal contribution nets that we have presented here is not a complete repre
sentation scheme: there are games whose characteristic function cannot be represented using rule sets of the form described above. A richer type of marginal contribution networks al
Tows for rules of the form (¢,x), where ¢ is a propositional logic formula over the players
N: a coalition C satisfies the condition ¢ if it corresponds to a satisfying assignment for ¢.
Section 18.3
Cooperative Game Theory
e 11,121,841
(s
!l
A= (eaen) (Buas)
631
ool (nwes)
level3.
©2590)
a1
Figure 187 The coalition structure graph for N = {1,2,3,4}. Level 1 has coalition structures containing a single coalition; level 2 has coalition structures containing two coalitions, and 0 on. This scheme is a complete representation—in the worst case, we need a rule for every possible coalition. Moreover, the Shapley value can be computed in polynomial time with this scheme; the details are more involved than for the simple rules described above, although the
basic principle is the same; see the notes at the end of the chapter for references.
Coalition structures for maximum
social welfare
‘We obtain a different perspective on cooperative games if we assume that the agents share
a common purpose. For example, if we think of the agents as being workers in a company,
then the strategic considerations relating to coalition formation that are addressed by the core,
for example, are not relevant. Instead, we might want to organize the workforce (the agents)
into teams so as to maximize their overall productivity. More generally, the task is to find a coalition that maximizes the social welfare of the system, defined as the sum of the values of
the individual coalitions. We write the social welfare of a coalition structure CS as sw(CS), with the following definition:
sw(CSs) =Y v(C). cecs Then a socially optimal coalition structure CS* with respect to G maximizes this quantity.
Finding a socially optimal coalition structure is a very natural computational problem, which has been studied beyond the multiagent systems community:
it is sometimes called the set
partitioning partitioning problem. Unfortunately, the problem is NPhard, because the number of possi Set problem ble coalition structures grows exponentially in the number of players.
Finding the optimal coalition structure by naive exhaustive search is therefore infeasible
in general. An influential approach to optimal coalition structure formation is based on the
Coalition structure idea of searching a subspace of the coalition structure graph. The idea is best explained graph with reference to an example.
Suppose we have a game with four agents, N = {1,2,3,4}. There are fifteen possible
coalition structures for this set of agents.
We can organize these into a coalition structure
graph as shown in Figure 18.7, where the nodes at level £ of the graph correspond to all the coalition structures with exactly £ coalitions.
An upward edge in the graph represents
the division of a coalition in the lower node into two separate coalitions in the upper node.
632
Chapter 18 Multiagent Decision Making For example, there is an edge from {{1},{2,3,4}} to {{1},{2}.{3.4}} because this latter
coalition structure is obtained from the former by dividing the coalition {2,3,4} into the
coalitions {2} and {3,4}.
The optimal coalition structure CS* lies somewhere within the coalition structure graph,
and so to find this, it seems we would have to evaluate every node in the graph. But consider
the bottom two rows of the graph—Ilevels 1 and 2. Every possible coalition (excluding the empty coalition) appears in these two levels. (Of course, not every possible coalition structure
appears in these two levels.) Now, suppose we restrict our search for a possible coalition structure to just these two levels—we go no higher in the graph. Let CS’ be the best coalition
structure that we find in these two levels, and let CS* be the best coalition structure overall.
Let C* be a coalition with the highest value of all possible coalitions:
C* € argmax v(C). CeN
The value of the best coalition structure we find in the first two levels of the coalition structure
graph must be at least as much as the value of the best possible coalition: sw(CS') > v(C*).
This is because every possible coalition appears in at least one coalition structure in the first two levels of the graph. So assume the worst case, that is, sw(CS') = v(C*).
Compare the value of sw(CS') to sw(CS*). Since sw(CS') is the highest possible value
of any coalition structure, and there are n agents (n = 4 in the case of Figure 18.7), then the
highest possible value of sw(CS*) would be nv(C*) = nsw(CS'). In other words, in the
worst possible case, the value of the best coalition structure we find in the first two levels of
the graph would be L the value of the best, where n is the number of agents. Thus, although
searching the first two levels of the graph does not guarantee to give us the optimal coalition
structure, it does guarantee to give s one that is no worse that & of the optimal. In practice it will often be much better than that.
18.4
Making Collective Decisions
‘We will now turn from agent design to mechanism design—the problem of designing the right game for a collection of agents to play. Formally, a mechanism consists of 1. A language for describing the set of allowable strategies that agents may adopt.
Center
2. A distinguished agent, called the center, that collects reports of strategy choices from the agents in the game. (For example, the auctioneer is the center in an auction.)
3. An outcome rule, known to all agents, that the center uses to determine the payoffs to
each agent, given their strategy choices. This section discusses some of the most important mechanisms.
Contract net protocol
18.4.1
Allocating tasks with the contract net
The contract net protocol is probably the oldest and most important multiagent problem
solving technique studied in AL It is a highlevel protocol for task sharing. As the name suggests, the contract net was inspired from the way that companies make use of contracts.
The overall contract net protocol has four main phases—see Figure 18.8. The process
starts with an agent identifying the need for cooperative action with respect to some task.
The need might arise because the agent does not have the capability to carry out the task
Section 18.4
problem recognition
27 awarding
633
lask\ o ouncement %
‘X’
T _X_
Making Collective Decisions
® «4 * bidding
'X'
Figure 18.8 The contract net task allocation protocol.
in isolation, or because a cooperative solution might in some way be better (faster, more efficient, more accurate).
The agent advertises the task to other agents in the net with a task announcement mes
sage, and then acts as the manager of that task for its duration.
The task announcement
Task announcement Manager
message must include sufficient information for recipients to judge whether or not they are willing and able to bid for the task. The precise information included in a task announcement
will depend on the application area. It might be some code that needs to be executed; or it ‘might be a logical specification of a goal to be achieved. The task announcement might also include other information that might be required by recipients, such as deadlines, qualityofservice requirements, and so on. ‘When an agent receives a task announcement, it must evaluate it with respect to its own
capabilities and preferences. In particular, each agent must determine, whether it has the
capability to carry out the task, and second, whether or not it desires to do so. On this basis, it may then submit a bid for the task. A bid will typically indicate the capabilities of the bidder that are relevant to the advertised task, and any terms and conditions under which the task will be carried out.
In general, a manager may receive multiple bids in response to a single task announcement. Based on the information in the bids, the manager selects the most appropriate agent
(or agents) to execute the task. Successful agents are notified through an award message, and
become contractors for the task, taking responsibility for the task until it is completed.
The main computational tasks required to implement the contract net protocol can be
summarized as follows:
&id
634
Chapter 18 Multiagent Decision Making « Task announcement processing. On receipt ofa task announcement, an agent decides if it wishes to bid for the advertised task.
« Bid processing. On receiving multiple bids, the manager must decide which agent to award the task to, and then award the task. * Award processing. Successful bidders (contractors) must attempt to carry out the task,
which may mean generating new subtasks, which are advertised via further task announcements.
Despite (or perhaps because of) its simplicity, the contract net is probably the most widely
implemented and beststudied framework for cooperative problem solving. It is naturally
applicable in many settings—a variation of it is enacted every time you request a car with
Uber, for example. 18.4.2
Allocating scarce resources with auctions
One of the most important problems in multiagent systems is that of allocating scarce re
sources; but we may as well simply say “allocating resources,” since in practice most useful
Auction
resources are scarce in some sense. The auction is the most important mechanism for allo
Bidder
there are multiple possible bidders. Each bidder i has a utility value v; for the item.
cating resources. The simplest setting for an auction is where there is a single resource and In some cases, each bidder has a private value for the item. For example, a tacky sweater
might be attractive to one bidder and valueless to another.
In other cases, such as auctioning drilling rights for an oil tract, the item has a com
mon value—the tract will produce some amount of money, X, and all bidders value a dollar
equally—but there is uncertainty as to what the actual value of X is. Different bidders have different information, and hence different estimates of the item’s true value.
In either case,
bidders end up with their own v;. Given v;, each bidder gets a chance, at the appropriate time
or times in the auction, to make a bid b;. The highest bid, b,,q,, wins the item, but the price Ascendingbid auction English auction
paid need not be b,,; that’s part of the mechanism design. The bestknown auction mechanism is the ascendingbid auction,’ or English auction,
in which the center starts by asking for a minimum (or reserve) bid by, If some bidder is willing to pay that amount, the center then asks for by, + d, for some increment d, and continues up from there. The auction ends when nobody is willing to bid anymore; then the last bidder wins the item, paying the price bid. How do we know if this is a good mechanism? One goal is to maximize expected revenue for the seller. Another goal is to maximize a notion of global utility. These goals overlap to
some extent, because one aspect of maximizing global utility is to ensure that the winner of
Efficient
the auction is the agent who values the item the most (and thus is willing to pay the most). We
say an auction is efficient if the goods go to the agent who values them most. The ascendingbid auction is usually both efficient and revenue maximizing, but if the reserve price is set too
high, the bidder who values it most may not bid, and if the reserve is set too low, the seller
Collusion
may get less revenue. Probably the most important things that an auction mechanism can do is encourage a sufficient number of bidders to enter the game and discourage them from engaging in collusion.
Collusion is an unfair or illegal agreement by two or more bidders to manipulate prices. It can 3 The word ion” comes from the L augeo, to increase.
Section 18.4
Making Collective Decisions
635
happen in secret backroom deals or tacitly, within the rules of the mechanism. For example, in 1999, Germany auctioned ten blocks of cellphone spectrum with a simultaneous auction
(bids were taken on all ten blocks at the same time), using the rule that any bid must be a minimum of a 10% raise over the previous bid on a block. There were only two credible bidders, and the first, Mannesman, entered the bid of 20 million deutschmark on blocks 15 and 18.18 million on blocks 610. Why 18.18M? One of TMobile’s managers said they “interpreted Mannesman’s first bid as an offer.” Both parties could compute that a 10% raise on 18.18M
is 19.99M; thus Mannesman’s bid was interpreted as saying “we can each get half the blocks for 20M:; let’s not spoil it by bidding the prices up higher”” And in fact TMobile bid 20M on blocks 610 and that was the end of the bidding.
The German government got less than they expected, because the two competitors were
able to use the bidding mechanism to come to a tacit agreement on how not to compete.
From the government’s point of view, a better result could have been obtained by any of these
changes to the mechanism: a higher reserve price; a sealedbid firstprice auction, so that the competitors could not communicate through their bids; or incentives to bring in a third bidder.
Perhaps the 10% rule was an error in mechanism design, because it facilitated the
precise signaling from Mannesman to TMobile.
In general, both the seller and the global utility function benefit if there are more bidders,
although global utility can suffer if you count the cost of wasted time of bidders that have no
chance of winning. One way to encourage more bidders is to make the mechanism easier for them. After all, if it requires too much research or computation on the part of the bidders, they may decide to take their money elsewhere. So it is desirable that the bidders have a dominant strategy. Recall that “dominant™ means that the strategy works against all other strategies, which in turn means that an agent
can adopt it without regard for the other strategies. An agent with a dominant strategy can just bid, without wasting time contemplating other agents’ possible strategies. A mechanism by
which agents have a dominant strategy is called a strategyproof mechanism. If, as is usually Strategyproof the case, that strategy involves the bidders revealing their true value, v;, then it is called a truthrevealing, or truthful, auction; the term incentive compatible is also used.
The
revelation principle states that any mechanism can be transformed into an equivalent truth
revealing mechanism, so part of mechanism design is finding these equivalent mechanisms.
It turns out that the ascendingbid auction has most of the desirable properties. The bidder
with the highest value v; gets the goods at a price of b, +d, where b, is the highest bid among all the other agents and d is the auctioneer’s increment.*
Bidders have a simple dominant
strategy: keep bidding as long as the current cost is below your v;. The mechanism is not
quite truthrevealing, because the winning bidder reveals only that his v; > b, +d; we have a
lower bound on v; but not an exact amount.
A disadvantage (from the point of view of the seller) of the ascendingbid auction is that
it can discourage competition. Suppose that in a bid for cellphone spectrum there is one
advantaged company that everyone agrees would be able to leverage existing customers and
infrastructure, and thus can make a larger profit than anyone else. Potential competitors can
see that they have no chance in an ascendingbid auction, because the advantaged company
4 There is actually a small chance that the agent with highest v; fails to gt the goods, in the case in which by < v; < by +d. The chance of this can be made arbitrarily small by decreasing the increment d.
Truthrevealing
Revelation principle
636
Chapter 18 Multiagent Decision Making can always bid higher. Thus, the competitors may not enter at all, and the advantaged company ends up winning at the reserve price.
Another negative property of the English auction is its high communication costs. Either the auction takes place in one room or all bidders have to have highspeed, secure communi
cation lines; in either case they have to have time to go through several rounds of bidding.
Sealedbid auction
An alternative mechanism, which requires much less communication,
is the sealedbid
auction. Each bidder makes a single bid and communicates it to the auctioneer, without the
other bidders seeing it. With this mechanism, there is no longer a simple dominant strategy.
If your value is v; and you believe that the maximum of all the other agents” bids will be b,, then you should bid b, + ¢, for some small ¢, if that is less than v;. Thus, your bid depends on
your estimation of the other agents’ bids, requiring you to do more work. Also, note that the
agent with the highest v; might not win the auction. This is offset by the fact that the auction is more competitive, reducing the bias toward an advantaged bidder.
Sealedbid secondprice auction
Vickrey auction
A small change in the mechanism for sealedbid auctions leads to the sealedbid second
price auction, also known as a Vickrey auction.’
In such auctions, the winner pays the
price of the secondhighest bid, b,, rather than paying his own bid. This simple modification completely eliminates the complex deliberations required for standard (or firstprice) sealedbid auctions, because the dominant strategy is now simply to bid v;; the mechanism is truth
revealing. Note that the utility of agent i in terms of his bid b;, his value v;, and the best bid among the other agents, b,, is
Ui
_ { (i=by) ifbi>b, 0
otherwise.
To see that b; = v; is a dominant strategy, note that when (v; — b,) is positive, any bid that wins the auction is optimal, and bidding v; in particular wins the auction. On the other hand, when
(vi—b,) is negative, any bid that loses the auction is optimal, and bidding v; in particular loses the auction. So bidding v; is optimal for all possible values of b,, and in fact, v; is the only bid that has this property. Because of its simplicity and the minimal computation requirements for both seller and bidders, the Vickrey auction is widely used in distributed Al systems.
Internet search engines conduct several trillion auctions each year to sell advertisements
along with their search results, and online auction sites handle $100 billion a year in goods, all using variants of the Vickrey auction. Note that the expected value to the seller is b,,
Revenue equivalence theorem
which is the same expected return as the limit of the English auction as the increment d goes to zero. This is actually a very general result: the revenue equivalence theorem states that, with a few minor caveats, any auction mechanism in which bidders have values v; known only
to themselves (but know the probability distribution from which those values are sampled),
will yield the same expected revenue. This principle means that the various mechanisms are not competing on the basis of revenue generation, but rather on other qualities.
Although the secondprice auction is truthrevealing, it turns out that auctioning n goods
with an n+ 1 price auction is not truthrevealing. Many Internet search engines use a mech
anism where they auction n slots for ads on a page. The highest bidder wins the top spot,
the second highest gets the second spot, and so on. Each winner pays the price bid by the
nextlower bidder, with the understanding that payment is made only if the searcher actually
5 Named after William Vickrey (19141996), who won the 1996 Nobel Prize in economics for this work and died of a heart attack three days later.
Section 18.4
Making Collective Decisions
637
clicks on the ad. The top slots are considered more valuable because they are more likely to be noticed and clicked on.
Imagine that three bidders, by, b, and b3, have valuations for a click of vy =200, v, = 180,
and v3 =100, and that n = 2 slots are available; and it is known that the top spot is clicked on
5% of the time and the bottom spot 2%. If all bidders bid truthfully, then b; wins the top slot
and pays 180, and has an expected return of (200— 180) x 0.05= 1. The second slot goes to by. But by can see that if she were to bid anything in the range 101179, she would concede
the top slot to by, win the second slot, and yield an expected return of (200 — 100) x .02=2. Thus, b; can double her expected return by bidding less than her true value in this case. In general, bidders in this n+ 1 price auction must spend a lot of energy analyzing the bids of others to determine their best strategy; there is no simple dominant strategy.
Aggarwal et al. (2006) show that there is a unique truthful auction mechanism for this
multislot problem, in which the winner of slot j pays the price for slot j just for those addi
tional clicks that are available at slot j and not at slot j+ 1. The winner pays the price for the lower slot for the remaining clicks. In our example, by would bid 200 truthfully, and would
pay 180 for the additional .05 —.02=.03 clicks in the top slot, but would pay only the cost of the bottom slot, 100, for the remaining .02 clicks. (200— 180) x .03+ (200— 100) x .02=2.6.
Thus, the total return to b; would be
Another example of where auctions can come into play within Al is when a collection of agents are deciding whether to cooperate on a joint plan. Hunsberger and Grosz (2000) show that this can be accomplished efficiently with an auction in which the agents bid for roles in
the joint plan. Common
goods
Now let’s consider another type of game, in which countries set their policy for controlling air pollution. Each country has a choice: they can reduce pollution at a cost of 10 points for
implementing the necessary changes, or they can continue to pollute, which gives them a net
utility of 5 (in added health costs, etc.) and also contributes 1 points to every other country
(because the air is shared across countries). Clearly, the dominant strategy for each country
is “continue to pollute,” but if there are 100 countries and each follows this policy, then each
country gets a total utility of 104, whereas if every country reduced pollution, they would
each have a utility of 10. This situation is called the tragedy of the commons: if nobody has
to pay for using a common resource, then it may be exploited in a way that leads to a lower total utility for all agents.
It is similar to the prisoner’s dilemma:
Tragedy of the commons
there is another solution
to the game that is better for all parties, but there appears to be no way for rational agents to
arrive at that solution under the current game. One approach for dealing with the tragedy of the commons s to change the mechanism to one that charges each agent for using the commons.
More generally, we need to ensure that
all externalities—effects on global utility that are not recognized in the individual agents’ transactions—are made explicit.
Setting the prices correctly is the difficult part. In the limit, this approach amounts to
creating a mechanism in which each agent is effectively required to maximize global utility,
but can do so by making a local decision. For this example, a carbon tax would be an example of a mechanism that charges for use of the commons
maximizes global utility.
in a way that, if implemented well,
Externalities
638
VCG
Chapter 18 Multiagent Decision Making It turns out there is a mechanism design, known as the VickreyClarkeGroves or VCG
mechanism, which has two favorable properties.
First, it is utility maximizing—that is, it
maximizes the global utility, which is the sum of the utilities for all parties, };v;.
Second,
the mechanism is truthrevealing—the dominant strategy for all agents is to reveal their true value. There is no need for them to engage in complicated strategic bidding calculations. We will give an example using the problem of allocating some common goods. Suppose a city decides it wants to install some free wireless Internet transceivers. However, the number
of transceivers available is less than the number of neighborhoods that want them. The city
wants to maximize global utility, but if it says to each neighborhood council you value a free transceiver (and by the way we will give them to the parties the most)?” then each neighborhood will have an incentive to report a very VCG mechanism discourages this ploy and gives them an incentive to report It works as follows:
1. The center asks each agent to report its value for an item, v;. 2. The center allocates the goods to a set of winners, W, to maximize ;¢
“How much do that value them high value. The their true value. v;.
3. The center calculates for each winning agent how much of a loss their individual pres
ence in the game has caused to the losers (who each got 0 utility, but could have got v;
if they were a winner).
4. Each winning agent then pays to the center a tax equal (o this loss. For example, suppose there are 3 transceivers available and 5 bidders, who bid 100, 50, 40, 20, and 10. Thus the set of 3 winners, W, are the ones who bid 100, 50, and 40 and the
global utility from allocating these goods is 190. For each winner, it is the case that had they
not been in the game, the bid of 20 would have been a winner. Thus, each winner pays a tax
of 20 to the center.
All winners should be happy because they pay a tax that is less than their value, and all losers are as happy as they can be, because they value the goods less than the required
tax. That’s why the mechanism is truthrevealing. In this example, the crucial value is 20; it would be irrational to bid above 20 if your true value was actually below 20, and vice versa.
Since the crucial value could be anything (depending on the other bidders), that means that is always irrational to bid anything other than your true value.
The VCG mechanism is very general, and can be applied to all sorts of games, not just auctions, with a slight generalization of the mechanism described above. For example, in a
ccombinatorial auction there are multiple different items available and each bidder can place
multiple bids, each on a subset of the items. For example, in bidding on plots of land, one bidder might want either plot X or plot Y but not both; another might want any three adjacent plots, and so on. The VCG mechanism can be used to find the optimal outcome, although
with 2V subsets of N goods to contend with, the computation of the optimal outcome is NPcomplete. With a few caveats the VCG mechanism is unique: every other optimal mechanism is essentially equivalent.
18.4.3
Social choice theory
Voting
The next class of mechanisms that we look at are voting procedures, of the type that are used for political decision making in democratic societies. The study of voting procedures derives from the domain of social choice theory.
Section 18.4 The basic setting is as follows.
Making Collective Decisions
639
As usual, we have a set N = {1,...,n} of agents, who
in this section will be the voters. These voters want to make decisions with respect to a set
Q= {wy,w,...} of possible outcomes. In a political election, each element of Q could stand for a different candidate winning the election.
Each voter will have preferences over Q. These are usually expressed not as quantitative
utilities but rather as qualitative comparisons:
we write w >~; w' to mean that outcome w is
ranked above outcome w' by agent i. In an election with three candidates, agent i might have
W
w3 = Wi The fundamental problem of social choice theory is to combine these preferences, using Social welfare a social welfare function, to come up with a social preference order: a ranking of the Function candidates, from most preferred down to least preferred. In some cases, we are only interested
in a social outcome—the most preferred outcome by the group as a whole. We will write Social outcome
w =" w' to mean that w is ranked above w’ in the social preference order.
A simpler setting is where we are not concerned with obtaining an entire ordering of
candidates, but simply want to choose a set of winners.
A social choice function takes as
input a preference order for each voter, and produces as output a set of winners.
Democratic societies want a social outcome that reflects the preferences of the voters.
Unfortunately, this is not always straightforward. Consider Condorcet’s Paradox, a famous
example posed by the Marquis de Condorcet (17431794). Suppose we have three outcomes, Q = {w,,wp,w,}, and three voters, N = {1,2,3}, with preferences as follows. Wa =1 Wp =1 We
We >2 Wa =2 Wh
(18.2)
Wp >3 We =3 Wa
Now, suppose we have to choose one of the three candidates on the basis of these preferences. The paradox is that: * 2/3 of the voters prefer w; over w;. + 2/3 of the voters prefer w; over ws.
* 2/3 of the voters prefer w) over ws. So, for each possible winner, we can point to another candidate who would be preferred by at least 2/3 of the electorate. It is obvious that in a democracy we cannot hope to make every
voter happy. This demonstrates that there are scenarios in which no matter which outcome we choose, a majority of voters will prefer a different outcome. A natural question is whether there is any “good” social choice procedure that really reflects the preferences of voters. To answer this, we need to be precise about what we mean when we say that a rule is “good.” ‘We will list some properties we would like a good social welfare function to satis
* The Pareto Condition:
above wj, then w; =* w;.
* The Condorcet
The Pareto condition simply says that if every voter ranks w;
Winner Condition:
An outcome is said to be a Condorcet winner if
a majority of candidates prefer it over all other outcomes. To put it another way, a
Condorcet winner is a candidate that would beat every other candidate in a pairwise
election. The Condorcet winner condition says that if w; is a Condorcet winner, then w; should be ranked first.
« Independence of Irrelevant Alternatives (IIA): Suppose there are a number of candidates, including w; and w;, and voter preferences are such that w; = w;. Now, suppose
Social choice function Condorcet's Paradox
Chapter 18 Multiagent Decision Making
Arrow's theorem
Simple majority vote
Plurality voting
one voter changed their preferences in some way, but nor about the relative ranking of wi and w;. The TTA condition says that, w; * w; should not change. « No Dictatorships: 1t should not be the case that the social welfare function simply outputs one voter’s preferences and ignores all other voters. These four conditions seem reasonable, but a fundamental theorem of social choice theory called Arrow’s theorem (due to Kenneth Arrow) tells s that it is impossible to satisfy all four conditions (for cases where there are at least three outcomes). That means that for any social choice mechanism we might care to pick, there will be some situations (perhaps unusual or pathological) that lead to controversial outcomes. However, it does not mean that democratic decision making is hopeless in most cases. We have not yet seen any actual voting procedures, so let’s now look at some. = With just two candidates, simple majority vote (the standard method in the US and UK) s the favored mechanism. We ask each voter which of the two candidates they prefer, and the one with the most votes is the winner. + With more than two outcomes, plurality voting is a common system.
We ask each
voter for their top choice, and select the candidate(s) (more than one in the case of ties) who get the most votes, even if nobody gets a majority. While it is common, plurality
voting has been criticized for delivering unpopular outcomes. A key problem is that it only takes into account the topranked candidate in each voter’s preferences.
Borda count
+ The Borda count (after JeanCharles de Borda, a contemporary and rival of Condorcet)
is a voting procedure that takes into account all the information in a voter’s preference
ordering. Suppose we have k candidates. Then for each voter i, we take their preference
ordering ;, and give a score of k to the top ranked candidate, a score of k — 1 to the
secondranked candidate, and so on down to the leastfavored candidate in i’s ordering. The total score for each candidate is their Borda count, and to obtain the social outcome
~*, outcomes are ordered by their Borda count—highest to lowest. One practical prob
lem with this system is that it asks voters to express preferences on all the candidates,
Approval voting
and some voters may only care about a subset of candidates.
+ In approval voting, voters submit a subset of the candidates that they approve of. The winner(s) are those who are approved by the most voters. This system is often used when the task is to choose multiple winners.
Instant runoff voting
« In instant runoff voting, voters rank all the candidates, and if a candidate has a major
ity of firstplace votes, they are declared the winner. If not, the candidate with the fewest
firstplace votes is eliminated. That candidate is removed from all the preference rank
ings (so those voters who had the eliminated candidate as their first choice now have
another candidate as their new first choice) and the process is repeated.
True majority rule voting
Eventually,
some candidate will have a majority of firstplace votes (unless there is a tie).
+ In true majority rule voting, the winner is the candidate who beats every other can
didate in pairwise comparisons. Voters are asked for a full preference ranking of all
candidates. We say that w beats ', if more voters have w > w’ than have w’  w. This
system has the nice property that the majority always agrees on the winner, but it has the bad property that not every election will be decided:
example, no candidate wins a majority.
in the Condorcet paradox, for
Section 18.4
Making Collective Decisions
641
Strategic manipulation Besides Arrow’s Theorem,
another important negative results in the area of social choice
theory is the GibbardSatterthwaite Theorem.
This result relates to the circumstances
under which a voter can benefit from misrepresenting their preferenc
Gibbard— Satterthwaite Theorem
Recall that a social choice function takes as input a preference order for each voter, and
gives as output a set of winning candidates. Each voter has, of course, their own true prefer
ences, but there is nothing in the definition of a social choice function that requires voters to
report their preferences truthfully; they can declare whatever preferences they like.
In some cases, it can make sense for a voter to misrepresent their preferences. For exam
ple, in plurality voting, voters who think their preferred candidate has no chance of winning may vote for their second choice instead. That means plurality voting is a game in which voters have to think strategically (about the other voters) to maximize their expected utility.
This raises an interesting question: can we design a voting mechanism that is immune to
such manipulation—a mechanism that is truthrevealing? The GibbardSatterthwaite Theo
rem tells us that we can not: Any social choice function that satisfies the Pareto condition for a domain with more than two outcomes is either manipulable or a dictatorship. That is, for any “reasonable” social choice procedure, there will be some circumstances under which a
voter can in principle benefit by misrepresenting their preferences. However, it does not tell
us how such manipulation might be done; and it does not tell us that such manipulation is
likely in practice. 18.4.4
Bargaining
Bargaining, or negotiation, is another mechanism that is used frequently in everyday life. It has been studied in game theory since the 1950s and more recently has become a task for automated agents. Bargaining is used when agents need to reach agreement on a matter of common interest. The agents make offers (also called proposals or deals) to each other under specific protocols, and either accept or reject cach offer. Bargaining with the alternating offers protocol
offers One influential bargaining protocol is the alternating offers bargaining model. For simplic Alternating bargaining model ity we’ll again
assume just two agents. Bargaining takes place in a sequence of rounds. A;
begins, at round 0, by making an offer. If A, accepts the offer, then the offer is implemented. If Ay rejects the offer, then negotiation moves to the next round. This time A> makes an offer and A chooses to accept or reject it, and so on. If the negotiation never terminates (because
agents reject every offer) then we define the outcome to be the conflict deal. A convenient
simplifying assumption is that both agents prefer to reach an outcome—any outcome—in finite time rather than being stuck in the infinitely timeconsuming conflict deal. ‘We will use the scenario of dividing a pie to illustrate alternating offers. The idea is that
Conflict deal
there is some resource (the “pie”) whose value is 1, which can be divided into two parts, one
part for each agent. Thus an offer in this scenario is a pair (x,1 —x), where x is the amount
of the pie that A; gets, and 1 — x is the amount that A, gets. The space of possible deals (the negotiation set) is thus:
{(12):0. Agent A can take this fact into account by offering (1 —72,72), an
Section 18.4
Making Collective Decisions
offer that A, may as well accept because A, can do no better than 7, at this point in time. (If you are worried about what happens with ties, just make the offer be (1 — (v, +¢€),72 +€) for
some small value of ¢.) So, the two strategies of A
offering (1 —~2,72), and A accepting that offer are in Nash
equilibrium. Patient players (those with a larger ~2) will be able to obtain larger pieces of the pie under this protocol: in this setting, patience truly is a virtue. Now consider the general case, where there are no bounds on the number of rounds. As
in the Iround case, A, can craft a proposal that A, should accept, because it gives A, the maximal
achievable amount, given the discount factors. It turns out that A; will get
Im

and A will get the remainder. Negotiation
in taskoriented domains
In this section, we consider negotiation for taskoriented domains. In such a domain, a set of
tasks must be carried out, and each task is initially assigned to a set of agents. The agents may
Taskoriented domain
be able to benefit by negotiating on who will carry out which tasks. For example, suppose
some tasks are done on a lathe machine and others on a milling machine, and that any agent
using a machine must incur a significant setup cost. Then it would make sense for one agent 1o offer another “T have to set up on the milling machine anyway; how about if I do all your milling tasks, and you do all my lathe tasks?”
Unlike the bargaining scenario, we start with an initial allocation, so if the agents fail to
agree on any offers, they perform the tasks T that they were originally allocated.
To keep things simple, we will again assume just two agents. Let 7 be the set of all tasks
and let (7}, 7) denote the initial allocation of tasks to the two agents at time 0. Each task in T must be assigned to exactly one agent.
We assume we have a cost function ¢, which
for every set of tasks 7" gives a positive real number ¢(7”) indicating the cost to any agent
of carrying out the tasks 7”. (Assume the cost depends only on the tasks, not on the agent
carrying out the task.) The cost function is monotonic—adding more tasks never reduces the cost—and the cost of doing nothing is zero: ¢({}) =
0. As an example, suppose the cost of
setting up the milling machine is 10 and each milling task costs 1, then the cost of a set of two milling tasks would be 12, and the cost of a set of five would be 15.
An offer of the form (7;,7>) means that agent i is committed to performing the set of tasks 7;, at cost ¢(7;). The utility to agent i is the amount they have to gain from accepting the offer—the difference between the cost of doing this new set of tasks versus the originally
assigned set of tasks:
Ui((T,T2) = e(T3) = e(TY)
An offer (T, T5) is individually rational if U;((7},73)) > 0 for both agents. If a deal is not Individually rational individually rational, then at least one agent can do better by simply performing the tasks it
was originally allocated.
The negotiation set for taskoriented domains (assuming rational agents) is the set of
offers that are both individually rational and Pareto optimal. There is no sense making an
individually irrational offer that will be refused, nor in making an offer when there is a better offer that improves one agent’s utility without hurting anyone else.
Chapter 18 Multiagent Decision Making
The monotonic concession protocol Mornotonic concession protocol
The negotiation protocol we consider for taskoriented domains
is known as the monotonic
concession protocol. The rules of this protocol are as follows. + Negotiation proceeds in a series of rounds. + On the first round, both agents simultaneously propose a deal, D; = (T}, T3), from the negotiation set. (This is different from the alternating offers we saw before.)
+ An agreement is reached if the two agents propose deals D; and D;, respectively, such
that either (i) Uy (D) > Uy (Dy) or (i) Us(Dy) > Us(D3), that is, if one of the agents finds that the deal proposed by the other is at least as good or better than the proposal it made. If agreement is reached, then the rule for determining the agreement deal is as follows:
If each agent’s offer matches or exceeds that of the other agent, then one of
the proposals is selected at random. If only one proposal exceeds or matches the other’s proposal, then this is the agreement deal. « If no agreement is reached, then negotiation proceeds to another round of simultaneous
Concession
proposals. In round 7 + 1, each agent must either repeat the proposal from the previous
round or make a concession—a proposal that is more preferred by the other agent (i.e.,
has higher utility).
« If neither agent makes a concession, then negotiation terminates, and both agents im
plement the conflict deal, carrying out the tasks they were originally assigned.
Since the set of possible deals is finite, the agents cannot negotiate indefinitely:
either the
agents will reach agreement, or a round will occur in which neither agent concedes. However, the protocol does not guarantee that agreement will be reached quickly: since the number of
possible deals is O(271), it is conceivable that negotiation will continue for a number of
rounds exponential in the number of tasks to be allocated. The Zeuthen strategy
Zeuthen strategy
So far, we have said nothing about how negotiation participants might or should behave when using the monotonic concession protocol for taskoriented domains. One possible strategy is the Zeuthen strategy.
The idea of the Zeuthen strategy is to measure an agent’s willingness to risk conflict.
Intuitively, an agent will be more willing to risk conflict if the difference in utility between its current proposal and the conflict deal is low.
In this case, the agent has little to lose if
negotiation fails and the conflict deal is implemented, and so is more willing to risk conflict, and less willing to concede. In contrast, if the difference between the agent’s current proposal and the conflict deal is high, then the agent has more to lose from conflict and is therefore less willing to risk conflict—and thus more willing to concede. Agent i’s willingness to risk conflict at round 7, denoted risk!, is measured as follows:
by conceding and accepting js offer
Until an agreement is reached, the value of risk{ will be a value between 0 and 1. Higher
values of risk! (nearer to 1) indicate that i has less to lose from conflict, and so is more willing to risk conflict.
Summary The Zeuthen strategy says that each agent’s first proposal should be a deal in the negoti
ation set that maximizes its own utility (there may be more than one). After that, the agent
who should concede on round 7 of negotiation should be the one with the smaller value of risk—the one with the most to lose from conflict if neither concedes.
The next question to answer is how much should be conceded? The answer provided by the Zeuthen strategy is, “Just enough to change the balance of risk to the other agent.” That is, an agent should make the smallest concession that will make the other agent concede on
the next round. There is one final refinement to the Zeuthen strategy.
Suppose that at some point both
agents have equal risk. Then, according to the strategy, both should concede. But, knowing this, one agent could potentially “defect” by not conceding, and so benefit.
To avoid the
possibility of both conceding at this point, we extend the strategy by having the agents “flip a coin” to decide who should concede if ever an equal risk situation is reached.
With this strategy, agreement will be Pareto optimal and individually rational. However,
since the space of possible deals is exponential in the number of tasks, following this
strategy
may require O(2/"!) computations of the cost function at each negotiation step. Finally, the
Zeuthen strategy (with the coin flipping rule) is in Nash equilibrium.
Summary « Multiagent planning is necessary when there are other agents in the environment with which to cooperate or compete. Joint plans can be constructed, but must be augmented with some form of coordination if two agents are to agree on which joint plan to execute.
+ Game
theory describes rational behavior for agents in situations in which multiple
agents interact. Game theory is to multiagent decision making as decision theory is to
singleagent decision making. « Solution concepts in game theory are intended to characterize rational outcomes of a game—outcomes that might occur if every agent acted rationally.
+ Noncooperative game theory assumes that agents must make their decisions indepen
dently. Nash equilibrium is the most important solution concept in noncooperative game theory. A Nash equilibrium is a strategy profile in which no agent has an incentive to deviate from its specified strategy. We have techniques for dealing with repeated games and sequential games.
+ Cooperative game theory considers settings in which agents can make binding agree
ments to form coalitions in order to cooperate. Solution concepts in cooperative game attempt to formulate which coalitions are stable (the core) and how to fairly divide the
value that a coalition obtains (the Shapley value). « Specialized techniques are available for certain important classes of multiagent decision: the contract net for task sharing; auctions are used to efficiently allocate scarce
resources; bargaining for reaching agreements on matters of common interest; and vot
ing procedures for aggregating preferences.
Chapter 18 Multiagent Decision Making Bibliographical and Historical Notes Itis a curiosity of the field that researchers in Al did not begin to seriously consider the issues
surrounding interacting agents until the 1980s—and the multiagent systems field did not really become established as a distinctive subdiscipline of Al until a decade later. Nevertheless,
ideas that hint at multiagent systems were present in the 1970s. For example, in his highly influential Society of Mind theory, Marvin Minsky (1986, 2007) proposed that human minds are constructed from an ensemble of agents. Doug Lenat had similar ideas in a framework he called BEINGS (Lenat, 1975). In the 1970s, building on his PhD work on the PLANNER
system, Carl Hewitt proposed a model of computation as interacting agents called the ac
tor model, which has become established as one of the fundamental models in concurrent computation (Hewitt, 1977; Agha, 1986).
The prehistory of the multiagent systems field is thoroughly documented in a collection
of papers entitled Readings in Distributed Artificial Intelligence (Bond and Gasser, 1988).
The collection is prefaced with a detailed statement of the key research challenges in multi
agent systems, which remains remarkably relevant today, more than thirty years after it was written.
Early research on multiagent systems tended to assume that all agents in a system
were acting with common interest, with a single designer. This
Cooperative distributed problem solving
is now recognized as a spe
cial case of the more general multiagent setting—the special case is known as cooperative
distributed problem solving. A key system of this time was the Distributed Vehicle Moni
toring Testbed (DVMT), developed under the supervision of Victor Lesser at the University
of Massachusetts (Lesser and Corkill, 1988). The DVMT modeled a scenario in which a col
lection of geographically distributed acoustic sensor agents cooperate to track the movement
of vehicles.
The contemporary era of multiagent systems research began in the late 1980s, when it
was widely realized that agents with differing preferences are the norm in Al and society— from this point on, game theory began to be established as the main methodology for studying such agents. Multiagent planning has leaped in popularity in recent years, although it does have a long history. Konolige (1982) formalizes multiagent planning in firstorder logic, while Pednault (1986) gives a STRIPSstyle description.
The notion of joint intention, which is essential if
agents are to execute a joint plan, comes from work on communicative acts (Cohen and Perrault, 1979; Cohen and Levesque,
1990; Cohen et al., 1990). Boutilier and Brafman (2001)
show how to adapt partialorder planning to a multiactor setting.
Brafman and Domshlak
(2008) devise a multiactor planning algorithm whose complexity grows only linearly with
the number of actors, provided that the degree of coupling (measured partly by the tree width
of the graph of interactions among agents) is bounded.
Multiagent planning is hardest when there are adversarial agents.
As JeanPaul Sartre
(1960) said, “In a football match, everything is complicated by the presence of the other team.” General Dwight D. Eisenhower said, “In preparing for battle I have always found that plans are useless, but planning is indispensable,” meaning that it is important to have a conditional plan or policy, and not to expect an unconditional plan to succeed.
The topic of distributed and multiagent reinforcement learning (RL) was not covered in
this chapter but is of great current interest. In distributed RL, the aim is to devise methods by
which multiple, coordinated agents learn to optimize a common utility function. For example,
Bibliographical and Historical Notes can we devise methods whereby separate subagents for robot navigation and robot obstacle avoidance could cooperatively achieve a combined control system that is globally optimal?
Some basic results in this direction have been obtained (Guestrin et al., 2002; Russell and Zimdars, 2003). The basic idea is that each subagent learns its own Qfunction (a kind of
utility function; see Section 22.3.3) from its own stream of rewards. For example, a robot
navigation component can receive rewards for making progress towards the goal, while the obstacleavoidance component receives negative rewards for every collision. Each global decision maximizes the sum of Qfunctions and the whole process converges to globally optimal solutions. The roots of game theory can be traced back to proposals made in the 17th century by
Christiaan Huygens and Gottfried Leibniz to study competitive and cooperative human in
ientifically and mathematically. Throughout the 19th century, several leading reated simple mathematical examples to analyze particular examples of compet
itive situations. The first formal results in game theory are due to Zermelo (1913) (who had, the year before, suggested a form of minimax search for games, albeit an incorrect one). Emile Borel (1921) introduced the notion of a mixed strategy. John von Neumann (1928) proved that
every twoperson, zerosum game has a maximin equilibrium in mixed strategies and a well
defined value. Von Neumann’s collaboration with the economist Oskar Morgenstern led to the publication in 1944 of the Theory of Games and Economic Behavior, the defining book
for game theory. Publication of the book was delayed by the wartime paper shortage until a member of the Rockefeller family personally subsidized its publication. In 1950, at the age of 21, John Nash published his ideas concerning equilibria in general (nonzerosum) games. His definition of an equilibrium solution, although anticipated in the work of Cournot (1838), became known as Nash equilibrium. After a long delay because of the schizophrenia he suffered from 1959 onward, Nash was awarded the Nobel Memorial Prize in Economics (along with Reinhart Selten and John Harsanyi) in 1994. The BayesNash
equilibrium is described by Harsanyi (1967) and discussed by Kadane and Larkey (1982). Some issues in the use of game theory for agent control are covered by Binmore (1982). Aumann and Brandenburger (1995) show how different equilibria can be reached depending on the knowleedge each player has.
The prisoner’s dilemma was invented as a classroom exercise by Albert W. Tucker in 1950 (based on an example by Merrill Flood and Melvin Dresher) and is covered extensively
by Axelrod (1985) and Poundstone (1993). Repeated games were introduced by Luce and
Raiffa (1957), and Abreu and Rubinstein (1988) discuss the use of finite state machines for
repeated games—technically, Moore machines. The text by Mailath and Samuelson (2006) concentrates on repeated games. Games of partial information in extensive form were introduced by Kuhn (1953).
The
sequence form for partialinformation games was invented by Romanovskii (1962) and independently by Koller ef al. (1996); the paper by Koller and Pfeffer (1997) provides a readable introduction to the field and describes a system for representing and solving sequential games. The use of abstraction to reduce a game tree to a size that can be solved with Koller’s
technique was introduced by Billings ef al. (2003). Subsequently, improved methods for equilibriumfinding enabled solution of abstractions with 102 states (Gilpin er al., 2008;
Zinkevich et al., 2008).
Bowling ef al. (2008) show how to use importance sampling to
647
Chapter 18 Multiagent Decision Making get a better estimate of the value ofa strategy. Waugh ef al. (2009) found that the abstraction approach is vulnerable to making systematic errors in approximating the equilibrium solution: it works for some games but not others.
Brown and Sandholm (2019) showed that, at least
in the case of multiplayer Texas hold "em poker, these vulnerabilities can be overcome by sufficient computing power.
They used a 64core server running for 8 days to compute a
baseline strategy for their Pluribus program. human champion opponents.
With that strategy they were able to defeat
Game theory and MDPs are combined in the theory of Markov games, also called stochastic games (Littman, 1994; Hu and Wellman, 1998). Shapley (1953b) actually described the
value iteration algorithm independently of Bellman, but his results were not widely appre
ciated, perhaps because they were presented in the context of Markov games. Evolutionary
game theory (Smith, 1982; Weibull, 1995) looks at strategy drift over time: if your opponent’s strategy is changing, how should you react?
Textbooks on game theory from an economics point of view include those by Myerson
(1991), Fudenberg and Tirole (1991), Osborne (2004), and Osborne and Rubinstein (1994).
From an Al perspective we have Nisan ef al. (2007) and LeytonBrown and Shoham (2008). See (Sandholm, 1999) for a useful survey of multiagent decision making. Multiagent RL is distinguished from distributed RL by the presence of agents who cannot
coordinate their actions (except by explicit communicative acts) and who may not share the
same utility function. Thus, multiagent RL deals with sequential gametheoretic problems or
Markov games, as defined in Chapter 17. What causes problems is the fact that, while an
agent is learning to defeat its opponent’s policy, the opponent is changing its policy to defeat the agent. Thus, the environment is nonstationary (see page 444).
Littman (1994) noted this difficulty when introducing the first RL algorithms for zero
sum Markov games. Hu and Wellman (2003) present a Qlearning algorithm for generalsum games that converges when the Nash equilibrium is unique; when there are multiple equilibria, the notion of convergence is not so easy to define (Shoham ef al., 2004). Assistance games were introduced under the heading of cooperative inverse reinforce
ment learning by HadfieldMenell ez al. (2017a). Malik ez al. (2018) introduced an efficient
Principalagent game
POMDP solver designed specifically for assistance games.
They are related to principal
agent games in economics, in which a principal (e.g., an employer) and an agent (e.g., an employee) need to find a mutually beneficial arrangement despite having widely different preferences. The primary differences are that (1) the robot has no preferences of its own, and (2) the robot is uncertain about the human preferences it needs to optimize.
Cooperative games were first studied by von Neumann and Morgenstern (1944). The notion of the core was introduced by Donald Gillies (1959), and the Shapley value by Lloyd Shapley (1953a). A good introduction to the mathematics of cooperative games is Peleg and Sudholter (2002). Simple games in general are discussed in detail by Taylor and Zwicker (1999). For an introduction to the computational aspects of cooperative game theory, see
Chalkiadakis er al. (2011).
Many compact representation schemes for cooperative games have been developed over
the past three decades, starting with the work of Deng and Papadimitriou (1994). The most
influential of these schemes is the marginal contribution networks model, which was intro
duced by Teong and Shoham (2005). The approach to coalition formation that we describe was developed by Sandholm ef al. (1999); Rahwan ef al. (2015) survey the state of the art.
Bibliographical and Historical Notes The contract net protocol was introduced by Reid Smith for his PhD work at Stanford
University in the late 1970s (Smith, 1980). The protocol seems to be so natural that it is reg
ularly reinvented to the present day. The economic foundations of the protocol were studied by Sandholm (1993).
Auctions and mechanism design have been mainstream topics in computer science and
AI for several decades:
see Nisan (2007) for a mainstream computer science perspective,
Krishna (2002) for an introduction to the theory of auctions, and Cramton er al. (2006) for a collection of articles on computational aspects of auctions. The 2007 Nobel Memorial Prize in Economics went to Hurwicz, Maskin, and Myerson
“for having laid the foundations of mechanism design theory” (Hurwicz, 1973). The tragedy of the commons, a motivating problem for the field, was analyzed by William Lloyd (1833) but named and brought to public attention by Garrett Hardin (1968). Ronald Coase presented
a theorem that if resources are subject to private ownership and if transaction costs are low
enough, then the resources will be managed efficiently (Coase, 1960). He points out that, in practice, transaction costs are high, so this theorem does not apply, and we should look to other solutions beyond privatization and the marketplace. Elinor Ostrom’s Governing the Commons (1990) described solutions for the problem based on placing management control over the resources into the hands of the local people who have the most knowledge of the situation. Both Coase and Ostrom won the Nobel Prize in economics for their work.
The revelation principle is due to Myerson (1986), and the revenue equivalence theorem was developed independently by Myerson (1981) and Riley and Samuelson (1981). Two
economists, Milgrom (1997) and Klemperer (2002), write about the multibilliondollar spec
trum auctions they were involved in. Mechanism design is used in multiagent planning (Hunsberger and Grosz, 2000; Stone et al., 2009) and scheduling (Rassenti er al., 1982). Varian (1995) gives a brief overview with connections to the computer science literature, and Rosenschein and Zlotkin (1994)
present a booklength treatment with applications to distributed AL Related work on distributed AT goes under several names, including collective intelligence (Tumer and Wolpert, 2000; Segaran, 2007) and marketbased control (Clearwater,
1996).
Since 2001
there has
been an annual Trading Agents Competition (TAC), in which agents try to make the best profit on a series of auctions (Wellman ez al., 2001; Arunachalam and Sadeh, 2005).
The social choice literature is enormous, and spans erations on the nature of democracy through to highly procedures. Campbell and Kelly (2002) provide a good Handbook of Computational Social Choice provides a
the gulf from philosophical considtechnical analyses of specific voting starting point for this literature. The range of articles surveying research
topics and methods in this field (Brandt ez al., 2016). Arrow’s theorem lists desired properties
of a voting system and proves that is impossible to achieve all of them (Arrow, 1951). Dasgupta and Maskin (2008) show that majority rule (not plurality rule, and not ranked choice voting) is the most robust voting system. The computational complexity of manipulating elections was first studied by Bartholdi ez al. (1989).
We have barely skimmed the surface of work on negotiation in multiagent planning. Durfee and Lesser (1989) discuss how tasks can be shared out among agents by negotiation. Kraus et al. (1991) describe a system for playing Diplomacy, a board game requiring negoti
ation, coalition formation, and dishonesty. Stone (2000) shows how agents can cooperate as
teammates in the competitive, dynamic, partially observable environment of robotic soccer. In
649
650
Chapter 18 Multiagent Decision Making a later article, Stone (2003) analyzes two competitive multiagent environments—RoboCup,
a robotic soccer competition, and TAC, the auctionbased Trading Agents Competition—
and finds that the computational intractability of our current theoretically wellfounded approaches has led to many multiagent systems being designed by ad hoc methods. Sarit Kraus has developed a number of agents that can negotiate with humans and other agents—
see Kraus (2001) for a survey. The monotonic concession protocol for automated negotiation was proposed by Jeffrey S. Rosenschein and his students (Rosenschein and Zlotkin, 1994). The alternating offers protocol was developed by Rubinstein (1982). Books on multiagent systems include those by Weiss (2000a), Young (2004), Vlassis (2008), Shoham and LeytonBrown (2009), and Wooldridge (2009). The primary conference for multiagent systems is the International Conference on Autonomous Agents and Multi
Agent Systems (AAMAS); there is also a journal by the same name. The ACM Conference on Electronic Commerce (EC) also publishes many relevant papers, particularly in the area of auction algorithms. The principal journal for game theory is Games and Economic Behavior.
TG
19
LEARNING FROM EXAMPLES In which we describe agents that can improve their behavior through diligent study of past experiences and predictions about the future.
An agent is learning if it improves its performance after making observations about the world.
Learning can range from the trivial, such as jotting down a shopping list, to the profound, as when Albert Einstein inferred a new theory of the universe. When the agent is a computer, we call it machine learning:
a computer observes some data, builds a model based on the
data, and uses the model as both a hypothesis about the world and a piece of software that
can solve problems.
‘Why would we want a machine to learn? Why not just program it the right way to begin
with? There are two main reasons.
First, the designers cannot anticipate all possible future
situations. For example, a robot designed to navigate mazes must learn the layout of each new
maze it encounters; a program for predicting stock market prices must learn to adapt when
conditions change from boom to bust. Second, sometimes the designers have no idea how
to program a solution themselves. Most people are good at recognizing the faces of family
members, but they do it subconsciously, so even the best programmers don’t know how to
program a computer to accomplish that task, except by using machine learning algorithms.
In this chapter, we interleave a discussion of various model classes—decision trees (Sec
tion 19.3), linear models (Section 19.6), nonparametric models such as nearest neighbors (Section 19.7), ensemble models such as random forests (Section 19.8)—with practical advice on building machine learning systems (Section 19.9), and discussion of the theory of ‘machine learning (Sections 19.1 to 19.5).
19.1
Forms of Learning
Any component of an agent program can be improved by machine learning. The improve‘ments, and the techniques used to make them, depend on these factors:
« Which component is to be improved. * What prior knowledge the agent has, which influences the model it builds. * What data and feedback on that data is available.
Chapter 2 described several agent designs. The components of these agents include: 1. A direct mapping from conditions on the current state to actions. 2. A means to infer relevant properties of the world from the percept sequence. 3. Information about the way the world evolves and about the results of possible actions
the agent can take.
Machine learning
652
Chapter 19 Learning from Examples 4. Utility information indicating the desirability of world states. 5. Actionvalue information indicating the desirability of actions.
6. Goals that describe the most desirable states.
7. A problem generator, critic, and learning element that enable the system to improve. Each of these components can be learned. Consider a selfdriving car agent that learns by observing a human driver. Every time the driver brakes, the agent might learn a condition— action rule for when to brake (component 1). By seeing many camera images that it is told contain buses, it can learn to recognize them (component 2).
By trying actions and ob
serving the results—for example, braking hard on a wet road—it can learn the effects of its actions (component 3). Then, when it receives complaints from passengers who have been thoroughly shaken up during the trip, it can learn a useful component of its overall utility function (component 4).
The technology of machine learning has become a standard part of software engineering.
Any time you are building a software system, even if you don’t think of it as an AI agent,
components of the system can potentially be improved with machine learning. For example,
software to analyze images of galaxies under gravitational lensing was speeded up by a factor
of 10 million with a machinelearned model (Hezaveh et al., 2017), and energy use for cooling data centers was reduced by 40% with another machinelearned model (Gao, 2014). Turing Award winner David Patterson and Google Al head Jeff Dean declared the dawn of a “Golden
Age” for computer architecture due to machine learning (Dean et al., 2018).
We have seen several examples of models for agent components: atomic, factored, and
relational models based on logic or probability, and so on. Learning algorithms have been Prior knowledge
devised for all of these.
This chapter assumes little prior knowledge on the part of the agent: it starts from scratch
and learns from the data. In Section 21.7.2 we consider transfer learning, in which knowl
edge from one domain is transferred to a new domain, so that learning can proceed faster with less data.
We do assume, however, that the designer of the system chooses a model
framework that can lead to effective learning.
Going from a specific set of observations to a general rule is called induction; from the
observations that the sun rose every day in the past, we induce that the sun will come up tomorrow.
This differs from the deduction we studied in Chapter 7 because the inductive
conclusions may be incorrect, whereas deductive conclusions are guaranteed to be correct if
the premises are correct. This chapter concentrates on problems where the input is a factored representation—a vector of attribute values.
It is also possible for the input to be any kind of data structure,
including atomic and relational.
When the output is one of a finite set of values (such as sunny/cloudy/rainy or true/false),
Classification
the learning problem is called classification. When it is a number (such as tomorrow’s tem
Regression
mittedly obscure!) name regression. ! A better name would have been function approximation or mumeric prediction. But in 1886 Francis Galton ‘wrote an influential article on the concept of regression o the mean (e.g.. the children of tall parents are likely to be taller than average, but not as tall as the parents). Galton showed plots with what he called “regression lines,” and readers came to associate the word “regression” with the statistical technique of function approximation rather than with the topic of regression to the mean.
perature, measured either as an integer or a real number), the learning problem has the (ad
Section 1.2 Supervised Learning
653
There are three types of feedback that can accompany the inputs, and that determine the Feedback
three main types of learning:
« In supervised learning the agent observes inputoutput pairs and learns a function that Supervised learning maps from input to output.
For example, the inputs could be camera images, each
one accompanied by an output saying “bus” or “pedestrian,” etc. An output like this is called a label. The agent learns a function that, when given a new image, predicts
the appropriate label. In the case of braking actions (component 1 above), an input is
Label
the current state (speed and direction of the car, road condition), and an output is the distance it took to stop. In this case a set of output values can be obtained by the agent
from its own percepts (after the fact); the environment is the teacher, and the agent learns a function that maps states to stopping distance.
« In unsupervised learning the agent learns patterns in the input without any explicit Unsupervised learning feedback. The most common unsupervised learning task is clustering: detecting poten
tially useful clusters of input examples. For example, when shown millions of images
taken from the Internet, a computer vision system can identify a large cluster of similar images which an English speaker would call “cats.”
«+ In reinforcement learning the agent learns from a series of reinforcements:
rewards
and punishments. For example, at the end of a chess game the agent is told that it has
Reinforcement learning
won (a reward) or lost (a punishment). It is up to the agent to decide which of the actions prior to the reinforcement were most responsible for it, and to alter its actions
to aim towards more rewards in the future.
19.2
Super
ed Learning
More formally, the task of supervised learning is this: Given a training set of N example inputoutput pairs
Training set
(1, y1)s (32,32)s  (s yn) 5 where each pair was generated by an unknown function y = f(x),
discover a function / that approximates the true function f. The function / is called a hypothesis about the world. It is drawn from a hypothesis space H of possible functions. For example, the hypothesis space might be the set of polynomials
Hypothesis space
of degree 3; or the set of Javascript functions; or the set of 3SAT Boolean logic formulas.
‘With alternative vocabulary, we can say that / is a model of the data, drawn from a model class 7, or we can say a function drawn from a function class. We call the output y; the
Model class Ground truth
ground truth—the true answer we are asking our model to predict. How do we choose a hypothesis space? We might have some prior knowledge about the data process that generated the data. If not, we can perform exploratory data analysis: examining Exploratory analysis the data with statistical tests and visualizations—histograms, scatter plots, box plots—to get
afeel for the data, and some insight into what hypothesis space might be appropriate. Or we can just try multiple hypothesis spaces and evaluate which one works best.
How do we choose a good hypothesis from within the hypothesis space? We could hope
for a consistent hypothesis: and / such that each x; in the training set has h(x;) = y;. With
continuousvalued outputs we can’t expect an exact match to the ground truth; instead we
Consistent hypothesis
654
Chapter 19 Learning from Examples Sinusoidal
Piecewise linear
Degree12 polynomial
o
Data set 2
Data set 1
Linear
ﬂf}
Figure 19.1 Finding hypotheses to fit data. Top row: four plots of bestfit functions from four different hypothesis spaces trained on data set 1. Bottom row: the same four functions, but trained on a slightly different data
set (sampled from the same f(x) function).
look for a bestfit function for which each h(x;) is close to y; (in a way that we will formalize in Section 19.4.2).
The true measure of a hypothesis is not how it does on the training set, but rather how
Test set Generalization
well it handles inputs it has not yet seen. We can evaluate that with a second sample of (x;,y;) pairs called a test set. We say that / generalizes well if it accurately predicts the outputs of the test set. Figure 19.1 shows that the function that a learning algorithm discovers depends on the hypothesis space H it considers and on the training set it is given. Each of the four plots in the top row have the same training set of 13 data points in the (x,y) plane. The four plots in the bottom row have a second set of 13 data points; both sets are representative of the
same unknown function f(x). Each column shows the bestfit hypothesis / from a different hypothesis space:
o Column 1: Straight lines; functions of the form i(x) = wjx wy. There is no line that
would be a consistent hypothesis for the data points.
« Column 2: Sinusoidal functions of the form A(x) = wy.x +sin(wox). This choice is not quite consistent, but fits both data sets very well.
e Column 3: Piecewiselinear functions where each line segment connects the dots from
one data point to the next. These functions are always consistent. o Column 4: Degree12 polynomials, h(x) = ¥/2gwix’. These are consistent: we can always get a degree12 polynomial to perfectly fit 13 distinct points. But just because
the hypothesis is consistent does not mean it is a good guess. One way to analyze hypothesis spaces is by the bias they impose (regardless of the train
Bias
ing data set) and the variance they produce (from one training set to another). By bias we mean (loosely) the tendency of a predictive hypothesis to deviate from the
expected value when averaged over different training sets. Bias often results from restrictions
Section 1.2 Supervised Learning
655
imposed by the hypothesis space. For example, the hypothesis space of linear functions
induces a strong bias: it only allows functions consisting of straight lines. If there are any
patterns in the data other than the overall slope of a
line, a linear function will not be able
to represent those patterns. We say that a hypothesis is underfitting when it fails to find a Underfitting pattern in the data. On the other hand, the piecewise linear function has low bias; the shape of the function is driven by the data.
By variance we mean the amount of change in the hypothesis due to fluctuation in the
training data. The two rows of Figure 19.1 represent data sets that were each sampled from the same f(x) function. The data sets turned out to be slightly different.
Variance
For the first three
columns, the small difference in the data set translates into a small difference in the hypothe
sis. We call that low variance. But the degree12 polynomials in the fourth column have high
variance: look how different the two functions are at both ends of the xaxis. Clearly, at least
one of these polynomials must be a poor approximation to the true f(x). We say a function
is overfitting the data when it pays too much attention to the particular data set it is trained
on, causing it to perform poorly on unseen data.
Often there is a biasvariance tradeoff: a choice between more complex, lowbias hy
potheses that fit the training data well and simpler, lowvariance hypotheses that may generalize better.
Albert Einstein said in 1933, “the supreme goal of all theory is to make the
irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” In other words, Einstein recommends choosing the simplest hypothesis that matches the data. This principle can be traced further back to the 14thcentury English philosopher William of Ockham.? His principle that “plurality [of entities] should not be posited without necessity” is called Ockham’s razor because it is used to “shave off”” dubious explanations.
Defining simplicity is not easy. It seems clear that a polynomial with only two parameters is simpler than one with thirteen parameters. We will make this intuition more precise in
Section
19.3.4.
However, in Chapter 21 we will see that deep neural network models can
often generalize quite well, even though they are very complex—some of them have billions of parameters. So the number of parameters by itself is not a good measure of a model’s fitness. Perhaps we should be aiming for “appropriateness,” not “simplicity” in a model class. We will consider this issue in Section 19.4.1. ‘Which hypothesis is best in Figure 19.1? We can’t be certain.
If we knew the data
represented, say, the number of hits to a Web site that grows from day to day, but also cycles depending on the time of day, then we might favor the sinusoidal function.
If we knew the
data was definitely not cyclic but had high noise, that would favor the linear function.
In some cases, an analyst is willing to say not just that a hypothesis is possible or im
possible, but rather how probable it is. Supervised learning can be done by choosing the hypothesis 4 that is most probable given the data:
1" = argmax P(hdata) . heM
By Bayes’ rule this s equivalent to I* = argmax P(datal ) P(h) . heM
2 The name is often misspelled as “Occam.”
Overfitting Biasvariance tradeoff
Chapter 19 Learning from Examples Then we can say that the prior probability P() is high for a smooth degree1 and lower for a degree12 polynomial with large, sharp spikes. We allow functions when the data say we really need them, but we discourage them low prior probability. Why not let H be the class of all computer programs, or all Turing
or 2 polynomial unusuallooking by giving them a machines? The
problem is that there is a tradeoff between the expressiveness of a hypothesis space and the
computational complexity of finding a good hypothesis within that space. For example, fitting a straight line to data is an easy computation; fitting highdegree polynomials is somewhat harder; and fitting Turing machines is undecidable. A second reason to prefer simple hypothesis spaces is that presumably we will want to use h after we have learned it, and computing h(x) when h is a linear function is guaranteed to be fast, while computing an arbitrary Turing machine program is not even guaranteed to terminate.
For these reasons, most work on learning has focused on simple representations. In recent
years there has been great interest in deep learning (Chapter 21), where representations are
not simple but where the h(x) computation still takes only a bounded number of steps to
compute with appropriate hardware.
We will see that the expressiveness—complexity tradeoff is not simple: it is often the case,
as we saw with firstorder logic in Chapter 8, that an expressive language makes it possible for a simple hypothesis to fit the data, whereas restricting the expressiveness of the language
means that any consistent hypothesis must be complex. 19.2.1
Example problem:
Restaurant waiting
We will describe a sample supervised learning problem in detail: the problem of deciding
whether to wait for a table at a restaurant. This problem will be used throughout the chapter
to demonstrate different model classes. For this problem the output, y, is a Boolean variable
that we will call WillWait; it is true for examples where we do wait for a table. The input, x,
is a vector of ten attribute values, each of which has discrete values:
1. Alternate: whether there is a suitable alternative restaurant nearby.
0PN AE W
656
Bar: whether the restaurant has a comfortable bar area to wait in. Fri/Sat: true on Fridays and Saturdays. Hungry: whether we are hungry right now.
Patrons: how many people are in the restaurant (values are None, Some, and Full). Price: the restaurant’s price range ($, $3, $$8). Raining: whether it is raining outside.
Reservation: whether we made a reservation.
Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: host’s wait estimate: 010, 1030, 3060, or >60 minutes.
A set of 12 examples, taken from the experience of one of us (SR), is shown in Figure 19.2.
Note how skimpy these data are: there are 26 x 3% x 4% = 9,216 possible combinations of
values for the input attributes, but we are given the correct output for only 12 of them; each of
the other 9,204 could be either true or false; we don’t know. This is the essence of induction: we need to make our best guess at these missing 9,204 output values, given only the evidence of the 12 examples.
Section 19.3
Example
Learning Decision Trees
657
Input Attribut Alt Yes No Yes No
Yes
No
No Yes No
Some
$3$
Some
%$Full $ Some $ Full $ Full ~ $8$
None Some Full Full None Full
Figure 19.2 Examples for the restaurant domain. 19.3
Learning Decision Trees
A decision tree is a representation of a function that maps a vector of attribute values to
a single output value—a “decision.” A decision tree reaches its decision by performing a
Decision tree
sequence of tests, starting at the root and following the appropriate branch until a leaf is reached. Each internal node in the tree corresponds to a test of the value of one of the input
attributes, the branches from the node are labeled with the possible values of the attribute,
and the leaf nodes specify what value s to be returned by the function. In general, the input and output values can be discrete or continuous, but for now we will
Positive example) or false (a negative example). We call this Boolean classification. We will use j Negative consider only inputs consisting of discrete values and outputs that are either true (a positive
to index the examples (x; is the input vector for the jth example and y; is the output), and x;;
for the ith attribute of the jth example.
The tree representing the decision function that SR uses for the restaurant problem is
shown in Figure 19.3. Following the branches, we see that an example with Patrons = Full and WaitEstimate =010 will be classified as positive (i.e., yes, we will wait for a table).
19.3.1
Expressiveness of decision trees
A Boolean decision tree is equivalent to a logical statement of the form:
Output
A +A; is hard to represent
with a decision tree because the decision boundary is a diagonal line, and all decision tree
tests divide the space up into rectangular, axisaligned boxes. We would have to stack a lot
of boxes to closely approximate the diagonal line. In other words, decision trees are good for some kinds of functions and bad for others. Is there any kind of representation that is efficient for all kinds of functions?
Unfortu
nately, the answer is no—there are just too many functions to be able to represent them all with a small number of bits.
Even just considering Boolean functions with n Boolean at
tributes, the truth table will have 2" rows, and each row can output true or false, so there are
22" different functions. With 20 attributes there are 248576 ~ 10300000 fynctions, so if we limit ourselves to a millionbit representation, we can’t represent all these functions.
19.3.2
Learning decision trees from examples
‘We want to find a tree that is consistent with the examples in Figure 19.2 and is as small as possible. Unfortunately, it is intractable to find a guaranteed smallest consistent tree. But with some simple heuristics, we can efficiently find one that is close to the smallest.
The
LEARNDECISIONTREE algorithm adopts a greedy divideandconquer strategy: always test the most important attribute first, then recursively solve the smaller subproblems that are defined by the possible results of the test. By “most important attribute,” we mean the one
that makes the most difference to the classification of an example. That way, we hope to get to the correct classification with a small number of tests, meaning that all paths in the tree will be short and the tree as a whole will be shallow.
Figure 19.4(a) shows that Type is a poor attribute, because it leaves us with four possible
Pations? None
Some"
o N Ves
Full
WaitEstimate? Altemate? No Yes
Reservation? No Yes
No
No /"\ Yes Yes
Bai No /"\ Yes
Figure 19.3 A decision tree for deciding whether to wait for a table.
Section 19.3
8
alian
659
mEan BE@EBD
Type? French
Learning Decision Trees
Patrons?
Thai
Burger
©lad ao
None
Some
Ful
.
No// (@)
(b)
\Yes
B
Figure 19.4 Splitting the examples by testing on attributes. At each node we show the positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type brings us no nearer to distinguishing between positive and negative examples. (b) Splitting on Patrons does a good job of separating positive and negative examples. After splitting on Patrons, Hungry is a fairly good second test. outcomes, each of which has the same number of positive as negative examples. On the other
hand, in (b) we see that Patrons is a fairly important attribute, because if the value is None or Some, then we are left with example sets for which we can answer definitively (No and Yes,
respectively). If the value is Full, we are left with a mixed set of examples. There are four cases to consider for these recursive subproblems:
1. If the remaining examples are all positive (or all negative), then we are done: we can answer Yes or No. Figure 19.4(b) shows examples of this happening in the None and Some branches.
2. If there are some positive and some negative examples, then choose the best attribute to split them. Figure 19.4(b) shows Hungry being used to split the remaining examples. 3. If there are no examples left, it means that no example has been observed for this com
bination of attribute values, and we return the most common output value from the set
of examples that were used in constructing the node’s parent. 4. If there are no attributes left, but both positive and negative examples, it means that these examples have exactly the same description, but different classifications. This can
happen because there is an error or noise in the data; because the domain is nondeterministic; or because we can’t observe an attribute that would distinguish the examples.
The best we can do is return the most common output value of the remaining examples. The LEARNDECISIONTREE algorithm is shown in Figure 19.5. Note that the set of exam
ples is an input to the algorithm, but nowhere do the examples appear in the tree returned by the algorithm. A tree consists of tests on attributes in the interior nodes, values of attributes on the branches, and output values on the leaf nodes. The details of the IMPORTANCE func
tion are given in Section 19.3.3. The output of the learning algorithm on our sample training
set is shown in Figure 19.6. The tree is clearly different from the original tree shown in Fig
Noise
660
Chapter 19 Learning from Examples function LEARNDECISIONTREE(examples, attributes, parent_examples) returns a tree
if examples is empty then return PLURALITYVALUE(parent_examples)
else if all examples have the same classification then return the classification
else if artributes is empty then return PLURALITYVALUE(examples) else
A~ argmax, ¢ uipues IMPORTANCE (a, examples) tree < a new decision tree with root test A for each value v ofA do
exs 0, while carthquakes have —4.9+ 1.7x; —x, < x=
Section 19.6
Linear Regression and Classification
683
0. We can make the equation easier to deal with by changing it into the vector dot product form—with xo=1 we have —4.9%+ 17x 3 =0, and we can define the vector of weights, w=(49,17,~1), and write the classification hypothesis Jw(x) = 1if wx > 0 and 0 otherwise.
Alternatively, we can think of / as the result of passing the linear function w  x through a threshold function:
Iw(x) = Threshold(wx) where Threshold(z) =1 if z > 0 and 0 otherwise.
Threshold function
The threshold function is shown in Figure 19.17(a).
Now that the hypothesis &y (x) has a welldefined mathematical form, we can think about
choosing the weights w to minimize the loss. In Sections 19.6.1 and 19.6.3, we did this both
in closed form (by setting the gradient to zero and solving for the weights) and by gradient
descent in weight space. Here we cannot do either of those things because the gradient is zero almost everywhere in weight space except at those points where w  x =0, and at those points the gradient is undefined.
There is, however, a simple weight update rule that converges to a solution—that is, to
a linear separator that classifies the data perfectly—provided the data are linearly separable. For a single example (x,y), we have wi — wita(y
(X)) X x;
(19.8)
which is essentially identical to Equation (19.6), the update rule for linear regression!
This
rule is called the perceptron learning rule, for reasons that will become clear in Chapter 21. Because we are considering a 0/1 classification problem, however, the behavior is somewhat
Perceptron learning rule
different. Both the true value y and the hypothesis output /(x) are either 0 or 1, so there are three possibilities:
« If the output is correct (i.e., y=hy(x)) then the weights are not changed.
« Ifyis 1 but hy(x) is 0, then w; is increased when the corresponding input x; is positive and decreased when x; is negative. This makes sense, because we want to make w  x
bigger so that /i (X) outputs a 1.
« If yis 0 but hy(x) is 1, then w; is decreased when the corresponding input x; is positive and increased when x; is negative.
This makes sense, because we want to make W  X
smaller so that A (x) outputs a 0.
Typically the learning rule is applied one example at a time, choosing examples at random (as in stochastic gradient descent). Figure 19.16(a) shows a training curve for this learning rule Training curve applied to the earthquake/explosion data shown in Figure 19.15(a). A training curve measures the classifier performance on a fixed training set as the learning process proceeds one update at a time on that training set. The curve shows the update rule converging to a zeroerror
linear separator. The “convergence” process isn’t exactly preity, but it always works. This
particular run takes 657 steps to converge, for a data set with 63 examples, so each example
is presented roughly 10 times on average. Typically, the variation across runs is large.
684
Chapter 19 Learning from Examples 5 £
P Zoo
51 209
§
507
§07
8 0,
£ o, z &
Sos
0 100200300400500600 700 Number of weight updates (@)
06 20 = 04
Sos
025000 50000 75000 Number of weight updates
€06 08 h=gg
The solution for @ is the same as before.
S
The solution for 6}, the probability that a cherry
candy has a red wrapper, is the observed fraction of cherry candies with red wrappers, and
similarly for 6.
These results are very comforting, and it is easy to see that they can be extended to any
Bayesian network whose conditional probabilities are represented as tables. The most impor
Section20.2
Learning with Complete Data
727
tant point is that with complete data, the maximumlikelihood parameter learning problem
for a Bayesian network decomposes into separate learning problems, one for each parameter. (See Exercise 20.NORX
for the nontabulated case, where each parameter affects several
conditional probabilities.) The second point is that the parameter values for a variable, given its parents, are just the observed frequencies of the variable values for each setting of the parent values. As before, we must be careful to avoid zeroes when the data set is small.
20.2.2
Naive Bayes models
Probably the most common Bayesian network model used in machine learning is the naive Bayes model first introduced on page 402. In this model, the “class™ variable C (which is to be predicted) is the root and the “attribute™ variables X; are the leaves. The model is “naive”
because it assumes that the attributes are conditionally independent of each other, given the class. (The model in Figure 20.2(b) is a naive Bayes model with class Flavor and just one
attribute, Wrapper.) In the case of Boolean variables, the parameters are
0=P(C=true),0y =P(X;=trueC=true),0,, = P(X; = true C =false). The maximumlikelihood parameter values are found in exactly the same way as in Figure 20.2(b). Once the model has been trained in this way, it can be used to classify new examples for which the class variable C is unobserved. With observed attribute values ¥y, .., %, the probability of each class is given by P(Cx1,...,%,) = a P(C ]'[Px,\c
A deterministic prediction can be obtained by choosing the most likely class. Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from
Chapter 19. The method learns fairly well but not as well as decision tree learning; this is presumably because the true hypothesis—which is a decision tree—is not representable exactly
using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a wide
range of applications; the boosted version (Exercise 20.BNBX) is one of the most effective
generalpurpose learning algorithms. Naive Bayes learning scales well to very large prob
lems: with n Boolean attributes, there are just 2n+ 1 parameters, and no search is required
10 find hy, the maximumlikelihood naive Bayes hypothesis. Finally, naive Bayes learning systems deal well with noisy or missing data and can give probabilistic predictions when appropriate. Their primary drawback is the fact that the conditional independence assumption is seldom accurate; as noted on page 403, the assumption leads to overconfident probabilities that are often very close to 0 or 1, especially with large numbers of attributes.
20.2.3
Generative and discriminative models
We can distinguish two kinds of machine learning models used for classifiers: generative and discriminative.
A generative model models the probability distribution of each class.
For
example, the naive Bayes text classifier from Section 12.6.1 creates a separate model for each
possible category of text—one for sports, one for weather, and so on. Each model includes
the prior probability of the category—for example P(Category=weather)—as well as the conditional probability P(Inputs  Category =weather). From these we can compute the joint
probability P(Inputs, Category = weather)) and we can generate a random selection of words that is representative of texts in the weather category.
Generative model
Chapter 20 Learning Probabilistic Models
Proportion correct on fest set
728
Naive Bayes
20
40 60 Training set size
80
100
Figure 20.3 The learning curve for naive Bayes learning applied to the restaurant problem from Chapter 19; the learning curve for decision tree learning is shown for comparison. Discriminative model
A discriminative model directly learns the decision boundary between classes. That is,
it learns P(Category  Inputs). Given example inputs, a discriminative model will come up
with an output category, but you cannot use a discriminative model to, say, generate random
words that are representative of a category. Logistic regression, decision trees, and support vector machines are all discriminative model:
Since discriminative models put all their emphasis on defining the decision boundary— that is, actually doing the classification task they were asked to do—they tend to perform
better in the limit, with an arbitrary amount of training data. However, with limited data, in
some cases a generative model performs better. (Ng and Jordan, 2002) compare the generative naive Bayes classifier to the discriminative logistic regression classifier on 15 (small) data sets, and find that with the maximum amount of data, the discriminative model does better on
9 out of 15 data sets, but with only a small amount of data, the generative model does better
on 14 out of 15 data sets.
20.2.4
Maximumlikelihood parameter learning:
Continuous models
Continuous probability models such as the linearGaussian model were shown on page 422.
Because continuous variables are ubiquitous in realworld applications, it is important to know how to learn the parameters of continuous models from data. The principles for maximumlikelihood learning are identical in the continuous and discrete case:
Let us begin with a very simple case: learning the parameters of a Gaussian density function on a single variable. That is, we assume the data are generated as follows:
P = The parameters of this model are the mean 4 and the standard deviation o. (Notice that the
normalizing “constant” depends on o, so we cannot ignore it.) Let the observed values be
Section20.2
Learning with Complete Data
0 01020304 0506070809 x ®)
(a)
1
Figure 20.4 (a) A linearGaussian model described as y =6+ 6 plus Gaussian noise with fixed variance. (b) A set of 50 data points generated from this model and the bestfit line. xy. Then the log likelihood is
m?
L= Zlog e 5T= N(—logV27 — logo) — Setting the denvzuves to zero as usual, we obtain
LyN 5oL = —Flialp=0 AL =_ S+ N, ELL(yp)?=0 IyN 2 Se
@04)
That is, the maximumlikelihood value of the mean is the sample average and the maximum
likelihood value of the standard deviation is the square root of the sample variance. Again, these are comforting results that confirm “commonsense” practice. Now consider a linearGaussian model with one continuous parent X and a continuous
child Y. As explained on page 422, ¥ has a Gaussian distribution whose mean depends linearly on the value ofX and whose standard deviation is fixed. distribution P(Y  X), we can maximize the conditional likelihood
POl =
1
_o 0o’
To learn the conditional
@5
Here, the parameters are 0y, 02, and o. The data are a collection of (x;, ;) pairs, as illustrated
in Figure 20.4. Using the usual methods (Exercise 20.LINR), we can find the maximumlikelihood values of the parameters. The point here is different. If we consider just the parameters #, and 0, that define the linear relationship between x and y, it becomes clear
that maximizing the log likelihood with respect to these parameters is the same as minimizing
the numerator (y— (6,x+6,))? in the exponent of Equation (20.5). This is the Ly loss, the
squared error between the actual value y and the prediction §;x + 5.
This is the quantity minimized by the standard linear regression procedure described in
Section 19.6. Now we can understand why: minimizing the sum of squared errors gives the maximumlikelihood straightline model, provided that the data are generated with Gaussian
noise of fixed variance.
729
730
Chapter 20 Learning Probabilistic
02
04 06 Parameter § ()
Models
08
1
02
04 06 Parameter ®)
08
1
Figure 20.5 Examples of the Bera(a,b) distribution for different values of (a,b). 20.2.5
Bayesian parameter learning
Maximumlikelihood learning gives rise to simple procedures, but it has serious deficiencies with small data sets. For example, after seeing one cherry candy, the maximumlikelihood hypothesis is that the bag is 100% cherry (i.c., §=1.0). Unless one’s hypothesis prior is that bags must be either all cherry or all lime, this is not a reasonable conclusion. It is more likely
that the bag is a mixture of lime and cherry.
The Bayesian approach to parameter learning
starts with a hypothesis prior and updates the distribution as data arrive. The candy example in Figure 20.2(a) has one parameter, 0: the probability that a randomly selected piece of candy is cherryflavored. In the Bayesian view, 0 is the (unknown)
value of a random variable © that defines the hypothesis space; the hypothesis prior is the
prior distribution over P(®). Thus, P(@=8) is the prior probability that the bag has a frac
tion 0 of cherry candies.
Beta distribution
Hyperparameter
If the parameter 6 can be any value between 0 and 1, then P(®) is a continuous probability density function (see Section A.3). If we don’t know anything about the possible values of 6 we can use the uniform density function P(6) = Uniform(6;0, 1), which says all values are equally likely. A more flexible family of probability density functions is known as the beta distributions. Each beta distribution is defined by two hyperparameters® a and b such that Beta(0:a,b) = a 0!
(19)"",
(20.6)
for 0 in the range [0, 1]. The normalization constant a, which makes the distribution integrate to 1, depends on a and b. Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the beta distribution is a/(a+ b), so larger values of a
suggest a belief that @ is closer to 1 than to 0. Larger values of a + b make the distribution
more peaked, suggesting greater certainty about the value of ©. It turns out that the uniform
density function is the same as Beta(1,1): the mean is 1/2, and the distribution is flat.
3 They are called hyperparameters because they parameterize a distribution over 0, which is itselfa parameter.
Section20.2
Learning with Complete Data
731
Figure 20.6 A Bayesian network that corresponds to a Bayesian learning process. Posterior distributions for the parameter variables ©, ©;, and @, can be inferred from their prior distributions and the evidence in Flavor; and Wrapper;. Besides its flexibility, the beta family has another wonderful property: if Beta(a, b), then, after a data point is observed, the posterior distribution for © distribution. In other words, Beta is closed under update. The beta family conjugate prior for the family of distributions for a Boolean variable.* Let’s works. Suppose we observe a cherry candy: then we have
© has a prior is also a beta is called the see how this Conjugate prior
P(0Dy=cherry) = a P(Dy=cherry0)P(6)
= o 0Beta(f;a,b) = o’ 60°'(19)"" = ' 0°(10)""" = o/ Beta(:a+1,b).
Thus, after seeing a cherry candy, we simply increment the a parameter to get the posterior;
similarly, after seeing a lime candy, we increment the b parameter. Thus, we can view the @ and b hyperparameters as virtual counts, in the sense that a prior Beta(a, b) behaves exactly
as if we had started out with a uniform prior Beta(1,1) and seen a— 1 actual cherry candies and b — 1 actual lime candies. By examining a sequence of beta distributions for increasing values ofa and b, keeping the proportions fixed, we can see vividly how the posterior distribution over the parameter
© changes as data arrive. For example, suppose the actual bag of candy is 75% cherry. Figure 20.5(b) shows the sequence Beta(3,1), Beta(6,2), Beta(30,10). Clearly, the distribution is converging to a narrow peak around the true value of ©. For large data sets, then, Bayesian learning (at least in this case) converges to the same answer as maximumlikelihood learning. Now let us consider a more complicated case. The network in Figure 20.2(b) has three parameters, 6, 01, and 6, where 6 is the probability ofa red wrapper on a cherry candy and 4 Other conjugate priors include the Dirichlet family for the parameters of a discrete multivalued distribution and the NormalWishart family for the parameters of a Gaussian ion. See Bernardo and Smith (1994).
Virtual count
732
Chapter 20 Learning Probabilistic
Models
0, is the probability of a red wrapper on a lime candy. The Bayesian hypothesis prior must
cover all three parameters—that is, we need to specify P(®,0,,0;).
Parameter independence
parameter independence:
Usually, we assume
P(0,0,.0,) = P(O)P(©,)P(©,).
‘With this assumption, each parameter can have its own beta distribution that is updated sepa
rately as data arrive. Figure 20.6 shows how we can incorporate the hypothesis prior and any data into a Bayesian network, in which we have a node for each parameter variable. The nodes ©,0;,0, have no parents. For the ith observation of a wrapper and corresponding flavor of a piece of candy, we add nodes Wrapper; and Flavor;. Flavor; is dependent on the flavor parameter ©:
P(Flavor;=cherry©=0) = 0. Wrapper; is dependent on ©; and ©,: P(Wrapper; = red Flavor; =cherry,©,=6,) = 6 P(Wrapper; = red Flavor; =lime,® = 0,) = 0, .
Now, the entire Bayesian learning process for the original Bayes net in Figure 20.2(b) can be formulated as an inference problem in the derived Bayes net shown in Figure 20.6, where the.
P>
data and parameters become nodes. Once we have added all the new evidence nodes, we can
then query the parameter variables (in this case, ©,©;,0,). Under this formulation there is just one learning algorithm—the inference algorithm for Bayesian networks. Of course, the nature of these networks is somewhat different from those of Chapter 13
because of the potentially huge number of evidence variables representing the training set and the prevalence of continuousvalued parameter variables. Exact inference may be impos
sible except in very simple cases such as the naive Bayes model. Practitioners typically use approximate inference methods such as MCMC
(Section 13.4.2); many statistical software
packages incorporate efficient implementations of MCMC for this purpose. 20.2.6
Bayesian linear regression
Here we illustrate how to apply a Bayesian approach to a standard statistical task:
linear
regression. The conventional approach was described in Section 19.6 as minimizing the sum of squared errors and reinterpreted in Section 20.2.4 as maximizing likelihood assuming a Gaussian error model. These produce a single best hypothesis: a straight line with specific values for the slope and intercept and a fixed variance for the prediction error at any given
point. There is no measure of how confident one should be in the slope and intercept values.
Furthermore, if one is predicting a value for an unseen data point far from the observed
data points, it seems to make no sense to assume a prediction error that is the same as the
prediction error for a data point right next to an observed data point.
It would seem more
sensible for the prediction error to be larger, the farther the data point is from the observed
data, because a small change in the slope will cause a large change in the predicted value for a distant point. The Bayesian approach fixes both of these problems. The general idea, as in the preceding section, is to place a prior on the model parameters—here, the coefficients of the linear model and the noise variance—and then to compute the parameter posterior given the data. For
multivariate data and unknown noise model, this leads to rather a lot of linear algebra, so we
Section20.2
Learning with Complete Data
733
focus on a simple case: univariable data, a model that is constrained to go through the origin, and known noise:
and the model is
a normal distribution with variance o2. Then we have just one parameter ¢ 5 (ML) .
P(y[x,0) = N (y:6x,07) =
(20.7)
As the log likelihood is quadratic in 0, the appropriate form for a conjugate prior on is also a Gaussian. This ensures that the posterior for ¢ will also be Gaussian. We’ll assume a mean 0 and variance o7 for the prior, so that
P(0)=N(9:€u. We don’t necessarily expect the top halves of photos to look like bottom halves, so there is a scale beyond which spatial invariance no longer holds.
Local spatial invariance can be achieved by constraining the / weights connecting a local
region to a unit in the hidden layer to be the same for each hidden unit. (That is, for hidden
units i and j, the weights wy ..., w;,; are the same as wi j,...,wy,;.)
This makes the hidden
units into feature detectors that detect the same feature wherever it appear in the image.
Typically, we want the first hidden layer to detect many kinds of features, not just one; so
for each local image region we might have d hidden units with d distinct sets of weights.
This means that there are d/ weights in all—a number that is not only far smaller than n?,
Convolutional neural network (CNN) Kernel Convolution
but is actually independent of n, the image size. Thus, by injecting some prior knowledge— namely, knowledge of adjacency and spatial invariance—we can develop models that have far fewer parameters and can learn much more quickly. A convolutional neural network (CNN) is one that contains spatially local connections,
at least in the early layers, and has patterns of weights that are replicated across the units
in each layer. A pattern of weights that is replicated across multiple local regions is called a kernel and the process of applying the kernel to the pixels of the image (or to spatially
organized units in a subsequent layer) is called convolution.*
Kernels and convolutions are easiest to illustrate in one dimension rather than two or
more, so we will assume an input vector x of size n, corresponding to n pixels in a one
3 Similar ideas can be applied to process timeseries data sources such as audio waveforms. These typically exhibit temporal invariance—a word sounds the same no matter what time of day it is uttered. Recurrent neural neworks (Setion 216 auomatiall exhibittemmpors invaranee. # In the terminology of signa correlation, not a convolution. But “convolution” is used within the field of neural networks.
Section 21.3
Convolutional Networks
Figure 21.4 An example ofa onedimensional convolution operation with a kernel of size 1=3 and a stride s=2. The peak response is centered on the darker (lower intensity) input pixel. The results would usually be fed through a nonlinear activation function (not shown) before going to the next hidden layer. dimensional image, and a vector kernel k of size 1. (For simplicity we will assume that / is an odd number.) All the ideas carry over straightforwardly to higherdimensional case: ‘We write the convolution operation using the * symbol, for example:
operation is defined as follows:
=
!
£
k
i (141)/2 
z = x
k.
The
(21.8)
In other words, for each output position i, we take the dot product between the kernel k and a
snippet of x centered on x; with width /.
The process is illustrated in Figure 21.4 for a kernel vector [+1,—1,+1], which detects a darker point in the 1D image. (The 2D version might detect a darker line.) Notice that in
this example the pixels on which the kernels are centered are separated by a distance of 2 pixels; we say the kernel is applied with a stride s=2. Notice that the output layer has fewer Stride pixels: because of the stride, the number of pixels is reduced from 1 to roughly n/s. (In two dimensions, the number of pixels would be roughly n/s,s,, where s, and s, are the strides in the x and y directions in the image.) We say “roughly” because of what happens at the edge of the image: in Figure 21.4 the convolution stops at the edges of the image, but one can also
pad the input with extra pixels (either zeroes or copies of the outer pixels) so that the kernel can be applied exactly n/s times. For small kernels, we typically use s=1, so the output
has the same dimensions as the image (see Figure 21.5). The operation of applying a kernel across an image can be implemented in the obvious way by a program with suitable nested loops; but it can also be formulated as a single matrix
operation, just like the application of the weight matrix in Equation (21.1). For example, the convolution illustrated in Figure 21.4 can be viewed as the following matrix multiplication: 0 1 0
0 +1 +1
0 0 1
0 0 +1
5
oo
1+ 0 +1 0 0
§ =[9]. 4
21.9)
hom
4+l 0 0
In this weight matrix, the kernel appears in each row, shifted according to the stride relative
to the previous row, One wouldn’t necessarily construct the weight matrix explicitly—it is
761
762
Chapter 21
Deep Learning
Figure 21.5 The first two layers of a CNN for a 1D image with a kernel size /=3 and a stride s= 1. Padding is added at the left and right ends in order to keep the hidden layers the same size as the input. Shown in red is the receptive field ofa unit in the second hidden layer. Generally speaking, the deeper the unit, the larger the receptive field. mostly zeroes, after all—but the fact that convolution is a linear matrix operation serves as a
reminder that gradient descent can be applied easily and effectively to CNNG, just as it can to plain vanilla neural networks. As mentioned earlier, there will be d kernels, not just one; so, with a stride of 1, the
output will be d times larger.
This means that a twodimensional input array becomes a
threedimensional array of hidden units, where the third dimension is of size d.
It is im
portant to organize the hidden layer this way, so that all the kernel outputs from a particular image location stay associated with that location. Unlike the spatial dimensions of the image,
Receptive field
however, this additional “kernel dimension” does not have any adjacency properties, so it does not make sense to run convolutions along it. CNNs were inspired originally by models of the visual cortex proposed in neuroscience. In those models, the receptive
field of a neuron is the portion of the sensory input that can
affect that neuron’s activation. In a CNN, the receptive field of a unit in the first hidden layer
is small—just the size of the kernel, i.e., / pixels.
In the deeper layers of the network, it
can be much larger. Figure 21.5 illustrates this for a unit in the second hidden layer, whose
receptive field contains five pixels. When the stride is 1, as in the figure, a node in the mth
hidden layer will have a receptive field of size (I — 1)m+ 1; so the growth is linear in m. (In a 2D image, each dimension of the receptive field grows linearly with m, so the area grows quadratically.) When the stride is larger than 1, each pixel in layer m represents s pixels in layer m — 1; therefore, the receptive field grows as O(ls™)—that is, exponentially with depth.
The same effect occurs with pooling layers, which we discuss next. 21.3.1
Pooling
Pooling and downsampling
A pooling layer in a neural network summarizes a set of adjacent units from the preceding layer with a single value. Pooling works just like a convolution layer, with a kernel size / and stride s, but the operation that is applied is fixed rather than learned. Typically, no activation function is associated with the pooling layer. There are two common forms of pooling: * Averagepooling computes the average value of its / inputs.
Downsampling
This is identical to con
volution with a uniform kernel vector k= (1/1,..., 1/I]. If we set [ =s, the effect is to coarsen the resolution of the image—to downsample it—by a factor of s. An object
that occupied, say, 10s pixels would now occupy only 10 pixels after pooling. The same
Section 21.3
Convolutional Networks
learned classifier that would be able to recognize the object at original image would now be able to recognize that object in if it was too big to recognize in the original image. In other facilitates multiscale recognition. It also reduces the number
a size of 10 pixels in the the pooled image, even words, averagepooling of weights required in
subsequent layers, leading to lower computational cost and possibly faster learning.
* Maxpooling computes the maximum value of its / inputs.
It can also be used purely
for downsampling, but it has a somewhat different semantics. Suppose we applied maxpooling to the hidden layer [5,9,4] in Figure 21.4: the result would be a 9, indicating that somewhere in the input image there is a darker dot that is detected by the kernel.
In other words, maxpooling acts as a kind of logical disjunction, saying that a feature
exists somewhere in the unit’s receptive field.
If the goal is to classify the image into one of ¢ categories, then the final layer of the network will be a softmax with ¢ output units.
The early layers of the CNN
are imagesized, so
somewhere in between there must be significant reductions in layer size. Convolution layers and pooling layers with stride larger than 1 all serve to reduce the layer size. It’s also possible to reduce the layer size simply by having a fully connected layer with fewer units than the preceding layer. CNNs often have one or two such layers preceding the final softmax layer. 21.3.2
Tensor operations in CNNs
We saw in Equations (21.1) and (21.3) that the use of vector and matrix notation can be helpful in keeping mathematical derivations simple and elegant and providing concise descriptions of computation graphs. Vectors and matrices are onedimensional and twodimensional special cases of tensors, which (in deep learning terminology) are simply multidimensional arrays Tensor of any dimension.’ For CNNs, tensors are a way of keeping track of the “shape™ of the data as it progresses
through the layers of the network. This is important because the whole notion of convolution
depends on the idea of adjacency:
adjacent data elements are assumed to be semantically
related, so it makes sense to apply operators to local regions of the data. Moreover, with suitable language primitives for constructing tensors and applying operators, the layers them
selves can be described concisely as maps from tensor inputs to tensor outputs.
A final reason for describing CNNs in terms of tensor operations is computational effi
ciency: given a description of a network as a sequence of tensor operations, a deep learning software package can generate compiled code that is highly optimized for the underlying
computational substrate. Deep learning workloads are often run on GPUs (graphics processing units) or TPUs (tensor processing units), which make available a high degree of parallelism. For example, one of Google’s thirdgeneration TPU pods has throughput equivalent
to about ten million laptops. Taking advantage of these capabilities is essential if one is train
ing a large CNN on a large database of images. Thus, it is common to process not one image at a time but many images in parallel; as we will see in Section 21.4, this also aligns nicely with the way that the stochastic gradient descent algorithm calculates gradients with respect
to a minibatch of training examples.
Let us put all this together in the form of an example.
256 x 256 RGB
images with a minibatch size of 64.
atical definition of tensors requires that cert
Suppose we are training on
The input in this case will be a four
ariances hold under a change of b:
763
764
Chapter 21
Deep Learning
dimensional tensor of size 256 x 256 x 3 x 64.
Feature map Channel
Then we apply 96 kernels of size 5x5x 3
with a stride of 2 in both x and y directions in the image. This gives an output tensor of size 128 x 128 x 96 x 64. Such a
tensor is often called a feature map, since it shows how each
feature extracted by a kernel appears across the entire image; in this case it is composed of
96 channels, where each channel carries information from one feature. Notice that unlike the
input tensor, this feature map no longer has dedicated color channels; nonetheless, the color information may still be present in the various feature channels if the learning algorithm finds
color to be useful for the final predictions of the network.
21.3.3 Residual network
Residual networks
Residual networks are a popular and successful approach to building very deep networks that avoid the problem of vanishing gradients.
Typical deep models use layers that learn a new representation at layer i by completely re
placing the representation at layer i — 1. Using the matrix—vector notation that we introduced
in Equation (21.3), with z(!) being the values of the units in layer i, we have
20 = f(20) = g (WD)
Because each layer completely replaces the representation from the preceding layer, all of the layers must learn to do something useful. Each layer must, at the very least, preserve the
taskrelevant information contained in the preceding layer. If we set W = 0 for any layer i, the entire network ceases to function. If we also set W(~!) = 0, the network would not even
be able to learn: layer i would not learn because it would observe no variation in its input from layer i — 1, and layer i — 1 would not learn because the backpropagated gradient from
layer i would always be zero. Of course, these are extreme examples, but they illustrate the
need for layers to serve as conduits for the signals passing through the network.
The key idea of residual networks is that a layer should perturb the representation from the previous layer rather than replace it entirely. If the learned perturbation is small, the next
layer is close to being a copy of the previous layer. This is achieved by the following equation for layer i in terms of layer i — 1:
20 = g (@D 4 @),
Residual
1.10)
where g, denotes the activation functions for the residual layer. Here we think of f as the
residual, perturbing the default behavior of passing layer i — 1 through to layer i. The function used to compute the residual is typically a neural network with one nonlinear layer combined
with one linear layer:
f(z) =Vg(Wz), where W and V are learned weight matrices with the usual bias weights added.
Residual networks make it possible to learn significantly deeper networks reliably. Consider what happens if we set V=0 for a particular layer in order to disable that layer. Then the residual f disappears and Equation (21.10) simplifies to 20 = g,z ). Now suppose that g, con:
function to its inputs: z
s of ReLU activation functions and that z¢~!) also applies a ReLU =ReLU(in"""). In that case we have
21 = g,(z0V) = ReLU(2"")) = ReLU(ReLU(in"""))) = ReLU(in(""1) = 2(1) 
Section 21.4 Learning Algorithms where the penultimate step follows because ReLU(ReLU(x))=ReLU(x).
In other words,
in residual nets with ReLU activations, a layer with zero weights simply passes its inputs
through with no change. The rest of the network functions just as if the layer had never existed. Whereas traditional networks must learn to propagate information and are subject to catastrophic failure of information propagation for bad choices of the parameters, residual
networks propagate information by default. Residual networks
are often used with convolutional layers in vision applications, but
they are in fact a generalpurpose tool that makes deep networks more robust and allows
researchers to experiment more freely with complex and heterogeneous network designs. At the time of writing, it is not uncommon
to see residual networks with hundreds of layers.
The design of such networks is evolving rapidly, so any additional specifics we might provide would probably be outdated before reaching printed form. Readers desiring to know the best architectures for specific applications should consult recent research publications.
21.4
Learning Algorithms
Training a neural network consists of modifying the network’s parameters so as to minimize
the loss function on the training set. In principle, any kind of optimization algorithm could
be used. In practice, modern neural networks are almost always trained with some variant of
stochastic gradient descent (SGD).
‘We covered standard gradient descent and its stochastic version in Section 19.6.2. Here,
the goal i to minimize the loss L(w), where w represents all of the parameters of the network. Each update step in the gradient descent process looks like this: W
w—aVyL(w),
where « is the learning rate. For standard gradient descent, the loss L is defined with respect
to the entire training set. For SGD, it is defined with respect to a minibatch of m examples chosen randomly at each step.
As noted in Section 4.2, the literature on optimization methods for highdimensional
continuous spaces includes innumerable enhancements to basic gradient descent.
We will
not cover all of them here, but it is worth mentioning a few important considerations that are
particularly relevant to training neural networks:
+ For most networks that solve realworld problems, both the dimensionality of w and the
size of the training set are very large. These considerations militate strongly in favor
of using SGD with a relatively small minibatch size m: stochasticity helps the algo
rithm escape small local minima in the highdimensional weight space (as in simulated annealing—see page 114); and the small minibatch size ensures that the computational
cost of each weight update step is a small constant, independent of the training set size. * Because the gradient contribution of each training example in the SGD minibatch can
be computed independently, the minibatch size is often chosen so as to take maximum advantage of hardware parallelism in GPUs or TPUs.
« To improve convergence, it is usually a good idea to use a learning rate that decreases over time. Choosing the right schedule is usually a matter of trial and error. + Near a local or global minimum of the loss function with respect to the entire training set, the gradients estimated from small minibatches will often have high variance and
765
766
Chapter 21
Deep Learning
Figure 21.6 Tllustration of the backpropagation of gradient information in an arbitrary computation graph. The forward computation of the output of the network proceeds from left to right, while the backpropagation of gradients proceeds from right to left. may point in entirely the wrong direction, making convergence difficult. One solution
Momentum
is to increase the minibatch size as training proceeds; another is to incorporate the idea
of momentum, which keeps a running average of the gradients of past minibatches in order to compensate for small minibatch sizes.
+ Care must be taken to mitigate numerical instabilities that may arise due to overflow,
underflow, and rounding error. These are particularly problematic with the use of exponentials in softmax,
sigmoid,
and tanh activation functions, and with the iterated
computations in very deep networks and recurrent networks (Section 21.6) that lead to
vanishing and exploding activations and gradients. Overall, the process of learning the weights of the network is usually one that exhibits diminishing returns.
We run until it is no longer practical to decrease the test error by running
longer. Usually this does not mean we have reached a global or even a local minimum of the loss function. Instead, it means we would have to make an impractically large number of very small steps to continue reducing the cost, or that additional steps would only cause overfitting, or that estimates of the gradient are too inaccurate to make further progress.
21.4.1
Computing
gradients in computation graphs
On page 755, we derived the gradient of the loss function with respect to the weights in a specific (and very simple) network. We observed that the gradient could be computed by backpropagating error information from the output layer of the network to the hidden layers. ‘We also said that this result holds in general for any feedforward computation graph. Here,
we explain how this works. Figure 21.6 shows a generic node in a computation graph. (The node & has indegree and outdegree 2, but nothing in the analysis depends on this.) During the forward pass, the node
computes some arbitrary function / from its inputs, which come from nodes f and g. In turn, h feeds its value to nodes j and k.
The backpropagation process passes messages back along each link in the network. At each node, the incoming messages are collected and new messages are calculated to pass
Section 21.4 Learning Algorithms back to the next layer. As the figure shows, the messages are all partial derivatives of the loss
L. For example, the backward message dL/dh; is the partial derivative of L with respect to
Jj’s first input, which is the forward message from & to j. Now, / affects L through both j and k, so we have
AL/3h=OL/3h;+ AL/ .
@111y
‘With this equation, the node / can compute the derivative of L with respect to / by summing
the incoming messages from j and k. Now, to compute the outgoing messages dL/d fy, and 9L/dgp, we use the following equations:
OL
i
9L ah
ISy
and
AL
g,
JL Jh
9h dgi
(21.12)
In Equation (21.12), JL/dh was already computed by Equation (21.11), and 9h/d f;, and dh/dgy are just the derivatives of i with respect to its first and second arguments, respec
tively. For example, if 4 is a multiplication node—that is, h(f,g)= f  g—then dh/d fi=g and 9h/dg, = f. Software packages for deep learning typically come with a library of node types (addition, multiplication, sigmoid, and so on), each of which knows how to compute its own derivatives as needed for Equation (21.12). The backpropagation process begins with the output nodes, where each initial message 9L/35; is calculated directly from the expression for L in terms of the predicted value § and the true value y from the training data.
At each internal node, the incoming backward
messages are summed according to Equation (21.11) and the outgoing messages are generated
from Equation (21.12). The process terminates at each node in the computation graph that represents a weight w (e.g., the light mauve ovals in Figure 21.3(b)). At that point, the sum of the incoming messages to w is JL/Jw—precisely the gradient we need to update w. Exercise 21.BPRE asks you to apply this process to the simple network in Figure 21.3 in order
to rederive the gradient expressions in Equations (21.4) and (21.5).
‘Weightsharing, as used in convolutional networks (Section 21.3) and recurrent networks
(Section 21.6), is handled simply by treating each shared weight as a single node with multiple outgoing arcs in the computation graph. During backpropagation, this results in multiple incoming gradient messages. By Equation (21.11), this means that the gradient for the shared
weight is the sum of the gradient contributions from each place it is used in the network.
It is clear from this description of the backpropagation process that its computational
cost is linear in the number of nodes in the computation graph, just like the cost of the forward computation. Furthermore, because the node types are typically fixed when the network
is designed, all of the gradient computations can be prepared in symbolic form in advance
and compiled into very efficient code for each node in the graph.
Note also that the mes
sages in Figure 21.6 need not be scalars: they could equally be vectors, matrices, or higher
dimensional tensors, so that the gradient computations can be mapped onto GPUs or TPUs to benefit from parallelism.
One drawback of backpropagation is that it requires storing most of the intermediate
values that were computed during forward propagation in order to calculate gradients in the backward pass. This means that the total memory cost of training the network is proportional to the number of units in the entire network.
Thus, even if the network itself is represented
only implicitly by propagation code with lots of loops, rather than explicitly by a data struc
ture, all of the intermediate results of that propagation code have to be stored explicitly.
767
768
Chapter 21 21.4.2
Batch normalization
Deep Learning
Batch normalization
Batch normalization is a commonly used technique that improves the rate of convergence of SGD by rescaling the values generated at the internal layers of the network from the examples within each minibatch.
Although the reasons for its effectiveness are not well understood at
the time of writing, we include it because it confers significant benefits in practice. To some
extent, batch normalization seems to have effects similar to those of the residual network. Consider a node z somewhere in the network: the values of z for the m examples in a
minibatch are zj., ... ,z,. Batch normalization replaces each z; with a new quantity Z;:
where 1 is the mean value of z across the minibatch, o is the standard deviation of zy, ...z,
€ is a small constant added to prevent division by zero, and and 3 are learned parameters.
Batch normalization standardizes the mean and variance of the values, as determined by the values of 3 and . This makes it much simpler to train a deep network. Without batch
normalization, information can get lost if a layer’s weights are too small, and the standard
deviation at that layer decays to near zero. Batch normalization prevents this from happening.
It also reduces the need for careful initialization of all the weights in the network to make sure
that the nodes in each layer are in the right operating region to allow information to propagate. With batch normalization, we usually include 3 and , which may be nodespecific or layerspecific, among the parameters of the network, so that they are included in the learning process.
After training, 3 and ~ are fixed at their learned values.
21.5
Generalization
So far we have described how to fit a neural network to its training set, but in machine learn
ing the goal is to generalize to new data that has not been seen previously, as measured by performance on a test set. In this section, we focus on three approaches to improving gener
alization performance: choosing the right network architecture, penalizing large weights, and
randomly perturbing the values passing through the network during training. 21.5.1
Choosing a network architecture
A great deal of effort in deep learning research has gone into finding network architectures that generalize well. Indeed, for each particular kind of data—images, speech, text, video, and so on—a good deal of the progress in performance has come from exploring different kinds of network architectures and varying the number of layers, their connectivity, and the types of node in each layer.®
Some neural network architectures are explicitly designed to generalize well on particular
types of data: convolutional networks encode the idea that the same feature extractor is useful at all locations across a spatial grid, and recurrent networks encode the idea that the same
update rule is useful at all points in a stream of sequential data.
To the extent that these
assumptions are valid, we expect convolutional architectures to generalize well on images and recurrent networks to generalize well on text and audio signals.
& Noting that much of this incremental, exploratory work is arried out by graduate students, some have called the process graduate student descent (GSD).
Section 21.5
01
3.layer 1llayer
008 Testset error
Generalization
006 004 002
o
1
2
3
4
5
Number of weights (x 10")
6
7
Figure 21.7 Testset error as a function of layer width (as measured by total number of weights) for threelayer and elevenlayer convolutional networks. The data come from early versions of Google’s system for transcribing addresses in photos taken by Street View cars (Goodfellow et al., 2014). One of the most important empirical findings in the field of deep learning is that when
comparing two networks with similar numbers of weights, the deeper network usually gives better generalization performance.
Figure 21.7 shows this effect for at least one realworld
application—recognizing house numbers. The results show that for any fixed number of parameters, an elevenlayer network gives much lower testset error than a threelayer network. Deep learning systems perform well on some but not all tasks. For tasks with high
dimensional inputs—images, video, speech signals, etc.—they perform better than any other
pure machine learning approaches. Most of the algorithms described in Chapter 19 can handle highdimensional input only if it is preprocessed using manually designed features to reduce the dimensionality. This preprocessing approach, which prevailed prior to 2010, has not yielded performance comparable to that achieved by deep learning systems. Clearly, deep learning models are capturing some important aspects of these tasks. In particular, their success implies that the tasks can be solved by parallel programs with a relatively
small number of steps (10 to 10° rather than, say, 107). This is perhaps not surprising, because these tasks are typically solved by the brain in less than a second, which is time enough for only a few tens of sequential neuron firings. Moreover, by examining the internallayer representations learned by deep convolutional networks for vision tasks, we find evidence
that the processing steps seem to involve extracting a sequence of increasingly abstract representations of the scene, beginning with tiny edges, dots, and corner features and ending with
entire objects and arrangements of multiple objects. On the other hand, because they are simple circuits, deep learning models lack the compositional and quantificational expressive power that we see in firstorder logic (Chapter 8) and contextfree grammars (Chapter 23).
Although deep learning models generalize well in many cases, they may also produce
unintuitive errors. They tend to produce inputoutput mappings that are discontinuous, so
that a small change to an input can cause a large change in the output. For example, it may
769
770
Adversarial example
Chapter 21
Deep Learning
be possible to alter just a few pixels in an image of a dog and cause the network to classify the dog as an ostrich or a school bus—even though the altered image still looks exactly like a
dog. An altered image of this kind is called an adversarial example. In lowdimensional spaces it is hard to find adversarial examples. But for an image with
a million pixel values, it is often the case that even though most of the pixels contribute to
the image being classified in the middle of the “dog” region of the space, there are a few dimensions where the pixel value is near the boundary to another category. An adversary
with the ability to reverse engineer the network can find the smallest vector difference that
would move the image over the border.
When adversarial examples were first discovered, they set off two worldwide scrambles:
one to find learning algorithms and network architectures that would not be susceptible to adversarial attack, and another to create evermoreeffective adversarial attacks against all
kinds of learning systems.
So far the attackers seem to be ahead.
In fact, whereas it was
assumed initially that one would need access to the internals of the trained network in order
to construct an adversarial example specifically for that network, it has turned out that one can construct robust adversarial examples that fool multiple networks with different architec
tures, hyperparameters, and training sets. These findings suggest that deep learning models
recognize objects in ways that are quite different from the human visual system. 21.5.2
Neural architecture search
Unfortunately, we don’t yet have a clear set of guidelines to help you choose the best network
architecture for a particular problem. Success in deploying a deep learning solution requires experience and good judgment. From the earliest days of neural network research, attempts have been made to automate
the process of architecture selection. We can think of this as a case of hyperparameter tuning (Section 19.4.4), where the hyperparameters determine the depth, width, connectivity, and
Neural architecture search
other attributes of the network.
However, there are so many choices to be made that simple
possible network architectures.
Many of the search techniques and learning techniques we
approaches like grid search can’t cover all possibilities in a reasonable amount of time. Therefore, it is common to use neural architecture search to explore the state space of covered earlier in the book have been applied to neural architecture search.
Evolutionary algorithms have been popular because it is sensible to do both recombination (joining parts of two networks together) and mutation (adding or removing a layer or changing a parameter value). Hill climbing can also be used with these same mutation operations.
Some researchers have framed the problem as reinforcement learning, and some
as Bayesian optimization.
Another possibility is to treat the architectural possibilities as a
continuous differentiable space and use gradient descent to find a locally optimal solution.
For all these search techniques, a major challenge is estimating the value of a candidate
network.
The straightforward way to evaluate an architecture is to train it on a test set for
multiple batches and then evaluate its accuracy on a validation set. But with large networks that could take many GPUdays.
Therefore, there have been many attempts to speed up this estimation process by eliminating or at least reducing the expensive training process. We can train on a smaller data set. We can train for a small number of batches and predict how the network would improve with more batches.
We can use a reduced version of the network architecture that we hope
Section 21.5
Generalization
771
retains the properties of the full version. We can train one big network and then search for subgraphs of the network that perform better; this search can be fast because the subgraphs
share parameters and don’t have to be retrained. Another approach is to learn a heuristic evaluation function (as was done for A* search).
That is, start by choosing a few hundred network architectures and train and evaluate them.
That gives us a data set of (network, score) pairs. Then learn a mapping from the features of a network to a predicted
score. From that point on we can generate a large number of candidate
networks and quickly estimate their value. After a search through the space of networks, the best one(s) can be fully evaluated with a complete training procedure.
21.5.3 Weight decay In Section 19.4.3 we saw that regularization—limiting the complexity of a model—can aid
generalization. This is true for deep learning models as well. In the context of neural networks
we usually call this approach weight decay.
Weight decay consists of adding a penalty AX; ;W
to the loss function used to train the
neural network, where \ is a hyperparameter controlling the strength of the penalty and the
sum is usually taken over all of the weights in the network. Using A=0 is equivalent to not using weight decay, while using larger values of A encourages the weights to become small. It is common to use weight decay with A near 104
Choosing a specific network architecture can be seen as an absolute constraint on the
hypothesis space: a function is either representable within that architecture or it is not. Loss
function penalty terms such as weight decay offer a softer constraint: functions represented
with large weights are in the function family, but the training set must provide more evidence in favor of these functions than is required to choose a function with small weights. It is not straightforward to interpret the effect of weight decay in a neural network. In
networks with sigmoid activation functions, it is hypothesized that weight decay helps to keep the activations near the linear part of the sigmoid, avoiding the flat operating region
that leads to vanishing gradients. With ReLU activation functions, weight decay seems to be beneficial, but the explanation that makes sense for sigmoids no longer applies because the ReLU’s output is either linear or zero.
Moreover, with residual connections, weight decay
encourages the network to have small differences between consecutive layers rather than
small absolute weight values.
Despite these differences in the behavior of weight decay
across many architectures, weight decay is still widely useful.
One explanation for the beneficial effect of weight decay is that it implements a form of maximum a posteriori (MAP) learning (see page 723). Letting X and y stand for the inputs
and outputs across the entire training set, the maximum a posteriori hypothesis /ap satisfies
Inap = argmax P(y X, W)P(W) w = argmin[ log P(yX, W) — log P(W)] The first term is the usual crossentropy loss; the second term prefers weights that are likely
under a prior distribution. This aligns exactly with a regularized loss function if we set
logP(W) = A Y W3, 7
which means that P(W) is a zeromean Gaussian prior.
Weight decay
772
Chapter 21 21.5.4
Dropout
Deep Learning
Dropout
Another way that we can intervene to reduce the testset error of a network—at the cost of making it harder to fit the training set—is to use dropout. At each step of training, dropout
applies one step of backpropagation learning to a new version of the network that is created
by deactivating a randomly chosen subset of the units. This is a rough and very lowcost approximation to training a large ensemble of different networks (see Section 19.8).
More specifically, let us suppose we are using stochastic gradient descent with minibatch
size m.
For each minibatch, the dropout algorithm applies the following process to every
node in the network: with probability p, the unit output is multiplied by a factor of 1/p;
otherwise, the unit output is fixed at zero. Dropout is typically applied to units in the hidden
layers with p=0.5; for input units, a value of p=0.8 turns out to be most effective. This process
produces a thinned network with about half as many units as the original, to which
backpropagation is applied with the minibatch of m training examples. The process repeats in the usual way until training is complete. At test time, the model is run with no dropout.
We can think of dropout from several perspectives:
+ By introducing noise at training time, the model is forced to become robust to noise. + As noted above, dropout approximates the creation of a large ensemble of thinned net
works. This claim can be verified analytically for linear models, and appears to hold experimentally for deep learning models.
+ Hidden units trained with dropout must learn not only to be useful hidden units; they
must also learn to be compatible with many other possible sets of other hidden units
that may or may not be included in the full model.
This is similar to the selection
processes that guide the evolution of genes: each gene must not only be effective in its own function, but must work well with other genes, whose identity in future organisms
may vary considerably.
+ Dropout applied to later layers in a deep network forces the final decision to be made
robustly by paying attention to all of the abstract features of the example rather than focusing on just one and ignoring the others. For example, a classifier for animal images might be able to achieve high performance on the training set just by looking at the
animal’s nose, but would presumably fail on a test case where the nose was obscured or damaged. With dropout, there will be training cases where the internal “nose unit” is
zeroed out, causing the learning process to find additional identifying features. Notice
that trying to achieve the same degree of robustness by adding noise to the input data
would be difficult: there is no easy way to know in advance that the network is going to
focus on noses, and no easy way to delete noses automatically from each image.
Altogether, dropout forces the model to learn multiple, robust explanations for each input.
This causes the model to generalize well, but also makes it more difficult to fit the training
set—it is usually necessary to use a larger model and to train it for more iterations.
21.6
Recurrent Neural Networks
Recurrent neural networks (RNNs) are distinct from feedforward networks in that they allow
cycles in the computation graph. In all the cases we will consider, each cycle has a delay,
50 that units may take as input a value computed from their own output at an earlier step in
Section 21.6
(@)
Recurrent Neural Networks
773
(b)
Figure 21.8 (a) Schematic diagram of a basic RNN where the hidden layer z has recurrent connections; the A symbol indicates a delay. (b) The same network unrolled over three time steps to create a feedforward network. Note that the weights are shared across all time steps. the computation. (Without the delay, a cyclic circuit may reach an inconsistent state.) This
allows the RNN to have internal state, or memory: inputs received at earlier time steps affect
the RNN’s response to the current input. RNNs can also be used to perform more general computations—after all, ordinary com
puters are just Boolean circuits with memory—and to model real neural systems, many of
which contain cyclic connections. Here we focus on the use of RNNs to analyze sequential data, where we assume that a new input vector X, arrives at each time step.
As tools for analyzing sequential data, RNNs can be compared to the hidden Markov
models, dynamic Bayesian networks, and Kalman filters described in Chapter 14. (The reader
may find it helpful to refer back to that chapter before proceeding.) Like those models, RNNs. make a Markov assumption (see page 463):
the hidden state z, of the network suffices
to capture the information from all previous inputs. Furthermore, suppose we describe the RNN'’s update process for the hidden state by the equation z, =f,,(z,_;,x;) for some param
eterized function f,. Once trained, this function represents a timehomogeneous process
(page 463)—effectively a universally quantified assertion that the dynamics represented by
fw hold for all time steps. Thus, RNNs add expressive power compared to feedforward networks, just as convolutional networks do, and just as dynamic Bayes nets add expressive power compared to regular Bayes nets. Indeed, if you tried to use a feedforward network to analyze sequential data, the fixed size of the input layer would force the network to examine only a finitelength window of data, in which case the network would fail to detect
longdistance dependencies. 21.6.1
Training a basic RNN
The basic model we will consider has an input layer x, a hidden layer z with recurrent con
nections, and an output layer y, as shown in Figure 21.8(a). We assume that both x and y are
observed in the training data at each time step. The equations defining the model refer to the values of the variables indexed by time step 7:
% = fo(z1,%)=g (W21 + Wiexi) = g (inz,) 9 = g(Weyz) =gy(iny,),
(21.13)
Memory
774
Chapter 21
Deep Learning
where g, and g, denote the activation functions for the hidden and output layers, respectively.
As usual, we assume an extra dummy input fixed at +1 for each unit as well as bias weights
associated with those inputs. Given a sequence of input vectors Xi,...,x and observed outputs y;,...,yr, we can turn this model into a feedforward network by “unrolling” it for T steps, as shown in Figure 21.8(b). Notice that the weight matrices W, W__, and W are shared across all time steps.
In the unrolled network, it is easy to see that we can calculate gradients to train the
weights in the usual way; the only difference is that the sharing of weights across layers makes the gradient computation a little more complicated.
To keep the equations simple, we will show the gradient calculation for an RNN with
just one input unit, one hidden unit, and one output unit.
For this case, making the bias
weights explicit, we have z,=g:(w.z—1 + WioX; + woz) and § =gy (w.,z + woy). As in Equations (21.4) and (21.5), we will assume a squarederror loss L—in this case, summed
over the time steps. The derivations for the inputlayer and outputlayer weights w,. and w.., are essentially identical to Equation (21.4), so we leave them as an exercise. For the hiddenlayer weight w. ., the first few steps also follow the same pattern as Equation (21.4):
%
= %
7
T
5
Z’Z(,Vl’fl);%
r
I
™=
M
I
y(iny) = );. 2y
$1)8(inys)
=200, = 30, i) g(weys + ) a
) wzy =20y = )&y (iny
(21.14)
I
Now the gradient for the hidden unit z; can be obtained from the previous time step as follows:
9z
T
W,
(Wezio1 + Wy
+wo2)
(21.15)
where the last line uses the rule for derivatives of products: (uv)/dx=vdu/dx+udv/dx.
Looking at Equation (21.15), we notice two things. First, the gradient expression is re
cursive: the contribution to the gradient from time step 7 is calculated using the contribution
Backpropagation through time
from time step 7 — 1. If we order the calculations in the right way, the total run time for computing the gradient will be linear in the size of the network. This algorithm is called backpropagation through time, and is usually handled automatically by deep learning software
Exploding gradient
terms proportional o w.. TT_ g.(in.,). For sigmoids, tanhs, and ReLUs, g’ < 1, 5o our simple RNN will certainly suffer from the vanishing gradient problem (see page 756) if w.. < 1. On the other hand, if w.. > 1, we may experience the exploding gradient problem. (For the
systems. Second, if we iterate the recursive calculation, we see that gradients at 7 will include
general case, these outcomes depend on the first eigenvalue of the weight matrix W...) The next section describes a more elaborate RNN design intended to mitigate this issue.
Section 217 21.6.2
Long shortterm memory
Unsupervised Learning and Transfer Learning
775
RNNs
Several specialized RNN architectures have been designed with the goal of enabling informa
tion to be preserved over many time steps. One of the most popular is the long shortterm
shortterm ‘memory or LSTM. The longterm memory component of an LSTM, called the memory cell Long memory and denoted by ¢, is essentially copied from time step to time step. (In contrast, the basic RNN
multiplies its memory by a weight matrix at every time step, as shown in Equation (21.13).)
Memory cell
New information enters the memory by adding updates; in this way, the gradient expressions
do not accumulate multiplicatively over time. LSTMs also include gating units, which are Gating unit vectors that control the flow of information in the LSTM via elementwise multiplication of
the corresponding information vector: « The forget gate f determines if each element of the memory cell is remembered (copied Forget gate to the next time step) or forgotten (reset to zero).
« The input gate i determines if each element of the memory cell is updated additively Input gate by new information from the input vector at the current time step.
« The output gate o determines if each element of the memory cell is transferred to the Output gate shortterm memory z, which plays a similar role to the hidden state in basic RNNs. Whereas the word “gate” in circuit design usually connotes a Boolean function, gates in
LSTMs are soft—for example, elements of the memory cell vector will be partially forgotten
if the corresponding elements of the forgetgate vector are small but not zero. The values for the gating units are always in the range [0, 1] and are obtained as the outputs of a sigmoid function applied to the current input and the previous hidden state. In detail, the update equations for the LSTM are as follows: £, =
o(Weyx +W_yz,1)
o(Weix+ Weizi1) (WeoX, +Wooz,1)
€ = ¢ 1 Of +i O tanh(Wy X, + Wez,1)
7, =
tanh(¢,) © o,
where the subscripts on the various weight matrices W indicate the origin and destination of
the corresponding links. The ® symbol denotes elementwise multiplication. LSTMs were among the first practically usable forms of RNN. They have demonstrated excellent performance on a wide range of tasks including speech recognition and handwriting recognition. Their use in natural language processing is discussed in Chapter 24. 21.7
Unsupervised
Learning and Transfer Learning
The deep learning systems we have discussed so far are based on supervised learning, which requires each training example to be labeled with a value for the target function.
Although
such systems can reach a high level of testset accuracy—as shown by the ImageNet com
petition results, for example—they often require far more labeled data than a human would for the same task. For example, a child needs to see only one picture of a giraffe, rather than thousands, in order to be able to recognize giraffes reliably in a wide range of settings
and views. Clearly, something is missing in our deep learning story; indeed, it may be the
776
Chapter 21
Deep Learning
case that our current approach to supervised deep learning renders some tasks completely
unattainable because the requirements for labeled data would exceed what the human race
(or the universe) can supply. Moreover, even in cases where the task is feasible, labeling large data sets usually requires scarce and expensive human labor.
For these reasons, there is intense interest in several learning paradigms that reduce the
dependence on labeled data. As we saw in Chapter 19, these paradigms include unsuper
vised learning, transfer learning, and semisupervised learning. Unsupervised learning
algorithms learn solely from unlabeled inputs x, which are often more abundantly available than labeled examples. Unsupervised learning algorithms typically produce generative models, which can produce realistic text, images, audio, and video, rather than simply predicting labels for such data. Transfer learning algorithms require some labeled examples but are able to improve their performance further by studying labeled examples for different tasks, thus making it possible to draw on more existing sources of data. Semisupervised learning algo
rithms require some labeled examples but are able to improve their performance further by also studying unlabeled examples. This section covers deep learning approaches to unsupervised and transfer learning; while semisupervised learning is also an active area of research in the deep learning community, the techniques developed so far have not proven broadly
effective in practice, so we do not cover them. 21.7.1
Unsupervised learning
Supervised learning algorithms all have essentially the same goal: given a training set of inputs x and corresponding outputs y=
f(x), learn a function A that approximates
f well.
Unsupervised learning algorithms, on the other hand, take a training set of unlabeled exam
ples x. Here we describe two things that such an algorithm might try to do. The first is to
learn new representations—for example, new features of images that make it easier to iden
tify the objects in an image. The second is to learn a generative model—typically in the form
of a probability distribution from which new samples can be generated. (The algorithms for learning Bayes nets in Chapter 20 fall in this category.) Many algorithms are capable of both representation learning and generative modeling. Suppose we learn a joint model Ay (x,z), where z is a set of latent, unobserved variables that represent the content of the data x in some way. In keeping with the spirit of the chapter, we do not predefine the meanings of the z variables; the model is free to learn to associate
z with x however it chooses. For example, a model trained on images of handwritten digits
might choose to use one direction in z space to represent the thickness of pen strokes, another to represent ink color, another to represent background color, and so on. With images of faces, the learning algorithm might choose one direction to represent gender and another to
capture the presence or absence of glasses, as illustrated in Figure 21.9. A learned probability model Py (x,z) achieves both representation learning (it has constructed meaningful z vectors from the raw x vectors) and generative modeling: grate z out of Py (x,z) we obtain Py (x).
if we inte
Probabilistic PCA: A simple generative model PPCA
There have been many proposals for the form that Py (x,z) might take. One of the simplest
is the probabilistic principal components analysis (PPCA) model.” In a PPCA model, z
Section 217
Unsupervised Learning and Transfer Learning
Figure 21.9 A demonstration of how a generative model has learned to use different directions in z space to represent different aspects of faces. We can actually perform arithmetic in 2 space. The images here are all generated from the learned model and show what happens when we decode different points in z space. We start with the coordinates for the concept of “man with glasses.” subtract off the coordinates for “man,” add the coordinates for “woman.” and obtain the coordinates for “woman with glasses.” Images reproduced with permission from (Radford et al., 2015).
is chosen from a zeromean, spherical Gaussian, then x is generated from z by applying a weight matrix W and adding spherical Gauss ian noise: P(z) = N(z0,I)
By(xz) = N(x;Wz,0°1).
The weights W (and optionally the noise parameter o) can be learned by maximizing the likelihood of the data, given by
Pu(x) :/Ry(x.z)dz=N(x:0.WWT+rIZI).
(21.16)
The maximization with respect to W can be done by gradient methods or by an efficient iterative EM
algorithm
(see Section 20.3).
Once W
has been learned, new data samples
can be generated directly from Py (x) using Equation (21.16). Moreover, new observations. x that have very low probability according to Equation (21.16) can be flagged as potential
anomalies. With PPCA, we usually assume that the dimensionality ofz is much less than the dimensionality of x, so that the model learns to explain the data as well as possible in terms of a
small number of features. These features can be extracted for use in standard classifiers by computing Z, the expectation of Py (zx). Generating data from a probabilistic PCA model is straightforward: first sample z from its fixed Gaussian prior, then sample x from a Gaussian with mean Wz.
As we will see
shortly, many other generative models resemble this process, but use complicated mappings
defined by deep models rather than linear mappings from zspace to xspace.
7 Standard PCA involves fitting a multivariate Gaussian to the raw input data and then selecting out the longest axes—the principal components—of that ellipsoidal distribution.
777
778
Chapter 21
Deep Learning
Autoencoders
Autoencoder
Many unsupervised deep learning algorithms are based on the idea of an autoencoder. An autoencoder is a model containing two parts: an encoder that maps from x to a representation
2 and a decoder that maps from a representation Z to observed data x. In general, the encoder
is just a parameterized function f and the decoder is just a parameterized function g. The
model is trained so that x ~ g(f(x)), so that the encoding decoding process. The functions f and g can be simple single matrix or they can be represented by a deep neural A very simple autoencoder is the linear autoencoder, a shared weight matrix W:
process is roughly inverted by the linear models parameterized by a network. where both f and g are linear with
One way to train this model is to minimize the squared error ¥; [[x; — g(f(x;))[* so that
x ~ g(f(x)). The idea is to train W so that a lowdimensional 2 will retain as much information as possible to reconstruct the highdimensional data x.
This linear autoencoder
turns out to be closely connected to classical principal components analysis (PCA). When
z is mdimensional, the matrix W should learn to span the m principal components of the data—in other words, the set of m orthogonal directions in which the data has highest vari
ance, or equivalently the m eigenvectors of the data covariance matrix that have the largest
eigenvalues—exactly as in PCA. The PCA model is a simple generative model that corresponds to a simple linear autoencoder. The correspondence suggests that there may be a way to capture more complex kinds Variational autoencoder Variational posterior
of generative models using more complex kinds of autoencoders.
coder (VAE) provides one way to do this.
The variational autoen
Variational methods were introduced briefly on page 458 as a way to approximate the
posterior distribution in complex probability models, where summing or integrating out a large number of hidden variables is intractable.
The idea is to use a variational posterior
Q(z), drawn from a computationally tractable family of distributions, as an approximation to
the true posterior. For example, we might choose Q from the family of Gaussian distributions
with a diagonal covariance matrix. Within the chosen family of tractable distributions, Q is
optimized to be as close as possible to the true posterior distribution P(zx).
For our purposes, the notion of “as close as possible” is defined by the KL divergence, which we mentioned on page 758. This is given by
Da(Q@IP(alx) = [ 0(a)ioggt
which is an average (with respect to Q) of the log ratio between Q and P. It is easy to see
Variational lower bound ELBO
that Dk (Q(z)P(zx)) > 0, with equality when Q and P coincide. We can then define the
variational lower bound £ (sometimes called the evidence lower bound, or ELBO) on the log likelihood of the data:
L(x,0) =logP(x) — Dg1(Q(2)P(2]x)) QL17) We can see that £ is a lower bound for logP because the KL divergence is nonnegative. Variational learning maximizes £ with respect to parameters w rather than maximizing log P(x),
in the hope that the solution found, w", is close to maximizing log P(x) as well.
Section 217
Unsupervised Learning and Transfer Learning
779
As written, £ does not yet seem to be any easier to maximize than log P. Fortunately, we
can rewrite Equation (21.17) to reveal improved computational tractability:
£ = logP(x)— / Q(z)logp?z(T))()dz = 7/Q(z)logQ(z)dz+/Q(z)logP(x)P(z\x)dz = H(Q)+FzglogP(z,x)
where H(Q) is the entropy of the Q distribution.
For some variational families @ (such
as Gaussian distributions), H(Q) can be evaluated analytically.
Moreover, the expectation,
E,olog P(z,x), admits an efficient unbiased estimate via samples of z from Q. For each sample, P(z,x) can usually be evaluated efficiently—for example, if P is a Bayes net, P(z,X) is just a product of conditional probabilities because z and x comprise all the variables. Variational autoencoders provide a means of performing variational learning in the deep learning setting. Variational learning involves maximizing £ with respect to the parameters
of both P and Q. For a variational autoencoder, the decoder g(z) is interpreted as defining
log P(xz). For example, the output of the decoder might define the mean of a conditional
Gaussian. Similarly, the output of the encoder f(x) is interpreted as defining the parameters of
Q—for example, Q might be a Gaussian with mean f(x). Training the variational autoencoder
then consists of maximizing £ with respect to the parameters of both the encoder f and the
decoder g, which can themselves be arbitrarily complicated deep networks.
Deep autoregressive models An autoregressive model (or AR model) is one in which each element x; of the data vector x
is predicted based on other elements of the vector. Such a model has no latent variables. If x
Autoregressive model
is of fixed size, an AR model can be thought of as a fully observable and possibly fully connected Bayes net. This means that calculating the likelihood ofa given data vector according to an AR model is trivial; the same holds for predicting the value of a single missing variable given all the others, and for sampling a data vector from the model.
The most common application of autoregressive models is in the analysis of time series
data, where an AR model of order k predicts x; given x,_j,...,x,_. In the terminology of Chapter 14, an AR model is a nonhidden Markov model. In the terminology of Chapter 23, an ngram model of letter or word sequences is an AR model of order n — 1. In classical AR models, where the variables are realvalued, the conditional distribution
P(%  t»....X_1) is a linearGaussian model with fixed variance whose mean is a weighted linear combination of X, ..., X j—in other words, a standard linear regression model. The maximum likelihood solution is given by the YuleWalker equations, which are closely YuleWalker equations related to the normal equations on page 680.
A deep autoregressive model is one in which the linearGaussian
model is replaced
by an arbitrary deep network with a suitable output layer depending on whether x; is dis
crete or continuous. Recent applications of this autoregressive approach include DeepMind’s ‘WaveNet model
for speech generation (van den Oord er al., 2016a).
WaveNet is trained
on raw acoustic signals, sampled 16,000 times per second, and implements a nonlinear AR model of order 4800 with a multilayer convolutional structure.
In tests it proves to be sub
stantially more realistic than previous stateoftheart speech generation systems.
Deep autoregressive model
780
Generative adversarial network GAN) enerator Discriminator Implicit model
Chapter 21
Deep Learning
Generative adversarial networks A generative adversarial network (GAN)
is actually a pair of networks that combine to
form a generative system. One of the networks, the generator, maps values from z to X in
order to produce samples from the distribution Py (x). A typical scheme samples z from a unit
Gaussian of moderate dimension and then passes it through a deep network /,, to obtain x. The other network, the discriminator, is a classifier trained to classify inputs x as real (drawn
from the training set) or fake (created by the generator). GANs are a kind of implicit model
in the sense that samples can be generated but their probabilities are not readily available; in a Bayes net, on the other hand, the probability of a sample is just the product of the conditional
probabilities along the sample generation path.
The generator is closely related to the decoder from the variational autoencoder frame
work. The challenge in implicit modeling is to design a loss function that makes it possible to train the model using samples from the distribution, rather than maximizing the likelihood assigned to training examples from the data set.
Both the generator and the discriminator are trained simultaneously, with the generator
learning to fool the discriminator and the discriminator learning to accurately separate real from fake data.
The competition between generator and discriminator can be described in
the language of game theory (see Chapter 18). The idea is that in the equilibrium state of the
game, the generator should reproduce the training distribution perfectly, such that the discrim
inator cannot perform better than random guessing. GANs have worked particularly well for
image generation tasks. For example, GAN can create photorealistic, highresolution images of people who have never existed (Karras ef al., 2017). Unsupervised
translation
Translation tasks, broadly construed, consist of transforming an input x that has rich structure into an output y that also has rich structure.
In this context, “rich structure” means that the
data are multidimensional and have interesting statistical dependencies among the various
dimensions. Images and natural language sentences have a rich structure, but a single number, such as a clas ID, does not. Transforming a sentence from English to French or converting a photo of a night scene into an equivalent photo taken during the daytime are both examples
of translation tasks.
Supervised translation consists of gathering many (x,y) pairs and training the model to
map each X to the corresponding y.
For example, machine translation systems are often
trained on pairs of sentences that have been translated by professional human translators. For
other kinds of translation, supervised training data may not be available. For example, con
sider a photo ofa night scene containing many moving cars and pedestrians. It is presumably
not feasible to find all of the cars and pedestrians and return them to their original positions in
Unsupervised translation
the nighttime photo in order to retake the same photo in the daytime. To overcome this difficulty, it is possible to use unsupervised translation techniques that are capable of training
on many examples of x and many separate examples of y but no corresponding (x,y) pairs. These approaches are generally based on GANS; for example, one can train a GAN gen
erator to produce a realistic example ofy when conditioned on x, and another GAN generator to perform the reverse mapping. The GAN training framework makes it possible to train a
generator to generate any one of many possible samples that the discriminator accepts as a
Section 217
Unsupervised Learning and Transfer Learning
781
realistic example of y given x, without any need for a specific paired y as is traditionally needed in supervised learning. More detail on unsupervised translation for images is given in Section 25.7.5. 21.7.2
Transfer learning and multitask learning
In transfer learning, experience with one learning task helps an agent learn better on another Transfer learning task. For example, a person who has already learned to play tennis will typically find it casier to learn related sports such as racquetball and squash; a pilot who has learned to fly one type of commercial passenger airplane will very quickly learn to fly another type: a student who has already learned algebra finds it easier to learn calculus.
‘We do not yet know the mechanisms of human transfer learning.
For neural networks,
learning consists of adjusting weights, so the most plausible approach for transfer learning is
to copy over the weights learned for task A to a network that will be trained for task B. The
weights are then updated by gradient descent in the usual way using data for task B. It may be a good idea to use a smaller learning rate in task B, depending on how similar the tasks are and how much data was used in task A.
Notice that this approach requires human expertise in selecting the tasks: for example,
weights learned during algebra training may not be very useful in a network intended for racquetball.
Also, the notion of copying weights requires a simple mapping between the
input spaces for the two tasks and essentially identical network architectures.
One reason for the popularity of transfer learning is the availability of highquality pre
trained models.
For example, you could download a pretrained visual object recognition
model such as the ResNet50 model trained on the COCO data set, thereby saving yourself
weeks of work. From there you can modify the model parameters by supplying additional images and object labels for your specific task.
Suppose you want to classify types of unicycles. You have only a few hundred pictures
of different unicycles, but the COCO data set has over 3,000 images in each of the categories
of bicycles, motorcycles, and skateboards. This means that a model pretrained on COCO
already has experience with wheels and roads and other relevant features that will be helpful
in interpreting the unicycle images.
Often you will want to freeze the first few layers of the pretrained model—these layers
serve as feature detectors that will be useful for your new model. Your new data set will be
allowed to modify the parameters of the higher levels only; these are the layers that identify problemspecific features and do classification. However, sometimes the difference between sensors means that even the lowestlevel layers need to be retrained.
As another example, for those building a natural language system, it is now common
to start with a pretrained model
such as the ROBERTA
model (see Section 24.6), which
already “knows” a great deal about the vocabulary and syntax of everyday language. The next step is to finetune the model in two ways. First, by giving it examples of the specialized vocabulary used in the desired domain; perhaps a medical domain (where it will learn about
“mycardial infarction”) or perhaps a financial domain (where it will learn about “fiduciary
responsibility”). Second, by training the model on the task it is to perform. If it is to do question answering, train it on question/answer pairs. One very important kind of transfer learning involves transfer between simulations and the real world.
For example, the controller for a selfdriving car can be trained on billions
782
Chapter 21
Deep Learning
of miles of simulated driving, which would be impossible in the real world. Then, when the
Multitask learning
controller is transitioned to the real vehicle, it adapts quickly to the new environment. Multitask learning is a form of transfer learning in which we simultaneously
train a
model on multiple objectives. For example, rather than training a natural language system on partofspeech tagging and then transferring the learned weights to a new task such as document classification, we train one system simultaneously on partofspeech tagging, document
classification, language detection, word prediction, sentence difficulty modeling, plagiarism
detection, sentence entailment, and question answering. The idea is that to solve any one of these tasks, a model might be able to take advantage of superficial features of the data. But to
solve all eight at once with a common representation layer, the model is more likely to create a common representation that reflects real natural language usage and content.
21.8
Applications
Deep learning has been applied successfully to many important problem areas in Al For indepth explanations, we refer the reader to the relevant chapters: Chapter 22 for the use of deep learning in reinforcement learning systems, Chapter 24 for natural language processing, Chapter 25 (particularly Section 25.4) for computer vision, and Chapter 26 for robotics. 21.8.1
Vision
We begin with computer vision, which is the application area that has arguably had the biggest impact on deep learning, and vice versa. Although deep convolutional networks had been in use since the 1990s for tasks such as handwriting recognition, and neural networks had begun to surpass generative probability models for speech recognition by around 2010, it was the success of the AlexNet deep learning system in the 2012 ImageNet competition that propelled
deep learning into the limelight. The ImageNet competition was a supervised learning task with 1,200,000 images in 1,000 different categories, and systems were evaluated on the “top5” score—how often the correct category appears in the top five predictions. AlexNet achieved an error rate of 15.3%, whereas the next best system had an error rate of more than 25%.
AlexNet had five convolutional
layers interspersed with maxpooling layers, followed by three fully connected layers. It used
ReLU activation functions and took advantage of GPUs to speed up the process of training
60 million weight
Since 2012, with improvements
in network design, training methods, and computing
resources, the top5 error rate has been reduced to less than 2%—well below the error rate of
a trained human (around 5%). CNNs have been applied to a wide range of vision tasks, from selfdriving cars to grading cucumbers.® Driving, which is covered in Section 25.7.6 and in several sections of Chapter 26, is among the most demanding of vision tasks: not only must
the algorithm detect, localize, track, and recognize pigeons, paper bags, and pedestrians, but it has to do it in real time with nearperfect accuracy.
8 The widely known tale of the Japanese cucumber farmer who built his own cucumbers ing robot using TensorFlow is, it tums out, mostly mythical. The algorithm was developed by the farmer’s son, who worked previously as a software engineer at Toyota, and its low accuracy—about 70%—meant that the cucumbers il had to be sorted by hand (Zeeberg, 2017).
Section 218 Applications 21.8.2
Natural language processing
Deep learning has also had a huge impact on natural language processing (NLP) applications. such as machine translation and speech recognition. Some advantages of deep learning for these applications include the possibility of endtoend learning, the automatic generation
of internal representations for the meanings of words, and the interchangeability of learned
encoders and decoders. Endtoend learning refers to the construction of entire systems as a single, learned func
tion f. For example, an f for machine translation might take
as input an English sentence
S and produce an equivalent Japanese sentence S; = £(Sg). Such an f can be learned from training data in the form of humantranslated pairs of sentences (or even pairs of texts, where
the alignment of corresponding sentences or phrases is part of the problem to be solved). A more classical pipeline approach might first parse Sg, then extract its meaning, then reexpress
the meaning in Japanese as S, then postedit S, using a language model for Japanese. This pipeline approach has two major drawbacks:
first, errors are compounded at each stage; and
second, humans have to determine what constitutes a “parse tree” and a “meaning representation,” but there is no easily accessible ground truth for these notions, and our theoretical
ideas about them are almost certainly incomplete.
At our present stage of understanding, then, the classical pipeline approach—which, at
least naively, seems to correspond to how a human translator works—is outperformed by the endtoend method made possible by deep learning. For example, Wu ef al. (2016b) showed that endtoend translation using deep learning reduced translation errors by 60% relative to a previous pipelinebased system. As of 2020, machine translation systems are approaching human performance for language pairs such as French and English for which very large paired data sets are available, and they are usable for other language pairs covering the majority of Earth’s population. There is even some evidence that networks trained on multiple languages do in fact learn an internal meaning representation: for example, after learning to translate Portuguese to English and English to Spanish, it is possible to translate Portuguese directly
into Spanish without any Portuguese/Spanish sentence pairs in the training set.
One of the most significant findings to emerge from the application of deep learning
to language tasks is that a great deal deal of mileage comes from rerepresenting individual words as vectors in a highdimensional space—socalled word embeddings (see Section 24.1).
The vectors are usually extracted from the weights of the first hidden layer of
a network trained on large quantities of text, and they capture the statistics of the lexical
contexts in which words are used. Because words with similar meanings are used in similar
contexts, they end up close to each other in the vector space. This allows the network to generalize effectively across categories of words, without the need for humans to predefine those categories. For example, a sentence beginning “John bought a watermelon and two pounds of ... s likely to continue with “apples” or “bananas” but not with “thorium” or “geography.” Such a prediction is much easier to make if “apples” and “bananas” have similar representations in the internal layer. 21.8.3
Reinforcement learning
In reinforcement learning (RL), a decisionmaking agent learns from a sequence of reward
signals that provide some indication of the quality of its behavior. The goal is to optimize the
sum of future rewards. This can be done in several ways: in the terminology of Chapter 17,
783
784
Chapter 21
Deep Learning
the agent can learn a value function, a Qfunction, a policy, and so on. From the point of view of deep learning, all these are functions that can be represented by computation graphs.
For example, a value function in Go takes a board position as input and returns an estimate of how advantageous the position is for the agent. While the methods of training in RL differ from those of supervised learning, the ability of multilayer computation graphs to represent
Deep reinforcement learning
complex functions over large input spaces has proved to be very useful. The resulting field of research is called deep reinforcement learning.
In the 1950s, Arthur Samuel experimented with multilayer representations of value func
tions in his work on reinforcement learning for checkers, but he found that in practice a linear
function approximator worked best. (This may have been a consequence of working with a computer roughly 100 billion times less powerful than a modern tensor processing unit.) The first major successful demonstration of deep RL was DeepMind’s Atariplaying agent, DQN (Mnih et al., 2013). Different copies of this agent were trained to play each of several different Atari video games, and demonstrated skills such as shooting alien spaceships, bouncing balls with paddles, and driving simulated racing cars. In each case, the agent learned a Qfunction from raw image data with the reward signal being the game score. Subsequent work has produced deep RL systems that play at a superhuman level on the majority of the 57 different Atari games. DeepMind’s ALPHAGO system also used deep RL to defeat the best human players at the game of Go (see Chapter 5).
Despite its impressive successes, deep RL still faces significant obstacles: it is often difficult to get good performance, and the trained system may behave very unpredictably if the environment differs even a little from the training data (Irpan, 2018).
Compared to
other applications of deep learning, deep RL is rarely applied in commercial settings. It is, nonetheless, a very active area of research.
Summary
This chapter described methods for learning functions represented by deep computational graphs. The main points were: + Neural networks represent complex nonlinear functions with a network of parameterized linearthreshold units.
« The backpropagation algorithm implements a gradient descent in parameter space to minimize the loss function. « Deep learning works well for visual object recognition, speech recognition, natural language processing, and reinforcement learning in complex environments.
+ Convolutional networks are particularly well suited for image processing and other tasks
where the data have a grid topology. « Recurrent networks are effective for sequenceprocessing tasks including language modeling and machine translation.
Bibliographical and Historical Notes Bibliographical and
Historical Notes
The literature on neural networks is vast. Cowan and Sharp (1988b,
1988a) survey the early
history, beginning with the work of McCulloch and Pitts (1943). (As mentioned in Chap
ter 1, John McCarthy has pointed to the work of Nicolas Rashevsky (1936, 1938) as the earliest mathematical model of neural learning.) Norbert Wiener, a pioneer of cybernetics and control theory (Wiener, 1948), worked with McCulloch and Pitts and influenced a num
ber of young researchers, including Marvin Minsky, who may have been the first to develop a working neural network in hardware, in 1951 (see Minsky and Papert, 1988, pp. ixx). Alan Turing (1948) wrote a research report titled Intelligent Machinery that begins with the
sentence “I propose to investigate the question as to whether it is possible for machinery to show intelligent behaviour” and goes on to describe a recurrent neural network architecture
he called “Btype unorganized machines” and an approach to training them. Unfortunately, the report went unpublished until 1969, and was all but ignored until recently.
The perceptron, a onelayer neural network with a hardthreshold activation function, was
popularized by Frank Rosenblatt (1957). After a demonstration in July 1958, the New York
Times described it as “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Rosenblatt
(1960) later proved the perceptron convergence theorem, although it had been foreshadowed by purely mathematical work outside the context of neural networks (Agmon, 1954; Motzkin and Schoenberg, 1954). Some early work was also done on multilayer networks, including Gamba perceptrons (Gamba ef al., 1961) and madalines (Widrow, 1962). Learning Machines (Nilsson,
1965) covers much of this early work and more.
The subsequent demise
of early perceptron research efforts was hastened (or, the authors later claimed, merely explained) by the book Perceptrons (Minsky and Papert, 1969), which lamented the field’s lack of mathematical rigor. The book pointed out that singlelayer perceptrons could represent only linearly separable concepts and noted the lack of effective learning algorithms for multilayer networks. These limitations were already well known (Hawkins, 1961) and had been acknowledged by Rosenblatt himself (Rosenblatt, 1962). The papers collected by Hinton and Anderson (1981), based on a conference in San Diego in 1979, can be regarded as marking a renaissance of connectionism. The twovolume “PDP” (Parallel Distributed Processing) anthology (Rumelhart and McClelland, 1986) helped
to spread the gospel, so to speak, particularly in the psychology and cognitive science com
munities. The most important development of this period was the backpropagation algorithm
for training multilayer networks.
The backpropagation algorithm was discovered independently several times in different
contexts (Kelley, 1960; Bryson,
1962; Dreyfus, 1962; Bryson and Ho, 1969; Werbos,
1974;
Parker, 1985) and Stuart Dreyfus (1990) calls it the “KelleyBryson gradient procedure.” Although Werbos had applied it to neural networks, this idea did not become widely known
until a paper by David Rumelhart, Geoff Hinton, and Ron Williams (1986) appeared in Narure giving a nonmathematical presentation of the algorithm. Mathematical respectability was enhanced by papers showing that multilayer feedforward networks are (subject to technical conditions) universal function approximators (Cybenko,
1988,
1989).
The late 1980s and
early 1990s saw a huge growth in neural network research: the number of papers mushroomed by a factor of 200 between 198084 and 199094.
785
786
Chapter 21
Deep Learning
In the late 1990s and early 2000s, interest in neural networks waned as other techniques such as Bayes nets, ensemble methods, and kernel machines came to the fore. Interest in deep
models was sparked when Geoff Hinton’s research on deep Bayesian networks—generative models with category variables at the root and evidence variables at the leaves—began to bear fruit, outperforming kernel machines on small benchmark
data sets (Hinton er al., 2006).
Interest in deep learning exploded when Krizhevsky ef al. (2013) used deep convolutional networks to win the ImageNet competition (Russakovsky et al., 2015).
Commentators often cite the availability of “big data” and the processing power of GPUs
as the main contributing factors in the emergence of deep learning.
Architectural improve
ments were also important, including the adoption of the ReLU activation function instead of the logistic sigmoid (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011) and later the development of residual networks (He et al., 2016).
On the algorithmic side, the use of stochastic gradient descent (SGD) with small batches
was essential in allowing neural networks to scale to large data sets (Bottou and Bousquet, 2008). Batch normalization (loffe and Szegedy, 2015) also helped in making the training pro
cess faster and more reliable and has spawned several additional normalization techniques (Ba etal.,2016; Wu and He, 2018; Miyato et al., 2018). Several papers have studied the empirical behavior of SGD on large networks and large data sets (Dauphin et al., 2015; Choromanska et al., 2014; Goodfellow et al., 2015b). On the theoretical side, some progress has been made
on explaining the observation that SGD applied to overparameterized networks often reaches
a global minimum with a training error of zero, although so far the theorems to this effect assume a network with layers far wider than would ever occur in practice (AllenZhu et al.,
2018; Du ef al., 2018). Such networks have more than enough capacity to function as lookup tables for the training data.
The last piece of the puzzle, at least for vision applications, was the use of convolutional
networks. These had their origins in the descriptions of the mammalian visual system by neurophysiologists David Hubel and Torsten Wiesel (Hubel and Wiesel, 1959, 1962, 1968).
They described “simple cells” in the visual system of a cat that resemble edge detectors,
as well as “complex cells” that are invariant to some transformations such as small spatial
translations. In modern convolutional networks, the output of a convolution is analogous to a
simple cell while the output of a pooling layer is analogous to a complex cell. The work of Hubel and Wiesel inspired many of the early connectionist models of vision (Marr and Poggio, 1976). The neocognitron (Fukushima, 1980; Fukushima and Miyake,
1982), designed as a model of the visual cortex, was essentially a convolutional network in terms of model architecture, although an effective training algorithm for such networks had
to wait until Yann LeCun and collaborators showed how to apply backpropagation (LeCun
et al., 1995). One of the early commercial successes of neural networks was handwritten digit recognition using convolutional networks (LeCun ez al., 1995).
Recurrent neural networks (RNNs) were commonly proposed as models of brain function
in the 1970s, but no effective learning algorithms were associated with these proposals. The
method of backpropagation through time appears in the PhD thesis of Paul Werbos (1974),
and his later review paper (Werbos, 1990) gives several additional references to rediscoveries of the method in the 1980s. One of the most influential early works on RNNs was due to Jeff Elman (1990), building on an RNN architecture suggested by Michael Jordan (1986).
Williams and Zipser (1989) present an algorithm for online learning in RNNs. Bengio et al.
Bibliographical and Historical Notes (1994) analyzed the problem of vanishing gradients in recurrent networks. The long shortterm memory (LSTM) architecture (Hochreiter,
1991; Hochreiter and Schmidhuber,
1997;
Gers et al., 2000) was proposed as a way of avoiding this problem. More recently, effective RNN designs have been derived automatically (Jozefowicz er al., 2015; Zoph and Le, 2016).
Many methods have been tried for improving generalization in neural networks. Weight decay was suggested by Hinton (1987) and analyzed mathematically by Krogh and Hertz
(1992). The dropout method is due to Srivastava et al. (2014a). Szegedy et al. (2013) intro
duced the idea of adversarial examples, spawning a huge literature.
Poole et al. (2017) showed that deep networks (but not shallow ones) can disentangle complex functions into flat manifolds in the space of hidden units. Rolnick and Tegmark
(2018) showed that the number of units required to approximate a certain class of polynomials
of n variables grows exponentially for shallow networks but only linearly for deep networks. White et al. (2019) showed that their BANANAS
system could do neural architecture
search (NAS) by predicting the accuracy of a network to within 1% after training on just 200 random sample architectures. Zoph and Le (2016) use reinforcement learning to search the space of neural network architectures. Real er al. (2018) use an evolutionary algorithm to do model selection, Liu er al. (2017) use evolutionary algorithms on hierarchical representations, and Jaderberg ef al. (2017) describe populationbased training. Liu er al. (2019)
relax the space of architectures to a continuous differentiable space and use gradient descent to find a locally optimal solution.
Pham et al. (2018) describe the ENAS
(Efficient Neural
Architecture Search) system, which searches for optimal subgraphs of a larger graph. It is fast because it does not need to retrain parameters. The idea of searching for a subgraph goes back to the “optimal brain damage™ algorithm of LeCun et al. (1990).
Despite this impressive array of approaches, there are critics who feel the field has not yet
matured. Yu et al. (2019) show that in some cases these NAS algorithms are no more efficient
than random architecture selection. For a survey of recent results in neural architecture search,
see Elsken ef al. (2018).
Unsupervised learning constitutes a large subfield within statistics, mostly under the
heading of density estimation. Silverman (1986) and Murphy (2012) are good sources for classical and modem techniques in this area. Principal components analysis (PCA) dates back to Pearson (1901); the name comes from independent work by Hotelling (1933). The probabilistic PCA model (Tipping and Bishop, 1999) adds a generative model for the principal components themselves. The variational autoencoder is due to Kingma and Welling (2013) and Rezende ef al. (2014); Jordan et al. (1999) provide an introduction to variational methods for inference in graphical models. For autoregressive models, the classic text is by Box er al. (2016). The YuleWalker equations for fitting AR models were developed independently by Yule (1927) and Walker (1931).
Autoregressive models with nonlinear dependencies were developed by several authors (Frey, 1998; Bengio and Bengio, 2001; Larochelle and Murray, 2011). The autoregressive WaveNet
model (van den Oord et al., 2016a) was based on earlier work on autoregressive image gen
eration (van den Oord et al., 2016b).
Generative adversarial networks, or GANs, were first
proposed by Goodfellow et al. (2015a), and have found many applications in AL Some theoretical understanding of their properties is emerging, leading to improved GAN models and algorithms (Li and Malik, 2018b, 2018a; Zhu et al., 2019). Part of that understanding involves
protecting against adversarial attacks (Carlini ef al., 2019).
787
788
Hopfield network
Boltzmann machine
Chapter 21
Deep Learning
Several branches of research into neural networks have been popular in the past but are not actively explored today. Hopfield networks (Hopfield, 1982) have symmetric connections between each pair of nodes and can learn to store patterns in an associative memory, 5o that an entire pattern can be retrieved by indexing into the memory using a fragment of the pattern.
Hopfield networks are deterministic; they were later generalized to stochastic
Boltzmann machines (Hinton and Sejnowski, 1983, 1986). Boltzmann machines are possi
bly the earliest example of a deep generative model. The difficulty of inference in Boltzmann
machines led to advances in both Monte Carlo techniques and variational techniques (see
Section 13.4).
Research on neural networks for Al has also been intertwined to some extent with research into biological neural networks. The two topics coincided in the 1940s, and ideas for convolutional networks and reinforcement learning can be traced to studies of biological sys
Computational
tems; but at present, new ideas in deep learning tend to be based on purely computational or statistical concerns.
The field of computational neuroscience aims to build computational
models that capture important and specific properties of actual biological systems. Overviews
are given by Dayan and Abbott (2001) and Trappenberg (2010). For modern neural nets and deep learning, the leading textbooks are those by Goodfellow et al. (2016) and Charniak (2018). There are also many handson guides associated with the various opensource software packages for deep learning.
Three of the leaders of the
field—Yann LeCun, Yoshua Bengio, and Geoff Hinton—introduced the key ideas to nonAl researchers in an influential Nature article (2015). The three were recipients of the 2018 Turing Award. Schmidhuber (2015) provides a general overview, and Deng et al. (2014)
focus on signal processing tasks.
The primary publication venues for deep learning research are the conference on Neural
Information Processing Systems (NeurIPS), the International Conference on Machine Learning (ICML), and the International Conference on Learning Representations (ICLR). The main
journals are Machine Learning, the Journal of Machine Learning Research, and Neural Com
putation. Increasingly, because of the fast pace of research, papers appear first on arXiv.org and are often described in the research blogs of the major research centers.
TS
D2
REINFORCEMENT LEARNING In which we see how experiencing rewards and punishments can teach an agent how to maximize rewards in the future. ‘With supervised learning, an agent learns by passively observing example input/output
pairs provided by a “teacher” In this chapter, we will see how agents can actively learn from
their own experience, without a teacher, by considering their own ultimate success or failure.
22.1
Learning from Rewards
Consider the problem of learning to play chess.
Let’s imagine treating this as a supervised
learning problem using the methods of Chapters 1921. The chessplaying agent function
takes as input a board position and returns a move, so we train this function by supplying
examples of chess positions, each labeled with the correct move. Now, it so happens that we have available databases of several million grandmaster games, each a sequence of positions. and moves. The moves made by the winner are, with few exceptions, assumed to be good,
if not always perfect. Thus, we have a promising training set. The problem is that there are
relatively few examples (about 108) compared to the space of all possible chess positions (about 10°). In a new game, one soon encounters positions that are significantly different
from those in the database, and the trained agent function is likely to fail miserably—not least because it has no idea of what its moves are supposed to achieve (checkmate) or even what
effect the moves have on the positions of the pieces. And of course chess is a tiny part of the real world. For more realistic problems, we would need much vaster grandmaster databases,
and they simply don’t exist.'
Analternative is reinforcement learning (RL), in which an agent interacts with the world
and periodically receives rewards (or, in the terminology of psychology, reinforcements) that reflect how well it is doing.
For example, in chess the reward is 1 for winning, 0 for
losing, and § for a draw. We have already seen the concept of rewards in Chapter 17 for Markov decision processes (MDPs). Indeed, the goal is the same in reinforcement learning:
maximize the expected sum of rewards.
Reinforcement learning differs from “just solving
an MDP” because the agent is not given the MDP as a problem to solve; the agent is in the MDP. It may not know the transition model or the reward function, and it has to act in order
to learn more. Imagine playing a new game whose rules you don’t know; after a hundred or 50 moves, the referee tells you “You lose.” That is reinforcement learning in a nutshell.
From our point of view as designers of Al systems, providing a reward signal to the agent is usually much easier than providing labeled examples of how to behave. First, the reward ! As Yann LeCun and Alyosha Efros have pointed out, “the Al revolution will not be supervised.”
{3570
790
Chapter 22 Reinforcement Learning function is often (as we saw for chess) very concise and easy to specify: it requires only a few lines of code to tell the chess agent if it has won or lost the game or to tell the carracing agent that it has won or lost the race or has crashed.
Second, we don’t have to be experts,
capable of supplying the correct action in any situation, as would be the case if we tried to apply supervised learning. It turns out, however, that a
Sparse
little bit of expertise can go a long way in reinforcement
learning. The two examples in the preceding paragraph—the win/loss rewards for chess and racing—are what we call sparse rewards, because in the vast majority of states the agent is given no informative reward signal at all. In games such as tennis and cricket, we can easily supply additional rewards for each point won or for each run scored. In car racing, we could reward the agent for making progress around the track in the right direction. When learning to crawl, any forward motion is an achievement. much easier.
These intermediate rewards make learning
As long as we can provide the correct reward signal to the agent, reinforcement learning provides a very general way to build Al systems. This is particularly true for simulated environments, where there is no shortage of opportunities to gain experience. The addition of deep learning as a tool within RL systems has also made new applications possible, including learning to play Atari video games from raw visual input (Mnih ef al., 2013), controlling robots (Levine ef al., 2016), and playing poker (Brown and Sandholm, 2017). Literally hundreds of different reinforcement learning algorithms have been devised, and
Modelbased reinforcement Iearning
many of them can employ as tools a wide range of learning methods from Chapters 1921. In this chapter, we cover the basic ideas and give some sense of the variety of approaches through a few examples. We categorize the approaches as follows: o Modelbased reinforcement learni
In these approaches the agent uses a transition
model of the environment to help interpret the reward signals and to make decisions about how to act. The model may be initially unknown, in which case the agent learns the model from observing the effects of its actions, or it may already be known—for example, a chess program may know the rules of chess even if it does not know how to choose good moves. In partially observable environments, the transition model is also useful for state estimation (see Chapter 14). Modelbased reinforcement learning
Modelfree reinforcement learning
Actionutiity learning Qlearning Qfunction
Policy search
systems often learn a utility function U (s), defined (as in Chapter 17) in terms of the sum of rewards from state s onward.” o Modelfree reinforcement learning: In these approaches the agent neither knows nor learns a transition model for the environment. Instead, it learns a more direct represen
tation of how to behave. This comes in one of two varieties:
o Actionutility learning: We introduced actionutility functions in Chapter 17. The
most common form of actionutility learning is Qlearning, where the agent learns
a Qfunction, or qualityfunction, Q(s,a), denoting the sum of rewards from state
s onward if action a is taken. Given a Qfunction, the agent can choose what to do in s by finding the action with the highest Qvalue.
o Policy search:
The agent learns a policy 7(s) that maps directly from states to
actions. In the terminology of Chapter 2, this a reflex agent. 2 Inthe RL literature, which draws more on operations research than economics, utility functions are often called value functions and denoted V (s).
Section 22.2
Passive Reinforcement Learning
791
08516  0.9078  0.9578 0.8016
(@)
(®)
Figure 22.1 (a) The optimal policies for the stochastic environment with R(s.a,s') = — 0.04 for transitions between nonterminal states. There are two policies because in state (3,1) both Left and Up are optimal. We saw this before in Figure 17.2. (b) The utilities of the states in the4 x 3 world, given policy 7. Passive
‘We begin in Section 22.2 with passive reinforcement learning, where the agent’s policy = reinforcement
is fixed and the task is to learn the utilities of states (or of stateaction pairs); this could
="
reinforcement learning, where the agent must also figure out what to do. The principal
(e reinforcement
also involve learning a model of the environment. (An understanding of Markov decision processes, as described in Chapter 17, is essential for this section.) Section 22.3 covers active issue is exploration: an agent must experience as much as possible of its environment in
order to learn how to behave in it. Section 22.4 discusses how an agent can use inductive
learning (including deep learning methods) to learn much faster from its experiences. We also discuss other approaches that can help scale up RL to solve real problems, including providing intermediate pseudorewards to guide the learner and organizing behavior into a hierarchy of actions. Section 22.5 covers methods for policy search. In Section 22.6, we explore apprenticeship learning: training a learning agent using demonstrations rather than
reward signals. Finally, Section 22.7 reports on applications of reinforcement learning. 22.2
Passive Reinforcement Learning
We start with the simple case of a fully observable environment with a small number of
actions and states, in which an agent already has a fixed policy 7 (s) that determines its actions.
The agent is trying to learn the utility function U™ (s)—the expected total discounted reward if policy m is executed beginning in state s. We call this a passive learning agent.
The passive learning task is similar to the policy evaluation task, part of the policy iteration algorithm described in Section 17.2.2. The difference is that the passive learning agent does not know the transition model P(s's,a), which specifies the probability of reaching state s’ from state s after doing action a@; nor does it know the reward function R(s,a,s’), which specifies the reward for each transition.
‘We will use as our example the 4 x 3 world introduced in Chapter 17. Figure 22.1 shows
the optimal policies for that world and the corresponding utilities. The agent executes a set
Fassive leaming
792
Chapter 22 Reinforcement Learning
Trial
of trials in the environment using its policy 7. In each trial, the agent starts in state (1,1) and
experiences a sequence of state transitions until it reaches one of the terminal states, (4,2) or
(4,3). Tts percepts supply both the current state and the reward received for the transition that just occurred to reach that state. Typical trials might look like this:
P20
0203
B e
6
a 1>‘““(12 (4,2)
Note that each transition is annotated with both the action ldkcn and the reward received at the next state. The object is to use the information about rewards to learn the expected utility U™ (s) associated with each nonterminal state s. The utility is defined to be the expected sum
of (discounted) rewards obtained if policy 7 is followed. As in Equation (17.2) on page 567, we write
U™(s) =E  Y 1=0
+'R(S:,7(S:),S41) »
2.1)
where R(S;,7(S;),5.1) is the reward received when action 7(S,) is taken in state S, and reaches state S..1. Note that S, is a random variable denoting the state reached at time ¢ when executing policy 7, starting from state So=s. We will include a discount factor
in all of
our equations, but for the 4 x 3 world we will set = 1, which means no discounting. Direct utility estimation
Rewardtogo
22.2.1
Direct utility estimation
The idea of direct utility estimation is that the utility of a
state is defined as the expected
total reward from that state onward (called the expected rewardtogo), and that each trial
provides a sample of this quantity for each state visited. For example, the first of the three
trials shown earlier provides a sample total reward of 0.76 for state (1,1), two samples of 0.80 and 0.88 for (1,2), two samples of 0.84 and 0.92 for (1,3), and so on. Thus, at the end of each
sequence, the algorithm calculates the observed rewardtogo for each state and updates the
estimated utility for that state accordingly, just by keeping a running average for each state
in a table. In the limit of infinitely many trials, the sample average will converge to the true
expectation in Equation (22.1). This means that we have reduced reinforcement learning to a standard supervised learn
ing problem in which each example is a (state, rewardtogo) pair. We have a lot of powerful algorithms for supervised learning, so this approach seems promising, but it ignores an im
portant constraint: The utility of a state is determined by the reward and the expected utility
of the successor states. More specifically, the utility values obey the Bellman equations for a
fixed policy (see also Equation (17.14)):
Uils) = X P(s'  5.m1(8)) [R(s. mi(5), ) + 7 Ui(s))] s
(222)
By ignoring the connections between states, direct utility estimation misses opportunities for
learning.
For example, the second of the three trials given earlier reaches the state (3,2),
which has not previously been visited.
from the first trial to have a high utility.
The next transition reaches (3,3), which is known
The Bellman equation suggests immediately that
(3,2) is also likely to have a high utility, because it leads to (3,3), but direct utility estimation
Section 22.2
Passive Reinforcement Learning
793
learns nothing until the end of the trial. More broadly, we can view direct utility estimation
as searching for U in a hypothesis space that is much larger than it needs to be, in that it includes many functions that violate the Bellman equations.
often converges very slowly. 22.2.2
Adaptive dynamic
For this reason, the algorithm
programming
An adaptive dynamic programming
(or ADP) agent takes advantage of the constraints
among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using dynamic programming. For a passive learning agent, this means plugging the learned transition model P(s'[5,7(s)) and the observed rewards R(s,7(s),s')
into Equation (22.2) to calculate the utilities of the states.
As
we remarked in our discussion of policy iteration in Chapter 17, these Bellman equations are linear when the policy 7 is fixed, so they can be solved using any linear algebra package. Alternatively, we can adopt the approach of modified policy iteration (see page 578),
using a simplified value iteration process to update the utility estimates after each change to the learned model. Because the model usually changes only slightly with each observation, the value iteration process can use the previous utility estimates as initial values and typically
converge very quickly.
Learning the transition model is easy, because the environment is fully observable. This
means that we have a supervised learning task where the input for each training example is a
stateaction pair, (s,a), and the output is the resulting state, s'. The transition model P(s' s,a)
is represented as a table and it is estimated directly from the counts that are accumulated in
Ny  The counts record how often state is reached when executing in s. For example, in the three trials given on page 792, Right is executed four times in (3,3) and the resulting state
is (3,2) twice and (4,3) twice, so P((3,2)(3,3),Right) and P((4,3)(3,3),Right) are both estimated to be §. The full agent program for a passive ADP agent is shown in Figure 22.2. Its performance on the 4 x 3 world is shown in Figure 22.3. In terms of how quickly its value estimates
improve, the ADP agent is limited only by its ability to learn the transition model. In this
sense, it provides a standard against which to measure any other reinforcement learning al
gorithms. It is, however, intractable for large state spaces. In backgammon, for example, it would involve solving roughly 10*° equations in 10*° unknowns. 22.2.3
Temporaldifference learning
Solving the underlying MDP as in the preceding section is not the only way to bring the
Bellman equations to bear on the learning problem.
Another way is to use the observed
transitions to adjust the utilities of the observed states so that they agree with the constraint
equations. Consider, for example, the transition from (1,3) to (2,3) in the second trial on page 792. Suppose that as a result of the first trial, the utility estimates are U™(1,3)=0.88
and U™(2,3)=0.96. Now, if this transition from (1,3) to (2,3) occurred all the time, we would expect the utilities to obey the equation
U™(1,3) = —0.04+ U™(2,3), 50 U7(1,3) would be 0.92. Thus, its current estimate of 0.84 might be a lttle low and should be increased. More generally, when a transition occurs from state s to state s’ via action 7(s),
Adaptive dynamic programming
794
Chapter 22 Reinforcement Learning function PASSIVEADPLEARNER(percepr) returns an action
inputs: percept, a percept indicating the current state s’ and reward signal r
persistent: 7, a fixed policy
mdp, an MDP with model P, rewards R, actions A, discount y U, atable of utilities for states, initially empty
Nyjs.a» 2 table of outcome count vectors indexed by state and action, initially zero 5. a, the previous state and action, initially null if " is new then U[s'] 0 s is not null then
increment Ny s, alls’]
Rls,a,s'1r
add a to A[s] P(  5,a) =+ > [envada 06   [Caenviada >[de 03] ([ »a et Score00 ] [Una [2.1]  [Scow=03] [puera [09]  [Seorer=11 puema [19]  [Seorer =17 Beam 2
Beam 2
Beam 2
Fypomests  [ Viord [Score]  [Fipoesis  [Word [Score]  [Fypomesis [05  [Lapueraal] > ~>[envada]03   [Lapvera >foe s Scors: =21] [puena [2.1]  [Scorer=12] (o ]07]  [Scorer =15
++
Figure 24.8 Beam search with beam size of b=2. The score of each word is the logprobability generated by the target RNN softmax, and the score of each hypothesis is the sum of the word scores. At timestep 3, the highest scoring hypothesis La entrada can only generate lowprobability continuations, so it “falls off the beam.” using a greedy decoder to translate into Spanish the English sentence we saw before, The
front door is red. The correct translation is “La puerta de entrada es roja”—literally “The door of entry is red.” Suppose the target RNN correctly generates the first word La for The. Next, a greedy decoder might propose entrada for front. But this is an error—Spanish word order should put the noun puerta before the modifier. Greedy decoding is fast—it only considers one choice at each timestep and can do so quickly—but the model has no mechanism to correct mistakes. We could try to improve the attention mechanism so that it always attends to the right word and guesses correctly every time.
But for many sentences it is infeasible to guess
correctly all the words at the start of the sentence until you have seen what’s at the end.
A better approach is to search for an optimal decoding (or at least a good one) using
one of the search algorithms from Chapter 3. A common choice is a beam search (see Sec
tion 4.1.3). In the context of MT decoding, beam search typically keeps the top k hypotheses at each stage, extending each by one word using the top k choices of words, then chooses the best k of the resulting > new hypotheses. When all hypotheses in the beam generate the special token, the algorithm outputs the highest scoring hypothesis.
A visualization of beam search is given in Figure 24.8. As deep learning models become
more accurate, we can usually afford to use a smaller beam size.
Current stateoftheart
neural MT models use a beam size of 4 to 8, whereas the older generation of statistical MT models would use a beam size of 100 or more.
24.4 Transformer Selfattention
The influential article “Attention is all you need” (Vaswani et al., 2018) introduced the transformer architecture, which uses a selfattention mechanism that can model longdistance
context without a sequential dependency.
24.4.1 Selfattention
The Transformer Architecture
Selfattention
Previously, in sequencetosequence models, attention was applied from the target RNN to the source RNN. Selfattention extends this mechanism so that each sequence of hidden states also attends to itself—the source to the source, and the target to the target. This allows
the model to additionally capture longdistance (and nearby) context within each sequence.
Section 24.4
The Transformer Architecture
869
The most straightforward way of applying selfattention is where the attention matrix is
directly formed by the dot product of the input vectors. However, this is problematic. The dot product between a vector and itself will always be high, so each hidden state will be biased
towards attending to itself. The transformer solves this by first projecting the input into three different representations using three different weight matrices:
« The query vector g;=W,x; is the one being attended from, like the target in the standard attention mechanism. + The key vector k;=W,x; attention mechanism.
is the one being attended to, like the source in the basic
« The value vector v;=W,x; is the context that is being generated.
Query vector
Key vector Value vector
In the standard attention mechanism, the key and value networks are identical, but intuitively
it makes sense for these to be separate representations. The encoding results of the ith word,
¢;, can be calculated by applying an attention mechanism to the projected vectors:
rj = (ak)/Va ay = & f(L)
aX i
k
lij Vs
where d is the dimension of k and q. Note that i and j are indexes in the same sentence, since we are encoding the context using selfattention. In each transformer layer, selfattention uses
the hidden vectors from the previous layer, which initially is the embedding layer. There are several details worth mentioning here.
First of all, the selfattention mecha
nism is asymmetric, as r;; is different from rj;. Second, the scale factor v/ was added to improve numerical stability. Third, the encoding for all words in a sentence can be calculated
simultaneously, as the above equations can be expressed using matrix operations that can be
computed efficiently in parallel on modern specialized hardware. The choice of which context to use is completely learned from training examples, not
prespecified. The contextbased summarization, ¢;, is a sum over all previous positions in the
sentence. In theory, and information from the sentence could appear in ¢;, but in practice, sometimes important information gets lost, because it is essentially averaged out over the whole sentence.
One way to address that is called multiheaded attention.
We divide the
sentence up into m equal pieces and apply the attention model to each of the m pieces. Each piece has its own set of weights.
Then the results are concatenated together to form ¢;. By
concatenating rather than summing, we make it easier for an important subpiece to stand out.
24.4.2
From selfattention to transformer
Selfattention is only one component of the transformer model. Each transformer layer con
sists of several sublayers. At each transformer layer, selfattention is applied first. The output of the attention module is fed through feedforward layers, where the same feedforward
weight matrices are applied independently at each position. A nonlinear activation function, typically ReLU, is applied after the first feedforward layer. In order to address the potential
vanishing gradient problem, two residual connections (are added into the transformer layer. A singlelayer transformer in shown in Figure 24.9. In practice, transformer models usually
Multiheaded attention
870
Chapter 24 Deep Learning for Natural Language Processing
{
Transformer
Output Vectors,
{
Layer
Residual
Connection
]
Rssiual)
SelfAttention
Residual
P11
L input Vectors Figure 24.9 A singlelayer transformer consists of selfattention, a feedforward network, and residual connections.
Clas: Adverb
Class Pronoun
Cla PastTenseVerb
t t 1 Feedforward   Feedforward   Feedforward   f f i “Transformer Layer f f i ‘Transformer Layer f f i “Transformer Layer t f f Positional Positional Positional Embedding 1 Embedding2 Embedding3 + + + Embedding  [Embedding  [Embedding  lookup Tookup Tookup. f f Yesterday
Figure 24.10
they
cut
Class. Determiner
t Feedforward   Feedforward f f f
f
f
f
f f Positional Positional Embedding4 Embedding 5 + + [Embedding  [ Embedding Tookup Tookup f f the
rope
Using the transformer architecture for POS tagging.
have six or more layers.
As with the other models that we’ve learned about, the output of
layer i is used as the input to layer i+ 1.
oo
The transformer architecture does not explicitly capture the order of words in the sequence, since context is modeled only through selfattention, which is agnostic to word order. To capture the ordering of the words, the transformer uses a technique called positional em
bedding. If our input sequence has a maximum length of n, then we learn n new embedding
Section 245 Pretraining and Transfer Learning
871
vectors—one for each word position. The input to the first transformer layer is the sum of the word embedding at position 7 plus the positional embedding corresponding to position 7.
Figure 24.10 illustrates the transformer architecture for POS tagging, applied to the same
sentence used in Figure 24.3. At the bottom, the word embedding and the positional embed
dings are summed to form the input for a threelayer transformer. The transformer produces
one vector per word, as in RNNbased POS tagging. Each vector is fed into a final output layer and softmax layer to produce a probability distribution over the tags.
In this section, we have actually only told half the transformer story: the model we de
scribed here is called the transformer encoder. It is useful for text classification tasks. The
full transformer architecture was originally designed as a sequencetosequence model for machine translation. decoder.
Therefore, in addition to the encoder, it also includes a transformer
The encoder and decoder are nearly identical, except that the decoder uses a ver
sion of selfattention where each word can only attend to the words before it, since text is
Transformer encoder Transformer decoder
generated lefttoright. The decoder also has a second attention module in each transformer
layer that attends to the output of the transformer encoder.
2
Pretraining and Transfer Learning
Getting enough data to build a robust model can be a challenge.
In computer vision (see
Chapter 25), that challenge was addressed by assembling large collections of images (such as ImageNet) and handlabeling them. For natural language, it is more common
to work with text that is unlabeled.
The dif
ference is in part due to the difficulty of labeling: an unskilled worker can easily label an image as “cat” or “sunset,” but it requires extensive training to annotate a sentence with partofspeech tags or parse trees. The difference is also due to the abundance of text: the Internet adds over 100 billion words of text each day, including digitized books, curated resources
such as Wikipedia, and uncurated social media posts. Projects such as Common Crawl provide easy access to this data. Any running text can
be used to build ngram or word embedding models, and some text comes with structure that
can be helpful for a variety of tasks—for example, there are many FAQ sites with questionanswer pairs that can be used to train a questionanswering system.
Similarly, many Web
sites publish sidebyside translations of texts, which can be used to train machine translation
systems. Some text even comes with labels of a sort, such as review sites where users annotate their text reviews with a 5star rating system.
‘We would prefer not to have to go to the trouble of creating a new data set every time we want a new NLP model. In this section, we introduce the idea of pretraining: a form
of transfer learning (see Section 21.7.2) in which we use a large amount of shared general
domain language data to train an initial version of an NLP model. From there, we can use a
smaller amount of domainspecific data (perhaps including some labeled data) to refine the model.
The refined model can learn the vocabulary, idioms, syntactic structures, and other
linguistic phenomena that are specific to the new domain. 24.5.1
Pretrained word embeddings
In Section 24.1, we briefly introduced word embeddings. We saw that how similar words like banana and apple end up with similar vectors, and we saw that we can solve analogy
Pretraining
872
Chapter 24 Deep Learning for Natural Language Processing problems with vector subtraction. This indicates that the word embeddings are capturing substantial information about the words.
In this section we will dive into the details of how word embeddings are created using an
entirely unsupervised process over a large corpus of text. That is in contrast to the embeddings
from Section 24.1, which were built during the process of supervised part of speech tagging,
and thus required POS tags that come from expensive hand annotation. We will concentrate
on one specific model for word embeddings,
the GloVe (Global
Vectors) model. The model starts by gathering counts of how many times each word appears
within a window of another word, similar to the skipgram model. First choose window size
(perhaps 5 words) and let X;; be the number of times that words i and j cooccur within
a window,
and let X; be the number of times word i cooccurs with any other word.
Let
P,;=X;;/X; be the probability that word j appears in the context of word i. As before, let E; be the word embedding for word i. Part of the intuition of the GloVe model is that the relationship between two words can
best be captured by comparing them both to other words. Consider the words ice and steam. Now consider the ratio of their probabilities of cooccurrence with another word, w, that is:
Puice/Pustean
When w is the word solid the ratio will be high (meaning solid applies more to ice) and when w is the word gas it will be low (meaning gas applies more to steam). And when w is a noncontent word like rhe, a word like warer that is equally relevant to both, or an equally irrelevant word like fashion, the ratio will be close to 1.
The GloVe model starts with this intuition and goes through some mathematical reason
ing (Pennington ef al., 2014) that converts ratios of probabilities into vector differences and
dot products, eventually arriving at the constraint
E;E}=log(P;). In other words, the dot product of two word vectors is equal to the log probability of their
cooccurrence. That makes intuitive sense: two nearlyorthogonal vectors have a dot product
close to 0, and two nearlyidentical normalized vectors have a dot product close to 1. There
is a technical complication wherein the GloVe model creates two word embedding vectors
for each word, E; and EJ; computing the two and then adding them together at the end helps limit overfitting.
Training a model like GloVe is typically much less expensive than training a standard
neural network:
a new model can be trained from billions of words of text in a few hours
using a standard desktop CPU. It is possible to train word embeddings on a specific domain, and recover knowledge in
that domain. For example, Tshitoyan e al. (2019) used 3.3 million scientific abstracts on the
subject of material science to train a word embedding model. They found that, just as we saw
that a generic word embedding model can answer “Athens is to Greece as Oslo is to what?” with “Norway,” their material science model can answer “NiFe is to ferromagnetic as IrMn is to what?” with “antiferromagnetic.”
Their model does not rely solely on cooccurrence of words; it seems to be capturing more complex scientific knowledge. When asked what chemical compounds can be classified
as “thermoelectric™ or “topological insulator,” their model is able to answer correctly.
For
example, CsAgGapSey never appears near “thermoelectric” in the corpus, but it does appear
Section 245 Pretraining and Transfer Learning
873
near “chalcogenide.” “band gap,” and “optoelectric,” which are all clues enabling it to be classified as similar to “thermoelectric.”
Furthermore, when trained only on abstracts up
to the year 2008 and asked to pick compounds that are “thermoelectric” but have not yet appeared in abstracts, three of the model’s top five picks were discovered to be thermoelectric in papers published between 2009 and 2019. 24.5.2
Pretrained contextual representations
‘Word embeddings are better representations than atomic word tokens, but there is an impor
tant issue with polysemous words. For example, the word rose can refer to a flower or the past tense of rise. Thus, we expect to find at least two entirely distinct clusters of word contexts for rose: one similar to flower names such as dahlia, and one similar to upsurge.
No
single embedding vector can capture both of these simultaneously. Rose is a clear example of a word with (at least) two distinct meanings, but other words have subtle shades of meaning that depend on context, such as the word need in you need to see this movie versus humans
need oxygen to survive. And some idiomatic phrases like break the bank are better analyzed as a whole rather than as component words.
Therefore, instead of just learning a wordtoembedding table, we want to train a model to
generate contextual representations of each word in a sentence. A contextual representation
maps both a word and the surrounding context of words into a word embedding vector. In
other words, if we feed this model the word rose and the context the gardener planted a rose bush, it should produce a contextual embedding that is similar (but not necessarily identical)
to the representation we get with the context the cabbage rose had an unusual fragrance, and very different from the representation of rose in the context the river rose five feet.
Figure 24.11 shows a recurrent network that creates contextual word embeddings—the boxes that are unlabeled in the figure. We assume we have already built a collection of
noncontextual word embeddings. We feed in one word at a time, and ask the model to predict the next word. So for example in the figure at the point where we have reached the word “car,” the the RNN node at that time step will receive two inputs: the noncontextual word embedding for “car” and the context, which encodes information from the previous words “The red.” The RNN node will then output a contextual representation for “car.” The network
as a whole then outputs a prediction for the next word, “is.” We then update the network’s weights to minimize the error between the prediction and the actual next word.
This model is similar to the one for POS tagging in Figure 24.5, with two important
differences. First, this model is unidirectional (lefttoright), whereas the POS model is bidi
rectional. Second, instead of predicting the POS tags for the current word, this model predicts
the next word using the prior context. Once the model is built, we can use it to retrieve representations for words and pass them on to some other task: we need not continue to predict
the next word. Note that computing a contextual representations always requires two inputs, the current word and the context.
24.5.3
Masked
language models
A weakness of standard language models such as ngram models is that the contextualization
of each word is based only on the previous words of the sentence. Predictions are made from
left to right. But sometimes context from later in a sentence—for example, feet in the phrase
rose five feet—helps to clarify earlier words.
Contextual representations
874
Chapter 24 Deep Learning for Natural Language Processing
t
is
Feedforward 
[Feedforward
Feedforward
t
big
car
red
t
{
Feedforward 
t
[Feedforward
Contextual
representations (RNN output)
RNN}——[RNN——{RNN——{RNN}——[RNN
]\t“’,"‘“f::‘"\,:;“
(word embeddings)
1
f
‘°°; X
> X
Figure 26.11 A simple triangular robot that can translate, and needs to avoid a rectangular
obstacle. On the left is the workspace, on the right is the configuration space.
Once we have a path, the task of executing a sequence of actions to follow the path is
called trajectory tracking control. A trajectory is a path that has a time associated with
[rajecio tracking
each point on the path. A path just says “go from A to B to C, etc.” and a trajectory says Trajectory “start at A, take 1 second to get to B, and another 1.5 seconds to get to C, ete.” 26.5.1
Configuration space
Imagine a simple robot, R, in the shape of a right triangle as shown by the lavender triangle in the lower left corner of Figure 26.11. The robot needs to plan a path that avoids a rectangular obstacle, O. The physical space that a robot moves about in is called the workspace. This Workspace particular robot can move in any direction in the x — y plane, but cannot rotate. The figure shows five other possible positions of the robot with dashed outlines; these are each as close
to the obstacle as the robot can get. The body of the robot could be represented as a set of (x,y) points (or (x,y,z) points
for a threedimensional robot), as could the obstacle. With this representation, avoiding the obstacle means that no point on the robot overlaps any point on the obstacle. Motion planning
would require calculations on sets of points, which can be complicated and timeconsuming.
We can simplify the calculations by using a representation scheme in which all the points
that comprise the robot are represented as a single point in an abstract multidimensional
space, which we call the configuration space, or Cspace. The idea is that the set of points Configuration space that comprise the robot can be computed if we know (1) the basic measurements of the robot
(for our triangle robot, the length of the three sides will do) and (2) the current pose of the robot—its position and orientation.
For our simple triangular robot, two dimensions suffice for the Cspace: if we know the
(x,y) coordinates of a specific point on the robot—we’ll use the rightangle vertex—then we
can calculate where every other point of the triangle is (because we know the size and shape of the triangle and because the triangle cannot rotate). In the lowerleft corner of Figure 26.11,
the lavender triangle can be represented by the configuration (0,0).
If we change the rules so that the robot can rotate, then we will need three dimensions,
(x,y,0), to be able to calculate where every point is. Here
is the robot’s angle of rotation
in the plane. If the robot also had the ability to stretch itself, growing uniformly by a scaling
factor s, then the Cspace would have four dimensions, (x,y,6,5).
~Cspace
Chapter 26 Robotics For now we’ll stick with the simple twodimensional Cspace of the nonrotating triangle
robot. The next task is to figure out where the points in the obstacle are in Cspace. Consider
the five dashedline triangles on the left of Figure 26.11 and notice where the rightangle vertex is on each of these. Then imagine all the ways that the triangle could slide about.
Obviously, the rightangle vertex can’t go inside the obstacle, and neither can it get any
Cspace obstacle
closer than it is on any of the five dashedline triangles. So you can see that the area where
the rightangle vertex can’t go—the Cspace obstacle—is the fivesided polygon on the right
of Figure 26.11 labeled Cyp
In everyday language we speak of there being multiple obstacles for the robot—a table, a
chair, some walls. But the math notation is a bit easier if we think of all of these as combit
into one “obstacle” that happens to have disconnected components.
In general, the C:
obstacle is the set of all points ¢ in C such that, if the robot were placed in that configuration, its workspace geometry would intersect the workspace obstacle.
Let the obstacles in the workspace be the set of points O, and let the set of all points on
the robot in configuration ¢ be .A(g). Then the Cspace obstacle is defined as
Free space Degrees of freedom (DOF)
Forward kinematics
Cons={q:q€CandA(q)NO# {}}
and the free space is Cjree = C — Cops.
The Cspace becomes more interesting for robots with moving parts. Consider the two
link arm from Figure 26.12(a). It is bolted to a table so the base does not move, but the arm
has two joints that move independently—we call these degrees of freedom (DOF). Moving
the joints alters the (x,y) coordinates of the elbow, the gripper, and every point on the arm. The arm’s configuration space is twodimensional: (Bguous ferp), Where Oy, is the angle of the shoulder joint, and 6, is the angle of the elbow joint. Knowing the configuration for our twolink arm means we can determine where each
point on the arm is through simple trigonometry.
ping is a function
In general, the forward kinematics map
¢p:C—=W
that takes in a configuration and outputs the location of a particular point b on the robot when
the robot is in that configuration. A particularly useful forward kinematics mapping is that for
the robot’s end effector, ¢z The set of all points on the robot in a particular configuration ¢ is denoted by A(q) C W:
Alg) =U{(@)}
;
Inverse kinematics
The inverse problem, of mapping a desired location for a point on the robot to the config
uration(s) the robot needs to be in for that to happen, is known as inverse kinematics:
IKy:x €W = {g €C st d(q) =x}.
Sometimes the inverse kinematics mapping might take not just a position, but also a desired
orientation as input. When we want a manipulator to grasp an object, for instance, we can
compute a desired position and orientation for its gripper, and use inverse kinematics to de
termine a goal configuration for the robot. Then a planner needs to find a way to get the robot from its current configuration to the goal configuration without intersecting obstacles.
Workspace obstacles are often depicted as simple geometric forms—especially in robotics textbooks, which tend to focus on polygonal obstacles. But how do the obstacles look in configuration space?
Section 26.5 Planning and Control
Felby Fshou,
(b)
(a)
Figure 26.12 (a) Workspace representation of a robot arm with two degrees of freedom. The workspace is a box with a flat obstacle hanging from the ceiling. (b) Configuration space of the same robot. Only white regions in the space are configurations that are free of collisions. The dot in this diagram corresponds to the configuration of the robot shown on the left.
For the twolink arm, simple obstacles in the workspace, like a vertical line, have very
complex Cspace counterparts, as shown in Figure 26.12(b). The different shadings of the occupied space correspond to the different objects in the robot’s workspace: the dark region
surrounding the entire free space corresponds to configurations in which the robot collides
with itself. It is easy to see that extreme values of the shoulder or elbow angles cause such a violation. The two ovalshaped regions on both sides of the robot correspond to the table on which the robot is mounted. The third oval region corresponds to the left wall.
Finally, the most interesting object in configuration space is the vertical obstacle that
hangs from the ceiling and impedes the robot’s motions.
This object has a funny shape in
configuration space: it is highly nonlinear and at places even concave. With a little bit of imagination the reader will recognize the shape of the gripper at the upper left end. ‘We encourage the reader to pause for a moment and study this diagram. The shape of this
obstacle in Cspace is not at all obvious!
The dot inside Figure 26.12(b) marks the configu
ration of the robot in Figure 26.12(a). Figure 26.13 depicts three additional configurations,
both in workspace and in configuration space. In configuration conf1, the gripper is grasping the vertical obstacle.
We see that even if the robot’s workspace is represented by flat polygons, the shape of
the free space can be very complicated. In practice, therefore, one usually probes a configu
ration space instead of constructing it explicitly. A planner may generate a configuration and
then test to see if it is in free space by applying the robot kinematics and then checking for collisions in workspace coordinates.
941
942
Chapter 26 Robotics
conf3
conf1
conf2
() (a) Figure 26.13 Three robot configurations, shown in workspace and configuration space. 26.5.2 Motion planning
Motion planning
The motion planning problem is that of finding a plan that takes a robot from one configuration to another without colliding with an obstacle. It is a basic building block for movement and manipulation.
In Section 26.5.4 we will discuss how to do this under complicated dy
namics, like steering a car that may drift off the path if you take a curve too fast. For now, we
will focus on the simple motion planning problem of finding a geometric path that is collision
free. Motion planning is a quintessentially continuousstate search problem, but it is often
Piano mover's problem
possible to discretize the space and apply the search algorithms from Chapter 3.
The motion planning problem is sometimes referred to as the piano mover’s problem. It
gets its name from a mover’s struggles with getting a large, irregularshaped piano from one room to another without hitting anything. We are given:
« a workspace world W in either R? for the plane or R? for three dimensions,
an obstacle region O C W,
+ arobot with a configuration space C and set of points A(g) for g € C, * astarting configuration ¢ € C, and
« a goal configuration g € C. The obstacle region induces a Cspace obstacle C,p, and its corresponding free space Cyee defined as in the previous section. We need to find a continuous path through free space. We
will use a parameterized curve, 7(r), to represent the path, where 7(0) = g, and 7(1) = g, and 7(r) for every
between 0 and 1 is some point in Cyy.. That is, ¢ parameterizes how
far we are along the path, from start to goal. Note that ¢ acts somewhat like time in that as
1 increases the distance along the path increases, but7 is always a point on the interval [0, 1] and is not measured in seconds.
Section 265
Planning and Control
943
4
o
Figure 26.14 A visibility graph. Lines connect every pair of vertices that can “see” each other—lines that don’t go through an obstacle. The shortest path must lie upon these lines. The motion planning problem can be made more complex in various ways: defining the
goal as
a set of possible configurations rather than a single configuration; defining the goal
in the workspace rather than the Cspace; defining a cost function (e.g.,
path length) to be
minimized; satisfying constraints (e.g., if the path involves carrying a cup of coffee, making
sure that the cup is always oriented upright so the coffee does not spill).
The spaces of motion planning: Let’s take a step back and make sure we understand the
spaces involved in motion planning. First, there is the workspace or world W. Points in W are points in the everyday threedimensional world. Next, we have the space of configurations,
C. Points g in C are ddimensional, with d the robot’s number of degrees of freedom, and
map to sets of points .A(g) in W. Finally, there is the space of paths. The space of paths is a
space of functions. Each point in this space maps to an entire curve through Cspace. This space is codimensional! Intuitively, we need d dimensions for each configuration along the path, and there are as many configurations on a path as there are points in the number line interval [0, 1]. Now let’s consider some ways of solving the motion planning problem. Visi
ility graphs
For the simplified case of twodimensional configuration spaces and polygonal Cspace ob
stacles, vi ity graphs are a convenient way to solve the motion planning problem with a guaranteed shortestpath solution. Let Vops C C be the set of vertices of the polygons making
up Cops, and let V = VoU {gs.q5}
We construct a graph G = (V,E) on the vertex set V with edges e;; € E connecting a
vertex v; to another vertex v; if the line connecting the two vertices
is collisionfree—that is,
if {Avi+ (1= A)v;: A€ [0,1]}NCops = { }. When this happens, we say the two vertices “can
see each other,” which is where “visibility” graphs got their name. To solve the motion planning problem, all we need to do is run a discrete graph search (e.g., bestfirst search) on the graph G with starting state g, and goal g,. In Figure 26.14
we see a visibility graph and an optimal threestep solution. An optimal search on visibility
graphs will always give us the optimal path (if one exists), or report failure if no path exists.
Voronoi
diagrams
Visibility graphs encourage paths that run immediately adjacent to an obstacle—if you had to walk around a table to get to the door, the shortest path would be to stick as close to the table
as possible. However, if motion or sensing is nondeterministic, that would put you at risk of
bumping into the table. One way to address this is to pretend that the robot’s body is a bit
Visibility graph
Chapter 26 Robotics
Figure 26.15 A Voronoi diagram showing the set of points (black lines) equidistant to two
or more obstacles in configuration space.
larger than it actually is, providing a buffer zone. Another way is to accept that path length is not the only metric we want to optimize. Section 26.8.2 shows how to learn a good metric from human examples of behavior.
Voronoi diagram Region
Voronoi graph
A third way is to use a different technique, one that puts paths as far away from obstacles
as possible rather than hugging close to them. A Voronoi diagram is a representation that
allows us to do just that. To get an idea for what a Voronoi diagram does, consider a space
where the obstacles are, say, a dozen small points scattered about a plane. Now surround each
of the obstacle points with a region consisting of all the points in the plane that are closer to that obstacle point than to any other obstacle point. Thus, the regions partition the plane. The
Voronoi diagram consists
and vertices of the regions.
of the set of regions, and the Voronoi graph consists of the edges
‘When obstacles are areas, not points, everything stays pretty much the same. Each region
still contains all the points that are closer to one obstacle than to any other, where distance is
measured to the closest point on an obstacle. The boundaries between regions still correspond
to points that are equidistant between two obstacles, but now the boundary may be a curve
rather than a straight line. Computing these boundaries can be prohibitively expensive in highdimensional spaces.
To solve the motion planning problem, we connect the start point g, to the closest point
on the Voronoi graph via a straight line, and the same for the goal point g,. We then use discrete graph search to find the shortest path on the graph. For problems like navigating
through corridors indoors, this gives a nice path that goes down the middle of the corridor.
However, in outdoor settings it can come up with inefficient paths, for example suggesting an
unnecessary 100 meter detour to stick to the middle of a wideopen 200meter space.
Section 26.5
Planning and Control
945
joal
(a)
(b)
Figure 26.16 (a) Value function and path found for a discrete grid cell approximation of the configuration space. (b) The same path visualized in workspace coordinates. Notice how the robot bends its elbow to avoid a collision with the vertical obstacle. Cell decomposition
An alternative approach to motion planning is to discretize the Cspace. Cell decomposition ~Cell decomposition methods decompose the free space into a finite number of contiguous regions, called cells. These cells are designed so that the pathplanning problem within a single cell can be solved by simple means (e.g., moving along a straight line). The pathplanning problem then becomes a discrete graph search problem (as with visibility graphs and Voronoi graphs) to find a path through a sequence of cel The simplest cell decomposition consists of a regularly spaced grid. Figure 26.16(a) shows a square grid decomposition of the space and a solution path that is optimal for this grid size. Grayscale shading indicates the value of each freespace grid cell—the cost of the shortest path from that cell to the goal. (These values can be computed by a deterministic form of the VALUEITERATION algorithm given in Figure 17.6 on page 573.) Figure 26.16(b)
shows the corresponding workspace trajectory for the arm. Of course, we could also use the A" algorithm to find a shortest path.
This grid decomposition has the advantage that it
from three limitations.
simple to implement, but it suffers
First, it is workable only for lowdimensional configuration spaces,
because the number of grid cells increases exponentially with d, the number of dimensions. (Sounds familiar? This is the curse of dimensionality.) Second, paths through discretized
state space will not always be smooth. We see in Figure 26.16(a) that the diagonal parts of the path are jagged and hence very difficult for the robot to follow accurately. The robot can attempt to smooth out the solution path, but this is far from straightforward. Third, there is the problem of what to do with cells that are “mixed”—that
is, neither
entirely within free space nor entirely within occupied space. A solution path that includes
Chapter 26 Robotics such a cell may not be a real solution, because there may be no way to safely cross the
cell. This would make the path planner unsound.
On the other hand, if we insist that only
completely free cells may be used, the planner will be incomplete, because it might be the case that the only paths to the goal go through mixed cells—it might be that a corridor is actually wide enough for the robot to pass, but the corridor is covered only by mixed cells. The first approach to this problem is further subdivision of the mixed cells—perhaps using cells of half the original size. This can be continued recursively until a path is found
that lies entirely within free cells. This method works well and is complete if there is a way to
decide if a given cell is a mixed cell, which is easy only if the configuration space boundaries have relatively simple mathematical descriptions.
It is important to note that cell decomposition does not necessarily require explicitly rep
Collision checker
resenting the obstacle space C,p,. We can decide to include a cell or not by using a collision checker. This is a crucial notion to motion planning. A collision checker is a function 7(g)
that maps to 1 if the configuration collides with an obstacle, and 0 otherwise. It is much easier
to check whether a specific configuration is in collision than to explicitly construct the entire
obstacle space Cyps. Examining the solution path shown in Figure 26.16(a), we can see an additional difficulty that will have to be resolved. The path contains arbitrarily sharp corners, but a physical robot has momentum and cannot change direction instantaneously. This problem can be solved by storing, for each grid cell, the exact continuous state (position and velocity) that was attained
when the cell was reached in the search. Assume further that when propagating information to nearby grid cells, we use this continuous state as a basis, and apply the continuous robot
motion model for jumping to nearby cells. So we don’t make an instantaneous 90° turn; we
make a rounded turn governed by the laws of motion. We can now guarantee that the resulting
Hybrid A
trajectory is smooth and can indeed be executed by the robot. One algorithm that implements
this is hybrid A*.
Randomized motion planning Randomized motion planning does graph search on a random decomposition of the configuration space, rather than a regular cell decomposition. The key idea is to sample a random set
of points and to create edges between them if there is a very simple way to get from one to
Probabilistic roadmap (PRM) Simple planner
the other (e.g., via a straight line) without colliding; then we can search on this graph.
A probabilistic roadmap (PRM) algorithm is one way to leverage this idea. We assume
access to a collision checker 7 (defined on page 946), and to a simple planner B(q;,q>) that
returns a path from g, to ¢, (or failure) but does so quickly. This simple planner is not going
to be complete—it might return failure even if a solution actually exists. Its job is to quickly
try to connect gy and g5 and let the main algorithm know if it succeeds. We will use it to
Milestone
define whether an edge exists between two vertices.
The algorithm starts by sampling M milestones—points in Cy—in addition to the points g, and g,. It uses rejection sampling, where configurations are sampled randomly and collisionchecked using + until a total of M milestones are found.
uses the simple planner to try to connect pairs of milestones.
Next, the algorithm
If the simple planner returns
success, then an edge between the pair is added to the graph; otherwise, the graph remains as
is. We try to connect each milestone either to its k nearest neighbors (we call this &PRM), or
to all milestones in a sphere of a radius r. Finally, the algorithm searches for a path on this
Section 26.5
@
Planning and Control
947
7
Figure 26.17 The probabilistic roadmap (PRM) algorithm. Top left: the start and goal configurations. Top right: sample M collisionfree milestones (here M = 5). Bottom left: connect each milestone to its k nearest neighbors (here k = 3). Bottom right: find the shortest path from the start to the goal on the resulting graph. graph from g to g. If no path is found, then M more milestones are sampled, added to the graph, and the process is repeated. Figure 26.17 shows a roadmap with the path found between two configurations. PRMs
Probabilistically are not complete, but they are what is called probabilistically complete—they will eventu comp lete ally find a path, if one exists. Intuitively, this is because they keep sampling more milestones.
PRMs work well even in highdimensional configuration spaces. PRM:s are also popular for multiquery planning, in which we have multiple motion Multiquery planning planning problems within the same Cspace. Often, once the robot reaches a goal, it is called upon to reach another goal in the same workspace. PRMs are really useful, because the robot can dedicate time up front to constructing a roadmap, and amortize the use of that roadmap over multiple queries.
Rapidlyexploring random trees Rapidly exploring : ! of PRMs called rapidly exploring random trees (RRTS) is} popular for singleAn extension random trées query planning. We incrementally build two trees, one with g, as the oot and one with gg " * as the root.
Random
milestones are chosen, and an attempt is made to connect each new
milestone to the existing trees. If a milestone connects both trees, that means a solution has been found, as in Figure 26.18. If not, the algorithm finds the closest point in each tree and
adds to the tree a new edge that extends from the point by a distance § towards the milestone.
This tends to grow the tree towards previously unexplored sections of the space. Roboticists love RRTs for their ease of use. However, RRT solutions are typically nonoptimal and lack smoothness. Therefore, RRTs are often followed by a postprocessing step.
The most common one is “shortcutting,” in which we randomly select one of the vertices on
the solution path and try to remove it by connecting its neighbors to each other (via the simple
Chapter 26 Robotics
Gsample
qs Figure 26.18 The bidirectional RRT algorithm constructs two trees (one from the start, the other from the goal) by incrementally connecting each sample to the closest node in each tree, if the connection is possible. When a sample connects to both trees, that means we have found a solution path.
Figure 26.19 Snapshots of a trajectory produced by an RRT and postprocessed with short
cutting. Courtesy of Anca Dragan.
planner). We do this repeatedly for as many steps as we have compute time for. Even then, RRT"
the trajectories might look a little unnatural due to the random positions of the milestone that were selected, as shown in Figure 26.19.
RRT is a modification to RRT that makes the algorithm asymptotically optimal: the
solution converges to the optimal solution as more and more milestones are sampled. The key idea is to pick the nearest neighbor based on a notion of cost to come rather than distance
from the milestone only, and to rewire the tree, swapping parents of older vertices if it is
cheaper to reach them via the new milestone.
Trajectory optimization for kinematic planning Randomized sampling algorithms tend to first construct a complex but feasible path and then optimize it. Trajectory optimization does the opposite: it starts with a simple but infeasible
path, and then works to push it out of collision. The goal is to find a path that optimizes a cost
Section 265
Planning and Control
949
function! over paths. That is, we want to minimize the cost function J(7), where 7(0) = g, and 7(1) = g, J is called a functional because it is a function over functions.
The argument to J is
7, which is itself a function: 7(r) takes as input a point in the [0, 1] interval and maps it to
a configuration. A standard cost functional trades off between two important aspects of the robot’s motion: collision avoidance and efficiency,
J = Jobs + Mo
where the efficiency J; measures the length of the path and may also measure smoothness. A convenient way to define efficiency is with a quadratic: it integrates the squared first derivative of 7 (we will see in a bit why this does in fact incentivize short paths):
Ja = [ 30 Pas. e
For the obstacle term, assume we can compute the distance d(x) from any point x € W to the nearest obstacle edge. This distance is positive outside of obstacles, 0 at the edge, and
negative inside. This is called a signed distance field. We can now define a cost field in the
workspace, call it ¢, that has high cost inside of obstacles, and a small cost right outside. With
Signed distance field
this cost, we can make points in the workspace really hate being inside obstacles, and dislike being right next to them (avoiding the visibility graph problem of their always hanging out by the edges of obstacles). Of course, our robot is not a point in the workspace, so we have some more work to do—we need to consider all points b on the robot’s body:
(fn(T \—ob () [l db ds ,»m—// ew
This
is called a path mtegral~n does not just integrate ¢ along the way for each body point,
but it multiplies by the derivative to make the cost invariant to retiming of the path. Imagine a robot sweeping through the cost field, accumulating cost as is moves. Regardless of how fast
Path integral
or slow the arm moves through the field, it must accumulate the exact same cost.
The simplest way to solve the optimization problem above and find a path is gradient
descent. If you are wondering how to take gradients of functionals with respect to functions,
something called the calculus of variations is here to help. It is especially easy for functionals of the form
'
Il
/0 F(s,7(5),#(5))ds
which are integrals of functions that depend just on the parameter s, the value of the function
EulerLagrange at s, and the derivative of the function at s. In such a case, the EulerLagrange equation equation says that the gradient is
Vo J(s)
)=
JF
= o
d
OF
5O " warm
If we look closely at Jo; and Jyp, they both follow this pattern. In particular for J,7, we have F(s,7(s),7(s)) = [#(s)[. To get a bit more comfortable with this, let’s compute the gradient
1 Roboticists like to minimize a cost function J, whereas in other parts of Al we try to maximize a utility function
U orareward R.
950
Chapter 26 Robotics
Figure 26.20 Trajectory optimization for motion planning. Two pointobstacles with circular bands of decreasing cost around them. The optimizer starts with the straight line trajectory, and lets the obstacles bend the line away from collisions, finding the minimum path through the cost field.
for Jo5 only. We see that F does not have a direct dependence on 7(s), so the first term in the formula is 0. We are left with d
Vel(s)=02(s9)
since the partial of F with respect to 7(s) is 7(s).
Notice how we made things easier for ourselves when defining Jo;—it’s a nice quadratic
of the derivative (and we even put a % in front so that the 2 nicely cancels out). In practice, you will see this trick happen a lot for optimization—the art is not just in choosing how to
optimize the cost function, but also in choosing a cost function that will play nicely with how you will optimize it. Simplifying our gradient, we get
V. J(s) = —#(s). Now, since J is a quadratic, setting this gradient to 0 gives us the solution for 7 if we
didn’t have to deal with obstacles. Integrating once, we get that the first derivative needs to be constant; integrating again we get that 7(s) = a s+ b, with a and b determined by the endpoint constraints for 7(0) and 7(1). The optimal path with respect to J.g is thus the
straight line from start to goal! It is indeed the most efficient way to go from one to the other if there are no obstacles to worry about.
Of course, the addition of Jp, is what makes things difficult—and we will spare you
deriving its gradient here. The robot would typically initialize its path to be a straight line,
which would plow right through some obstacles. It would then calculate the gradient of the cost about the current path, and the gradient would serve to push the path away from the obstacles (Figure 26.20). Keep in mind that gradient descent will only find a locally optimal solution—just like hill climbing. Methods such as simulated annealing (Section 4.1.2) can be
used for exploration, to make it more likely that the local optimum is a good one. 26.5.3
Control theory
Trajectory tracking control
We have covered how to plan motions, but not how to actually move—to apply current to motors, to produce torque, to move the robot.
This is the realm of control theory, a field
of increasing importance in Al There are two main questions to deal with: how do we turn
Section 26.5 Planning and Control
951
Figure 26.21 The task of reaching to grasp a bottle solved with a trajectory optimizer. Left: the initial trajectory, plotted for the end effector. Middle: the final trajectory after optimiza
tion. Right: the goal configuration. Courtesy of Anca Dragan. See Ratliff ef al. (2009).
a mathematical description of a path into a sequence of actions in the real world (openloop
control), and how do we make sure that we are staying on track (closedloop control)?
From configurations to torques for openloop tracking: Our path 7(1) gives us configurations. The robot starts at rest at g, = 7(0). From there the robot’s motors will turn currents into torques, leading to motion. But what torques should the robot aim for, such that it ends
up at g = 7(1)?
This is where the idea of a dynamics model (or transition model) comes in. We can give
the robot a function f that computes the effects torques have on the configuration. Remem
Dynamics model
ber F = ma from physics? Well, there is something like that for torques too, in the form
u= f"(.4.G), with u a torque, ¢ a velocity, and g an acceleration.? If the robot is at config
uration ¢ and velocity ¢, and applied torque u, that would lead to acceleration G = f(q,q,u). The tuple (g.¢) is a dynamic state, because it includes velocity, whereas g is the kinematic
state and is not sufficient for computing exactly what torque to apply. f is a deterministic
dynamics model in the MDP over dynamic states with torques as actions. f~! is the inverse
Dynamic state
Kinematic state
dynamics, telling us what torque to apply if we want a particular acceleration, which leads Inverse dynamics to a change in velocity and thus a change in dynamic state.
Now, naively, we could think of 7 € [0, 1] as “time” on a scale from 0 to 1 and select our
torque using inverse dynamics:
) = £ (r(0,7(0),5(1))
(262)
assuming that the robot starts at (7(0),7(0)). In reality though, things are not that easy.
The path 7 was created as a sequence of points, without taking velocities and accelera
tions into account. As such, the path may not satisfy 7(0) = 0 (the robot starts at 0 velocity), or even that 7 is differentiable (let alone twice differentiable).
Further, the meaning of the
endpoint “1” is unclear: how many seconds does that map to? In practice, before we even think of tracking a reference path, we usually retime it, that Retiming is, transform it into a trajectory £(r) that maps the interval [0, 7] for some time duration 7'
into points in the configuration space C. (The symbol is the Greek letter Xi.) Retiming is trickier than you might think, but there are approximate ways to do it, for instance by picking
a maximum velocity and acceleration, and using a profile that accelerates to that maximum 2 We omit the details of /=" here, but they involve mass, inertia, gravity, and Coriolis and centrifugal forces.
952
Chapter 26 Robotics
e
(@)
s
(®)
(c)
Figure 26.22 Robot arm control using (a) proportional control with gain factor 1.0, (b) proportional control with gain factor 0.1, and (c) PD (proportional derivative) control with gain factors 0.3 for the proportional component and 0.8 for the differential component. In all cases the robot arm tries to follow the smooth line path, but in (a) and (b) deviates substantially from
the path.
velocity, stays there as long as it can, and then decelerates back to 0. Assuming we can do this, Equation (26.2) above can be rewritten as
ur) = 1760 E0,€0)
(263)
Even with the change from 7 to £, an actual trajectory, the equation of applying torques from
Control law
above (called a control law) has a problem in practice. Thinking back to the reinforcement
Stiction
and inertias exactly, and f might not properly account for physical phenomena like stiction in the motors (the friction that tends to prevent stationary surfaces from being set in motion—to
learning section, you might guess what it is. The equation works great in the situation where [ is exact, but pesky reality gets in the way as usual: in real systems, we can’t measure masses
make them stick). So, when the robot arm starts applying those torques but f is wrong, the
errors accumulate and you deviate further and further from the reference path.
Rather than just letting those errors accumulate, a robot can use a control process that
looks at where it thinks it is, compares that to where it wanted to be, and applies a torque to minimize the error.
P controller Gain factor
A controller that provides force in negative proportion to the observed error is known as
a proportional controller or P controller for short. The equation for the force
u(t) = Kp(&(1) — q:)
where g, is the current configuration, and Kp is a constant representing the gain factor of the
controller. Kp regulates how strongly the controller corrects for deviations between the actual state g; and the desired state &(t).
Figure 26.22(a) illustrates what can go wrong with proportional control. Whenever a deviation occurs—whether due to noise or to constraints on the forces the robot can apply—the
robot provides an opposing force whose magnitude is proportional to this deviation. Intuitively, this might appear plausible, since deviations should be compensated by a counterforce to keep the robot on track. However, as Figure 26.22(a) illustrates, a proportional controller
can cause the robot to apply too much force, overshooting the desired path and zigzagging
Section 265 back and forth. This is the result of the natural inertia of the robot:
Planning and Control
953
once driven back to its
reference position the robot has a velocity that can’t instantaneously be stopped.
In Figure 26.22(a), the parameter Kp = 1. At first glance, one might think that choosing a smaller value for Kp would remedy the problem, giving the robot a gentler approach to the desired path.
Kp
Unfortunately, this is not the case. Figure 26.22(b) shows a trajectory for
=1, still exhibiting oscillatory behavior. The lower value of the gain parameter helps, but
does not solve the problem. In fact, in the absence of friction, the P controller is essentially a
spring law; so it will oscillate indefinitely around a fixed target location.
There are a number of controllers that are superior to the simple proportional control law.
A controller is said to be stable if small perturbations lead to a bounded error between the
robot and the reference signal. It is said to be strictly stable if it is able to return to and then
stay on its reference path upon such perturbations. Our P controller appears to be stable but not strictly stable, since it fails to stay anywhere near its reference trajectory.
The simplest controller that achieves strict stability in our domain is a PD controller. The letter ‘P’ stands again for proportional, and ‘D’ stands for derivative. PD controllers are
Stable Strictly stable PD controller
described by the following equation:
(26.4) + Kp(€(r) = dr) = a1) £() ult) = Kp( As this equation suggests, PD controllers extend P controllers by a differential component,
which adds to the value of u(r) a term that is proportional to the first derivative of the error
&(t) — g, over time. What is the effect of such a term? In general, a derivative term damp
ens the system that is being controlled.
To see this, consider a situation where the error is
changing rapidly over time, as is the case for our P controller above. The derivative of this error will then counteract the proportional term, which will reduce the overall response to
the perturbation. However, if the same error persists and does not change, the derivative will
vanish and the proportional term dominates the choice of control.
Figure 26.22(c) shows the result of applying this PD controller to our robot arm, using as gain parameters Kp = .3 and K = .8. Clearly, the resulting path is much smoother, and does
not exhibit any obvious oscillations.
PD controllers do have failure modes, however. In particular, PD controllers may fail to
regulate an error down to zero, even in the absence of external perturbations.
Often such a
situation is the result ofa systematic external force that is not part of the model. For example,
an autonomous car driving on a banked surface may find itself systematically pulled to one
side.
Wear and tear in robot arms causes similar systematic errors.
In such situations, an
overproportional feedback is required to drive the error closer to zero. The solution to this
problem lies in adding a third term to the control law, based on the integrated error over time:
u(t) = KP({(f)*q:)+K:A’({(Y)*q\)d»“rKv(f(')*d:)
(26.5)
Here K; is a third gain parameter. The term [}(€(s) calculates the integral of the error over time. The effect of this term is that longlasting deviations between the reference signal and the actual state are corrected.
Integral terms, then, ensure that a controller does not exhibit
systematic longterm error, although they do pose a danger of oscillatory behavior. A controller with all three terms is called a PID controller (for proportional integral
derivative). PID controllers are widely used in industry, for a variety of control problems. Think of the three terms as follows—proportional: try harder the farther away you are from
PID controller
954
Chapter 26 Robotics the path; derivative: try even harder if the error is increasing; integral: try harder if you haven’t made progress for a long time.
Computed torque control
A middle ground between openloop control based on inverse dynamics and closedloop PID control is called computed torque control. We compute the torque our model thinks we. will need, but compensate for model inaccuracy with proportional error terms:
u(t) = §7 (60, £0).60)) +m(E)) (Kp(€(1) — 1) +Kp(£(1) = 1)) Feedforward component Feedback component
feedforward
(26.6)
feedback
The first term is called the feedforward component because it looks forward to where the
robot needs to go and computes what torque might be required. The second is the feedback
component because it feeds the current error in the dynamic state back into the control law.
m(q) is the inertia matrix at configuration g—unlike normal PD control, the gains change with the configuration of the system. Plans versus policies
Let’s take a step back and make sure we understand the analogy between what happened so far in this chapter and what we learned in the search, MDP, and reinforcement learning chapters.
‘With motion in robotics, we are really considering an underlying MDP where the states are
dynamic states (configuration and velocity), and the actions are control inputs, usually in the form of torques. If you take another look at our control laws above, they are policies, not plans—they tell the robot what action to take from any state it might reach. However,
they are usually far from optimal policies. Because the dynamic state is continuous and high dimensional (as is the action space), optimal policies are computationally difficult to extract. Instead, what we did here is to break up the problem. We come up with a plan first, in a simplified state and action space: we use only the kinematic state, and assume that states are reachable from one another without paying attention to the underlying dynamics. This is motion planning, and it gives us the reference path. If we knew the dynamics perfectly, we
could turn this into a plan for the original state and action space with Equation (26.3).
But because our dynamics model is typically erroneous, we turn it instead into a policy
that tries to follow the plan—getting back to it when it drifts away. When doing this, we introduce suboptimality in two ways:
first by planning without considering dynamics, and
second by assuming that if we deviate from the plan, the optimal thing to do is to return to the original plan. In what follows, we describe techniques that compute policies directly over the dynamic state, avoiding the separation altogether.
26.5.4
Optimal
control
Rather than using a planner to create a kinematic path, and only worrying about the dynamics
of the system after the fact, here we discuss how we might be able to do it all at once. We'll
take the trajectory optimization problem for kinematic paths, and turn it into true trajectory
optimization with dynamics: we will optimize directly over the actions, taking the dynamics (or transitions) into account. This brings us much closer to what we’ve seen in the search and MDP chapters.
If we
know the system’s dynamics, then we can find a sequence of actions to execute, as we did in
Chapter 3. If we’re not sure, then we might want a policy, as in Chapter 17.
Section 265
Planning and Control
955
In this section, we are looking more directly at the underlying MDP the robot works
in. We're switching from the familiar discrete MDPs to continuous ones.
We will denote
our dynamic state of the world by x, as is common practice—the equivalent of s in discrete MDPs. Let x, and x; be the starting and goal states.
‘We want to find a sequence of actions that, when executed by the robot, result in state
action pairs with low cumulative cost. The actions are torques which we denote with u(r)
for ¢ starting at 0 and ending at 7. Formally, we want to find the sequence of torques u that minimize a cumulative cost J:
min Ji J(x(t),u(r))dr
(26.7)
u
subject to the constraints
Ve, x(1) = f(x(0),u(t))
X(0) = xy, X(T) =%,
How is this connected to motion planning and trajectory tracking control? Well, imagine
we take the notion of efficiency and clearance away from the obstacles and put it into the cost function J, just as we did before in trajectory optimization over kinematic state. The dynamic
state is the configuration and velocity, and torques u change it via the dynamics
f from open
loop trajectory tracking. The difference is that now we're thinking about the configurations
and the torques at the same time. Sometimes, we might want to treat collision avoidance as a
hard constraint as well, something we’ve also mentioned before when we looked at trajectory optimization for the kinematic state only.
To solve this optimization problem, we can take gradients of J—not with respect to the
sequence 7 of configurations anymore, but directly with respect to the controls . It is sometimes helpful to include the state sequence x as a decision variable too, and use the dynamics
constraints to ensure that x and u are consistent. There are various trajectory optimization
techniques using this approach; two of them go by the names multiple shooting and direct
collocation. None of these techniques will find the global optimal solution, but in practice they can effectively make humanoid robots walk and make autonomous cars drive. Magic happens when in the problem above, J is quadratic and f is linear in x and 1. We want to minimize
min / XTQx+ulRudt
o
subjectto
Vi, i(r) = Ax(r) + Bu(t).
‘We can optimize over an infinite horizon rather than a finite one, and we obtain a policy
from any state rather than just a sequence of controls. Q and R need to be positive definite matrices for this to work. This gives us the linear quadratic regulator (LQR). With LQR,
Linear quadratic regulator (LQR)
the optimal value function (called cost to go) is quadratic, and the optimal policy is linear. The policy looks like u = —Kx, where finding the matrix K requires solving an algebraic Riceati equation—no local optimization, no value iteration, no policy iteration are needed! Riccati equation Because of the ease of finding the optimal policy, LQR finds many uses in practice despite the fact that real problems seldom actually have quadratic costs and linear dynamics. A LQR really useful method is called iterative LQR (ILQR), which works by starting with a solu Iterative (ILQR) tion and then iteratively computing a linear approximation of the dynamics and a quadratic approximation of the cost around it, then solving the resulting LQR system to arrive at a new
solution. Variants of LQR are also often used for trajectory tracking.
956
Chapter 26 Robotics 26.6_Planning Uncertain
Movements
In robotics, uncertainty arises from partial observability of the environment and from the stochastic (or unmodeled) effects of the robot’s actions. Errors can also arise from the use
of approximation algorithms
such as particle filtering, which does not give the robot an exact
belief state even if the environment is modeled perfectly.
The majority of today’s robots use deterministic algorithms for decision making, such as the pathplanning algorithms of the previous section, or the search algorithms that were introduced in Chapter 3. These deterministic algorithms are adapted in two ways: first, they
deal with the continuous state space by turning it into a discrete space (for example with Most likely state
visibility graphs or cell decomposition).
Second, they deal with uncertainty in the current
state by choosing the most likely state from the probability distribution produced by the state estimation algorithm. That approach makes the computation faster and makes a better fit for the deterministic search algorithms. In this section we discuss methods for dealing with
uncertainty that are analogous to the more complex search algorithms covered in Chapter 4.
First, instead of deterministic plans, uncertainty calls for policies. We already discussed
how trajectory tracking control turns a plan into a policy to compensate for errors in dynamics.
Online replanning Model predictive
control (MPC)
Sometimes though, if the most likely hypothesis changes enough, tracking the plan designed for a different hypothesis is too suboptimal. This is where online replanning comes in: we can recompute a new plan based on the new belief. Many robots today use a technique called model predictive control (MPC), where they plan for a shorter time horizon, but replan
at every time step. (MPC is therefore closely related to realtime search and gameplaying
algorithms.) This effectively results in a policy: at every step, we run a planner and take the
first action in the plan; if new information comes along, or we end up not where we expected, that’s OK, because we are going to replan anyway and that will tell us what to do next. Second, uncertainty calls for information gathering actions. When we consider only the information we have and make a plan based on it (this is called separating estimation from
control), we are effectively solving (approximately) a new MDP at every step, corresponding
to our current belief about where we are or how the world works. But in reality, uncertainty is
better captured by the POMDP framework: there is something we don’t directly observe, be it the robot’s location or configuration, the location of objects in the world, or the parameters
of the dynamics model itself—for example, where exactly is the center of mass of link two on this arm?
What we lose when we don’t solve the POMDP
is the ability to reason about future
information the robot will get: in MDPs we only plan with what we know, not with what we might eventually know. Remember the value of information? Well, robots that plan using
their current belief as if they will never find out anything more fail to account for the value of
information. They will never take actions that seem suboptimal right now according to what
they know, but that will actually result in a lot of information and enable the robot to do well.
What does such an action look like for a navigation robot?
The robot could get close
to a landmark to get a better estimate of where it is, even if that landmark is out of the way
according to what it currently knows. This action is optimal only if the robot considers the
new observations it will get, as opposed to looking only at the information it already has.
Guarded movement
To get around this, robotics techniques sometimes define information gathering actions
explicitly—such as moving a hand until it touches a surface (called guarded movements)—
Section 26.6 initial configuration
Planning Uncertain Movements

motion
envelope —
957
cy \2
Figure 26.23 A twodimensional environment, velocity uncertainty cone, and envelope of possible robot motions. The intended velocity is v, but with uncertainty the actual velocity could be anywhere in C,. resulting in a final configuration somewhere in the motion envelope, which means we wouldn’t know if we hit the hole or not. and make sure the robot does that before coming up with a plan for reaching its actual goal. Each guarded motion consists of (1) a motion command
and (2) a termination condition,
which is a predicate on the robot’s sensor values saying when to stop. Sometimes, the goal itself could be reached via a sequence of guarded moves guaranteed to succeed regardless of uncertainty. As an example, Figure 26.23 shows a twodimensional
configuration space with a narrow vertical hole. It could be the configuration space for inser
tion ofa rectangular peg into a hole or a car key into the ignition. The motion commands are constant velocities. The termination conditions are contact with a surface. To model uncer
tainty in control, we assume that instead of moving in the commanded direction, the robot’s actual motion lies in the cone C, about it.
The figure shows what would happen if the robot attempted to move straight down from
the initial configuration. Because of the uncertainty in velocity, the robot could move anywhere in the conical envelope, possibly going into the hole, but more likely landing to one side of it. Because the robot would not then know which side of the hole it was on, it would
not know which way to move.
A more sensible strategy is shown in Figures 26.24 and 26.25. In Figure 26.24, the robot deliberately moves to one side of the hole. The motion command is shown in the figure, and the termination test is contact with any surface. In Figure 26.25, a motion command is
given that causes the robot to slide along the surface and into the hole. Because all possible
velocities in the motion envelope are to the right, the robot will slide to the right whenever it
is in contact with a horizontal surface.
It will slide down the righthand vertical edge of the hole when it touches it, because
all possible velocities are down relative to a vertical surface.
It will keep moving until it
reaches the bottom of the hole, because that is its termination condition.
In spite of the
control uncertainty, all possible trajectories of the robot terminate in contact with the bottom of the hole—that is, unless surface irregularities cause the robot to stick in one place.
Other techniques beyond guarded movements change the cost function to incentivize ac
tions we know will lead to information—Tlike the coastal navigation heuristic which requires
the robot to stay near known landmarks. More generally, techniques can incorporate the ex
pected information gain (reduction of entropy of the belief) as a term in the cost function,
Coastal navigation
958
Chapter 26
Robotics
initial configuration
7Cv
~
v motion
envelope
Figure 26.24 The first motion command and the resulting envelope of possible robot motions. No matter what actual motion ensues, we know the final configuration will be to the left of the hole.
DN
motion
v
/ envelope
Figure 26.25 The second motion command and the envelope of possible motions. Even with error, we will eventually get into the hole. leading to the robot explicitly reasoning about how much information each action might bring
when deciding what to do. While more difficult computationally, such approaches have the
advantage that the robot invents its own information gathering actions rather than relying on humanprovided heuristics and scripted strategies that often lack flexibility. 26.7
Reinforcement Learning in Robotics
Thus far we have considered tasks in which the robot has access to the dynamics model of
the world. In many tasks, it is very difficult to write down such a model, which puts us in the
domain of reinforcement learning (RL).
One challenge of RL in robotics is the continuous nature of the state and action spaces,
which we handle either through discretization, or, more commonly, through function approxi
mation. Policies or value functions are represented as combinations of known useful features,
or as deep neural networks. Neural nets can map from raw inputs directly to outputs, and thus largely avoid the need for feature engineering, but they do require more data. A bigger challenge is that robots operate in the real world. We have seen how reinforcement learning can be used to learn to play chess or Go by playing simulated games. But when
a real robot moves in the real world, we have to make sure that its actions are safe (things
Section 26.7
Reinforcement Learning in Robotics
(@)
Figure 26.26 Training a robust policy. (a) Multiple simulations are run of a robot hand ma
nipulating objects, with different randomized parameters for physics and lighting. Courtesy of Wojciech Zaremba. (b) The realworld environment, with a single robot hand in the center
of a cage, surrounded by cameras and range finders. (c) Simulation and realworld training yields multiple different policies for grasping objects; here a pinch grasp and a quadpod grasp. Courtesy of OpenAL See Andrychowicz et al. (2018a).
break!), and we have to accept that progress will be slower than in a simulation because the
world refuses to move faster than one second per second. Much of what is interesting about using reinforcement learning in robotics boils down to how we might reduce the real world
sample complexity—the number of interactions with the physical world that the robot needs before it has learned how to do the task.
26.7.1
Exploiting models
A natural way to avoid the need for many realworld samples is to use as much knowledge of the world’s dynamics as possible. For instance, we might not know exactly what the coefficient of friction or the mass of an object is, but we might have equations that describe
the dynamics as a function of these parameters.
In such a case, modelbased reinforcement learning (Chapter 22) is appealing, where
the robot can alternate between fitting the dynamics parameters and computing a better policy. Even if the equations are incorrect because they fail to model every detail of physics,
researchers have experimented with learning an error term, in addition to the parameters, that can compensate for the inaccuracy of the physical model. Or, we can abandon the equations
and instead fit locally linear models of the world that each approximate the dynamics in a region of the state space, an approach that has been successful in getting robots to master
complex dynamic tasks like juggling. A model of the world can also be useful in reducing the sample complexity of modelfree
reinforcement learning methods by doing simtoreal transfer: transferring policies that work ~simtorea!
959
960
Chapter 26 Robotics in simulation to the real world. The idea is to use the model as a simulator for a policy search (Section 22.5). To learn a policy that transfers well, we can add noise to the model during
Domain randomization
training, thereby making the policy more robust. Or, we can train policies that will work with a variety of models by sampling different parameters in the simulations—sometimes referred to as domain randomization. An example is in Figure 26.26, where a dexterous manipulation task is trained in simulation by varying visual attributes, as well as physical attributes like friction or damping.
Finally, hybrid approaches that borrow ideas from both modelbased and modelfree al
gorithms are meant to give us the best of both. The hybrid approach originated with the Dyna
architecture, where the idea was to iterate between acting and improving the policy, but the
policy improvement would come in two complementary ways:
1) the standard modelfree
way of using the experience to directly update the policy, and 2) the modelbased way of using the experience to fit a model, then plan with it to generate a policy.
More recent techniques have experimented with fitting local models, planning with them
to generate actions, and using these actions as supervision to fit a policy, then iterating to get better and better models around the areas that the policy needs.
This has been successfully
applied in endtoend learning, where the policy takes pixels as input and directly outputs
torques as actions—it enabled the first demonstration of deep RL on physical robots. Models can also be exploited for the purpose of ensuring safe exploration.
Learning
slowly but safely may be better than learning quickly but crashing and burning half way through. So arguably, more important than reducing realworld samples is reducing realworld samples in dangerous states—we
don’t want robots falling off cliffs, and we don’t
want them breaking our favorite mugs or, even worse, colliding with objects and people. An approximate model, with uncertainty associated to it (for example by considering a range of
values for its parameters), can guide exploration and impose constraints on the actions that the robot is allowed to take in order to avoid these dangerous states. This is an active area of research in robotics and control.
26.7.2
Motion primitive
Exploiting other information
Models are useful, but there is more we can do to further reduce sample complexity. When setting up a reinforcement learning problem, we have to select the state and action spaces, the representation of the policy or value function, and the reward function we’re using. These decisions have a large impact on how easy or how hard we are making the problem. One approach is to use higherlevel motion primitives instead of lowlevel actions like torque commands. A motion primitive is a parameterized skill that the robot has. For exam
ple, a robotic soccer player might have the skill of “pass the ball to the player at (x,y).” All the policy needs to do is to figure out how to combine them and set their parameters, instead
of reinventing them. This approach often learns much faster than lowlevel approaches, but does restrict the space of possible behaviors that the robot can learn.
Another way to reduce the number of realworld samples required for learning is to reuse information from previous learning episodes on other tasks, rather than starting from scratch.
This falls under the umbrella of metalearning or transfer learning.
Finally, people are a great source of information. In the next section, we talk about how
to interact with people, and part of it is how to use their actions to guide the robot’s learning.
Section 26.8
26.8
Humans and Robots
961
Humans and Robots
Thus far, we’ve focused on a robot planning and learning how to act in isolation. This is useful for some robots, like the rovers we send out to explore distant planets on our behalf.
But, for the most part, we do not build robots to work in isolation. We build them to help us, and to work in human environments, around and with us.
This raises two complementary challenges. First is optimizing reward when there are
people acting in the same environment as the robot. We call this the coordination problem
(see Section 18.1). When the robot’s reward depends on not just its own actions, but also the
actions that people take, the robot has to choose its actions in a way that meshes well with
theirs. When the human and the robot are on the same team, this turns into collaboration.
Second is the challenge of optimizing for what people actually want. If a robot is to
help people, its reward function needs to incentivize the actions that people want the robot to
execute. Figuring out the right reward function (or policy) for the robot is itself an interaction problem. We will explore these two challenges in turn. 26.8.1
Coordination
Let’s assume for now, as we have been, that the robot has access to a clearly defined reward
function. But, instead of needing to optimize it in isolation, now the robot needs to optimize
it around a human who is also acting. For example, as an autonomous car merges on the
highway, it needs to negotiate the maneuver with the human driver coming in the target lane—
should it accelerate and merge in front, or slow down and merge behind? Later, as it pulls to
astop sign, preparing to take a right, it has to watch out for the cyclist in the bicycle lane, and for the pedestrian about to step onto the crosswalk.
Or, consider a mobile robot in a hallway.
Someone heading straight toward the robot
steps slightly to the right, indicating which side of the robot they want to pass on. The robot has to respond, clarifying its intentions. Humans
as approximately
rational agents
One way to formulate coordination with a human is to model it as a game between the robot
and the human (Section 18.2). With this approach, we explicitly make the assumption that
people are agents incentivized by objectives. This does not automatically mean that they are
perfectly rational agents (i.e., find optimal solutions in the game), but it does mean that the
robot can structure the way it reasons about the human via the notion of possible objectives that the human might have. In this game: « the state of the environment captures the configurations of both the robot and human agents; call itx = (xg,xg); + each agent can take actions, ug and uy respectively;
« cach agent has an objective that can be represented as a cost, Jg and Jy: each agent
wants to get to its goal safely and efficiently; + and, as in any game, each objective depends on the state and on the actions of both
agents: Jg(x, g, yr) and Jy (x, 1y, ug). Think of the carpedestrian interaction—the car should stop if the pedestrian crosses, and should go forward if the pedestrian waits.
Three important aspects complicate this game. First is that the human and the robot don’t
necessarily know each other’s objectives. This makes it an incomplete information game.
Incomplete information game
962
Chapter 26 Robotics Second is that the state and action spaces are continuous, as they’ve been throughout this
chapter. We learned in Chapter 5 how to do tree search to tackle discrete games, but how do we tackle continuous spaces? Third, even though at the high level the game model makes sense—humans do move, and they do have objectives—a human’s behavior might not always be wellcharacterized as a solution to the game. The game comes with a computational challenge not only for the robot, but for us humans too. It requires thinking about what the robot will do in response to what the person does, which depends on what the robot thinks the person will do, and pretty soon we get to “what do you think I think you think I think”— it’s turtles all the way down! Humans can’t deal with all of that, and exhibit certain suboptimalities.
This means that the
robot should account for these suboptimalities.
So, then, what is an autonomous car to do when the coordination problem is this hard?
We will do something similar to what we’ve done before in this chapter. For motion planning and control, we took an MDP and broke it up into planning a trajectory and then tracking it with a controller. Here too, we will take the game, and break it up into making predictions about human actions, and deciding what the robot should do given these predictions.
Predicting human action Predicting human actions is hard because they depend on the robot’s actions and vice versa. One trick that robots use is to pretend the person is ignoring the robot. The robot assumes people are noisily optimal with respect to their objective, which is unknown to the robot and
is modeled as no longer dependent on the robot’s actions: Jj (x,
). In particular, the higher
the value of an action for the objective (the lower the cost to go), the more likely the human
is to take it. The robot can create a model for P(uy  x,Jy), for instance using the softmax function from page 811:
Plug  %, Jig) o< &= Q)
(268)
with Q(x, ups3Jy) the Qvalue function corresponding to Ji; (the negative sign is there because in robotics we like to minimize cost, not maximize reward). Note that the robot does not
assume perfectly optimal actions, nor does it assume that the actions are chosen based on
reasoning about the robot at all.
Armed with this model, the robot uses the human’s ongoing actions as evidence about Jp;.
If we have an observation model for how human actions depend on the human’s objective,
each human action can be incorporated to update the robot’s belief over what objective the person has
b (J) o< b(Ju)Purg  %) 
(26.9)
An example is in Figure 26.27: the robot is tracking a human’s location and as the human moves, the robot updates its belief over human goals. As the human heads toward the
windows, the robot increases the probability that the goal is to look out the window, and
decreases the probability that the goal is going to the kitchen, which is in the other direction.
This is how the human’s past actions end up informing the robot about what the human will do in the future. Having a belief about the human’s goal helps the robot anticipate what next actions the human will take. The heatmap in the figure shows the robot’s future
predictions: red is most probable; blue least probable.
Section 26.8
Humans and Robots
(b)
(©)
Figure 26.27 Making predictions by assuming that people are noisily rational given their goal: the robot uses the past actions to update a belief over what goal the person is heading 10, and then uses the belief to make predictions about future actions. (a) The map ofa room. (b) Predictions after seeing a small part of the person’s trajectory (white path):; (¢) Predictions after seeing more human actions: the robot now knows that the person is not heading to the hallway on the left, because the path taken so far would be a poor path if that were the person’s goal. Images courtesy of Brian D. Ziebart. See Ziebart et al. (2009). The same can happen in driving. We might not know how much another driver values
efficiency, but if we see them accelerate as someone is trying to merge in front of them, we
now know a bit more about them. And once we know that, we can better anticipate what they will do in the future—the same driver is likely to come closer behind us, or weave through traffic to get ahead. Once the robot can make predictions about human future actions, it has reduced its prob
lem to solving an MDP. The human actions complicate the transition function, but as long as
the robot can anticipate what action the person will take from any future state, the robot can
calculate P(¥'  x,ug): it can compute P(up  x) from P(ug  x,Ju) by marginalizing over Ji,
and combine it with P(x’  x,ug,up), the transition (dynamics) function for how the world updates based on both the robot’s and the human’s actions. In Section 26.5 we focused on how to solve this in continuous state and action spaces for deterministic dynamics, and in
Section 26.6 we discussed doing it with stochastic dynamics and uncertainty.
Splitting prediction from action makes it easier for the robot to handle interaction, but
sacrifices performance much from control.
plitting estimation from motion did, or splitting planning
A robot with this split no longer understands that its actions can influence what people end up doing. In contrast, the robot in Figure 26.27 anticipates where people will go and then optimizes for reaching its own goal and avoiding collisions with them. In Figure 26.28, we have an autonomous car merging on the highway. If it just planned in reaction to other cars, it might have to wait a long time while other cars occupy its target lane. In contrast, a car that reasons about prediction and action jointly knows that different actions it could take will
result in different reactions from the human. If it starts to assert itself, the other cars are likely to slow down a bit and make room. Roboticists are working towards coordinated interactions like this so robots can work better with humans.
963
Chapter 26 Robotics
(b)
Figure 26.28 (a) Left: An autonomous car (middle lane) predicts that the human driver (left lane) wants to keep going forward, and plans a trajectory that slows down and merges behind. Right: The car accounts for the influence its actions can have on human actions, and realizes
it can merge in front and rely on the human driver to slow down. (b) That same algorithm produces an unusual strategy at an intersection: the car realizes that it can make it more
likely for the person (bottom) to proceed faster through the intersection by starting to inch
backwards. Images courtesy of Anca Dragan. See Sadigh et al. (2016). Human
predi
ions about the robot
Incomplete information is often twosided:
the robot does not know the human’s objective
and the human, in turn, does not know the robot’s objective—people need to be making predictions about robots. As robot designers, we are not in charge of how the human makes
predictions; we can only control what the robot does. However, the robot can act in a way to make it easier for the human to make correct predictions.
The robot can assume that
the human is using something roughly analogous to Equation (26.8) to estimate the robot’s objective Jg, and thus the robot will act so that its true objective can be easily inferred. A special case of the game is when the human
and the robot are on the same team,
working toward the same goal or objective: Jy = Jg. Imagine getting a personal home robot
Joint agent
that is helping you make dinner or clean up—these are examples of collaboration.
We can now define a joint agent whose actions are tuples of humanrobot actions,
(upr, ug) and who optimizes for Jy (x,up,ug) = Jg(x,ug,u), and we're solving a regular planning problem. We compute the optimal plan or policy for the joint agent, and voila, we now know what the robot and human should do.
This would work really well if people were perfectly optimal. The robot would do its part
of the joint plan, the human theirs. Unfortunately, in practice, people don’t seem to follow the perfectly laid out jointagent plan; they have a mind of their own! We've already learned one way to handle this though, back in Section 26.6. We called it model predictive control
(MPC): the idea was to come up with a plan, execute the first action, and then replan. That
way, the robot always adapts its plan to what the human is actually doing.
Let’s work through an example. Suppose you and the robot are in your kitchen, and have
decided to make waffles. You are slightly closer to the fridge, so the optimal joint plan would
Section 26.8
Humans and Robots
have you grab the eggs and milk from the fridge, while the robot fetches the flour from the cabinet. The robot knows this because it can measure quite precisely where everyone is. But
suppose you start heading for the flour cabinet. You are going against the optimal joint plan. Rather than sticking to it and stubbornly also going for the flour, the MPC robot recalculates
the optimal plan, and now that you are close enough to the flour it is best for the robot to grab the waffle iron instead.
If we know that people might deviate from optimality, we can account for it ahead of time.
In our example, the robot can try to anticipate that you are going for the flour the moment you take your first step (say, using the prediction technique above). Even if it is still technically
optimal for you to turn around and head for the fridge, the robot should not assume that’s
what is going to happen. Instead, the robot can compute a plan in which you keep doing what you seem to want.
Humans as black box agents ‘We don’t have to treat people as objectivedriven, intentional agents to get robots to coordinate with us. An alternative model is that the human is merely some agent whose policy 7y “messes” with the environment dynamics.
The robot does not know 7y, but can model the
problem as needing to act in an MDP with unknown dynamics. We have seen this before: for general agents in Chapter 22, and for robots in particular in Section 26.7. The robot can fit a policy model 7y to human data, and use it to compute an optimal policy for itself. Due to scarcity of data, this has been mostly used so far at the task level. For
instance, robots have learned through interaction what actions people tend to take (in response
to its own actions) for the task of placing and drilling screws in an industrial assembly task. Then there is also the modelfree reinforcement learning alternative:
the robot can start
with some initial policy or value function, and keep improving it over time via trial and error.
26.8.2
Learning to do what humans want
Another way interaction with humans comes into robotics is in Jg itself—the robot’s cost or reward function. The framework of rational agents and the associated algorithms reduce the
problem of generating good behavior to specifying a good reward function.
as for many other Al agents, getting the cost right is still difficult.
But for robots,
Take autonomous cars: we want them to reach the destination, to be safe, to drive com
fortably for their passengers, to obey traffic laws, etc. A designer of such a system needs to trade off these different components of the cost function. The designer’s task is hard because robots are built to help end users, and not every end user is the same. We all have different
preferences for how aggressively we want our car to drive, etc.
Below, we explore two alternatives for trying to get robot behavior to match what we
actually want the robot to do.
The first is to learn a cost function from human input.
The
second is to bypass the cost function and imitate human demonstrations of the task.
Preference learning: Learning cost functions Imagine that an end user is showing a robot how to do a task. For instance, they are driving
the car in the way they would like it to be driven by the robot. Can you think of a way for the robot to use these actions—we call them “demonstrations”—to figure out what cost function it should optimize?
965
966
Chapter 26 Robotics
Figure 26.29 Left: A mobile robot is shown a demonstration that stays on the dirt road. Middle: The robot infers the desired cost function, and uses it in a new scene, knowing to
put lower cost on the road there. Right: The robot plans a path for the new scene that also
stays on the road, reproducing the preferences behind the demonstration. Images courtesy of Nathan Ratliff and James A. Bagnell. See Ratliff er al. (2006).
We have actually already seen the answer to this back in Section 26.8.1. There, the setup
was a little different: we had another person taking actions in the same space as the robot, and the robot needed to predict what the person would do. But one technique we went over for making these predictions was to assume that people act to noisily optimize some cost function
Jy, and we can use their ongoing actions as evidence about what cost function that is. We
can do the same here, except not for the purpose of predicting human behavior in the future, but rather acquiring the cost function the robot itself should optimize.
If the person drives
defensively, the cost function that will explain their actions will put a lot of weight on safety
and less so on efficiency. The robot can adopt this cost function as its own and optimize it when driving the car itself.
Roboticists have experimented with different algorithms for making this cost inference
computationally tractable. In Figure 26.29, we see an example of teaching a robot to prefer staying on the road to going over the grassy
terrain. Traditionally in such methods, the cost
function has been represented as a combination of handcrafted features, but recent work has also studied how to represent it using a deep neural network, without feature engineering.
There are other ways for a person to provide input. A person could use language rather
than demonstration to instruct the robot.
A person could act as a critic, watching the robot
perform a task one way (or two ways) and then saying how well the task was done (or which way was better), or giving advice on how to improve.
Learning policies directly via imitation An alternative is to bypass cost functions and learn the desired robot policy directly. In our car example, the human’s demonstrations make for a convenient data set of states labeled by
the action the robot should take at each state: D = {(x;,u;)}. The robot can run supervised Behavioral cloning Generalization
learning to fit a policy
behavioral cloning.
: x — u, and execute that policy. This is called imitation learning or
A challenge with this approach is in generalization to new states. The robot does not
know why the actions in its database have been marked as optimal. It has no causal rule; all
it can do is run a supervised learning algorithm to try to learn a policy that will generalize to unknown states. However, there is no guarantee that the generalization will be correct.
Section 26.8
Humans and Robots
Figure 26.30 A human teacher pushes the robot down to teach it to stay closer to the table. The robot appropriately updates its understanding of the desired cost function and starts optimizing it. Courtesy of Anca Dragan. See Bajcsy f al. (2017).
Figure 26.31 A programming interface that involves placing specially designed blocks in the robot’s workspace to select objects and specify highlevel actions. Images courtesy of Maya Cakmak. See Sefidgar ef al. (2017). The ALVINN autonomous car project used this approach, and found that even when
starting from a
state in D, 7 will make small errors, which will take the car off the demon
strated trajectory. There, 7 will make a larger error, which will take the car even further off the desired course.
‘We can address this at training time if we interleave collecting labels and learning: start
with a demonstration, learn a policy, then roll out that policy and ask the human for what action to take at every state along the way, then repeat. The robot then learns how to correct its mistakes as it deviates from the human’s desired actions.
Alternatively, we can address it by leveraging reinforcement learning. The robot can fit a
dynamics model based on the demonstrations, and then use optimal control (Section 26.5.4)
to generate a policy that optimizes for staying close to the demonstration.
A version of this
has been used to perform very challenging maneuvers at an expert level in a small radiocontrolled helicopter (see Figure 22.9(b)). The DAGGER (Data Aggregation) system starts with a human expert demonstration. From that it learns a policy, 7 and uses the policy to generate a data set D. Then from
D it generates a new policy 75 that best imitates the original human data. This repeats, and
967
968
Chapter 26 Robotics on the nth iteration it uses 7, to generate more data, to be added to D, which is then used to create 7, . In other words, at each iteration the system gathers new data under the current
policy and trains the next policy using all the data gathered so far. Related recent techniques use adversarial training:
they alternate between training a
classifier to distinguish between the robot’s learned policy and the human’s demonstrations,
and training a new robot policy via reinforcement learning to fool the classifier. These ad
vances enable the robot to handle states that are near demonstrations, but generalization to
faroff states or to new dynamics is a work in progress.
Teaching interfaces and the correspondence problem.
So far, we have imagined the
case of an autonomous car or an autonomous helicopter, for which human demonstrations use
the same actions that the robot can take itself: accelerating, braking, and steering. But what
happens if we do this for tasks like cleaning up the Kitchen table? We have two choices here: either the person demonstrates using their own body while the robot watches, or the person physically guides the robot’s effectors. Correspondence problem
The first approach is appealing because it comes naturally to end users. Unfortunately,
it suffers from the correspondence problem: how to map human actions onto robot actions.
People have different kinematics and dynamics than robots. Not only does that make it difficult to translate or retarget human motion onto robot motion (e.g., retargeting a fivefinger
human grasp to a twofinger robot grasp), but often the highlevel strategy a person might use is not appropriate for the robot.
Kinesthetic teaching Keyframe
Visual programming
The second approach, where the human teacher moves the robot’s effectors into the right positions, is called kinesthetic teaching. It is not easy for humans to teach this way, espe
cially to teach robots with multiple joints. The teacher needs to coordinate all the degrees of
freedom as it is guiding the arm through the task. Researchers have thus investigated alter
natives, like demonstrating keyframes as opposed to continuous trajectories, as well as the use of visual programming to enable end users to program primitives for a task rather than
demonstrate from scratch (Figure 26.31). Sometimes both approaches are combined.
26.9 Deliberative Reactive
Alternative Robo
Frameworks
Thus far, we have taken a view of robotics based on the notion of defining or learning a reward function, and having the robot optimize that reward function (be it via planning or learning), sometimes in coordination or collaboration with humans.
view of robotics, to be contrasted with a reactive view.
26.9.1
This is a deliberative
Reactive controllers
In some cases, it is easier to set up a good policy for a robot than to model the world and plan.
Then, instead of a rational agent, we have a reflex agent.
For example, picture a legged robot that attempts to lift a leg over an obstacle. We could
give this robot a rule that says lift the leg a small height 7 and move it forward, and if the leg
encounters an obstacle, move it back and start again at a higher height. You could say that is modeling an aspect of the world, but we can also think of& as an auxiliary variable of the robot controller, devoid of direct physical meaning. One such example is the sixlegged (hexapod) robot, shown in Figure 26.32(a), designed for walking through rough terrain. The robot’s sensors are inadequate to obtain accurate
Section 26.9
969
Alternative Robotic Frameworks
retract, lift higher
liftup
setdown push backward (®)
Figure 26.32 (a) Genghis, a hexapod robot. (Image courtesy of Rodney A. Brooks.) (b) An
augmented finite state machine (AFSM) that controls one leg. The AFSM
reacts to sensor
feedback: if a leg is stuck during the forward swinging phase, it will be lifted increasingly higher.
‘models of the terrain for path planning. Moreover, even if we added highprecision cameras and rangefinders, the 12 degrees of freedom (two for each leg) would render the resulting path planning problem computationally difficult. It is possible, nonetheless, to specify a controller directly without an explicit environ
mental model.
(We have already seen this with the PD controller, which was able to keep a
complex robot arm on target without an explicit model of the robot dynamics.)
For the hexapod robot we first choose a gait, or pattern of movement of the limbs. One
statically stable gait is to first move the right front, right rear, and left center legs forward (keeping the other three fixed), and then move the other three.
flat terrain.
Gait
This gait works well on
On rugged terrain, obstacles may prevent a leg from swinging forward.
This
problem can be overcome by a remarkably simple control rule: when a leg’s forward motion is blocked, simply retract it, lift it higher; and try again. The resulting controller is shown in Figure 26.32(b) as a simple finite state machine; it constitutes a reflex agent with state, where the internal state is represented by the index of the current machine state (s; through s4).
26.9.2
Subsumption
architectures
The subsumption architecture (Brooks, 1986) is a framework for assembling reactive controllers out of finite state machines.
Nodes in these machines may contain tests for certain
Subsumption architecture
sensor variables, in which case the execution trace of a finite state machine is conditioned
on the outcome of such a test. Arcs can be tagged with messages that will be generated when traversing them, and that are sent to the robot’s motors or to other finite state machines.
Additionally, finite state machines possess internal timers (clocks) that control the time it
takes to traverse an arc. The resulting machines are called augmented finite state machines
(AFSMs), where the augmentation refers to the use of clocks.
An example of a simple AFSM is the fourstate machine we just talked about, shown in
Figure 26.32(b).
This AFSM
implements a cyclic controller, whose execution mostly does
not rely on environmental feedback. The forward swing phase, however, does rely on sensor feedback.
If the leg is stuck, meaning that it has failed to execute the forward swing, the
Augmented finite
state machine {AFSM)
970
Chapter 26 Robotics robot retracts the leg, lifts it up a little higher, and attempts to execute the forward swing once
again. Thus, the controller is able to react to contingencies arising from the interplay of the robot and its environment.
The subsumption architecture offers additional primitives for synchronizing AFSMs, and for combining output values of multiple, possibly conflicting AFSMs. In this way, it enables the programmer to compose increasingly complex controllers in a bottomup fashion. In our example, we might begin with AFSMs for individual legs, followed by an AFSM for coordinating multiple legs. On top of this, we might implement higherlevel behaviors such as collision avoidance, which might involve backing up and turning. The idea of composing robot controllers from AFSMs is quite intriguing. Imagine how difficult it would be to generate the same behavior with any of the configurationspace path
planning algorithms described in the previous section. First, we would need an accurate model of the terrain. The configuration space of a robot with six legs, each of which is driven
by two independent motors, totals 18 dimensions (12 dimensions for the configuration of the
legs, and six for the location and orientation of the robot relative to its environment).
Even
if our computers were fast enough to find paths in such highdimensional spaces, we would
have to worry about nasty effects such as the robot sliding down a slope.
Because of such stochastic effects, a single path through configuration space would al
most certainly be too brittle, and even a PID controller might not be able to cope with such
contingencies. In other words, generating motion behavior deliberately is simply too complex a problem in some cases for presentday robot motion planning algorithms. Unfortunately, the subsumption architecture has its own problems. First, the AFSMs
are driven by raw sensor input, an arrangement that works if the sensor data is reliable and contains all necessary
information for decision making,
but fails if sensor data has to be
integrated in nontrivial ways over time. Subsumptionstyle controllers have therefore mostly
been applied to simple tasks, such as following a wall or moving toward visible light sources. Second, the lack of deliberation makes it difficult to change the robot’s goals.
A robot
with a subsumption architecture usually does just one task, and it has no notion of how to
modify its controls to accommodate different goals (just like the dung beetle on page 41). Third, in many realworld problems, the policy we want is often too complex to encode explicitly. Think about the example from Figure 26.28, of an autonomous car needing to negotiate a lane change with a human driver.
goes into the target lane.
We might start off with a simple policy that
But when we test the car, we find out that not every driver in the
target lane will slow down to let the car in. We might then add a bit more complexity: make
the car nudge towards the target lane, wait for a response form the driver in that lane, and
then either proceed or retreat back. But then we test the car, and realize that the nudging needs to happen at a different speed depending on the speed of the vehicle in the target lane, on whether there is another vehicle in front in the target lane, on whether there is a vehicle
behind the car in the initial, and so on. The number of conditions that we need to consider
to determine the right course of action can be very large, even for such a deceptively simple
maneuver. This in turn presents scalability challenges for subsumptionstyle architectures. All that said, robotics is a complex problem with many approaches: deliberative, reactive, or a mixture thereof; based on physics, cognitive models, data, or a mixture thereof. The right approach s still a subject for debate, scientific inquiry, and engincering prowess.
Section 26.10
Application Domains
971
Figure 2633 (a) A patient with a brainmachine interface controlling a robot arm to grab a drink. Tmage courtesy of Brown University. (b) Roomba, the robot vacuum cleaner. Photo by HANDOUT/KRT/Newscom. 26.10
Application Domains
Robotic technology is already permeating our world, and has the potential to improve our independence, health, and productivity. Here are some example applications. Home care:
Robots have started to enter the home to care for older adults and people
with motor impairments, assisting them with activities of daily living and enabling them to live more independently.
These include wheelchairs and wheelchairmounted arms like the
Kinova arm from Figure 26.1(b). Even though they start off as being operated by a human di
rectly, these robots are gaining more and more autonomy. On the horizon are robots operated
by brainmachine interfaces, which have been shown to enable people with quadriplegia to use a robot arm to grasp objects and even feed themselves (Figure 26.33(a)). Related to these are prosthetic limbs that intelligently respond to our actions, and exoskeletons that give us superhuman strength or enable people who can’t control their muscles from the waist down
to walk again.
Personal robots are meant to assist us with daily tasks
like cleaning and organizing, free
ing up our time. Although manipulation still has a way to go before it can operate seamlessly
in messy, unstructured human environments, navigation has made some headway. In particu
lar, many homes already enjoy a mobile robot vacuum cleaner like the one in Figure 26.33(b). Health care:
Robots assist and augment surgeons, enabling more precise, minimally
invasive, safer procedures with better patient outcomes. The Da Vinci surgical robot from Figure 26.34(a) is now widely deployed at hospitals in the U.S. Services: Mobile robots help out in office buildings, hotels, and hospitals.
Savioke has
put robots in hotels delivering products like towels or toothpaste to your room. The Helpmate and TUG
robots carry food and medicine in hospitals (Figure 26.34(b)), while Dili
gent Robotics’ Moxi robot helps out nurses with backend logistical responsibilities. CoBot
roams the halls of Carnegie Mellon University, ready to guide you to someone’s office. We
can also use telepresence robots like the Beam to attend meetings and conferences remotely,
or check in on our grandparents.
Telepresence robots
972
Chapter 26 Robotics
(b)
Figure 26.34 (a) Surgical robot in the operating room. Photo by Patrick Landmann/Science Source. (b) Hospital delivery robot. Photo by Wired.
Figure 26.35 (a) Autonomous car BOsS which won the DARPA Urban Challenge. Photo by Tangi Quemener/AFP/Getty Images/Newscom. Courtesy of Sebastian Thrun. (b) Aerial view showing the perception and predictions of the Waymo autonomous car (white vehicle with green track). Other vehicles (blue boxes) and pedestrians (orange boxes) are shown with anticipated trajectories. Road/sidewalk boundaries are in yellow. Photo courtesy of Waymo. Autonomous cars: Some of us are occasionally distracted while driving, by cell phone
calls, texts, or other distractions. The sad result: more than a million people die every year in traffic accidents. Further, many of us spend a lot of time driving and would like to recapture
some of that time. All this has led to a massive ongoing effort to deploy autonomous cars. Prototypes have existed since the 1980s, but progress was
stimulated by the 2005 DARPA
Grand Challenge, an autonomous vehicle race over 200 challenging kilometers of unrehearsed desert terrain.
Stanford’s Stanley vehicle completed the course in less than seven
hours, winning a $2 million prize and a place in the National Museum of American History.
Section 26.10
Application Domains
973
(b)
Figure 26.36 (a) A robot mapping an abandoned coal mine. (b) A 3D map of the mine acquired by the robot. Courtesy of Sebastian Thrun.
Figure 26.35(a) depicts BOss, which in 2007 won the DARPA
Urban Challenge, a compli
cated road race on city streets where robots faced other robots and had to obey traffic rules.
In 2009, Google started an autonomous driving project (featuring many of the researchers who had worked on Stanley and BOSs), which has now spun off as Waymo. In 2018 Waymo started driverless testing (with nobody in the driver seat) in the suburbs of Pheonix,
zona.
Ari
In the meantime, other autonomous driving companies and ridesharing companies
are working on developing their own technology, while car manufacturers have been selling cars with more and more assistive intelligence, such as Tesla’s driver assist, which is meant
for highway driving. Other companies are targeting nonhighway driving applications including college campuses and retirement communities. Still other companies are focused on nonpassenger applications such as trucking, grocery delivery, and valet parking. Entertainment:
Disney has been using robots (under the name animatronics) in their
Driver assist
Animatronics
parks since 1963. Originally, these robots were restricted to handdesigned, openloop, unvarying motion (and speech), but since 2009 a version called autonomatronics can generate Autonomatronics autonomous actions. Robots
also take the form of intelligent toys for children; for example,
Anki’s Cozmo plays games with children and may pound the table with frustration when it loses. Finally, quadrotors like Skydio’s R1 from Figure 26.2(b) act as personal photographers and videographers, following us around to take action shots as we ski or bike. Exploration and hazardous environments: Robots have gone where no human has gone before, including the surface of Mars. Robotic arms assist astronauts in deploying and retrieving satellites and in building the International Space Station. Robots also help explore
under the sea. They are routinely used to acquire maps of sunken ships. Figure 26.36 shows a robot mapping an abandoned coal mine, along with a 3D model of the mine acquired using range sensors. In 1996, a team of researches released a legged robot into the crater of an active volcano to acquire data for climate research. Robots are becoming very effective tools
for gathering information in domains that are difficult (or dangerous) for people to access. Robots have assisted people in cleaning up nuclear waste, most notably in Three Mile Island, Chernobyl, and Fukushima. Robots were present after the collapse of the World Trade
974
Chapter 26 Robotics Center, where they entered structures deemed too dangerous for human
search and rescue
crews. Here too, these robots are initially deployed via teleoperation, and as technology advances they are becoming more and more autonomous, with a human operator in charge but not having to specify every single command.
Industry: The majority of robots today are deployed in factories, automating tasks that
are difficult, dangerous, or dull for humans. (The majority of factory robots are in automobile
factories.) Automating these tasks is a positive in terms of efficiently producing what society needs. At the same time, it also means displacing some human workers from their jobs. This
has important policy and economics implications—the need for retraining and education, the need for a fair division of resources, etc. These topics are discussed further in Section 27.3.5.
Summary
Robotics is about physically embodied agents, which can change the state of the physical world. In this chapter, we have leaned the following: « The most common types of robots arc manipulators (robot arms) and mobile robots. They have sensors for perceiving the world and actuators that produce motion, which then affects the world via effectors. « The general robotics problem involves stochasticity (which can be handled by MDPs), partial observability (which can be handled by POMDPs), and acting with and around other agents (which can be handled with game theory). The problem is made even
harder by the fact that most robots work in continuous and highdimensional state and
action spaces. They also operate in the real world, which refuses to run faster than real time and in which failures lead to real things being damaged, with no “undo” capability. Ideally, the robot would solve the entire problem in one go: observations in the form
of raw sensor feeds go in, and actions in the form of torques or currents to the motors
come out. In practice though, this is too daunting, and roboticists typically decouple different aspects of the problem and treat them independently.
* We typically separate perception (estimation) from action (motion generation). Percep
tion in robotics involves computer vision to recognize the surroundings through cam
eras, but also localization and mapping.
« Robotic perception concerns itself with estimating decisionrelevant quantities from sensor data. To do so, we need an internal representation and a method for updating this internal representation over time.
+ Probabilistic filtering algorithms such as particle filters and Kalman filters are useful
for robot perception. These techniques maintain the belief state, a posterior distribution over state variables.
« For generating motion, we use configuration spaces, where a point specifies everything
we need to know to locate every body point on the robot. For instance, for a robot arm with two joints, a configuration consists of the two joint angles.
« We typically decouple the motion generation problem into motion planning, concerned
with producing a plan, and trajectory tracking control, concerned with producing a
policy for control inputs (actuator commands) that results in executing the plan.
Bibliographical and Historical Notes
« Motion planning can be solved via graph scarch using cell decomposition; using randomized motion planning algorithms, which sample milestones in the continuous configuration space; or using trajectory optimization, which can iteratively push a straightline path out of collision by leveraging a signed distance field. + A path found by a search algorithm can be executed using the path as the reference
trajectory for a PID controller, which constantly corrects for errors between where the robot is and where it is supposed o be, or via computed torque control, which adds a
feedforward term that makes use of inverse dynamics to compute roughly what torque to send to make progress along the trajectory.
« Optimal control unites motion planning and trajectory tracking by computing an optimal trajectory directly over control inputs.
This
is especially easy when we have
quadratic costs and linear dynamics, resulting in a linear quadratic regulator (LQR). Popular methods make use of this by linearizing the dynamics and computing secondorder approximations of the cost (ILQR). « Planning under uncertainty unites perception and action by online replanning (such as model predictive control) and information gathering actions that aid perception.
+ Reinforcement learning is applied in robotics, with techniques striving to reduce the
required number of interactions with the real world. Such techniques tend to exploit models, be it estimating models and using them to plan, or training policies that are robust with respect to different possible model parameters.
« Interaction with humans
requires the ability to coordinate the robot’s actions with
theirs, which can be formulated as a game. We usually decompose the solution into
prediction, in which we use the person’s ongoing actions to estimate what they will do in the future, and action, in which we use the predictions to compute the optimal motion for the robot. + Helping humans also requires the ability to learn or infer what they want. Robots can
approach this by learning the desired cost function they should optimize from human
input, such as demonstrations, corrections, or instruction in natural language. Alterna
tively, robots can imitate human behavior, and use reinforcement learning to help tackle
the challenge of generalization to new states. Bibliographical and
Historical Notes
The word robot was popularized by Czech playwright Karel Capek in his 1920 play R.U.R. (Rossum’s Universal Robots).
The robots, which were grown chemically rather than con
structed mechanically, end up resenting their masters and decide to take over. It appears that
it was Capek’s brother, Josef, who first combined the Czech words “robota” (obligatory work)
and “robotnik™ (serf) to yield “robot™ in his 1917 short story Opilec (Glanc, 1978). The term
robotics was invented for a science fiction story (Asimov, 1950). The idea of an autonomous machine predates the word “robot” by thousands of years. In 7th century BCE Greek mythology, a robot named Talos was built by Hephaistos, the Greek god of metallurgy, to protect the island of Crete.
The legend is that the sorceress Medea
defeated Talos by promising him immortality but then draining his life fluid. Thus, this is the
975
976
Chapter 26 Robotics first example of a robot making a mistake in the process of changing its objective function. In 322 BCE, Aristotle anticipated technological unemployment, speculating “If every tool, when ordered, or even of its own accord, could do the work that befits it.. . then there would be no need either of apprentices for the master workers or of slaves for the lords.” In the 3rd century BCE an actual humanoid robot called the Servant of Philon could pour wine or water into a cup; a series of valves cut off the flow at the right time. Wonderful automata were built in the 18th century—Jacques Vaucanson’s mechanical duck from 1738
being one early example—but the complex behaviors they exhibited were entirely fixed in advance. Possibly the earliest example of a programmable robotlike device was the Jacquard loom (1805), described on page 15. Grey Walter’s “turtle,” built in 1948, could be considered the first autonomous mobile
robot, although its control system was not programmable. The “Hopkins Beast” built in 1960 at Johns Hopkins University, was much more sophisticated; it had sonar and photocell sensors, patternrecognition hardware, and could recognize the cover plate of a standard AC power outlet. It was capable of searching for outlets, plugging itself in, and then recharging its batteries! Still, the Beast had a limited repertoire of skills.
The first generalpurpose mobile robot was “Shakey,” developed at what was then the
Stanford Research Institute (now SRI) in the late 1960s (Fikes and Nilsson, 1971; Nilsson,
1984). Shakey was the first robot to integrate perception, planning, and execution, and much
subsequent research in Al was influenced by this remarkable achievement. Shakey appears
on the cover of this book with project leader Charlie Rosen (19172002). Other influential projects include the Stanford Cart and the CMU Rover (Moravec, 1983). Cox and Wilfong (1990) describe classic work on autonomous vehicles. The first commercial robot was an arm called UNIMATE,
for universal automation, de
veloped by Joseph Engelberger and George Devol in their compnay, Unimation. In 1961, the first UNIMATE robot was sold to General Motors for use in manufacturing TV picture tubes. 1961 was also the year when Devol obtained the first U.S. patent on a robot.
In 1973, Toyota and Nissan started using an updated version of UNIMATE for auto body
spot welding. This initiated a major revolution in automobile manufacturing that took place mostly in Japan and the U.S., and that is still ongoing. Unimation followed up in 1978 with the development of the Puma robot (Programmable Universal Machine for Assembly), which
was the de facto standard for robotic manipulation for the two decades that followed. About
500,000 robots are sold each year, with half of those going to the automotive industry. In manipulation,
the first major effort at creating a handeye machine was Heinrich
Emst’s MH1, described in his MIT Ph.D. thesis (Ernst, 1961).
The Machine Intelligence
project at Edinburgh also demonstrated an impressive early system for visionbased assembly called FREDDY (Michie, 1972).
Research on mobile robotics has been stimulated by several important competitions.
AAAT’s annual mobile robot competition began in 1992.
The first competition winner was
CARMEL (Congdon et al., 1992). Progress has been steady and impressive: in recent com
petitions robots entered the conference complex, found their way to the registration desk, registered for the conference, and even gave a short talk. The RoboCup
competition,
launched in 1995 by Kitano and colleagues (1997), aims
to “develop a team of fully autonomous humanoid robots that can win against the human world champion team in soccer” by 2050.
Some competitions use wheeled robots, some
Bibliographical and Historical Notes
977
humanoid robots, and some software simulations. Stone (2016) describes recent innovations
in RoboCup.
The DARPA Grand Challenge, organized by DARPA in 2004 and 2005, required autonomous vehicles to travel more than 200 kilometers through the desert in less than ten hours (Buehler ez al., 2006). In the original event in 2004, no robot traveled more than eight miles, leading many to believe the prize would never be claimed. In 2005, Stanford’s robot
Stanley won the competition in just under seven hours (Thrun, 2006).
DARPA then orga
nized the Urban Challenge, a competition in which robots had to navigate 60 miles in an
urban environment with other traffic.
Carnegie Mellon University’s robot BOSS took first
place and claimed the $2 million prize (Urmson and Whittaker, 2008). Early pioneers in the development of robotic cars included Dickmanns and Zapp (1987) and Pomerleau (1993).
The field of robotic mapping has evolved from two distinct origins. The first thread began with work by Smith and Cheeseman (1986), who applied Kalman filters to the simultaneous localization and mapping (SLAM) problem. This algorithm was first implemented by Moutarlier and Chatila (1989) and later extended by Leonard and DurrantWhyte (1992); see Dissanayake et al. (2001) for an overview of early Kalman filter variations. The second thread
began with the development of the occupancy grid representation for probabilistic mapping,
which specifies the probability that each (x,y) location is occupied by an obstacle (Moravec
Occupancy grid
and Elfes, 1985).
Kuipers and Levitt (1988) were among the first to propose topological rather than metric mapping, motivated by models of human spatial cognition. A seminal paper by Lu and Milios (1997) recognized the sparseness of the simultaneous localization and mapping problem, which gave rise to the development of nonlinear optimization techniques by Konolige (2004)
and Montemerlo and Thrun (2004), as well as hierarchical methods by Bosse et al. (2004). Shatkay and Kaelbling (1997) and Thrun er al. (1998) introduced the EM algorithm into the
field of robotic mapping for data association. An overview of probabilistic mapping methods can be found in (Thrun et al., 2005). Early mobile robot localization techniques are surveyed by Borenstein ef al. (1996). Although Kalman
filtering was well known
as a localization method in control theory for
decades, the general probabilistic formulation of the localization problem did not appear in the Al literature until much later, through the work of Tom Dean and colleagues (Dean er al., 1990) and of Simmons and Koenig (1995). The latter work introduced the term Markov
localization. The first realworld application of this technique was by Burgard ef al. (1999), Markov localization through a series of robots that were deployed in museums.
Monte Carlo localization based
on particle filters was developed by Fox ef al. (1999) and is now widely used. The Rao
Blackwellized particle filter combines particle filtering for robot localization with exact
filtering for map building (Murphy and Russell, 2001; Montemerlo et al., 2002).
RaoBlackwellized particle filter
A great deal of early work on motion planning focused on geometric algorithms for de
terministic and fully observable motion planning problems. The PSPACEhardness of robot
motion planning was shown in a seminal paper by Reif (1979). The configuration space rep
resentation is due to LozanoPerez (1983). A series of papers by Schwartz and Sharir on what
they called piano movers problems (Schwartz et al., 1987) was highly influential.
Recursive cell decomposition for configuration space planning was originated in the work
of Brooks and LozanoPerez (1985) and improved significantly by Zhu and Latombe (1991). The earliest skeletonization algorithms were based on Voronoi diagrams (Rowat, 1979) and
Piano movers
978
Chapter 26 Robotics
Visibility graph
visibility graphs (Wesley and LozanoPerez, 1979). Guibas ez al. (1992) developed efficient
techniques for calculating Voronoi diagrams incrementally, and Choset (1996) generalized Voronoi diagrams to broader motion planning problems. John Canny (1988) established the first singly exponential algorithm for motion planning.
The seminal text by Latombe (1991) covers a variety of approaches to motion planning, as
do the texts by Choset et al. (2005) and LaValle (2006). Kavraki et al. (1996) developed the theory of probabilistic roadmaps. Kuffner and LaValle (2000) developed rapidly exploring random trees (RRTS). Involving optimization in geometric motion planning began with elastic bands (Quinlan
and Khatib, 1993), which refine paths when the configurationspace obstacles change. Ratliff et al. (2009) formulated the idea as the solution to an optimal control problem,
allowing
the initial trajectory to start in collision, and deforming it by mapping workspace obstacle gradients via the Jacobian into the configuration space. Schulman et al. (2013) proposed a practical secondorder alternative. The control of robots as dynamical systems—whether for manipulation or navigation—
has generated a vast literature. While this chapter explained the basics of trajectory tracking
control and optimal control, it left out entire subfields, including adaptive control, robust
control, and Lyapunov analysis. Rather than assuming everything about the system is known
a priori, adaptive control aims to adapt the dynamics parameters and/or the control law online.
Robust control, on the other hand, aims to design controllers that perform well in spite of
uncertainty and external disturbances. Lyapunov analysis was originally developed in the 1890s for the stability analysis of
general nonlinear systems, but it was not until the early 1930s that control theorists realized its true potential.
With the development of optimization methods, Lyapunov
analysis was
extended to control barrier functions, which lend themselves nicely to modern optimization tools. These methods are widely used in modern robotics for realtime controller design and
safety analysis. Crucial works in robotic control include a trilogy on impedance control by Hogan (1985) and a general study of robot dynamics by Featherstone (1987). Dean and Wellman (1991)
were among the first to try to tie together control theory and Al planning systems. Three clas
Haptic feedback
sic textbooks on the mathematics of robot manipulation are due to Paul (1981), Craig (1989), and Yoshikawa (1990). Control for manipulation is covered by Murray (2017). The area of grasping is also important in robotics—the problem of determining a stable grasp is quite difficult (Mason and Salisbury, 1985). Competent grasping requires touch sensing, or haptic feedback, to determine contact forces and detect slip (Fearing and Hollerbach, 1985). Understanding how to grasp the the wide variety of objects in the world is a daunting
task.
(Bousmalis et al., 2017) describe a system that combines realworld experimentation
with simulations guided by simtoreal transfer to produce robust grasping.
Potentialfield control, which attempts to solve the motion planning and control problems
Vector field histogram
simultaneously, was developed for robotics by Khatib (1986). In mobile robotics, this idea was viewed as a practical solution to the collision avoidance problem, and was later extended
into an algorithm called vector field histograms by Borenstein (1991). ILQR is currently widely used at the intersection of motion planning and control and is
due to Li and Todorov (2004). It is a variant of the much older differential dynamic program
ming technique (Jacobson and Mayne, 1970).
Bibliographical and Historical Notes Finemotion planning with limited sensing was investigated by LozanoPerez et al. (1984)
and Canny and Reif (1987). Landmarkbased navigation (Lazanas and Latombe, 1992) uses many of the same ideas in the mobile robot arena. Navigation functions, the robotics version of a control policy for deterministic MDPs, were introduced by Koditschek (1987). Key work
applying POMDP methods (Section 17.4) to motion planning under uncertainty in robotics is
due to Pineau et al. (2003) and Roy et al. (2005). Reinforcement learning in robotics took off with the seminal work by Bagnell and Schneider (2001) and Ng ef al. (2003), who developed the paradigm in the context of autonomous helicopter control.
Kober er al. (2013) offers an overview of how reinforcement
learning changes when applied to the robotics problem. Many of the techniques implemented on physical systems build approximate dynamics models, dating back to locally weighted linear models due to Atkeson ef al. (1997). But policy gradients played their role as well,
enabling (simplified) humanoid robots to walk (Tedrake er al., 2004), or a robot arm to hit a baseball (Peters and Schaal, 2008).
Levine ez al. (2016) demonstrated the first deep reinforcement learning application on a
real robot. At the same time, modelfree RL in simulation was being extended to continuous domains (Schulman et al., 2015a; Heess et al., 2016; Lillicrap et al., 2015). Other work
scaled up physical data collection massively to showcase the learning of grasps and dynamics
models (Pinto and Gupta, 2016; Agrawal ez al., 2017; Levine et al., 2018). Transfer from simulation to reality or simtoreal (Sadeghi and Levine, 2016; Andrychowicz et al., 2018a), metalearning (Finn er al., 2017), and sampleefficient modelfree reinforcement learning
(Andrychowicz ef al., 2018b) are active areas of research.
Early methods for predicting human actions made use of filtering approaches (Madha
van and Schlenoff, 2003), but seminal work by Ziebart ef al. (2009) proposed prediction by modeling people as approximately rational agents. Sadigh ef al. (2016) captured how these predictions should actually depend on what the robot decides to do, building toward a gametheoretic setting. For collaborative settings, Sisbot et al. (2007) pioneered the idea of account
ing for what people want in the robot’s cost function. Nikolaidis and Shah (2013) decomposed
collaboration into learning how the human will act, but also learning how the human wants
the robot to act, both achievable from demonstrations. For learning from demonstration see
Argall et al. (2009). Akgun et al. (2012) and Sefidgar et al. (2017) studied teaching by end users rather than by experts. Tellex et al. (2011) showed how robots can infer what people want from natural language
instructions. Finally, not only do robots need to infer what people want and plan on doing, but
people too need to make the same inferences about robots. Dragan et al. (2013) incorporated amodel of the human’s inferences into robot motion planning.
The field of humanrobot interaction is much broader than what we covered in this
chapter, which focused primarily on the planning and learning aspects. Thomaz ef al. (2016) provides a survey of interaction more broadly from a computational perspective. Ross et al. (2011) describe the DAGGER system. The topic of software architectures for robots engenders much religious debate. The
good oldfashioned Al candidate—the threelayer architecture—dates back to the design of Shakey and is reviewed by Gat (1998). The subsumption architecture is due to Brooks (1986),
although similar ideas were developed independently by Braitenberg, whose book, Vehicles (1984), describes a
series of simple robots based on the behavioral approach.
979
980
Chapter 26 Robotics The success of Brooks’s
sixlegged walking robot was followed by many other projects.
Connell, in his Ph.D. thesis (1989), developed an entirely reactive mobile robot that was ca
pable of retrieving objects. Extensions of the paradigm to multirobot systems can be found in work by Parker (1996) and Mataric (1997). GRL (Horswill, 2000) and COLBERT (Kono
lige, 1997) abstract the ideas of concurrent behaviorbased robotics into general robot control
languages. Arkin (1998) surveys some of the most popular approaches in this field. Two early textbooks, by Dudek and Jenkin (2000) and by Murphy (2000), cover robotics
generally. More recent overviews are due to Bekey (2008) and Lynch and Park (2017). Anex
cellent book on robot manipulation addresses advanced topics such as compliant motion (Ma
son, 2001). Robot motion planning is covered in Choset ef al. (2005) and LaValle (2006).
Thrun ez al. (2005) introduces probabilistic robotics. The Handbook of Robotics (Siciliano and Khatib, 2016) is a massive, comprehensive overview of all of robotics.
The premiere conference for robotics is Robotics: Science and Systems Conference, fol
lowed by the IEEE International Conference on Robotics and Automation.
HumanRobot
Interaction is the premiere venue for interaction. Leading robotics journals include IEEE
Robotics and Automation, the International Journal of Robotics Research, and Robotics and
Autonomous Systems.
TR 97
PHILOSOPHY, ETHICS, AND SAFETY OF Al In which we consider the big questions around the meaning of Al how we can ethically develop and apply it, and how we can keep it safe.
Philosophers have been asking big questions for a long time: How do minds work? s it possible for machines to act intelligently in the way that people do? Would such machines have real, conscious minds? To these, we add new ones: What are the ethical implications of intelligent machines in daytoday use? Should machines be allowed to decide to kill humans? Can algorithms be fair and unbiased? What will humans do if machines can do all kinds of work? And how do we control machines that may become more intelligent than us? 27.1
The Limits of Al
In 1980, philosopher John Searle introduced a distinction between weak Al—the idea that
machines could act as if they were intelligent—and strong AI—the assertion that machines
that do so are actually consciously thinking (not just simulating thinking). Over time the definition of strong Al shifted to refer to what is
also called “humanlevel AI” or “general
AI’—programs that can solve an arbitrarily wide variety of tasks, including novel ones, and do so as well as a human.
Critics of weak AT who objected to the very possibility of intelligent behavior in machines now appear as shortsighted as Simon Newcomb, who in October 1903 wrote “aerial flight is one of the great class of problems with which man can never cope”—just two months before the Wright brothers’ flight at Kitty Hawk.
The rapid progress of recent years does
not, however, prove that there can be no limits to what Al can achieve. Alan Turing (1950),
the first person to define Al was also the first to raise possible objections to Al foreseeing almost all the ones subsequently raised by others. 27.1.1
The argument from informality
Turing’s “argument from informality of behavior” says that human behavior is far too com
plex to be captured by any formal set of rules—humans must be using some informal guide
lines that (the argument claims) could never be captured in a formal set of rules and thus
could never be codified in a computer program. A key proponent of this view was Hubert Dreyfus, who produced a series of influential critiques of artificial intelligence:
What Computers
Can’t Do (1972), the sequel What
weak Al
Strong Al
982
Chapter 27 Philosophy, Ethics, and Safety of AT Computers Still Can’t Do (1992), and, with his brother Stuart, Mind Over Machine (1986).
Good OldFashioned AT'(GOFAI)
Similarly, philosopher Kenneth Sayre (1993) said “Artificial intelligence pursued within the cult of computationalism stands not even a ghost of a chance of producing durable results.” The technology they criticize came to be called Good OldFashioned AI (GOFAI).
GOFAI corresponds to the simplest logical agent design described in Chapter 7, and we saw there that it is indeed difficult to capture every contingency of appropriate behavior in a set of necessary and sufficient logical rules; we called that the qualification problem.
But
as we saw in Chapter 12, probabilistic reasoning systems are more appropriate for openended domains, and as we saw in Chapter 21, deep learning systems do well on a variety of “informal” tasks. Thus, the critique is not addressed against computers per se, but rather against one particular style of programming them with logical rules—a style that was popular
in the 1980s but has been eclipsed by new approaches.
One of Dreyfus’s strongest arguments is for situated agents rather than disembodied log
ical inference engines. An agent whose understanding of “dog” comes only from a limited set of logical sentences such as “Dog(x) = Mammal(x)” is at a disadvantage compared to an agent that has watched dogs run, has played fetch with them, and has been licked by one. As philosopher Andy Clark (1998) says, “Biological brains are first and foremost the control systems for biological bodies. Biological bodies move and act in rich realworld surroundings.”
Embodied cognition
According to Clark, we are “good at frisbee, bad at logic.”
The embodied cognition approach claims that it makes no sense to consider the brain
separately: cognition takes place within a body, which is embedded in an environment. We need to study the system as a whole; the brain’s functioning exploits regularities in its envi
ronment, including the rest of its body.
Under the embodied cognition approach, robotics,
vision, and other sensors become central, not peripheral. Overall, Dreyfus
saw areas where Al did not have complete answers and said that Al is
therefore impossible; we now see many of these same areas undergoing continued research and development leading to increased capability, not impossibility. 27.1.2
The argument from disability
The “argument from disability” makes the claim that “a machine can never do X.” As examples of X, Turing lists the following: Be kind, resourceful, beautiful, friendly, have initiative, have a sense of humor, tell right from wrong, make mistakes, fall in love, enjoy strawberries and cream, make someone fall in love with it, learn from experience, use words properly, be the subject of its own thought, have as much diversity of behavior as man, do something really new. In retrospect, some of these are rather easy—we’re all familiar with computers that “make
mistakes.” Computers with metareasoning capabilities (Chapter 5) can examine heir own computations, thus being the subject of their own reasoning. A centuryold technology has the proven ability to “make someone fall in love with it"—the teddy bear. Computer chess
expert David Levy predicts that by 2050 people will routinely fall in love with humanoid robots.
As for a robot falling in love, that is a common theme in fiction,' but there has
been only limited academic speculation on the subject (Kim ef al., 2007). Computers have
! For example, the opera Coppélia (1870), the novel Do Androids Dream of Electric Sheep? (1968), the movies AIQ001), WallE (2008), and Her (2013).
Section 27.1
The Limits of AT
done things that are “really new;” making significant discoveries in astronomy, mathematics, chemistry, mineralogy, biology, computer science, and other fields, and creating new forms of art through style transfer (Gatys e al., 2016). Overall, programs exceed human performance in some tasks and lag behind on others. The one thing that it is clear they can’t do is be exactly human. 27.1.3
The
mathematical
objection
Turing (1936) and Godel (1931) proved that certain mathematical questions are in princi
ple unanswerable by particular formal systems. Godel’s incompleteness theorem (see Sec
tion 9.5) is the most famous example of this. Briefly, for any formal axiomatic framework F
powerful enough to do arithmetic, it is possible to construct a socalled Godel sentence G(F)
with the following properties: « G(F) is a sentence of F, but cannot be proved within F. « If F is consistent, then G(F) is true.
Philosophers such as J. R. Lucas (1961) have claimed that this theorem shows that machines
are mentally inferior to humans, because machines are formal systems that are limited by the incompleteness theorem—they cannot establish the truth of their own Godel sentence—
while humans have no such limitation. This has caused a lot of controversy, spawning a vast literature, including two books by the mathematician/physicist Sir Roger Penrose (1989, 1994).
Penrose repeats Lucas’s claim with some fresh twists, such as the hypothesis that
humans are different because their brains operate by quantum gravity—a theory that makes
multiple false predictions about brain physiology.
‘We will examine three of the problems with Lucas’s claim. First, an agent should not be
ashamed that it cannot establish the truth of some sentence while other agents can. Consider the following sentence:
Lucas cannot consistently assert that this sentence is true. If Lucas asserted this sentence, then he would be contradicting himself, so therefore Lucas
cannot consistently assert it, and hence it is true. We have thus demonstrated that there is a
true sentence that Lucas cannot consistently assert while other people (and machines) can. But that does not make us think any less of Lucas.
Second, Gadel’s incompleteness theorem and related results apply to mathematics, not
to computers. No entity—human or machine—can prove things that are impossible to prove. Lucas and Penrose falsely assume that humans can somehow get around these limits, as when
Lucas (1976) says “we must assume our own consistency, if thought is to be possible at all.” But this is an unwarranted assumption: humans are notoriously inconsistent. This is certainly
true for everyday reasoning, but it is also true for careful mathematical thought. A famous example is the fourcolor map problem. Alfred Kempe (1879) published a proof that was widely accepted for 11 years until Percy Heawood (1890) pointed out a flaw. Third, Gédel’s incompleteness theorem technically applies only to formal systems that are powerful enough to do arithmetic. This
includes Turing machines, and Lucas’s claim is
in part based on the assertion that computers are equivalent to Turing machines. This is not
quite true. Turing machines are infinite, whereas computers (and brains) are finite, and any computer can therefore be described as a (very large) system in propositional logic, which is not subject to Gédel’s incompleteness theorem. Lucas assumes that humans can “change their
983
984
Chapter 27 Philosophy, Ethics, and Safety of AT minds” while computers cannot, but that is also false—a computer can retract a conclusion
after new evidence or further deliberation; it can upgrade its hardware; and it can change its
decisionmaking processes with machine learning or software rewriting. 27.1.4
Measuring Al
Alan Turing, in his famous paper “Computing Machinery and Intelligence” (1950), suggested that instead of
asking whether machines can think, we should ask whether machines can pass
a behavioral test, which has come to be called the Turing test. The test requires a program
to have a conversation (via typed messages) with an interrogator for five minutes.
The in
terrogator then has to guess if the conversation is with a program or a person; the program
passes the test if it fools the interrogator 30% of the time. To Turing, the key point was not
the exact details of the test, but instead the idea of measuring intelligence by performance on some kind of openended behavioral task, rather than by philosophical speculation.
Nevertheless, Turing conjectured that by the year 2000 a computer with a storage of a
billion units could pass the test, but here we are on the other side of 2000, and we still can’t
agree whether any program has passed. Many people have been fooled when they didn’t know they might be chatting with a computer.
The ELIZA program and Internet chatbots such as
MGONZ (Humphrys, 2008) and NATACHATA (Jonathan et al., 2009) fool their correspondents
repeatedly, and the chatbot CYBERLOVER
has attracted the attention of law enforcement be
cause of its penchant for tricking fellow chatters into divulging enough personal information that their identity can be stolen. In 2014, a chatbot called Eugene Goostman fooled 33% of the untrained amateur judges
in a Turing test. The program claimed to be a boy from Ukraine with limited command of
English; this helped explain its grammatical errors. Perhaps the Turing test is really a test of
human gullibility. So far no welltrained judge has been fooled (Aaronson, 2014).
Turing test competitions have led to better chatbots, but have not been a focus of research
within the AT community. Instead, Al researchers who crave competition are more likely
to concentrate on playing chess or Go or StarCraft II, or taking an 8th grade science exam,
or identifying objects in images.
In many of these competitions, programs have reached
or surpassed humanlevel performance, but that doesn’t mean the programs are humanlike
outside the specific task. The point is to improve basic science and technology and to provide useful tools, not to fool judges.
27.2
Can Machines Really Think?
Some philosophers claim that a machine that acts intelligently would not be actually thinking,
but would be only a simulation of thinking. But most Al researchers are not concerned with
the distinction, and the computer scientist Edsger Dijkstra (1984) said that “The question of whether Machines Can Think ... is about as relevant as the question of whether Submarines Can Swim.” The American Heritage Dictionary’s first definition of swim is “To move through
water by means of the limbs, fins, or tail,” and most people agree that submarines, being limbless, cannot swim. The dictionary also defines fly as “To move through the air by means of wings or winglike parts,” and most people agree that airplanes, having winglike parts,
can fly. However, neither the questions nor the answers have any relevance to the design or
capabilities of airplanes and submarines; rather they are about word usage in English. (The
Section 27.2
985
Can Machines Really Think?
fact that ships do swim (“priver”) in Russian amplifies this point.) English speakers have
not yet settled on a precise definition for the word “think”—does it require “a brain” or just “brainlike parts?”
Again, the issue was addressed by Turing. He notes that we never have any direct ev
idence about the internal mental states of other humans—a kind of mental solipsism. Nev
ertheless, Turing says, “Instead of arguing continually over this point, it is usual to have the polite convention that everyone thinks.” Turing argues that we would also extend the polite
convention to machines, if only we had experience with ones that act intelligently. How
Polite convention
ever, now that we do have some experience, it seems that our willingness to ascribe sentience
depends at least as much on humanoid appearance and voice as on pure intelligence. 27.2.1
The Chinese room
The philosopher John Searle rejects the polite convention. His famous Chinese room argu Chinese room ment (Searle, 1990) goes as follows: Imagine a human, who understands only English, inside a room that contains a rule book, written in English, and various stacks of paper. Pieces of paper containing indecipherable symbols are slipped under the door to the room. The human follows the instructions in the rule book, finding symbols in the stacks, writing symbols on new pieces of paper, rearranging the stacks, and so on. Eventually, the instructions will cause
one or more symbols to be transcribed onto a piece of paper that is passed back to the outside world. From the outside, we see a system that is taking input in the form of Chinese sentences and generating fluent, intelligent Chinese responses. Searle then argues: it is given that the human does not understand Chinese. The rule book
and the stacks of paper, being just pieces of paper, do not understand Chinese. Therefore, there is no understanding of Chinese. And Searle says that the Chinese room is doing the same thing that a computer would do, so therefore computers generate no understanding. Searle (1980) is a proponent of biological naturalism, according to which mental states Biological naturalism are highlevel emergent features that are caused by lowlevel physical processes in the neurons, and it is the (unspecified) properties of the neurons that matter: according to Searle’s biases, neurons have “it” and transistors do not. There have been many refutations of Searle’s
argument, but no consensus. His argument could equally well be used (perhaps by robots) to argue that a human cannot have true understanding; after all, a human is made out of cells,
the cells do not understand, therefore there is no understanding. In fact, that is the plot of Terry Bisson’s (1990) science fiction story They're Made Out of Meat, in which alien robots
explore Earth and can’t believe that hunks of meat could possibly be sentient. How they can be remains a mystery.
27.2.2
Consciousness and qualia
Running through all the debates about strong Al is the issue of consciousness:
awareness
of the outside world, and of the self, and the subjective experience of living. The technical
term for the intrinsic nature of experiences is qualia (from the Latin word meaning, roughly,
“of what kind”). The big question is whether machines can have qualia. In the movie 2001, when astronaut David Bowman is disconnecting the “cognitive circuits™ of the HAL 9000
computer, it says “I'm afraid, Dave. Dave, my mind is going. I can feel it.” Does HAL actually have feelings (and deserve sympathy)? Or is the reply just an algorithmic response, no different from “Error 404: not found”?
Consciousness Qualia
986
Chapter 27 Philosophy, Ethics, and Safety of AT There is a similar question for animals: pet owners are certain that their dog or cat has
consciousness, but not all scientists agree. Crickets change their behavior based on tempera
ture, but few people would say that crickets experience the feeling of being warm or cold.
One reason that the problem of consciousness is hard is that it remains illdefined, even
after centuries of debate.
But help may be on the way. Recently philosophers have teamed
with neuroscientists under the auspices of the Templeton Foundation to start a series of ex
periments that could resolve some of the issues. Advocates of two leading theories of con
sciousness (global workspace theory and integrated information theory) have agreed that the
experiments could confirm one theory over the other—a rarity in philosophy.
Alan Turing (1950) concedes that the question of consciousness is a difficult one, but
denies that it has much relevance to the practice of Al: “I do not wish to give the impression
that I think there is no mystery about consciousness ... But I do not think these mysteries
necessarily need to be solved before we can answer the question with which we are concerned in this paper” We agree with Turing—we are interested in creating programs that behave intelligently.
Individual aspects of consciousness—awareness,
selfawareness, attention—
can be programmed and can be part of an intelligent machine. The additional project of making a machine conscious in exactly the way humans are is not one that we are equipped
to take on. We do agree that behaving intelligently will require some degree of awareness,
which will differ from task to task, and that tasks involving interaction with humans will
require a model of human subjective experience. In the matter of modeling experience, humans have a clear advantage over machines, because they can use their own subjective apparatus to appreciate the subjective experience of others.
For example, if you want to know what it’s like when someone hits their thumb
with a hammer, you can hit your thumb with a hammer. Machines have no such capability— although unlike humans, they can run each other’s code. 27.3
The Ethics of Al
Given that Al is a powerful technology, we have a moral obligation to use it well, to promote the positive aspects and avoid or mitigate the negative ones.
The positive aspects are many. For example, Al can save lives through improved med
ical diagnosis, new medical discoveries, better prediction of extreme weather events, and
safer driving with driver assistance and (eventually) selfdriving technologies. There are also many opportunities to improve lives. Microsoft’s Al for Humanitarian Action program ap
plies Al to recovering from natural disasters, addressing the needs of children, protecting refugees, and promoting human rights. Google’s Al for Social Good program supports work on rainforest protection, human rights jurisprudence, pollution monitoring, measurement of fossil fuel emissions, cris counseling, news fact checking, suicide prevention, recycling, and other issues. The University of Chicago’s Center for Data Science for Social Good applies machine learning to problems in criminal justice, economic development, education, public health, energy, and environment.
Al applications in crop management and food production help feed the world. Optimization of business processes using machine learning will make businesses more productive,
increasing wealth and providing more employment. Automation can replace the tedious and
dangerous tasks that many workers face, and free them to concentrate on more interesting
Section 27.3
The Ethics of Al
987
aspects. People with disabilities will benefit from Albased assistance in seeing, hearing, and
mobility. Machine translation already allows people from different cultures to communicate.
Softwarebased Al solutions have near zero marginal cost of production, and so have the potential to democratize access to advanced technology (even as other aspects of software have the potential to centralize power).
Despite these many positive aspects, we shouldn’t ignore the negatives. Many new tech
nologies have had unintended negative side effects:
nuclear fission brought Chernobyl and
the th