« PreviousContinue »
Figure 10. “Backing up” the static evaluations of proposed moves in a game tree. From the vertex at the left, representing the present position in a board game, radiate three branches, representing the player's proposed moves. Each of these might be countered by a variety of opponent moves, and so on. According to some program, a finite tree is generated. Then the worth to the player of each terminal board position is estimated (see text). If the opponent has the same values, he will choose to minimize the score, while the player will always try to maximize. The heavy lines show how this minimaxing process backs up until a choice is determined for the present position.
The full tree for chess has the order of 101% branches-beyond the reach of any man or computer. There is a fundamental heuristic exchange between the effectiveness of the evaluation function and the extent of the tree. A very weak evaluation (e.g., one which just compares the players' values of pieces) would yield a devastating game if the machine could explore all continuations out to, say, 20 levels. But only 6 levels, roughly within the range of our presently largest computers, would probably not give a brilliant game; less exhaustive strategies, perhaps along the lines of Newell, Shaw, and Simon (1958b), would be more profitable.
substitute for the absolute "win, lose, or draw” criterion some other "static” way of evaluating nonterminal positions.18 (See Fig. 10.) Perhaps the simplest scheme is to use a weighted sum of some selected set of "property” functions of the positions—mobility, advancement, center control, and the like. This is done in Samuel's program, and in most of its predecessors. Associated with this is a multiple-simultaneous-optimizer method for discovering a good coefficient assignment (using the correlation technique noted in Sec. IIIA). But the source of reinforcement signals in
» In some problems the backing-up process can be handled in closed analytic form so that one may be able to use such methods as Bellman's “Dynamic Programming” (1957). Freimer (1960) gives some examples for which limited "lookahead" doesn't work.
Samuel (1959a) is novel. One cannot afford to play out one or more entire games for each single learning step. Samuel measures instead for each move the difference between what the evaluation function yields directly of a position and what it predicts on the basis of an extensive continuation exploration, i.e., backing up. The sign of this error, “Delta,” is used for reinforcement; thus the system may learn something at each
D. The Basic Credit-assignment Problem for Complex Reinforcement Learning Systems In playing a complex game such as chess or checkers, or in writing a computer program, one has a definite success criterion—the game is won or lost. But in the course of play, each ultimate success (or failure) is associated with a vast number of internal decisions. If the run is successful, how can we assign credit for the success among the multitude of decisions? As Newell noted,
It is extremely doubtful whether there is enough information in "win, lose, or draw” when referred to the whole play of the game to permit any learning at all over available time scales. . . . For learning to take place, each play of the game must yield much more information. This is . . . achieved by breaking the problem into components. The unit of success is the goal. If a goal is achieved, its subgoals are reinforced; if not they are inhibited. (Actually, what is reinforced is the transformation rule that provided the subgoal.) This also is true of the other kinds of structure: every tactic that is created provides information about the success or failure of tactic search rules; every opponent's action provides information about success or failure of likelihood inferences; and so on. The amount of information relevant to learning increases directly with the number of mecha
nisms in the chess-playing machine.20 We are in complete agreement with Newell on this approach to the problem.21
It is my impression that many workers in the area of “self-organizing" systems and "random neural nets” do not feel the urgency of this problem. Suppose that one million decisions are involved in a complex task (such as winning a chess game). Could we assign to each decision element one-millionth of the credit for the completed task? In certain special situations we can do just this-e.g., in the machines of Rosenblatt (1958), Roberts (1960), and Farley and Clark (1954), etc., where the connections being reinforced are to a sufficient degree independent. But the problem-solving ability is correspondingly weak.
10 It should be noted that Samuel (1959a) describes also a rather successful checker-playing program based on recording and retrieving information about positions encountered in the past, a less abstract way of exploiting past experierce. Samuel's work is notable in the variety of experiments that were performed, with and without various heuristics. This gives an unusual opportunity to really find out how different heuristic methods compare. More workers should choose (other things being equal) problems for which such variations are practicable.
30 See p. 108 of Newell (1955).
* See also the discussion in Samuel (p. 22, 1959a) on assigning credit for a change in "Delta."
For more complex problems, with decisions in hierarchies (rather than summed on the same level) and with increments small enough to assure probable convergence, the running times would become fantastic. For complex problems we will have to define “success” in some rich local sense. Some of the difficulty may be evaded by using carefully graded "training sequences” as described in the following section.
FRIEDBERG'S PROGRAM-WRITING PROGRAM An important example of comparative failure in this credit-assignment matter is provided by the program of Friedberg (1958, 1959) to solve program-writing problems. The problem here is to write programs for a (simulated) very simple digital computer. A simple problem is assigned, e.8., “compute the AND of two bits in storage and put the result in an assigned location.” A generating device produces a random (64-instruction) program. The program is run and its success or failure is noted. The success information is used to reinforce individual instructions (in fixed locations) so that each success tends to increase the chance that the instructions of successful programs will appear in later trials. (We lack space for details of how this is done.) Thus the program tries to find “good” instructions, more or less independently, for each location in program memory. The machine did learn to solve some extremely simple problems. But it took of the order of 1000 times longer than pure chance would expect. In part II of Friedberg et al. (1959), this failure is discussed, and attributed in part to what we called (Sec. IC) the “Mesa phenomena.” In changing just one instruction at a time the machine had not taken large enough steps in its search through program space.
The second paper goes on to discuss a sequence of modifications in the program generator and its reinforcement operators. With these, and with some “priming” (starting the machine off on the right track with some useful instructions), the system came to be only a little worse than chance. Friedberg et al. (1959) conclude that with these improvements "the generally superior performance of those machines with a successnumber reinforcement mechanism over those without does serve to indicate that such a mechanism can provide a basis for constructing a learning machine." I disagree with this conclusion. It seems to me that each of the “improvements” can be interpreted as serving only to increase the step size of the search, that is, the randomness of the mechanism; this helps to avoid the Mesa phenomenon and thus approach chance behavior. But it certainly does not show that the “learning mechanism” is working-one would want at least to see some better-than-chance results before arguing this point. The trouble, it seems, is with credit-assignment. The credit for a working program can only be assigned to functional groups of instructions, e.8., subroutines, and as these operate in hierarchies we should not expect individual instruction reinforcement to work well.22 It seems surprising that it was not recognized in Friedberg et al. (1959) that the doubts raised earlier were probably justified! In the last section of Friedberg et al. (1959) we see some real success obtained by breaking the problem into parts and solving them sequentially. (This successful demonstration using division into subproblems does not use any reinforcement mechanism at all.) Some experiments of similar nature are reported in Kilburn, Grimsdale and Sumner (1959).
It is my conviction that no scheme for learning, or for pattern recognition, can have very general utility unless there are provisions for recursive, or at least hierarchical, use of previous results. We cannot expect a learning system to come to handle very hard problems without preparing it with a reasonably graded sequence of problems of growing difficulty. The first problem must be one which can be solved in reasonable time with the initial resources. The next must be capable of solution in reasonable time by using reasonably simple and accessible combinations of methods developed in the first, and so on. The only alternatives to this use of an adequate “training sequence” are (1) advanced resources, given initially, or (2) the fantastic exploratory processes found perhaps only in the history of organic evolution.23 And even there, if we accept the general view of Darlington (1958) who emphasizes the heuristic aspects of genetic systems, we must have developed early (in, e.g., the phenomena of meiosis and crossing-over) quite highly specialized mechanisms providing for the segregation of groupings related to solutions of subproblems. Recently, much effort has been devoted to the construction of training sequences in connection with programming “teaching machines.” Naturally, the psychological literature abounds with theories of how complex behavior is built up from simpler. In our own area, perhaps the work of Solomonoff (1957), while overly cryptic, shows the most thorough consideration of this dependency on training sequences.
* See the introduction to Friedberg (1958) for a thoughtful discussion of the plausibility of the scheme.
* It should, however, be possible to construct learning mechanisms which can select for themselves reasonably good training sequences (from an always complex environment) by prearranging a relatively slow development (or “maturation”) of the system's facilities. This might be done by prearranging that sequence of goals attempted by the primary Trainer match reasonably well, at each stage, the complexity of performance mechanically available to the pattern-recognition and other parts of the system. One might be able to do much of this by simply limiting the depth of hierarchical activity, perhaps only later permitting limited recursive activity.
IV. Problem-solving and Planning
The solution, by machine, of really complex problems will require a variety of administration facilities. During the course of solving a problem, one becomes involved with a large assembly of interrelated subproblems. From these, at each stage, a very few must be chosen for investigation. This decision must be based on (1) estimates of relative difficulties and (2) estimates of centrality of the different candidates for attention. Following subproblem selection (for which several heuristic methods are proposed), one must choose methods appropriate to the selected problems. But for really difficult problems, even these step-by-step heuristics for reducing search will fail, and the machine must have resources for analyzing the problem structure in the large---in short, for “planning." A number of schemes for planning are discussed, among them the use of models
-analogous, semantic, and abstract. Certain abstract models, “CharacterAlgebras,” can be constructed by the machine itself, on the basis of experience or analysis. For concreteness, the discussion begins with a description of a simple but significant system (LT) which encounters some of these problems. A. The "Logic Theory” Program of Newell, Shaw and Simon It is not surprising that the testing grounds for early work on mechanical problem-solving have usually been areas of mathematics, or games, in which the rules are defined with absolute clarity. The “Logic Theory” machine of Newell and Simon (1956a, 1957a), called “LT” below, was a first attempt to prove theorems in logic, by frankly heuristic methods. Although the program was not by human standards a brilliant success (and did not surpass its designers), it stands as a landmark both in heuristic programming and also in the development of modern automatic programming
The problem domain here is that of discovering proofs in the RussellWhitehead system for the propositional calculus. That system is given as a set of (five) axioms and (three) rules of inference; the latter specify how certain transformations can be applied to produce new theorems from old theorems and axioms.
The LT program is centered around the idea of “working backward” to find proof. Given a theorem I to be proved, LT searches among the axioms and previously established theorems for one from which T can be deduced by a single application of one of three simple “Methods” (which