STEPS TOWARD by Marvin Minsky Introduction A visitor to our planet might be puzzled about the role of con: puters in our technology. On the one hand, he would read and hear all about wonderful “mechanical brains” baffling their creators with prodigious intellectual performance. And he (or it) would be warned that these machines must be restrained, lest they overwhelm us by might, persuasion, or even by the revelation of truths too terrible to be borne. On the other hand, our visitor would find the machines being denounced, on all sides, for their slavish obedience, unimaginative literal interpretations, and incapacity for innovation or initiative; in short, for their inhuman dullness. Our visitor might remain puzzled if he set out to find, and judge for himself, these monsters. For he would find only a few machines (mostly “general-purpose” computers, programmed for the moment to behave according to some specification) doing things that might claim any real intellectual status. Some would be proving mathematical theorems of rather undistinguished character. A few machines might be playing certain games, occasionally defeating their designers. Some might be distinguishing between hand-printed letters. Is this enough to justify so much interest, iet alone deep concern? I believe that it is; that we are on the threshold of an era that will be strongly influenced, and quite possibly doninated, hy intelligent problem-solving machines. But our purpose is not to guess about what the future may bring; it is only to try to describe and explain what seem now to be our first steps toward the construction of "artificial intelligence.” 406 Along with the development of general-purpose computers, the past few years have seen an increase in effort toward the discovery and mechanization of problem-solving processes. Quite a number of papers have appeared describing theories or actual computer programs concerned with game playing, theorem proving, pattern recognition, and other domains which would seem to require some intelligence. The literature does not include any general discussion of the outstanding problems of this field. In this article, an attempt will be made to separate out, analyze, and find the relations between some of these problems. Analysis will be supported with enough examples from the literature to serve the introductory function of a review article, but there remains much relevant work not described here. This report is highly compressed, and therefore, cannot begin to discuss all these matters in the available space. There is, of course, no generally accepted theory of "intelligence"; the analysis is our own and may be controversial. We regret that we cannot give full personal acknowledgments here—suffice it to say that we have discussed these matters with almost every one of the cited authors. It is convenient to divide the problems into five main areas: Search, Pattern Recognition, Learning, Planning, and Induction; these comprise the main divisions of the report. Let us summarize, the entire argument very briefly: A computer can do, in a sense, only what it is told to do. But even when we do not know exactly how to solve a certain problem, we may program a machine to Search through some large space of solution attempts. Unfortunately, when we write a straightforward program for such a search, we usually find the resulting process to be enormously inefficient. With Pattern Recognition techniques, efficiency can be greatly improved by restricting the machine to use its methods only on the kind of attempts for which they are appropriate. And with Learning, efficiency is furthe improved by directing Search in accord with earlier experiences. By ac tually analyzing the situation, using what we call Planning methods, the machine may obtain a really fundamental improvement by replacing ths. originally given Search by a much smaller, more appropriate exploration Finally, in the section on Induction, we consider some rather more global concepts of how one might obtain intelligent machine behavior. 1. The Problem of Search? If, for a given problem, we have a means for checking a proposed solution, then we can solve the problem by testing all possible answers. But this 'The adjective "heuristic," as used here and widely in the literature, mears related to improving problem-solving performance; as a noun 'it is also used in regard to any method or trick used to improve the efficiency of a problem-solving system. A always takes much too long to be of practical interest. Any device that can reduce this search may be of value. If we can detect relative improvement, then “hill-climbing” (Sec. I-B) may be feasible, but its use requires some structural knowledge of the search space. And unless this structure meets certain conditions, hill-climbing may do more harm than good. When we talk of problem-solving in what follows we will usually suppose that all the problems to be solved are initially well defined (McCarthy, 1956). By this we mean that with each problem we are given some systematic way to decide when a proposed solution is acceptable. Most of the experimental work discussed here is concerned with such well-defined problems as are met in theorem-proving, or in games with precise rules for play and scoring. In one sense all such problems are trivial. For if there exists a solution to such a problem, that solution can be found eventually by any blind exhaustive process which searches through all possibilities. And it is usually not difficult to mechanize or program such a search. But for any problem worthy of the name, the search through all possibilities will be too inefficient for practical use. And on the other hand, systems like chess, or nontrivial parts of mathematics, are too complicated for complete analysis. Without complete analysis, there must always remain some core of search, or "trial and error.” So we need to find techniques through which the results of incomplete analysis can be used to make the search more efficient. The necessity for this is simply overwhelming: a search of all the paths through the game of checkers involves some 1040 move choices (Samuel, 1959a), in chess, some 10120 (Shannon, in Newman, 1956). If we organized all the particles in our galaxy into some kind of parallel computer operating at the frequency of hard cosmic rays, the latter computation would still take impossibly long; we cannot expect improvements in “hardware” alone to solve all our problems! Certainly we must use whatever we know in advance to guide the trial generator. And we must also be able to make use of results obtained along the way.2,3 "heuristic program,” to be considered successful, must work well on a variety of problems, and may often be excused if it fails on some. We often find it worthwhile to introduce a heuristic method which happens to cause occasional failures, if there is an over-all improvement in performance. But imperfect methods are not necessarily heuristic, nor vice versa. Hence "heuristic” should not be regarded as opposite to "foolproof"; this has caused some confusion in the literature. · McCarthy (1956) has discussed the enumeration problem from a recursivefunction-theory point of view. This incomplete but suggestive paper proposes, among other things, that "the enumeration of partial recursive functions should give an early place to compositions of functions that have already appeared.” I regard this as an important notion, especially in the light of Shannon's resulta (1949) on two-terminal switching circuits—that the “average" n-variable switching A. Relative Improvement, Hill-climbing, and Heuristic Connections A problem can hardly come to interest us if we have no background of information about it. We usually have some basis, however flimsy, for detecting improvement; some trials will be judged more successful than others. Suppose, for example, that we have a comparator which selects as the better, one from any pair of trial outcomes. Now the comparator cannot, alone, serve to make a problem well defined. No goal is defined. But if the comparator-defined relation between trials is “transitive” (i.e, if A dominates B and B dominates C implies that A dominates C), then we can at least define “progress," and ask our machine, given a time limit, to do the best it can. But it is essential to observe that a comparator by itself, however shrewd, cannot alone give any improvement over exhaustive search. The comparator gives us information about partial success, to be sure. But we need also some way of using this information to direct the pattern of search in promising directions; to select new trial points which are in some sense "like,” or “similar to,” or “in the same direction as” those which have given the best previous results. To do this we need some additional structure on the search space. This structure need not bear much resemblance to the ordinary spatial notion of direction, or that of distance, but it must somehow tie together points which are heuristically related. We will call such a structure a heuristic connection. We introduce this term for informal use only—that is why our definition is itself so informal. But we need it. Many publications have been marred by the misuse, for this purpose, of precise mathematical terms, e.g., metric and topological. The term “connection,” with its variety of dictionary meanings, seems just the word to designate a relation without commitment as to the exact nature of the relation. An important and simple kind of heuristic connection is that defined when a space has coordinates (or parameters) and there is also defined a numerical “success function” E which is a reasonably smooth function of the coordinates. Here we can use local optimization or hill-climbing methods. function requires about 2"/n contacts. This disaster does not usually strike when we construct "interesting" large machines, presumably because they are based on composition of functions already found useful. One should not overlook the pioneering paper of Newell (1955), and Samuel's discussion of the minimaxing process in (1959a). . In 1952 and especially in 1956, Ashby has an excellent discussion of the search problem. (However, I am not convinced of the usefulness of his notion of "ultrastability,” which seems to be little more than the property of a machine to search until something stops it.) B. Hill-climbing Suppose that we are given a black-box machine with inputs dz, ...,do and an output E(11, · dn). We wish to maximize E by adjusting the input values. But we are not given any mathematical description of the function E; hence we cannot use differentiation or related methods. The obvious approach is to explore locally about a point, finding the direction of steepest ascent. One moves a certain distance in that direction and repeats the process until improvement ceases. If the hill is smooth this may be done, approximately, by estimating the gradient component aE/adi separately for each coordinate ti. There are more sophisticated approaches (one may use noise added to each variable, and correlate the output with each input, see Fig. 1), but this is the general idea. It is a fundamental technique, and we see it always in the background of far more complex systems. Heuristically, its great virtue is this: the sampling effort (for determining the direction of the gradient) grows, in a sense, only linearly with the number of parameters. So if we can solve, by such a method, a certain kind of problem involving many parameters, then the addition of more parameters of the same kind ought not cause an inordinate increase in difficulty. We are particularly interested in problemsolving methods which can be so extended to more difficult problems. Alas, most interesting systems which involve combinational operations usually grow exponentially more difficult as we add variables. A great variety of hill-climbing systems have been studied under the names of "adaptive” or “self-optimizing” servomechanisms. Figure 1. "Multiple simultaneous optimizers” search for a (local) maximum value of some function E(,, ..., dn) of several parameters. Each unit U. independently “jitters” its parameter 1., perhaps randomly, by adding a variation 8: (1) to a current mean value Hi. The changes in the quantities o, and E are correlated, and the result is used to (slowly) change Mr. The filters are to rnove d-c components. This simultaneous technique, really a form of coherent detection, usually has an advantage over methods dealing separately and sequentially with cach parameter. (Cf. the discussion of "informative feedback” in Wiener (1948, pp. 133ff.).} |