« PreviousContinue »
C. Troubles with Hill-climbing Obviously, the gradient-following hill-climber would be trapped if it should reach a local peak which is not a true or satisfactory optimum. It must then be forced to try larger steps or changes.
It is often supposed that this false-peak problem is the chief obstacle to machine learning by this method. This certainly can be troublesome. But for really difficult problems, it seems to us that usually the more fundamental problem lies in finding any significant peak at all. Unfortunately the known E functions for difficult problems often exhibit what we have called (Minsky and Selfridge, 1960) the “Mesa Phenomenon” in which a small change in a parameter usually leads to either no change in performance or to a large change in performance. The space is thus composed primarily of flat regions or "mesas.” Any tendency of the trial generator to make small steps then results in much aimless wandering without compensating information gains. A profitable search in such a space requires steps so large that hill-climbing is essentially ruled out. The problem-solver must find other methods; hill-climbing might still be feasible with a different heuristic connection.
Certainly, in our own intellectual behavior we rarely solve a tricky problem by a steady climb toward success. I doubt that in any one simple mechanism, e.g., hill-climbing, will we find the means to build an efficient and general problem-solving machine. Probably, an intelligent machine will require a variety of different mechanisms. These will be arranged in hierarchies, and in even more complex, perhaps recursive, structures. And perhaps what amounts to straightforward hill-climbing on one level may sometimes appear (on a lower level) as the sudden jumps of "insight.”
II. The Problem of Pattern Recognition
In order not to try all possibilities, a resourceful machine must classify problem situations into categories associated with the domains of effectiveness of the machine's different methods. These pattern-recognition methods must extract the heuristically significant features of the objects in question. The simplest methods simply match the objects against standards or prototypes. More powerful “property-list" methods subject each object to a sequence of tests, each detecting some property of heuristic importance. These properties have to be invariant under commonly encountered forms of distortion. Two important problems arise here—inventing new useful properties, and combining many properties to form a recognition system. For complex problems, such methods will have to be augmented by facilities for subdividing complex objects and describing the complex relations between their parts.
Any powerful heuristic program is bound to contain a variety of different methods and techniques. At each step of the problem-solving process the machine will have to decide what aspect of the problem to work on, and then which method to use. A choice must be made, for we usually cannot afford to try all the possibilities. In order to deal with a goal or a problem, that is, to choose an appropriate method, we have to recognize what kind of thing it is. Thus the need to choose among actions compels us to provide the machine with classification techniques, or means of evolving them. It is of overwhelming importance that the machine have classification techniques which are realistic. But “realistic” can be defined only with respect to the environments to be encountered by the machine, and with respect to the methods available to it. Distinctions which cannot be exploited are not worth recognizing. And methods are usually worthless without classification schemes which can help decide when they are applicable. A. Teleological Requirements of Classification The useful classifications are those which match the goals and methods of the machine. The objects grouped together in the classifications should have something of heuristic value in common; they should be "similar" in a useful sense; they should depend on relevant or essential features. We should not be surprised, then, to find ourselves using inverse or teleological expressions to define the classes. We really do want to have a grip on “the class of objects which can be transformed into a result of form Y,” that is, the class of objects which will satisfy some goal. One should be wary of the familiar injunction against using teleological language in science. While it is true that talking of goals in some contexts may dispose us towards certain kinds of animistic explanations, this need not be a bad thing in the field of problem-solving; it is hard to see how one can solve problems without thoughts of purposes. The real difficulty with teleological definitions is technical, not philosophical, and arises when they have to be used and not just mentioned. One obviously cannot afford to use for classification a method which actually requires waiting for some remote outcome, if one needs the classification precisely for deciding whether to try out that method. So, in practice, the ideal teleological definitions often have to be replaced by practical approximations, usually with some risk of error; that is, the definitions have to be made heuristically effective, or economically usable. This is of great importance. (We can think of “heuristic effectiveness” as contrasted to the ordinary mathematical notion of "effectiveness” which distinguishes those definitions which can be realized at all by machine, regardless of efficiency.) B. Patterns and Descriptions It is usually necessary to have ways of assigning naines—symbolic expressions—to the defined classes. The structure of the names will have a crucial influence on the mental world of the machine, for it determines what kinds of things can be conveniently thought about. There are a variety of ways to assign names. The simplest schemes use what we will call conventional (or proper) names; here, arbitrary symbols are assigned to classes. But we will also want to use complex descriptions or computed names; these are constructed for classes by processes which depend on the class definitions. To be useful, these should reflect some of the structure of the things they designate, abstracted in a manner relevant to the problem area. The notion of description merges smoothly into the more complex notion of model; as we think of it, a model is a sort of active description. It is a thing whose form reflects some of the structure of the thing represented, but which also has some of the character of a working machine.
In Sec. III we will consider “learning” systems. The behavior of those systems can be made to change in reasonable ways depending on what happened to them in the past. But by themselves, the simple learning systems are useful only in recurrent situations; they cannot cope with any significant novelty. Nontrivial performance is obtained only when learning systems are supplemented with classification or pattern-recognition methods of some inductive ability. For the variety of objects encountered in a nontrivial search is so enormous that we cannot depend on recurrence, and the mere accumulation of records of past experience can have only limited value. Pattern Recognition, by providing a heuristic connection which links the old to the new, can make learning broadly useful.
What is a “pattern"? We often use the term teleologically to mean a set of objects which can in some (useful) way be treated alike. For each problem area we must ask, “What patterns would be useful for a machine working on such problems?”
The problems of visual pattern recognition have received much attention in recent years and most of our examples are from this area. C. Prototype-derived Patterns The problem of reading printed characters is a clearcut instance of a situation in which the classification is based ultimately on a fixed set of "prototypes"-e.g., the dies from which the type font was made. The individual marks on the printed page may show the results of many distortions. Some distortions are rather systematic: change in size, position, orientation. Some are of the nature of noise: blurring, grain, low contrast, etc.
If the noise is not too severe, we may be able to manage the identification by what we call a normalization and template-mutching process. We first remove the differences related to size and position—that is, we normalize the input figure. One may do this, for example, by constructing a similar figure inscribed in a certain fixed triangle (see Fig. 2); of one may transform the figure to obtain a certain fixed center of gravity and a unit second central moment. (There is an additional problem with rotational
equivalence where it is not easy to avoid all ambiguities. One does not want to equate “6" and “9.” For that matter, one does not want to equate (0,0), or (x,x) or the o's in x, and t°, so that there may be context dependency
involved.] Once normalized, the unknown B
figure can be compared with templates for the prototypes and, by means of some measure of
matching, choose the best fitting template. Figure 2. A simple normal
Each “matching criterion” will be sensitive to ization technique. If an object is expanded uniformly, particular forms of noise and distortion, and without rotation, until it so will each normalization procedure. The intouches all three sides of a scribing or boxing method may be sensitive to triangle, the resulting figure
small specks, while the moment method will will be unique, and pattern
be especially sensitive to smearing, at least for recognition
proceed without concern about re
thin-line figures, etc. The choice of a matching lative size and position. criterion must depend on the kinds of noise
and transformations commonly encountered. Still, for many problems we may get acceptable results by using straightforward correlation methods.
When the class of equivalence transformations is very large, e.g., when local stretching and distortion are present, there will be difficulty in finding a uniform normalization method. Instead, one may have to consider a process of adjusting locally for best fit to the template. (While measuring the matching, one could jitter” the figure locally; if an improvement were found the process could be repeated using a slightly different change, etc.) There is usually no practical possibility of applying to the figure all of the admissible transformations. And to recognize the topological equivalence of pairs such as those in Fig. 3 is likely beyond any practical kind of iterative local-improvement or hill-climbing matching procedure. (Such recognitions can be mechanized, though, by methods which follow lines, detect vertices, and build up a description in the form, say, of a vertex-connection table.)
(61 Figure 3. The figures A, A and B. B' are topologically equivalent pairs. Lengths have been distorted in an arbitrary manner, but the connectivity relations between corresponding points have been preserved. In Sherman (1959) and Haller (1959) we find computer programs which can deal with such equivalences.
The template-matching scheme, with its normalization and direct comparison and matching criterion, is just too limited in conception to be of much use in more difficult problems. If the transformation set is large, normalization, or “fitting," may be impractical, especially if there is no adequate heuristic connection on the space of transformations. Furthermore, for each defined pattern, the system has to be presented with a prototype. But if one has in mind a fairly abstract class, one may simply be unable to represent its essential features with one or a very few concrete examples. How could one represent with a single prototype the class of figures which have an even number of disconnected parts? Clearly, the template system has negligible descriptive power. The property-list system frees us from some of these limitations. D. Property Lists and “Characters” We define a property to be a two-valued function which divides figures into two classes; a figure is said to have or not have the property according to whether the function's value is 1 or 0. Given a number N of distinction properties, we could define as many as 2" subclasses by their set intersections and, hence, as many as 22* patterns by combining the properties with AND's and OR’s. Thus, if we have three properties, rectilinear, connected, and cyclic, there are eight subclasses (and 256 patterns) delined by their intersections (see Fig. 4).
If the given properties are placed in a fixed order then we can represent any of these elementary regions by a vector, or string of digits. The vector so assigned to each figure will be called the Character of that figure (with respect to the sequence of properties in question). [In "Some Aspects of Heuristic Programming and Artificial Intelligence" (1959a), we use the
Figure 4. The eight regions represent all the possible configurations of values of the three properties "rectilinear," "connected," "containing a loop." Each region contains a representative figure, and its associated binary "Character” sequence.