(wis + 1)0, where is a factor slightly less than unity. And when the joint event (Fj, E; = 0) occurs, we decrement Wi; by replacing it with (Wij) 8. It is not difficult to show that the expected values of the wij's will become proportional to the Pij's [and, in fact, approach Pij[0/(1 - 0)). Hence, the machine tends to learn the optimal weighting on the basis of experience. (One must put in a similar mechanism for estimating the $;'s.) The variance of the normalized weight wij[(1 - 0)/0] approaches [(1 – 0)/(1+0]pijqij. Thus a small value for @ means rapid learning but is associated with a large variance, hence, with low reliability. Choosing O close to unity means slow, but reliable, learning. @ is really a sort of memory decay constant, and its choice must be determined by the noise and stability of the environment—much noise requires long averaging times, while a changing environment requires fast adaptation. The two requirements are, of course, incompatible and the decision has to be based on an economic compromise. 2. POSSIBILITIES OF USING RANDOM NETS FOR BAYES DECISIONS The nets of Fig. 6 are very orderly in structure. Is all this structure necessary? Certainly if there were a great many properties, each of which provided very little marginal information, some of them would not be missed. Then one might expect good results with a mere sampling of all the possible connection paths wij. And one might thus, in this special situation, use a random connection net. The two-layer nets here resemble those of the “Perceptron” proposal of Rosenblatt (1958). In the latter, there is an additional level of connections coming directly from randomly selected points of a “retina.” Here the properties, the devices which abstract the visual input data, are simple functions which add some inputs, subtract others, and detect whether the result exceeds a threshold. Equation (1), we think, illustrates what is of value in this scheme. It does seem clear that a maximum-likelihood type of analysis of the output of the property functions can be handled by such nets. But these nets, with their simple, randomly generated, connections can probably never achieve recognition of such patterns as “the class of figures having two separated parts,” and they cannot even achieve the effect of template recognition without size and position normalization (unless sample figures have been presented previously in essentially all sizes and positions). For the chances are extremely small of finding, by random methods, enough properties usefully correlated with patterns appreciably more abstract than those of the prototype-derived kind. And these networks can really only separate out (by weighting) information in the individual input properties; they cannot extract further information present in nonadditive form. The “Perceptron” class of machines have facilities neither for obtaining better-than-chance properties nor for assembling See also Minsky and Selfridge (1960), ard Papert (1951). better-than-additive combinations of those it gets from random construc tion. 10 Pii For recognizing normalized printed or hand-printed characters, singlepoint properties do surprisingly well (Highleyman and Kamentsky, 1960); this amounts to just “averaging” many samples. Bledsoe and Browning (1959) claim good results with point-pair properties. Roberts (1960) describes a series of experiments in this general area. Doyle (1959) without normalization but with quite sophisticated properties obtains excellent results; his properties are already substantiaily size- and position-invariant. A general review of Doyle's work and other pattern-recognition experiments will be found in Selfridge and Neisser (1960). For the complex discrimination, e.g., between one and two connected objects, the property problem is very serious, especially for long wiggly objects such as are handled by Kirsch (1957). Here some kind of recursive processing is required and combinations of simple properties would almost certainly fail even with large nets and long training. We should not leave the discussion of some decision net models without noting their important limitations. The hypothesis that, for given j, the represent independent events, is a very strong condition indeed. Without this hypothesis we could still construct maximum-likelihood nets, but we would need an additional layer of cells to represent all of the joint events V; that is, we would need to know all the Pr(F;\V). This gives a general (but trivial) solution, but requires 2" cells for n properties, which is completely impractical for large systems. What is required is a system which computes some sampling of all the joint conditional probabilities, and uses these to estimate others when needed. The work of Uttley (1956, 1959) bears on this problem, but his proposed and experimental devices do not yet clearly show how to avoid exponential growth." H. Articulation and Attention-Limitations of the Property-list Method Because of its fixed size, the property-list scheme is limited (for any given set of properties) in the detail of the distinctions it can make. Its ability to deal with a compound scene containing several objects is critically weak, and its direct extensions are unwieldy and unnatural. If a machine can recognize a chair and a table, it surely should be able to tell us that "thera is a chair and a table.” To an extent, we can invent properties which allow some capacity for superposition of object Characters.'? But there is no way to escape the information limit. ** See also Roberts (1960), Papert (1961), and Hawkins (1958). We can find nothing resembling an analysis (see (1) above) in Rosenblatt (1958) or his subsequent publications. 11 See also Papert (1961). Figure 7. The picture (a) is first described verbally in the text. Then, by introducing notation for the relations "inside of," "to the left of” and “above,” we construct a symbolic description. Such descriptions can be formed and manipulated by machines. By abstracting out of the complex relation between the parts of the figure we can use the same formula to describe the related pictures (6) and (c), changing only the list of primitive parts. It is up to the programmer to decide at just what level of complexity a part of a picture should be considered “primitive"; this will depend on what the description is to be used for. We could further divide the drawings into vertices, lines, and arcs. Obviously, for some applications the relations would need more metrical information, e.8., specification of lengths or angles. What is required is clearly (1) a list (of whatever length is necessary) of the primitive objects in the scene and (2) a statement about the relations among them. Thus we say of Fig. 7a, “A rectangle (1) contains two subfigures disposed horizontally. The part on the left is a rectangle (2) which contains two subfigures disposed vertically; the upper a circle (3) and the lower a triangle (4). The part on the right ... etc.” Such a description entails an ability to separate or "articulate” the scene into parts. (Note that in this example the articulation is essentially recursive; the figure is first divided into two parts; then each part is described using the same machinery.) We can formalize this kind of description in an expression language whose fundamental grammatical form is a pair (RL) whose first member R names a relation and whose second member L is an ordered list (x1,x2, . . . ,xn) of the objects or subfigures which bear that relation to one another. We obtain the required flexibility by allowing the members of the list L to contain not only the names of "elementary” figures but also "subexpressions” of the form (R,L) designating complex subfigures. Then our scene above may be described by the expression 10, (0, (+, {(0, (0, (1, (0, A)))), (0, (0, (V, O, O, O)))) }))] where (0, (x,y)) means that y is contained in x; (+,(x,y)) means that y is to the right of x; (1,(x,y)) means that y is below x, and ( A,(x,y,z)) means that y is to the right of x and z is underneath and between them. The symbols O, O, and a represent the indicated kinds of primitive geometric objects. This expression-pair description language may be regarded as a simple kind of "list-structure" language. Powerful computer techniques have been developed, originally by Newell, Shaw and Simon, for manipulating symbolic expressions in such languages for purposes of heuristic programming. (See the remarks at the end of Sec. IV. If some of the members of a list are themselves lists, they must be surrounded by exterior parentheses, and this accounts for the accumulation of parentheses.) It may be desirable to construct descriptions in which the complex relation is extracted, e.g., so that we have an expression of the form FG where F is an expression which at once denotes the composite relation between all the primitive parts listed in G. A complication arises in cornection with the “binding” of variables, i.e., in specifying the manner in which the elements of G participate in the relation F, This can be handled in general by the “X” notation (McCarthy, 1960) but here we can just use integers to order the variables. For the given example, we could describe the relational part F by an expression 0(1, -(0 (2,1(3,4)), o (5,7 (6,7,8))))) in which we now use a "functional notation"; "(0, (x,y))” is replaced by “O (x,y),” etc., making for better readability. To obtain the desired description, this expression has to be applied to an ordered list of primitive objects, which in this case is (0,0,0,0,0,0.0,0). This composite functional form allows us to abstract the composite relation. By changing only the object list we can obtain descriptions also of the objects in Fig. 7b and c. The important thing about such "articular” descriptions is that they can be obtained by repeated application of a fixed set of pattern-recognition techniques. Thus we can obtain arbitrarily complex descriptions from a fixed complexity classification mechanism. The new element required in the mechanism (beside the capacity to manipulate the list structures) is the ability to articulate—to “attend fully” to a selected part of the picture and bring all one's resources to bear on that part. In efficient problemsolving programs, we will not usually complete such a description in a single operation. Instead, the depth or detail of description will be under the control of other processes. These will reach deeper, or look more carefully, only when they have to, e.g., when the presently available description is inadequate for a current goal. The author, together with L. Hodes, is working on pattern-recognition schemes using articular descriptions. By manipulating the formal descriptions we can deal with overlapping and incomplete figures, and several other problems of the “Gestalt” type. It seems likely that as machines are turned toward more difficult problem areas, passive classification systems will become less adequate, and we may have to turn toward schemes which are based more on internally generated hypotheses, perhaps "error-controlled” along the lines proposed by MacKay (1956). Space requires us to terminate this discussion of pattern-recognition and description. Among the important works not reviewed here should be mentioned those of Bomba (1959) and Grimsdale et al. (1959), which involve elements of description, Unger (1959) and Holland (1960) for parallel processing schemes, Hebb (1949) who is concerned with physiological description models, and the work of the Gestalt psychologists, notably Kohler (1947), who have certainly raised, if not solved, a number of important questions. Sherman (1959), Haller (1959) and others have completed programs using line-tracing operations for topological classification. The papers of Selfridge (1955, 1956) have been a major influence on work in this general area. See also Kirsch et al. (1957) for discussion of a number of interesting computer image processing techniques, and see Minot (1959) and Stevens (1957) for reviews of the reading machine and related problems. One should also examine some biological work, e.g., Tinbergen (1951) to see instances in which some discriminations which seem, at first glance very complicated are explained on the basis of a few apparently simple properties arranged in simple decision trees. III. Learning Systems In order to solve a new problem, one should first try using methods similar to those that have worked on similar problems. To implement this “basic learning heuristic” one must generalize on past experience, and one way to do this is to use success-reinforced decision models. These learning systems are shown to be averaging devices. Using devices which learn also which events are associated with reinforcement, i.e., reward, we can build more autonomous “secondary reinforcement” systems. In applying such methods to complex problems, one encounters a serious difficulty—in distributing credit for success of a complex strategy among the many decisions that were involved. This problem can be managed by arranging for local reinforcement of partial goals within a hierarchy, and by grading the training sequence of problems to parallel a process of maturation of the machine's resources. In order to solve a new problem one uses what might be called the basic learning heuristic—first try using methods similar to those which have worked, in the past, on similar problems. We want our machines, too, to benefit from their past experience. Since we cannot expect new situations to be precisely the same as old ones, any useful learning will have to involve generalization techniques. There are too many notions associated |