Page images

term characteristic for a property without restriction to 2 values.] Thus a square has the Character (1,1,1) and a circle the Character (0,1,1) for the given sequence of properties.

For many problems one can use such Characters as names for categories and as primitive elements with which to define an adequate set of patterns. Characters are more than conventional names. They are instead very rudimentary forms of description (having the form of the simplest symbolic expression-the list) whose structure provides some information about the designated classes. This is a step, albeit a small one, beyond the template method; the Characters are not simple instances of the patterns, and the properties may themselves be very abstract. Finding a good set of properties is the major concern of many heuristic programs.

E. Invariant Properties

One of the prime requirements of a good property is that it be invariant under the commonly encountered equivalence transformations. Thus for visual Pattern Recognition we would usually want the object identification to be independent of uniform changes in size and position. In their pioneering paper Pitts and McCulloch (1947) describe a general technique for forming invariant properties from noninvariant ones, assuming that the transformation space has a certain (group) structure. The idea behind their mathematical argument is this: suppose that we have a function P of figures, and suppose that for a given figure F we define [F] = {F1,F2, ... } to be the set of all figures equivalent to F under the given set of transformations; further, define P[F] to be the set {P(F1),P(F2), . . . } of values of P on those figures. Finally, define P*[F] to be AVERAGE (P[F]). Then we have a new property P* whose values are independent of the selection of F from an equivalence class defined by the transformations. We have to be sure that when different representatives are chosen from a class the collection [F] will always be the same in each case. In the case of continuous transformation spaces, there will have to be a measure or the equivalent associated with the set [F] with respect to which the operation AVERAGE is defined, say, as an integration.*

This method is proposed (Pitts and McCulloch, 1947) as a neurophysiological model for pitch-invariant hearing and size-invariant visual

'In the case studied in Pitts and McCulloch (1947) the transformation space is a group with a uniquely defined measure: the set [F] can be computed without repetitions by scanning through the application of all the transforms T. to the given figure so that the invariant property can be defined by

P*(F) = √。≤ç P(Ta(F)) dμ


where G is the group and the measure. By substituting T(F) for F in this, one can see that the result is independent of choice of ẞ since we obtain the same integral over Gp1 = G.

recognition (supplemented with visual centering mechanisms). This model is discussed also by Wiener.5 Practical application is probably limited to one-dimensional groups and analog scanning devices.

In much recent work this problem is avoided by using properties already invariant under these transformations. Thus a property might count the number of connected components in a picture-this is invariant under size and position. Or a property may count the number of vertical lines in a picture—this is invariant under size and position (but not rotation).

F. Generating Properties

The problem of generating useful properties has been discussed by Selfridge (1955); we shall summarize his approach. The machine is given, at the start, a few basic transformations A1, ., An, each of which transforms, in some significant way, each figure into another figure. A1 might, for example, remove all points not on a boundary of a solid region; A2 might leave only vertex points; A, might fill up hollow regions, etc. (see Fig. 5). Each sequence AiAi2 . . . Aik of these forms a new transformation, so that there is available an infinite variety. We provide the machine also with one or more "terminal" operations which convert a picture into a number, so that any sequence of the elementary transformations, followed by a terminal operation, defines a property. [Dineen (1955) describes how these processes were programmed in a digital computer.] We can start with a few short sequences, perhaps chosen randomly. Selfridge describes how the machine might learn new useful properties.

We now feed the machine A's and O's telling the machine each time which letter it is. Beside each sequence under the two letters, the machine builds up distribution functions from the results of applying the sequences to the image. Now, since the sequences were chosen completely randomly, it may well be that most of the sequences have very flat distribution functions; that is, they [provide] no information, 'See pp. 160ff. of Wiener (1948).

[blocks in formation]

Figure 5. An arbitrary sequence of picture transformations, followed by a numericalvalued function, can be used as a property function for pictures. A, removes all points which are not at the edge of a solid region. A leaves only vertex pointsat which an arc suddenly changes direction. The function C simply counts the number of points remaining in the picture. All remarks in the text could be generalized to apply to properties like A.A.C, which can have more than two values.

and the sequences are therefore [by definition] not significant. Let it discard these and pick some others. Sooner or later, however, some sequences will prove significant; that is, their distribution functions will peak up somewhere. What the machine does now is to build up new sequences like the significant ones. This is the important point. If it merely chose sequences at random it might take a very long while indeed to find the best sequences. But with some successful sequences, or partly successful ones, to guide it, we hope that the process will be much quicker. The crucial question remains: how do we build up sequences “like” other sequences, but not identical? As of now we think we shall merely build sequences from the transition frequencies of the significant sequences. We shall build up a matrix of transition frequencies from the significant ones, and use those as transition probabilities with which to choose new sequences.

We do not claim that this method is necessarily a very good way of choosing sequences—only that it should do better than not using at all the knowledge of what kind of sequences has worked. It has seemed to us that this is the crucial point of learning.

It would indeed be remarkable if this failed to yield properties more useful than would be obtained from completely random sequence selection. The generating problem is discussed further in Minsky (1956a). Newell, Shaw, and Simon (1960b) describe more deliberate, less statistical, techniques that might be used to discover sets of properties appropriate to a given problem area. One may think of the Selfridge proposal as a system which uses a finite-state language to describe its properties. Solomonoff (1957, 1960) proposes some techniques for discovering common features of a set of expressions, e.g., of the descriptions of those properties of already established utility; the methods can then be applied to generate new properties with the same common features. I consider the lines of attack in Selfridge (1955), Newell, Shaw and Simon (1960a), and Solomonoff (1960, 1958), although still incomplete, to be of the greatest importance.

G. Combining Properties

One cannot expect easily to find a small set of properties which will be just right for a problem area. It is usually much easier to find a large set of properties each of which provides a little useful information. Then one is faced with the problem of finding a way to combine them to make the desired distinctions. The simplest method is to choose, for each class, a typical character (a particular sequence of property values) and then to use some matching procedure, e.g., counting the numbers of agreements and disagreements, to compare an unknown with these chosen "Character

See p. 93 of Selfridge (1955).

prototypes." The linear weighting scheme described just below is a slight generalization on this. Such methods treat the properties as more or less independent evidence for and against propositions; more general procedures (about which we have yet little practical information) must account also for nonlinear relations between properties, i.e., must contain weighting terms for joint subsets of property values.


We consider a single experiment in which an object is placed in front of a property-list machine. Each property E; will have a value, 0 or 1. Suppose that there has been defined some set of "object classes" F1, and that we want to use the outcome of this experiment to decide in which of these classes the object belongs.

Assume that the situation is basically probabilistic, and that we know the probability p1; that, if the object is in class F; then the ith property E, will have value 1. Assume further that these properties are independent; that is, even given F,, knowledge of the value of E; tells us nothing more about the value of a different E in the same experiment. (This is a strong condition-see below.) Let ; be the absolute probability that an object is in class F. Finally, for this experiment define V to be the particular set of i's for which the E's are 1. Then this V represents the Character of the object. From the definition of conditional probability, we have

Pr(F;‚V) = Pr(V) · Pr(F;|V) = Pr(F;) · Pr(V|F;)

Given the Character V, we want to guess which F, has occurred (with the least chance of being wrong—the so-called maximum likelihood estimate); that is, for which j is Pr(F,V) the largest? Since in the above Pr(V) does not depend on j, we have only to calcuate for which j is

Pr(F;) · Pr(V|F;) = 6; Pr(V|F;)

the largest. Hence, by our independence hypothesis, we have to maximize

Þj · [] Pij ⋅ [] 9ij = 6; [] Pij · [] qi;


all i

iEV iЄv iEV These "maximum-likelihood" decisions can be made (Fig. 6) by a simple network device."

'At the cost of an additional network layer, we may also account for the possible cost g, that would be incurred if we were to assign to F a figure really in class F;; in this case the minimum cost decision is given by the k for which

[ocr errors]

Στικς, Π

iEV iЄv

is the least. is the complement set to V. q., is (1 — P.,).


[ocr errors]
[ocr errors][merged small][merged small][merged small][ocr errors][merged small][merged small][merged small][merged small][merged small][ocr errors][merged small][merged small][ocr errors][merged small][merged small][merged small]

Figure 6. "Net" model for maximum-likelihood decisions based on linear weightings of property values. The input data are examined by each "property filter" E.. Each E. has "0" and "1" output channels, one of which is excited by each input. These outputs are weighted by the corresponding p.,'s, as shown in the text. The resulting signals are multiplied in the F, units, each of which "collects evidence" for a particular figure class. [We could have used here log (p.;), and added at the F, units.] The final decision is made by the topmost unit D, who merely chooses that F, with the largest score. Note that the logarithm of the coefficient p/qi, in the second expression of (1) can be construed as the "weight of the evidence" of E, in favor of F,. [See also Papert (1961) and Rosenblatt (1958).]

These nets resemble the general schematic diagrams proposed in the "Pandemonium” model of Selfridge (1959) (see his fig. 3). It is proposed there that some intellectual processes might be carried out by a hierarchy of simultaneously functioning submachines suggestively called "demons.” Each unit is set to detect certain patterns in the activity of others and the output of each unit announces the degree of confidence of that unit that it sees what it is looking for. Our E; units are Selfridge's "data demons.” Our units F; are his "cognitive demons"; each collects from the abstracted data evidence for a specific proposition. The topmost "decision demon” D responds to that one in the multitude below it whose shriek is the loudest."

It is quite easy to add to this "Bayes network model" a mechanism which will enable it to learn the optimal connection weightings. Imagine that, after each event, the machine is told which F; has occurred; we could implement this by sending back a signal along the connections leading to that F, unit. Suppose that the connection for pi; (or q1;) contains a two-terminal device (or "synapse") which stores a number w¡¡. Whenever the joint event (F,, E; = 1) occurs, we modify wi; by replacing it by

See also the report in Selfridge and Neisser (1960).

« PreviousContinue »