SELF ORGANIZING MAPS - Full report



These notes provide an introduction to unsupervised neural networks, in particular Kohonen self-organizing maps; together with some fundamental background material on statistical pattern recognition.

                  One question which seems to puzzle many of those who encounter unsupervised learning for the first time is how can anything useful be achieved when input information is simply poured into a black box with no provision of any rules as to how this information should be stored, or examples of the various groups into which this information can be placed. If the information is sorted on the basis of how similar one input is with another, then we will have accomplished an important step in condensing  the available information by developing a more compact representation. We can represent this information, and any subsequent information, in a much reduced fashion. We will know which information is more likely. This black box will certainly have learned. It may permit us to perceive some order in what otherwise was a mass of unrelated information to see the wood for the trees.

                  In any learning system, we need to make full use of the all the available data and to impose any constrains that we feel are justified. If we know that what groups the information must fall into, that certain combinations of inputs preclude others, or that certain rules underlie the production of the information then we must use them. Often, we do not possess such additional information. Consider two examples of experiments. One designed to test a particular hypothesis, say, to determine the effects of alcohol on driving; the second to investigate any possible connection between car accidents and the driver’s lifestyle. In the first experiment, we could arrange a laboratory-based experiment where volunteers took measured amounts of alcohol and then attempted some motor-skill activity (e.g., following a moving light on a computer screen by moving the mouse). We could collect the data (i.e., amount of alcohol vs. error rate on the computer test), conduct the customary statistical test and, finally, draw our conclusions. Our hypothesis may that the more alcohol consumed the greater the error rate we can confirm this on the basis of this experiment. Note, that we cannot prove the relationship only state that we are 99% certain (or whatever level we set ourselves) that the result is not due purely to chance.    

                  The second experiment is much more open-ended (indeed, it could be argued that it is not really an experiment).Data is collected from a large number of drives those that have been involved in accidents and those that have not. This data could include the driver's age, occupation, health details, drinking habits, etc. From this mass of information, we can attempt to discover any possible connections. A number of conventional statistical tools exist to support this (e.g., factor analysis). We may discover possible relationships including one between accidents and drinking but perhaps many others as well. There could be a number of leads that need following up. Both approaches are valid in searching for causes underlying road accidents. This second experiment can be considered as an example of unsupervised learning.

                  The next section provides some introductory background material on statistical pattern recognition. The terms and concepts will be useful in understanding the later material on unsupervised neural networks. As the approach underlying unsupervised networks is the measurement of how similar (or different) various inputs are, we need to consider how the distances between these inputs are measured. This forms the basis Section Three, together with a brief description of non-neural approaches to unsupervised learning. Section Four discusses the background to and basic algorithm of Kohonen self-organizing maps. The next section details some of the properties of these maps and introduces several useful practical points. The final section provides pointers to further information on unsupervised neural networks.


2. Statistical Pattern Recognition

2.1 Elements of Pattern Recognition

                  The goal of pattern is to classify objects into one of a number of categories or classes.  The objects of interest are generically called patterns and may be images, speech signals or entries in a database. It is these patterns, which we can loosely define as the natural structure, that gives meaning to events in the external world.  Patterns are manifestations of rules, and through observing patterns we are able to propose rules and so form an Understanding of the underlying processes.

                  Supervised learning is where its correct class label accompanies each training input pattern.  The difference, or error, between the recognition system’s current response and the desired one is used to modify the system so as to reduce this error.

2.2 Pattern space and vector

                  Patterns are sets of measurements. For e.g., we could take measurements of height and weight for members of different rowing crews and produce a graph as shown in figure 2.2.1(a).  This graph represents our patterns space that is the two dimensional space within which must lay all possible rowing crew members.  This space is also called the measurement of observation space (as it is the space in which measurements or observations are made) There are two groupings of our experimental data or measurements, which we can (as an external teacher with our additional wider knowledge of rowing) identity as a group of rowers and a group of coxes.  Given just the information of height and weight about an individual (i.e. the points A or B), we need to make a decision whether this individual is a Cox or a rower that is; we wish to classify the information into one of the two classes.
One approach would be to construct a line, based on the original data on which we trained the system (the training data) that separated the two groups.  This is termed a decision line as data points lying to one side of the line are determined to be, say, rowers, and to the other side to be coxes.  So individual ‘A’, represented by the data point A, is assumed to be a rower.  While individual ‘B’ is assumed to be a Cox.  An alternative approach is to consider each grouping as a cluster, and measure the distance from the points A and B to each of the cluster centers.  The shortest distance will determine which class the points are a member of.  If we take another independent measurement, for example ‘age’, then the pattern space becomes three dimensional (fig. 2.2.1(b)), the clusters become spheres and the decision surface become a plane.
We may take many more measurements than three.  For example, the idealized optical character (OCR) illustrated in figure 2.2.2, possesses an array of 7*5 photodiodes, which yield 35 independent output voltages or measurements.  This set of measurements is the measurement vector, X. 

                  An idealised Optical Character Reader.  The simple camera produces, through the 5*7 array of photodiodes, a set of 35 voltages (which are zero volts when no light falls on a diode to one volt for the maximum intensity of light falls on it).  This set is the measurement vector, X.  The individual elements of X are random variable that is scalar quantities lying between 0 and 1.   The measurement vectors for, say, all ‘B’s will be similar, so each ‘B’ will produce a point in the 35 dimensional pattern space that is close to all other points produce by letter ‘B’s.  Similar cluster will be formed for the other letters.
Where N is the number of measurements (i.e., the dimensionality of the patterns space).  It is often more compact to use the transpose of this vector, i.e.

         xt = [ x1   x2 ………xk  ……… xN  ]   
                 
            Each element of this vector, Xk, is a random variable this means that we cannot predict the value of Xk before we take the measurement.  If a photodiode voltage lies between 0 and 1, then Xk will lie somewhere in this range.
The pattern space, in this example, possesses 35 dimensions &each dimension is orthogonal to all the others. Clearly, we now refer to hyperspace, hyper spheres and hyper surfaces .All the normal rules of geometry apply to these high dimensional spaces (we just cannot visualize them!).We may hope that there are 26 separate clusters each representing a single letter of the alphabet. In this situation, it should be relatively easy to construct hyper surfaces that separate each cluster from the remainder. This may not be so.

                  Consider the two-dimensional pattern space of Figure2.2.4, in which there are three pattern classes. The first case (a) is relatively simple as only two decision surfaces are needed to delineate the pattern space into the three class regions.

                   The second case (b) looks more complicated as many more decision surfaces are required to form uniquely identifiable individual case regions. However, if we transformed our two measurements in some way, we could form the two thicker curved decision surfaces. The last case (c) is different. Here, radically different measurements (patterns) are associated to the same class. There is no way to transform our two measurements to recover the situation where only two decision surfaces are sufficient. We are now associating patterns, which bear no similarity to each other, to the same pattern class. This is an example for hard learning for which we must use supervised learning.

2.3 Features and decision spaces

                       As the previous section noted, it can sometimes be beneficial to transform the measurements that we make as this can greatly simplify the task of classification. It can also be unwise to use the raw data in the case of the OCR system, the 35 individual photodiode voltages. Working in such high dimensional pattern space is not wasteful in terms of computing resources but can easily leads to difficulties in accurate pattern classification. Consider the situation illustrated in Figure2.3.1, where an OCR system has inadvertently been taught letter ‘O’s that are always circular and letter ‘Q’s that are always elliptical. After training, the system is exposed to elliptical ‘O’s and circular ‘Q’s. As more picture elements (pixels) match for the main body of the letters, rather than for the small cross-line, the system will make the incorrect classification. However, we know that the small cross-line is the one feature that distinguishes ‘Q’s from ‘O’s.

3. Unsupervised Pattern Classification

3.1 Introduction

                      We need to measure the similarity of the various input pattern vectors so that we can determine which cluster or pattern class each vector should be associated with. This chapter discusses a number of the common metrics employed to measure this similarity. It also mentions briefly non-neural methods for unsupervised pattern classification.

3.2 Distance metrics

                     To record the similarity, or difference, between vectors we need to measure the distance between these vectors. In conventional Euclidean space, we can use Pythagoras’s theorem (Figure 4.2.1(a)). Normally, the distance between the two-dimensional vectors x and y is given by
  d(x, y) = | x - y | = [ (x1- y1)2 + (x2 – y2)2] 1/2
This can be extended to N dimensions, to yield the general expression
for  Euclidean distance,

             We can use many different measures which all define different kinds of metric space that is a space where distance has meaning. For a space to be metric, it must satisfy the following three conditions:
*   If the vectors x and y are different, the distance between them must be positive. If the vectors are identical, then the distance must be zero.
 * d(x, y) =d(y, x) the distance between x to y is the same as the distance between y to x.
 * d(x, y) +d(y, z)>= d(x, z) The distance between x and z must be equal to or greater than the distance between x to y and the distance between y to z. This is the triangle inequality

3.3 Dimensionality Reduction

                     One important function of statistical pattern recognition is dimensionality reduction. The set of measurements (i.e., pattern space) originally taken may be too large and not the most appropriate. What we require is means of reducing the number of dimensions but at the same minimizing any error resulting from discarding measurements. If the dimensionality of the pattern space is ‘p’, then we cannot simply keep ‘m’ of these (where m<p).We require some kind of transformation that ranks its dimensions in terms of the variance of the data. Why should this be so?
Consider the scatter of data points in figure 4.3.1. The vector along the direction of greatest scatter (i.e., variance) captures the most important aspects of a particular data class. There are several important statistical methods based on transforming the measurement space based a different set of axes which may be more appropriate indiscriminating the various data clusters (hopefully in fewer dimensions). The general approach is called multivariant analysis with principle component analysis being the oldest and most common. There are many pitfalls for the novice in applying such techniques. The alternative approach, based on clustering methods, is perhaps easier to comprehend. It should be noted that there is no universal technique that is guaranteed to yield the best solution for all cases. So much in statistical analysis (and neural computing) is data dependent as illustrated in figure 4.3.2.

4. Unsupervised Neural Networks

4.1 Introduction

                  At the start of Section One, we mentioned supervised learning where the desired output response of a network is determined by a set of targets. The general form of the relationship or mapping between the input and output domains are established by the training data. We say that an external teacher is needed to specify these input/output pairings. Networks that are trained without such a teacher learn by evaluating the similarity between the input patterns presented to the network. For example, if the scalar product is used as a measure of the similarity between input vectors and network's weight vectors, then an unsupervised network could adapt its weights to become more like the frequently occurring input vectors. So such networks can be used to perform cluster analysis. They make use of the statistical properties of the input data as frequently occurring patterns will have a greater influence than infrequent ones. The final trained response of the network must resemble in some way the underlying probability density function of the data.
                 
                  Unsupervised learning makes use of the redundancy present in the input data in order to produce a more compact representation. There is a sense of discovery with unsupervised learning. An unsupervised network can discover the existence of unlabelled data clusters, but it cannot give them meaningful names to these clusters nor can it associate different clusters as representatives of the same class

4.2 Winner-take-all Network

                  The basic form of learning employed in unsupervised networks is called competitive learning, and as the name suggests the neurons compete among each other to be the one that fires. All the neurons in the network are identical except that they initialized to have randomly distributed weights. As only one neuron will fire, it can be declared the winner. A simple winner-take-all network is shown in Figure 5.2.1

6. CONCLUSION

                  Devised by Kohonen in the early 80's, the SOM is now one of the most popular and widely used types of unsupervised artificial neural network.

                  It is built in a one-or two-dimensional lattice of neurons for capturing the important features contained in an input (data) space of interest. In so doing, it provides a structural representation of the input data by the neurons weight vectors as prototypes. The SOM algorithm is neurobiologically inspired, incorporation all the mechanisms that are basic to self-organization: competition, cooperation, and self-amplification.

                    The development of self-organizing map as a neural model is motivated by distinct feature of the human brain. What is astonishing about Kohonen’s SOM algorithm is that it is simple to implement, yet mathematically so difficult to analyze its properties in a general setting.

7. REFERENCES
    
  Books:
1.     Neural Networks by Simon Haykin
2.     Fundamentals of artificial neural networks by M.H.Hassoun
3.     Introduction to artificial neural networks by J.M.Zurada
               
Sites:
1. www.som.com

No comments:

Post a Comment

leave your opinion