Engineering Seminar Topics :: Seminar Paper: Neural Networks

ABSTRACT

With the dawn of the genome era computational methods in the automatic analysis of biological data have become increasingly important. The explosion in the production rate of expression level data has highlighted the need for automated techniques that help scientists analyze, understand, and cluster (sequence) the enormous amount of data that is being produced. Example of such problems are analyzing gene expression level data produced by microarray technology on the genomic scale, sequencing genes on the genomic scale, sequencing proteins and amino acids, etc. Researchers have recognised Artificial Neural Networks as a promising technique that can be applied to several problems in genome informatics and molecular sequence analysis. This seminar explains how Neural Networks have been widely employed in genome informatics research.

INTRODUCTION

Recent developments in the genomics arena have resulted techniques that can produce large amounts of expression level data. One such technique is the Microarray technology that relies on the hybridization properties of nucleic acids to monitor DNA or RNA abundance on a genomic scale. Microarrays have revolutionized the study of genome by allowing researchers to study the expression of thousands of genes simultaneously for the first time. It is being predicted that this is essential to understanding the role of genes in various biological functions.

The action of discovering patterns of gene expression is closely related to correlating sequences of genes to specific biological functions, and thereby understanding the role of genes in biological functions on a genomic scale. The ability to simultaneously study thousands of genes under a host of differing conditions presents an immense challenge in the fields of computational science and data mining. New computational and data mining techniques need to be developed in order to properly comprehend and interpret expression with the above goal in mind.

ANNS are used as a solution to various problems; however, their success as an intelligent pattern recognition methodology has been most prominently advertised. The most important, and attractive, feature of ANNs is their capability of learning (generalizing) from example (extracting knowledge from data). ANNs can do this without any prespecified rules that define intelligence or represent an expert’s knowledge. This feature makes ANNs a very popular choice for gene expression analysis and sequencing. Due to their power and flexibility, ANNS have even been used as tools for relevant variable selection, which can in turn greatly increase the expert’s knowledge and understanding of the problem.

ARTIFICIAL NEURAL NETWORKS

An artificial neural network is a parallel computational model composed of densely interconnected adaptive processing elements called neurons. It is an information-processing system that certain performance characteristics in common with the biological neural networks. It resembles the brain in that knowledge is acquired by the network through a learning process and that the interconnection strengths known as synaptic weights are used to store the knowledge.

A neural network is characterised by its pattern of connections between the neurons referred to as network architecture and its method of determining the weights on the connections called training or learning algorithm. The weights are adjusted on the basis of data. In other words, neural networks learn from examples and exhibit some capability for generalisation beyond the training data. This feature makes such computational models very appealing in application domains where one has little or incomplete understanding of the problem to be solved, but where training data is readily available. Neural networks normally have great potential for parallelism, since the computations of the components are largely interdependent of each other.

Artificial neural networks are viable computational models for a wide variety of problems. Already, useful applications have been designed, built, and commercialised for various areas in engineering, business and biology. These include pattern classification, speech synthesis and recognition, adaptive interfaces between human and complex physical systems, function approximation, image compression, associative memory, clustering, forecasting and prediction, combinatorial optimisation, nonlinear system modelling, and control. Although they may have been inspired by neuroscience, the majority of the networks have close relevance or counterparts to traditional statistical methods such as non-parametric pattern classifiers, clustering algorithm, nonlinear filters, and statistical regression models

Benefits of Neural Networks

It is apparent that a neural network derives its computational power through, first, its massively parallel distributed structure and, second its ability to learn and therefore generalise. Generalisation refers to the neural network producing reasonable outputs for inputs not encountered during training. These two information-processing capabilities make it possible for neural networks to solve complex problems that are currently intractable. A complex problem of interest is decomposed into a number of relatively simple tasks, and neural networks are assigned a subset of tasks that match their inherent capabilities. The use of neural networks offers the following useful properties and capabilities.

Nonlinearity, An artificial neuron can be linear or nonlinear. A neural network made up of an interconnection of nonlinear neurons, is itself nonlinear. Nonlinearity is a highly important property, particularly if the underlying physical mechanism responsible for generation of the input signal (e.g., speech signal) is inherently nonlinear.
Input-Output Mapping, A popular paradigm of learning called learning with a teacher or supervised learning involves modification of the synaptic weights of a neural network by applying a set of labelled training samples or task examples. Each example consists of a unique input signal and a corresponding desired response. The network is presented with an example picked at random from the set, and the synaptic weights of the network are modified to minimise the difference between the desired response and the actual response of the network produced by the input signal in accordance with an appropriate statistical criterion. Thus the network learns from the examples by constructing an input-output mapping for the problem at hand.
Adaptivity, Neural networks have a built-in capability to adapt their synaptic weights to changes in the surrounding environment. In particular, a neural network trained to operate in a specific environment can be easily restrained to deal with minor changes in the operating environmental conditions. Moreover, when it is operating in a nonstationary environment, a neural network can be designed to change its synaptic weights in real time.
Evidential Response, In the context of pattern classification, a neural network can be designed to provide information not only about which pattern to select, but also about the confidence in the decision made. This latter information may be used to reject ambiguous patterns, should they rise, and thereby improving the classification performance of the network.
Contextual Information, Knowledge is represented by the very structure and activation state of a neural network. Every neuron in the network is potentially affected by the global activity of all other neurons in the network. Consequently, contextual information is dealt with naturally by a neural network.
Fault Tolerance, A neural network implemented in hardware form, has the potential to be inherently fault tolerant, or capable of robust computation, in the sense that its performance degrades gracefully under adverse operating conditions. This can be attributed to it’s massively distributed nature of information stored in the network. Thus, in principle, a neural network exhibits a graceful degradation in performance rather than catastrophic failure.
VLSI Implementability, The massively parallel nature of a neural network makes it potentially fast for the computation of certain tasks. This same feature makes a neural network well suited for implementation using very-large-scale-integrated technology. One particular beneficial virtue of VLSI is that it provides a means of capturing truly complex behaviour in a highly hierarchical fashion.
Neurobiological Analogy, The design of a neural network is motivated by analogy with the brain, which is living proof that fault tolerant parallel processing is not only physically possible but also fast and powerful.

NEURAL NETWORK FOUNDATIONS

Artificial Neural networks (ANNs) belong to the adaptive class of techniques in the machine learning arena. ANNS are used as a solution to various problems; however, their success as an intelligent pattern recognition methodology has been most prominently advertised. Most models of ANNs are organized in the form of a number of processing units called artificial neurons, or simply neurons, and a number of weighted connections referred to as artificial synapses between the neurons. The process of building an ANN, similar to its biological inspiration, involves a learning episode. During learning episode, the network observes a sequence of recorded data, and adjusts the strength of its synapses according to a learning algorithm and based on the observed data. The process of adjusting the synaptic strengths in order to be able to accomplish a certain task, much like the brain, is called “learning”. Learning algorithms are generally divided into two types, supervised and unsupervised. The supervised algorithms require labeled training data. In other words, they require more a priori knowledge about the training set.

MODELS OF A NEURON

A neuron is an information-processing unit that is fundamental to the operation of a neural network. Fig.2.1a shows the model of a neuron, which forms the basis for designing artificial neural networks. The three basic elements of the neuronal model are

1. A set of synapses or connecting links, each of which is characterized by a weight or strength of its own. Specifically, a signal x_jat which the input of synapse j connected to neuron k is multiplied by the synaptic weight w_jk.

2. An adder for summing the input signals, weighted by the respective synapses of the neuron; the operations described here constitutes a linear combiner.

3. An activation function for limiting the amplitude of the output of a neuron. The activation is also referred to as a squashing function as it squashes the permissible amplitude range of the output signal to some finite value.

NETWORK ARCHITECTURES

The manner in which the neurons of a neuron network are structured is intimately linked with the learning algorithm used to train the network. Network structure can be broadly divided into three classes of network architectures.

Single-Layer Feedforward Networks

In a layered neural network the neurons are organised in the form of layers. The simplest form of a layered network consists of an input layer of source nodes that projects onto an output layer of neurons, but not vice-versa. In other words, this network is strictly a feedforward or acyclic type.

Such a network is called a single-layer network, with the designation “single layer” referring to the output layer of computational nodes. The input layer of source nodes are not counted as no computation is performed here.

Multilayer Feedforward Networks

The second class of a feedforward neural network distinguishes itself by the presence of one or more hidden layers, whose computation nodes are correspondingly called hidden neurons or hidden units. The function of hidden neurons is to intervene between the external input and the network output in some useful manner. By adding one or more hidden layers, the network is enabled to extract higher-order statistics.

In a rather loose sense the network acquires a global perspective despite it’s local connections due to the extra set of synaptic connections and extra dimension of neural interactions. The ability of hidden neurons to extract higher-order statistics is particularly valuable when the size of the input is large.

Recurrent Networks

A recurrent network distinguishes itself from a forward neural network in that it has at least one feedback loop. For example, a recurrent network may consist of a single layer of neurons with each neuron feeding its output signal back to the inputs of all the other neurons, as illustrated in the architectural graph. The presence of feedback loops has a profound impact on the learning compatibility of the network and its performance. Moreover, the feedback loops involve the use of particular branches composed of unit-delay elements(denoted by z^-1), which result in a nonlinear dynamical behaviour, assuming that the neural network contains nonlinear units.

TRAINING OF NEURAL NETWORKS

Neural networks are models that may be used to approximate, summarise, classify, generalise or otherwise represent real situations. Before models can be used they have to be trained or made to ‘fit’ into the representative data. The model parameters, e.g., number of layers, number of units in each layer and weights of the connections between them, must be determined. In ordinary statistical terms this is called regression. There are two fundamental types of training with neural networks: supervised and unsupervised learning. For supervised training, as in regression, data used for the training consists of independent variables (also referred to as feature variables or predictor variables) and dependent variables (target values). The independent variables (input to the neural network) are used to predict the dependent variables (output from the network). Unsupervised training does not have dependent (target) values supplied: the network is supposed to cluster the data automatically into meaningful sets.

The fundamental idea behind training, for all neural networks, is in picking a set of weights and then applying the inputs to the network and gauging the network’s performance with this set of weights. If the network does not perform well, then the weights are modified by an algorithm specific to each architecture and the procedure is then repeated. This iterative process continues until some pre-specified criterion has been achieved. A training pass through all vectors of the input data is called an epoch. Iterative changes can be made to the weights with each input vector, or changes can be made after all input vectors have been processed. Typically, weights are iteratively modified by epochs.

DESIGN ISSUES

Applications of neural networks to genome informatics share several similar design issues that affect system performance and integrity. The major issues include the input-output encoding, the neural network design, the training and prediction data compilation and system evaluation mechanism. The first step in developing a neural network system is problem definition, in other words determining the desired input-output mapping for the task. The design of a complete system then involves the choices for a pre-processor and a post-processor, in addition to the neural network model itself. Data pre-processing is an integral and crucial part of any neural network system design. Pre-processing involves both feature representation and input coding. The selection of appropriate feature representation is probably one of the most significant factors determining the performance of the final system because it determines what information is presented to the neural networks. Feature representation is critical to molecular sequence analysis for two reasons: 1.this is how a priori knowledge about the molecular structures and functions can be utilised to improve accuracy, and 2.it allows extraction of salient features that are characteristic of given sequences.

FEATURE PRESENTATION

The amino acid and protein features can be represented in different ways to maximise information extraction. They may be represented as real-numbered measurements in a continuous scale, such as mass or hydrophobicity, or as vectors of distances or frequencies. They can also be conveniently organised into classes based on these properties. This effectively reduces the original 20-letter amino acid alphabet set to an alternative alphabet set of smaller sizes and emphasizes the various properties of the molecular residues and maximises feature extraction. Using the hydrophobicity scale as an example, the amino acid residues can be represented by the real-numbered scalar value. They can also be classified with respect to their side chains as polar, nonpolar, or amphipathic, depending on the range of the hydrophobicity in the scale, such as in

Group1: > = (H_m+ sd/2)

Group 2: >= (H_m- sd/2) & < (H_m+ sd/2)

Group 3: < (H_m- sd/2) H_m = 1/20 H_i

where H_mis the mean value of the hydrophobicity over 20 amino acids, sd is the standard deviation, and H_i is the hydrophobicity value in the scale.

DATA ENCODING

Once a desired input-output mapping task is determined, the design of a complete system then involves the choices for a pre-processor and a post-processor, in addition to the neural network model itself. For molecular applications, the input sequence encoding method converts molecular sequences into input vectors of the neural networks. Likewise, an output encoding method is used in the post-processing step to convert the neural network output to desired forms.

Molecular applications involve the processing of individual residues, sequence windows of n-consecutive residues, or the complete sequence string. Correspondingly the coding may be local involving only single or neighbouring residues in short sequence segment or global involving long-range relationship in full-length sequence or long sequence segment. The sequence encoding methods may be categorised as direct encoding or indirect encoding. Direct encoding converts each individual residue to a vector, whereas indirect coding provides overall information of a complete string. Direct encoding preserves positional information, but can only deal with fixed-length sequence windows. On the other hand, indirect encoding disregards the ordering information, but can be used for sequences of either fixed or variable lengths.

Direct Input Sequence Encoding

In direct encoding, each molecular residue can be represented by its identity or features. The most commonly used method for direct encoding involves the use of indicator vectors. The indicator vector is a vector of binary numbers that has only one unit turned on to indicate the identity of the corresponding residue. Here, a vector of 4 units with 3 zeros and a single 1 is required for a nucleotide, so the 4 nucleotides may be represented as 1000 (A), 0100 (T), 0010 (G), 0001 (C). The spacer residue may be represented as 0000, without an additional unit. Likewise a vector of 20 input units is needed to represent an amino acid. Or a vector of 21 units may be used to include a unit for he spacer in regions between proteins. These binary representations are dubbed as BIN4, BIN20 and BIN21. The lengths of their input vectors are 4 X n, 20 X n, and 21 X n, respectively, where n is the total number of nucleotide or amino acid residues in the sequence window. An alternative to the sparse encoding scheme of BIN4 is a dense representation that uses two units for 4 nucleotides (e.g., 00 for A, 01 for T, 10 for G, and 11 for C). a comparative study however showed that the BIN4 coding was better, possibly due to its unitary coding matrix with identical Hamming distance among each vector.

Indirect Input Sequence Encoding

The direct sequence encoding methods preserve the order of residues along the sequence atring and encode primarily local information. They are, however, not suitable when global sequence features or information content is more important to the application, or when variable length sequences are to be analysed. A commonly used indirect sequence encoding method is the n-gram hashing method, which computes residue frequencies. Residue frequency is one of the most effective global features for sequence determination. As shown in Figure 3.2a alternative alphabets of various n-gram terms can be used to maximise information extraction. The final size of the input vector is Mⁿ, where M is the alphabet size and n is the n-gram size.

Input Trimming

An excessive number of input units not only results in poor generalisations, but also slows network training. Reducing the number of input variables often leads to improved performance of a given data set, even though information is being discarded. Thus one important role of pre-processing is to reduce the dimensionality of the input data.

Output Encoding

The choice of output encoding method is more straightforward, depending on the neural network mapping desired. The discrimination usually involves a yes or no answer that can be encoded using a single output unit (dense coding) or two output units (sparse coding). The number of output units in the classifier is usually the same as the number of classes, encoded as indicator vectors, such as the number of secondary structure states of protein families. The value of the output units can be used qualitatively (all –or- none by comparing to a threshold value), or as a quantitative measure of confidence level or activity level.

NETWORK DESIGN

While designing a network for application in genome informatics several factors have to be taken into consideration. The type of network architecture and learning algorithm chosen are closely linked to the type of application the network is to be involved in. To optimise the neural network design, important choices must be made for the selection of numerous parameters. Many of these are internal parameters that need to be tuned with the help of experimental results and experience with the specific application under study. The following discussion focuses specifically on back-propagation design choices for the learning rate, momentum term, activation function and termination criteria.

The selection of a learning rate is important in finding the true global minimum of the error distance. The convergence speed of back-propagation is directly related to the learning parameter. Too small a learning rate will slow down training progress, whereas too large a learning rate may simply produce oscillations between relatively poor solutions.

A momentum rate can be helpful in speeding convergence and avoiding local minima. In back propagation with momentum, the weight change is in a direction that is a combination of the current gradient and the previous gradient. Momentum allows the net to make reasonably large weight adjustments as long as the corrections are in the same general direction for several patterns, while using a smaller learning rate to prevent a large response to the error from any one training pattern. An activation function for a back propagation network should satisfy several important characteristics. It should be continuous and differentiable. For computational efficiency it should have a derivative that is easy to compute.

The back propagation network cannot, in general, be shown to converge, nor are there well defined stopping criteria. The training is usually stopped by a user-determined threshold value (tolerance) for the error function, or a fixed upper limit on the number of training iterations called epochs. The termination can also be based on performance of a validation data set used to monitor generalisation performance during learning and to terminate learning when there is no more improvement.

NN ARCHTECTURES APPLIED IN GENOME INFORMATICS

Due to their power and flexibility, ANNS have even been used as tools for relevant variable selection, which can in turn greatly increase the expert’s knowledge and understanding of the problem. There is a very large body of research that has resulted in a large number of ANN. However this chapter discusses only some of the types that have been used in sequencing.

Layered, feed-forward, backpropagation neural networks

These are a class of ANNs whose neurons are organized in layers. The layers are normally fully connected, meaning that each neuron of a layer is connected to each element of the next layer. However, self-organizing varieties also exist in which a network starts either with a minimal number of synaptic connections between the layers and adds to the number as training progresses (constructive), or starts as a fully connected network and prunes connections based on the data observed in training (destructive). Backpropagation is a learning algorithm that, in its original version, belongs to the gradient descent optimization methods. The combination of backpropagation learning algorithm and the feed-forward, layered networks provide the most popular type of ANNS. These ANNS have been applied to virtually all pattern recognition problems, and are typically the first networks tried on a new problem. The reason for this is the simplicity of the algorithm, and the vast body of research that has studied these networks. Assuch, in sequencing, many researchers have also used this type of network as a first line of attack. In Methods In Enzymology: Computer Methods for Macromolecular Sequence Analysis the author C.H Wu has mentioned the development of a system called gene classification artificial neural system (GenCANS), which is based on a three layered, feed-forward backpropagation network. GenCANS was designed to “classify new (unknown) sequences into predefined (known) classes. It involves two steps, sequence encoding and neural network classification, to map molecular sequences (input) into gene families (output)”. The same type of network has been used to perform rapid

searches for sequences of proteins.

Self Organizing Map

A self-organizing map (SOM) is a type of neural network approach first proposed by Kohonen. SOMs have been used as a divisive clustering approach in many areas including genomics. Several groups have used SOMs to discover patterns in gene expression data. Tamayo and colleagues use self-organizing maps to explore patterns of gene expression generated using Affymetrix arrays, and provide the GENECLUSTER implementation of SOMs. Tamayo and his colleagues explain the implementation of SOM for expression level data as follows: “An SOM has a set of nodes with a simple topology and a distance function on the nodes. Nodes are iteratively mapped into kdimensional “gene expression space” (in which the ith coordinate represents the expression level in the ith sample)”. A SOM assigns genes to a series of partitions on the basis of the similarity of their expression vectors to reference vectors that are defined for each partition. The summary of the basic SOM algorithm is perhaps best described in Quackenbush’s review: “First, random vectors are constructed and assigned to each partition. Second, a gene is picked at random and, using a selected distance metric, the reference vector that is closest to the gene is identified. Third, the reference vector is then adjusted so that it is more similar to the vector of the assigned gene. The reference vectors that are nearly on the two-dimensional gird are also adjusted so that they are more similar to the vector of the assigned gene. Fourth, steps 2 and 3 are iterated several thousand times, decreasing the amount by which the reference vectors are adjusted and increasing the stringency used to define closeness in each step. As the process continues, the reference vectors converge to fixed values. Last, the genes are mapped to the relevant partitions depending on the reference vector to which they are most similar”.

SOMs, like the gene shaving approach, have the distinct advantage that they allow a priori knowledge to be included in the clustering process. Tamayo explains this and other advantages of the SOM approach as follows: “The SOM has a number of features that make them particularly well suited to clustering and analyzing gene expression patterns. They are ideally suited to exploratory data analysis, allowing one to impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the nonstructure of k-means clustering) facilitating easy visualization and interpretation. SOMs have good computational properties and are easy to implement, reasonably fast, and are scalable to large data sets”.

The most prominent disadvantage of the SOM approach is that it is difficult to know when to stop the algorithm. If the map is allowed to grow indefinitely, the size of SOM is gradually increased to a point when clearly different sets of expression patterns are identified. Therefore, as with k-means clustering, the user has to rely on some other source of information, such as PCA, to determine the number of clusters that best represents the available data. For this reason, Sasik and his colleagues believe that “SOM, as implemented by Tamayo, is essentially a restricted version of k-means: Here, the k clusters are linked by some arbitrary user-imposed topological constraints (e.g. a 3 x 2 grid), and as such suffers from all of the problems mentioned above for k-means (and more), except that the constraints expedites the optimization process”.

There are many varieties to SOM, among which the self-organizing feature maps (SOFM) should be mentioned. The growing cell structure (GCS) is another derivative of SOFM. It is a self organizing and incremental (constructive) neural learning approach. The unsupervised variety of GCS was used by Azuaje for discovery of gene expression patterns in B-cell lymphoma. Using cDNA microarray expression data, GCS was able to “identify normal and diffuse large B-cell lymphoma (DLBCL) patients. Furthermore, it distinguishes patients with molecularly distinct types of DLBCL without previous knowledge of those subclasses”.

Amino Acid Grouping Example

A self organizing map was trained with 4 physical structures of the amino acids- mass in Daltons, K & D hydrophobicity, surface area and turn propensity. Mass was scaled to the unit interval. No other information was presented to the self-organising maps. A 10 X 10 grid was used. gives the result after only just 50 iterations; each intersection of lines in the table represents a grid point. Each amino acid was mapped into one of the 100 grid points.

It can be seen that the groupings found automatically by the self-organising map resemble the exchange group as found by elaborate examinations of sequence conservation. The traditional groups are {H,R,K}, {D, E, N, Q}(these first two groups are often combined into two groups), {S, P, T, A, G},{M, I, L, V}, {F, W, Y}, {C}. only the F and C amino acids are out of place according to these groups. This is a remarkable achievement for a simple network, working unsupervised, withonly four amino acid features as input data. This simple application illustrates the potential power of self organising maps.

RADIAL BASIS FUNCTION

Networks based on radial basis function have been developed to address some of the problems encountered with training NLPs; radial basis functions arre guaranteed to converge and training is much more rapid. Both are feed-forward networks with similar-looking diagrams and their applications are similar; however, the principles of action of radial basis function networks and the way they are trained are quite different from MLPs.

INTRODUCATION TO RBF

The essence of the difference between the operation of radial basis function networks and multilayer perceptrons can be seen in Fig. 4.3a. MLPs classify data by the use of hyperplanes that divide the data space into discrete areas; radial basis function clusters the data into a finite number of ellipsoid regions. Classification is then finding which ellipsoid is closest for a given test data point. The hidden units of a RBF network are not the same as used for a MLP, and the weights between input and hidden layer have different meanings. Transfer functions typically used include the Gaussian function, spline functions and various quadratic functions: they are all smooth functions, which taper off as distance from a centre point increases.

For radial basis function networks, each hidden unit represents the centre of a cluster in the data space. Input to a hidden unit in a RBF is not the weighted sum of its inputs, but a Distance measure: a measure how far the input vector is from the centre of the basis function for that hidden unit. If x and are vectors, the Euclidean distance between them is given by where x is an input vector and is the location vector, or centre of the basis function for hidden node j_.the hidden node then computes its output as a function of the distance between the input vector and its centre. For the Gaussian RBF the hidden unit output is where D_j is the Euclidean distance between an input vector and the location vector for hidden unit j; h_j is the output of the hidden unit j and is a measure of the size of the cluster j(variance).

So, it can be seen that, computationally a RBF has three separate actions:

1. Compute the distance from the input layer (usually a vector) to each of the hidden units (each of which represents an ellipsoid region in data space);

2. Compute the output (a function of the distance measure in step 1) from each hidden unit;

3. And finally, use the outputs of the hidden units to compute the network outputs with a simple weighted sum function. The only difficult part is finding values for and for each hidden unit, and the weights between the hidden and output layers, that is training of the network. Training RBF networks is considerably faster than training multilayer perceptrons. On the other hand, once trained, the feed-forward process for MLPs is faster than for RBFs.

COMPARISON BETWEEN BACK PROPAGATION, RBF AND SOM NETWORKS

The performance of several classes of neural networks including back-propagation neural networks, radial basis function networks and self-organizing maps are compared when they are applied to a set of genome signatures for the gene classification problem. The algorithms used were all adapted from ‘Neural Networks – A Comprehensive Foundation’ by Simon Haykins, 2nd Edition. Prentice-Hall, Inc.

DATA SETS

Genome sequences are composed of four kinds of nucleotides: Adenine, Cytosine, Guanine and Thymine (referred to as A, C, G, T). A segment of continuous occurrence of nucleotides is called a motif. For example, the sequence segment ‘ATC’ is a motif of length three. Given a motif length, the frequencies of the motifs compose the profile of a gene sequence, which is also called the genomic signature of the sequence. A sequence is divided into subsequences (windows) of a fixed size. Genomic signatures are built from each window. It has been shown that the “intergenomic differences” of signatures are generally higher than “intragenomic differences”. Consequently, features of signatures from a given genome can be learned with a properly designed neural network. In these set of experiments from “PRACTICE OF NEURAL NETWORKS FOR GENOME SIGNATURE ANALYSIS” by LIANGYOU CHEN and LOIS C. BOGGESS a fixed window size of 100 bases to divide the genome sequences. The motif length of three nucleotides is used for generation of signatures for each of the 100-base sequence windows. An example signature is captured in the following table:

Motif Frequency

AAA 10

AAC 3

AAG 5

… … …

TTT 13

The three-base motifs consist of three characters from A, T, C, and G. There are 43 different motifs possible. Each possible motif corresponds to one input feature for our networks. As a result, we need input vectors of length 43 for the networks. Each feature in the vector represents the frequency of the corresponding motif in a 100-base sequence window. The first unit of the input vector is the frequency for the motif AAA, the second for AAC, and so on. A subset of nonoverlapping windows is randomly selected from known genomes as the training data sets. The data consists of three parts: a signature ID, the input feature vector, and the expected output. The signature ID reflects the genome that the input vector is generated from and position in that genome. For the above signature table, the input vector would be <10, 3, 5, …, 13>.

The experiment involved two sets of data. The first set of data is a two genome classification dataset, generated from genomes of Aeropyrum pernix (Accession ID: BA000002) and Mesorhizobium loti (Accession ID: NC_002678). This set of data is referred to as Dataset 1. This is the basis for an experiment using back-propagation neural networks, radial-basis function networks and self-organizing maps.

Dataset 1 includes a training dataset with 1000 signatures from Aeropyrum pernix and 1000 signatures from Mesorhizobium loti, and a test dataset with 1000 signatures from Aeropyrum pernix and 1000 signatures from Mesorhizobium loti. The signatures of the training dataset are taken from the first non-overlapping 1000 windows of each genome, and signatures of the testing set are taken from the following 1000 windows of each genome. If a signature is from the Aeropyrum genome, the expected outputs are ‘1 0’.Otherwise, the outputs are ‘0 1’.

BACK-PROPAGATION NEURAL NETWORK

The implementation of the back-propagation neural network is based on the algorithm described in (Haykins, 1999). The update of the weights in the network is done sequentially, that is, after presentation of each training instance. The two thousand training instances from Dataset 1 were randomly reordered for each epoch of training. In our experiments, the back-propagation network is the quickest algorithm to train. After experimenting with number of units in the hidden layer, it was determined that even a single unit in the hidden layer could on average achieve an error or 2.61%, and hidden layers with 2, 3, 5, 7, and 9units averaged errors from 2.52% to 2.56% over multiple runs.

Radial-basis function network

A modified version of the radial-basis function network algorithm by (Haykin, 1999) was used: Self-organized selection for clustering centers was used to identify radial-basis centers automatically. The first phase of training was to randomly select a number of input vectors from the training dataset as initial centers and use the k-means clustering algorithm to adjust the centers. The second phase was a supervised learning stage to compute the linear weights of the output layer. The radial-basis function centered at t_i is defined as G (||x-t_i||2) = exp( - ||x-t_i||2 / (d_min /α)2), where the d_min is the minimum distance between the centers, and α is a scalar to adjust the d_min. The linear weights of the output layer were trained as single layer perceptrons.

Results from varying α on Dataset 1 for 200 centers

0.5 0.7 0.8 0.9 1.0 1.1

Error rate 8.31% 7.54% 7.47% 7.69% 8.16% 8.26%

Test results for 1000 centers, averaged over five runs

0.5 1.0 1.1 1.2 1.3 1.4 1.6

Error rate 10.13% 6.24% 6.25% 6.01% 6.04% 6.18% 7.74%

Self organizing map

The self-organizing map network is again based on an adaptation of (Haykin,1999). With a simple majority voting method, the classification of output units in the map can be decided. The training data and the desired outputs are parsed through the network. Each unit in the output map keeps count of the classes of genomes that activate it. After training, the output unit is labeled as belonging to the class which has most frequently activated it, and this labeling is used for classification purposes during testing.

The network was trained with Dataset 1. We found a minimum test error rate of 3.55% by a 4*4 map.

Test results of self-organizing map, averaged over five runs.

Minimum Testing Errors

Map size 2x2 3x3 4x4 5x5 6x6 7x7 10x10

Average 80 75.4 71 83.6 81.8 82.8 79.6

Error Rate 4% 3 77% 3.55% 4.18% 4.09% 4.14% 3.98%

APPLICATIONS

The major topics of genome informatics research include gene recognition, functional analysis, structural determination and family classification, for the identification of genes, and the understanding of function, structure, and evolution of gene products.

NUCLEIC ACID EQUENCE ANALYSIS

Some of the neural network applications for nucleic acid sequence analysis are summarised in Table 5.1, with an overview of their neural network designs and input-output encoding features. The applications can be divided into three sections. The coding region recognition and gene identification problem is tackled by two complementary approaches, gene search by signal and search by content. NN provide an attractive model in which sequence features for both signals and content can be combined and weighted to improve predictive accuracy.

The identification and analysis of other signals, binding sites or regulatory sites, such as promoters, ribosome-binding sites, and transcriptional initiating and terminating sites are also important for the studies of gene regulation and expression. NN allow the incorporation of both positive and negative examples as well as he detection of higher – order and long-range correlation, and are not based on the assumption of positional independence. As a result they are favoured in comparison to other methods in many studies.

Other applications of neural networks are sequence classification and feature extraction. Detecting significant sequence features and understanding biological rules that govern gene structure and gene regulation are important problems that can be addressed by NN. Both supervised and unsupervised NN can be used as effective feature detectors. Features can be detected from within clusters of a self-organising map or from hidden units of a back-propagation multilayer perceptron.

PROTEIN STRUCTURE PREDICTION

Protein structure prediction is often the first step toward understanding and predicting tertiary structure because secondary structure elements constitute the building blocks of the folding units. NN have also been applied to protein tertiary structure prediction, such as the prediction of the backbones or side-chain packing, and to structural class prediction.

PROTEIN SEQUENCE ANALYSIS

Sequence similarity database searching and protein sequence analysis constitute one of the most important computational approaches to understanding protein structure and function .Although most computational methods used for nucleic acid sequence analysis are also applicable to protein sequence studies, how to capture the enriched features of amino acid alphabets poses a special challenge for protein analysis.

Conclusion

Artificial neural networks thus have paved the way for automatic analysis of biological data. The simultaneous analysis of millions of genes at a time has driven the new field of computational biology—Genome informatics.

Application of artificial neural networks in Genome informatics has great significance in this arena. The artificial neural network attains its knowledge through the process of learning, which is similar to its inspiration---the Human brain.

But like all clouds with a silver lining Genomic engineering can be misused too, which can be a real threat to the man kind. Let us hope that man puts his brain for the welfare of his fellow beings and not against them.

Engineering Seminar Topics :: Seminar Paper

Neural Networks - Seminar Report

No comments:

Post a Comment

Seminar Topics