ABSTRACT
With the dawn of the genome era computational methods in the automatic analysis of biological data have become increasingly important. The explosion in the production rate of expression level data has highlighted the need for automated techniques that help scientists analyze, understand, and cluster (sequence) the enormous amount of data that is being produced. Example of such problems are analyzing gene expression level data produced by microarray technology on the genomic scale, sequencing genes on the genomic scale, sequencing proteins and amino acids, etc. Researchers have recognised Artificial Neural Networks as a promising technique that can be applied to several problems in genome informatics and molecular sequence analysis. This seminar explains how Neural Networks have been widely employed in genome informatics research.
INTRODUCTION
Recent
developments in the genomics arena have resulted techniques that can produce
large amounts of expression level data. One such technique is the Microarray technology
that relies on the hybridization properties of nucleic acids to monitor DNA or
RNA abundance on a genomic scale.
Microarrays have revolutionized the study of genome by allowing researchers to
study the expression of thousands of genes simultaneously for the first time.
It is being predicted that this is essential to understanding the role of genes
in various biological functions.
The action of discovering patterns of gene
expression is closely related to correlating sequences of genes to specific
biological functions, and thereby understanding the role of genes in biological
functions on a genomic scale. The ability to simultaneously study thousands of
genes under a host of differing conditions presents an immense challenge in the
fields of computational science and data mining. New computational and data
mining techniques need to be developed in order to properly comprehend and
interpret expression with the above goal in mind.
ANNS
are used as a solution to various problems; however, their success as an
intelligent pattern recognition methodology has been most prominently
advertised. The most important, and attractive, feature of ANNs is their capability
of learning (generalizing) from example (extracting knowledge from data). ANNs
can do this without any prespecified rules that define intelligence or
represent an expert’s knowledge. This feature makes ANNs a very popular choice
for gene expression analysis and sequencing. Due to their power and flexibility,
ANNS have even been used as tools for relevant variable selection, which can in
turn greatly increase the expert’s knowledge and understanding of the problem.
ARTIFICIAL NEURAL NETWORKS
An
artificial neural network is a parallel computational model composed of densely
interconnected adaptive processing elements called neurons. It is an
information-processing system that certain performance characteristics in
common with the biological neural networks. It resembles the brain in that
knowledge is acquired by the network through a learning process and that the
interconnection strengths known as synaptic weights are used to store the
knowledge.
A
neural network is characterised by its pattern of connections between the
neurons referred to as network architecture and its method of determining the
weights on the connections called training or learning algorithm. The weights
are adjusted on the basis of data. In other words, neural networks learn from
examples and exhibit some capability for generalisation beyond the training
data. This feature makes such computational models very appealing in
application domains where one has little or incomplete understanding of the
problem to be solved, but where training data is readily available. Neural
networks normally have great potential for parallelism, since the computations
of the components are largely interdependent of each other.
Artificial
neural networks are viable computational models for a wide variety of problems.
Already, useful applications have been designed, built, and commercialised for
various areas in engineering, business and biology. These include pattern
classification, speech synthesis and recognition, adaptive interfaces between
human and complex physical systems, function approximation, image compression,
associative memory, clustering, forecasting and prediction, combinatorial
optimisation, nonlinear system modelling, and control. Although they may have
been inspired by neuroscience, the majority of the networks have close
relevance or counterparts to traditional statistical methods such as
non-parametric pattern classifiers, clustering algorithm, nonlinear filters,
and statistical regression models
Benefits of Neural Networks
It
is apparent that a neural network derives its computational power through,
first, its massively parallel distributed structure and, second its ability to
learn and therefore generalise. Generalisation refers to the neural network
producing reasonable outputs for inputs not encountered during training. These
two information-processing capabilities make it possible for neural networks to
solve complex problems that are currently intractable. A complex problem of
interest is decomposed into a number of relatively simple tasks, and neural
networks are assigned a subset of tasks that match their inherent capabilities.
The use of neural networks offers the following useful properties and
capabilities.
- Nonlinearity, An artificial neuron can be linear or nonlinear. A neural network made up of an interconnection of nonlinear neurons, is itself nonlinear. Nonlinearity is a highly important property, particularly if the underlying physical mechanism responsible for generation of the input signal (e.g., speech signal) is inherently nonlinear.
- Input-Output Mapping, A popular paradigm of learning called learning with a teacher or supervised learning involves modification of the synaptic weights of a neural network by applying a set of labelled training samples or task examples. Each example consists of a unique input signal and a corresponding desired response. The network is presented with an example picked at random from the set, and the synaptic weights of the network are modified to minimise the difference between the desired response and the actual response of the network produced by the input signal in accordance with an appropriate statistical criterion. Thus the network learns from the examples by constructing an input-output mapping for the problem at hand.
- Adaptivity, Neural networks have a built-in capability to adapt their synaptic weights to changes in the surrounding environment. In particular, a neural network trained to operate in a specific environment can be easily restrained to deal with minor changes in the operating environmental conditions. Moreover, when it is operating in a nonstationary environment, a neural network can be designed to change its synaptic weights in real time.
- Evidential Response, In the context of pattern classification, a neural network can be designed to provide information not only about which pattern to select, but also about the confidence in the decision made. This latter information may be used to reject ambiguous patterns, should they rise, and thereby improving the classification performance of the network.
- Contextual Information, Knowledge is represented by the very structure and activation state of a neural network. Every neuron in the network is potentially affected by the global activity of all other neurons in the network. Consequently, contextual information is dealt with naturally by a neural network.
- Fault Tolerance, A neural network implemented in hardware form, has the potential to be inherently fault tolerant, or capable of robust computation, in the sense that its performance degrades gracefully under adverse operating conditions. This can be attributed to it’s massively distributed nature of information stored in the network. Thus, in principle, a neural network exhibits a graceful degradation in performance rather than catastrophic failure.
- VLSI Implementability, The massively parallel nature of a neural network makes it potentially fast for the computation of certain tasks. This same feature makes a neural network well suited for implementation using very-large-scale-integrated technology. One particular beneficial virtue of VLSI is that it provides a means of capturing truly complex behaviour in a highly hierarchical fashion.
- Neurobiological Analogy, The design of a neural network is motivated by analogy with the brain, which is living proof that fault tolerant parallel processing is not only physically possible but also fast and powerful.
NEURAL NETWORK FOUNDATIONS
Artificial
Neural networks (ANNs) belong to the adaptive class of techniques in the machine
learning arena. ANNS are used as a solution to various problems; however,
their success as an intelligent pattern recognition methodology has been most
prominently advertised. Most models of ANNs are organized in the form of a
number of processing units called artificial neurons, or simply neurons, and a
number of weighted connections referred to as artificial synapses between the
neurons. The process of building an ANN, similar to its biological inspiration,
involves a learning episode. During learning episode, the network observes a
sequence of recorded data, and adjusts the strength of its synapses according
to a learning algorithm and based on the observed data. The process of
adjusting the synaptic strengths in order to be able to accomplish a certain
task, much like the brain, is called “learning”. Learning algorithms are
generally divided into two types, supervised and unsupervised. The supervised
algorithms require labeled training data. In other words, they require more a
priori knowledge about the training set.
MODELS OF A NEURON
A
neuron is an information-processing unit that is fundamental to the operation
of a neural network. Fig.2.1a shows the model of a neuron, which forms the
basis for designing artificial neural networks. The three basic elements of the
neuronal model are
1.
A
set of synapses or connecting links, each of which is characterized by a weight
or strength of its own. Specifically, a signal xj at which the input
of synapse j connected to neuron k is multiplied by the synaptic weight wjk.
2.
An
adder for summing the input signals, weighted by the respective synapses of the
neuron; the operations described here constitutes a linear combiner.
3.
An
activation function for limiting the amplitude of the output of a neuron. The
activation is also referred to as a squashing function as it squashes the permissible
amplitude range of the output signal to some finite value.
NETWORK ARCHITECTURES
The
manner in which the neurons of a neuron network are structured is intimately
linked with the learning algorithm used to train the network. Network structure
can be broadly divided into three classes of network architectures.
Single-Layer Feedforward
Networks
In
a layered neural network the neurons are organised in the form of layers. The
simplest form of a layered network consists of an input layer of source nodes
that projects onto an output layer of neurons, but not vice-versa. In other
words, this network is strictly a feedforward or acyclic type.
Such
a network is called a single-layer network, with the designation “single layer”
referring to the output layer of computational nodes. The input layer of source
nodes are not counted as no computation is performed here.
Multilayer Feedforward Networks
The
second class of a feedforward neural network distinguishes itself by the
presence of one or more hidden layers, whose computation nodes are
correspondingly called hidden neurons or hidden units. The function of hidden
neurons is to intervene between the external input and the network output in
some useful manner. By adding one or more hidden layers, the network is enabled
to extract higher-order statistics.
In
a rather loose sense the network acquires a global perspective despite it’s
local connections due to the extra set of synaptic connections and extra
dimension of neural interactions. The ability of hidden neurons to extract
higher-order statistics is particularly valuable when the size of the input is
large.
Recurrent Networks
A
recurrent network distinguishes itself from a forward neural network in that it
has at least one feedback loop. For example, a recurrent network may consist of
a single layer of neurons with each neuron feeding its output signal back to
the inputs of all the other neurons, as illustrated in the architectural graph.
The presence of feedback loops has a
profound impact on the learning compatibility of the network and its
performance. Moreover, the feedback loops involve the use of particular
branches composed of unit-delay elements(denoted by z-1), which
result in a nonlinear dynamical behaviour, assuming that the neural network
contains nonlinear units.
TRAINING OF NEURAL NETWORKS
Neural
networks are models that may be used to approximate, summarise, classify,
generalise or otherwise represent real situations. Before models can be used
they have to be trained or made to ‘fit’ into the representative data. The
model parameters, e.g., number of layers, number of units in each layer and
weights of the connections between them, must be determined. In ordinary
statistical terms this is called regression. There are two fundamental types of
training with neural networks: supervised and unsupervised learning. For
supervised training, as in regression, data used for the training consists of
independent variables (also referred to as feature variables or predictor
variables) and dependent variables (target values). The independent variables
(input to the neural network) are used to predict the dependent variables
(output from the network). Unsupervised training does not have dependent
(target) values supplied: the network is supposed to cluster the data
automatically into meaningful sets.
The
fundamental idea behind training, for all neural networks, is in picking a set
of weights and then applying the inputs to the network and gauging the
network’s performance with this set of weights. If the network does not perform
well, then the weights are modified by an algorithm specific to each
architecture and the procedure is then repeated. This iterative process
continues until some pre-specified criterion has been achieved. A training pass
through all vectors of the input data is called an epoch. Iterative changes can
be made to the weights with each input vector, or changes can be made after all
input vectors have been processed. Typically, weights are iteratively modified
by epochs.
DESIGN ISSUES
Applications
of neural networks to genome informatics share several similar design issues
that affect system performance and integrity. The major issues include the
input-output encoding, the neural network design, the training and prediction
data compilation and system evaluation mechanism. The first step in developing
a neural network system is problem definition, in other words determining the
desired input-output mapping for the task. The design of a complete system then
involves the choices for a pre-processor and a post-processor, in addition to
the neural network model itself. Data pre-processing is an integral and crucial
part of any neural network system design. Pre-processing involves both feature
representation and input coding. The selection of appropriate feature
representation is probably one of the most significant factors determining the
performance of the final system because it determines what information is
presented to the neural networks. Feature representation is critical to
molecular sequence analysis for two reasons: 1.this is how a priori knowledge
about the molecular structures and functions can be utilised to improve
accuracy, and 2.it allows extraction of salient features that are
characteristic of given sequences.
FEATURE PRESENTATION
The
amino acid and protein features can be represented in different ways to
maximise information extraction. They may be represented as real-numbered
measurements in a continuous scale, such as mass or hydrophobicity, or as
vectors of distances or frequencies. They can also be conveniently organised
into classes based on these properties. This effectively reduces the original
20-letter amino acid alphabet set to an alternative alphabet set of smaller
sizes and emphasizes the various properties of the molecular residues and
maximises feature extraction. Using the hydrophobicity scale as an example, the
amino acid residues can be represented by the real-numbered scalar value. They can
also be classified with respect to their side chains as polar, nonpolar, or
amphipathic, depending on the range of the hydrophobicity in the scale, such as
in
Group1:
> = (Hm + sd/2)
Group
2: >= (Hm - sd/2) & < (Hm + sd/2)
Group
3: < (Hm - sd/2) Hm
= 1/20 Hi
where Hm is the mean value
of the hydrophobicity over 20 amino acids, sd is the standard deviation, and Hi
is the hydrophobicity value in the scale.
DATA ENCODING
Once
a desired input-output mapping task is determined, the design of a complete
system then involves the choices for a pre-processor and a post-processor, in
addition to the neural network model itself. For molecular applications, the
input sequence encoding method converts molecular sequences into input vectors
of the neural networks. Likewise, an output encoding method is used in the
post-processing step to convert the neural network output to desired forms.
Molecular
applications involve the processing of individual residues, sequence windows of
n-consecutive residues, or the complete sequence string. Correspondingly the
coding may be local involving only single or neighbouring residues in short
sequence segment or global involving long-range relationship in full-length
sequence or long sequence segment. The sequence encoding methods may be
categorised as direct encoding or indirect encoding. Direct encoding converts
each individual residue to a vector, whereas indirect coding provides overall
information of a complete string. Direct encoding preserves positional
information, but can only deal with fixed-length sequence windows. On the other
hand, indirect encoding disregards the ordering information, but can be used
for sequences of either fixed or variable lengths.
Direct Input Sequence Encoding
In
direct encoding, each molecular residue can be represented by its identity or
features. The most commonly used method for direct encoding involves the use of
indicator vectors. The indicator vector
is a vector of binary numbers that has only one unit turned on to indicate the
identity of the corresponding residue.
Here, a vector of 4 units with 3 zeros and a single 1 is required for a
nucleotide, so the 4 nucleotides may be represented as 1000 (A), 0100 (T), 0010
(G), 0001 (C). The spacer residue may be represented as 0000, without an
additional unit. Likewise a vector of 20 input units is needed to represent an
amino acid. Or a vector of 21 units may be used to include a unit for he spacer
in regions between proteins. These binary representations are dubbed as BIN4,
BIN20 and BIN21. The lengths of their input vectors are 4 X n, 20 X n, and 21 X
n, respectively, where n is the total number of nucleotide or amino acid
residues in the sequence window. An alternative to the sparse encoding scheme
of BIN4 is a dense representation that uses two units for 4 nucleotides (e.g.,
00 for A, 01 for T, 10 for G, and 11 for C). a comparative study however showed
that the BIN4 coding was better, possibly due to its unitary coding matrix with
identical Hamming distance among each vector.
Indirect Input Sequence Encoding
The
direct sequence encoding methods preserve the order of residues along the
sequence atring and encode primarily local information. They are, however, not
suitable when global sequence features or information content is more important
to the application, or when variable length sequences are to be analysed. A
commonly used indirect sequence encoding method is the n-gram hashing method,
which computes residue frequencies. Residue frequency is one of the most
effective global features for sequence determination. As shown in Figure 3.2a
alternative alphabets of various n-gram terms can be used to maximise
information extraction. The final size of the input vector is Mn,
where M is the alphabet size and n is the n-gram size.
Input Trimming
An
excessive number of input units not only results in poor generalisations, but
also slows network training. Reducing the number of input variables often leads
to improved performance of a given data set, even though information is being
discarded. Thus one important role of pre-processing is to reduce the dimensionality
of the input data.
Output Encoding
The
choice of output encoding method is more straightforward, depending on the
neural network mapping desired. The discrimination usually involves a yes or no
answer that can be encoded using a single output unit (dense coding) or two
output units (sparse coding). The number of output units in the classifier is
usually the same as the number of classes, encoded as indicator vectors, such
as the number of secondary structure states of protein families. The value of
the output units can be used qualitatively (all –or- none by comparing to a
threshold value), or as a quantitative measure of confidence level or activity
level.
NETWORK DESIGN
While
designing a network for application in genome informatics several factors have
to be taken into consideration. The type of network architecture and learning
algorithm chosen are closely linked to the type of application the network is
to be involved in. To optimise the neural network design, important choices
must be made for the selection of numerous parameters. Many of these are
internal parameters that need to be tuned with the help of experimental results
and experience with the specific application under study. The following
discussion focuses specifically on back-propagation design choices for the
learning rate, momentum term, activation function and termination criteria.
The
selection of a learning rate is important in finding the true global minimum of
the error distance. The convergence speed of back-propagation is directly
related to the learning parameter. Too small a learning rate will slow down
training progress, whereas too large a learning rate may simply produce
oscillations between relatively poor solutions.
A
momentum rate can be helpful in speeding convergence and avoiding local minima.
In back propagation with momentum, the weight change is in a direction that is
a combination of the current gradient and the previous gradient. Momentum
allows the net to make reasonably large weight adjustments as long as the
corrections are in the same general direction for several patterns, while using
a smaller learning rate to prevent a large response to the error from any one
training pattern. An activation function for a back propagation network should
satisfy several important characteristics. It should be continuous and
differentiable. For computational efficiency it should have a derivative that
is easy to compute.
The
back propagation network cannot, in general, be shown to converge, nor are
there well defined stopping criteria. The training is usually stopped by a
user-determined threshold value (tolerance) for the error function, or a fixed
upper limit on the number of training iterations called epochs. The termination
can also be based on performance of a validation data set used to monitor
generalisation performance during learning and to terminate learning when there
is no more improvement.
NN ARCHTECTURES APPLIED IN
GENOME INFORMATICS
Due
to their power and flexibility, ANNS have even been used as tools for relevant
variable selection, which can in turn greatly increase the expert’s knowledge
and understanding of the problem. There is a very large body of research that
has resulted in a large number of ANN. However this chapter discusses only some
of the types that have been used in sequencing.
Layered,
feed-forward, backpropagation neural networks
These
are a class of ANNs whose neurons are organized in layers. The layers are
normally fully connected, meaning that each neuron of a layer is connected to
each element of the next layer. However, self-organizing varieties also exist
in which a network starts either with a minimal number of synaptic connections
between the layers and adds to the number as training progresses (constructive),
or starts as a fully connected network and prunes connections based on the data
observed in training (destructive). Backpropagation is a learning
algorithm that, in its original version, belongs to the gradient descent optimization
methods. The combination of backpropagation learning algorithm and the
feed-forward, layered networks provide the most popular type of ANNS. These
ANNS have been applied to virtually all pattern recognition problems, and are
typically the first networks tried on a new problem. The reason for this is the
simplicity of the algorithm, and the vast body of research that has studied
these networks. Assuch, in sequencing, many researchers have also used this
type of network as a first line of attack. In Methods In Enzymology: Computer
Methods for Macromolecular Sequence Analysis the author C.H Wu
has mentioned the development
of a system called gene classification artificial neural system
(GenCANS), which is based on a three layered, feed-forward backpropagation
network. GenCANS was designed to “classify new (unknown) sequences into
predefined (known) classes. It involves two steps, sequence encoding and
neural network classification, to map molecular sequences (input) into
gene families (output)”. The same type of network has been used to perform
rapid
searches for sequences of proteins.
Self Organizing Map
A
self-organizing map (SOM) is a type of neural network approach first proposed
by Kohonen. SOMs have been used as a divisive clustering approach in many areas
including genomics. Several groups have used SOMs to discover patterns in gene
expression data. Tamayo and colleagues use self-organizing maps to explore
patterns of gene expression generated using Affymetrix arrays, and provide the
GENECLUSTER implementation of SOMs. Tamayo and his colleagues explain the
implementation of SOM for expression level data as follows: “An SOM has
a set of nodes with a simple topology and a distance function on the nodes.
Nodes are iteratively mapped into kdimensional “gene expression space” (in
which the ith coordinate represents the expression level in the ith
sample)”. A SOM assigns genes to a series of partitions on the basis of the
similarity of their expression vectors to reference vectors that are defined
for each partition. The summary of the basic SOM algorithm is perhaps best
described in Quackenbush’s review: “First, random vectors are constructed and
assigned to each partition. Second, a gene is picked at random and, using a
selected distance metric, the reference vector that is closest to the gene is
identified. Third, the reference vector is then adjusted so that it is more
similar to the vector of the assigned gene. The reference vectors that are nearly
on the two-dimensional gird are also adjusted so that they are more similar to
the vector of the assigned gene. Fourth, steps 2 and 3 are iterated several
thousand times, decreasing the amount by which the reference vectors are
adjusted and increasing the stringency used to define closeness in each step.
As the process continues, the reference vectors converge to fixed values. Last,
the genes are mapped to the relevant partitions depending on the reference
vector to which they are most similar”.
SOMs,
like the gene shaving approach, have the distinct advantage that they
allow a priori knowledge to be included in the clustering process. Tamayo
explains this and other advantages of the SOM approach as follows: “The SOM has
a number of features that make them particularly well suited to clustering and
analyzing gene expression patterns. They are ideally suited to exploratory data
analysis, allowing one to impose partial structure on the clusters (in contrast
to the rigid structure of hierarchical clustering, the strong prior hypotheses
used in Bayesian clustering, and the nonstructure of k-means clustering)
facilitating easy visualization and interpretation. SOMs have good
computational properties and are easy to implement, reasonably fast, and are
scalable to large data sets”.
The
most prominent disadvantage of the SOM approach is that it is difficult to know
when to stop the algorithm. If the map is allowed to grow indefinitely, the
size of SOM is gradually increased to a point when clearly different sets of
expression patterns are identified. Therefore, as with k-means clustering, the user
has to rely on some other source of information, such as PCA, to determine the
number of clusters that best represents the available data. For this reason,
Sasik and his colleagues believe that “SOM, as implemented by Tamayo, is
essentially a restricted version of k-means: Here, the k clusters are linked
by some arbitrary user-imposed topological constraints (e.g. a 3 x 2 grid), and
as such suffers from all of the problems mentioned above for k-means (and
more), except that the constraints expedites the optimization process”.
There
are many varieties to SOM, among which the self-organizing feature maps (SOFM)
should be mentioned. The growing cell
structure (GCS) is another derivative of SOFM. It is a self organizing
and incremental (constructive) neural learning approach. The unsupervised
variety of GCS was used by Azuaje for discovery of gene expression patterns in
B-cell lymphoma. Using cDNA microarray expression data, GCS was able to
“identify normal and diffuse large B-cell lymphoma (DLBCL) patients.
Furthermore, it distinguishes patients with molecularly distinct types of DLBCL
without previous knowledge of those subclasses”.
Amino Acid Grouping Example
A
self organizing map was trained with 4 physical structures of the amino acids-
mass in Daltons ,
K & D hydrophobicity, surface area and turn propensity. Mass was scaled to
the unit interval. No other information was presented to the self-organising
maps. A 10 X 10 grid was used. gives the result after only just 50
iterations; each intersection of lines in the table represents a grid point.
Each amino acid was mapped into one of the 100 grid points.
It
can be seen that the groupings found automatically by the self-organising map
resemble the exchange group as found by elaborate examinations of sequence
conservation. The traditional groups are {H,R,K}, {D, E, N, Q}(these first two
groups are often combined into two groups), {S, P, T, A, G},{M, I, L, V}, {F,
W, Y}, {C}. only the F and C amino acids are out of place according to these
groups. This is a remarkable achievement for a simple network, working
unsupervised, withonly four amino acid features as input data. This simple
application illustrates the potential power of self organising maps.
RADIAL BASIS FUNCTION
Networks
based on radial basis function have been developed to address some of the
problems encountered with training NLPs; radial basis functions arre guaranteed
to converge and training is much more rapid. Both are feed-forward networks
with similar-looking diagrams and their applications are similar; however, the
principles of action of radial basis function networks and the way they are
trained are quite different from MLPs.
INTRODUCATION TO RBF
The
essence of the difference between the operation of radial basis function
networks and multilayer perceptrons can be seen in Fig. 4.3a. MLPs classify
data by the use of hyperplanes that divide the data space into discrete areas;
radial basis function clusters the data into a finite number of ellipsoid regions.
Classification is then finding which ellipsoid is closest for a given test data
point. The hidden units of a RBF network are not the same as used for a MLP,
and the weights between input and hidden layer have different meanings.
Transfer functions typically used include the Gaussian function, spline
functions and various quadratic functions: they are all smooth functions, which
taper off as distance from a centre point increases.
For
radial basis function networks, each hidden unit represents the centre of a
cluster in the data space. Input to a hidden unit in a RBF is not the weighted
sum of its inputs, but a Distance
measure: a measure how far the input vector is from the centre of the basis
function for that hidden unit. If x
and are vectors, the Euclidean
distance between them is given by where x is an input vector and is the location vector, or centre of the
basis function for hidden node j. the hidden node then computes its
output as a function of the distance between the input vector and its centre.
For the Gaussian RBF the hidden unit output is where Dj is the Euclidean distance
between an input vector and the location vector for hidden unit j; hj
is the output of the hidden unit j and
is a measure of the size of the cluster j(variance).
So, it can be seen that,
computationally a RBF has three separate actions:
1.
Compute the distance from the
input layer (usually a vector) to each of the hidden
units (each of which represents an ellipsoid region in data space);
2.
Compute the output (a function
of the distance measure in step 1) from each hidden
unit;
3.
And finally, use the outputs of
the hidden units to compute the network outputs with a simple weighted sum
function. The only difficult part is finding values for and for each hidden unit, and the weights
between the hidden and output layers, that is training of the network. Training
RBF networks is considerably faster than training multilayer perceptrons. On
the other hand, once trained, the feed-forward process for MLPs is faster than
for RBFs.
COMPARISON
BETWEEN BACK PROPAGATION, RBF AND SOM NETWORKS
The performance
of several classes of
neural networks including back-propagation neural networks, radial basis
function networks and self-organizing maps are compared when they are applied
to a set of genome signatures for the gene classification problem. The
algorithms used were all adapted from ‘Neural Networks – A Comprehensive
Foundation’ by Simon Haykins, 2nd Edition. Prentice-Hall, Inc.
DATA SETS
Genome
sequences are composed of four kinds of nucleotides: Adenine, Cytosine, Guanine
and Thymine (referred to as A, C, G, T). A segment of continuous occurrence of
nucleotides is called a motif. For example, the sequence segment ‘ATC’ is a
motif of length three. Given a motif length, the frequencies of the motifs
compose the profile of a gene sequence, which is also called the genomic
signature of the sequence. A sequence is divided into subsequences (windows) of
a fixed size. Genomic signatures are built from each window. It has been shown
that the “intergenomic differences” of signatures are generally higher than “intragenomic
differences”. Consequently, features of signatures from a given genome can be
learned with a properly designed neural network. In these set of experiments
from “PRACTICE OF NEURAL NETWORKS FOR
GENOME SIGNATURE ANALYSIS” by
LIANGYOU CHEN and LOIS C. BOGGESS a fixed window size of 100 bases to
divide the genome sequences. The motif length of three nucleotides is used for generation
of signatures for each of the 100-base sequence windows. An example signature
is captured in the following table:
Motif Frequency
AAA
10
AAC
3
AAG
5
…
… …
TTT
13
The
three-base motifs consist of three characters from A, T, C, and G. There are 43
different motifs possible. Each possible motif corresponds to one input feature
for our networks. As a result, we need input vectors of length 43 for the
networks. Each feature in the vector represents the frequency of the corresponding
motif in a 100-base sequence window. The first unit of the input vector is the
frequency for the motif AAA, the second for AAC, and so on. A subset of
nonoverlapping windows is randomly selected from known genomes as the training
data sets. The data consists of three parts: a signature ID, the input feature
vector, and the expected output. The signature ID reflects the genome that the
input vector is generated from and position in that genome. For the above
signature table, the input vector would be <10, 3, 5, …, 13>.
The
experiment involved two sets of data. The first set of data is a two genome classification
dataset, generated from genomes of Aeropyrum pernix (Accession ID: BA000002)
and Mesorhizobium loti (Accession ID: NC_002678). This set of data is referred
to as Dataset 1. This is the basis for an experiment using back-propagation
neural networks, radial-basis function networks and self-organizing maps.
Dataset
1 includes a training dataset with 1000 signatures from Aeropyrum pernix and
1000 signatures from Mesorhizobium loti, and a test dataset with 1000
signatures from Aeropyrum pernix and 1000 signatures from Mesorhizobium loti.
The signatures of the training dataset are taken from the first non-overlapping
1000 windows of each genome, and signatures of the testing set are taken from
the following 1000 windows of each genome. If a signature is from the Aeropyrum
genome, the expected outputs are ‘1 0’.Otherwise, the outputs are ‘0 1’.
BACK-PROPAGATION NEURAL NETWORK
The
implementation of the back-propagation neural network is based on the algorithm
described in (Haykins, 1999). The update of the weights in the network is done
sequentially, that is, after presentation of each training instance. The two
thousand training instances from Dataset 1 were randomly reordered for each
epoch of training. In our experiments, the back-propagation network is the
quickest algorithm to train. After experimenting with number of units in the hidden
layer, it was determined that even a single unit in the hidden layer could on
average achieve an error or 2.61%, and hidden layers with 2, 3, 5, 7, and
9units averaged errors from 2.52% to 2.56% over multiple runs.
Radial-basis
function network
A
modified version of the radial-basis function network algorithm by (Haykin,
1999) was used: Self-organized selection for clustering centers was used to
identify radial-basis centers automatically. The first phase of training was to
randomly select a number of input vectors from the training dataset as initial
centers and use the k-means clustering algorithm to adjust the centers. The
second phase was a supervised learning stage to compute the linear weights of
the output layer. The radial-basis function centered at ti is
defined as G (||x-ti||2) = exp( - ||x-ti||2 / (dmin
/α)2), where the dmin is the minimum distance between the centers,
and α is a scalar to adjust the dmin. The linear weights of the
output layer were trained as single layer perceptrons.
Results from varying α on Dataset 1
for 200 centers
0.5 0.7
0.8 0.9 1.0 1.1
Error
rate 8.31% 7.54% 7.47% 7.69% 8.16%
8.26%
Test
results for 1000 centers, averaged over five runs
0.5 1.0 1.1 1.2 1.3
1.4 1.6
Error
rate 10.13% 6.24% 6.25% 6.01%
6.04% 6.18% 7.74%
Self
organizing map
The
self-organizing map network is again based on an adaptation of (Haykin,1999).
With a simple majority voting method, the classification of output units in the
map can be decided. The training data and the desired outputs are parsed
through the network. Each unit in the output map keeps count of the classes of genomes
that activate it. After training, the output unit is labeled as belonging to
the class which has most frequently activated it, and this labeling is used for
classification purposes during testing.
The
network was trained with Dataset 1. We found a minimum test error rate of 3.55%
by a 4*4 map.
Test results of self-organizing map, averaged
over five runs.
Minimum
Testing Errors
Map
size 2x2 3x3 4x4 5x5 6x6
7x7 10x10
Average
80 75.4
71 83.6
81.8 82.8
79.6
Error
Rate 4% 3 77% 3.55%
4.18% 4.09%
4.14% 3.98%
APPLICATIONS
The
major topics of genome informatics research include gene recognition,
functional analysis, structural determination and family classification, for
the identification of genes, and the understanding of function, structure, and
evolution of gene products.
NUCLEIC ACID EQUENCE ANALYSIS
Some
of the neural network applications for nucleic acid sequence analysis are
summarised in Table 5.1, with an overview of their neural network designs and
input-output encoding features. The applications can be divided into three
sections. The coding region recognition
and gene identification problem is tackled by two complementary approaches,
gene search by signal and search by content. NN provide an attractive model in
which sequence features for both signals and content can be combined and
weighted to improve predictive accuracy.
The
identification and analysis of other signals, binding sites or regulatory
sites, such as promoters, ribosome-binding sites, and transcriptional
initiating and terminating sites are also important for the studies of gene
regulation and expression. NN allow the incorporation of both positive and
negative examples as well as he detection of higher – order and long-range
correlation, and are not based on the assumption of positional independence. As
a result they are favoured in comparison to other methods in many studies.
Other
applications of neural networks are sequence classification and feature
extraction. Detecting significant sequence features and understanding biological
rules that govern gene structure and gene regulation are important problems
that can be addressed by NN. Both supervised and unsupervised NN can be used as
effective feature detectors. Features can be detected from within clusters of a
self-organising map or from hidden units of a back-propagation multilayer
perceptron.
PROTEIN STRUCTURE PREDICTION
Protein
structure prediction is often the first step toward understanding and
predicting tertiary structure because secondary structure elements constitute
the building blocks of the folding units. NN have also been applied to protein
tertiary structure prediction, such as the prediction of the backbones or side-chain packing, and to
structural class prediction.
PROTEIN SEQUENCE ANALYSIS
Sequence
similarity database searching and protein sequence analysis constitute one of
the most important computational approaches to understanding protein structure
and function .Although most computational methods used for nucleic acid
sequence analysis are also applicable to protein sequence studies, how to
capture the enriched features of amino acid alphabets poses a special challenge
for protein analysis.
Conclusion
Artificial
neural networks thus have paved the way for automatic analysis of biological
data. The simultaneous analysis of millions of genes at a time has driven the
new field of computational biology—Genome informatics.
Application
of artificial neural networks in Genome informatics has great significance in
this arena. The artificial neural network attains its knowledge through the
process of learning, which is similar to its inspiration---the Human brain.
But
like all clouds with a silver lining Genomic engineering can be misused too,
which can be a real threat to the man kind. Let us hope that man puts his brain
for the welfare of his fellow beings and not against them.
No comments:
Post a Comment
leave your opinion