VOICE MORPHING - Full report

Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals, while generating a smooth transition between them. Speech morphing is analogous to image morphing. In image morphing the in-between images all show one face smoothly changing its shape and texture until it turns into the target face. It is this feature that a speech morph should possess. One speech signal should smoothly change into another, keeping the shared characteristics of the starting and ending signals but smoothly changing the other properties. The major properties of concern as far as a speech signal is concerned are its pitch and envelope information. These two reside in a convolved form in a speech signal. Hence some efficient method for extracting each of these is necessary. We have adopted an uncomplicated approach namely cepstral analysis to do the same. Pitch and formant information in each signal is extracted using the cepstral approach. Necessary processing to obtain the morphed speech signal include methods like Cross fading of envelope information, Dynamic Time Warping to match the major signal features (pitch) and Signal Re-estimation to convert the morphed speech signal back into the acoustic waveform.

This report has been subdivided into seven chapters. The second chapter gives an idea of the various processes involved in this project in a concise manner. A thorough analysis of the procedure used to accomplish morphing and the necessary theory involved is presented in an uncomplicated manner in the third chapter. Processes like pre processing, cepstral analysis, dynamic time warping and signal re-estimation are vividly described with necessary diagrams. The fourth chapter gives a deep insight into the actual morphing process. The conversion of the morphed signal into an acoustic waveform is dealt in detail in the fifth chapter. Chapter six summarizes the whole morphing process with the help of a block diagram. Chapter seven lists the conclusions that have been drawn from this project.

An Introspection of the Morphing Process
We had undertaken this work, which sounded quite challenging and interesting. We were eager to know whether a venture like speech morphing will be feasible using the cepstral approach. Processes like cepstral analysis and the re estimation of the morphed speech signal into an acoustic waveform involve much intricacy and challenge. Also this project digs deep into the basics of digital signal processing or speech processing rather. This project covers a lot of ground as far as speech processing is concerned.

Speech morphing can be achieved by transforming the signal’s representation from the acoustic waveform obtained by sampling of the analog signal, with which many people are familiar with, to another representation. To prepare the signal for the transformation, it is split into a number of 'frames' - sections of the waveform. The transformation is then applied to each frame of the signal. This provides another way of viewing the signal information. The new representation (said to be in the frequency domain) describes the average energy present at each frequency band.

Further analysis enables two pieces of information to be obtained: pitch information and the overall envelope of the sound. A key element in the morphing is the manipulation of the pitch information. If two signals with different pitches were simply cross-faded it is highly likely that two separate sounds will be heard. This occurs because the signal will have two distinct pitches causing the auditory system to perceive two different objects. A successful morph must exhibit a smoothly changing pitch throughout. The pitch information of each sound is compared to provide the best match between the two signals' pitches. To do this match, the signals are stretched and compressed so that important sections of each signal match in time. The interpolation of the two sounds can then be performed which creates the intermediate sounds in the morph. The final stage is then to convert the frames back into a normal waveform.

However, after the morphing has been performed, the legacy of the earlier analysis becomes apparent. The conversion of the sound to a representation in which the pitch and spectral envelope can be separated loses some information. Therefore, this information has to be re-estimated for the morphed sound. This process obtains an acoustic waveform, which can then be stored or listened to.

Conclusions and Future scope
The approach we have adopted separates the sounds into two forms: spectral envelope information and pitch and voicing information. These can then be independently modified. The morph is generated by splitting each sound into two forms: a pitch representation and an envelope representation. The pitch peaks are then obtained from the pitch spectrograms to create a pitch contour for each sound. Dynamic Time Warping of these contours aligns the sounds with respect to their pitches. At each corresponding frame, the pitch, voicing and envelope information are separately morphed to produce a final morphed frame. These frames are then converted back into a time domain waveform using the signal re-estimation algorithm.

In order to reduce the number of cepstral slices to be processed, the window size and window shift were increased. However, the size of the window was still within the range to achieve the desired balance between frequency and time resolution. The quality of the morph is heavily influenced by the number of iterations used to re-estimate the sound. In re-estimation section, the algorithm was tested by re-estimating a sound from an unprocessed magnitude DFT. In other words, no information was removed -intentionally or not - by further manipulation. This meant that if a large number of iterations were used then an almost perfect signal could be obtained. In speech morphing, a large amount of manipulation of the signal takes place and some loss of quality is inevitable. Therefore, less iteration were required before the sound began to converge to a point at which further iterations made negligible difference.

The pitch contour extraction process is performed in a rather naïve manner. In order to smoothly morph the pitch information, the pitches of each signal need to be matched. To facilitate this, a pitch estimate for the entire signal is found – a pitch contour. Dynamic Time Warping is then used to find the best match between the two pitch contours. In this work, the pitch contour is found from the cepstral domain. The position of the peak in each slice is found and these build up a pitch contour. Although the results are satisfactory, this method does not take into account two possibilities: The pitch may be absent or difficult to find in both frames; one frame may have a pitch but the other may not. Unlike visual morphing, speech morphing can separate different aspects of the sound into independent dimensions. Those dimensions are time, pitch and voicing, and spectral envelope.

There are a number of areas in which further work should be carried out in order to improve the technique described here and extend the field of speech morphing in general. The time required to generate a morph is dominated by the signal re-estimation process. Even a small number (for example, 2) of iterations takes a significant amount of time even to re-estimate signals of approximately one second duration. Although in speech morphing, an inevitable loss of quality due to manipulation occurs and so less iteration are required, an improved re-estimation algorithm is required.

A number of the processes, such as the matching and signal re-estimation are very unrefined and inefficient methods but do produce satisfactory morphs. Concentration on the issues described above for further work and extensions to the speech morphing principle ought to produce systems which create extremely convincing and satisfying speech morphs.

In this project, only one type of morphing has been discussed - that in which the final morph has the same duration as the longest signal. Also we discuss the case of speech morphing in this project. But the work can be extended to include audio sounds as well. The longest signal is compressed and the morph has the same duration as the shortest signal (the reverse of the approach described here).  If one signal is significantly longer than the other, two possibilities arise. However, according to the eventual use of the morph, a number of other types could be produced:
1.    If the longer signal is the 'target' - the sound one wishes to morph to - then the morph would be performed between the start signal and the target's corresponding section (of equal duration) with the remainder of the target's signal unaffected.
2.    If the longer signal is the start signal then the morph would be performed over the duration of the shorter signal and the remainder of the start signal would be removed.

Further extension to this work to provide the above functionality would create a powerful and flexible morphing tool. Such a tool would allow the user to specify at which points a morph was to start and finish the properties of the morph and also the matching function. With the increased user interaction in the process, a Graphical User Interface could be designed and integrated to make the package more 'user-friendly'. Such an improvement would immediate visual feedback (which is lacking in the current implementation) and possibly step by step guidance. Finally, this work has used spectrograms as the pitch and voicing and spectral envelope representations. Although effective, further work ought to concentrate on new representations which enable further separation of information. For example, a new representation might allow the separation of the pitch and voicing.

Pitch is not the only time-varying property which can be used to morph between two sounds. If the underlying rhythm of a sound is important then this ought to be used as the matching function between the two sounds. A better approach still, may be to combine two or more matching functions together in order to achieve a more pleasing morph. The algorithm presented in this project is prone to excessive stretching of the time axis in order to achieve a match between the two pitch contours. The use of a combined rhythm and pitch matching function could limit this unwanted warping.

Further, the weighting of each component in the matching function could be determined according to requirements allowing heavily rhythm-biased matches or heavily pitch-biased matches.

The Speech morphing concept can be extended to include audio sounds in general. This area offers many possible applications including sound synthesis. For example, there are two major methods for synthesizing musical notes. One is to digitally model the sound's physical source and provide a number of parameters in order to produce a synthetic note of the desired pitch. Another is to take two notes which bound the desired note and use the principles used in speech morphing to manufacture a note which contains the shared characteristics of the bounding notes but whose other properties have been altered to form a new note. The use of pitch manipulation within the algorithm also has an interesting potential use. In the interests of security, it is sometimes necessary for people to disguise the identity of their voice. An interesting way of doing this is to alter the pitch of the sound in real-time using sophisticated methods.

3 comments:

  1. Name:-Anand Kadu
    Email ID:- ananadkadu9767@gmail.com
    Course details:- B. Tech.(Electronics)

    ReplyDelete
  2. Hiren Bavarava
    Email:-hiren2009.09@gmail.com
    Course details:-BE(Electronics&Communication)

    ReplyDelete

leave your opinion