Top Quality




Top Quality

Automated Voice Quality Testing For Voip Quality Of Service Solutions

Automated sound signals quality estimation

Sevana Oy, 2009
http://www.sevana.fi
1. INTRODUCTION
Sound signal quality estimation acquires the increasing value with the distribution of mobile communications, systems of a synthetic telephony, VoIP and various portable sound recording and sound reproducing devices. The desire naturally arises to work out a way, which would provide objective estimation (i.e. independently from estimation of particular subject) and the opportunity to automate such estimation. It is of a high importance as for comparison of competitive commercial products as well as for parameters’ optimisation of proprietary products.
One of the main parameters in systems of compression, transfer and reproduction of the sound information is the quality of the restored, received or reproduced sound.
Quantitative measurement of sound quality has specific features due to the fact that the final receiver of a sound signal is always a human, and a human is also a source of the majority of sound signals. According to the well-known fact, sound signals quality is determined not only by the technical characteristics of a sound processing and transfer systems, but also by the properties of individual peculiarities of speech perception and production, which vary in time and from individual to individual.

2.REVIEW OF QUALITY ESTIMATION METHODS
Subjective and objective methods to measure speech quality are distinguished. Subjective methods are those, which include the hearing of a person as a component of a measuring complex. Objective methods, on the contrary, exclude participation of person’s hearing from the process of measurements.
The most widespread subjective method of speech quality estimation is MOS (Mean Opinion Score), five-point scale estimation.
This kind of estimation is determined by processing estimations given by groups of auditors to the sequences of sound signals, reproduced by various audio systems. Each auditor estimates each signal, and then the results are averaged.
To organize and implement subjective estimation is sufficiently difficult, long lasting and expensive activity, therefore investigations have been conducted in order to find objective methods, allowing receiving fast and automated estimations which would well correspond to subjective examinations.
There are various automatic estimation methods; some of them are given below [1]:
AI (Articulation Index). The idea is that the whole frequency range of speech signal is divided into 20 bands and the signal/noise ratio is determined within the band. The band broad is defined in such a way, that every band contributes equally in speech perception. The signal/noise ratio is calculated within every band. Articulation index is supposed to be equal the weighted total of the band values.
The disadvantage of the articulation index is that it does not take into account the properties of hearing and speech production, although it directs towards speech signal.
SII (Speech Intelligibility Index) is the evolution of AI method. The American Standard ANSI S3.5-1997 includes the speech intelligibility index. It provides 4 measuring procedures on different band groups: 21 critical bands, 18 one third-octave bands, 17 equal by their contribution critical bands and 6 octave bands. The signal/noise ratio is calculated within every band and the total SII coefficient, ranged from 0 to 1 is computed.
The speech intelligibility index, however, takes into account only the properties of hearing, not speech production.
STI (Speech Transmission Index). We may approximately consider speech signal as broadband signal modulated by low-frequency signal. Articulation speed determines modulation frequency. When modulation depth decreases, speech signal becomes similar to noise and its intelligibility decreases. Accordingly, intelligibility decrease can be estimated according to modulation depth decrease as well.
Whole speech range is divided into 7octave bands. An octave noise signal is the input. The test signal intensity distribution agrees with the distribution of speech signal intensities. The modulating signal frequencies vary from 0.5 to 12.5 Hz with one-third-octave interval (14 frequencies in all).
The STI measuring method is stated in the International standard IEC 268-16.
RATSI/STIPA (Rapid Speech Transmission Index). The STI method needs a lot of measuring procedures and calculations. A simplified method was developed, which provides for measuring only in 2 bands with 5 modulation frequencies and reduces the number of measuring procedures and calculations. For good intelligibility RASTI values must be not less than 0.6.
Both speech transmission index (STI) as well as rapid speech transmission index (RASTI) imitate speech production process by means of noise model, but to take into account the properties of speech production and hearing in such way is far from optimum.
C50 (factor of clearness) determines sound clearness and clarity. It is computed as near echo/far echo ratio. The method is based on the fact, that echo reduces signal intelligibility. The near echo/far echo ratios in several frequency bands are calculated. They consider near echo (less than 33 ms) as useful signal and far echo (more than 33 ms) as disturbing signal.
The factor of clearness takes into account only one kind of the possible distortions and it is worth to apply it only as one of the speech quality estimations approaches.

ITU P.862 PESQ (Perceptual Evaluation of Speech Quality). PESQ is an objective measurement method that predicts the results of subjective listening tests on telephony systems. PESQ uses a sensory model to compare the original, unprocessed signal with the degraded signal from the network or network element. The resulting quality score is similar to the subjective “Mean Opinion Score” (MOS) measured using panel tests according to ITU-T P.800. The PESQ scores are calibrated using a large database of subjective tests. The method takes into account coding distortions, errors, packet loss, delay and variable delay, and filtering in analogue network components.
Being one of the most popular tools PESQ has a number of disadvantages such as demanding test signals to be speech-like because many systems are optimized for speech and respond in an unrepresentative way to non-speech signals (e.g. tones, noise, ITU-T P.50). PESQ test signal is to be set by tester and thus vendor estimations may vary from end customer estimations. The approach performs signal level equalization what theoretically is not that good because when speaking different sound volumes may have different spectrums. PESQ cannot catch significant quality loss, which occurs when the voice is equalized such that there is far less low frequency and high frequency energy when compared to the original voice file.
The need to develop new methods and to improve existing ones is caused by desire to bring together objective and subjective estimation of quality and to explicitly use in such systems our knowledge about hearing and speech production.
To use arbitrary or particularized signal as a source signal depends on the estimation purpose (speech intelligibility evaluation, sound reproduction quality, quality estimation of speech transmitted through intercommunication channels, etc.) and allows increasing estimation objectivity.

3. GENERAL SCHEME OF THE SYSTEM

The figure 1 represents general scheme of the quality estimation system for sound signals.

Fig.1. General scheme of the quality estimation system for sound signals

A generator of test signals allows sound signal forming according to one of the sound flow models. It can be either a particularized set of sound signals or a signal, received in output of statistical speech model. (Signal models in details are considered later.) Generator’s signal can either be saved for follow-up usage or be exposed to processing and estimation. Bank of signals stores sound data, received as a result of signals’ generator work or from some external sources.
Accordingly, an input of estimation block is a signal of generator directly or one of the bank of signals. Test signal is the input of the synchronizer or of the device under test, which can be for example, a vocoder or a communication channel. The output signal of the device under test is an input of synchronizer also.
The synchronizer matches in time an initial signal and a processed signal. The synchronized signals in chunks input in analytical module, which determines the degree of similarity for signals and issues the quality estimation as the measure of similarity between the initial and the processed signals.
Let’s consider the functioning of system modules in details.
3.1. Generator of test signals

The generator of test signals consists of a generator of noise signals and a simplified statistical speech model. Both of generators simulate the process of “speaking”, but their approaches to speech production simulating differ. The statistical model forms sound flow on the base of human speech patterns and the generator of noise signals bases on knowledge about sound perception and speech production.
3.2. Generator of noise signals
The generator of noise signals operates on speech flow model like one, which used in the STI method. The idea is that we may approximately consider speech signal as broadband signal modulated by low-frequency signal. Articulation speed determines modulation frequency, which varies from 0.63 to 13.44 Hz.

As a modulation signal the noise signal is used, resulting from white noise by means of cutting the critical bands of hearing and speech production. In the first case the signal generated allows estimation of sound signal quality in general, in the other case – particularly speech signal estimation. Critical bands in details are considered in the description of the analytical module.
3.3. Statistical speech model
Language consists of sounds. Every individual generates a unique set of sounds. However, one can distinguish standard speakers (SS), generating average kinds of sounds. Standard speakers are subdivided according to their age, gender, region, social status, education, occupation etc.
One should determine sound frequencies, probabilities of sounds following each other, intonation contours, vocabularies, physical properties of individual sounds for every standard speaker. Based on these data one can simulate natural speech flow.
One should also include in the system statistic information about the population structure and with its help generate speech flows with the features, which characterize population of some region or the whole country.
Broadly speaking, statistic model contains statistic data about the population structure, speech bases of standard speakers, speech signal processing facilities (algorithms of synthesis), means of speed sounds parameters determination, generation algorithms of sounds and standard speakers distributions.

The interface block provides interaction with outer world (or User) and also synchronizes functions of other blocks of statistic model.

The block of speaker choice generates sample of standard speakers (or sequence of indexes of standard speakers). Depending on the command a representative sample of standard speakers or a sample from one standard speaker can be generated. The sample is representative in the sense that the speech parameters distribution in it corresponds to the speech parameters distribution of the population, described in the model.

The sequence of indexes of standard speakers is saved in the block of standard speaker choice for further usage.

The block of sound choice forms the prosodic (the descriptions of sounds). Depending on the command prosodic is constituted either for a representative sound sample, or for a specified sequence of sounds, or for one specified sound.

Prosodic is saved in the prosodic buffer follow-up usage.

The block of speech flow transforms descriptions of sounds in readings of speech signal.

The block of the descriptions of standard speakers stores descriptions of standard speakers and on query returns necessary parts of descriptions, information about their number, list of speakers.
3.4. Signals synchronizer
The synchronizer matches in time domain initial and processed signals. Input of the synchronizer receives signal segments (pDATA), duration of which is equal to VAD (Voice Activity Detection) frame, and criterions of VAD activity for them are specified in the pDATA segments.
Any sound signal can be separated into active and inactive phases. The first corresponds to active sound processes, the latter – to low-level background noise. The elementary way of dividing these two phases is to divide them according to signal energy level. However such approach is not accurate enough. In our approach VAD algorithm presented in recommendation G.723 is used for this purpose (as a part of VAD vocoder).
After filtration the state criterions and signal frames enter the the synchronizer blocks, which combine active signal fragments and pauses. The modules use common data: buffer of active etalon signal (EBuffer1), buffer of active signal under test (TBuffer1), buffer of the etalon signal pause (EBuffer0), buffer of signal under test pause (TBuffer0), readiness criterion of buffers of active signal and pauses (dReady[0..1]). There is also a counter of synchronization errors (dErrorCounter).
Output of the synchronizer is a pair of buffers with active signals or a pair of buffers with pauses. Both of the blocks of synchronizer can initiate an appearance of a pair of synchronized buffers.
The synchronized buffers and the criterion of activity are the input of analytical module.
3.5. Analytical module
The analytical module compares separately the combined pairs of fragments of active and inactive phase signal that allows getting more accurate estimation.

The integral spectrum is determined for each fragment using discrete cosine transformation (DCT). Spectrum integration is calculated according to the proprietary formula.
In the spectrum calculation the interpenetration of windows comes to N/2 samples, the known Hamming or Blackmann-Harris window function is applied to every window.
Levels of spectrum energy on bands are determined for all sets of bands. Groups of critical bands [2-6], determined by different authors resulting from different models of sound perception and speech production are already known.

Band boundaries (initial and terminal indexes) as well as band energy values are determined according to a set of proprietary formulas.

The initial quality estimation value is taken as 100%. Further it decreases proportionally to distinction of energies on bands. Quality estimation values are determined on every set of bands. The overall quality estimation on all bands is calculated according to proprietary formulas.

To determine sound (D) and word (W) intelligibility the Pokrovskij’s formulas may be used:
To go from the quality loss coefficient to the sound intelligibility value, a correspondent table is used.

To determine value in intermediate points, interpolation (for example, Lagrange interpolation polynomial) is used.

Quality estimations can be translated similarly into MOS estimation values.

4. IMPLEMENTATION & CONCLUSIONS
Algorithms described are implemented for voice quality estimation and comparison of external initial signals and signals under test.
As the external arbitrary signals recorded with the sampling frequency of 8 kHz and the capacity of samples equal to 16 bits can be used. Supposed, the signal under test is received from an initial signal as a result of some transformations (for example, compression/restoration, transmission through communication channels, filtration). In additional as an initial external signal a record of the phonetically representative text read aloud by several speaker of different age of both gender.
As internal initial signals (i.e. signals, which the user of the program has no access to) the signals generated according to the noise model (the description of the generator is given below) and the signals, generated on the basic of the statistic model.
The internal signals are put in the system of sound data comparison/restoration, implemented for example as a DLL with the specified interface. The signal processed by means of methods contained in DLL is considered as the signal under test and is exposed to the quality estimation procedure described earlier.

Presented method of sound signal quality estimation has a number of advantages over known methods of quality measurements, namely:
• it is universal since it allows judging the quality of signals from various source and processed in different ways;
• one can optimize quality estimation signal depending on the purposes:
o in speed (for example, it is possible to receive rough estimation quickly);
o in signal type (using different bands for speech signals and sound signals in general);
• resulting estimations correlate well with that of S;
• quality estimations received for speech signals can be translated into values of various kinds of intelligibility.

Test results representing quality estimations of several standard voice codecs, received on various test signals using the method suggested and the implementation described have a very strong correspondence with known MOS values for these codecs [6].

5. TRENDS OF DEVELOPMENT

According to the structure of the suggested quality estimation system of sound signals the system can develop in following trends:
• the test signal model improvement. Here the noise model can be supplied with a set of multiband modulated noise signals; the set of data and algorithms of the statistic speech model can be enriched, the number of preprepared test signals (such as records of PhRT) can be enlarged;
• the development of more upgraded algorithms of synchronization, based, for example, on coincidence of maximums in signal energy spectrums;
• the acoustic model modernization with taking into account masking effects and the fact that pure tones and band noise cause the hearing in some way differently;
• the signal comparison scheme modernization. Current distance measure is not accurate enough for strongly different signals. For higher universality of the system it is desired to use the correlation analysis methods for comparison;
• to solve a number of practical problems the systems requires the possibility to work with multichannel (Stereo-, Quadro-, etc.) and to receive immediate quality estimations;
• absolutely correct translation of the objective estimations into MOS estimation values requires further experimental researches.
REFERENCES
1. Aldoshina I., “Bases of psychoacoustics”, The sound producer, 2002, 5, 8
2. Sekunov N., “Processing of a sound on PC”, bhv, Saint-Petersburg, 2001
3. Sapozhkov M.A., “Speech signal in cybernetics and communications”, Svyazizdat, Moscow, 1963
4. Pokrovskiy N.B., “Calculation and measurement of speech legibility”, Svyazizdat, Moscow, 1962
5. Sorokin V.N., “Speech synthesis”, Nauka, Moscow, 1992
6. http://www.sevana.fi/audio_speech_codecs_quality_analysis.php

Sevana Oy (Ltd) is a privately owned software company founded in 2003 in Finland with offices in Russia and Estonia.The company creates software in computer and telecommunications technologies. In 2008 it created a new technology for automated voice and audio quality analysis.In Q1 2009 it announced a new algorithm based engine for associations for market basket analysis.

Sevana Oy http://www.sevana.fi

About the Author

Sevana Oy (Ltd) is a privately owned software company founded in 2003 in Finland with offices in Russia and Estonia.The company creates software in computer and telecommunications technologies. In 2008 it created a new technology for automated voice and audio quality analysis.In Q1 2009 it announced a new algorithm based engine for associations for market basket analysis.

Sevana Oy http://www.sevana.fi

Top Quality – Magnum Opus


Culligan FM-15RA Level 3 Faucet Filter Replacement Cartridge


Culligan FM-15RA Level 3 Faucet Filter Replacement Cartridge


$9.05



EatSmart Precision Pro - Multifunction Digital Kitchen Scale w/ Extra Large LCD and 11 Lb. Capacity


EatSmart Precision Pro – Multifunction Digital Kitchen Scale w/ Extra Large LCD and 11 Lb. Capacity



The EatSmart Precision Pro Digital Kitchen Scale is a versatile multifunction home appliance. Designed to be highly accurate and aesthetically pleasing, the Precision Pro is manufactured to the highest quality specifications. Weigh items up to 11 lbs quickly and accurately, with results displayed in four different units: grams / ounces / pounds / kilograms.
For cooks – Weigh food items directly on…


Lodge Logic Pre-Seasoned Skillet


Lodge Logic Pre-Seasoned Skillet


$22.99


Made by the cast iron experts at Lodge, this revolutionary pan is preseasoned, so it’s ready to cook right out of the box. It’s perfect for making virtually anything from bacon and eggs to grilled sandwiches, pan-fried fish, and fabulous cornbread. Each has dual pour spouts and a helper handle. Lodge Logic’s new preseasoning process penetrates the metal’s pores thoroughly and uniformly, so it look…

Back in Black


Back in Black


$5.97


Most critics complain Back in Black, the album AC/DC recorded after the death of their original lead screamer Bon Scott, is ridiculously juvenile, obvious, snickering, bludgeoning, derivative, single-minded about sex and booze, a big cartoon. All true, of course, and–on rock ‘n’ ragers like “What Do You Do For Money Honey,” “You Shook Me All Night Long,” and the title track–all great. As Scott’s…

Aenima


Aenima


$10.86


Tool Aenima – Sealed US CD album…

Best of Simon & Garfunkel


Best of Simon & Garfunkel


$6.46


No Description AvailableNo Track Information AvailableMedia Type: CDArtist: SIMON & GARFUNKELTitle: BEST OF SIMON & GARFUNKELStreet Release Date: 11/16/1999…

Tron: Legacy (Four-Disc Combo: Blu-ray 3D / Blu-ray / DVD / Digital Copy)


Tron: Legacy (Four-Disc Combo: Blu-ray 3D / Blu-ray / DVD / Digital Copy)


$26.95


Disney presents a high-tech motion picture unlike anything you’ve ever seen in an astonishing 3D Combo Pack. Immerse yourself in the digital world of Tron, as celebrated actor Jeff Bridges stars in a revolutionary visual effects adventure beyond imagination. When Flynn, the world’s greatest video game creator, sends out a secret signal from an amazing digital realm, his son discovers the clue and…

Snow White and the Seven Dwarfs (Walt Disney's Masterpiece) [VHS]


Snow White and the Seven Dwarfs (Walt Disney’s Masterpiece) [VHS]


$2.74


Walt Disney’s Classic Masterpiece movie. Comes in plastic protective case
enjoyable entertainment for young and old….

Titanic [VHS]


Titanic [VHS]


$3.98


Leonardo DiCaprio (Actor), Kate Winslet (Actor) | Rated: PG-13 | Format: VHS Tape…

Gilmour 01RW Rubber Hose Washers


Gilmour 01RW Rubber Hose Washers


$0.01


These heavy-duty rubber Hose Washers fit all standard female garden hose fittings….

Lorena Fabric Nailhead Trim Armchair


Lorena Fabric Nailhead Trim Armchair


$327.59


Bring sophistication and class to your living room with this Lorena Fabric Nailhead Trim Armchair. This furniture piece is constructed of wood and top quality designed fabric.

Sterling Silver Interchangeable Leverback Earrings


Sterling Silver Interchangeable Leverback Earrings


$5.69


Beautiful top-quality sterling silver leverback earring findings are interchangeableJewelry supplies let you slide any dangle on or off while the earring is openJewelry findings are perfect for converting more than one pair of earrings

Antique Dark Cherry Accent Chair


Antique Dark Cherry Accent Chair


$294.99


Add a touch elegance to your home decor with this accent chair. This furniture features cabriole legs, tight seat and open scroll arms. Chair seat and back is upholstered in dark brown top quality bi-cast leather for long-lasting use.

Goldfill Interchangeable Earring Leverbacks (4)


Goldfill Interchangeable Earring Leverbacks (4)


$11.99


Beautiful top quality 14-karat gold-filled leverback earring findings are interchangeableSlide any dangle on or off while the earring is openLeverbacks are perfect for customers who want to convert more than one pair of earrings

WearEver A8342465 Stainless Steel 3-quart Sauce Pan


WearEver A8342465 Stainless Steel 3-quart Sauce Pan


$30.19


Indulge your culinary whims with top-quality WearEver pots and pansExperience the difference a steel 3-quart sauce pan makes to your cookingCook-n-Strain combines style, innovation, performance and rugged durability


Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*