Influence of the properties of audio records in speaker verification systems in a forensic context: a review of the state of the art
DOI:
https://doi.org/10.22335/rlct.v11i3.982Keywords:
voice comparison methods, forensic acoustics, coding, speaker verification, signal-to-noise ratioAbstract
The procedure for verifying speakers (SV) in the forensic field must be reliable. However, the performance is affected by the intrinsic properties of audio records. In this regard, it is important to analyze the impact on the SV methods found in the forensic field in order to be able to carry out more reliable procedures in forensic proceedings. In this article, the analysis is based on studies reported in the state of the art, from which it is found that the performance of the verification process depends on properties such as type of coding, audio length, noise content, presence of saturations and transients, where the degree of affectation of these properties depends on the verification method that is used. Although there are other elements that affect performance, this research addresses the aforementioned. According to the review carried out, there is a lack of reports about the degree of affectation, especially in the case of methods other than the automatic method. In addition, regarding the influence of the saturation of the dynamic and transient range, not much reported information was found, which makes it difficult to establish their influence.Downloads
References
Add-Decker, M. et al. (1999). Pétition pour l’arrêt des expertises vocales, tant qu’elles n’auront pas été validées scientifiquement: Pétition du GFCP de la SFA. Association Francophone de la Communication Parl’ee. Descargado febrero 14, 2018, de http://www.afcp-parole.org/doc/petition.pdf.
Amino, K., & Arai, T. (2009). Speaker-dependent characteristics of the nasals. Forensic Science International, 185(1), 21-28. doi: https://doi.org/10.1016/j.forsciint.2008.11.018.
Arantes, P., & Eriksson, A. (2014). Temporal stability of long-term measures of fundamental frequency. doi: 10.13140/2.1.4619.0089.
Ávila, F. R., & Biscainho, L. W. P. (2012). Bayesian restoration of audio signals degraded by impulsive noise modeled as individual pulses. IEEE Transactions on Audio, Speech, and Language Processing, 20(9), 2470-2481. doi: 10.1109/TASL.2012.2203811.
Barinov, A. (2010). Voice samples recording and speech quality assessment for forensic and automatic speaker identification. En Audio engineering society, convention paper (vol. 129th Convention, pp. 366-373). https://speechpro.com/files/en/media/publications/voice_samples_recording_for_forensic_speaker_identification.pdf.
Bie, F., Wang, D., Wang, J., & Zheng, T. F. (2015). Detection and reconstruction of clipped speech for speaker recognition. Speech Communication, 72, 218-231. doi: https://doi.org/10.1016/j.specom.2015.06.008.
Bonastre, J.-F., Bimbot, F., Böe, L.-J., Campbell, J. P., Reynolds, D. A., & Magrin-Chagnolleau, I. (2003). Person authentication by voice: A need for caution. En Interspeech. ISCA. https://www.isca-speech.org/archive/eurospeech_2003/e03_0033.html.
Campbell, W. M., Campbell, J. P., Gleason, T. P., Reynolds, D. A., & Shen, W. (2007). Speaker verification using support vector machines and high-level features. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2085-2094. doi: 10.1109/TASL.2007.902874.
Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006). Support vector machines using gmm supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308-311. doi: 10.1109/LSP.2006.870086.
Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., & Vair, C. (2007). Compensation of nuisance factors for speaker and language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 1969-1978. doi: 10.1109/TASL.2007.901823.
Cheveigné, A., & Kawahara, H. (2002). YIN, A fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917-1930. doi: 10.1121/1.1458024.
Cicres, J. (2011). Los sonidos fricativos sordos y sus implicaciones forenses. Estudios Filológicos, 48, 33-48.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798. doi: 10.1109/TASL.2010.2064307.
Eaton, J., & Naylor, P. A. (2013). Detection of clipping in coded speech signals. En 21st european signal processing conference (eusipco 2013) (pp. 1-5). https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6811469.
Farrús, M., Hernando, J., & Ejarque, P. (2007). Jitter and shimmer measurements for speaker recognition. En 8th annual conference of the international speech communication association, interspeech (pp. 778-781). Antwerp (Belgium). http://dx.doi.org/10.15332/iteckne.v14i2.1767.
Fazel, A., & Chakrabartty, S. (2011). An overview of statistical pattern recognition techniques for speaker verification. IEEE Circuits and Systems Magazine, 11(2), 62-81. doi: 10.1109/MCAS.2011.941080.
Garimella, S., & Hermansky, H. (2013). Factor analysis of auto-associative neural networks with application in speaker verification. IEEE Transactions on Neural Networks and Learning Systems, 24(4), 522-528. doi: 10.1109/TNNLS.2012.2236652.
González-Rátiva, M. C., & Mejía-Escobar, J. A. (2011). Frecuencia fonética del español de Colombia. Forma. Func., 24(2), 69-102.
González-Rodríguez, J., Drygajlo, A., Ramos-Castro, D., García-Gomar, M., & Ortega-García, J. (2006). Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Computer Speech & Language, 20(2), 331-355. (Odyssey 2004: The speaker and Language Recognition Workshop). doi: https://doi.org/10.1016/j.csl.2005.08.005.
Gruber, J., & Poza, F. (1995). Voicegram identification evidence. Lawyers Cooperative Pub.
Hansen, J. H. L., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74-99. doi: 10.1109/MSP.2015.2462851.
Hasan, T., & Hansen, J. H. L. (2011). A study on universal background model training in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 1890-1899. doi: 10.1109/TASL.2010.2102753.
Hasan, T., & Hansen, J. H. L. (2013). Acoustic factor analysis for robust speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 21(4), 842-853. doi: 10.1109/TASL.2012.2226161.
Hasan, T., Saeidi, R., Hansen, J. H. L., & van Leeuwen, D. A. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, pp. 7663-7667. doi: 10.1109/ICASSP.2013.6639154.
Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A.-M. (2014). Comparison of human listeners and speaker verification systems using voice mimicry data. http://cs.uef.fi/~villeh/mimicry_odyssey2014.pdf.
Hautamäki, V., Cheng, Y.-C., Rajan, P., & Lee, C.-H. (2013). Minimax i-vector extractor for short duration speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 3708-3712.
Hollien, H., Didla, G., Harnsberger, J. D., & Hollien, K. A. (2016). The case for aural perceptual speaker identification. Forensic Science International, http://dx.doi.org/10.1016/j.forsciint.2016.08.007.
Ireland, D., Knuepffer, C., & Mcbride, S. (2015). Adaptive multi-rate compression effects on vowel analysis. Frontiers in bioengineering and biotechnology, 3, 118. doi: 10.3389/fbioe.2015.00118.
Jameel, A. S. M. M., Fattah, S. A., Goswami, R., Zhu, W. P., & Ahmad, M. O. (2017). Noise robust formant frequency estimation method based on spectral model of repeated autocorrelation of speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1357-1370. doi: 10.1109/ TASLP.2016.2625423.
Jarina, R., Polacký, J., Počta, P., & Chmulík, M. (2017). Automatic speaker verification on narrowband and wideband lossy coded clean speech. IET Biometrics, 6(4), 276-281. doi: 10.1049/iet-bmt.2016.0119.
Kanagasundaram, A. (2014). Speaker verification using i-vector features (tesis doctoral). Speech and Audio Research Laboratory, Queensland University of Technology. https://eprints.qut.edu.au/77834/1/Ahilan_Kanagasundaram_Thesis.pdf.
Kenny, P., Mihoubi, M., & Dumouchel, P. (2003). New map estimators for speaker recognition. En Interspeech. https://www.crim.ca/perso/patrick.kenny/eurospeech2003.pdf.
Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980-988. doi: 10.1109/TASL.2008.925147.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to super-vectors. Speech Communication, 52(1), 12-40. https://doi.org/10.1016/j.specom.2009.08.009.
Leung, K., Mak, M., Siu, M., & Kung, S. (2006). Adaptive articulatory feature-based conditional pronunciation modeling for speaker verification. Speech Communication, 48(1), 71-84. https://doi.org/10.1016/j.specom.2005.05.013.
Li, L., Chen, Y., Shi, Y., Tang, Z., & Wang, D. (2017). Deep speaker feature learning for text-independent speaker verification. En Interspeech (pp. 1542-1546). doi: 10.21437/Interspeech.2017-452.
Li, N., & Mak, M.-W. (2015). Snr-invariant PLDA modeling for robust speaker verification. En Interspeech. https://www.isca-speech.org/archive/interspeech_2015/papers/i15_2317.pdf.
Magrin-Chagnolleau, I., Durou, G., & Bimbot, F. (2002). Application of time-frequency principal component analysis to text-independent speaker identification. IEEE Transactions on Speech and Audio Processing, 10(6), 371-378. doi: 10.1109/TSA.2002.800557.
Mandasari, M. I., McLaren, M., & van Leeuwen, D. A. (2012). The effect of noise on modern auto-matic speaker recognition systems. En IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4249-4252). doi: 10.1109/ICASSP.2012.6288857.
Manikandan, M. S., Yadav, A. K., & Ghosh, D. (2017). Elimination of impulsive disturbances from archive audio signals using sparse representation in mixed dictionaries. En Tencon ieee region 10 conference (pp. 2531-2535). doi: 10.1109/TENCON.2017.8228288.
Moreno-Daniel, A. (2004). Speaker verification using coded speech. En A. Martínez Trinidad (Ed.), Progress in pattern recognition, image analysis and applications. CIARP (vol. 3287, pp. 366-373). doi: 10.1007/978-3-540-30463-0_45.
Morrison, G. S. (2009a). Forensic voice comparison using likelihood ratios based on polynomial curves fitted to the formant trajectories of Australian English /ai/. doi: 10.1558/ijsll.v15i2.249.
Morrison, G. S. (2009b). Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. The Journal of the Acoustical Society of America, 125(4), 2387-2397. doi: 10.1121/1.3081384.
Morrison, G. S. (2010). Expert evidence. En (cap. Forensic voice comparison). Thomson Reuters. http://expert-evidence.forensic-voice-comparison.net/.
Mustafá, K., & Bruce, I. C. (2006). Robust formant tracking for continuous speech with speaker variability. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 435-444. doi: 10.1109/ TSA.2005.855840.
Nakatani, T., & Irino, T. (2004). Robust and accurate fundamental frequency estimation based on dominant harmonic components. Journal of the Acoustical Society of America, 116(6), 3690-3700. doi: 10.1121/1.1787522.
Nielsen, A. S., & Stern, K. R. (1985). Identification of known voices as a function of familiarity and narrow band coding. Journal of the Acoustical Society of America, 77, 658. https://doi.org/10.1121/1.391884.
Nongpiur, R. C. (2008). Impulse noise removal in speech using wavelets. En IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1593-1596). doi: 10.1109/ICASSP.2008.4517929.
Poddar, A., Sahidullah, M., & Saha, G. (2018). Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biometrics, 7(2), 91-101.
doi: 10.1049/iet-bmt.2017.0065.
Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. Annual IEEE India Conference (INDICON) (pp. 1-6). New Delhi.
doi: 10.1109/INDICON.2015.7443464.
Polacky, J., Jarina, R., & Chmulik, M. (2016). Assessment of automatic speaker verification on lossy transcoded speech. En 4th International Conference on Biometrics and Forensics (IWBF) (pp. 1-6). doi: 10.1109/IWBF.2016.7449679.
Reynolds, D. A. (1997). Comparison of background normalization methods for text-independent speaker verification. En Proc. of 5th European Conf. on Speech Communication and Technology (EuroSpeech) (vol. 2, pp. 963-966). https://pdfs.semanticscholar.org/f5ad/e2e149b2d4bc0a4c679207b2bf858692af7a.pdf.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1), 19-41.
Romito, L., & Galatà, V. (2004). Towards a protocol in speaker recognition analysis. Forensic Science International, 146, S107-11. doi: 10.1006/dspr.1999.0361.
Rosas, C., & Sommerhoff, J. (2009). Efectos acústicos de las variaciones fonopragmáticas y ambientales. Estudios Filológicos, 44, 195-210. http://dx.doi.org/10.4067/S0071-17132009000100012.
Rose, P. (2002). Forensic speaker identification (F. S. Series, Ed.). Taylor & Francis.
Saedi, R. et al. (2013). I4U submission to NIST-SRE 2012: A large-scale collaborative effort for noise-robust speaker verification. En Interspeech. https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_1986.pdf.
Sarkar, A., Driss, M., Bousquet, P.-M., & Bonastre, J.-F. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association. Interspeech.
Schmidt-Nielsen, A., & Crystal, T. H. (2000). Speaker verification by human listeners: Experiments comparing human and machine performance using the nist 1998 speaker evaluation data. Digital Signal Processing, 10(1), 249-266. doi: https://doi.org/10.1006/dspr.1999.0356.
Snyder, D., García-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. En Interspeech (pp. 999-1003). doi: 10.21437/Interspeech.2017-620.
Tosi, O. (1979). Voice identification: Theory and legal applications. University Park Press: Baltimore, Maryland.
Univaso, P., Ale, J. M., & Gurlekian, J. A. (2015). Data mining applied to forensic speaker identification. En IEEE Latin America Transactions, 13(4), 1098-1111. doi: 10.1109/TLA.2015.7106363.
Univaso, P. (2017). Forensic speaker identification: A tutorial. En IEEE Latin America Transactions, 15(9), pp. 1754-1770.
doi: 10.1109/TLA.2017.8015083.
Van Lancker, D., Kreiman, J., & Emmorey, K. (1985). Familiar voice recognition: Patterns and parameters. Part I. Recognition of backward voices. Journal of Phonetics, 13, 19-38.
Wan, H., Ma, X., & Li, X. (2018). Variational bayesian learning for removal of sparse impulsive noise from speech signals. Digital Signal Processing, 73, 106-116. doi: https://doi.org/10.1016/j.dsp.2017.11.007.
Zheng, N., Lee, T., & Ching, P. C. (2007). Integration of complementary acoustic features for speaker recognition. IEEE Signal Processing Letters, 14(3), 181-184. doi: 10.1109/LSP.2006.884031.
Downloads
Published
Issue
Section
License
Copyright (c) 2019 Revista Logos Ciencia & Tecnología
This work is licensed under a Creative Commons Attribution 4.0 International License.
This journal provides free and immediate access to its content (https://creativecommons.org/licenses/by/4.0/legalcode#languages), under the principle that making research available to the public free of charge supports greater global knowledge exchange. This means that the authors transfer the Copyrights to the journal, so that the material can be copied and distributed by any means, as long as the authors’ recognition is maintained, and the articles are not commercially used or modified in any way.