Influencia de las propiedades de los registros de audio en sistemas de verificación de hablantes en el contexto forense: una revisión del estado del arte

Franklin Alexander Sepúlveda Sepúlveda

doi:10.22335/rlct.v11i3.982

Autores

Franklin Alexander Sepúlveda Sepúlveda Universidad Industrial de Santander https://orcid.org/0000-0002-9643-5193

DOI:

https://doi.org/10.22335/rlct.v11i3.982

Palavras-chave:

métodos de comparação de vozes, acústica forense, codificação, verificação de falantes, relação sinal-ruído

Resumo

O procedimento de verificação de falantes (VF) no campo forense deve ser confiável. Contudo, seu desempenho é afetado por propriedades intrínsecas dos registros de áudio. Neste sentido, é importante analisar a afetação sobre os métodos de VF encontrados no campo forense, com o fim de estar capacitado para realizar procedimentos mais confiáveis nas diligências forenses. Neste artigo, a análise é feita com base em trabalhos reportados no estado da arte, a partir do qual se constata que o desempenho do processo de verificação depende de propriedades tais como tipo de codificação, longitude de áudio, conteúdo de ruído, presença de saturações e transitórios; onde o grau de afetação destas propriedades depende do método de verificação utilizado. Ainda que existam outros elementos que afetam o desempenho, neste trabalho abordam-se os previamente mencionados. Segundo a revisão realizada, percebe-se uma falência de relatos sobre o grau de afetação em caso de métodos diferentes ao método automático, especialmente. Além disso, no que se refere à influência da saturação do rango dinâmico e de transitórios encontrou-se pouca informação reportada, o que dificulta estabelecer a influência das mesmas.

Downloads

Os dados de download ainda não estão disponíveis.

Biografia do Autor

Franklin Alexander Sepúlveda Sepúlveda, Universidad Industrial de Santander

Profesor Asociado, Escuela de Ingenierías Eléctrica, Electrónica y de Telecomunicaciones.

Referências

Add-Decker, M. et al. (1999). Pétition pour l’arrêt des expertises vocales, tant qu’elles n’auront pas été validées scientifiquement: Pétition du GFCP de la SFA. Association Francophone de la Communication Parl’ee. Descargado febrero 14, 2018, de http://www.afcp-parole.org/doc/petition.pdf.

Amino, K., & Arai, T. (2009). Speaker-dependent characteristics of the nasals. Forensic Science International, 185(1), 21-28. doi: https://doi.org/10.1016/j.forsciint.2008.11.018.

Arantes, P., & Eriksson, A. (2014). Temporal stability of long-term measures of fundamental frequency. doi: 10.13140/2.1.4619.0089.

Ávila, F. R., & Biscainho, L. W. P. (2012). Bayesian restoration of audio signals degraded by impulsive noise modeled as individual pulses. IEEE Transactions on Audio, Speech, and Language Processing, 20(9), 2470-2481. doi: 10.1109/TASL.2012.2203811.

Barinov, A. (2010). Voice samples recording and speech quality assessment for forensic and automatic speaker identification. En Audio engineering society, convention paper (vol. 129th Convention, pp. 366-373). https://speechpro.com/files/en/media/publications/voice_samples_recording_for_forensic_speaker_identification.pdf.

Bie, F., Wang, D., Wang, J., & Zheng, T. F. (2015). Detection and reconstruction of clipped speech for speaker recognition. Speech Communication, 72, 218-231. doi: https://doi.org/10.1016/j.specom.2015.06.008.

Bonastre, J.-F., Bimbot, F., Böe, L.-J., Campbell, J. P., Reynolds, D. A., & Magrin-Chagnolleau, I. (2003). Person authentication by voice: A need for caution. En Interspeech. ISCA. https://www.isca-speech.org/archive/eurospeech_2003/e03_0033.html.

Campbell, W. M., Campbell, J. P., Gleason, T. P., Reynolds, D. A., & Shen, W. (2007). Speaker verification using support vector machines and high-level features. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2085-2094. doi: 10.1109/TASL.2007.902874.

Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006). Support vector machines using gmm supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308-311. doi: 10.1109/LSP.2006.870086.

Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., & Vair, C. (2007). Compensation of nuisance factors for speaker and language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 1969-1978. doi: 10.1109/TASL.2007.901823.

Cheveigné, A., & Kawahara, H. (2002). YIN, A fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917-1930. doi: 10.1121/1.1458024.

Cicres, J. (2011). Los sonidos fricativos sordos y sus implicaciones forenses. Estudios Filológicos, 48, 33-48.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798. doi: 10.1109/TASL.2010.2064307.

Eaton, J., & Naylor, P. A. (2013). Detection of clipping in coded speech signals. En 21st european signal processing conference (eusipco 2013) (pp. 1-5). https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6811469.

Farrús, M., Hernando, J., & Ejarque, P. (2007). Jitter and shimmer measurements for speaker recognition. En 8th annual conference of the international speech communication association, interspeech (pp. 778-781). Antwerp (Belgium). http://dx.doi.org/10.15332/iteckne.v14i2.1767.

Fazel, A., & Chakrabartty, S. (2011). An overview of statistical pattern recognition techniques for speaker verification. IEEE Circuits and Systems Magazine, 11(2), 62-81. doi: 10.1109/MCAS.2011.941080.

Garimella, S., & Hermansky, H. (2013). Factor analysis of auto-associative neural networks with application in speaker verification. IEEE Transactions on Neural Networks and Learning Systems, 24(4), 522-528. doi: 10.1109/TNNLS.2012.2236652.

González-Rátiva, M. C., & Mejía-Escobar, J. A. (2011). Frecuencia fonética del español de Colombia. Forma. Func., 24(2), 69-102.

González-Rodríguez, J., Drygajlo, A., Ramos-Castro, D., García-Gomar, M., & Ortega-García, J. (2006). Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Computer Speech & Language, 20(2), 331-355. (Odyssey 2004: The speaker and Language Recognition Workshop). doi: https://doi.org/10.1016/j.csl.2005.08.005.

Gruber, J., & Poza, F. (1995). Voicegram identification evidence. Lawyers Cooperative Pub.

Hansen, J. H. L., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74-99. doi: 10.1109/MSP.2015.2462851.

Hasan, T., & Hansen, J. H. L. (2011). A study on universal background model training in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 1890-1899. doi: 10.1109/TASL.2010.2102753.

Hasan, T., & Hansen, J. H. L. (2013). Acoustic factor analysis for robust speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 21(4), 842-853. doi: 10.1109/TASL.2012.2226161.

Hasan, T., Saeidi, R., Hansen, J. H. L., & van Leeuwen, D. A. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, pp. 7663-7667. doi: 10.1109/ICASSP.2013.6639154.

Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A.-M. (2014). Comparison of human listeners and speaker verification systems using voice mimicry data. http://cs.uef.fi/~villeh/mimicry_odyssey2014.pdf.

Hautamäki, V., Cheng, Y.-C., Rajan, P., & Lee, C.-H. (2013). Minimax i-vector extractor for short duration speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 3708-3712.

Hollien, H., Didla, G., Harnsberger, J. D., & Hollien, K. A. (2016). The case for aural perceptual speaker identiﬁcation. Forensic Science International, http://dx.doi.org/10.1016/j.forsciint.2016.08.007.

Ireland, D., Knuepffer, C., & Mcbride, S. (2015). Adaptive multi-rate compression effects on vowel analysis. Frontiers in bioengineering and biotechnology, 3, 118. doi: 10.3389/fbioe.2015.00118.

Jameel, A. S. M. M., Fattah, S. A., Goswami, R., Zhu, W. P., & Ahmad, M. O. (2017). Noise robust formant frequency estimation method based on spectral model of repeated autocorrelation of speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1357-1370. doi: 10.1109/ TASLP.2016.2625423.

Jarina, R., Polacký, J., Počta, P., & Chmulík, M. (2017). Automatic speaker verification on narrowband and wideband lossy coded clean speech. IET Biometrics, 6(4), 276-281. doi: 10.1049/iet-bmt.2016.0119.

Kanagasundaram, A. (2014). Speaker verification using i-vector features (tesis doctoral). Speech and Audio Research Laboratory, Queensland University of Technology. https://eprints.qut.edu.au/77834/1/Ahilan_Kanagasundaram_Thesis.pdf.

Kenny, P., Mihoubi, M., & Dumouchel, P. (2003). New map estimators for speaker recognition. En Interspeech. https://www.crim.ca/perso/patrick.kenny/eurospeech2003.pdf.

Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980-988. doi: 10.1109/TASL.2008.925147.

Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to super-vectors. Speech Communication, 52(1), 12-40. https://doi.org/10.1016/j.specom.2009.08.009.

Leung, K., Mak, M., Siu, M., & Kung, S. (2006). Adaptive articulatory feature-based conditional pronunciation modeling for speaker verification. Speech Communication, 48(1), 71-84. https://doi.org/10.1016/j.specom.2005.05.013.

Li, L., Chen, Y., Shi, Y., Tang, Z., & Wang, D. (2017). Deep speaker feature learning for text-independent speaker verification. En Interspeech (pp. 1542-1546). doi: 10.21437/Interspeech.2017-452.

Li, N., & Mak, M.-W. (2015). Snr-invariant PLDA modeling for robust speaker verification. En Interspeech. https://www.isca-speech.org/archive/interspeech_2015/papers/i15_2317.pdf.

Magrin-Chagnolleau, I., Durou, G., & Bimbot, F. (2002). Application of time-frequency principal component analysis to text-independent speaker identification. IEEE Transactions on Speech and Audio Processing, 10(6), 371-378. doi: 10.1109/TSA.2002.800557.

Mandasari, M. I., McLaren, M., & van Leeuwen, D. A. (2012). The effect of noise on modern auto-matic speaker recognition systems. En IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4249-4252). doi: 10.1109/ICASSP.2012.6288857.

Manikandan, M. S., Yadav, A. K., & Ghosh, D. (2017). Elimination of impulsive disturbances from archive audio signals using sparse representation in mixed dictionaries. En Tencon ieee region 10 conference (pp. 2531-2535). doi: 10.1109/TENCON.2017.8228288.

Moreno-Daniel, A. (2004). Speaker verification using coded speech. En A. Martínez Trinidad (Ed.), Progress in pattern recognition, image analysis and applications. CIARP (vol. 3287, pp. 366-373). doi: 10.1007/978-3-540-30463-0_45.

Morrison, G. S. (2009a). Forensic voice comparison using likelihood ratios based on polynomial curves fitted to the formant trajectories of Australian English /ai/. doi: 10.1558/ijsll.v15i2.249.

Morrison, G. S. (2009b). Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. The Journal of the Acoustical Society of America, 125(4), 2387-2397. doi: 10.1121/1.3081384.

Morrison, G. S. (2010). Expert evidence. En (cap. Forensic voice comparison). Thomson Reuters. http://expert-evidence.forensic-voice-comparison.net/.

Mustafá, K., & Bruce, I. C. (2006). Robust formant tracking for continuous speech with speaker variability. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 435-444. doi: 10.1109/ TSA.2005.855840.

Nakatani, T., & Irino, T. (2004). Robust and accurate fundamental frequency estimation based on dominant harmonic components. Journal of the Acoustical Society of America, 116(6), 3690-3700. doi: 10.1121/1.1787522.

Nielsen, A. S., & Stern, K. R. (1985). Identification of known voices as a function of familiarity and narrow band coding. Journal of the Acoustical Society of America, 77, 658. https://doi.org/10.1121/1.391884.

Nongpiur, R. C. (2008). Impulse noise removal in speech using wavelets. En IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1593-1596). doi: 10.1109/ICASSP.2008.4517929.

Poddar, A., Sahidullah, M., & Saha, G. (2018). Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biometrics, 7(2), 91-101.

doi: 10.1049/iet-bmt.2017.0065.

Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. Annual IEEE India Conference (INDICON) (pp. 1-6). New Delhi.

doi: 10.1109/INDICON.2015.7443464.

Polacky, J., Jarina, R., & Chmulik, M. (2016). Assessment of automatic speaker verification on lossy transcoded speech. En 4th International Conference on Biometrics and Forensics (IWBF) (pp. 1-6). doi: 10.1109/IWBF.2016.7449679.

Reynolds, D. A. (1997). Comparison of background normalization methods for text-independent speaker verification. En Proc. of 5th European Conf. on Speech Communication and Technology (EuroSpeech) (vol. 2, pp. 963-966). https://pdfs.semanticscholar.org/f5ad/e2e149b2d4bc0a4c679207b2bf858692af7a.pdf.

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1), 19-41.

Romito, L., & Galatà, V. (2004). Towards a protocol in speaker recognition analysis. Forensic Science International, 146, S107-11. doi: 10.1006/dspr.1999.0361.

Rosas, C., & Sommerhoff, J. (2009). Efectos acústicos de las variaciones fonopragmáticas y ambientales. Estudios Filológicos, 44, 195-210. http://dx.doi.org/10.4067/S0071-17132009000100012.

Rose, P. (2002). Forensic speaker identification (F. S. Series, Ed.). Taylor & Francis.

Saedi, R. et al. (2013). I4U submission to NIST-SRE 2012: A large-scale collaborative effort for noise-robust speaker verification. En Interspeech. https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_1986.pdf.

Sarkar, A., Driss, M., Bousquet, P.-M., & Bonastre, J.-F. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association. Interspeech.

Schmidt-Nielsen, A., & Crystal, T. H. (2000). Speaker verification by human listeners: Experiments comparing human and machine performance using the nist 1998 speaker evaluation data. Digital Signal Processing, 10(1), 249-266. doi: https://doi.org/10.1006/dspr.1999.0356.

Snyder, D., García-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. En Interspeech (pp. 999-1003). doi: 10.21437/Interspeech.2017-620.

Tosi, O. (1979). Voice identification: Theory and legal applications. University Park Press: Baltimore, Maryland.

Univaso, P., Ale, J. M., & Gurlekian, J. A. (2015). Data mining applied to forensic speaker identification. En IEEE Latin America Transactions, 13(4), 1098-1111. doi: 10.1109/TLA.2015.7106363.

Univaso, P. (2017). Forensic speaker identification: A tutorial. En IEEE Latin America Transactions, 15(9), pp. 1754-1770.

doi: 10.1109/TLA.2017.8015083.

Van Lancker, D., Kreiman, J., & Emmorey, K. (1985). Familiar voice recognition: Patterns and parameters. Part I. Recognition of backward voices. Journal of Phonetics, 13, 19-38.

Wan, H., Ma, X., & Li, X. (2018). Variational bayesian learning for removal of sparse impulsive noise from speech signals. Digital Signal Processing, 73, 106-116. doi: https://doi.org/10.1016/j.dsp.2017.11.007.

Zheng, N., Lee, T., & Ching, P. C. (2007). Integration of complementary acoustic features for speaker recognition. IEEE Signal Processing Letters, 14(3), 181-184. doi: 10.1109/LSP.2006.884031.