Influencia de las propiedades de los registros de audio en sistemas de verificación de hablantes en el contexto forense: una revisión del estado del arte

Franklin Alexander Sepúlveda Sepúlveda

doi:10.22335/rlct.v11i3.982

Autores/as

Franklin Alexander Sepúlveda Sepúlveda Universidad Industrial de Santander https://orcid.org/0000-0002-9643-5193

DOI:

https://doi.org/10.22335/rlct.v11i3.982

Palabras clave:

métodos de comparación de voces, acústica forense, codificación, verificación de hablantes, relación señal-ruido

Resumen

El procedimiento de verificación de hablantes (VH) en el campo forense ha de ser confiable. Sin embargo, su desempeño se ve afectado por propiedades intrínsecas de los registros de audio. En tal sentido, es importante analizar la afectación sobre los métodos de VH encontrados en el campo forense, a fin de estar en capacidad de llevar a cabo procedimientos más confiables en las diligencias forenses. En el presente artículo, el análisis se hace con base en trabajos reportados en el estado del arte, a partir del cual se encuentra que el desempeño del proceso de verificación depende de propiedades tales como tipo de codificación, longitud de audio, contenido de ruido, presencia de saturaciones y transitorios; donde el grado de afectación de estas propiedades depende del método de verificación que se utiliza. Aunque existen otros elementos que afectan el desempeño, en el presente trabajo se abordan los previamente mencionados. Según la revisión realizada, se nota una falencia de reportes acerca del grado de afectación en el caso de métodos diferentes al método automático, especialmente. Además, en cuanto a la influencia de la saturación del rango dinámico y de transitorios se encontró poca información reportada, lo cual dificulta establecer la influencia de las mismas.

Descargas

Los datos de descarga aún no están disponibles.

Biografía del autor/a

Franklin Alexander Sepúlveda Sepúlveda, Universidad Industrial de Santander

Profesor Asociado, Escuela de Ingenierías Eléctrica, Electrónica y de Telecomunicaciones.

Referencias

Add-Decker, M. et al. (1999). Pétition pour l’arrêt des expertises vocales, tant qu’elles n’auront pas été validées scientifiquement: Pétition du GFCP de la SFA. Association Francophone de la Communication Parl’ee. Descargado febrero 14, 2018, de http://www.afcp-parole.org/doc/petition.pdf.

Amino, K., & Arai, T. (2009). Speaker-dependent characteristics of the nasals. Forensic Science International, 185(1), 21-28. doi: https://doi.org/10.1016/j.forsciint.2008.11.018.

Arantes, P., & Eriksson, A. (2014). Temporal stability of long-term measures of fundamental frequency. doi: 10.13140/2.1.4619.0089.

Ávila, F. R., & Biscainho, L. W. P. (2012). Bayesian restoration of audio signals degraded by impulsive noise modeled as individual pulses. IEEE Transactions on Audio, Speech, and Language Processing, 20(9), 2470-2481. doi: 10.1109/TASL.2012.2203811.

Barinov, A. (2010). Voice samples recording and speech quality assessment for forensic and automatic speaker identification. En Audio engineering society, convention paper (vol. 129th Convention, pp. 366-373). https://speechpro.com/files/en/media/publications/voice_samples_recording_for_forensic_speaker_identification.pdf.

Bie, F., Wang, D., Wang, J., & Zheng, T. F. (2015). Detection and reconstruction of clipped speech for speaker recognition. Speech Communication, 72, 218-231. doi: https://doi.org/10.1016/j.specom.2015.06.008.

Bonastre, J.-F., Bimbot, F., Böe, L.-J., Campbell, J. P., Reynolds, D. A., & Magrin-Chagnolleau, I. (2003). Person authentication by voice: A need for caution. En Interspeech. ISCA. https://www.isca-speech.org/archive/eurospeech_2003/e03_0033.html.

Campbell, W. M., Campbell, J. P., Gleason, T. P., Reynolds, D. A., & Shen, W. (2007). Speaker verification using support vector machines and high-level features. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2085-2094. doi: 10.1109/TASL.2007.902874.

Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006). Support vector machines using gmm supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308-311. doi: 10.1109/LSP.2006.870086.

Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., & Vair, C. (2007). Compensation of nuisance factors for speaker and language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 1969-1978. doi: 10.1109/TASL.2007.901823.

Cheveigné, A., & Kawahara, H. (2002). YIN, A fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917-1930. doi: 10.1121/1.1458024.

Cicres, J. (2011). Los sonidos fricativos sordos y sus implicaciones forenses. Estudios Filológicos, 48, 33-48.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798. doi: 10.1109/TASL.2010.2064307.

Eaton, J., & Naylor, P. A. (2013). Detection of clipping in coded speech signals. En 21st european signal processing conference (eusipco 2013) (pp. 1-5). https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6811469.

Farrús, M., Hernando, J., & Ejarque, P. (2007). Jitter and shimmer measurements for speaker recognition. En 8th annual conference of the international speech communication association, interspeech (pp. 778-781). Antwerp (Belgium). http://dx.doi.org/10.15332/iteckne.v14i2.1767.

Fazel, A., & Chakrabartty, S. (2011). An overview of statistical pattern recognition techniques for speaker verification. IEEE Circuits and Systems Magazine, 11(2), 62-81. doi: 10.1109/MCAS.2011.941080.

Garimella, S., & Hermansky, H. (2013). Factor analysis of auto-associative neural networks with application in speaker verification. IEEE Transactions on Neural Networks and Learning Systems, 24(4), 522-528. doi: 10.1109/TNNLS.2012.2236652.

González-Rátiva, M. C., & Mejía-Escobar, J. A. (2011). Frecuencia fonética del español de Colombia. Forma. Func., 24(2), 69-102.

González-Rodríguez, J., Drygajlo, A., Ramos-Castro, D., García-Gomar, M., & Ortega-García, J. (2006). Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Computer Speech & Language, 20(2), 331-355. (Odyssey 2004: The speaker and Language Recognition Workshop). doi: https://doi.org/10.1016/j.csl.2005.08.005.

Gruber, J., & Poza, F. (1995). Voicegram identification evidence. Lawyers Cooperative Pub.

Hansen, J. H. L., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74-99. doi: 10.1109/MSP.2015.2462851.

Hasan, T., & Hansen, J. H. L. (2011). A study on universal background model training in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 1890-1899. doi: 10.1109/TASL.2010.2102753.

Hasan, T., & Hansen, J. H. L. (2013). Acoustic factor analysis for robust speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 21(4), 842-853. doi: 10.1109/TASL.2012.2226161.

Hasan, T., Saeidi, R., Hansen, J. H. L., & van Leeuwen, D. A. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, pp. 7663-7667. doi: 10.1109/ICASSP.2013.6639154.

Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A.-M. (2014). Comparison of human listeners and speaker verification systems using voice mimicry data. http://cs.uef.fi/~villeh/mimicry_odyssey2014.pdf.

Hautamäki, V., Cheng, Y.-C., Rajan, P., & Lee, C.-H. (2013). Minimax i-vector extractor for short duration speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 3708-3712.

Hollien, H., Didla, G., Harnsberger, J. D., & Hollien, K. A. (2016). The case for aural perceptual speaker identiﬁcation. Forensic Science International, http://dx.doi.org/10.1016/j.forsciint.2016.08.007.

Ireland, D., Knuepffer, C., & Mcbride, S. (2015). Adaptive multi-rate compression effects on vowel analysis. Frontiers in bioengineering and biotechnology, 3, 118. doi: 10.3389/fbioe.2015.00118.

Jameel, A. S. M. M., Fattah, S. A., Goswami, R., Zhu, W. P., & Ahmad, M. O. (2017). Noise robust formant frequency estimation method based on spectral model of repeated autocorrelation of speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1357-1370. doi: 10.1109/ TASLP.2016.2625423.

Jarina, R., Polacký, J., Počta, P., & Chmulík, M. (2017). Automatic speaker verification on narrowband and wideband lossy coded clean speech. IET Biometrics, 6(4), 276-281. doi: 10.1049/iet-bmt.2016.0119.

Kanagasundaram, A. (2014). Speaker verification using i-vector features (tesis doctoral). Speech and Audio Research Laboratory, Queensland University of Technology. https://eprints.qut.edu.au/77834/1/Ahilan_Kanagasundaram_Thesis.pdf.

Kenny, P., Mihoubi, M., & Dumouchel, P. (2003). New map estimators for speaker recognition. En Interspeech. https://www.crim.ca/perso/patrick.kenny/eurospeech2003.pdf.

Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980-988. doi: 10.1109/TASL.2008.925147.

Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to super-vectors. Speech Communication, 52(1), 12-40. https://doi.org/10.1016/j.specom.2009.08.009.

Leung, K., Mak, M., Siu, M., & Kung, S. (2006). Adaptive articulatory feature-based conditional pronunciation modeling for speaker verification. Speech Communication, 48(1), 71-84. https://doi.org/10.1016/j.specom.2005.05.013.

Li, L., Chen, Y., Shi, Y., Tang, Z., & Wang, D. (2017). Deep speaker feature learning for text-independent speaker verification. En Interspeech (pp. 1542-1546). doi: 10.21437/Interspeech.2017-452.

Li, N., & Mak, M.-W. (2015). Snr-invariant PLDA modeling for robust speaker verification. En Interspeech. https://www.isca-speech.org/archive/interspeech_2015/papers/i15_2317.pdf.

Magrin-Chagnolleau, I., Durou, G., & Bimbot, F. (2002). Application of time-frequency principal component analysis to text-independent speaker identification. IEEE Transactions on Speech and Audio Processing, 10(6), 371-378. doi: 10.1109/TSA.2002.800557.

Mandasari, M. I., McLaren, M., & van Leeuwen, D. A. (2012). The effect of noise on modern auto-matic speaker recognition systems. En IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4249-4252). doi: 10.1109/ICASSP.2012.6288857.

Manikandan, M. S., Yadav, A. K., & Ghosh, D. (2017). Elimination of impulsive disturbances from archive audio signals using sparse representation in mixed dictionaries. En Tencon ieee region 10 conference (pp. 2531-2535). doi: 10.1109/TENCON.2017.8228288.

Moreno-Daniel, A. (2004). Speaker verification using coded speech. En A. Martínez Trinidad (Ed.), Progress in pattern recognition, image analysis and applications. CIARP (vol. 3287, pp. 366-373). doi: 10.1007/978-3-540-30463-0_45.

Morrison, G. S. (2009a). Forensic voice comparison using likelihood ratios based on polynomial curves fitted to the formant trajectories of Australian English /ai/. doi: 10.1558/ijsll.v15i2.249.

Morrison, G. S. (2009b). Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. The Journal of the Acoustical Society of America, 125(4), 2387-2397. doi: 10.1121/1.3081384.

Morrison, G. S. (2010). Expert evidence. En (cap. Forensic voice comparison). Thomson Reuters. http://expert-evidence.forensic-voice-comparison.net/.

Mustafá, K., & Bruce, I. C. (2006). Robust formant tracking for continuous speech with speaker variability. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 435-444. doi: 10.1109/ TSA.2005.855840.

Nakatani, T., & Irino, T. (2004). Robust and accurate fundamental frequency estimation based on dominant harmonic components. Journal of the Acoustical Society of America, 116(6), 3690-3700. doi: 10.1121/1.1787522.

Nielsen, A. S., & Stern, K. R. (1985). Identification of known voices as a function of familiarity and narrow band coding. Journal of the Acoustical Society of America, 77, 658. https://doi.org/10.1121/1.391884.

Nongpiur, R. C. (2008). Impulse noise removal in speech using wavelets. En IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1593-1596). doi: 10.1109/ICASSP.2008.4517929.

Poddar, A., Sahidullah, M., & Saha, G. (2018). Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biometrics, 7(2), 91-101.

doi: 10.1049/iet-bmt.2017.0065.

Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. Annual IEEE India Conference (INDICON) (pp. 1-6). New Delhi.

doi: 10.1109/INDICON.2015.7443464.

Polacky, J., Jarina, R., & Chmulik, M. (2016). Assessment of automatic speaker verification on lossy transcoded speech. En 4th International Conference on Biometrics and Forensics (IWBF) (pp. 1-6). doi: 10.1109/IWBF.2016.7449679.

Reynolds, D. A. (1997). Comparison of background normalization methods for text-independent speaker verification. En Proc. of 5th European Conf. on Speech Communication and Technology (EuroSpeech) (vol. 2, pp. 963-966). https://pdfs.semanticscholar.org/f5ad/e2e149b2d4bc0a4c679207b2bf858692af7a.pdf.

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1), 19-41.

Romito, L., & Galatà, V. (2004). Towards a protocol in speaker recognition analysis. Forensic Science International, 146, S107-11. doi: 10.1006/dspr.1999.0361.

Rosas, C., & Sommerhoff, J. (2009). Efectos acústicos de las variaciones fonopragmáticas y ambientales. Estudios Filológicos, 44, 195-210. http://dx.doi.org/10.4067/S0071-17132009000100012.

Rose, P. (2002). Forensic speaker identification (F. S. Series, Ed.). Taylor & Francis.

Saedi, R. et al. (2013). I4U submission to NIST-SRE 2012: A large-scale collaborative effort for noise-robust speaker verification. En Interspeech. https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_1986.pdf.

Sarkar, A., Driss, M., Bousquet, P.-M., & Bonastre, J.-F. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association. Interspeech.

Schmidt-Nielsen, A., & Crystal, T. H. (2000). Speaker verification by human listeners: Experiments comparing human and machine performance using the nist 1998 speaker evaluation data. Digital Signal Processing, 10(1), 249-266. doi: https://doi.org/10.1006/dspr.1999.0356.

Snyder, D., García-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. En Interspeech (pp. 999-1003). doi: 10.21437/Interspeech.2017-620.

Tosi, O. (1979). Voice identification: Theory and legal applications. University Park Press: Baltimore, Maryland.

Univaso, P., Ale, J. M., & Gurlekian, J. A. (2015). Data mining applied to forensic speaker identification. En IEEE Latin America Transactions, 13(4), 1098-1111. doi: 10.1109/TLA.2015.7106363.

Univaso, P. (2017). Forensic speaker identification: A tutorial. En IEEE Latin America Transactions, 15(9), pp. 1754-1770.

doi: 10.1109/TLA.2017.8015083.

Van Lancker, D., Kreiman, J., & Emmorey, K. (1985). Familiar voice recognition: Patterns and parameters. Part I. Recognition of backward voices. Journal of Phonetics, 13, 19-38.

Wan, H., Ma, X., & Li, X. (2018). Variational bayesian learning for removal of sparse impulsive noise from speech signals. Digital Signal Processing, 73, 106-116. doi: https://doi.org/10.1016/j.dsp.2017.11.007.

Zheng, N., Lee, T., & Ching, P. C. (2007). Integration of complementary acoustic features for speaker recognition. IEEE Signal Processing Letters, 14(3), 181-184. doi: 10.1109/LSP.2006.884031.