default search action
23rd Interspeech 2022: Incheon, Korea
- Hanseok Ko, John H. L. Hansen:
23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022. ISCA 2022
Speech Synthesis: Toward end-to-end synthesis
- Hyunjae Cho, Wonbin Jung, Junhyeok Lee, Sang Hoon Woo:
SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech. 1-5 - Hanbin Bae, Young-Sun Joo:
Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch. 6-10 - Martin Lenglet, Olivier Perrotin, Gérard Bailly:
Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings. 11-15 - Yooncheol Ju, Ilhwan Kim, Hongsun Yang, Ji-Hoon Kim, Byeongyeol Kim, Soumi Maiti, Shinji Watanabe:
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner. 16-20 - Dan Lim, Sunghee Jung, Eesung Kim:
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech. 21-25
Technology for Disordered Speech
- Rosanna Turrisi, Leonardo Badino:
Interpretable dysarthric speaker adaptation based on optimal-transport. 26-30 - Zhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic:
Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs. 31-35 - Luke Prananta, Bence Mark Halpern, Siyuan Feng, Odette Scharenborg:
The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition. 36-40 - Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda:
Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition. 41-45 - Chitralekha Bhat, Ashish Panda, Helmer Strik:
Improved ASR Performance for Dysarthric Speech Using Two-stage DataAugmentation. 46-50 - Abner Hernandez, Paula Andrea Pérez-Toro, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier, Seung Hee Yang:
Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition. 51-55
Neural Network Training Methods for ASR I
- Mun-Hak Lee, Joon-Hyuk Chang, Sang-Eon Lee, Ju-Seok Seong, Chanhee Park, Haeyoung Kwon:
Regularizing Transformer-based Acoustic Models by Penalizing Attention Weights. 56-60 - David M. Chan, Shalini Ghosh:
Content-Context Factorized Representations for Automated Speech Recognition. 61-65 - Georgios Karakasidis, Tamás Grósz, Mikko Kurimo:
Comparison and Analysis of New Curriculum Criteria for End-to-End ASR. 66-70 - Deepak Baby, Pasquale D'Alterio, Valentin Mendelev:
Incremental learning for RNN-Transducer based speech recognition models. 71-75 - Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio López-Moreno, Rajiv Mathews, Françoise Beaufays:
Production federated keyword spotting via distillation, filtering, and joint federated-centralized training. 76-80
Acoustic Phonetics and Prosody
- Jieun Song, Hae-Sung Jeon, Jieun Kiaer:
Use of prosodic and lexical cues for disambiguating wh-words in Korean. 81-85 - Vinicius Ribeiro, Yves Laprie:
Autoencoder-Based Tongue Shape Estimation During Continuous Speech. 86-90 - Giuseppe Magistro, Claudia Crocco:
Phonetic erosion and information structure in function words: the case of mia. 91-95 - Miran Oh, Yoon-Jeong Lee:
Dynamic Vertical Larynx Actions Under Prosodic Focus. 96-100 - Leah Bradshaw, Eleanor Chodroff, Lena A. Jäger, Volker Dellwo:
Fundamental Frequency Variability over Time in Telephone Interactions. 101-105
Spoken Machine Translation
- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-jussà:
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation. 106-110 - Jinming Zhao, Hao Yang, Gholamreza Haffari, Ehsan Shareghi:
M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation. 111-115 - Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim:
Cross-Modal Decision Regularization for Simultaneous Speech Translation. 116-120 - Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura:
Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation. 121-125 - Kirandevraj R, Vinod Kumar Kurmi, Vinay P. Namboodiri, C. V. Jawahar:
Generalized Keyword Spotting using ASR embeddings. 126-130
(Multimodal) Speech Emotion Recognition I
- Youngdo Ahn, Sung Joo Lee, Jong Won Shin:
Multi-Corpus Speech Emotion Recognition for Unseen Corpus Using Corpus-Wise Weights in Classification Loss. 131-135 - Junghun Kim, Yoojin An, Jihie Kim:
Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms. 136-140 - Joosung Lee:
The Emotion is Not One-hot Encoding: Learning with Grayscale Label for Emotion Recognition in Conversation. 141-145 - Andreas Triantafyllopoulos, Johannes Wagner, Hagen Wierstorf, Maximilian Schmitt, Uwe Reichel, Florian Eyben, Felix Burkhardt, Björn W. Schuller:
Probing speech emotion recognition transformers for linguistic knowledge. 146-150 - Navin Raj Prabhu, Guillaume Carbajal, Nale Lehmann-Willenbrock, Timo Gerkmann:
End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks. 151-155 - Matthew Perez, Mimansa Jaiswal, Minxue Niu, Cristina Gorrostieta, Matthew Roddy, Kye Taylor, Reza Lotfian, John Kane, Emily Mower Provost:
Mind the gap: On the value of silence representations to lexical-based speech emotion recognition. 156-160 - Huang-Cheng Chou, Chi-Chun Lee, Carlos Busso:
Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier. 161-165 - Hira Dhamyal, Bhiksha Raj, Rita Singh:
Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection. 166-170
Dereverberation, Noise Reduction, and Speaker Extraction
- Tuan Vu Ho, Maori Kobayashi, Masato Akagi:
Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion. 171-175 - Tuan Vu Ho, Quoc Huy Nguyen, Masato Akagi, Masashi Unoki:
Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement. 176-180 - Minseung Kim, Hyungchan Song, Sein Cheong, Jong Won Shin:
iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement. 181-185 - Kuo-Hsuan Hung, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao, Chii-Wann Lin:
Boosting Self-Supervised Embeddings for Speech Enhancement. 186-190 - Seorim Hwang, Youngcheol Park, Sungwook Park:
Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections. 191-195 - Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey:
CycleGAN-based Unpaired Speech Dereverberation. 196-200 - Ashutosh Pandey, DeLiang Wang:
Attentive Training: A New Training Framework for Talker-independent Speaker Extraction. 201-205 - Tyler Vuong, Richard M. Stern:
Improved Modulation-Domain Loss for Neural-Network-based Speech Enhancement. 206-210 - Chiang-Jen Peng, Yun-Ju Chan, Yih-Liang Shen, Cheng Yu, Yu Tsao, Tai-Shih Chi:
Perceptual Characteristics Based Multi-objective Model for Speech Enhancement. 211-215 - Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolíková, Hiroshi Sato, Tomohiro Nakatani:
Listen only to me! How well can target speech extraction handle false alarms? 216-220 - Hao Shi, Longbiao Wang, Sheng Li, Jianwu Dang, Tatsuya Kawahara:
Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction. 221-225 - Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann:
Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments. 226-230
Source Separation II
- Nicolás Schmidt, Jordi Pons, Marius Miron:
PodcastMix: A dataset for separating music and speech in podcasts. 231-235 - Kohei Saijo, Robin Scheibler:
Independence-based Joint Dereverberation and Separation with Neural Source Model. 236-240 - Kohei Saijo, Robin Scheibler:
Spatial Loss for Unsupervised Multi-channel Source Separation. 241-245 - Samuel Bellows, Timothy W. Leishman:
Effect of Head Orientation on Speech Directivity. 246-250 - Kohei Saijo, Tetsuji Ogawa:
Unsupervised Training of Sequential Neural Beamformer Using Coarsely-separated and Non-separated Signals. 251-255 - Marvin Borsdorf, Kevin Scheck, Haizhou Li, Tanja Schultz:
Blind Language Separation: Disentangling Multilingual Cocktail Party Voices by Language. 256-260 - Mateusz Guzik, Konrad Kowalczyk:
NTF of Spectral and Spatial Features for Tracking and Separation of Moving Sound Sources in Spherical Harmonic Domain. 261-265 - Jack Deadman, Jon Barker:
Modelling Turn-taking in Multispeaker Parties for Realistic Data Simulation. 266-270 - Christoph Böddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach:
An Initialization Scheme for Meeting Separation with Spatial Mixture Models. 271-275 - Seongkyu Mun, Dhananjaya Gowda, Jihwan Lee, Changwoo Han, Dokyun Lee, Chanwoo Kim:
Prototypical speaker-interference loss for target voice separation using non-parallel audio samples. 276-280
Embedding and Network Architecture for Speaker Recognition
- Pierre-Michel Bousquet, Mickael Rouvier, Jean-François Bonastre:
Reliability criterion based on learning-phase entropy for speaker recognition with neural network. 281-285 - Bei Liu, Zhengyang Chen, Yanmin Qian:
Attentive Feature Fusion for Robust Speaker Verification. 286-290 - Bei Liu, Zhengyang Chen, Yanmin Qian:
Dual Path Embedding Learning for Speaker Verification with Triplet Attention. 291-295 - Bei Liu, Zhengyang Chen, Shuai Wang, Haoyu Wang, Bing Han, Yanmin Qian:
DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design. 296-300 - Ruida Li, Shuo Fang, Chenguang Ma, Liang Li:
Adaptive Rectangle Loss for Speaker Verification. 301-305 - Yang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu, Hung-yi Lee, Helen Meng:
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification. 306-310 - Leying Zhang, Zhengyang Chen, Yanmin Qian:
Enroll-Aware Attentive Statistics Pooling for Target Speaker Verification. 311-315 - Yusheng Tian, Jingyu Li, Tan Lee:
Transport-Oriented Feature Aggregation for Speaker Embedding Learning. 316-320 - Mufan Sang, John H. L. Hansen:
Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning. 321-325 - Linjun Cai, Yuhong Yang, Xufeng Chen, Weiping Tu, Hongyang Chen:
CS-CTCSCONV1D: Small footprint speaker verification with channel split time-channel-time separable 1-dimensional convolution. 326-330 - Pengqi Li, Lantian Li, Askar Hamdulla, Dong Wang:
Reliable Visualization for Deep Speaker Recognition. 331-335 - Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee, Guanglu Wan:
Unifying Cosine and PLDA Back-ends for Speaker Verification. 336-340 - Yuheng Wei, Junzhao Du, Hui Liu, Qian Wang:
CTFALite: Lightweight Channel-specific Temporal and Frequency Attention Mechanism for Enhancing the Speaker Embedding Extractor. 341-345
Speech Representation II
- Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du:
SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech. 346-350 - David Feinberg:
VoiceLab: Software for Fully Reproducible Automated Voice Analysis. 351-355 - Joel Shor, Subhashini Venugopalan:
TRILLsson: Distilled Universal Paralinguistic Speech Representations. 356-360 - Nan Li, Meng Ge, Longbiao Wang, Masashi Unoki, Sheng Li, Jianwu Dang:
Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural Network. 361-365 - Mostafa Sadeghi, Paul Magron:
A Sparsity-promoting Dictionary Model for Variational Autoencoders. 366-370 - Yan Zhao, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, Li Zhao:
Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition. 371-375 - John H. L. Hansen, Zhenyu Wang:
Audio Anti-spoofing Using Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning. 376-380 - Boris Bergsma, Minhao Yang, Milos Cernak:
PEAF: Learnable Power Efficient Analog Acoustic Features for Audio Recognition. 381-385 - Gasser Elbanna, Alice Biryukov, Neil Scheidwasser-Clow, Lara Orlandic, Pablo Mainar, Mikolaj Kegler, Pierre Beckmann, Milos Cernak:
Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load. 386-390 - Shijun Wang, Hamed Hemati, Jón Guðnason, Damian Borth:
Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition. 391-395 - Sarthak Yadav, Neil Zeghidour:
Learning neural audio features without supervision. 396-400 - Yixuan Zhang, Heming Wang, DeLiang Wang:
Densely-connected Convolutional Recurrent Network for Fundamental Frequency Estimation in Noisy Speech. 401-405 - Abu Zaher Md Faridee, Hannes Gamper:
Predicting label distribution improves non-intrusive speech quality estimation. 406-410 - Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka:
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models. 411-415 - Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza:
Dataset Pruning for Resource-constrained Spoofed Audio Detection. 416-420
Speech Synthesis: Linguistic Processing, Paradigms and Other Topics II
- Jaesung Tae, Hyeongju Kim, Taesu Kim:
EdiTTS: Score-based Editing for Controllable Text-to-Speech. 421-425 - Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng:
Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information. 426-430 - Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi:
SpeechPainter: Text-conditioned Speech Inpainting. 431-435 - Song Zhang, Ken Zheng, Xiaoxu Zhu, Baoxiang Li:
A polyphone BERT for Polyphone Disambiguation in Mandarin Chinese. 436-440 - Mutian He, Jingzhou Yang, Lei He, Frank K. Soong:
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge. 441-445 - Jian Zhu, Cong Zhang, David Jurgens:
ByT5 model for massively multilingual grapheme-to-phoneme conversion. 446-450 - Puneet Mathur, Franck Dernoncourt, Quan Hung Tran, Jiuxiang Gu, Ani Nenkova, Vlad I. Morariu, Rajiv Jain, Dinesh Manocha:
DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis. 451-455 - Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao:
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech. 456-460 - Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson:
Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition. 461-465 - Tho Nguyen Duc Tran, The Chuong Chu, Vu Hoang, Trung Huu Bui, Steven Hung Quoc Truong:
An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis. 466-470 - Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter:
Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks. 471-475 - Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin:
An Automatic Soundtracking System for Text-to-Speech Audiobooks. 476-480 - Daxin Tan, Guangyan Zhang, Tan Lee:
Environment Aware Text-to-Speech Synthesis. 481-485 - Artem Ploujnikov, Mirco Ravanelli:
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation. 486-490 - Evelina Bakhturina, Yang Zhang, Boris Ginsburg:
Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization. 491-495 - Yogesh Virkar, Marcello Federico, Robert Enyedi, Roberto Barra-Chicote:
Prosodic alignment for off-screen automatic dubbing. 496-500 - Qibing Bai, Tom Ko, Yu Zhang:
A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis. 501-505 - Hirokazu Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka:
CAUSE: Crossmodal Action Unit Sequence Estimation from Speech. 506-510 - Binu Nisal Abeysinghe, Jesin James, Catherine I. Watson, Felix Marattukalam:
Visualising Model Training via Vowel Space for Text-To-Speech Systems. 511-515
Other Topics in Speech Recognition
- Aaqib Saeed:
Binary Early-Exit Network for Adaptive Inference on Low-Resource Devices. 516-520 - Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka:
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings. 521-525 - Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura:
Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data. 526-530 - Yi-Kai Zhang, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan:
Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation. 531-535 - Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide:
Federated Domain Adaptation for ASR with Full Self-Supervision. 536-540 - Longfei Yang, Wenqing Wei, Sheng Li, Jiyi Li, Takahiro Shinozaki:
Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection. 541-545 - Zvi Kons, Hagai Aronowitz, Edmilson da Silva Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon:
Extending RNN-T-based speech recognition systems with emotion and language classification. 546-549 - Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg:
Thutmose Tagger: Single-pass neural model for Inverse Text Normalization. 550-554 - Yeonjin Cho, Sara Ng, Trang Tran, Mari Ostendorf:
Leveraging Prosody for Punctuation Prediction of Spontaneous Speech. 555-559 - Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie:
A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings. 560-564
Audio Deep PLC (Packet Loss Concealment) Challenge
- Yuansheng Guan, Guochen Yu, Andong Li, Chengshi Zheng, Jie Wang:
TMGAN-PLC: Audio Packet Loss Concealment using Temporal Memory Generative Adversarial Network. 565-569 - Jean-Marc Valin, Ahmed Mustafa, Christopher Montgomery, Timothy B. Terriberry, Michael Klingbeil, Paris Smaragdis, Arvindh Krishnaswamy:
Real-Time Packet Loss Concealment With Mixed Generative and Predictive Model. 570-574 - Baiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang:
PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial Network. 575-579 - Lorenz Diener, Sten Sootla, Solomiya Branets, Ando Saabas, Robert Aichner, Ross Cutler:
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge. 580-584 - Nan Li, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu:
End-to-End Multi-Loss Training for Low Delay Packet Loss Concealment. 585-589
Robust Speaker Recognition
- Ju-ho Kim, Jungwoo Heo, Hye-jin Shim, Ha-Jin Yu:
Extended U-Net for Speaker Verification in Noisy Environments. 590-594 - Seunghan Yang, Debasmit Das, Janghoon Cho, Hyoungwoo Park, Sungrack Yun:
Domain Agnostic Few-shot Learning for Speaker Verification. 595-599 - Qiongqiong Wang, Kong Aik Lee, Tianchi Liu:
Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA? 600-604 - Themos Stafylakis, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Anna Silnova, Lukás Burget, Jan Cernocký:
Training speaker embedding extractors using multi-speaker audio with unknown speaker boundaries. 605-609 - Chau Luu, Steve Renals, Peter Bell:
Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations. 610-614 - Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak:
Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification. 615-619
Speech Production
- Tsukasa Yoshinaga, Kikuo Maekawa, Akiyoshi Iida:
Variability in Production of Non-Sibilant Fricative [ç] in /hi/. 620-624 - Sathvik Udupa, Aravind Illa, Prasanta Kumar Ghosh:
Streaming model for Acoustic to Articulatory Inversion with transformer networks. 625-629 - Tsiky Rakotomalala, Pierre Baraduc, Pascal Perrier:
Trajectories predicted by optimal speech motor control using LSTM networks. 630-634 - Daniel R. van Niekerk, Anqi Xu, Branislav Gerazov, Paul Konstantin Krug, Peter Birkholz, Yi Xu:
Exploration strategies for articulatory synthesis of complex syllable onsets. 635-639 - Yoonjeong Lee, Jody Kreiman:
Linguistic versus biological factors governing acoustic voice variation. 640-643 - Takayuki Nagamine:
Acquisition of allophonic variation in second language speech: An acoustic and articulatory study of English laterals by Japanese speakers. 644-648
Speech Quality Assessment
- Pranay Manocha, Anurag Kumar, Buye Xu, Anjali Menon, Israel Dejene Gebru, Vamsi Krishna Ithapu, Paul Calamia:
SAQAM: Spatial Audio Quality Assessment Metric. 649-653 - Pranay Manocha, Anurag Kumar:
Speech Quality Assessment through MOS using Non-Matching References. 654-658 - Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Tatsuya Kitamura, Hideki Banno, Masanori Morise:
An objective test tool for pitch extractors' response attributes. 659-663 - Kai Li, Sheng Li, Xugang Lu, Masato Akagi, Meng Liu, Lin Zhang, Chang Zeng, Longbiao Wang, Jianwu Dang, Masashi Unoki:
Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection. 664-668 - Salah Zaiem, Titouan Parcollet, Slim Essid:
Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning. 669-673 - Deebha Mumtaz, Ajit Jena, Vinit Jakhetiya, Karan Nathwani, Sharath Chandra Guntuku:
Transformer-based quality assessment model for generalized user-generated multimedia audio content. 674-678
Language Modeling and Lexical Modeling for ASR
- Christophe Van Gysel, Mirko Hannemann, Ernest Pusateri, Youssef Oualil, Ilya Oparin:
Space-Efficient Representation of Entity-centric Query Language Models. 679-683 - Saket Dingliwal, Ashish Shenoy, Sravan Bodapati, Ankur Gandhe, Ravi Teja Gadde, Katrin Kirchhoff:
Domain Prompts: Towards memory and compute efficient domain adaptation of ASR systems. 684-688 - W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor D. Strohman, Shankar Kumar:
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition. 689-693 - Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey:
UserLibri: A Dataset for ASR Personalization Using Only Text. 694-698 - Chin-Yueh Chien, Kuan-Yu Chen:
A BERT-based Language Modeling Framework. 699-703
Challenges and Opportunities for Signal Processing and Machine Learning for Multiple Smart Devices
- Yoshiki Masuyama, Kouei Yamaoka, Nobutaka Ono:
Joint Optimization of Sampling Rate Offsets Based on Entire Signal Relationship Among Distributed Microphones. 704-708 - Gregory Ciccarelli, Jarred Barber, Arun Nair, Israel Cohen, Tao Zhang:
Challenges and Opportunities in Multi-device Speech Processing. 709-713 - Ameya Agaskar:
Practical Over-the-air Perceptual AcousticWatermarking. 714-718 - Timm Koppelmann, Luca Becker, Alexandru Nelus, Rene Glitza, Lea Schönherr, Rainer Martin:
Clustering-based Wake Word Detection in Privacy-aware Acoustic Sensor Networks. 719-723 - Francesco Nespoli, Daniel Barreda, Patrick A. Naylor:
Relative Acoustic Features for Distance Estimation in Smart-Homes. 724-728 - Ashutosh Pandey, Buye Xu, Anurag Kumar, Jacob Donley, Paul Calamia, DeLiang Wang:
Time-domain Ad-hoc Array Speech Enhancement Using a Triple-path Network. 729-733
Speech Processing & Measurement
- Arne-Lukas Fietkau, Simon Stone, Peter Birkholz:
Relationship between the acoustic time intervals and tongue movements of German diphthongs. 734-738 - Sanae Matsui, Kyoji Iwamoto, Reiko Mazuka:
Development of allophonic realization until adolescence: A production study of the affricate-fricative variation of /z/ among Japanese children. 739-743 - Chung Soo Ahn, L. L. Chamara Kasun, Sunil Sivadas, Jagath C. Rajapakse:
Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition. 744-748 - Louise Coppieters de Gibson, Philip N. Garner:
Low-Level Physiological Implications of End-to-End Learning for Speech Recognition. 749-753 - Carolina Lins Machado, Volker Dellwo, Lei He:
Idiosyncratic lingual articulation of American English /æ/ and /ɑ/ using network analysis. 754-758 - Teruki Toya, Wenyu Zhu, Maori Kobayashi, Kenichi Nakamura, Masashi Unoki:
Method for improving the word intelligibility of presented speech using bone-conduction headphones. 759-763 - Debasish Ray Mohapatra, Mario Fleischer, Victor Zappi, Peter Birkholz, Sidney S. Fels:
Three-dimensional finite-difference time-domain acoustic analysis of simplified vocal tract shapes. 764-768 - Dorina De Jong, Aldo Pastore, Noël Nguyen, Alessandro D'Ausilio:
Speech imitation skills predict automatic phonetic convergence: a GMM-UBM study on L2. 769-773 - Marc-Antoine Georges, Jean-Luc Schwartz, Thomas Hueber:
Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE. 774-778 - Peter Wu, Shinji Watanabe, Louis Goldstein, Alan W. Black, Gopala Krishna Anumanchipalli:
Deep Speech Synthesis from Articulatory Representations. 779-783 - Monica Ashokumar, Jean-Luc Schwartz, Takayuki Ito:
Orofacial somatosensory inputs in speech perceptual training modulate speech production. 784-787
Speech Synthesis: Acoustic Modeling and Neural Waveform Generation I
- Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Sunghwan Ahn, Joun Yeop Lee, Nam Soo Kim:
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus. 788-792 - Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto:
DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning. 793-797 - Kentaro Mitsui, Kei Sawada:
MSR-NV: Neural Vocoder Using Multiple Sampling Rates. 798-802 - Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani:
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping. 803-807 - Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung:
Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge. 808-812 - Jae-Sung Bae, Jinhyeok Yang, Taejun Bak, Young-Sun Joo:
Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech. 813-817 - Krishna Subramani, Jean-Marc Valin, Umut Isik, Paris Smaragdis, Arvindh Krishnaswamy:
End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation. 818-822 - Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman:
EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models. 823-827 - Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis:
Fine-grained Noise Control for Multispeaker Speech Synthesis. 828-832 - Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby:
WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. 833-837 - Ivan Vovk, Tasnima Sadekova, Vladimir Gogoryan, Vadim Popov, Mikhail A. Kudinov, Jiansheng Wei:
Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU. 838-842 - Alexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James R. Glass:
Simple and Effective Unsupervised Speech Synthesis. 843-847 - Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda:
Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation. 848-852
Show and Tell I
- Taejin Park, Nithin Rao Koluguri, Fei Jia, Jagadeesh Balam, Boris Ginsburg:
NeMo Open Source Speaker Diarization System. 853-854 - Baihan Lin:
Voice2Alliance: Automatic Speaker Diarization and Quality Assurance of Conversational Alignment. 855-856 - Rishabh Kumar, Devaraja Adiga, Mayank Kothyari, Jatin Dalal, Ganesh Ramakrishnan, Preethi Jyothi:
VAgyojaka: An Annotating and Post-Editing Tool for Automatic Speech Recognition. 857-858 - Alzahra Badi, Chungho Park, Min-Seok Keum, Miguel Alba, Youngsuk Ryu, Jeongmin Bae:
SKYE: More than a conversational AI. 859-860
Spatial Audio
- Hokuto Munakata, Ryu Takeda, Kazunori Komatani:
Training Data Generation with DOA-based Selecting and Remixing for Unsupervised Training of Deep Separation Models. 861-865 - Hangting Chen, Yi Yang, Feng Dang, Pengyuan Zhang:
Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output. 866-870 - Feifei Xiong, Pengyu Wang, Zhongfu Ye, Jinwei Feng:
Joint Estimation of Direction-of-Arrival and Distance for Arrays with Directional Sensors based on Sparse Bayesian Learning. 871-875 - Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello:
How to Listen? Rethinking Visual Sound Localization. 876-880 - Zhiheng Ouyang, Miao Wang, Wei-Ping Zhu:
Small Footprint Neural Networks for Acoustic Direction of Arrival Estimation. 881-885 - Xiaoyu Wang, Xiangyu Kong, Xiulian Peng, Yan Lu:
Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation. 886-890 - Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang:
MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources. 891-895 - Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang:
Iterative Sound Source Localization for Unknown Number of Sources. 896-900 - Katharine Patterson, Kevin W. Wilson, Scott Wisdom, John R. Hershey:
Distance-Based Sound Separation. 901-905 - Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang:
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network. 906-910 - Ali Aroudi, Stefan Uhlich, Marc Ferras Font:
TRUNet: Transformer-Recurrent-U Network for Multi-channel Reverberant Sound Source Separation. 911-915
Single-channel Speech Enhancement II
- Xiaofeng Ge, Jiangyu Han, Yanhua Long, Haixin Guan:
PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement. 916-920 - Zhuangqi Chen, Pingjian Zhang:
Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement. 921-925 - Jiaming Cheng, Ruiyu Liang, Yue Xie, Li Zhao, Björn W. Schuller, Jie Jia, Yiyuan Peng:
Cross-Layer Similarity Knowledge Distillation for Speech Enhancement. 926-930 - Feifei Xiong, Weiguang Chen, Pengyu Wang, Xiaofei Li, Jinwei Feng:
Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation. 931-935 - Ruizhe Cao, Sherif Abdulatif, Bin Yang:
CMGAN: Conformer-based Metric GAN for Speech Enhancement. 936-940 - Zeyuan Wei, Li Hao, Xueliang Zhang:
Model Compression by Iterative Pruning with Knowledge Distillation and Its Application to Speech Enhancement. 941-945 - Chenhui Zhang, Xiang Pan:
Single-channel speech enhancement using Graph Fourier Transform. 946-950 - Zilu Guo, Xu Xu, Zhongfu Ye:
Joint Optimization of the Module and Sign of the Spectral Real Part Based on CRN for Speech Denoising. 951-955 - Hao Zhang, Ashutosh Pandey, DeLiang Wang:
Attentive Recurrent Network for Low-Latency Active Noise Control. 956-960 - Jen-Hung Huang, Chung-Hsien Wu:
Memory-Efficient Multi-Step Speech Enhancement with Neural ODE. 961-965 - Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Jianjun Hao:
GLD-Net: Improving Monaural Speech Enhancement by Learning Global and Local Dependency Features with GLD Block. 966-970 - Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Dejun Li:
Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention. 971-975 - Jun Chen, Wei Rao, Zilin Wang, Zhiyong Wu, Yannan Wang, Tao Yu, Shidong Shang, Helen Meng:
Speech Enhancement with Fullband-Subband Cross-Attention Network. 976-980 - Cheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli:
OSSEM: one-shot speaker adaptive speech enhancement using meta learning. 981-985 - Wenbin Jiang, Tao Liu, Kai Yu:
Efficient Speech Enhancement with Neural Homomorphic Synthesis. 986-990 - Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang:
Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation. 991-995 - Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura:
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations. 996-1000
Novel Models and Training Methods for ASR II
- Haaris Mehmood, Agnieszka Dobrowolska, Karthikeyan Saravanan, Mete Ozay:
FedNST: Federated Noisy Student Training for Automatic Speech Recognition. 1001-1005 - Li Fu, Xiaoxiao Li, Runyu Wang, Lu Fan, Zhengchen Zhang, Meng Chen, Youzheng Wu, Xiaodong He:
SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition. 1006-1010 - Yukun Liu, Ta Li, Pengyuan Zhang, Yonghong Yan:
NAS-SCAE: Searching Compact Attention-based Encoders For End-to-end Automatic Speech Recognition. 1011-1015 - Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma:
Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR. 1016-1020 - Guodong Ma, Pengfei Hu, Nurmemet Yolwas, Shen Huang, Hao Huang:
PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition. 1021-1025 - Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno:
Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition. 1026-1030 - Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach:
Improving Rare Word Recognition with LM-aware MWER Training. 1031-1035 - Mohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Ralf Schlüter, Hermann Ney:
Improving the Training Recipe for a Robust Conformer-based Hybrid Model. 1036-1040 - Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg:
CTC Variations Through New WFST Topologies. 1041-1045 - Martin Sustek, Samik Sadhu, Hynek Hermansky:
Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition. 1046-1050 - Chenfeng Miao, Kun Zou, Ziyang Zhuang, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao:
Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition. 1051-1055 - Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker:
On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training. 1056-1060 - Selen Hande Kabil, Hervé Bourlard:
From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition. 1061-1065 - Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya:
Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks. 1066-1070 - Takashi Maekaku, Yuya Fujita, Yifan Peng, Shinji Watanabe:
Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR. 1071-1075
Spoken Dialogue Systems and Multimodality
- Naokazu Uchida, Takeshi Homma, Makoto Iwayama, Yasuhiro Sogawa:
Reducing Offensive Replies in Open Domain Dialogue Systems. 1076-1080 - Ting-Wei Wu, Biing-Hwang Juang:
Induce Spoken Dialog Intents via Deep Unsupervised Context Contrastive Clustering. 1081-1085 - Fumio Nihei, Ryo Ishii, Yukiko I. Nakano, Kyosuke Nishida, Ryo Masumura, Atsushi Fukayama, Takao Nakamura:
Dialogue Acts Aided Important Utterance Detection Based on Multiparty and Multimodal Information. 1086-1090 - Dhanush Bekal, Sundararajan Srinivasan, Srikanth Ronanki, Sravan Bodapati, Katrin Kirchhoff:
Contextual Acoustic Barge-In Classification for Spoken Dialog Systems. 1091-1095 - Peilin Zhou, Dading Chong, Helin Wang, Qingcheng Zeng:
Calibrate and Refine! A Novel and Agile Framework for ASR Error Robust Intent Detection. 1096-1100 - Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng:
ASR-Robust Natural Language Understanding on ASR-GLUE dataset. 1101-1105 - Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen:
From Disfluency Detection to Intent Detection and Slot Filling. 1106-1110 - Hengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Odette Scharenborg, Jingdong Chen, Shifu Xiong, Jianqing Gao:
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis. 1111-1115 - Christina Sartzetaki, Georgios Paraskevopoulos, Alexandros Potamianos:
Extending Compositional Attention Networks for Social Reasoning in Videos. 1116-1120 - Shiquan Wang, Yuke Si, Xiao Wei, Longbiao Wang, Zhiqiang Zhuang, Xiaowang Zhang, Jianwu Dang:
TopicKS: Topic-driven Knowledge Selection for Knowledge-grounded Dialogue Generation. 1121-1125 - Andreas Liesenfeld, Mark Dingemanse:
Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages. 1126-1130 - Yi Zhu, Zexun Wang, Hang Liu, Peiying Wang, Mingchao Feng, Meng Chen, Xiaodong He:
Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding. 1131-1135 - Keiko Ochi, Nobutaka Ono, Keiho Owada, Miho Kuroda, Shigeki Sagayama, Hidenori Yamasue:
Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism. 1136-1140
Show and Tell I(VR)
- Denis Ivanko, Dmitry Ryumin, Alexey M. Kashevnik, Alexandr Axyonov, Andrey Kitenko, Igor Lashkov, Alexey Karpov:
DAVIS: Driver's Audio-Visual Speech recognition. 1141-1142
Speech Emotion Recognition I
- Einari Vaaras, Manu Airaksinen, Okko Räsänen:
Analysis of Self-Supervised Learning and Dimensionality Reduction Methods in Clustering-Based Active Learning for Speech Emotion Recognition. 1143-1147 - Chun-Yu Chen, Yun-Shao Lin, Chi-Chun Lee:
Emotion-Shift Aware CRF for Decoding Emotion Sequence in Conversation. 1148-1152 - Bo-Hao Su, Chi-Chun Lee:
Vaccinating SER to Neutralize Adversarial Attacks with Self-Supervised Augmentation Strategy. 1153-1157 - Jack Parry, Eric DeMattos, Anita Klementiev, Axel Ind, Daniela Morse-Kopp, Georgia Clarke, Dimitri Palaz:
Speech Emotion Recognition in the Wild using Multi-task and Adversarial Learning. 1158-1162 - Ashishkumar Prabhakar Gudmalwar, Biplove Basel, Anirban Dutta, Ch V. Rama Rao:
The Magnitude and Phase based Speech Representation Learning using Autoencoder for Classifying Speech Emotions using Deep Canonical Correlation Analysis. 1163-1167 - Lucas Goncalves, Carlos Busso:
Improving Speech Emotion Recognition Using Self-Supervised Learning with Domain-Specific Audiovisual Tasks. 1168-1172
Single-channel Speech Enhancement I
- Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani:
SNRi Target Training for Joint Speech Enhancement and Recognition. 1173-1177 - Yutaro Sanada, Takumi Nakagawa, Yuichiro Wada, Kosaku Takanashi, Yuhui Zhang, Kiichi Tokuyama, Takafumi Kanamori, Tomonori Yamada:
Deep Self-Supervised Learning of Speech Denoising from Noisy Speeches. 1178-1182 - Chi-Chang Lee, Cheng-Hung Hu, Yu-Chen Lin, Chu-Song Chen, Hsin-Min Wang, Yu Tsao:
NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling. 1183-1187 - Ivan Shchekotov, Pavel K. Andreev, Oleg Ivanov, Aibek Alanov, Dmitry P. Vetrov:
FFC-SE: Fast Fourier Convolution for Speech Enhancement. 1188-1192 - Or Tal, Moshe Mandel, Felix Kreuk, Yossi Adi:
A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement. 1193-1197 - Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Byung Hoon Lee, Sung Won Han:
Multi-View Attention Transfer for Efficient Speech Enhancement. 1198-1202
Speech Synthesis: New Applications
- Nabarun Goswami, Tatsuya Harada:
SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate. 1203-1207 - Talia Ben Simon, Felix Kreuk, Faten Awwad, Jacob T. Cohen, Joseph Keshet:
Correcting Mispronunciations in Speech using Spectrogram Inpainting. 1208-1212 - Jason Fong, Daniel Lyth, Gustav Eje Henter, Hao Tang, Simon King:
Speech Audio Corrector: using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech. 1213-1217 - Wen-Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, Anjali Menon:
End-to-End Binaural Speech Synthesis. 1218-1222 - Julia Koch, Florian Lux, Nadja Schauffler, Toni Bernhart, Felix Dieterle, Jonas Kuhn, Sandra Richter, Gabriel Viehhauser, Ngoc Thang Vu:
PoeticTTS - Controllable Poetry Reading for Literary Studies. 1223-1227 - Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu, Yi Xu:
Articulatory Synthesis for Data Augmentation in Phoneme Recognition. 1228-1232
Spoken Language Understanding I
- Jihyun Lee, Gary Geunbae Lee:
SF-DST: Few-Shot Self-Feeding Reading Comprehension Dialogue State Tracking with Auxiliary Task. 1233-1237 - Oralie Cattan, Sahar Ghannay, Christophe Servan, Sophie Rosset:
Benchmarking Transformers-based models on French Spoken Language Understanding tasks. 1238-1242 - Seong-Hwan Heo, WonKee Lee, Jong-Hyeok Lee:
mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot Filling. 1243-1247 - Pu Wang, Hugo Van hamme:
Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding. 1248-1252 - Anirudh Raju, Milind Rao, Gautam Tiwari, Pranav Dheram, Bryan Anderson, Zhe Zhang, Chul Lee, Bach Bui, Ariya Rastrow:
On joint training with interfaces for spoken language understanding. 1253-1257 - Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed Hussen Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed H. Tewfik:
Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models. 1258-1262
Inclusive and Fair Speech Technologies I
- Perez Ogayo, Graham Neubig, Alan W. Black:
Building African Voices. 1263-1267 - Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke:
Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities. 1268-1272 - May Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao, Nicole R. Holliday:
Training and typological bias in ASR performance for world Englishes. 1273-1277
Inclusive and Fair Speech Technologies II
- Marcely Zanon Boito, Laurent Besacier, Natalia A. Tomashenko, Yannick Estève:
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems. 1278-1282 - Alexander Johnson, Kevin Everson, Vijay Ravi, Anissa Gladney, Mari Ostendorf, Abeer Alwan:
Automatic Dialect Density Estimation for African American English. 1283-1287 - Kunnar Kukk, Tanel Alumäe:
Improving Language Identification of Accented Speech. 1288-1292 - Wiebke Toussaint, Lauriane Gorce, Aaron Yi Ding:
Design Guidelines for Inclusive Speaker Verification Evaluation Datasets. 1293-1297 - Viet Anh Trinh, Pegah Ghahremani, Brian John King, Jasha Droppo, Andreas Stolcke, Roland Maas:
Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation. 1298-1302
Phonetics I
- Takuya Kunihara, Chuanbo Zhu, Nobuaki Minematsu, Noriko Nakanishi:
Gradual Improvements Observed in Learners' Perception and Production of L2 Sounds Through Continuing Shadowing Practices on a Daily Basis. 1303-1307 - Christin Kirchhübel, Georgina Brown:
Spoofed speech from the perspective of a forensic phonetician. 1308-1312 - Hae-Sung Jeon, Stephen Nichols:
Investigating Prosodic Variation in British English Varieties using ProPer. 1313-1317 - Hyun Kyung Hwang, Manami Hirayama, Takaomi Kato:
Perceived prominence and downstep in Japanese. 1318-1321 - Andrea Alicehajic, Silke Hamann:
The discrimination of [zi]-[dʑi] by Japanese listeners and the prospective phonologization of /zi/. 1322-1326 - Ingo Langheinrich, Simon Stone, Xinyu Zhang, Peter Birkholz:
Glottal inverse filtering based on articulatory synthesis and deep learning. 1327-1331 - Bogdan Ludusan, Marin Schröer, Petra Wagner:
Investigating phonetic convergence of laughter in conversation. 1332-1336 - Véronique Delvaux, Audrey Lavallée, Fanny Degouis, Xavier Saloppe, Jean-Louis Nandrino, Thierry Pham:
Telling self-defining memories: An acoustic study of natural emotional speech productions. 1337-1341 - Laura Spinu, Ioana Vasilescu, Lori Lamel, Jason Lilley:
Voicing neutralization in Romanian fricatives across different speech styles. 1342-1346 - Sishi Liao, Phil Hoole, Conceição Cunha, Esther Kunay, Aletheia Cui, Lia Saki Bucar Shigemori, Felicitas Kleber, Dirk Voit, Jens Frahm, Jonathan Harrington:
Nasal Coda Loss in the Chengdu Dialect of Mandarin: Evidence from RT-MRI. 1347-1351 - Philipp Buech, Simon Roessig, Lena Pagel, Doris Mücke, Anne Hermes:
ema2wav: doing articulation by Praat. 1352-1356
Multi-, Cross-lingual and Other Topics in ASR I
- Lars Rumberg, Christopher Gebauer, Hanna Ehlert, Ulrike Lüdtke, Jörn Ostermann:
Improving Phonetic Transcriptions of Children's Speech by Pronunciation Modelling with Constrained CTC-Decoding. 1357-1361 - Soky Kak, Sheng Li, Masato Mimura, Chenhui Chu, Tatsuya Kawahara:
Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention Mechanism. 1362-1366 - Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol:
KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus. 1367-1371 - Tünde Szalay, Mostafa Ali Shahin, Beena Ahmed, Kirrie J. Ballard:
Knowledge of accent differences can be used to predict speech recognition. 1372-1376 - Maximilian Karl Scharf, Sabine Hochmuth, Lena L. N. Wong, Birger Kollmeier, Anna Warzybok:
Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal features. 1377-1381 - Martin Flechl, Shou-Chun Yin, Junho Park, Peter Skala:
End-to-end speech recognition modeling from de-identified data. 1382-1386 - Aditya Yadavalli, Mirishkar Sai Ganesh, Anil Kumar Vuppala:
Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition. 1387-1391 - Jiamin Xie, John H. L. Hansen:
DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition. 1392-1396
Zero, low-resource and multi-modal speech recognition I
- Yuna Lee, Seung Jun Baek:
Keyword Spotting with Synthetic Data using Heterogeneous Knowledge Distillation. 1397-1401 - Maureen de Seyssel, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski:
Probing phoneme, language and speaker information in unsupervised speech representations. 1402-1406 - Andrei Bîrladeanu, Helen Minnis, Alessandro Vinciarelli:
Automatic Detection of Reactive Attachment Disorder Through Turn-Taking Analysis in Clinical Child-Caregiver Sessions. 1407-1410 - Eesung Kim, Jae-Jin Jeon, Hyeji Seo, Hoon Kim:
Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning. 1411-1415 - Tyler Miller, David Harwath:
Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech. 1416-1420 - Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman:
Pseudo Label Is Better Than Human Label. 1421-1425 - Werner van der Merwe, Herman Kamper, Johan Adam du Preez:
A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery. 1426-1430
Speaker Embedding and Diarization
- Siqi Zheng, Hongbin Suo, Qian Chen:
PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification. 1431-1435 - Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li:
Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings. 1436-1440 - Weiqing Wang, Ming Li, Qingjian Lin:
Online Target Speaker Voice Activity Detection for Speaker Diarization. 1441-1445 - Niko Brummer, Albert Swart, Ladislav Mosner, Anna Silnova, Oldrich Plchot, Themos Stafylakis, Lukás Burget:
Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings. 1446-1450 - Bin Gu:
Deep speaker embedding with frame-constrained training strategy for speaker verification. 1451-1455 - Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan:
Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization. 1456-1460 - Mao-Kui He, Jun Du, Chin-Hui Lee:
End-to-End Audio-Visual Neural Speaker Diarization. 1461-1465 - Yanyan Yue, Jun Du, Mao-Kui He, Yu Ting Yeung, Renyu Wang:
Online Speaker Diarization with Core Samples Selection. 1466-1470 - Chenyu Yang, Yu Wang:
Robust End-to-end Speaker Diarization with Generic Neural Clustering. 1471-1475 - Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Yanmin Qian, Kai Yu:
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild. 1476-1480 - Md. Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, Rosie Jones:
Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning Free. 1481-1485 - Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Böddeker, Reinhold Haeb-Umbach:
Utterance-by-utterance overlap-aware neural diarization with Graph-PIT. 1486-1490 - Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Qingyang Hong:
Spatial-aware Speaker Diarizaiton for Multi-channel Multi-party Meeting. 1491-1495
Acoustic Event Detection and Classification
- Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang:
Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection. 1496-1500 - Peng Liu, Songbin Li, Jigang Tang:
An End-to-End Macaque Voiceprint Verification Method Based on Channel Fusion Mechanism. 1501-1505 - Liang Xu, Jing Wang, Lizhong Wang, Sijun Bi, Jianqian Zhang, Qiuyue Ma:
Human Sound Classification based on Feature Fusion Method with Air and Bone Conducted Signal. 1506-1510 - Dongchao Yang, Helin Wang, Zhongjie Ye, Yuexian Zou, Wenwu Wang:
RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection. 1511-1515 - Achyut Mani Tripathi, Konark Paul:
Temporal Self Attention-Based Residual Network for Environmental Sound Classification. 1516-1520 - Juncheng Li, Shuhui Qu, Po-Yao Huang, Florian Metze:
AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification. 1521-1525 - Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou:
Improving Target Sound Extraction with Timestamp Information. 1526-1530 - Ying Hu, Xiujuan Zhu, Yunlong Li, Hao Huang, Liang He:
A Multi-grained based Attention Network for Semi-supervised Sound Event Detection. 1531-1535 - Sangwook Park, Sandeep Reddy Kothinti, Mounya Elhilali:
Temporal coding with magnitude-phase regularization for sound event detection. 1536-1540 - Nian Shao, Erfan Loweimi, Xiaofei Li:
RCT: Random consistency training for semi-supervised sound event detection. 1541-1545 - Yifei Xin, Dongchao Yang, Yuexian Zou:
Audio Pyramid Transformer with Domain Adaption for Weakly Supervised Sound Event Detection and Audio Classification. 1546-1550 - Yu Wang, Mark Cartwright, Juan Pablo Bello:
Active Few-Shot Learning for Sound Event Detection. 1551-1555 - Tong Ye, Shijing Si, Jianzong Wang, Ning Cheng, Jing Xiao:
Uncertainty Calibration for Deep Audio Classifiers. 1556-1560 - Yuanbo Hou, Dick Botteldooren:
Event-related data conditioning for acoustic event classification. 1561-1565
Speech Synthesis: Acoustic Modeling and Neural Waveform Generation II
- Haohan Guo, Hui Lu, Xixin Wu, Helen Meng:
A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS. 1566-1570 - Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo:
RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion. 1571-1575 - Manh Luong, Viet-Anh Tran:
FlowVocoder: A small Footprint Neural Vocoder based Normalizing Flow for Speech Synthesis. 1576-1580 - Yanqing Liu, Ruiqing Xue, Lei He, Xu Tan, Sheng Zhao:
DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders. 1581-1585 - Xin Yuan, Robin Feng, Mingming Ye, Cheng Tuo, Minghang Zhang:
AdaVocoder: Adaptive Vocoder for Custom Voice. 1586-1590 - Shengyuan Xu, Wenxiao Zhao, Jing Guo:
RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses. 1591-1595 - Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu:
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature. 1596-1600 - Mengnan He, Tingwei Guo, Zhenxing Lu, Ruixiong Zhang, Caixia Gong:
Improving GAN-based vocoder for fast and high-quality speech synthesis. 1601-1605 - Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yuchao Zhang:
SoftSpeech: Unsupervised Duration Model in FastSpeech 2. 1606-1610 - Haohan Guo, Feng-Long Xie, Frank K. Soong, Xixin Wu, Helen Meng:
A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS. 1611-1615 - Yuhan Li, Ying Shen, Dongqing Wang, Lin Zhang:
SiD-WaveFlow: A Low-Resource Vocoder Independent of Prior Knowledge. 1616-1620 - Takeru Gorai, Daisuke Saito, Nobuaki Minematsu:
Text-to-speech synthesis using spectral modeling based on non-negative autoencoder. 1621-1625 - Hiroki Kanagawa, Yusuke Ijima, Hiroyuki Toda:
Joint Modeling of Multi-Sample and Subband Signals for Fast Neural Vocoding on CPU. 1626-1630 - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki:
MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks. 1631-1635 - Chenfeng Miao, Ting Chen, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao:
A compact transformer-based GAN vocoder. 1636-1640 - Hideyuki Tachibana, Muneyoshi Inahara, Mocho Go, Yotaro Katayama, Yotaro Watanabe:
Diffusion Generative Vocoder for Fullband Speech Synthesis Based on Weak Third-order SDE Solver. 1641-1645
ASR: Architecture and Search
- Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran:
On Adaptive Weight Interpolation of the Hybrid Autoregressive Transducer. 1646-1650 - Ting-Wei Wu, I-Fan Chen, Ankur Gandhe:
Learning to rank with BERT-based confidence models in ASR rescoring. 1651-1655 - Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury:
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States. 1656-1660 - Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, Jianwei Niu:
WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit. 1661-1665 - Yufei Liu, Rao Ma, Haihua Xu, Yi He, Zejun Ma, Weibin Zhang:
Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR. 1666-1670 - Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, Yonghong Yan:
Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies. 1671-1675 - Ye Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi, Xiaorui Wang:
Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition. 1676-1680 - Zhanheng Yang, Sining Sun, Jin Li, Xiaoming Zhang, Xiong Wang, Long Ma, Lei Xie:
CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer. 1681-1685 - Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li:
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT. 1686-1690 - Jash Rathod, Nauman Dawalatabad, Shatrughan Singh, Dhananjaya Gowda:
Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition. 1691-1695 - Weiran Wang, Ke Hu, Tara N. Sainath:
Streaming Align-Refine for Non-autoregressive Deliberation. 1696-1700 - Rongmei Lin, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays:
Federated Pruning: Improving Neural Network Efficiency with Federated Learning. 1701-1705 - Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman:
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes. 1706-1710 - Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov:
4-bit Conformer with Native Quantization Aware Training for Speech Recognition. 1711-1715 - Qiang Xu, Tongtong Song, Longbiao Wang, Hao Shi, Yuqin Lin, Yongjie Lv, Meng Ge, Qiang Yu, Jianwu Dang:
Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR Model. 1716-1720
Spoken Language Processing II
- Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka:
Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation. 1721-1725 - Linh The Nguyen, Nguyen Luong Tran, Long Doan, Manh Luong, Dat Quoc Nguyen:
A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation. 1726-1730 - Qian Wang, Chen Wang, Jiajun Zhang:
Investigating Parameter Sharing in Multilingual Speech Translation. 1731-1735 - Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie, Yonghong Yan:
Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset. 1736-1740