ABSTRACT

Speech recognition model and language understanding is the most critical task when it comes to understanding the language models (LMs). At present, various end-to-end learning model has been used for speech recognition using unidirectional and bidirectional language models. Despite their theoretical advantages over conventional unidirectional and bidirectional approach, it has been found that the accuracy is not improved. Using BERT (Bidirectional Encoder Representations from Transformers), which is recently proposed pre-trained language representation model from Google’s AI team, consists of multi-layer bidirectional Transformer encoder, provides much better accuracy than only using unidirectional or bidirectional approach with huge corpus of training data. Whereas, NLP (natural language processing), is used for language understanding (LU) and language generation (LG). So, in this study, we have designed a model to extract the text from speech, based on classification ranking and then use BERT to analyze the context and semantic of the entire sentence of top-ranked sentences. BERT uses bidirectional approach to understand the semantics of the words in a sentence from both left and 440right directions and provides most relevant score based on the meaning of entire sentence and words around it. This has been observed that using pre-trained model decreases the processing time and, increases the accuracy and turnaround time for end-to-end speech recognition system.