思维之海

——在云端,寻找我的星匙。

第十八期人工智能前沿学生论坛——语音识别前沿技术

近年来智能语音进入了快速增长期,语音识别作为语音领域的重要分支获得了广泛的关注,如何提高声学建模能力和如何进行端到端的联合优化是语音识别领域中的重要课题。活动由 人工智能前沿学生论坛 主办。

References

Exploring Neural Transducers for End-to-End Speech Recognition

Query-by-example keyword spotting using long short-term memory networks

Lattice Indexing for Spoken Term Detection

Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

语音关键词检测方法综述

PPT

摘要

随着智能音箱、语音助手等应用的出现,普通人也可以像科幻场景一样使用语音与机器进行交流。语音关键词检测是实现人机语音交互的重要技术,被广泛地应用于各类智能设备、语音检索系统当中。语音关键词检测可以分成两种,一种是用于设备唤醒、设备控制keyword spotting;一种是应用于语音文档检索的spoken term detection,二者虽然名字类似,但从功能侧重和技术路线上都有所区别。本次分享介绍语音关键词检测的主要方法与最新进展。

语音关键词检测介绍

一个语音的小方向。A survey on keyword spotting.(keyword search, spoken term detection)

主流:语音识别、增强。关键词检测技术正在变得重要。

从一段连续语音中检测关键词。(异常检测?)

Ex.

语音智能设备:唤醒词识别。

  • 关键词,厂家指定
  • 低内存
  • 低计算
  • 低功耗

语音检索keuword search(在长篇audio中检测,利用keyword截取关键信息片段)。

  • 可变化
  • 长文本定位
  • 超出词汇?(OOV,out of vocabulary),新知识往往指向新词汇,棘手

基于HMM的语音关键词检测

Exploring Neural Transducers for End-to-End Speech Recognition

In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a language model, on the popular Hub5’00 benchmark. On our internal diverse dataset, these trends continue - RNN Transducer models rescored with a language model after beam search outperform our best CTC models. These results simplify the speech recognition pipeline so that decoding can now be expressed purely as neural network operations. We also study how the choice of encoder architecture affects the performance of the three models - when all encoder layers are forward only, and when encoders downsample the input representation aggressively.

Filler Models:逐帧的序列标注问题。非关键词称为garbage段,对非关键词和关键词段进行分别建模。

1989年,对每一个关键词建立一个HMM。(基于Filler model)

其概率模型可以利用GMMs or DNNs建模。

HMM based:

2017,HMM示意图。对于每一个音素进行声学建模。(解码,手工设计)

类似于KMM算法。图灵自动机?

DNN based:

连续语音分为窗口,利用神经网络进行分类。(预测片段是关键词的概率)

可能还需要平滑处理。

无须HMM的解码(动态规划)。

难度:很难找到现存的语料

2014

基于Sample查询的语音关键词检测(QBE)

Query-by-example keyword spotting using long short-term memory networks

We present a novel approach to query-by-example keyword spotting (KWS) using a long short-term memory (LSTM) recurrent neural network-based feature extractor. In our approach, we represent each keyword using a fixed-length feature vector obtained by running the keyword audio through a word-based LSTM acoustic model. We use the activations prior to the softmax layer of the LSTM as our keyword-vector. At runtime, we detect the keyword by extracting the same feature vector from a sliding window and computing a simple similarity score between this test vector and the keyword vector. With clean speech, we achieve 86% relative false rejection rate reduction
at 0:5% false alarm rate when compared to a competitive phoneme posteriorgram with dynamic time warping KWS system, while the reduction in the presence of babble noise is 67%. Our system has a small memory footprint, low computational cost, and high precision, making it suitable for on-device applications.

匹配(match)。

关键词作为一个模式(pattern)被存储起来。

可以个性化定义关键词。

DTW based:(动态时间弯折,DP算法)

基于语音识别。

计算两个时间序列的相似度。

  • 转化为同态序列
  • 动态规划计算(最长子序列?)
    • 每步做归一化处理
    • 单调、连续

1975、1978

用在KWS里需要稍作修改。

  • 分段(2),类似CV的滑动窗口
  • 不分段(2),先贪心到一个最优窗口,再在窗口内匹配(会不会收敛到local?)

feature representation:

  • MFCC,FBANK
  • Posteriorgram(GMM,DNN)后验概率图
  • DNN,Autoencoder

DTW的缺点:

  • 需要多项式时间
  • 可能影响精度

Enbedding(嵌入):(神经网络)

向量编码,计算相似度。

2015,陈——预处理词分类器(Softmax),作为特征提取器。(产品:Snowboy)

Siamese network,孪生网络,卷积network。弱监督。2016

基于大词汇量语音识别系统的语音关键词检测(ASR)

Lattice Indexing for Spoken Term Detection

This paper considers the problem of constructing an efficient inverted index for the spoken term detection (STD) task. More specifically, we construct a deterministic weighted finite-state transducer storing soft-hits in the form of (utterance ID, start time, end time, posterior score) quadruplets.We propose a generalized factor transducer structure which retains the time information necessary for performing STD. The required information is embedded into the path weights of the factor transducer without disrupting the inherent optimality. We also describe how to index all substrings seen in a collection of raw automatic speech recognition lattices using the proposed structure. Our STD indexing/search implementation is built upon the OpenFst Library and is designed to scale well to large problems. Experiments on Turkish and English data sets corroborate our claims.

LVSCR based methods。

核心:语音识别+文本索引。

问题:

  • 语音识别的精度问题,如何解决error
    • 次优结果的索引(关键词出现error)
  • 关键词的定位技术
    • Lattice。保存语音识别的最优和次优结果

WFST:加权有限状态转换器。WFST是一个有向图:

  • 节点:三个状态起始、普通、终止。
  • 边:输入、输出、权重。

效果:将一个字符串映射到一个序列。(有点像贝叶斯网络)也可以用来表示一个单个字符串。

Composition(复合原理):T1:A to B; T2 B to C; T1* T2: A to C.

Union(联合):合并起始节点。

实现发音-词串转换。

Lattice:一个复杂WFST网络,作为识别器的representation。(索引)

Factor Automata(因子自动机):v is factor of u if u = xvy.

  • 获得子串

Timed Factor Transducer,TFT,时间因子自动机。2011

  • 查询关键词的匹配自动机
  • 时间间隔
  • 概率模型

构建Lattice Indexing:convert lattice into TFT,union,optimize。从而优化体积,查询概率、时序信息。

OOV(词汇溢出)问题:KWS中更严重。

  • 转换为音素,字(细小单元)
    • 或许可以利用知识图谱?
  • Proxy word(代理词),2013,陈
    • 用发音相似的集内词代替关键词
    • OOV转化为WFST
      • 应用最短路算法
      • FST加速?(并行计算),相似度评估优势
    • 转化为K”集内词

Advances

Model Compression:(2017,亚马逊)

TDNN结构——分层共享,大幅减少计算量。

矩阵分解:SVD。(压缩)

Computing similarity heterogeneous pattern:(2017,Audhkhasi)

文本和声学分别分类,然后将分类结果相互比较(模型融合)

Similarity Image Classification Query-by-Example KWS:

不再搞窗口,直接对整个语音图进行分类(模式识别)。CNN分类。

  • 思路类似音频指纹,长宽特别不一样(某个维度相对有限)
  • 但语音可能获得弯斜线

Streaming Seq2Seq Models for KWS:(He Y,2017)

  • Blank符号解决长度不匹配问题
  • 不确定音素映射到blank
  • CTC以前没有建立出label之间的关系
    • 利用RNN-Transducer进行联合网络预测
  • 加入Keyword encoder,Attention(类似一个附加记忆模块)
    • 放射变换+内积+Softmax+求和
    • 现在工业上没有预存的语料,因此HMM还是主流

总结

keyword spotting:主要在于下限制条件、复杂度。

spoken term detection:OOV问题。

哈希?不便于compose,并行化。(相似哈希?)

使用RNN-Transducer进行声学建模

PPT

Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNNT):
a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an ‘encoder’, which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a ‘decoder’ which is partially initialized from a recurrent neural network language model trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further
through the use of sub-word units (‘wordpieces’) which capture longer context and significantly reduce substitution
errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000
word pieces as output targets achieves a word error rate of 8.5% on voice-search and 5.2% on voice-dictation tasks and is comparable to a state-of-the-art baseline at 8.3% on voice search and 5.4% voice-dictation.

难复现?

摘要

基于联结时序分类(CTC)的声学模型不再需要对训练的音频序列和文本序列进行强制对齐,实际上已经初步具备了端到端的声学模型建模能力。但是CTC模型进行声学建模存在着两个严重的瓶颈,一是缺乏语言模型建模能力,不能整合语言模型进行联合优化,二是不能建模模型输出之间的依赖关系。RNN-Transducer针对CTC的不足,进行了改进,使得模型具有了端到端联合优化、具有语言建模能力、便于实现Online语音识别等突出的优点,更加适合语音任务,值得引起大家的重视。

CTC模型与不足

Connectionist Temporal Classification,CTC。

以前都是HMM+Deeplearining混合训练,需要帧级别的标注信息。极其繁琐。

HMM本身具有序列建模能力,RNN可以胜任。

帧预测可以是blank或对应音素。

综合后,进行梯度反传。(DP,前向+后向传播

设定预测状态转移方程。

Loss Function:倾向于概率更大的路径选择。(对数概率损失函数)

帧级别预测是准确的“高原”,CTC预测尖峰。

  • 引入blank(损失较小),使得对于预测进行信息累计输出

CTC优点:

  • 不需要强制对齐
  • 解码加速

RNN-Transducer模型

CTC没有对label之间的关系没有刻画,不能完全做到端到端,要求输入序列大于输出(不是语音识别的问题)。

联合训练模型。(Pred. Network语言模型,Encoder声学模型)结合后输出标记。

Joint Net:将两个模型融合。

曼哈顿模型(可行路径是曼哈顿距离)。

前向+后向算法。Loss Function。

比较几个模型结构。(4类,CTC,RNN-Transducer,Attention-based,PNN-Transducer with Attention)

  • CTC
  • RNN-Transducer(三阶张量,计算复杂,内存容易爆)
  • Attention-based
  • PNN-Transducer with Attention

CTC在实际应用中,对于start-end time还是差强人意,应用前景仍然有限

RNN-Transducer:

  • 解决CTC条件独立问题
  • 整合了两个模型
  • 流式结构

RNN-Transducer模型的改进

高计算代价,存在一些不合理的路径。

Recurrent Neural Aligner:

  • 斜化
  • 期望损失……
  • 没有什么性能提升

Multi-stages of Training a Wordpiece RNN-T,多级预训练

  • 训练模型的结果较好

总结:

  • RNNT总体性能较好(Without大规模语言模型)
  • 训练十分困难,预处理十分重要
  • 适用于在线的结构(解码)

高性能框架Bert?