orsdanilo

Python Packages and Virtual Environments

2020-08-03T00:00:00+00:00

There are several tools to keep Python projects organized and running without compatibility issues. In fact, it can get confusing, as some are similar and seem to do the same thing. This article highlights the main ones, hopefully giving good directions for Python newcomers and eventually clearing some points Python developers may have overlooked. We start with brief clarifications on the concepts of packages and virtual environments and the default tools for Python, then present the conda distribution, and conclude with the similarities and differences between conda and pip/venv.

Python Packages

It is a good practice to refactor code so as to make it more reusable. A Python file containing variables, classes and/or functions can be distributed and reused as a module. Suppose we write the file foo.py:

def bar(x, y):
  return x + y

This file is a module. The Python doc on modules says:

A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py appended.

Running Python or a Python file from the same folder, the module can be imported in the following manner:

>>> import foo
>>> foo.bar(1,2)
3

Or alternatively:

>>> from foo import bar
>>> bar(1,2)
3

It is also possible to create shorter aliases for the imports, binding the alias to the module contents. Some famous ones are:

>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt

We can import all statements from a module using:

>>> from <module> import *

However, this is not recommended, as it can be inefficient and possibly hide some things that were already defined (namespace collision).

As the project grows and scales, these modules should be organized in a directory structure containing such types of files, which is then called an import package. We can use to a certain extent the analogy of packages as file system directories, and modules as files. More on packages and imports can be found on this page.

As is presented on the docs, a regular package is typically implemented as a directory containing an __init__.py file. When it is imported, this __init__.py file is implicitly executed, and the objects it defines are bound to names in the package’s namespace. __init__.py can be an empty file in the simplest case, or contain initialization code for the package.

Import packages usually contain multiple files and are not so convenient for distribution. Enter source distribution packages, or sdists, which are compressed archives (.tar.gz files) containing one or more packages or modules, as is explained here. Python also created the wheel package format (.whl), that allows shipping libraries with compiled artifacts, making installation faster.

Distributions can be found in repositories, called Package Indexes, which contain a web interface to automate package discovery and consumption. The default one for the Python community is the Python Package Index (PyPI). All Python developers can consume and distribute their distributions by making the .tar.gz and the .whl files available in this repository. Instructions on how to package and upload a distribution to PyPI can be found here.

pip

Pip is the recommended installer for Python packages. Packages can be easily installed from PyPI with this command:

$ pip install

Here are some other useful commands to upgrade, uninstall and list installed packages, respectively:

$ pip install --upgrade 
$ pip uninstall 
$ pip list

Virtual Environments

Sometimes different Python applications with different dependencies and specific package version requirements need to be executed on the same system. If they were all in the same Python installation, compatibility issues would arise. A solution for this kind of problem is the use of virtual environments.

Quoting the Python documentation, a virtual environment is:

A self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.

With different virtual environments, the applications can run independently from one another, without conflicting requirements.

venv

Since version 3.5 of Python, the recommended tool to create virtual environments is venv, a built-in Python 3 module. The process is the following: navigate to the folder where the environment should be placed and enter the command:

$ python3 -m venv

Where is the name of the new environment. It will by default use the newest Python 3 version installed on the computer as the version of the Python interpreter in the project.

To activate the virtual environment on Linux or MacOS, run:

$ source /bin/activate

Or on Windows:

$ \Scripts\activate.bat

After running this activation script, () will be prefixed to the command prompt, showing what is the current environment, like so:

() $

To exit the current env, type

() $ deactivate

Though it is common practice to create the virtual environment inside the project directory, it is not a good idea to keep it in the project repository, due to OS specifities and folder size. The best alternative is to persist the list of required packages and their versions to a file with the command below:

() $ pip freeze > requirements.txt

And then anyone who needs to run the project can obtain the packages in the following manner:

() $ pip install -r requirements.txt

Anaconda

Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing and data science. It comes with Python, approximately 250 scientific packages installed out-of-the-box and Anaconda Navigator, a Graphical User Interface (GUI).

If you’re not new to Python or programming in general and somewhat confortable with the idea of using the Command Line Interface (CLI), it is preferred to use Miniconda instead, which installs only Python, conda (more on that below) and, on Windows, Anaconda Prompt.

conda

Conda is an open source package management system and environment management system. Like pip, it can install Python packages, though it is language agnostic, and can deal with other programming languages and their dependencies as well. Like venv, it can create and manage virtual environments, and it can also manage different Python versions, which would require another tool otherwise, such as pyenv or virtualenv.

Similar to venv, a virtual environment can be created with a simple command. Here the environment folder will be created in Anaconda’s default folder.

$ conda create --name

Alternatively, the environment can be created in another directory, e.g. the project folder. We only need to provide the flag --prefix alongside the path for the new environment folder:

$ conda create --prefix

python= or can be appended to the conda create command, creating an environment with such Python version, and with such package and its dependencies, respectively.

A list of the existing environments can be obtained by typing either one of these:

$ conda info --envs
$ conda env list

Packages can be installed from the conda repository by executing the following:

() $ conda install 

As with pip, packages and their dependencies can be exported to a file, only here it is a YAML file:

() $ conda env export > environment.yml

And then the file can be used to recreate the environment:

$ conda env create -f environment.yml

You can find more commands for conda in this cheat sheet.

Here is a comparison table between conda and pip taken from the Anaconda Blog:

	conda	pip
Manages	Binaries	Wheel or source
Can require compilers	No	Yes
Package types	Any	Python-only
Creates environment	Yes, built-in	No, requires virtualenv or venv
Dependency checks	Yes	No
Package sources	Anaconda repo and cloud	PyPI

It should be noted that it is not such a fair comparison, as conda aims to do more than pip, containing more features and not being restricted to Python.

An important additional piece of information is the number of packages available in each tool: while the Anaconda Repository features around two thousand packages, PyPI boasts more than 250 thousand projects from the Python community. Therefore, it is not uncommon to find with pip a library that’s not available with conda.

Conclusion

All in all, conda is a great tool for doing data science work. Gathering different functionalities all in on tool certainly makes things a bit more fluid. If you’re doing pure Python work, pip+venv will probably do the job just fine. For general data science projects with varied scientific libraries and dependencies, it may be a good idea to use conda. Since the two package managers aren’t exactly interchangeable, the recommended approach is to use an isolated conda environment, trying to install everything with conda and backing off to pip when needed.

In this article I tried to gather some information on the available tools for Python development. I recommend that you check the numerous references along this page, do your research and experiments, and then decide what works best for you. I hope it was helpful!

Decoders for Automatic Speech Recognition

2020-07-21T00:00:00+00:00

Automatic Speech Recognition (ASR) systems convert speech from a recorded audio signal to text. An ASR system aims to infer the original words given an observable signal, most commonly following a probabilistic approach [1].

Figure 1: Diagram of an ASR system.

The input audio is split into overlapping frames of 25ms shifted by 10ms, so that within this tiny time window, the speech signal is considered to be stationary, allowing for the analysis featured here. Before decoding, these are fed to a feature extractor, which reduces signal dimensionality and extracts the most relevant information, that is, the linguistic message.

Decoding is the process of calculating which sequence of words is most likely to match the acoustic signal represented by the feature vectors [1] of a given utterance (a continuous piece of speech beginning and ending with a clear pause). Decoding is executed based on three sources of information:

Acoustic model: An ensemble of Hidden Markov Models representing words or phonemes;
Language model: A list of word sequence probabilities;
Lexicon: A dictionary of words and their respective phonemes.

During decoding, we essentially use these to try to predict the word sequence $\hat{W}$ that best matches the acoustic observation $\boldsymbol{X}=X_1 X_2\ldots X_n$, obtained using Bayes rule:

\[\begin{equation*} \begin{split} \boldsymbol{\hat{W}} &= arg\max_{\boldsymbol{W}}P(\boldsymbol{W}\mid \boldsymbol{X})\\ &= arg\max_{\boldsymbol{W}}\frac{P(\boldsymbol{X}\mid \boldsymbol{W})P(\boldsymbol{W})}{P(\boldsymbol{X})}\\ &\propto arg\max_{\boldsymbol{W}} (\underbrace{P(\boldsymbol{X} \mid \boldsymbol{W})}_{\substack{\text{Acoustic} \\ \text{model}}} \underbrace{P(\boldsymbol{W})}_{\substack{\text{Language} \\ \text{model}}}) \end{split} \end{equation*}\]

The acoustic analysis is performed by a Deep Neural Network, which is trained to output the posterior probability $P(\boldsymbol{X} \mid \boldsymbol{W})$ of obtaining this observation given a candidate word sequence. The prior probability $P(\boldsymbol{W})$ of a word sequence is obtained from a language model, containing the allowed and most frequent occurrences for a given language. Therefore, prior probability and posterior likelihood are combined in order to output the best prediction with the available information.

One solution for the search of the best word sequence involves the use of Weighted Finite State Transducers (WFSTs). I wrote about it in this post, if you feel up for the challenge.

References

[1] Gruhn, R. E., Minker, W., & Nakamura, S. (2011). Statistical Pronunciation Modeling for Non-Native Speech Processing. Springer Berlin Heidelberg.

ASR with Weighted Finite State Transducers

2020-07-21T00:00:00+00:00

In this post, the subject of decoders for Automatic Speech Recognition was introduced, presenting the task of searching for the best word sequence given an audio observation. One solution for that involves the use of Weighted Finite State Transducers (WFSTs). This post is theory heavy, so don’t worry if you don’t understand all of it on first try (I certainly didn’t).

WFSTs are static graphs encoding the whole word-sequence search space. They contain many levels of information, going from the acoustic model to phones in context, to single phonemes, to words, and finally, words in sequence.

A finite-state transducer is a finite automaton whose state transitions are labeled with both input and output symbols. Therefore, a path through the transducer encodes a mapping from an input to an output string. They provide a common and natural representation for major components of speech recognition systems, including Hidden Markov Models (HMMs), context-dependency models, pronunciation dictionaries, statistical grammars and word or phone lattices [1]. The decoder goes through the WFST creating a lattice, a graph structure that contains the most probable - or less costly - paths for a given utterance (a continuous piece of speech beginning and ending with a clear pause), taking into account both acoustic and language costs.

A transducer $T=(\mathcal{A}, \mathcal{B}, \mathcal{Q}, I, F, E, \lambda, \rho)$ over a semiring $\mathbb{K}$ is characterized by:

$\mathcal{A}$: finite input alphabet;
$\mathcal{B}$: finite output alphabet;
$\mathcal{Q}$: finite set of states;
$I$: set of initial states;
$F$: set of final states;
$E$: finite set of transitions;
$\lambda$: initial state weight assignment;
$\rho$: final state weight assignment.

The set of transitions is $E\subseteq\bar{E}=\mathcal{Q}\times(\mathcal{A}\cup{\epsilon})\times(\mathcal{B}\cup{\epsilon})\times\mathbb{K}\times\mathcal{Q}$. This means it is a subset of all possibilities of transition from one state to another (or to itself) with an input label, an output label and a weight. $\epsilon$ means an empty label, no input or no output.

The operations are executed on specific semirings. A semiring is a set $\mathbb{R}$ equipped with two binary operators addition ($\oplus$) and multiplication ($\otimes$), with identity elements $\bar{0}$ and $\bar{1}$, respectively. One simple example of a semiring is the set of natural numbers (including zero) under ordinary addition and multiplication [2]. Two commonly used semirings in ASR are the probability semiring and the log semiring, shown in Table 1.

SEMIRING	SET	$\oplus$	$\otimes$	$\bar{0}$	$\bar{1}$
Probability	$\mathbb{R}_+$	$+$	$\times$	0	1
Log	$\mathbb{R} \cup {-\infty,+\infty}$	$\oplus_{log}$	+	$+\infty$	0

Table 1: Probability and log semirings. $\oplus_{log}$ is defined by: $x \oplus y=-log(e^{-x}+e^{-y})$.}

There is a set of operations we can perform on transducers. The main ones are the following:

Composition: Combining graphs, therefore different levels of representation;
Determinization: Making an automaton deterministic, that is, each state should have at most one transition with any given input label, which should not be the empty ($\epsilon$) symbols;
Minimization: Reducing the size of a deterministic automaton.

A widely used toolkit for ASR is Kaldi [3]. A usual decoder in Kaldi contains a set of four WFSTs which are combined (or composed, as explained below): HMM-level, Context Dependency, Pronunciation Lexicon and Grammar. The mapping from input to output across transducers is shown in Figure 1.

Figure 1: Mapping of input and output across transducers.

Figure 2 is the representation of a grammar with full sentences. Arcs show legal word strings and their probabilities. As input and output labels are the same, we call this automaton an acceptor (it ends at a final state if the sequence is accepted by the WFST). The graph in Figure 3 corresponds to a pronunciation lexicon, showing different possibilities of pronunciation and mapping a phone string to a word string in the lexicon. Since it transduces phones to words, it is called a transducer. Here are some Weighted finite-state transducer examples:

Figure 2: Example grammar FST.

Figure 3: Example pronunciation lexicon FST.

Essentially, the grammar graph shows us the allowed sequences of words along with the probability of their occurrence. It can be hand crafted through the use of regular expressions, rules or even direct build, but it can most notably be learned from data, creating stochastic $n$-gram models from text corpora.

The lexicon graph, on the other hand, presents the allowed pronunciations for the words. It is the Kleene closure of the union of individual word pronunciations, that is, the set of pronunciations for all word sequences. This means that there are loops in this automaton. For matters of efficiency, output word labels should be placed on the initial transition of words.

Below is a visual example of graph growth with composition of H, C, L and G transducers.

Figure 4: Example $L\circ G$.

Figure 5: Example $C\circ L\circ G$.

Figure 6: Example $C\circ L\circ G$.

The operation of composition essentially consists in checking transitions in both input transducers and adding an arc to the output graph whenever the output label of the first transducer matches the input of the second one. This creates a mapping between the input of $T_1$ and the output of $T_2$. A visual example, also from [2], is shown in Figure 7. It works like function composition, where the composition $(f\circ g)(x)$ of function $f$ with function $g$ is equal to $f(g(x))$.

Figure 7: Example of a basic transducer composition [1]. The upper left transducer is $T_1$, the upper right one is $T_2$, and the graph below them is the result.

During decoding, the decoder examines the word-sequence search space beginning with start states and evaluates transition costs in order to build a lattice, a representation of the alternative $n$-best word-sequences

Below is an illustration of decoding operation. For each time frame, visited states are saved in the form of tokens, with links to other states. Notice how not all states and transitions make their way into the lattice (and those that are there can still be removed via pruning).

Figure 8: Example FST examined during decoding.

Figure 9: Example lattice created from decoding.

Kaldi creates a data structure corresponding to a full state-level lattice, so for every arc of $HCLG$ traversed, a separate arc is created in the lattice, with acoustic and graph cost stored separately, as illustrated in Figure 9. This structure is pruned periodically, and the final pruned graph then has its epsilons removed and is determinized [4].

More on the theory of weighted finite state transducers can be found in [1]. For more on Kaldi, check out their documentation page.

References

[1] Mohri, M., Pereira, F., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing (pp. 559–584). Springer.

[2] Akian, M., Gaubert, S., & Guterman, A. (2009). Linear independence over tropical semirings and beyond. Contemporary Mathematics, 495, 1.

[3] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011, December). The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.

[4] Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., Karafiát, M., Kombrink, S., Motlı́ček Petr, Qian, Y., & others. (2012). Generating exact lattices in the WFST framework. Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference On, 4213–4216.

Dynamic WFST-based Decoder for Automatic Speech Recognition

2020-07-20T00:00:00+00:00

This post is the executive summary of my final project for graduate school in France. The work was conducted from April to September 2018 in the Speech and Sound Group (SSG) at Sony, in Germany, under the supervision of Marc Ferras-Font. The project used mainly Kaldi [1], an open-source toolkit for Automatic Speech Recognition. Click here for the French version.

Automatic Speech Recognition (ASR) systems convert speech from a recorded audio signal to text. Such systems aim to infer the original words given an observable signal, most commonly following a probabilistic approach. We call decoding the process of calculating which sequence of words is most likely to match the acoustic signal [2].

One extensively used way of searching the best word sequence involves the use of Weighted Finite State Transducers (WFST), static graphs encoding the whole word-sequence search space. They contain many levels of information, and define what is allowed (and more probable) in the language. During decoding, the decoder examines them in order to predict the best sequence(s) - the most probable / less costly path(s).

Figure 1: Example grammar WFST. Transitions contain input and output labels and a probability (or a cost) of making such transition. A path from a start to an end state transduces one input sequence of labels to an output one.

The internship consisted in the investigation and implementation of ways to allow dynamic changes to the decoding graphs and/or decoding process of a WFST-based decoder. The main goals were to create class-based decoders, as well as the use of dynamic approaches for the construction of cascades of WFSTs.

We managed to implement functional decoders that are class based; numbers were kept in a separate graph, and this graph was replaced in a main graph whenever a quantity was expected. This makes it easier for modifications, since OOVs can be more easily inserted into their respective class graphs due to the reduced size of the latter. This technique can be extended to other classes, such as names of people and cities, which are often unknown to the system.

The WFST examined during decoding is a combination (composition) of different graphs with multiple levels of information. These graphs are usually combined before the decoding operation, but this increases significantly the WFST size. We managed to implement this combination during decoding, so the search space is constructed on-the-fly. This significantly reduces memory usage both in graph on-disk storage and RAM resources during decoding.

These two features were combined in a sole decoder that includes on-the-fly class replacement and on-the-fly composition. Table 1 presents the performance of the different decoders that were implemented over the course of this internship. Word Error Rate (WER) expresses the error of the predictions (less is better) and Real Time Factor (RTF) shows the speed (less is better, 1.0 means real time). Static is the original case; Static Replace adds the replacement of class graphs in the main transducer; Lazy implements the on-the-fly combination of the different levels of representation; and Lazy Replace uses both techniques in the same decoder.

Type	WER	RTF
Static	10.0%	0.37
Static Replace	11.2%	0.39
Lazy	13.4%	1.28
Lazy Replace	14.4%	1.48

Table 1: Scores obtained for lazy composition experiments.

Granted, the performance of the dynamic decoder is worse than the original non-lazy, non-replaced graph, both in terms of accuracy and speed. Nevertheless, this impact on performance was expected; the original graph, though impractical in terms of size, is further optimized, since it’s a single piece. Our implemented graph can also be further optimized so as to get as close a performance as possible to the original, ordinary graph.

References

[1] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011, December). The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.

[2] Gruhn, R. E., Minker, W., & Nakamura, S. (2011). Statistical Pronunciation Modeling for Non-Native Speech Processing. Springer Berlin Heidelberg.