# TensorFlow 1: DeepSpeech Conversion

In this illustration, we talk about the automatic handling of flexible shapes and related capabilities of the new coremltools conversion API.

We demonstrate it using an Automatic Speech Recognition task. In this task, the input is a speech audio file and the output is the text transcription of it. There are many approaches to Automatic Speech Recognition, the system that we use in this example, consists of 3 stages. There are pre- and post-processing stages and a neural network model in the middle to do the heavy-lifting.

Pre-processing involves extracting the Mel spectrum, also called MFCCs, from the raw audio file. These MFCCs are fed to the neural network model which returns a character level time series, of probability distributions. Those are then post-processed by a CTC decoder to produce the final transcription.

The pre- and post-processing stages employ standard techniques which can be easily implemented. Therefore, our focus is on converting the neural network model.

We use a pre-trained TensorFlow model, called DeepSpeech. At a high level, this model, uses an LSTM and few dense layers stacked on top of each other. And, such an architecture is quite common for `seq2seq`

models.

In order to run this demo on your system, please download the following assets

- Processing and inspection utilities (demo_utils.py)
- Sample audio file (audio_sample_16bit_mono_16khz.wav)
- Alphabet configuration file (alphabet.txt)
- Language model scorer (kenlm.scorer)
- Pre-trained weights (deepspeech-0.7.1-checkpoint)
- Script to export TensorFlow 1 model (DeepSpeech.py)

First, we install the `deepspeech`

package using `pip`

.

```
pip install deepspeech
```

Let us run the following script downloaded from DeepSpeech repository to export the TensorFlow 1 model.

```
python DeepSpeech.py --export_dir /tmp --checkpoint_dir ./deepspeech-0.7.1-checkpoint --alphabet_config_path=alphabet.txt --scorer_path=kenlm.scorer >/dev/null 2>&1
```

After the model is exported, we can inspect the outputs of the TensorFlow graph.

```
tf_model = "/tmp/output_graph.pb"
from demo_utils import inspect_tf_outputs
inspect_tf_outputs(tf_model)
```

We find that there are 4 outputs, i.e. `['mfccs', 'logits', 'new_state_c', 'new_state_h']`

. And, this first one called “mfccs", represents the output of the pre-processing stage, which means, the exported TensorFlow graph contains not just the DeepSpeech model, but also the pre-processing sub-graph.

We can strip off this pre-processing component, by providing the remaining 3 output names to the unified converter function.

```
outputs = ["logits", "new_state_c", "new_state_h"]
```

Let us convert this model to Core ML and try it out, on an audio sample.

```
import coremltools as ct
mlmodel = ct.convert(tf_model, outputs=outputs)
```

After the model is converted, we load and preprocess an audio file. For the full pipeline to work in this demo, pre and post-processing functions have already been constructed using code in the DeepSpeech repository.

```
audiofile = "./audio_sample_16bit_mono_16khz.wav"
from demo_utils import preprocessing, postprocessing
mfccs = preprocessing(audiofile)
print(mfccs.shape)
```

Preprocessing transforms the audio file into a tensor object of shape `(1, 636, 19, 26)`

. Th shape of the tensor can be viewed as 1 audio file, preprocessed into 636 sequences, each of width 19, and containing 26 coefficients. The number of these sequences, change with the length of the audio. For this 12 seconds audio file, we have 636 sequences.

Let’s inspect the input shapes that the model expects.

```
from demo_utils import inspect_inputs
inspect_inputs(mlmodel, tf_model)
```

We find that the model input with the name `input_node`

has shape `(1, 16, 19, 26)`

which matches the shapes of the preprocessed tensor in all the dimensions except for the sequence one. Since, the converted model can only process 16 sequences at a time, we write a loop to break the input features into chunks and feed each segment to the model one by one.

```
start = 0
step = 16
max_time_steps = mfccs.shape[1]
logits_sequence = []
input_dict = {}
input_dict["input_lengths"] = np.array([step]).astype(np.float32)
input_dict["previous_state_c"] = np.zeros([1, 2048]).astype(np.float32) # Initializing cell state
input_dict["previous_state_h"] = np.zeros([1, 2048]).astype(np.float32) # Initializing hidden state
print("Transcription: \n")
while (start + step) < max_time_steps:
input_dict["input_node"] = mfccs[:, start:(start + step), :, :]
# Evaluation
preds = mlmodel.predict(input_dict)
start += step
logits_sequence.append(preds["logits"])
# Updating states
input_dict["previous_state_c"] = preds["new_state_c"]
input_dict["previous_state_h"] = preds["new_state_h"]
# Decoding
probs = np.concatenate(logits_sequence)
transcription = postprocessing(probs)
print(transcription[0][1], end="\r", flush=True)
```

Basically, we break the preprocessed feature, into slices of size 16 and run prediction on each slice, with some state management, inside a loop.

On running the above code snippet, we find that the transcription matches with the contents of the audio file.

It is also possible to run the prediction on the entire pre-processed feature, in just one go, but we would need a dynamic TensorFlow model for that.

Let us re-run the same script from the DeepSpeech repository to obtain a dynamic graph.

```
!python DeepSpeech.py --n_steps -1 --export_dir /tmp --checkpoint_dir ./deepspeech-0.7.1-checkpoint --alphabet_config_path=alphabet.txt --scorer_path=kenlm.scorer >/dev/null 2>&1
```

This time, we provide an additional flag `n_steps`

which corresponds to sequence length and had a default value of 16. Setting it to -1 means that the sequence length can take any positive value.

Let us convert the newly exported dynamic TensorFlow model.

```
mlmodel = ct.convert(tf_model, outputs=outputs)
```

After the model is converted, we can inspect how this model is different from the previous static one.

```
inspect_inputs(mlmodel,tf_model)
```

We find that the shape of input `input_node`

now is `(1, None, 19, 26)`

which mean that this CoreML model can work on inputs of arbitrary sequence length.

And, the difference lies not just in the shapes. Under the hood, this dynamic CoreML model is much more complicated than the previous static one. It has lots of dynamic operations such as get shape, dynamic reshape etc.

However, the experience of converting it was exactly the same. The converter handled it with just the same amount of ease as before.

Let us validate the transcription accuracy on the same audio file.

```
input_dict = {}
input_dict["input_node"] = mfccs
input_dict["input_lengths"] = np.array([mfccs.shape[1]]).astype(np.float32)
input_dict["previous_state_c"] = np.zeros([1, 2048]).astype(np.float32) # Initializing cell state
input_dict["previous_state_h"] = np.zeros([1, 2048]).astype(np.float32) # Initializing hidden state
```

This time, we didn't need the loop and could directly feed the entire input feature to the model.

```
probs = mlmodel.predict(input_dict)["logits"]
transcription = postprocessing(probs)
print(transcription[0][1])
```

And, we get the same transcription with this dynamic Core ML model.

To summarize, we worked with two variants of the DeepSpeech model. On a static TF graph, the converter produced, a CoreML model, with inputs of fixed shape. And with the dynamic variant, we obtained a CoreML model, which could accept inputs, of any sequence length. Converter handled both cases transparently without making any change to the conversion call.

One more thing, with Core ML TF converters, it's also possible to start with a dynamic TF graph and obtain a static Core ML model. This can be easily done by providing the type description object containing name and shape of the input, to the conversion API as shown below.

```
input = ct.TensorType(name="input_node", shape=(1,16,19,26))
mlmodel = ct.convert(tf_model, outputs=outputs, inputs=[input])
```

Under the hood, the type and value inference propagates this shape information to remove all the unnecessary dynamic operations. Therefore, Static models are likely to be more performant while the dynamic ones are definitely more flexible.

The choice between static and dynamic variant can be made depending on the requirements of the application.

# TensorFlow 2: Convert the DistilBERT transformer model to CoreML

This example demonstrates how to convert the DistilBERT model from Huggingface.

- Add the import statements.
**Note:**This was tested with`transformers==2.10.0`

.

```
import numpy as np
import coremltools as ct
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForMaskedLM
```

- Load the DistilBERT model and tokenizer.
**Note:**This example uses the`TFDistilBertForMaskedLM`

variant.

```
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
distilbert_model = TFDistilBertForMaskedLM.from_pretrained('distilbert-base-cased')
```

- Describe and set the input layer, then build the TensorFlow model:

```
max_seq_length = 10
input_shape = (1, max_seq_length) #(batch_size, maximum_sequence_length)
input_layer = tf.keras.layers.Input(shape=input_shape[1:], dtype=tf.int32, name='input')
prediction_model = distilbert_model(input_layer)
tf_model = tf.keras.models.Model(inputs=input_layer, outputs=prediction_model)
```

- Convert the model to the Core ML format:

```
mlmodel = ct.convert(tf_model)
```

- Create the input using
`tokenizer`

:

```
# Fill the input with zeros to adhere to input_shape
input_values = np.zeros(input_shape)
# Store the tokens from our sample sentence into the input
input_values[0,:8] = np.array(tokenizer.encode("Hello, my dog is cute")).astype(np.int32)
```

- Use the ML Model for prediction:

```
mlmodel.predict({'input':input_values}) # 'input' is the name of our input layer from (3)
```

# TensorFlow 2: Convert the TF Hub BERT transformer model to CoreML

This example will demonstrate how to convert the BERT model from TensorFlow Hub.

- Add the import statements:

```
import numpy as np
import tensorflow as tf
import tensorflow_hub as tf_hub
import coremltools as ct
```

- Describe and set the input layer:

```
max_seq_length = 384
input_shape = (1, max_seq_length)
input_words = tf.keras.layers.Input(
shape=input_shape[1:], dtype=tf.int32, name='input_words')
input_masks = tf.keras.layers.Input(
shape=input_shape[1:], dtype=tf.int32, name='input_masks')
segment_ids = tf.keras.layers.Input(
shape=input_shape[1:], dtype=tf.int32, name='segment_ids')
```

- Build the TensorFlow model:

```
bert_layer = tf_hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False)
pooled_output, sequence_output = bert_layer(
[input_words, input_masks, segment_ids])
tf_model = tf.keras.models.Model(
inputs=[input_words, input_masks, segment_ids],
outputs=[pooled_output, sequence_output])
```

- Convert the model to the TensorFlow format:

```
mlmodel = ct.convert(tf_model, source='TensorFlow')
```

Updated 4 months ago