Artificial Intelligence (AI) and computer vision function ... · MobileNet is an efficient model architecture [3] suitable for mobile and embedded vision applications. This model

IntroductionFP-AI-VISION1 is a function pack (FP) demonstrating the capability of microcontrollers in the STM32H7 Series to execute aConvolutional Neural Network (CNN) efficiently in relation to computer vision tasks. FP-AI-VISION1 contains everything neededto build a CNN-based computer vision application on STM32H7 microcontrollers.

This user manual describes the content of the FP-AI-VISION1 function pack and details the different steps to be carried out inorder to build a CNN-based computer vision application on STM32H7 microcontrollers.

Artificial Intelligence (AI) and computer vision function pack for STM32H7 microcontrollers

UM2611

User manual

UM2611 - Rev 2 - December 2019For further information contact your local STMicroelectronics sales office.

www.st.com

https://www.st.com/en/product/fp-ai-vision1?ecmp=tt9470_gl_link_feb2019&rt=um&id=UM2611



http://www.st.com

1 General information

The FP-AI-VISION1 function pack runs on the STM32H7 microcontrollers based on the Arm® Cortex®-M7processor.

Note: Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or elsewhere.

1.1 FP-AI-VISION1 function pack feature overview

• Runs on the STM32H747I-DISCO board connected with the STM32F4DIS-CAM camera daughterboard• Includes complete application firmware for camera capture, frame image preprocessing, inference execution

and output result display• Includes an image classification application example based on a CNN for performing food classification on

color (RBG) frame images• Includes examples of integration of both floating-point and 8-bit quantized C models• Supports several configurations for data memory placement in order to meet application requirements• Includes test and validation firmware in order to test, debug and validate the embedded application• Includes capture firmware enabling dataset collection• Includes support for file handling (on top of FatFS) on external microSD™ card

1.2 Software architecture

The top-level architecture of the FP-AI-VISION1 function pack usage is shown in Figure 1.

UM2611General information

UM2611 - Rev 2 page 2/41


https://www.st.com/en/product/stm32h747I-disco?ecmp=tt9470_gl_link_feb2019&rt=um&id=UM2611


Figure 1. FP-AI-VISION1 architecture

Middleware level

Drivers

Food recognition applications(float/quantized)

FatFS(Light FAT file system)

Hardware abstraction layer(HAL)

Board support package(BSP)

Hardware components

STM32 LCDCamera sensor

STM32_Fs(FatFS abstraction)

Development boards

STM32H747I-DISCOSTM32F4DIS-CAM

STM32_Image(Image processing library)

STM32_AI(Neural Network runtime library)

1.3 Terms and definitions

Table 1 presents the definitions of the acronyms that are relevant for a better contextual understanding of thisdocument.

Table 1. List of acronyms

Acronym Definition

API Application programming interface

BSP Board support package

CNN Convolutional Neural Network

DMA Direct memory access

FAT File allocation table

FatFS Light generic FAT file system

FP Function pack

FPS Frame per second

HAL Hardware abstraction layer

UM2611Terms and definitions

UM2611 - Rev 2 page 3/41

Acronym Definition

LCD Liquid crystal display

MCU Microcontroller unit

microSD™ Micro secure digital

MIPS Million of instructions per second

NN Neural Network

RAM Random access memory

QVGA Quarter VGA

SRAM Static random access memory

VGA Video graphics array resolution

1.4 Overview of available documents and references

Table 2 lists the complementary references for using FP-AI-VISION1.

Table 2. References

ID Description

[1]User manual:

Getting started with X-CUBE-AI Expansion Package for Artificial Intelligence (AI) (UM2526).

[2]Reference manual:

STM32H745/755 and STM32H747/757 advanced Arm®-based 32-bit MCUs (RM0399).

[3]MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications:

https://arxiv.org/pdf/1704.04861.pdf

[4]The Food-101 Data Set:

https://www.vision.ee.ethz.ch/datasets_extra/food-101/

[5]STM32CubeProgrammer software for programming STM32 products:

STM32CubeProg

[6]Keras - The Python Deep Learning library:

https://keras.io/

[7]STM32Cube initialization code generator:

STM32CubeMX

UM2611Overview of available documents and references

UM2611 - Rev 2 page 4/41


https://www.st.com/en/product/stm32cubeprog?ecmp=tt9470_gl_link_feb2019&rt=um&id=UM2611

https://www.st.com/en/product/stm32cubemx?ecmp=tt9470_gl_link_feb2019&rt=um&id=UM2611

2 Building a CNN-based computer vision application on STM32H7

Figure 2 illustrates the different steps to obtain a CNN-based computer vision application running on theSTM32H7 microcontrollers.

Figure 2. CNN-based computer vision application build flow

STM32Cube.AI

Buildfloat

C model

32-bitfloating-point

C code

STM32H7drivers

Runtimelibrary 8-bit

integerC code

Imagepreprocessing library

Main framework

Computer vision application on STM32H7

Quantization tool.h5 .json.h5 +

QUANTIZED model(FoodReco_MobileNet_Derivative_Quantized.h5/json)

FLOAT model (FoodReco_MobileNet_Derivative_Float.h5)

Neural Networkruntime library

network.c/hnetwork_data.c/h

fp_vision_app.c/himg_preprocess.c/h

main.c/h

CNN float

CNN quantized

Other libraries

Ecosystem & tools

Legend:

network.c/hnetwork_data.c/h

BuildquantizedC model

Generated library

Starting from a floating-point CNN model (designed and trained using a framework such as Keras), the usergenerates an optimized C code (using the STM32Cube.AI tool, [1]) and integrates it in a computer visionframework (provided as part of FP-AI-VISION1) in order to build his computer vision application on STM32H7.

Note: For users having selected a dual-core MCU like the STM32H747 for their application but running it on theCortex®-M7 core only: STM32CubeMX does not support the addition of packages like the STM32Cube.AI (X-CUBE-AI) to the project. As a consequence, when using STM32CubeMX along with STM32Cube.AI, a single-core MCU like the STM32H743 must be selected to be able to generate the Neural Network code for theCortex®-M7 core.

UM2611Building a CNN-based computer vision application on STM32H7

UM2611 - Rev 2 page 5/41


https://www.st.com/en/product/stm32cubemx?ecmp=tt9470_gl_link_feb2019&rt=um&id=UM2611

https://www.st.com/en/product/x-cube-ai?ecmp=tt9470_gl_link_feb2019&rt=um&id=UM2611


The user has the possibility to select one of two options for generating the C code:• Either generating the floating-point C code directly from the CNN model in floating-point• Or quantizing the floating-point CNN model to obtain an 8-bit model, and subsequently generating the

corresponding quantized C code

For most CNN models, the second option enables to reduce the memory footprint (Flash and RAM) as well asinference time. The impact on the final output accuracy depends on the CNN model as well as on the quantizationprocess (mainly the test dataset and the quantization algorithm).As part of the FP-AI-VISION1 function pack, an image classification application example (food recognition) isprovided including the following material:• Floating–point Keras model (.h5 file) for food recognition application• Generated C code in both floating point and 8-bit quantized format• Example of computer vision application integration based on C code generated by STM32Cube.AI (X-

CUBE-AI)

2.1 Integration of the generated code

From a float or quantized model, the user must use the STM32Cube.AI tool (X-CUBE-AI) to generate thecorresponding optimized C code.When using the GUI version of STM32Cube.AI (X-CUBE-AI) with the user's own .ioc file, the following set offiles is generated in the output directory:• Src\network.c and Inc\network.h: contain the description of the CNN topology• Src\network_data.c and Inc\network_data.c: contain the weights and biases of the CNN

Note: For the network, the user must keep the default name, which is “network”. Otherwise, the user must rename allthe functions and macros contained in files ai_interface.c and ai_interface.h. The purpose of the ai_interface.c and ai_interface.h files is to provide an abstraction interface to the NN API.From that point, the user must copy and replace the above generated .c files and .h files respectively into thefollowing directories:• \Projects\STM32H747I-DISCO\Applications\FoodReco_MobileNetDerivative\app_name\CM

7\Src (app_name is either Float_Model or Quantized_Model)• \Projects\STM32H747I-DISCO\Applications\FoodReco_MobileNetDerivative\app_name\CM

7\Inc (app_name is either Float_Model or Quantized_Model)

An alternate solution is to use the CLI (command-line interface) version of STM32Cube.AI, so that the generatedfiles be directly copied into the Src and Inc directories contained in the output directory provided on thecommand line. This solution does not require any manual copy/paste operation.The application parameters are configured in files fp_vision_app.c and fp_vision_app.h where they canbe easily adapted to the user's needs.

UM2611Integration of the generated code

UM2611 - Rev 2 page 6/41






3 Package content

3.1 CNN model

The FP-AI-VISION1 function pack is demonstrating a food-recognition CNN recognizing 18 types of food.This food-recognition CNN is a derivative of the MobileNet model (refer to [3]).MobileNet is an efficient model architecture [3] suitable for mobile and embedded vision applications. This modelarchitecture was proposed by Google®.The MobileNet model architecture includes two simple global hyper-parameters that efficiently trade off betweenlatency and accuracy. Basically these hyper-parameters allow the model builder to determine the application right-sized model based on the constraints of the problem.The food recognition model that is used in this FP has been built by adjusting these hyper-parameters for anoptimal tradeoff between accuracy, computational cost and memory footprint, considering the STM32H747 targetconstraints.The food-recognition CNN model has been trained on a custom database of 18 types of food:• Apple pie• Beer• Caesar salad• Cappuccino• Cheesecake• Chicken wings• Chocolate cake• Coke™

• Cupcake• Donut• French fries• Hamburger• Hot dog• Lasagna• Pizza• Risotto• Spaghetti bolognese• Steak

The food-recognition CNN is expecting color image of size 224 × 224 pixels as input, each pixel being coded onthree bytes: RGB888.The FP-AI-VISION1 function pack includes two examples based on the food recognition application: one exampleimplementing the floating-point version of the generated code, and one example implementing the quantizedversion of the generated code.

UM2611Package content

UM2611 - Rev 2 page 7/41



3.2 Software

3.2.1 Folder organizationFigure 3 shows the folder organization in FP-AI-VISION1 function pack.

Figure 3. FP-AI-VISION1 folder tree

STM32Cube.AI

32-bitfloating-point

C code

8-bitintegerC code

Quantization tool

.h5 .json.h5 +

QUANTIZED model(FoodReco_MobileNet_Derivative_Quantized.h5/json)

FLOAT model (FoodReco_MobileNet_Derivative_Float.h5)

Generated code:network.c/h

network_data.c/h

CNN float

CNN quantized

Ecosystem & tools

Legend:

DriverContains all the BSP and STM32H7 HAL source code.

UM2611Software

UM2611 - Rev 2 page 8/41


MiddlewareContains four sub-folders:• STM32_AI

The lib folder contains the Neural Network runtime libraries generated by STM32Cube.AI (X-CUBE-AI) foreach IDE: EWARM, SW4STM32, and Keil®. These libraries do not need to be replaced when converting anew Neural Network.The Inc folder contains the include files required by the runtime libraries.These two folders do not need to be replaced when converting a new Neural Network, unless using a newversion of the X-CUBE-AI code generator.

• STM32_ImageContains a library of functions for image processing. These functions are used to preprocess the input frameimage captured by the camera. The purpose of this preprocessing is to generate the adequate data ( suchas size, format, and others) to be input to the Neural Network during the inference.

• Third_Party/FatFSThird party middleware providing support for FAT file system.

• STM32_FsContains a library of functions for handling image files using FatFS on a microSD™ card.

Project/STM32H747I-DISCO/ApplicationsContains the projects and source codes for the applications provided in the FP-AI-VISION1 FP. Theseapplications are running on the STM32H747 (refer to [2]), which is a dual-core microcontroller based on theCortex®-M7 and Cortex®-M4 processors. The application code is running only on the Cortex®-M7 core.

Project/STM32H747I-DISCO/Applications/FoodReco_MobileNetDerivativeThis folder contains one sub-folder per application example. In the present FP, there are two applicationexamples, both based on food recognition application:• One demonstrating the integration of the float C model (32-bit float C code)• One demonstrating the integration of the quantized C model (8-bit integer C code)

Each sub-folder is composed as follows:• Binary

Contains the binaries for the applications:– STM32H747I-DISCO_xxx_Model_yyy_zzz_Va.b.c.bin

Binaries generated from the source files contained in the xxx_Model/CM7 folder.– STM32H747I-DISCO_xxx_Optimized_Model_yyy_zzz_Va.b.c.bin

Binary generated from sources that are not released as part of the FP-AI-VISION1 function pack sincethey are generated from a specific version of the food recognition CNN model. This specific version ofthe model is further optimized for a better tradeoff between accuracy and embedded constraints suchas memory footprint and MIPS. Contact STMicroelectronics for information about this specific version.◦ xxx refers to the model type, which is either Float or Quantized◦ yyy refers to the data placement scheme in the volatile memory, which is either Ext, Split, Int

Mem or IntFps◦ zzz refers to the memory location from where the weight-and-bias table is accessed during

inference which is either IntFlash, QspiFlash or ExtSdram◦ a.b.c refers to the version number of the FP release

UM2611Software

UM2611 - Rev 2 page 9/41




• CM7Contains the source code of the food recognition application example that is executed on the Cortex®-M7core. There are two types of files:– Files that are generated by the STM32Cube.AI tool (X-CUBE-AI):

◦ network.c and network.h: contain the description of the CNN topology◦ network_data.c and network_data.h: contain the weights and biases of the CNN

– Files that make up the framework of the application:◦ main.c and main.h◦ fp_vision_app.c and fp_vision_app.h◦ ai_interface.c and ai_interface.h: provide an abstraction of the NN API

• CM4This folder is empty since all the code of the food recognition application is running on the Cortex®-M7 core.

• CommonContains the source code that is common to the Cortex®-M7 and Cortex®-M4 cores.

• EWARMContains the IAR™ workspace and project files for the application example. It also contains the startup filesfor both cores.

• MDK-ARMContains the Keil® MDK-ARM workspace and project files for the application example. It also contains thestartup files for both cores.

• SW4STM32Contains the SW4STM32 workspace and project files for the application example. It also contains thestartup files for both cores.

Note: For the EWARM, MDK-ARM and SW4STM32 sub-folders, each application project may contain severalconfigurations. Each configuration corresponds to:• A specific data placement in the volatile memory (RAM)• A specific placement of the weight-and-bias table in the non-volatile memory (Flash)

Utilities/AI_resources/Food-RecognitionThis sub-folder contains:• The original trained model (file FoodReco_MobileNet_Derivative_Float.h5) for the food recognition

CNN used in the application examples. This model is used to generate:– Either directly the floating-point C code via STM32Cube.AI (X-CUBE-AI)– Or the 8-bit quantized model via the quantization process, and then subsequently the integer C code

via STM32Cube.AI (X-CUBE-AI)• The files required for the quantization process (refer to Section 3.2.2 Quantization process):

– config_file_foodreco_nn.json: file containing the configuration parameters for the quantizationoperation

– test_set_generation_foodreco_nn.py: file containing the function used to prepare the testvectors used in the quantization process

• The quantized model generated by the quantization tool (files FoodReco_MobileNet_Derivative_Quantized.json and FoodReco_MobileNet_Derivative_Quantized.h5)

• The re-training script (refer to Section 3.2.3 Training script): FoodDetection.py along with a Jupyter™notebook (FoodDetection.ipynb)

• A script (create_dataset.py) to convert a dataset of images in the format expected by the piece offirmware performing the validation on board (refer to Onboard Validation in Section 3.2.8 Embeddedvalidation and testing)

UM2611Software

UM2611 - Rev 2 page 10/41


https://www.st.com/en/product/SW4STM32?ecmp=tt9470_gl_link_feb2019&rt=um&id=UM2611



3.2.2 Quantization processThe quantization process consists in quantizing the parameters (weights and biases) as well as the activations ofa NN in order to obtain a quantized model having parameters and activations represented on 8-bit integers.Quantizing a model reduces the memory footprint because weights, biases, and activations are on 8 bits insteadof 32 bits in a float model. It also reduces the inference execution time through the optimized DSP unit of theCortex®-M7 core.Several quantization schemes are supported by the STM32Cube.AI tool:• Fixed point Qm,n• Integer arithmetic (signed and unsigned)

Refer to the STM32Cube.AI tool (X-CUBE-AI) documentation in [1] for more information on the differentquantization schemes and how to run the quantization process.

Note: • Two datasets are required for the quantization operation. It is up to the user to provide his own datasets.• The impact of the quantization on the accuracy of the final output depends on the CNN model (that is its

topology), but also on the quantization process: the test dataset and the quantization algorithm have asignificant impact on the final accuracy.

UM2611Software

UM2611 - Rev 2 page 11/41


3.2.3 Training scriptA training script is provided in file Utilities/AI_ressources/Food-Recognition/FoodDetection.ipynb as an example showing how to train the MobileNet derivative model used in the function pack. As the datasetused to train the model provided in the function pack is not publicly available, the training script relies on a subsetof the Food-101 dataset (refer to [4]). This publicly available dataset contains images of 101 food categories with1000 images per category.In order to keep the training process short, the script uses only 50 images per food category, and limits thetraining of the model to 20 epochs. To achieve a training on the whole dataset, the variablemax_imgs_per_class in section Prepare the test and train datasets must be updated to np.inf.

Note: The use of the GPU is recommended for the complete training on the whole dataset.The Jupyter™ notebook is also available as a plain Python™ script in the Utilities/AI_ressources/Food-Recognition/FoodDetection.py file.

3.2.4 Memory requirementsWhen integrating a C model generated by the STM32Cube.AI (X-CUBE-AI) tool, the following memoryrequirements must be considered:• Volatile (RAM) memory requirement: memory space is required to allocate:

– The inference working buffer (called the activation buffer in this document). This buffer is usedduring inference to store the temporary results of the intermediate layers within the Neural Network.

– The inference input buffer (called the nn_input buffer in this document), which is used to hold theinput data of the Neural Network.

• Non-volatile (Flash) memory requirement: memory space is required to store the table containing theweights and biases of the network model.

On top of the above-listed memory requirements, some more requirements come into play when integrating the Cmodel for a computer vision application:• Volatile (RAM) memory requirement: memory space is required in order to allocate the various buffers that

are used across the execution of the image pipeline (camera capture, frame pre-processing).

UM2611Software

UM2611 - Rev 2 page 12/41


3.2.4.1 Application execution flow and volatile (RAM) data memory requirementIn the context of a computer vision application, the integration requires several data buffers as illustrated inFigure 4. Table 3 shows the different data buffers required during the execution flow of the application.

Figure 4. Data buffers during execution flow

DCMI data register

NeuralNetworkinference

Start first frame capture

Frame capture completed?

Memory copy+

color format conversion(RGB565 => RGB888)

Initialization phase

Start next frame capture

Image rescaling(VGA / QVGA => 224 x 224)

Pixel value scaling

Neural Network inference

Pixelscaling

Imagerescaling

DMA

DMA2D

DMA

camera_frame buffer

R

W

rgb24_image buffer

camera_frame bufferR

W

DCMI data register

camera_frame buffer

R

W

rgb24_scaled_image buffer

rgb24_image bufferR

W

nn_input buffer

R

W


nn_output buffer

R

W

activation (working) bufferR/W

nn_input buffer

HWoperation

SWoperation

Peripheral register memory

RAM data memory

Depending on the memory allocation scheme (FPS- or memory-optimized), occurs either where shown or after NN

inference completion.

Legend:

R: read operationW: write operation

UM2611Software

UM2611 - Rev 2 page 13/41

The application executes the following operations in sequence:1. Copy the frame acquired by the camera (via the DMA engine from DCMI data register) in the

camera_frame buffer.2. Copy (via the DMA2D engine) with pixel color format conversion (RGB565 to RGB888, which fits with the

three input channels expected by the food recognition CNN) from the camera_frame buffer into thergb24_image buffer. The content of the camera_frame buffer is also copied into the frame buffer of theLCD (not shown in Table 3 and always located in the external SDRAM) via the DMA2D engine with thetransformation from the RGB565 to the ARGB8888 format.

3. At this point, depending on the memory allocation scheme chosen, it is possible to launch the capture of thesubsequent frame.

4. Rescale the image contained in the rgb24_image buffer into the rgb24_scaled_image buffer to matchthe expected food recognition CNN input tensor dimensions: Height/Width = 224/224

5. Scale the value of each pixel contained in the rgb24_scaled_image buffer content into the nn_inputbuffer. The purpose is to fit with the pixel value range expected by the food recognition NN: in the presentcase, the model is trained with pixel values normalized between 0 and 1.

6. Run inference of the food recognition NN: the nn_input buffer as well as the activation buffer areprovided as input to the NN. The results of the classification are stored into the nn_output buffer.

7. Analyze the nn_output buffer content and display the results on the LCD display.The activation buffer is the working buffer used by the NN during the inference. Its size depends on the NNmodel used.

Note: The activation buffer size is provided by the STM32Cube.AI tool (X-CUBE-AI) when analyzing the NN (referto Table 3 for an example).The nn_output buffer is where the classification output results are stored. In the food recognition NN provided inthe FP-AI-VISION1 function pack, the size of this buffer is 18 × 4 = 72 bytes: 18 corresponds to the number ofoutput classes and 4 corresponds to the fact that the probability of each output class is provided as a float value(single precision, coded on 4 bytes).

UM2611Software

UM2611 - Rev 2 page 14/41



Table 3 details the amount of data RAM required by the food recognition application when integrating thequantized C model or the float C model.

Table 3. RAM buffers

RAM data bufferRAM data buffer size (byte)

Food recognition CNN

Name Pixel format

Quantized C model Float C model

VGA capture

(640 × 480)

QVGA capture

(320 × 240)

VGA capture

(640 × 480)

QVGA capture

(320 × 240)

camera_frame 16 bits

(RGB565)

614.4 K

(640 × 480 × 2)

156.6 K

(320 × 240 × 2)

614.4 K

(640 × 480 × 2)

156.6 K

(320 × 240 × 2)

rgb24_image 24 bits

(RGB888)

921.6 K

(640 × 480 ₓ 3)

230.4 K

(320 × 240 × 3)

921.6 K

(640 × 480 × 3)

230.4 K

(320 × 240 × 3)

rgb24_scaled_image 24 bits

(RGB888)

150.5 K

(224 × 224 × 3)

150.5 K

(224 × 224 × 3)

nn_input(1)24 bits

(RGB888)

150.5 K

(224 × 224 × 3)

602.1 K

(224 × 224 × 3 × 4)

activation(1) - ~115 K ~500 K

nn_output -72

(18 × 4)

72

(18 × 4)

Total ~1.98 M ~810 K ~2.8 M ~1.6 M

1. When generating the C code with the STM32Cube.AI tool: if the “allocate input in activation” option is selected (either via the--allocate-inputs option in the CLI or via the “Use activation buffer for input buffer” checkbox in the advancedsettings of the GUI), STM32Cube.AI overlays the nn_input buffer with the activation buffer, thus resulting in a biggeractivation size. However in this case the nn_input buffer is not required any longer. In the end, the overall amount ofmemory required is reduced. In the given quantized C model, when the “allocate input in activation” is selected, the size ofthe generated activation buffer is ~167.5 Kbytes, which is bigger than 115 Kbytes but lower than 265.5 Kbytes(115 + 150.5).

UM2611Software

UM2611 - Rev 2 page 15/41

3.2.4.2 STM32H747 internal SRAMTable 4 represents the internal SRAM memory map of the STM32H747XIH6 microcontroller:

Table 4. STM32H747XIH6 SRAM memory map

Memory Address range Size (Kbyte)

DTCM RAM 0x2000 0000 ‑ 0x2001 FFFF 128

Reserved 0x2002 0000 ‑ 0x23FF FFFF -

AXI SRAM 0x2400 0000 ‑ 0x2407 FFFF 512

Reserved 0x2400 8000 ‑ 0x2FFF FFFF -

SRAM1 0x3000 0000 ‑ 0x3001 FFFF 128

SRAM2 0x3002 0000 ‑ 0x3003 FFFF 128

SRAM3 0x3004 0000 ‑ 0x3004 7FFF 32

Reserved 0x3004 8000 ‑ 0x37FF FFFF -

SRAM4 0x3800 0000 ‑ 0x3800 FFFF 64

Reserved 0x3801 0000 ‑ 0x387F FFFF -

BCKUP SRAM 0x3880 0000 ‑ 0x3880 0FFF 4

Total - ~1 Mbyte

The STM32H747XIH6 features about 1 Mbyte of internal SRAM. However, it is important to note that:• This 1 Mbyte is not a contiguous memory space. The largest block of SRAM is the AXI SRAM, which has a

size of 512 Kbytes.• The nn_input buffer must be placed in a continuous area of memory. This is the same for the

activation buffer. As a result, the basic following rule applies: if either the nn_input buffer or theactivation buffer is larger than 512 Kbytes, the generated C model is not able to execute by relying onlyon the internal SRAM. In this case, additional external data RAM is required.

3.2.4.3 Buffer placement in volatile (RAM) data memoryThe comparison of the sizes in Table 3 and Table 4 leads to the following conclusions regarding the integration ofthe food recognition C models:• The float C model implementation does not fit completely into the internal SRAM, whatever the camera

resolution selected: some external RAM is required in order to run this use case.In the context of the FP-AI-VISION1 FP, the two following memory layouts are implemented:– Full External, which consists in placing all the buffers (listed in Table 3) in the external RAM– Split External/Internal, which consists in placing the activation buffer (which includes the nn_input

buffer if the allocate input in activation option in selected in the STM32Cube.AI tool) in the internalSRAM and the other buffers in the external RAM

UM2611Software

UM2611 - Rev 2 page 16/41


• Concerning the implementation of the quantized C model:– If the VGA camera resolution is selected:

The implementation does not fit completely into the internal SRAM: some external RAM is required inorder to run this use case.In the context of the FP-AI-VISION1 FP, two memory layout schemes are implemented:◦ Full External: consists in placing all the buffers (listed in Table 3) in the external RAM.◦ Split External/Internal: consists in placing the activation buffer (which includes the nn_input

buffer if the allocate input in activation option in selected in the STM32Cube.AI tool) in the internalSRAM and the other buffers in the external RAM.

Choosing one scheme or the other comes down to the following tradeoff:◦ Release as much internal SRAM as possible so that it is available for other user applications

possibly requiring a significant amount of internal SRAM◦ Reduce the inference time by having the working buffer (activation buffer) allocated in faster

memory such as the internal SRAM– If the QVGA camera resolution is selected:

The implementation fits completely into the internal SRAM and no external RAM is required. Section 3.2.4.4 Optimizing the internal SRAM memory space describes the different memory layout schemesimplemented when the QVGA resolution is selected for the quantized C model.

3.2.4.4 Optimizing the internal SRAM memory spaceFor the use case fitting integrally in the internal SRAM, several memory allocation schemes are possible.However, with a view to optimizing the internal SRAM space, two schemes are chosen for implementing the foodrecognition applications.These two implementation schemes rely on the allocate input in activation feature offered by the STM32Cube.AItool (X-CUBE-AI) with a view to optimize the space required for the activation and nn_input buffers:basically, the nn_input buffer is allocated within the activation buffer, making both buffer “overlaid”. (Refer to[1] for more information on how to enable these optimizations when generating the Neural Network C code.)

First scheme: full internal, memory optimized

A single and unique physical memory space is allocated for all the buffers: camera_frame, rgb24_image,rgb24_scaled_image, nn_input, nn_output and activation buffers. In other words, these six buffers are“overlaid”, which means that a unique physical memory space is used for six different purposes during theapplication execution flow.The size of this memory space must be equal to the size of the largest among the six “overlaid” buffers: in ourcase, it is the rgb24_image buffer, which has a size of 230.4 Kbytes.The advantage of this approach is that the required memory space is optimized at best.This approach has also two main drawbacks:• The data stored in the memory space are overwritten (and thus no longer available) as the application is

moving forward along the execution flow.• The memory space is not available for a new camera capture until the NN inference is completed. As a

consequence, the maximum number of frames that can be processed per second is impacted.

Conclusion: this first memory allocation scheme reduces the amount of memory required by the food recognitionapplication from ~810 Kbytes to ~230.4 Kbytes. This is at the expense of the number of frames that can beprocessed per second.

UM2611Software

UM2611 - Rev 2 page 17/41


The whole amount of memory required is allocated in a single region, the AXI SRAM region, as illustrated inFigure 5.

Figure 5. SRAM allocation - Memory optimized scheme

DCMIdata

register


Pixelconversion

Imagerescaling

DMA2Dtransfer

with PFC

DMA

R

W

camera_framebuffer

R

W

rgb24_imagebuffer

R R


nn_outputbuffer

activationbuffer R/W

HWoperation

SWoperation Peripheral register memory RAM data memory

Legend:


Application execution flow

@ = 0x2400 0000

@ = 0x2400 0000+

230.4 Kbytes(*)

SRAM memory space allocated in the AXI SRAM region.

Wnn_input

bufferR

WW

nn_inputallocated in activation

(*) rgb24_image buffer size

Note: By enabling the memory optimization feature in the STM32Cube.AI tool (“allocate input in activation”), only~167.5 Kbytes of memory are required to hold both the activation and the nn_input buffers (versus150.5 Kbytes + ~115 Kbytes = ~265.5 Kbytes if this feature is not enabled).

Second scheme: full internal, FPS optimized

• A first physical memory space is allocated only for the camera_frame buffer. The advantage of thisapproach is that a subsequent frame can be captured as soon as the content of the camera_frame bufferis copied onto the rgb24_image buffer without waiting for the inference to complete. As a result, thenumber of frames processed per second is optimal.This is at the expense of the amount of memory space required.

• A second unique physical memory space is allocated for all the other buffers: rgb24_image,rgb24_scaled_image, nn_input, nn_output and activation buffers. In other words, these fivebuffers are “overlaid”, which means that a unique physical memory space is used for five different purposesduring the application execution flow.The size of this memory space must be equal to the size of the biggest of the five “overlaid” buffers: in ourcase, it is the rgb24_image buffer, which has a size of 230.4 Kbytes.

Conclusion: this second memory allocation scheme reduces the amount of memory required by the foodrecognition application from ~810 Kbytes to ~384 Kbytes. The amount of memory required by the second schemeis higher than the one in the first scheme but the second scheme enables to optimize the number of framesprocessed per second.

UM2611Software

UM2611 - Rev 2 page 18/41

The whole amount of memory required is allocated in a single region, the AXI SRAM region, as illustrated inFigure 6.

Figure 6. SRAM allocation - FPS optimized scheme

DMA2Dtransfer

with PFC

DMA

R

W

R

W

Application execution flow

DMAR

W

@ = 0x2400 0000

@ = 0x2400 0000+

384 Kbytes(*)

SRAM memory space allocated in theAXI SRAM region.

camera_framebuffer

rgb24_imagebuffer


Imagerescaling

R

R


nn_outputbuffer

activationbuffer R/W

Wnn_input

bufferR

W

W

Pixelconversion

DCMIdata

register

HWoperation

SWoperation Peripheral register memory RAM data memory

Legend:


nn_inputallocated in activation

(*) camera_frame buffer size + rgb24_image buffer size

Note: By enabling the memory optimization feature in the STM32Cube.AI tool (“allocate input in activation”), only~167.5 Kbytes of memory are required to hold both the activation and the nn_input buffers (versus150.5 Kbytes + ~115 Kbytes = ~265.5 Kbytes if this feature is not enabled).

3.2.4.5 Weight and bias placement in non-volatile (Flash) memoryThe STM32Cube.AI (X-CUBE-AI) code generator generates a table (defined as constant) containing the weightsand biases. Therefore, by default, the weights and biases are stored into the internal Flash memory.There are use cases where the table of weights and biases does not fit into the internal Flash memory (the size ofwhich being 2 Mbytes, shared between read-only data and code). In such situation, the user has the possibility tostore the weight-and-bias table in the external Flash memory.The STM32H747I-DISCO Discovery board supported in the context of this FP has a serial external Flash that isinterfaced to the STM32H747I MCU via a Quad SPI. An example illustrating how to place the weight-and-biastable in the external Flash memory is provided in this FP under Project/STM32H747I-DISCO/Applications/FoodReco_MobileNetDerivative directory/Float_Model (configurations STM32H747I-DISCO_FoodReco_Float_Ext_Qspi, STM32H747I-DISCO_FoodReco_Float_Split_Qspi, STM32H747I-DISCO_FoodReco_Float_Split_Sdram and STM32H747I-DISCO_FoodReco_Float_Ext_Sdram).

UM2611Software

UM2611 - Rev 2 page 19/41



Follow the steps below to load the weight-and-bias table in the QSPI external Flash memory:1. Define flags WEIGHT_QSPI to 1 and WEIGHT_QSPI_PROGED to 0 to define a memory placement section in

the memory range corresponding to the Quad-SPI memory interface (0x9000 0000) and place the table ofweights and biases into it.In order to speed up the inference, it is possible for the user to set flag WEIGHT_EXEC_EXTRAM to 1 (asdemonstrated in configurations STM32H747I-DISCO_FoodReco_Float_Split_Sdram and STM32H747I-DISCO_FoodReco_Float_Ext_Sdram) so that the weight-and-bias table gets copied from the externalQSPI Flash memory into the external SDRAM memory at program initialization.

2. Generate a binary of the whole application as an .hex file (not as a .bin file). Load the .hex file using theSTM32CubeProgrammer (STM32CubeProg) tool [5]. Select first the external Flash loader for theSTM32H747I-DISCO Discovery board.Figure 7 and Figure 8 are snapshots of the STM32CubeProgrammer (STM32CubeProg) tool showing thesequence of steps to program the full binary of the application into the Flash memory.

Figure 7. Flash programming (1 of 2)

1Select the external flash loader tab

Select the Flash loader for theSTM32H747I-DISCO Discovery board

2

UM2611Software

UM2611 - Rev 2 page 20/41



Figure 8. Flash programming (2 of 2)

3

Browse for the binary (.hex) of the full application

Load the binary into the target memory

4

3. Once the table of weights and biases is loaded into the external memory, it is possible to continue thedebugging through the IDE by defining flag WEIGHT_QSPI_PROGED to 1Defining flag WEIGHT_QSPI to 0 generates a program with the weight-and-bias table located in the internalFlash memory.

Table 5 summarizes the combinations of the compile flags.

Table 5. Compile flags

EffectCompile flag

WEIGHT_QSPI WEIGHT_QSPI_PROGED

WEIGHT_EXEC_EXTRAM

Generates a binary with the weight/bias table placed in the internalFlash memory

0 0 0

0 0 1

0 1 0

0 1 1

Generates a binary with theweight/bias table placed in theexternal QSPI Flash memory(1)

- 1 0 0

Weight/bias table copied from theexternal QSPI Flash into theexternal SDRAM at startup

1 0 1

Generates a binary without theweight/bias table (since already

loaded)

- 1 1 0

Weight/bias table copied from theexternal QSPI Flash into theexternal SDRAM at startup

1 1 1

1. To be loaded using the STM32CubeProgrammer tool (STM32CubeProg).

UM2611Software

UM2611 - Rev 2 page 21/41


Refer to Section 3.2.5 Execution performance and overall memory footprint for the performance when theweight-and-bias table is accessed from the external QSPI Flash and external SDRAM versus internal Flash.When generating new C code from a new model, the usual way to do (as described in Section 2.1 Integration ofthe generated code) is to replace the existing network.c, network.h, network_data.c and network_data.h files by the ones generated. If placement in external memory is required, the existing network_data.c filemust not be replaced. Instead, it is requested to replace only the content of the weight-and-bias table (named s_network_weights[]) contained in the existing network_data.c file by the content of the table contained in thegenerated network_data.c file.

3.2.4.6 Summary of data placement in volatile and non-volatile memoryThe FP-AI-VISION1 function pack demonstrates the implementation of four different schemes for allocating thedata in volatile memory:• Full internal memory with FPS optimized:

All the buffers (listed in Table 3) are located in the internal SRAM. In order to enable the whole system toexecute from the internal memory, some memory buffers are overlaid (as shown in Figure 6).Also, it is required to set the camera capture resolution to QVGA (320×240, pixels coded in RGB565).In this memory layout scheme, the camera_frame buffer is not overlaid so that it enables a new cameracapture while the inference is running, hence maximizing the number of frames per second (FPS).

• Full internal memory with memory optimized:All the buffers (listed in Table 3) are located in the internal SRAM. In order to enable the whole system toexecute from the internal memory, some memory buffers are overlaid (as shown in Figure 5).Also, it is required to set the camera capture resolution to QVGA (320×240, pixels coded in RGB565).In this memory layout scheme, the camera_frame buffer is overlaid so that it optimizes as much aspossible the internal SRAM occupation.

• Full external memory:All the buffers (listed in Table 3) are located in the external SDRAM.

• Split internal / external memory:All the buffers are located in the external SDRAM, except the activation buffer (which includes thenn_input buffer if the allocate input in activation option is selected in the STM32Cube.AI tool), which islocated in the internal SRAM.

The FP-AI-VISION1 function pack also demonstrates the implementation of three different ways to access (atinference time) the data stored in non-volatile memory:• Access from the internal Flash memory• Access from the external QSPI Flash memory• Access from the external SDRAM: in this case the data are stored in one of the two non-volatile memories

(either in the internal Flash or the external QSPI Flash) and copied into the external SDRAM at programstartup

UM2611Software

UM2611 - Rev 2 page 22/41



3.2.5 Execution performance and overall memory footprintTable 6 presents the camera capture time and preprocessing time measured for different models and frame sizes.

Table 6. Measurements of frame capture and preprocessing times

Measurements(1)

(ms)

Food recognition

VGA capture (640 × 480) QVGA capture (320 × 240)

Quantized C model Float C model Quantized C model Float C model

Camera frame capture time(2)(3) ~65 ~35

Pixel format conversion

(using DMA2D)~10 ~0.5

Image rescaling time(4)

(to 224 × 224)~6 to ~25 ~3 to ~20

Pixel conversion time(5) ~4 ~11 ~1 -

1. Measurement conditions: STM32H747 at 400 MHz CPU clock, code compiled with EWARM v8.40.2 with option -O3.

2. The values depend on the lighting conditions since the exposure time may vary.3. Values for the OV9655 module of the STM32F4DIS-CAM camera daughterboard connected with the STM32H747I-DISCO

Discovery board.4. Value varies according to the image rescaling algorithm being used (“Nearest Neighbor” or “Bilinear”). Values are different

for VGA (full external memory configuration) and QVGA (full internal memory configuration) because of the number of pixelsto resize from, but also because the buffers (rgb24_image and rgb24_image_scaled) are located in the externalSDRAM memory versus internal SRAM memory for the QVGA scenario. In the FP VISION, the “Nearest Neighbor”algorithm is used as default rescaling method. Depending on the type of application (such as the Neural Network used),using the “Bilinear” algorithm as rescaling method can bring some improvement in output accuracy, at the expense of therescaling execution time.

5. A pre-computed look-up table is used for the pixel conversion when dealing with the quantized model. Values are differentfor VGA (full external memory configuration) and QVGA (full internal memory configuration) since the buffers(rgb24_image_scaled and nn_input) are located in the external SDRAM memory versus internal SRAM memory for aQVGA scenario.

Considering the different possible models (float and quantized), the different possible camera resolutions (VGAand QVGA), the different possible locations for data in volatile memory (full external, split, full internal FPSoptimized, full internal memory optimized), and the different possible locations of the weight-and-bias table duringinference execution (internal Flash, external QSPI Flash, external SDRAM), the number of possible configurationis quite high.

UM2611Software

UM2611 - Rev 2 page 23/41

Table 7 summarizes the configurations that are supported in the FP-AI-VISION1 function pack.

Table 7. Configurations supported by the FP-AI-VISION1 function pack

Volatile memory

layout scheme

Weight/bias

table location

Food recognition

Quantized C model Float C model

VGA capture(640 × 480)

QVGA capture(320 × 240)

VGA capture(640 × 480)

QVGA capture(320 × 240)

Full internal

Memory optimized

Internal Flash

NO(1)

YES

NO(1) NO(1)External QSPI Flash NO(2)

External SDRAM NO(2)

Full internal

FPS optimized

Internal Flash

NO(1)

YES

NO(1) NO(1)External QSPI Flash NO(2)

External SDRAM NO(2)

Full external

Internal Flash YES

NO(2)

YES

NO(2)External QSPI Flash NO(2) YES

External SDRAM NO(2) YES

Split

external/internal

Internal Flash YES

NO(2)

YES

NO(2)External QSPI Flash NO(2) YES

External SDRAM NO(2) YES

1. Configuration not possible.2. Configuration possible but not supported.

The values provided in Table 8 and Table 9 below result from measurements performed using application binariesbuilt with the EWARM IDE.

UM2611Software

UM2611 - Rev 2 page 24/41

Table 8 presents the execution performance of the food recognition applications with the following conditions:• IAR EWARM v8.40.2• -O3 option

• Cortex®-M7 core clock frequency set to 400 MHz

Table 8. Execution performance of the food recognition application

C model Cameraresolution

Volatile memory

layout scheme

Weight/bias

table locationNN inference

time (ms)

Framesprocessedper second

(FPS)(1)

Outputaccuracy

(%)

Binaryfile

Float VGA

Full external

Internal Flash 210 4.2

~73

A

QSPI external Flash 372 2.5 B

External SDRAM 317 2.9 C

Split

Internal Flash 191 4.6 D

QSPI external Flash 355 2.6 E

External SDRAM 290 3.2 F

Quantized(2)

VGAFull external

Internal Flash

86 9.4

~72.5

G

Split 82 9.9 H

QVGA

Full internal

Memory optimized82 8.4 I

Full internal

FPS optimized82 11.6 J

1. FPS is calculated taking into account the capture, preprocessing (using the “Nearest Neighbor” image resizing algorithm),and inference times.

2. Food recognition CNN model quantized using the quantization tool (provided by the STM32Cube.AI tool v5.0.0) with the“UaUa per layer” quantization scheme.

The FPS values provided in Table 8 are max values. The FPS values can vary depending on the light conditions.Indeed, the FPS value includes the inference time, but also the preprocessing and camera capture times (refer toTable 6). The capture time depends on the light conditions via the exposure time.The binary files, contained in the Binary folder, associated with the scenarios in Table 8, are listed below:• A: STM32H747I-DISCO_Float_Model_Ext_IntFlash_V1.1.0.bin• B: STM32H747I-DISCO_Float_Model_Ext_QspiFlash_V1.1.0.hex• C: STM32H747I-DISCO_Float_Model_Ext_ExtSdram_V1.1.0.hex• D: STM32H747I-DISCO_Float_Model_Split_IntFlash_V1.1.0.bin• E: STM32H747I-DISCO_Float_Model_Split_QspiFlash_V1.1.0.hex• F: STM32H747I-DISCO_Float_Model_Split_ExtSdram_V1.1.0.hex• G: STM32H747I-DISCO_Quantized_Model_Ext_IntFlash_V1.1.0.bin• H: STM32H747I-DISCO_Quantized_Model_Split_IntFlash_V1.1.0.bin• I: STM32H747I-DISCO_Quantized_Model_IntMem_IntFlash_V1.1.0.bin• J: STM32H747I-DISCO_Quantized_Model_IntFps_IntFlash_V1.1.0.bin

UM2611Software

UM2611 - Rev 2 page 25/41

Table 9 presents the execution performance of the recognition application generated from the optimized model(not provided as part of the FP-AI-VISION1 function pack. Contact STMicroelectronics for information about thisspecific version. The measurement conditions are:• IAR EWARM v8.40.2• -O3 option

• Cortex®-M7 core clock frequency set to 400 MHz

Table 9. Execution performance of the optimized food recognition application


Volatile memory

layout scheme

Weight/bias

table locationNN inference

time (ms)

Framesprocessedper second

(FPS)(1)

Outputaccuracy

(%)

Binaryfile

Float VGAFull external

Internal Flash

420 2.2 ~78 K

Split Not possible

Quantized(2)

VGAFull external 153 5.8

~77.5

L

Split 146 6.1 M

QVGA

Full internal

Memory optimized146 5.5 N

Full internal

FPS optimized146 6.7 O

1. FPS is calculated taking into account the capture, preprocessing (using the “Nearest Neighbor” image resizing algorithm),and inference times.


The binary files, contained in the Binary folder, associated with the scenarios in Table 9, are listed below:• K: STM32H747I-DISCO_Float_Optimized_Model_Ext_IntFlash_V1.1.0.bin• L: STM32H747I-DISCO_Quantized_Optimized_Model_Ext_IntFlash_V1.1.0.bin• M: STM32H747I-DISCO_Quantized_Optimized_Model_Split_IntFlash_V1.1.0.bin• N: STM32H747I-DISCO_Quantized_Optimized_Model_IntMem_IntFlash_V1.1.0.bin• O: STM32H747I-DISCO_Quantized_Optimized_Model_IntFps_IntFlash_V1.1.0.binThe straight model (Table 8) shows better execution performance in some cases than the optimized model(Table 9). However, this is at the expense of the accuracy, which is better when using the optimized model sincein both cases the CNN working buffer (activation buffer) is located in the internal SRAM.

Full External versus Split

The Full External memory scheme is impacting slightly the inference time (around 10 %) since it requires moreCPU cycles to access the CNN working buffer (activation buffer, which includes the nn_input buffer if theallocate input in activation option is selected in the STM32Cube.AI tool) when it is located in the external SDRAMversus internal SRAM. However, the L1 data cache memory of the Cortex®-M7 (16 Kbytes) limits this impact. Inthe Split case, the inference time is the same as the one observed in the Full Internal memory implementation.

UM2611Software

UM2611 - Rev 2 page 26/41

CNN impact on the final accuracy

As mentioned previously in Quantization process, the type of CNN has an impact on the final accuracy. This isillustrated by the columns Ouput accuracy in Table 8 and Table 9:• Table 8 contains numbers pertaining to a network which topology has been optimized in such a way to

reducing the inference time. However, this is at the expense of the output accuracy (accuracy drop of about10 %).

• Table 9 provides numbers pertaining to a network which topology has been optimized in such a way tooptimize the output accuracy. As a result, Table 9 shows accuracies that are better than the ones shown inTable 8. However, this is at the expense of the inference time and the memory space required by thenetwork (larger activation buffer and weight-and-bias table).

3.2.6 Memory footprintTable 10 presents the memory footprints of the application in different configurations.

Table 10. Memory footprints


Volatile memory

layout scheme

Flash (byte) RAM (byte)

Code Weight Internal SRAM External SDRAM

Float VGAFull external

~75 K ~515 K~20 K ~7.1 M

Split ~520 K ~6.6 M

Quantized(1)

VGAFull external

~85 K ~135 K

~20 K ~6.3 M

Split(2) ~190 K/~145 K ~6.025 M/~6.175 M

QVGA

Full internal

Memory optimized~250 K

~3.2 M(3)

Full internal

FPS optimized~405 K


2. First value when the “allocate input in activation” option is selected in the STM32Cube.AI tool for C code generation; secondvalue otherwise.

3. Dedicated to the management of the LCD display.

The external SDRAM memory always contains at least the two following items:• Read and write buffers for managing the display on the LCD: total size of 3 Mbytes• The buffer used for containing the test frame (randomly generated at program startup when entering the test

mode):– When the VGA resolution is selected, the size is 614.4 Kbytes– When the QVGA resolution is selected, the size is 153.6 Kbytes

Depending on the scheme selected for the placement of data in the volatile memory, the external SDRAM mayalso contain all or part of the buffers listed in Table 3.The internal SRAM contains at least the stack and heap required by the application. Depending on the schemeselected for the placement of data in the volatile memory, the internal SRAM may also contain all or part of thebuffers listed in Table 3.The code size is different between the float and quantized models. This is due to the fact that both models do notinvolve the same set of functions from the runtime library.

UM2611Software

UM2611 - Rev 2 page 27/41

Table 11 shows the memory footprints of the food recognition application generated from the optimized model (notprovided as part of the FP-AI-VISION1 function pack. Contact STMicroelectronics for information about thisspecific version):

Table 11. Memory footprints for the optimized model


Volatile memory

layout scheme

Flash (byte) RAM (byte)

Code Weight Internal SRAM External SDRAM

Float VGA Full external ~75 K ~580 K ~20 K ~7.6 M

Quantized(1)

VGAFull external

~85 K ~150 K

~20 K ~6.425 M

Split(2) ~240 K/~270 K ~6.025 M/~6.175 M

QVGA

Full internal

Memory optimized~250 K

~3.2 M(3)

Full internal

FPS optimized~405 K


2. First value when the “allocate input in activation” option is selected in the STM32Cube.AI tool for C code generation; secondvalue otherwise.

3. Dedicated to the management of the LCD display.

3.2.7 Two visualization modesThe main application provides two visualization modes:• A logo view, where each food category is represented by a logo• An image view, which displays the camera feed

By default, the application starts with the logo view activated. Press the [WakeUp] button during the execution totoggle between the modes.

Note: When pressing the [WakeUp] button, the user must wait for the message Entering/Exiting CAMERAPREVIEW mode, Please release button to show up on the screen before releasing the button.

3.2.8 Embedded validation and testingThe embedded validation and testing mode offers three complementary applications:• Capture Frame: captures frames from the camera to a microSD™ card• Intermediate Dumps: dumps every step of the processing pipeline to a microSD™ card for debug• Onboard Validation: evaluates the Neural Network with images stored in a microSD™ card

This section presents each of the three applications as well as the way to start the validation and testing mode.

Prerequisite

The evaluation and testing mode requires a microSD™ card to be plugged into the STM32H747I-DISCODiscovery board. It is recommended to use at least a class 10 microSD™ card to avoid slow read/write operations.Moreover, the card must be FatFs formatted and contain one single partition.

Enabling the validation and testing mode

Maintain the [WakeUp] button pressed during the welcome screen (after a reset) to enable the test menu shownin Figure 9 instead of the main application.

UM2611Software

UM2611 - Rev 2 page 28/41


Figure 9. Validation and testing menu

Start the desired application by pressing the corresponding arrow key of the joystick:• [UP]: the default application is started (Neural Network inference with images from the camera)• [RIGHT]: the Capture Frame application is started• [LEFT]: the Intermediate Dumps application is started• [DOWN]: the Onboard Validation application is started

Capture Frame

The Capture Frame application grabs frames from the camera sensor of the STM32F4DIS-CAM cameradaughterboard and saves them to the microSD™ card as bitmap image files (.bmp).At application startup, a session name is generated and displayed in the top-left corner of the screen. Every filesaved during the session is prefixed by SESS_<session name>. This avoids overwriting files and differentiatesbetween resets. The screen shows the image taken from the camera and in the top-right corner, the READYmessage indicates that the program is ready to write to the microSD™ card.To capture a frame, press on the [WakeUp] button. The READY message switches to BUSY, indicating that theframe is being written to the microSD™ card, preventing all other capture meanwhile. Once the write operation isfinished, the READY message is displayed and capture enabled again.All the files are saved in the bitmap (.bmp) format and post-fixed by an increasing number starting from 1.

Intermediate Dumps

This application performs similarly to the default application described in Section 3.2.4.1 Application executionflow and volatile (RAM) data memory requirement (such as NN inference with images from the camera). Inaddition, at each stage of the processing pipeline, it saves the data to the microSD™ card. As with the CaptureFrame application, a session name is generated and displayed. All files are prefixed with this session name. Thefiles created are:• rgb565.dat• rgb888.bmp• rgb888_resized.bmp• input_tensor.dat• nn_output.txt

UM2611Software

UM2611 - Rev 2 page 29/41

The processing pipeline is presented in Figure 10.

Figure 10. Intermediate Dumps processing pipeline

Neural Network inference

camera_frame buffer

rgb24_image buffer


nn_input buffer

nn_output buffer

RAM data memoryLegend:

RGB565 buffer

RGB888 buffer

RGB888 resized buffer

Input tensor

Output tensor

Processing step Intermediate dump

Frame capture

Color format conversion(RGB565 => RGB888)

Image rescaling(VGA / QVGA => 224 x 224)

Pixel value scaling

Onboard Validation

The Onboard Validation application uses images stored in the microSD™ card as inputs for evaluating the NeuralNetwork.In order for the program to find the images in the microSD™ card, a directory named dataset must exist at theroot of the file system. Inside this directory, there must be one directory per class containing the images with thesame name as defined by NN_OUTPUT_CLASS_LIST in the code.In the function pack, this variable is aliased to g_food_classes defined in file fp_vision_app.c:

#define NN_OUTPUT_CLASS_LIST g_food_classesconst char *g_food_classes[AI_NET_OUTPUT_SIZE] = {"Apple Pie", "Beer", "Caesar Salad", "Cappuccino", "Cheesecake", "Chicken Wings", "Chocolate Cake", "Coke", "Cup Cakes", "Donuts", "French Fries", "Hamburger", "HotDog", "Lasagna", "Pizza", "Risotto", "Spaghetti Bolognese", "Steak"};

All images must be stored in the PPM P6 format (Portable PixMap binary) with a maximum size of1080 × 720 pixels each. The class is derived directly from the directory containing the images. The filenames arenot considered.A helper script (named create_dataset.py) to convert any dataset of images to the PPM format is provided inthe Utilities\AI_resources\Food-Recognition directory.

UM2611Software

UM2611 - Rev 2 page 30/41

For the food recognition example, the structure of the microSD™ file system must be as follows:

.|-- dataset |-- Apple Pie | |-- apple_pie_example1.ppm | |-- another_apple_pie.ppm | `-- ... |-- Beer | |-- beer.ppm | |-- another_beer.ppm | `-- ... |-- Caesar Salad | |-- a_caesar_salad.ppm | `-- ... |-- Cappuccino | |-- cappuccino.ppm | `-- ... |-- Cheesecake | |-- cheesecake.ppm | `-- ... |-- Chicken Wings | |-- chicken_wing.ppm | `-- ... |-- Chocolate Cake | |-- chocolate_cake.ppm | `-- ... |-- Coke | |-- coke.ppm | `-- ... |-- Cup Cakes | |-- cup_cakes.ppm | `-- ... |-- Donuts | |-- donuts.ppm | `-- ... |-- French Fries | |-- french_fries.ppm | `-- ... |-- Hamburger | |-- hamburger.ppm | `-- ... |-- Hot Dog | |-- hot_dog.ppm | `-- ... |-- Lasagna | |-- lasagna.ppm | `-- ... |-- Pizza | |-- pizza.ppm | `-- ... |-- Risotto | |-- risotto.ppm | `-- ... |-- Spaghetti Bolognese | |-- spaghetti_bolognese.ppm | `-- ... `-- Steak |-- steak.ppm `-- ...

UM2611Software

UM2611 - Rev 2 page 31/41

At startup, the Onboard Validation application displays a summary of the information presented in this section.Press the [WakeUp] button to start the validation. The screen shown in Figure 11 is then displayed.

Figure 11. Onboard Validation summary of information

At the top of the display, the class name is presented with its index in NN_OUTPUT_CLASS_LIST.The left side shows the current image being processed by the Neural Network, which is resized to match thenetwork input size. Below the image, the display shows the NN Top-1 output, the confidence level, and theaverage categorical cross-entropy loss across all processed images.The right side of the display shows the confusion matrix, that is constantly updated while images are beingprocessed by the neural network. The rows indicate the ground-truth label, and the columns the predicted class.

Note: If an image exceeds the maximum size or if the file is unreadable, the image is ignored and the applicationproceeds with the next image.

UM2611Software

UM2611 - Rev 2 page 32/41

When all images are processed, a message is displayed and a press on the [WakeUp] button updates the displaywith a classification report, as shown in Figure 12.

Figure 12. Onboard Validation classification report

The classification report shows the precision, recall, and f1-score for each class, as well as the macro average(average of the unweighted mean per class) and weighted average (average of the support-weighted mean perclass).The confusion matrix, the list of misclassified files and the classification report are saved at the root of themicroSD™ card file system as:• confusion_matrix.csv• missclassified.txt• classification_report.txt

3.2.9 Output displayBy default, only the information (class name + confidence level in %) concerning the output class with the highestprobability (Top1) is displayed on the LCD after the inference.The CNN inference time (expressed in milliseconds) as well as the number of frames processed per second(FPS) are also displayed on the LCD after the inference.One of the three LEDs below is set after each inference:• Red LED: if output confidence is below 55 %• Orange LED: if output confidence ranges from 55 % to 70 %• Green LED: if output confidence is above 70 %

UM2611Software

UM2611 - Rev 2 page 33/41

Figure 13. Food recognition example with green LED on

3.3 Hardware setup

The FP-AI-VISION1 function pack supports the following hardware configuration: STM32H747I-DISCO Discoveryboard connected to the STM32F4DIS-CAM camera daughterboard.The STM32H747I-DISCO Discovery board features:• One STM32H747XIH6 microcontroller with:

– SRAM:◦ 864 Kbytes of system SRAM (512 Kbytes + 288 Kbytes + 64 Kbytes)◦ 128 Kbytes of DTCM RAM

– Flash memory:◦ 2 Mbytes (2 banks of 1 Mbyte each)

• One 8-bit camera connector• One 256-Mbit external SDRAM

The STM32F4DIS-CAM camera daughterboard features:• One OmniVision® OV9655 camera module

Note: Information about STM32F4DIS-CAM is obtained directly from Farnell website at https://www.farnell.com.

UM2611Hardware setup

UM2611 - Rev 2 page 34/41



The STM32F4DIS-CAM camera daughterboard is connected to the STM32H747I-DISCO Discovery boardthrough a flex cable provided with the camera daughterboard as illustrated in Figure 14.

Figure 14. Hardware setup with STM32H747I-DISCO and STM32F4DIS-CAM

Flex cable

STM32H747I-DISCO

STM32F4DIS-CAM

UM2611Hardware setup

UM2611 - Rev 2 page 35/41

4 Testing the CNN

4.1 Testing conditions

The typical testing environment consists of a tablet device placed in front of the camera. In that case, the usermust keep in mind that the following parameters impact the accuracy of the recognition:• Ambiance light and illumination, for instance neon light flickering• Quality of the input image displayed by the tablet. High-resolution is required• Distance between the camera and the tablet• Reflection, for instance when a tablet is used as input image source

Adjusting the camera contrast by pressing the joystick [LEFT] ans [RIGHT] buttons is a way to improve theaccuracy of the recognition.

4.2 Camera and LCD setting adjustments

The LCD brightness is set using the joystick ([UP] and [DOWN]).The camera contrast levels are adjusted using the joystick ([LEFT] and [RIGHT]).The LCD brightness and camera contrast levels are set to their default values using the joystick ([SEL]).

UM2611Testing the CNN

UM2611 - Rev 2 page 36/41

Revision history

Table 12. Document revision history

Date Version Changes

17-Jul-2019 1 Initial release.

24-Dec-2019 2

Replaced former section Test mode with Section 3.2.8 Embedded validationand testing.

Replaced former section Two acquisition modes with Section 3.2.7 Twovisualization modes.

Added Section 3.2.3 Training script.

Updated the folder tree across the entire document.

Updated the entire document with respect to the possible buffer optimizationoffered by the allocate input in activation option in STM32Cube.AI.

Updated measurement conditions and results in Section 3.2.5 Executionperformance and overall memory footprint.

UM2611

UM2611 - Rev 2 page 37/41

Contents

1 General information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

1.1 FP-AI-VISION1 function pack feature overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Overview of available documents and references. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Building a CNN-based computer vision application on STM32H7 . . . . . . . . . . . . . . . . . . .5

2.1 Integration of the generated code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Package content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

3.1 CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Folder organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.2 Quantization process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.3 Training script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.4 Memory requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.5 Execution performance and overall memory footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.6 Memory footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.7 Two visualization modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.8 Embedded validation and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.9 Output display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Testing the CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

4.1 Testing conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Camera and LCD setting adjustments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Revision history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39

List of figures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

UM2611Contents

UM2611 - Rev 2 page 38/41

List of tablesTable 1. List of acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Table 2. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Table 3. RAM buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Table 4. STM32H747XIH6 SRAM memory map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Table 5. Compile flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Table 6. Measurements of frame capture and preprocessing times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Table 7. Configurations supported by the FP-AI-VISION1 function pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Table 8. Execution performance of the food recognition application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Table 9. Execution performance of the optimized food recognition application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Table 10. Memory footprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 11. Memory footprints for the optimized model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Table 12. Document revision history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

UM2611List of tables

UM2611 - Rev 2 page 39/41

List of figuresFigure 1. FP-AI-VISION1 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Figure 2. CNN-based computer vision application build flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 3. FP-AI-VISION1 folder tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 4. Data buffers during execution flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 5. SRAM allocation - Memory optimized scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 6. SRAM allocation - FPS optimized scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 7. Flash programming (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 8. Flash programming (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 9. Validation and testing menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 10. Intermediate Dumps processing pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 11. Onboard Validation summary of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 12. Onboard Validation classification report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 13. Food recognition example with green LED on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 14. Hardware setup with STM32H747I-DISCO and STM32F4DIS-CAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

UM2611List of figures

UM2611 - Rev 2 page 40/41

IMPORTANT NOTICE – PLEASE READ CAREFULLY

STMicroelectronics NV and its subsidiaries (“ST”) reserve the right to make changes, corrections, enhancements, modifications, and improvements to STproducts and/or to this document at any time without notice. Purchasers should obtain the latest relevant information on ST products before placing orders. STproducts are sold pursuant to ST’s terms and conditions of sale in place at the time of order acknowledgement.

Purchasers are solely responsible for the choice, selection, and use of ST products and ST assumes no liability for application assistance or the design ofPurchasers’ products.

No license, express or implied, to any intellectual property right is granted by ST herein.

Resale of ST products with provisions different from the information set forth herein shall void any warranty granted by ST for such product.

ST and the ST logo are trademarks of ST. For additional information about ST trademarks, please refer to www.st.com/trademarks. All other product or servicenames are the property of their respective owners.

Information in this document supersedes and replaces information previously supplied in any prior versions of this document.

© 2019 STMicroelectronics – All rights reserved

UM2611

UM2611 - Rev 2 page 41/41

http://www.st.com/trademarks

Documents

Artificial Intelligence (AI) and computer vision function ... · MobileNet is an efficient model architecture [3] suitable for mobile and embedded vision applications. This model