STM32 Agent — STM32H7, CMSIS-NN, LwIP, and MQTT
Building an MCU Agent on STM32
The STM32H7 series — specifically the STM32H743 running Cortex-M7 at 480 MHz with a double-precision FPU and 1 MB of SRAM — is the highest-performance Cortex-M target for MCU agents; it runs STM32Cube.AI and CMSIS-NN inference on datasets that would overflow lower-end MCUs, while LwIP and MQTT-C provide a production-grade networking stack.
The STM32MP157 (Cortex-A7 + Cortex-M4 heterogeneous SoC) is a step beyond the M7 class when you need Linux on the A7 side for heavier processing, with the M4 handling real-time sensor tasks and actuation.
STM32H7 series — key specs
| Part | Core | Max clock | SRAM | Flash | FPU |
|---|---|---|---|---|---|
| STM32H743/753 | Cortex-M7 | 480 MHz | 1 MB (864 KB user) | 2 MB | Double-precision |
| STM32H745/755 | Cortex-M7 + Cortex-M4 | 480 / 240 MHz | 1 MB shared | 2 MB | M7: DP, M4: SP |
| STM32H747/757 | Cortex-M7 + Cortex-M4 | 480 / 240 MHz | 1 MB shared | 2 MB | M7: DP, M4: SP |
| STM32H7B3 | Cortex-M7 | 280 MHz | 1.4 MB | 2 MB | Double-precision |
The Cortex-M7 includes a Tightly Coupled Memory (TCM) interface: 128 KB ITCM (instruction) and 128 KB DTCM (data), accessible at full core speed without wait states. Place hot inference kernels and activation buffers here for maximum throughput.
The 864 KB of user SRAM (across multiple banks: AXI SRAM, SRAM1, SRAM2, SRAM3, SRAM4) enables larger ML model activations than any Cortex-M4 class device. A MobileNetV1 at 0.25 width factor runs in under 200 KB of activations.
STM32Cube.AI and CMSIS-NN
STM32Cube.AI (now STM32Cube AI Studio as of 2024) is ST’s free tool for importing trained models (ONNX, TFLite, Keras) and generating optimized C code for STM32 targets. It outputs:
- A static C library with your model’s weights and topology baked in.
- A validation report showing inference time, memory usage (activation buffer, weights), and layer-by-layer profiling.
- Integration hooks for STM32CubeMX project generation.
CMSIS-NN is the underlying ARM-optimized kernel library. On the Cortex-M7 it exploits:
- The FPU for float32 operations.
- DSP extension SIMD instructions (
SMLAD,SMLABT, etc.) for int8/int16 quantized convolutions. - D-Cache and I-Cache (both 16 KB on STM32H7) for repeated kernel access.
Typical inference times on STM32H743 @ 480 MHz for CMSIS-NN quantized models:
| Model | Operation | Time (approx.) |
|---|---|---|
| CIFAR-10 CNN (int8) | Full image inference | ~50–100 ms |
| Keyword spotting (MFCC input) | Single window | ~5–15 ms |
| Anomaly detection (small MLP) | Single feature vector | < 1 ms |
Agent skeleton for STM32H7 with FreeRTOS + MQTT
#include "FreeRTOS.h"
#include "task.h"
#include "queue.h"
#include "network_interface.h" /* LwIP + Ethernet or Wi-Fi HAL */
#include "MQTTClient.h" /* paho-embedded-c or MQTT-C */
#include "ai_model.h" /* generated by STM32Cube.AI */
#define DEVICE_ID "stm32h7-node-01"
#define TOPIC_EVENT "agents/" DEVICE_ID "/event"
#define TOPIC_CMD "agents/" DEVICE_ID "/cmd"
static QueueHandle_t xFeatureQueue;
static MQTTClient s_mqtt;
/* Pre-process raw ADC data into feature vector */
static void vSensorTask(void *pvParam) {
float raw_buf[WINDOW_SIZE];
float feature_vec[FEATURE_LEN];
for (;;) {
adc_dma_read(raw_buf, WINDOW_SIZE); /* DMA-backed, non-blocking */
compute_features(raw_buf, feature_vec);
xQueueOverwrite(xFeatureQueue, feature_vec);
vTaskDelay(pdMS_TO_TICKS(WINDOW_MS));
}
}
/* Agent task: inference + decision + publish */
static void vAgentTask(void *pvParam) {
float feature_vec[FEATURE_LEN];
ai_float out_scores[NUM_CLASSES];
char payload[256];
for (;;) {
xQueueReceive(xFeatureQueue, feature_vec, portMAX_DELAY);
/* Run STM32Cube.AI generated inference */
ai_model_run(feature_vec, out_scores);
int best_class = argmax(out_scores, NUM_CLASSES);
if (out_scores[best_class] > CONFIDENCE_THRESHOLD) {
snprintf(payload, sizeof(payload),
"{\"id\":\"%s\",\"class\":%d,\"score\":%.3f}",
DEVICE_ID, best_class, out_scores[best_class]);
MQTTPublish(&s_mqtt, TOPIC_EVENT, payload, QOS1);
}
}
}
Note: ai_model_run() is the entry point generated by STM32Cube.AI. The actual signature depends on the tool version and model input/output tensor shapes.
LwIP + MQTT stack options
| Option | Description | Best for |
|---|---|---|
| paho-embedded-c | Eclipse Paho C client ported for embedded (blocking API) | Simplest integration |
| MQTT-C | Small, async, BSD-licensed; 1-2 KB RAM overhead | Memory-tight builds |
| wolfMQTT | Wolfssl-integrated; TLS built in | Security-critical deployments |
| Azure RTOS NetX Duo | Microsoft’s stack; full TLS via NetX Secure | STM32 with Azure IoT integration |
For STM32H7 + Ethernet, LwIP over the MAC/PHY (e.g., DP83848, LAN8720) is the most common path. For Wi-Fi, an external module (ESP-AT over UART, or WinC1500 over SPI) is attached — the STM32H7 itself has no RF.
STM32MP157 — when you need more
The STM32MP157 is a heterogeneous SoC:
- Cortex-A7 @ 650 MHz: runs Linux (OpenSTLinux). Can host heavier inference, Python, or even a local MQTT broker.
- Cortex-M4 @ 209 MHz: runs FreeRTOS. Handles real-time sensor acquisition and actuator control.
The two cores communicate via OpenAMP / virtio remoteproc. The M4 handles deterministic sensing; the A7 runs the heavier decision logic and delegates to cloud endpoints if needed.
Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.
FAQ
Q: Does STM32Cube.AI support ONNX models? Yes. STM32Cube AI Studio accepts TFLite flatbuffers, Keras H5/SavedModel, and ONNX models. It quantizes and optimizes them for the target STM32 device during import.
Q: What is the TCM and why does it matter for inference? TCM (Tightly Coupled Memory) is SRAM accessible at zero wait states at full CPU speed. On STM32H743, 128 KB DTCM is ideal for activation buffers during inference. Placing activations in AXI SRAM adds latency from the bus matrix.
Q: Can CMSIS-NN use float32 or only quantized? CMSIS-NN supports both. The quantized (int8, int16) paths are faster and use the DSP SIMD instructions. Float32 paths use the FPU and are appropriate when quantization accuracy loss is unacceptable.
Q: How do I add TLS to the MQTT connection? Use wolfMQTT with wolfSSL, or mbedTLS (included in STM32Cube packages). You need to provision the CA certificate and optionally a device certificate for mutual TLS. Certificate storage in internal Flash or an external serial Flash is common.
Q: Is the STM32H7 overkill for a simple threshold agent? Almost certainly, unless you also need the high-resolution ADC (16-bit, 3.6 MSPS), the DSP throughput, or the large SRAM for multi-channel buffering. For a simple threshold-and-alert agent, an STM32G0 or STM32L4 costs a fraction and uses far less power.