STM32 Agent — STM32H7, CMSIS-NN, LwIP, and MQTT

// last reviewed 2026-05-22 · Marcus Rüb

Building an MCU Agent on STM32

The STM32H7 series — specifically the STM32H743 running Cortex-M7 at 480 MHz with a double-precision FPU and 1 MB of SRAM — is the highest-performance Cortex-M target for MCU agents; it runs STM32Cube.AI and CMSIS-NN inference on datasets that would overflow lower-end MCUs, while LwIP and MQTT-C provide a production-grade networking stack.

The STM32MP157 (Cortex-A7 + Cortex-M4 heterogeneous SoC) is a step beyond the M7 class when you need Linux on the A7 side for heavier processing, with the M4 handling real-time sensor tasks and actuation.

STM32H7 series — key specs

Part	Core	Max clock	SRAM	Flash	FPU
STM32H743/753	Cortex-M7	480 MHz	1 MB (864 KB user)	2 MB	Double-precision
STM32H745/755	Cortex-M7 + Cortex-M4	480 / 240 MHz	1 MB shared	2 MB	M7: DP, M4: SP
STM32H747/757	Cortex-M7 + Cortex-M4	480 / 240 MHz	1 MB shared	2 MB	M7: DP, M4: SP
STM32H7B3	Cortex-M7	280 MHz	1.4 MB	2 MB	Double-precision

The Cortex-M7 includes a Tightly Coupled Memory (TCM) interface: 128 KB ITCM (instruction) and 128 KB DTCM (data), accessible at full core speed without wait states. Place hot inference kernels and activation buffers here for maximum throughput.

The 864 KB of user SRAM (across multiple banks: AXI SRAM, SRAM1, SRAM2, SRAM3, SRAM4) enables larger ML model activations than any Cortex-M4 class device. A MobileNetV1 at 0.25 width factor runs in under 200 KB of activations.

STM32Cube.AI and CMSIS-NN

STM32Cube.AI (now STM32Cube AI Studio as of 2024) is ST’s free tool for importing trained models (ONNX, TFLite, Keras) and generating optimized C code for STM32 targets. It outputs:

A static C library with your model’s weights and topology baked in.
A validation report showing inference time, memory usage (activation buffer, weights), and layer-by-layer profiling.
Integration hooks for STM32CubeMX project generation.

CMSIS-NN is the underlying ARM-optimized kernel library. On the Cortex-M7 it exploits:

The FPU for float32 operations.
DSP extension SIMD instructions (SMLAD, SMLABT, etc.) for int8/int16 quantized convolutions.
D-Cache and I-Cache (both 16 KB on STM32H7) for repeated kernel access.

Typical inference times on STM32H743 @ 480 MHz for CMSIS-NN quantized models:

Model	Operation	Time (approx.)
CIFAR-10 CNN (int8)	Full image inference	~50–100 ms
Keyword spotting (MFCC input)	Single window	~5–15 ms
Anomaly detection (small MLP)	Single feature vector	< 1 ms

Agent skeleton for STM32H7 with FreeRTOS + MQTT

#include "FreeRTOS.h"
#include "task.h"
#include "queue.h"
#include "network_interface.h"  /* LwIP + Ethernet or Wi-Fi HAL */
#include "MQTTClient.h"         /* paho-embedded-c or MQTT-C */
#include "ai_model.h"           /* generated by STM32Cube.AI */

#define DEVICE_ID   "stm32h7-node-01"
#define TOPIC_EVENT "agents/" DEVICE_ID "/event"
#define TOPIC_CMD   "agents/" DEVICE_ID "/cmd"

static QueueHandle_t xFeatureQueue;
static MQTTClient    s_mqtt;

/* Pre-process raw ADC data into feature vector */
static void vSensorTask(void *pvParam) {
    float raw_buf[WINDOW_SIZE];
    float feature_vec[FEATURE_LEN];
    for (;;) {
        adc_dma_read(raw_buf, WINDOW_SIZE);   /* DMA-backed, non-blocking */
        compute_features(raw_buf, feature_vec);
        xQueueOverwrite(xFeatureQueue, feature_vec);
        vTaskDelay(pdMS_TO_TICKS(WINDOW_MS));
    }
}

/* Agent task: inference + decision + publish */
static void vAgentTask(void *pvParam) {
    float feature_vec[FEATURE_LEN];
    ai_float out_scores[NUM_CLASSES];
    char payload[256];

    for (;;) {
        xQueueReceive(xFeatureQueue, feature_vec, portMAX_DELAY);

        /* Run STM32Cube.AI generated inference */
        ai_model_run(feature_vec, out_scores);

        int best_class = argmax(out_scores, NUM_CLASSES);
        if (out_scores[best_class] > CONFIDENCE_THRESHOLD) {
            snprintf(payload, sizeof(payload),
                "{\"id\":\"%s\",\"class\":%d,\"score\":%.3f}",
                DEVICE_ID, best_class, out_scores[best_class]);
            MQTTPublish(&s_mqtt, TOPIC_EVENT, payload, QOS1);
        }
    }
}

Note: ai_model_run() is the entry point generated by STM32Cube.AI. The actual signature depends on the tool version and model input/output tensor shapes.

LwIP + MQTT stack options

Option	Description	Best for
paho-embedded-c	Eclipse Paho C client ported for embedded (blocking API)	Simplest integration
MQTT-C	Small, async, BSD-licensed; 1-2 KB RAM overhead	Memory-tight builds
wolfMQTT	Wolfssl-integrated; TLS built in	Security-critical deployments
Azure RTOS NetX Duo	Microsoft’s stack; full TLS via NetX Secure	STM32 with Azure IoT integration

For STM32H7 + Ethernet, LwIP over the MAC/PHY (e.g., DP83848, LAN8720) is the most common path. For Wi-Fi, an external module (ESP-AT over UART, or WinC1500 over SPI) is attached — the STM32H7 itself has no RF.

STM32MP157 — when you need more

The STM32MP157 is a heterogeneous SoC:

Cortex-A7 @ 650 MHz: runs Linux (OpenSTLinux). Can host heavier inference, Python, or even a local MQTT broker.
Cortex-M4 @ 209 MHz: runs FreeRTOS. Handles real-time sensor acquisition and actuator control.

The two cores communicate via OpenAMP / virtio remoteproc. The M4 handles deterministic sensing; the A7 runs the heavier decision logic and delegates to cloud endpoints if needed.

Running ready-made agents on STM32MP-class Linux: Everything above is hand-written Cortex-M firmware. One layer up, the open-source edge-agents runtime runs on the Cortex-A Linux side of MPU boards like the STM32MP25 (Cortex-A35, verified) as a 30 MB Go engine — no firmware to hand-write for the agent layer, with GPIO/UART/MQTT as first-class workflow nodes. It does not run on bare-metal Cortex-M (STM32H7/L4); that distinction is the whole point. See the edge-agents Hardware Support matrix and the Runtime Quickstart.

Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.

FAQ

Q: Does STM32Cube.AI support ONNX models? Yes. STM32Cube AI Studio accepts TFLite flatbuffers, Keras H5/SavedModel, and ONNX models. It quantizes and optimizes them for the target STM32 device during import.

Q: What is the TCM and why does it matter for inference? TCM (Tightly Coupled Memory) is SRAM accessible at zero wait states at full CPU speed. On STM32H743, 128 KB DTCM is ideal for activation buffers during inference. Placing activations in AXI SRAM adds latency from the bus matrix.

Q: Can CMSIS-NN use float32 or only quantized? CMSIS-NN supports both. The quantized (int8, int16) paths are faster and use the DSP SIMD instructions. Float32 paths use the FPU and are appropriate when quantization accuracy loss is unacceptable.

Q: How do I add TLS to the MQTT connection? Use wolfMQTT with wolfSSL, or mbedTLS (included in STM32Cube packages). You need to provision the CA certificate and optionally a device certificate for mutual TLS. Certificate storage in internal Flash or an external serial Flash is common.

Q: Is the STM32H7 overkill for a simple threshold agent? Almost certainly, unless you also need the high-resolution ADC (16-bit, 3.6 MSPS), the DSP throughput, or the large SRAM for multi-channel buffering. For a simple threshold-and-alert agent, an STM32G0 or STM32L4 costs a fraction and uses far less power.