STM32 Agent — STM32H7, CMSIS-NN, LwIP, and MQTT

// last reviewed 2026-05-22 · Marcus Rüb

Building an MCU Agent on STM32

The STM32H7 series — specifically the STM32H743 running Cortex-M7 at 480 MHz with a double-precision FPU and 1 MB of SRAM — is the highest-performance Cortex-M target for MCU agents; it runs STM32Cube.AI and CMSIS-NN inference on datasets that would overflow lower-end MCUs, while LwIP and MQTT-C provide a production-grade networking stack.

The STM32MP157 (Cortex-A7 + Cortex-M4 heterogeneous SoC) is a step beyond the M7 class when you need Linux on the A7 side for heavier processing, with the M4 handling real-time sensor tasks and actuation.

STM32H7 series — key specs

PartCoreMax clockSRAMFlashFPU
STM32H743/753Cortex-M7480 MHz1 MB (864 KB user)2 MBDouble-precision
STM32H745/755Cortex-M7 + Cortex-M4480 / 240 MHz1 MB shared2 MBM7: DP, M4: SP
STM32H747/757Cortex-M7 + Cortex-M4480 / 240 MHz1 MB shared2 MBM7: DP, M4: SP
STM32H7B3Cortex-M7280 MHz1.4 MB2 MBDouble-precision

The Cortex-M7 includes a Tightly Coupled Memory (TCM) interface: 128 KB ITCM (instruction) and 128 KB DTCM (data), accessible at full core speed without wait states. Place hot inference kernels and activation buffers here for maximum throughput.

The 864 KB of user SRAM (across multiple banks: AXI SRAM, SRAM1, SRAM2, SRAM3, SRAM4) enables larger ML model activations than any Cortex-M4 class device. A MobileNetV1 at 0.25 width factor runs in under 200 KB of activations.

STM32Cube.AI and CMSIS-NN

STM32Cube.AI (now STM32Cube AI Studio as of 2024) is ST’s free tool for importing trained models (ONNX, TFLite, Keras) and generating optimized C code for STM32 targets. It outputs:

CMSIS-NN is the underlying ARM-optimized kernel library. On the Cortex-M7 it exploits:

Typical inference times on STM32H743 @ 480 MHz for CMSIS-NN quantized models:

ModelOperationTime (approx.)
CIFAR-10 CNN (int8)Full image inference~50–100 ms
Keyword spotting (MFCC input)Single window~5–15 ms
Anomaly detection (small MLP)Single feature vector< 1 ms

Agent skeleton for STM32H7 with FreeRTOS + MQTT

#include "FreeRTOS.h"
#include "task.h"
#include "queue.h"
#include "network_interface.h"  /* LwIP + Ethernet or Wi-Fi HAL */
#include "MQTTClient.h"         /* paho-embedded-c or MQTT-C */
#include "ai_model.h"           /* generated by STM32Cube.AI */

#define DEVICE_ID   "stm32h7-node-01"
#define TOPIC_EVENT "agents/" DEVICE_ID "/event"
#define TOPIC_CMD   "agents/" DEVICE_ID "/cmd"

static QueueHandle_t xFeatureQueue;
static MQTTClient    s_mqtt;

/* Pre-process raw ADC data into feature vector */
static void vSensorTask(void *pvParam) {
    float raw_buf[WINDOW_SIZE];
    float feature_vec[FEATURE_LEN];
    for (;;) {
        adc_dma_read(raw_buf, WINDOW_SIZE);   /* DMA-backed, non-blocking */
        compute_features(raw_buf, feature_vec);
        xQueueOverwrite(xFeatureQueue, feature_vec);
        vTaskDelay(pdMS_TO_TICKS(WINDOW_MS));
    }
}

/* Agent task: inference + decision + publish */
static void vAgentTask(void *pvParam) {
    float feature_vec[FEATURE_LEN];
    ai_float out_scores[NUM_CLASSES];
    char payload[256];

    for (;;) {
        xQueueReceive(xFeatureQueue, feature_vec, portMAX_DELAY);

        /* Run STM32Cube.AI generated inference */
        ai_model_run(feature_vec, out_scores);

        int best_class = argmax(out_scores, NUM_CLASSES);
        if (out_scores[best_class] > CONFIDENCE_THRESHOLD) {
            snprintf(payload, sizeof(payload),
                "{\"id\":\"%s\",\"class\":%d,\"score\":%.3f}",
                DEVICE_ID, best_class, out_scores[best_class]);
            MQTTPublish(&s_mqtt, TOPIC_EVENT, payload, QOS1);
        }
    }
}

Note: ai_model_run() is the entry point generated by STM32Cube.AI. The actual signature depends on the tool version and model input/output tensor shapes.

LwIP + MQTT stack options

OptionDescriptionBest for
paho-embedded-cEclipse Paho C client ported for embedded (blocking API)Simplest integration
MQTT-CSmall, async, BSD-licensed; 1-2 KB RAM overheadMemory-tight builds
wolfMQTTWolfssl-integrated; TLS built inSecurity-critical deployments
Azure RTOS NetX DuoMicrosoft’s stack; full TLS via NetX SecureSTM32 with Azure IoT integration

For STM32H7 + Ethernet, LwIP over the MAC/PHY (e.g., DP83848, LAN8720) is the most common path. For Wi-Fi, an external module (ESP-AT over UART, or WinC1500 over SPI) is attached — the STM32H7 itself has no RF.

STM32MP157 — when you need more

The STM32MP157 is a heterogeneous SoC:

The two cores communicate via OpenAMP / virtio remoteproc. The M4 handles deterministic sensing; the A7 runs the heavier decision logic and delegates to cloud endpoints if needed.

Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.

FAQ

Q: Does STM32Cube.AI support ONNX models? Yes. STM32Cube AI Studio accepts TFLite flatbuffers, Keras H5/SavedModel, and ONNX models. It quantizes and optimizes them for the target STM32 device during import.

Q: What is the TCM and why does it matter for inference? TCM (Tightly Coupled Memory) is SRAM accessible at zero wait states at full CPU speed. On STM32H743, 128 KB DTCM is ideal for activation buffers during inference. Placing activations in AXI SRAM adds latency from the bus matrix.

Q: Can CMSIS-NN use float32 or only quantized? CMSIS-NN supports both. The quantized (int8, int16) paths are faster and use the DSP SIMD instructions. Float32 paths use the FPU and are appropriate when quantization accuracy loss is unacceptable.

Q: How do I add TLS to the MQTT connection? Use wolfMQTT with wolfSSL, or mbedTLS (included in STM32Cube packages). You need to provision the CA certificate and optionally a device certificate for mutual TLS. Certificate storage in internal Flash or an external serial Flash is common.

Q: Is the STM32H7 overkill for a simple threshold agent? Almost certainly, unless you also need the high-resolution ADC (16-bit, 3.6 MSPS), the DSP throughput, or the large SRAM for multi-channel buffering. For a simple threshold-and-alert agent, an STM32G0 or STM32L4 costs a fraction and uses far less power.