Behavior Cloning of MPC for 3-DOF Robotic Manipulators

Accepted Poster: IEEE ICRA 2026 Workshop on RL in the Era of IL

Theo Guegan, Wen Jie Dexter Teo - University of Waterloo

IK + MPC online control stack   ->   Neural network policy (MLP) for real-time control

Problem

Model Predictive Control (MPC) provides strong tracking quality and stability for robotic manipulation, but it requires solving an optimization problem at every control step. This repeated solve introduces latency and runtime variability that can limit deployment in high-frequency control loops and on compute-constrained platforms. Our objective is to preserve the expert controller behavior while reducing inference time and computational load.

Expert Controller and Dataset

We generate demonstrations using a hierarchical expert: IK computes a joint-space reference and MPC outputs torques.

Step 1: sample reachable target position.
Step 2: IK computes desired joint angles.
Step 3: MPC computes optimal torques.
Step 4: store (state, target, torque) tuples.

Learning Setup

We formulate policy imitation as supervised regression from robot state and target to expert torque commands generated by the IK+MPC controller. The training data is collected from closed-loop expert rollouts, and models are optimized to minimize the discrepancy between predicted and expert actions. We compare static and temporal neural architectures to test whether explicit history improves control fidelity.

Architecture Schema

Training uses expert supervision (IK + MPC); deployment replaces online optimization with a direct neural policy mapping from state and target to torques.

Training (Behavior Cloning from Expert) Input x = [q, q̇, pdes] IK Module q_des MPC Solver τMPC Dataset (x, τMPC) Deployment (Expert Replaced) Input x = [q, q̇, pdes] MLP Policy πθ(x) -> τ̂ Robot Control fast real-time torque

Fig 1 - Expert-to-policy architecture schema.

3-DOF arm in MuJoCo environment
Fig 2 - 3-DOF manipulator in MuJoCo.

Main Result: Replacing IK + MPC with a Neural Policy

In deployment, the learned MLP replaces the online IK+MPC optimization loop: it maps current state and target directly to torques.

Relaxed tolerance success rate 84.98%
Mean final tracking error 2.9 cm
Inference latency ~1.1 ms
Latency improvement vs expert MPC ~3x faster
Closed-loop success rate comparison across thresholds
Fig 3 - Closed-loop success rate across error thresholds.
Solve-time distribution comparing MPC and MLP policy
Fig 4 - Solve time distribution: MPC vs MLP_Deep.

Takeaway

We preserve most of the expert behavior while removing expensive online optimization. This demonstrates a practical path from IK + MPC to a lightweight neural controller for embedded and real-time robotic applications.