RISE: Robust Imitation by Stiching from Experts demonstrates the following benefits.
Figure 1: Overview of RISE approach showing how non-expert data is leveraged to robustify imitation learning.
RISE stitches together non-expert data trajectories, such as SE2 planar pushing data, with expert data to recover from OOD states.
By utilizing play data, RISE is able to recover from OOD states.
Standard Offline RL (IDQL): Offline RL fails to stitch disjoint trajectories together.
RISE: By leveraging non-expert data in addition to expert data, RISE is able to recover from OOD states.
RISE enables continual recovery in cases where robot is subject to multiple physical perturbations.
Diffusion Policy: BC policies are brittle and cannot utilize non-expert data effectively.
RISE: RISE creates robust policies that are able to continually recover from multiple disturbances.
Unlike most imitation learning methods, RISE is robust to suboptimal demonstrations and is able to stitch together useful segments of suboptimal demonstrations to robustify the expert. Additionally, the non-expert data increases the coverage of the policy performance.
Diffusion Policy: BC learns to mimic suboptimal demonstrations, hence hindering task completion.
RISE: RISE is able to use suboptimal demos to robustify the expert and not be negatively impacted by them.
RISE generalizes beyond rigid, tabletop settings to deformable cloth stacking task requiring multiple steps.
Diffusion Policy: BC fails to complete the task by stacking the folded cloth..
RISE: RISE successfully stitches folding and stacking with a deformable object.
By utilizing cheaper data that can be shared across tasks, RISE amoritzes the cost of collecting new data needed to learn new tasks.
Diffusion Policy: BC fails to complete the task in OOD states for square-peg and square-hook tasks.
RISE: By sharing square play data, RISE completes the square-peg tasks in a high-coverage of inital state.
RISE: By sharing square play data, RISE completes the square-hook task in a high-coverage of inital state.
Baseline 1
Baseline 2
Our Method
Offline RL is a natural method to learn policies from non-optimal data. Reward labels for human collected demonstrations on real robots are difficult to obtain; thus, we only assume labels of whether the offline data belongs to an expert dataset, or a non-expert dataset. We label all transitions in the expert dataset with a reward of 1, and all transitions in the non-expert dataset with a reward of 0. We build off of IDQL, using the following updates:
$$ \begin{align} \mathcal{L}_V(\psi) &= \mathbb{E}_{(s, a) \sim (\mathcal{D}_{\text{E}} \cup \mathcal{D}_{\text{NE}})} \left[ L^{\tau}_2 (Q_\phi (s, a) - V_{\psi}(s)) \right] \qquad \text{(Value Learning)} \\ \mathcal{L}_Q(\phi) &= \mathbb{E}_{(s, a) \sim \mathcal{D}_{\text{E}}} \left[ \left( 1 + \gamma V_{\psi}(s') - Q_{\phi}(s, a)\right)^2 \right] + \mathbb{E}_{(s, a) \sim \mathcal{D}_{\text{NE}}} \left[ \left(\gamma V_{\psi}(s') - Q_{\phi}(s, a)\right)^2 \right]\\ \pi_B(a|s) &= \text{argmax}_{\pi} \mathbb{E}_{{s, a} \sim (\mathcal{D}_{\text{E}} \cup \mathcal{D}_{\text{NE}})}\left[\log \pi(a|s) \right] \qquad \text{(Behavior Policy Learning)} \\ \pi^*(a|s) &= \underset{a \in \{a_1,\dots, a_K\} \sim \pi_B(a|s)}{\text{argmax}} Q_\phi(s, a) \qquad \text{(Optimal Policy Extraction)} \end{align} $$
However, in practice, without very large datasets, the likelihood of state-overlap in the data in a continous space tends to 0, resulting in stitching across overlapping states unlikely. This prevents offline RL from effectively recovering back to expert states from non-expert ones. While this is challenging to solve in the most general case - we base our practical improvements on a set of empirical findings in a robotic manipulation setting, introducing a Lipschitz penalty using spectral normalization along with data augmentation to improve stitching.
Empirically, we observe that Q-value functions learned with expectile regression tend to be accurate and interpolate well within a neighborhood of the data, showing reasonable stitching behavior. The challenge comes from policy extraction—while Q functions can interpolate within a neighborhood, we find that the marginal action distribution captured by the behavior policy and be overly conservative. This tends to be the main practical challenge in offline RL.
To understand this, consider the figures below, which demonstrate a toy task where the goal is to push an object to a goal on a table. By default, on the left figure, the action distribution of the policy (in our case parameterized by a diffusion model) is very narrow, and the optimal action falls outside of the policy distribution (see the right figure). To remedy this, we introduce a Lipschitz penalty to the behavior policy during training, which effectively widens the action distribution, improving stitching. While there are several ways to enforce Lipschitz continuity, we simply opt for regularizing the policy with a spectral norm penalty. Our behavior policy loss becomes:
$$ \max_\theta \mathbb{E}_{(s, a) \sim (\mathcal{D}_{\text{E}} \cup \mathcal{D}_{\text{NE}})}\left[ \log \pi_\theta(a|s) \right] + \lambda\sum_{W \in \theta} \|\sigma_{\text{max}}(W)\|^2. $$
The effect of spectral norm regularization is shown in the figure below, significantly widening the action distribution.
Figure 2: Adding a spectral norm penalty to the behavior policy widens the action distribution, improving robustness.
In addition, we apply an explicit method of widening the policy distribution by adding additional transitions to the dataset. For a given (s, a) pair, we select nearby transitions (s', a') such that d(s, s') < T for some threshold T and distance metric d. We then add the transition (s, a') to the dataset. We use Euclidean distance in the feature space of a large pretrained vision model, in our case DINOv2, as the distance metric, which has been shown to capture semantic similarity between images.
We can also visualize the effect of RISE in the figures below. Below, we first visualize the expert and non-expert data for the square-peg task. Though they appear to overlap, in practice, stitching is challenging. We see that using naive offline RL, the policy takes transitions similar to those found in the non-expert data, but is unable to stitch to complete the task. In contrast, RISE, which incorporates the Lipschitz penalty and data augmentation, successfully completes the task.
Non-expert data for the square-peg task.
Visualization of trajectories in the non-expert data, showing task completion.
Example of expert data for the square-peg task.
Visualization of trajectories in the expert data, showing task completion.
Policy trained with naive offline RL
Naive offline RL fails to stitch, resulting in task failure, while RISE successfully stitches.
Policy trained with RISE
The authors would like to acknowledge the members of the Robot Learning Lab and the Washington Embodied Intelligence and Robotics Development Lab for helpful and informative discussions throughout the process of this research. The authors would also like to thank Emma Romig for robot hardware help. This research was supported by funding from Toyota Research Institute, under the University 2.0 research program.
@article{huang2025rise,
title={Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning},
author={Kevin Huang and Rosario Scalise and Cleah Winston and Ayush Agrawal and Yunchu Zhang and Rohan Baijal and Markus Grotz and Byron Boots and Benjamin Burchfiel and Masha Itkina and Paarth Shah and Abhishek Gupta},
journal={Under Review},
url={https://arxiv.org/abs/2510.19495},
year={2025}
}