RISEOffline

RISE

Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning

Kevin Huang*, Rosario Scalise*, Cleah Winston, Ayush Agrawal, Yunchu Zhang, Rohan Baijal, Markus Grotz, Bryon Boots, Abhishek Gupta (*:equal contribution) University of Washington UW Logo

Benjamin Burchfiel, Masha Itkina, Paarth Shah Toyota Research Institute Toyota Logo

RISE: Robust Imitation by Stiching from Experts demonstrates the following benefits.

[pdf] [arxiv] [code(soon)]

Abstract.

Imitation learning has proven highly effective for training robots to perform complex tasks from expert human demonstrations. However, it remains limited by reliance on high-quality, task-specific data, restricting adaptability to the diverse range of real-world object configurations and scenarios. In contrast, non-expert data—such as play data, suboptimal demonstrations, partial task completions, or rollouts from suboptimal policies—can offer broader coverage and lower collection costs, but conventional imitation learning approaches fail to utilize this data effectively. To address these challenges, we show that offline reinforcement learning can be used as a tool to harness non-expert data to enhance the performance of imitation learning policies. We show that while standard offline RL approaches can be ineffective at actually leveraging non-expert data under sparse coverage, simple algorithmic modifications can allow the utilization of this data without significant additional assumptions. Our approach shows that broadening the support of the policy dis- tribution in offline RL can allow offline RL augmented imitation algorithms to solve tasks robustly, under sparse coverage. In manipulation tasks, these innova- tions significantly increase the range of states where learned policies are success- ful when non-expert data is incorporated. Moreover, we show that these methods are able to leverage all collected data, including partial or suboptimal demonstra- tions, to bolster task-directed policy performance, underscoring the importance of methods for using non-expert data for scalable and robust robot learning. We introduce Robust Imitation Learning by Stitching from Experts, or RISE 🌅.

Approach: Robust Imitation Learning via Offline RL.

Figure 1: Overview of RISE approach showing how non-expert data is leveraged to robustify imitation learning.

RISE Enables OOD Recovery.

RISE stitches together non-expert data trajectories, such as SE2 planar pushing data, with expert data to recover from OOD states.

By utilizing play data, RISE is able to recover from OOD states.

Standard Offline RL (IDQL): Offline RL fails to stitch disjoint trajectories together.

RISE: By leveraging non-expert data in addition to expert data, RISE is able to recover from OOD states.

RISE enables continual recovery in cases where robot is subject to multiple physical perturbations.

Diffusion Policy: BC policies are brittle and cannot utilize non-expert data effectively.

RISE: RISE creates robust policies that are able to continually recover from multiple disturbances.

Data for Lampshade Tasks

▼

Expert demos for lampshade task collected from narrow inital distribution.

Play data of random tabletop pushing.

Tipping recovery demos.

RISE Leverages Suboptimal Demonstrations.

Unlike most imitation learning methods, RISE is robust to suboptimal demonstrations and is able to stitch together useful segments of suboptimal demonstrations to robustify the expert. Additionally, the non-expert data increases the coverage of the policy performance.

Diffusion Policy: BC learns to mimic suboptimal demonstrations, hence hindering task completion.

RISE: RISE is able to use suboptimal demos to robustify the expert and not be negatively impacted by them.

Data for One-Leg Insertion Task

▼

Expert demos for one-leg insertion task.

Suboptimal demos for one-leg insertion task from an OOD inital state.

RISE Extends to Deformable Object Tasks.

RISE generalizes beyond rigid, tabletop settings to deformable cloth stacking task requiring multiple steps.

Diffusion Policy: BC fails to complete the task by stacking the folded cloth..

RISE: RISE successfully stitches folding and stacking with a deformable object.

Data for Cloth Stacking Task

▼

Expert demos for stacking cloth task

Recovery demos of folding the cloth.

RISE Reduces the Cost of Learning New Tasks.

By utilizing cheaper data that can be shared across tasks, RISE amoritzes the cost of collecting new data needed to learn new tasks.

Diffusion Policy: BC fails to complete the task in OOD states for square-peg and square-hook tasks.

RISE: By sharing square play data, RISE completes the square-peg tasks in a high-coverage of inital state.

RISE: By sharing square play data, RISE completes the square-hook task in a high-coverage of inital state.

Data for Square Tasks

▼

Expert demos collected from narrow inital distribution for square-peg task.

Expert demos collected from narrow inital distribution for square-hook task.

Play demos collected from wide inital distribution for square setting.

RISE Outperforms BC on a Wide Range of Tasks.

Baseline 1

Baseline 2

Our Method

The Algorithm.

Offline RL

Offline RL is a natural method to learn policies from non-optimal data. Reward labels for human collected demonstrations on real robots are difficult to obtain; thus, we only assume labels of whether the offline data belongs to an expert dataset, or a non-expert dataset. We label all transitions in the expert dataset with a reward of 1, and all transitions in the non-expert dataset with a reward of 0. We build off of IDQL, using the following updates:

$$ \begin{align} \mathcal{L}_V(\psi) &= \mathbb{E}_{(s, a) \sim (\mathcal{D}_{\text{E}} \cup \mathcal{D}_{\text{NE}})} \left[ L^{\tau}_2 (Q_\phi (s, a) - V_{\psi}(s)) \right] \qquad \text{(Value Learning)} \\ \mathcal{L}_Q(\phi) &= \mathbb{E}_{(s, a) \sim \mathcal{D}_{\text{E}}} \left[ \left( 1 + \gamma V_{\psi}(s') - Q_{\phi}(s, a)\right)^2 \right] + \mathbb{E}_{(s, a) \sim \mathcal{D}_{\text{NE}}} \left[ \left(\gamma V_{\psi}(s') - Q_{\phi}(s, a)\right)^2 \right]\\ \pi_B(a|s) &= \text{argmax}_{\pi} \mathbb{E}_{{s, a} \sim (\mathcal{D}_{\text{E}} \cup \mathcal{D}_{\text{NE}})}\left[\log \pi(a|s) \right] \qquad \text{(Behavior Policy Learning)} \\ \pi^*(a|s) &= \underset{a \in \{a_1,\dots, a_K\} \sim \pi_B(a|s)}{\text{argmax}} Q_\phi(s, a) \qquad \text{(Optimal Policy Extraction)} \end{align} $$

However, in practice, without very large datasets, the likelihood of state-overlap in the data in a continous space tends to 0, resulting in stitching across overlapping states unlikely. This prevents offline RL from effectively recovering back to expert states from non-expert ones. While this is challenging to solve in the most general case - we base our practical improvements on a set of empirical findings in a robotic manipulation setting, introducing a Lipschitz penalty using spectral normalization along with data augmentation to improve stitching.

Improving Stitching in Offline RL

Empirically, we observe that Q-value functions learned with expectile regression tend to be accurate and interpolate well within a neighborhood of the data, showing reasonable stitching behavior. The challenge comes from policy extraction—while Q functions can interpolate within a neighborhood, we find that the marginal action distribution captured by the behavior policy and be overly conservative. This tends to be the main practical challenge in offline RL.

To understand this, consider the figures below, which demonstrate a toy task where the goal is to push an object to a goal on a table. By default, on the left figure, the action distribution of the policy (in our case parameterized by a diffusion model) is very narrow, and the optimal action falls outside of the policy distribution (see the right figure). To remedy this, we introduce a Lipschitz penalty to the behavior policy during training, which effectively widens the action distribution, improving stitching. While there are several ways to enforce Lipschitz continuity, we simply opt for regularizing the policy with a spectral norm penalty. Our behavior policy loss becomes:

$$ \max_\theta \mathbb{E}_{(s, a) \sim (\mathcal{D}_{\text{E}} \cup \mathcal{D}_{\text{NE}})}\left[ \log \pi_\theta(a|s) \right] + \lambda\sum_{W \in \theta} \|\sigma_{\text{max}}(W)\|^2. $$

The effect of spectral norm regularization is shown in the figure below, significantly widening the action distribution.

Effect of Spectral Norm on Action Distribution

Figure 2: Adding a spectral norm penalty to the behavior policy widens the action distribution, improving robustness.

In addition, we apply an explicit method of widening the policy distribution by adding additional transitions to the dataset. For a given (s, a) pair, we select nearby transitions (s', a') such that d(s, s') < T for some threshold T and distance metric d. We then add the transition (s, a') to the dataset. We use Euclidean distance in the feature space of a large pretrained vision model, in our case DINOv2, as the distance metric, which has been shown to capture semantic similarity between images.

We can also visualize the effect of RISE in the figures below. Below, we first visualize the expert and non-expert data for the square-peg task. Though they appear to overlap, in practice, stitching is challenging. We see that using naive offline RL, the policy takes transitions similar to those found in the non-expert data, but is unable to stitch to complete the task. In contrast, RISE, which incorporates the Lipschitz penalty and data augmentation, successfully completes the task.

Non-expert data for the square-peg task.

Visualization of trajectories in the non-expert data, showing task completion.

Example of expert data for the square-peg task.

Visualization of trajectories in the expert data, showing task completion.

Policy trained with naive offline RL

Naive offline RL fails to stitch, resulting in task failure, while RISE successfully stitches.

Policy trained with RISE

Acknowledgements.

The authors would like to acknowledge the members of the Robot Learning Lab and the Washington Embodied Intelligence and Robotics Development Lab for helpful and informative discussions throughout the process of this research. The authors would also like to thank Emma Romig for robot hardware help. This research was supported by funding from Toyota Research Institute, under the University 2.0 research program.

BibTeX

@article{huang2025rise,
      title={Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning},
      author={Kevin Huang and Rosario Scalise and Cleah Winston and Ayush Agrawal and Yunchu Zhang and Rohan Baijal and Markus Grotz and Byron Boots and Benjamin Burchfiel and Masha Itkina and Paarth Shah and Abhishek Gupta},
      journal={Under Review},
      url={https://arxiv.org/abs/2510.19495},
      year={2025}
    }

Content

Data for Lampshade Tasks

Data for One-Leg Insertion Task

Data for Cloth Stacking Task

Data for Square Tasks