Push-Grasp Policy Learning Using Equivariant Models and Grasp Score Optimization

Northeastern University
$\star$equal contribution
IEEE Robotics and Automation Letters (RAL)

Abstract

Goal-conditioned robotic grasping in cluttered environments remains a challenging problem due to occlusions caused by surrounding objects, which prevent direct access to the target object. A promising solution to mitigate this issue is combining pushing and grasping policies, enabling active rearrangement of the scene to facilitate target retrieval. However, existing methods often overlook the rich geometric structures inherent in such tasks, thus limiting their effectiveness in complex, heavily cluttered scenarios. To address this, we propose the Equivariant Push-Grasp Network, a novel framework for joint pushing and grasping policy learning. Our contributions are twofold: (1) leveraging $SE(2)$-equivariance to improve both pushing and grasping performance and (2) a grasp score optimization-based training strategy that simplifies the joint learning process. Experimental results show that our method improves grasp success rates by 45% in simulation and by 35% in real-world scenarios compared to strong baselines, representing a significant advancement in push-grasp policy learning.

Summary of EPG

In this work, we introduce the $\textbf{Equivariant PushGrasp (EPG)}$ Network, a novel framework for efficient goal-conditioned push-grasp policy learning in cluttered environments. EPG leverages inherent task symmetries to improve both sample efficiency and performance. Specifically, we model the pushing and grasping policies using $SE(2)$-equivariant neural networks, embedding rotational and translational symmetry as an inductive bias. This design substantially enhances the model's generalization and data efficiency. Furthermore, we propose a self-supervised training approach that optimizes the pushing policy with a reward signal defined as the change in grasping scores before and after each push. This formulation simplifies the training procedure and naturally couples the learning of pushing and grasping.

Workflow of EPG

The target object, specified by human instruction, is highlighted with a red mask (e.g., a banana). At each step, the push action direction is represented by an arrow. Our method iteratively predicts and executes push actions to create sufficient space for grasping the target. The final grasp pose is shown as a blue rectangle, with green blocks indicating the gripper's fingers.

Model Architecture of EPG

Our model consists of three key components: a CriticNet, a GraspNet, and a PushNet. At each time step, GraspNet and PushNet generate a grasp action and a push action with respect to the target object. CriticNet then evaluates the grasp action by assigning it a score. If the score exceeds a predefined threshold $\tau$ or the maximum number of push attempts is reached, the grasp action is executed. Otherwise, the push action is executed, and the process repeats with an updated observation.

Two-Step Agent Learning

Compared to previous works that rely on complex alternating training between grasp and push networks, we propose a simple two-step training process. First we train a universal, goal-agnostic GraspNet together with a CriticNet that evaluates predicted grasps and returns a score. Then, we use the difference in grasp scores before and after pushing, computed from the CriticNet, as a reward signal to train a goal-conditioned PushNet. This decoupled training strategy eliminates the need for alternating optimization and its scheduling-related hyperparameters, making the training more stable, controllable, and efficient. For more details, please refer to the paper.

Experiments

We conduct experiments in both simulation and real-world environments to evaluate our method, with the setup illustrated below. The evaluation consists of three tasks:

  • Goal-Conditioned Push-Grasp in Clutter: This task assesses our framework's ability to retrieve a specific object from a cluttered scene.
  • Clutter Clearing: This task evaluates the ability to clear an entire scene without any predefined target or grasp sequence.
  • Goal-Conditioned Push-Grasp in Tight Layouts: In this task, objects are arranged in challenging geometric configurations (e.g., tight clusters, narrow gaps). This is a hard task because the robot must push strategically to create graspable space in a constrained environment.

$\textbf{Simulation Experiment Result}$

The comparison result for the $\textbf{Goal-conditioned Push-Grasp in Clutter task}$. Our method achieves the best performance, significantly outperforming all baselines. On average, across all the settings with different number of objects, it surpasses the best baseline by 44.7% in Grasp Success Rate (GSR). The first two variations (Table I, row 3 and 4) show that integrating our approach into existing baselines further improves their performance, which highlights our design’s effectiveness.

This table shows the results of the $\textbf{Clutter Clearing task}$. Although this task is target-agnostic, push actions remain beneficial in cluttered environments. Since there is no specific target, we use the object with the highest score from GraspNet as the target object for each step. The results show that our method’s grasping capability exceeds all baselines by a large margin in both with and without push actions.

$\textbf{Real-world Experiment Result}$

This table presents the results of the real-world $\textbf{Goal-conditioned Push-Grasp in Clutter}$ task. Our EPG significantly outperforms all baselines by at least 35% in GSR. The primary failure cases are: 1) inaccurate object masks from SAM2, which further affect PushNet and CriticNet outputs; 2) imprecise grasp poses predicted by the GraspNet. Despite these challenges, our method demonstrates strong overall stability.

case1

case2

case3

case4

case5

case6

case7

case8

The configuration of the real-world $\textbf{Goal-Conditioned Push-Grasp in Tight Layouts}$ task is shown above. It contains eight different cases, each with a varying number of small boxes placed in specific positions. The objective is to grasp the yellow box, which is consistently placed at the center of surrounding boxes. These tasks are unseen during training and require effective strategies to solve, placing a strong demand on the generalization ability. The results in this table indicate that despite increasing task complexity, our method consistently outperforms the baselines while maintaining stable performance.

Video

Citation


@ARTICLE{
    11150764,
    author={Hu, Boce and Tian, Heng and Wang, Dian and Huang, Haojie and Zhu, Xupeng and Walters, Robin and Platt, Robert},
    journal={IEEE Robotics and Automation Letters},
    title={Push-Grasp Policy Learning Using Equivariant Models and Grasp Score Optimization},
    year={2025},
    doi={10.1109/LRA.2025.3606392}
}