Goal-conditioned robotic grasping in cluttered environments remains a challenging problem due to occlusions caused by surrounding objects, which prevent direct access to the target object. A promising solution to mitigate this issue is combining pushing and grasping policies, enabling active rearrangement of the scene to facilitate target retrieval. However, existing methods often overlook the rich geometric structures inherent in such tasks, thus limiting their effectiveness in complex, heavily cluttered scenarios. To address this, we propose the Equivariant Push-Grasp Network, a novel framework for joint pushing and grasping policy learning. Our contributions are twofold: (1) leveraging $SE(2)$-equivariance to improve both pushing and grasping performance and (2) a grasp score optimization-based training strategy that simplifies the joint learning process. Experimental results show that our method improves grasp success rates by 45% in simulation and by 35% in real-world scenarios compared to strong baselines, representing a significant advancement in push-grasp policy learning.
In this work, we introduce the $\textbf{Equivariant PushGrasp (EPG)}$ Network, a novel framework for efficient goal-conditioned push-grasp policy learning in cluttered environments. EPG leverages inherent task symmetries to improve both sample efficiency and performance. Specifically, we model the pushing and grasping policies using $SE(2)$-equivariant neural networks, embedding rotational and translational symmetry as an inductive bias. This design substantially enhances the model's generalization and data efficiency. Furthermore, we propose a self-supervised training approach that optimizes the pushing policy with a reward signal defined as the change in grasping scores before and after each push. This formulation simplifies the training procedure and naturally couples the learning of pushing and grasping.
The target object, specified by human instruction, is highlighted with a red mask (e.g., a banana). At each step, the push action direction is represented by an arrow. Our method iteratively predicts and executes push actions to create sufficient space for grasping the target. The final grasp pose is shown as a blue rectangle, with green blocks indicating the gripper's fingers.
Our model consists of three key components: a CriticNet, a GraspNet, and a PushNet. At each time step, GraspNet and PushNet generate a grasp action and a push action with respect to the target object. CriticNet then evaluates the grasp action by assigning it a score. If the score exceeds a predefined threshold $\tau$ or the maximum number of push attempts is reached, the grasp action is executed. Otherwise, the push action is executed, and the process repeats with an updated observation.
Compared to previous works that rely on complex alternating training between grasp and push networks, we propose a simple two-step training process. First we train a universal, goal-agnostic GraspNet together with a CriticNet that evaluates predicted grasps and returns a score. Then, we use the difference in grasp scores before and after pushing, computed from the CriticNet, as a reward signal to train a goal-conditioned PushNet. This decoupled training strategy eliminates the need for alternating optimization and its scheduling-related hyperparameters, making the training more stable, controllable, and efficient. For more details, please refer to the paper.
We conduct experiments in both simulation and real-world environments to evaluate our method, with the setup illustrated below. The evaluation consists of three tasks:
The comparison result for the $\textbf{Goal-conditioned Push-Grasp in Clutter task}$. Our method achieves the best performance, significantly outperforming all baselines. On average, across all the settings with different number of objects, it surpasses the best baseline by 44.7% in Grasp Success Rate (GSR). The first two variations (Table I, row 3 and 4) show that integrating our approach into existing baselines further improves their performance, which highlights our design’s effectiveness.
This table shows the results of the $\textbf{Clutter Clearing task}$. Although this task is target-agnostic, push actions remain beneficial in cluttered environments. Since there is no specific target, we use the object with the highest score from GraspNet as the target object for each step. The results show that our method’s grasping capability exceeds all baselines by a large margin in both with and without push actions.
This table presents the results of the real-world $\textbf{Goal-conditioned Push-Grasp in Clutter}$ task. Our EPG significantly outperforms all baselines by at least 35% in GSR. The primary failure cases are: 1) inaccurate object masks from SAM2, which further affect PushNet and CriticNet outputs; 2) imprecise grasp poses predicted by the GraspNet. Despite these challenges, our method demonstrates strong overall stability.
case1
case2
case3
case4
case5
case6
case7
case8
The configuration of the real-world $\textbf{Goal-Conditioned Push-Grasp in Tight Layouts}$ task is shown above. It contains eight different cases, each with a varying number of small boxes placed in specific positions. The objective is to grasp the yellow box, which is consistently placed at the center of surrounding boxes. These tasks are unseen during training and require effective strategies to solve, placing a strong demand on the generalization ability. The results in this table indicate that despite increasing task complexity, our method consistently outperforms the baselines while maintaining stable performance.
@ARTICLE{
11150764,
author={Hu, Boce and Tian, Heng and Wang, Dian and Huang, Haojie and Zhu, Xupeng and Walters, Robin and Platt, Robert},
journal={IEEE Robotics and Automation Letters},
title={Push-Grasp Policy Learning Using Equivariant Models and Grasp Score Optimization},
year={2025},
doi={10.1109/LRA.2025.3606392}
}