MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Installation

The code is built using Python 3.10, and can be run under any environment with Python 3.8 and above. We require PyTorch >= 2.2.0 and CUDA >= 12.0 (It may run with lower versions, but we have not tested it).

We recommend using Miniconda and setting up an environment:

conda create --name MoLe_VLA python=3.10

Next, clone our repo and install the required packages:

    git clone https://github.com/RoyZry98/MoLe-VLA.git
    cd MoLE_VLA
    conda env create -f environment.yml

Getting Started

The backbone model CogACT, which includes checkpoints, configs, and model cards, is available on Hugging Face page. Refer to the code below for the minimal inference:

from PIL import Image
from vla import load_vla
import torch

model = load_vla(
      'CogACT/CogACT-Base',                 # choose from [CogACT-Small, CogACT-Base, CogACT-Large] or the local path
      load_for_training=False, 
      action_model_type='DiT-B',              # choose from ['DiT-S', 'DiT-B', 'DiT-L'] to match the model weight
      future_action_window_size=15,
    )                                 
# about 30G Memory in fp32; 

# (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16

model.to('cuda:0').eval()

image: Image.Image = <input_your_image>     
prompt = "move sponge near apple"           # input your prompt

# Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e., fractal20220817_data)
actions, _ = model.predict_action(
          image,
          prompt,
          unnorm_key='fractal20220817_data', # input your unnorm_key of the dataset
          cfg_scale = 1.5,                   # cfg from 1.5 to 7 also performs well
          use_ddim = True,                   # use DDIM sampling
          num_ddim_steps = 10,               # number of steps for DDIM sampling
        )

# results in 7-DoF actions of 16 steps with shape [16, 7]

Alternatively, you can use batch inference function predict_action_batch from vla/cogactvla.py to accelerate inference in the simulator.

Quickly train model:

cd /path/to/MoLE_VLA
bash train_multi_task10_mix.sh 14 0.5 0.1 0.5 32 0.999 0,1,2,3,4,5,6,7

Citation

Please cite our work if you find it useful.

@article{zhang2025mole,
  title={MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation},
  author={Zhang, Rongyu and Dong, Menghang and Zhang, Yuan and Heng, Liang and Chi, Xiaowei and Dai, Gaole and Du, Li and Wang, Dan and Du, Yuan and Zhang, Shanghang},
  journal={arXiv preprint arXiv:2503.20384},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
MoLE_VLA_model		MoLE_VLA_model
action_model		action_model
cogact.egg-info		cogact.egg-info
conf		conf
dataset/10tasks_selected_keyframe_state/rlbench/1.0.0		dataset/10tasks_selected_keyframe_state/rlbench/1.0.0
prismatic_new		prismatic_new
scripts		scripts
sim_cogact		sim_cogact
training		training
vla		vla
wandb		wandb
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
environment.yml		environment.yml
mole.png		mole.png
pyproject.toml		pyproject.toml
should		should
train.sh		train.sh
train_multi.sh		train_multi.sh
train_multi_task10_mix.sh		train_multi_task10_mix.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Installation

Getting Started

Quickly train model:

Citation

About

Releases

Packages

Contributors 3

Languages

License

RoyZry98/MoLe-VLA-Pytorch

Folders and files

Latest commit

History

Repository files navigation

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Installation

Getting Started

Quickly train model:

Citation

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages