π Paper ο½π€ STAR-1 Data | π€ STAR-1 Model | π Project Page
Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie
STAR-1 is a high-quality safety dataset designed to enhance safety alignment in large reasoning models (LRMs) like DeepSeek-R1.
- Built on the principles of diversity, deliberative reasoning, and rigorous filtering, STAR-1 integrates and refines data from multiple sources to provide policy-grounded reasoning samples.
- The dataset contains 1,000 carefully selected examples, each aligned with best safety practices through GPT-4o-based evaluation.
- Fine-tuning with STAR-1 leads to significant safety improvements across multiple benchmarks, with minimal impact on reasoning capabilities.
Dataset | Num. of Sample | URL |
---|---|---|
STAR-1 | 1K | π€ UCSC-VLAA/STAR-1 |
STAR 41K | 41K | π€ UCSC-VLAA/STAR-41K |
STAR-benign-915 | 915 | π€ UCSC-VLAA/STAR-benign-915 |
Model | Type | URL |
---|---|---|
STAR1 -R1-Distill-1.5B |
R1-Distill-Qwen-1.5B trained on STAR-1 | π€ UCSC-VLAA/STAR1-R1-Distill-1.5B |
STAR1 -R1-Distill-7B |
R1-Distill-Qwen-7B trained on STAR-1 | π€ UCSC-VLAA/STAR1-R1-Distill-7B |
STAR1 -R1-Distill-8B |
R1-Distill-Llama-8B trained on STAR-1 | π€ UCSC-VLAA/STAR1-R1-Distill-8B |
STAR1 -R1-Distill-14B |
R1-Distill-Qwen-14B trained on STAR-1 | π€ UCSC-VLAA/STAR1-R1-Distill-14B |
STAR1 -R1-Distill-32B |
R1-Distill-Qwen-32B trained on STAR-1 | π€ UCSC-VLAA/STAR1-R1-Distill-32B |
-
data_making/
: STAR-1 Data making pipeline (Sec. 2)data_collection/
: Sec. 2.1deliberative_reasoning/
: Sec. 2.2data_selection
: Sec. 2.3
-
train/
: Training scripts (Sec. 3.1) -
benchmark/
: Evaluation Scripts (Sec. 3.1)safe_benchmark
: Safety Evaluationreasoning_benchmark/
: Reasoning Evaluation
-
overrefusal_ablation/
: A Mitigation for the Overrefusal Behaviour (Sec. 4.3)
git clone https://github.com/UCSC-VLAA/STAR-1.git
cd STAR-1
pip install -e .
cd data_making/data_collection/scripts
bash load_test_datasets.sh
bash collect_decon_train_datasets.sh
You will get a json file named data_collection.json
under data_making/data_collection/
folder and this contains the 41K initial collected data (after deduplication).
cd data_making/deliberative_reasoning/category_classification
mkdir datasets
cp ../../data_collection/data_collection.json datasets/
bash category_classification.sh
You will get a json file named data_with_category.json
under data_making/deliberative_reasoning/category_classification/
folder and this contains the 41K samples with safety categories classified by GPT-4o.
cd data_making/deliberative_reasoning/reasoning_generation
mkdir datasets
cp ../category_classification/data_with_category.json datasets/
bash reasoning_generation.sh
You will get a json file named data_with_cot.json
under data_making/deliberative_reasoning/reasoning_generation/
folder and this contains the 41K samples with safety-aligned reasoning process grounded with safety policies, generated by Deepseek-R1.
cd data_making/data_selection
mkdir datasets
cp ../deliberative_reasoning/reasoning_generation/data_with_cot.json datasets/
bash scorer.sh
You will get:
- A json file named
data_with_score.json
underdata_making/data_selection/
folder and this contains the 41K samples with 4o-based scores. Thisdata_with_score.json
is ourSTAR-41K
- A json file named
all_10.json
underdata_making/data_selection/
folder and this contains samples with full scores(10) on all criteria. (Sec. 2.3 -Ensuring Accuracy
) - A json file named
star1_high.json
underdata_making/data_selection/
folder and this contains 1K samples selected according to balanced representation. (Sec. 2.3 -Ensuring Diversity
). Thisstar1_high.json
is our finalSTAR-1
.
cd train
bash run_sft.sh
The run_sft.sh
looks like:
accelerate launch --config_file ./configs/deepspeed_zero3.yaml \
--num_processes 8 \
--train_bsz_per_gpu 1 \
--num_machines 1 \
--machine_rank 0 \
--deepspeed_multinode_launcher standard sft.py \
--model_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--data_path ../data/STAR-1.json \
--n_epochs 5 \
--experiment_name STAR-1 \
--base_model Qwen \
--base_flag 0 \
--think_flag 1
base_flag
: If distill model then 0 elif instruct model then 1think_flag
: default=1, ifw/o think
then 0 (Sec 4.2)train_bsz_per_gpu * num_processes
should be 8 to keep the batchsize as 128- You change the
model_path
to different model - Or change the
data_path
to use different finetune data (Sec 4.1)
You could change the mode_path
of the evaluated model in benchmark/config.py
.
cd benchmark/safe_benchmark
bash scripts.sh $model $data
# bash scripts.sh DeepSeek-R1-Distill-Qwen-1.5B strongreject
The code in Reasoning Benchmark is based on simple-evals
and modified.
cd benchmark/reasoning_benchmark
bash run_all_evals.sh
If you want to change models, change MODELS
inside the bash scrips run_all_evals.sh
at Line 7.
cd overrefusal_ablation/benign_gene
mkdir datasets
cp ../../data_making/data_selection/star1_high.json datastes/
bash rewriter.sh
You will get a json file named star1_benign.json
under overrefusal_ablation/benign_gene/
folder and this contains the 1K samples that structurally similar to 1K harmful questions in STAR-1 but benign variants.
cd overrefusal_ablation/reasoning_generation
mkdir datasets
cp ../benign_gene/star1_benign.json datasets/
bash reasoning_generation.sh
You will get a json file named star1_benign_with_cot.json
under overrefusal_ablation/reasoning_generation/
folder and this contains 1K benign variants with reasoning process generated by Deepseek-R1.
cd overrefusal_ablation/scorer
mkdir datasets
cp ../reasoning_generation/star1_benign_with_cot.json datasets/
bash scorer.sh
You will get
- A json file named
star1_benign_with_score.json
underoverrefusal_ablation/scorer/
folder and this contains the 1K benign variants with 4o-based scores. - A json file named
star1_benign_filtered.json
underoverrefusal_ablation/scorer/
folder and this contains the benign variants with full scores(5) on all criteria. Thisstar1_benign_filtered.json
is our finalSTAR-benign-915
.
Then you can combine the star1_benign_filtered.json
with star1_high.json
, and use these 2K samples to sft a model and benchmark the finetuned model. The pipeline is the same as Training
and Evaluation
above.
This work is partially supported by a gift from Open Philanthropy. We thank the NAIRR Pilot Program and the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.
LLNL co-authors were supported under Contract DE-AC52-07NA27344 with the U.S. Department of Energy and the LLNL-LDRD Program under Project No. 24-ERD-058. The United States Government retains, and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.
@article{wang2025star1saferalignmentreasoning,
title={STAR-1: Safer Alignment of Reasoning LLMs with 1K Data},
author={Zijun Wang and Haoqin Tu and Yuhan Wang and Juncheng Wu and Jieru Mei and Brian R. Bartoldson and Bhavya Kailkhura and Cihang Xie},
year={2025},
journal = {arXiv preprint arXiv:2504.01903}
}