Model Card: a2ui_qwen3_235b_grpo_0415_v1

1. Overview

This model version is a GRPO-trained A2UI model initialized from a 235B SFT checkpoint. It uses the reward-v2 pipeline and is trained on a filtered subset of the 0409 GRPO data with a larger response budget.

2. Model Metadata

Field	Value
model_version_id	`a2ui_qwen3_235b_grpo_0415_v1`
algorithm	`grpo`
base_model	`Qwen/Qwen3-235B-A22B-Instruct-2507`
precision	`bf16`

3. Lineage

Field	Value
parent_model_version_id	`a2ui_qwen3_235b_sft_0414_v1`

4. Training Data

Field	Value
dataset_version_id	`a2ui_data_grpo_0409_v2`

5. Training Configuration

Field	Value
epochs_or_steps	`steps=65`
batch_size	`32.0`
group_size	`8.0`
lr	`3e-5`
max_ctx	`8192.0`
max_resp	`16384`
lora_rank	`16.0`

6. Change Summary

Warm-start from 235B SFT checkpoint, switch to reward-v2 pipeline, train on filtered 0409 subset with larger response budget.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support