DPO-RM

community

AI & ML interests

None defined yet.

Recent Activity

FlippyDora submitted a paper about 1 month ago

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

FlippyDora authored a paper 3 months ago

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

FlippyDora submitted a paper 3 months ago

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

View all activity

models 52

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step120-reward

2B • Updated May 5, 2025 • 1

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step120-actor

2B • Updated May 5, 2025 • 1

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step110-reward

2B • Updated May 5, 2025 • 1

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step110-actor

2B • Updated May 5, 2025 • 1

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step100-reward

2B • Updated May 5, 2025 • 1

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step100-actor

2B • Updated May 5, 2025 • 1

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step90-reward

2B • Updated May 5, 2025 • 3

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step90-actor

2B • Updated May 5, 2025 • 5

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step80-reward

2B • Updated May 5, 2025 • 1

DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step80-actor

2B • Updated May 5, 2025 • 1

datasets 0

None public yet