Offline Inverse Reinforcement Learning.
2024
Description
Investigated offline preference data selection methods to choose the most helpful preference pairs for model reward tuning with distribution entropy estimates obtained from Stein Variational Gradient Descent (SVGD).
In the first part of the training loop, we used SVGD to train an initial sampler distribution to create a good spread of sampled points across our target distribution (a Bradley-Terry (BT) pairwise preference model) using the reparameterization trick. In the second part of the training loop, we used the entropy value estimated from SVGD to calculate estimated optimal preference pairs, represented by a vector, delta. We queried our database of filtered offline preference pairs and selected M new preferences to incorporate into our BT model, thus quickly honing in on the subset of preferences that creates the most accurate and smallest potential space for our target model. This can be helpful for quickly selecting the most informative pairs of LLM responses from a huge potential dataset, for example.
This was also the impetus for creating my 2D Bradley-Terry probability visualizer.