18p
Now, I’ll explain the next step, called relabeling.
19p
As we update reward function every K-th iteration as mentioned before, it has non-stationary problem. This means, that even we have the same state-action pair, the output of the reward function varys under the time.
Previous study, Christiano et al. doesn’t need to consider this issue since they take the on-policy RL setting. However, the on-policy RL has generally poor sample efficiency compared to the off-policy setting.
Different from Christiano et al., PEBBLE maintains the off-policy setting for sample efficiency. To mitigate the non-stationary issue, PEBBLE add relabeling process which relabel whole experiences in replay buffer to re-use samples.
20p
This is the full algorithm of the relabeling process. 18th row of the left figure is the corresponding part. We can find the relabeling is executed after every reward function updates.
To wrap up, the relabeling process allows PEBBLE to maintain off-policy setting which is sample efficient. Specifically, authors relabel previous experiences in replay buffer, ensuring the experience reusage.
21p
The final step of PEBBLE is an Agent Learning, which learns both policy and q-function using experiences in replay buffet and current reward function.
22p
The Agent Learning process is basically same with unsupervised pre-training process in terms of using Soft Actor-Critic algorithm for learning policy and q-function. The main difference is that we use learned current reward function instead of intrinsic reward. As we previously relabel the experiences in replay buffer, we can use both new experiences and relabelled past experiences, which makes sample efficiency.
23p
This is the full algorithm of the Agent Learning. For each Agent Learning step, we take action from policy and current state, and add transition to replay buffer. Then, policy and q-functions are updated via current reward function using soft actor-critic algorithm.
24p
This was the pipeline of PEBBLE.
To evaluate the method, authors set the baseline as a intergration of christiano et al., PPO algorithm and ensemble-based sampling. You can think this as a very basic HiL RL method based on PPO and on-policy setting.
The result in the slide is the comparison with basline on locomotion tasks such as quadruped, cheetah, and walker. We can find three meaningful patterns from the result.
First, PEBBLE (denoted as green) reaches an upper bound performance of SAC with oracle reward (denoted as pink). Second, Preference PPO (denoted as purple) is unable to reach an upper bound performance of PPO with oracle reward (denoted as black). Lastly, Preference PPO with unsupervised pre-training (denoted as red) is better than original Preference PPO and comparable with Preference PPO with more feedback (denoted as blue).
According to these observation, authors demonstrate that PEBBLE can solve task without ground truth reward function, and is feedback-efficient than Preference PPO. Also, they verify the efficacy of thier unsupervised pre-training in terms of feedback efficiency and asymptotic performance.
25p
The result in the slide is the comparison with baseline on robotic manipulation tasks. We can observe that PEBBLE (denoted as blueish) reaches an upperbound performance of SAC with oracle reward (denoted as redish pink) and outperforms Preference PPO (denoted as purple pink).
Accordint to these observation, authors demonstrate that PEBBLE can be applied to various robotic manipulation tasks for upper bound performance with only human preference-based reward.
26p
Authors also show the ablation study for each techniques including relabeling, unsupervised pre-training, sampling methods, and length of segments.
We can find out that relabeling and unsupervised pre-training increases both of sample efficiency and asymptotic performance. Also, uncertainty-based sampling such as ensemble-based and entropy-based were superior to the naive uniform sampling. Furthermore, we can see that the feedback from longer segments provide more meaningful signal than step-wise feedback.
These results support that relabeling, unsupervised pre-training, uncertainty based sampling, and segment-wise preference setting were meaningful in PEBBLE pipeline.
27p
As human preference is relatively free for the way of guidance, authors show that PEBBLE can be used to guide an agent to show the novel behaviors such as (1) cart agent swining a pole, (2) quadruped agent waving a front leg, or (3) hopper performing backflip.
This can be the additional advantage of PEBBLE, which is based on HiL RL setting.
28p
Finally, authors show that HiL RL can also aviod reward exploitation which occurs at hand-engineered reward setting. Figure (a) shows that result of PEBBLE walks naturally using both legs, while Figure (b) shows that the result of hand-engineered reward struggles from walking with only one leg, which is an unintended behaviors.
29p
To wrap up, PEBBLE is the sample and feedback efficient HiL RL.
The papre shows that PEBBLE can be applied to various locomotion and complex robotic manipulation tasks. Also, using the advantage of HiL RL, PEBBLE can leran novel behaviors and avoid reward exploitation, which leads more desriable behaviors.
30p
This is the end of the presentation of PEBBLE, and I’ll briefly answer the previously asked questions. We select 8 questions which fits in the paper’s scope, but not explicitly shown.
First question is about the computational cost of relabeling process. We can say this is amount of the elements in replay buffer, which have to be relabeled.
Second question is about the computational requirements of PEBBLE, compared to traditional preference-based RL models. We can say the pipeline of unsupervised pre-training, relabeling is the additional cost requirements. However, these preocess reduces the feedback, thus we guess the total computational requirements are much less than traditional ones.
31p
Third question is about the insufficient expolitation problem noted by LSD. As PEBBLE is released before LSD, actually we gusee that authors didn’t mentioned and resolve the issue.
Fourth question is about replacing the unsupervised pre-training process to LSD, which resolved the limitation of APT. Our guess is yes, since the unsupervised pre-training part is task-agnoistic initialization of policy.
32p
Fifth and Sixth question is about relationship between feedback number and performance. In the range of PEBBLE, the results shows that more feedback leads to better performance, and this is corresponds to our intuition.
33p
Seventh question is about the difference between providing human preference and conducting imitation learning.
Actually the focus is slightly different. Providing human preference is for learning a “reward” while, providing the demonstration in imitation learning is for learning a “policy”.
Eighth question is about the problem of inaccurate reward function in early stage of learning. Our guess is that PEBBLE is relatively free from the problem cimpared from original SAC, since authors leverage several strategies such as (1) enforcing various exploration, (2) soft actor-critic, (3) iterative relabeling reward function, and (4) uncertainty-based sampling.