논문리뷰

Recent Advances in imitation learning from observation

자월현 2020. 11. 8.

deep-learning 발달 전 imitation learning using state-only demonstrations papers
1. Movement imitation with nonlinear dynamical systems in humanoid robots.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.7189&rep=rep1&type=pdf
2. Humanoid robot learning and game playing using pc-based vision.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.438.323&rep=rep1&type=pdf



**visual observations can only provide partial state information**
**In imitation learning, agents do not receive task reward feedback r.**

Behavior cloning does not require any further interaction between the agent and the environment - covariate shift problem
IRL-based techniques iteratively alternate between using the demo to infer a hidden reward function and using RL.
- object manipulation: Guided cost learning: Deep inverse optimal control via policy optimization.

https://arxiv.org/abs/1603.00448
GAIL: induce an imitator state-action occupancy measure that is similar to that of the demonstrator.

Imitation learning from observation
===

## perception
1. Record the expert's movements using sensors placed directly on the expert agent
- Trajectory formation for imitation with nonlinear dynamical systems.

https://ieeexplore.ieee.org/document/976259
arm-reaching movements, biped locomotion, and human gestures
- Incremental learning of gestures by imitation in a humanoid robot.

https://ieeexplore.ieee.org/document/6251697

2. Motion capture: use visual markers on the demo to infer movement.
- Motion capture in robotics review.

https://ro.uow.edu.au/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1645&context=engpapers
locomotion, acrobatics, martial arts
- require costly instrumentation and pre-processing:

A deep learning framework for character motion synthesis and editing. ★

http://www.ipab.inf.ed.ac.uk/cgvu/motionsynthesis.pdf

### Embodiment Mismatch

1. learns correspondence between the embodiments using autoencoders in a supervised fashion

(encoded representation은 embodiment features에 invariant하다)

- Learning invariant features spaces to transfer skills with reinforcement learning ★

https://arxiv.org/abs/1703.02949

2. unsupervised fashion & human supervision 조금

- Time-contrastive networks: self-supervised learning from video.

https://arxiv.org/abs/1704.06888

 

### Viewpoint difference

1. context translation model to translate an observation by predicting it in the target context.

- Imitation from observation: Learning to imitate behaviors from raw video via context translation.

https://arxiv.org/abs/1707.03374

2. classifier to distinguish viewpoints and maximize the domain confusion in the adversarial setting during the training

- Third-person imitation learning.

https://arxiv.org/abs/1703.01703

 

## Control

### Model-based algorithms

#### Inverse dynamics models

- Grounded action transformation for robot learning in simulation.

https://www.cs.utexas.edu/~pstone/Papers/bib2html/b2hd-AAAI17-Hanna.html

1. Explore and collect data (s, a, s') and learn the pixel-level inverse dynamics model (o, o') -> a

- Combining self-supervised learning and imitation for vision-based rope manipulation.

https://arxiv.org/abs/1703.02018

2. reinforced inverse dynamics modeling (uses sparse reward function to optimize the model)

- Ridm: reinforced inverse dynamics modeling for learning from a single observed demonstration

https://arxiv.org/abs/1906.07372

** each observation transition is reachable through the application of a single action. **

- Zero-shot visual imitation: execute multiple actions until it gets close enough to the next demonstrated frame.

https://arxiv.org/abs/1804.08606

- Behavior cloning from observation: learn generalized imitation policies using multiple demo.

https://arxiv.org/abs/1805.01954

- Hybrid reinforcement learning with expert state sequences. 

https://arxiv.org/abs/1903.04110

(visual demo랑 reward info 다 접근가능하다는 전제를 한다. minimize a linear combination of behavior cloning loss and RL loss)

 

#### Forward dynamics model

- imitating latent policies from observation.

https://arxiv.org/abs/1805.07914

--> 먼저, 현재 state가 주어졌을 때 latent(unreal) action z의 probability를 측정해주는 latent policy를 배운다. 실제 action이 행해지지 않기 때문에 offline에서 학습가능하다. latent policy를 배울 때 latent forward dynamics model을 이용하는데 이건 그 다음 state와 prior over z given s를 예상해준다. 그 다음 env interaction하면서 action-remapping network을 배운다.

 

### Model-free algorithms

#### Adversarial methods

- Learning human behaviors from motion capture by adversarial imitation. ★

https://arxiv.org/abs/1707.02201

 

 

#### reward-engineering methods

- Internal model from observations for reward shaping.

https://arxiv.org/abs/1806.01267

 

 

댓글