Recent Advances in imitation learning from observation
deep-learning 발달 전 imitation learning using state-only demonstrations papers
1. Movement imitation with nonlinear dynamical systems in humanoid robots.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.7189&rep=rep1&type=pdf
2. Humanoid robot learning and game playing using pc-based vision.
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.438.323&rep=rep1&type=pdf
**visual observations can only provide partial state information**
**In imitation learning, agents do not receive task reward feedback r.**
Behavior cloning does not require any further interaction between the agent and the environment - covariate shift problem
IRL-based techniques iteratively alternate between using the demo to infer a hidden reward function and using RL.
- object manipulation: Guided cost learning: Deep inverse optimal control via policy optimization.
https://arxiv.org/abs/1603.00448
GAIL: induce an imitator state-action occupancy measure that is similar to that of the demonstrator.
Imitation learning from observation
===
## perception
1. Record the expert's movements using sensors placed directly on the expert agent
- Trajectory formation for imitation with nonlinear dynamical systems.
https://ieeexplore.ieee.org/document/976259
arm-reaching movements, biped locomotion, and human gestures
- Incremental learning of gestures by imitation in a humanoid robot.
https://ieeexplore.ieee.org/document/6251697
2. Motion capture: use visual markers on the demo to infer movement.
- Motion capture in robotics review.
https://ro.uow.edu.au/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1645&context=engpapers
locomotion, acrobatics, martial arts
- require costly instrumentation and pre-processing:
A deep learning framework for character motion synthesis and editing. ★
http://www.ipab.inf.ed.ac.uk/cgvu/motionsynthesis.pdf
### Embodiment Mismatch
1. learns correspondence between the embodiments using autoencoders in a supervised fashion
(encoded representation은 embodiment features에 invariant하다)
- Learning invariant features spaces to transfer skills with reinforcement learning ★
https://arxiv.org/abs/1703.02949
2. unsupervised fashion & human supervision 조금
- Time-contrastive networks: self-supervised learning from video.
https://arxiv.org/abs/1704.06888
### Viewpoint difference
1. context translation model to translate an observation by predicting it in the target context.
- Imitation from observation: Learning to imitate behaviors from raw video via context translation.
https://arxiv.org/abs/1707.03374
2. classifier to distinguish viewpoints and maximize the domain confusion in the adversarial setting during the training
- Third-person imitation learning.
https://arxiv.org/abs/1703.01703
## Control
### Model-based algorithms
#### Inverse dynamics models
- Grounded action transformation for robot learning in simulation.
https://www.cs.utexas.edu/~pstone/Papers/bib2html/b2hd-AAAI17-Hanna.html
1. Explore and collect data (s, a, s') and learn the pixel-level inverse dynamics model (o, o') -> a
- Combining self-supervised learning and imitation for vision-based rope manipulation.
https://arxiv.org/abs/1703.02018
2. reinforced inverse dynamics modeling (uses sparse reward function to optimize the model)
- Ridm: reinforced inverse dynamics modeling for learning from a single observed demonstration
https://arxiv.org/abs/1906.07372
** each observation transition is reachable through the application of a single action. **
- Zero-shot visual imitation: execute multiple actions until it gets close enough to the next demonstrated frame.
https://arxiv.org/abs/1804.08606
- Behavior cloning from observation: learn generalized imitation policies using multiple demo.
https://arxiv.org/abs/1805.01954
- Hybrid reinforcement learning with expert state sequences.
https://arxiv.org/abs/1903.04110
(visual demo랑 reward info 다 접근가능하다는 전제를 한다. minimize a linear combination of behavior cloning loss and RL loss)
#### Forward dynamics model
- imitating latent policies from observation.
https://arxiv.org/abs/1805.07914
--> 먼저, 현재 state가 주어졌을 때 latent(unreal) action z의 probability를 측정해주는 latent policy를 배운다. 실제 action이 행해지지 않기 때문에 offline에서 학습가능하다. latent policy를 배울 때 latent forward dynamics model을 이용하는데 이건 그 다음 state와 prior over z given s를 예상해준다. 그 다음 env interaction하면서 action-remapping network을 배운다.
### Model-free algorithms
#### Adversarial methods
- Learning human behaviors from motion capture by adversarial imitation. ★
https://arxiv.org/abs/1707.02201
#### reward-engineering methods
- Internal model from observations for reward shaping.
https://arxiv.org/abs/1806.01267
댓글