CNNRNN
Due to the independent training of the image feature extraction part (CAE) and the time series learning part (RNN), CAE-RNN has faced challenges in parameter adjustment and model training time. In addition, CAE extracts image features specifically designed for dimensional compression of image information, rather than features suitable for robot motion generation. To address these issues, CNN-RNN is introduced as a motion generation model that can automatically extract image features essential for motion generation by simultaneously learning (end-to-end learning) the image feature extraction part (CAE) and the time series learning part (RNN). This approach enables the robot to prioritize objects critical to the task and generate motions that are more robust to background changes compared to CAE-RNN1.
Files
The following programs and folders are used in CNNRNN:
- bin/train.py: This program is used to load data, train models, and save the trained models.
- bin/test.py: This program performs offline inference using test data (images and joint angles) and visualizes the results of the inference.
- bin/test_pca_cnnrnn.py: This program visualizes the internal state of the RNN using Principal Component Analysis.
- libs/fullBPTT.py: This is a
backpropagation
class used for time series learning. - log: This folder is used to store the weights, learning curves, and parameter information.
- output: This folder is used to store the results of the inference.
Model
CNNRNN is a motion generation model that can learn and perform inference on multimodal time series data. It predicts the image y_image
and joint angle y_joint
at the next time step $t+1$ based on the image xi
, joint angle xv
, and the state state
at the time step $t$.
[SOURCE] CNNRNN.py | |
---|---|
|
Backpropagation Through Time
Backpropagation Through Time (BPTT) is used as the error backpropagation algorithm for time series learning in CNNRNN. The detailed explanation of BPTT has already been provided in SARNN, please refer to that section for more information.
[SOURCE] fullBPTT.py | |
---|---|
|
Training
The main program, train.py
, is used to train CNNRNN. When executing the program, the trained weights (pth) and Tensorboard log files are saved in the log
folder. For a detailed understanding of the functionality of the program, please refer to the comments in the code.
$ cd eipl/zoo/cnnrnn/
$ python3 ./bin/train.py
[INFO] Set tag = 20230514_1958_07
================================
batch_size : 5
device : 0
epoch : 100000
feat_dim : 10
img_loss : 1.0
joint_loss : 1.0
log_dir : log/
lr : 0.001
model : CNNRNN
optimizer : adam
rec_dim : 50
stdev : 0.02
tag : 20230514_1958_07
vmax : 1.0
vmin : 0.0
================================
0%| | 83/100000 [05:07<99:16:42, 3.58s/it, train_loss=0.0213, test_loss=0.022
Inference
To ensure that CNNRNN has been trained correctly, you can use the test program test.py
. The filename
argument should be the path to the trained weights file, while idx
is the index of the data you want to visualize. Additionally, input_param
is the mixing coefficient for the inference. More details can be found in the provided documentation.
$ cd eipl/zoo/cnnrnn/
$ python3 ./bin/test.py --filename ./log/20230514_1958_07/CNNRNN.pth --idx 4 --input_param 1.0
images shape:(187, 128, 128, 3), min=0, max=255
joints shape:(187, 8), min=-0.8595600128173828, max=1.8292399644851685
loop_ct:0, joint:[ 0.00226304 -0.7357931 -0.28175825 1.2895856 0.7252841 0.14539993
-0.0266939 0.00422328]
loop_ct:1, joint:[ 0.00307412 -0.73363686 -0.2815826 1.2874944 0.72176594 0.1542334
-0.02719587 0.00325996]
.
.
.
$ ls ./output/
CNNRNN_20230514_1958_07_4_1.0.gif
The figure below shows the results of inference at an untaught position. The images are presented from left to right, showing the input image, the predicted image, and the predicted joint angles (with dotted lines representing the true values). CNNRNN predicts the next time step based on the extracted image features and robot joint angles. The image features are expected to include information such as the color and position of the grasped object. It is also critical that the predicted image and robot joint angles are appropriately aligned. However, experimental results indicate that while the joint angles are accurately predicted, the predicted image consists only of the robot hand. Consequently, generating flexible movements based on object positions becomes challenging because the image features only contain information about the robot hand.
Principal Component Analysis
The figure below illustrates the visualization of the internal state of CNNRNN using Principal Component Analysis. Each dotted line represents the temporal evolution of CNNRNN's internal state, starting from the black circle. The color of each attractor corresponds to the object position, where blue, orange, and green represent teaching positions A, C, and E, while red and purple represent untaught positions B and D. The self-organization of attractors for each teaching position suggests the ability to generate well-learned movements at these positions. However, the attractors at the untaught positions are attracted to the attractors at the learning positions, making it impossible to generate interpolated movements. This occurs because the image features fail to extract the positional information of the grasped object.
Model Improvement
In CAE-RNN, generalization performance was achieved by learning different object position information through data augmentation. In contrast, CNN-RNN learns images and joint angle information simultaneously, making it difficult to apply data augmentation to robot joint angles corresponding to changes in image position. Three potential solutions are proposed below to improve the position generalization performance of CNNRNNs.
-
Pre-training
Only the CAE part of the CNNRNN is extracted and pre-trained. By learning only the image information using data augmentation, CAE can extract a variety of object position information. Then, end-to-end learning is performed using the pre-trained weights to map images to joint angles. However, since CAE needs to be pre-trained, the training time required is the same as that of CAE-RNN, resulting in minimal benefits of using CNN-RNN.
-
Layer Normalization
CAE-RNN used
BatchNormalization
2 as a normalization method to make CAE training stable and fast. However, BatchNormalization has the problems that learning becomes unstable when the batch of dataset is small and it is difficult to apply to recursive neural networks. Therefore, we will improve the generalization performance by usingLayer Normalization
3, which can train stably on small batches of data sets and time-series data.The figure below visualizes the internal state of CNNRNNLN: CNNRNN with Layer Normalization using principal component analysis. The self-organization (alignment) of attractors for each object position allows the robot to generate correct motion even at untaught positions.
-
Spatial Attention
Since CAE-RNN and CNNRNN learn motion based on image features containing various information (position, color, shape, background, lighting conditions, etc.), robustness during motion generation has been a concern. To address this issue, we can improve robustness by incorporating a spatial attention mechanism that "explicitly" extracts spatial coordinates of important positions (target objects and arms) from images, thus improving the learning of spatial coordinates and robot joint angles. For more information on the spatial attention mechanism, see this link.
-
Hiroshi Ito, Kenjiro Yamamoto, Hiroki Mori, Shuki Goto, and Tetsuya Ogata. Visualization of focal cues for visuomotor coordination by gradient-based methods: a recurrent neural network shifts the attention depending on task requirements. In 2020 IEEE/SICE International Symposium on System Integration (SII), 188–194. IEEE, 2020. ↩
-
Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456. pmlr, 2015. ↩
-
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. ↩