Representation Learning for Modeling and Control of ... · the robot. These equations enable robust control of rigid robot arms. In contrast, for ﬂexible manipulators such as catheters,

Representation Learning for Modeling and Control of Flexible Manipulators

Joey GreerStanford University

CS 331B Final [email protected]

Abstract

Traditional rigid serial manipulators have a natural rep-resentation associated with them – namely the angles ofall the joints in the robot arm. A complete frameworkthat is built around this representation allows engineers towrite down simple kinematic and dynamic equations forthe robot. These equations enable robust control of rigidrobot arms. In contrast, for flexible manipulators such ascatheters, steerable needles, and active cannula, no gen-eral modeling and control framework exists. One limitationis the lack of a natural representation of these robots. In thiswork, we investigate learning a representation for flexiblemanipulators that may be useful for modeling and controlof flexible manipulators.

1. IntroductionDenavit-Hartenberg (DH) parameters [7] are a natural

representation of serial rigid robots that were first proposednearly fifty years ago. A simple and complete frameworkfor both modeling the kinematics and dynamics of the robotand controlling the motion of the robot, which is based onDH parameters [9] was first published nearly thirty yearsago. This framework (or similar frameworks based on DHparameters) lie at the core of most industrial robot controlsystems at present.

Flexible robotics represents an alternative approach torobot design than traditional rigid robotic arms. Flexiblerobots are characterized by their structures which are com-pliant and can withstand large strains during normal opera-tion (Fig. 1). In contrast to their rigid counterparts that typ-ically have desirable properties like compliance, safety, anddexterity imbued in them through control – a challengingand ongoing area of research – soft robots have these prop-erties ingrained in their mechanical structures. Soft robotsare an increasingly popular choice for use in cluttered andunstructured environments such as search and rescue oper-ations [12, 16], manipulation [11, 3], and human-centeredtasks [1, 18] because of their inherent safety and dexterity.

(a)

Image Courtesy of R. Alterovitz

Image Courtesy of N. Simaan

(b)

Figure 1. Two examples of continuum robots applied to minimallyinvasive surgery. (a) Concentric tube robot [17]. (b) Tendon basedcontinuum manipulators [20].

However, unlike their rigid counterparts, no generalframework exists to develop kinematic and dynamic equa-tions relating actuation inputs to relevant task space vari-ables (such as the location of the instrument to be positionedin a desired location and orientation). Past work on control-ling and modeling continuum robots have taken a physicsbased approach to modeling these robots [15, 20]. Unfor-tunately, these approaches are specific to the design of therobot and are derived from continuum mechanics which in-volve partial differential equations that are not appropriatefor real-time control.

Due to the aforementioned disadvantages of physics

1

(b)

ConvNet

(a)

Image Courtesy Auris Surgical

Image Courtesy Aalto University

Figure 2. (a) Traditional rigid serial manipulators have a natural representation, which is the angle of each joint. Flexible manipulators tonot have a natural representation. (b) In this work, we propose to learn a representation that is useful for modeling and control from imagesof flexible manipulators using a convolutional neural network.

based approaches to modeling of flexible manipulators, adata-driven approach to this task is desirable. An importantstarting point for data-driven modeling is developing a rep-resentation for these robots. Because of the inherent flexi-bility of soft robots, “joint angles” are not a compact repre-sentation, as infinitely many are needed. In this work, weseek to develop a “good” representation of flexible robots.We believe that there are several requirements for such arepresentation to be considered “good”:

1. The representation should be compact (small dimen-sionality)

2. The representation should be expressive enough to ac-curately encode the shape of the entire robot backbone.

3. The dynamics of the representation evolution shouldbe simple. For example linear.

2. Related WorkRecently, several groups have applied advances in deep

learning to the task of dynamic model learning of electri-cal and mechanical systems using a variety of network ar-chitectures. For example, Punaji and Abbeel [14] a ReLUnetwork architecture to the task of learning helicopter dy-namics. In particular, they used a training set which con-sisted recordings of various helicopter state parameters such

as position, pose, and angular velocities, as well as the com-mand inputs into the system during a multitude of demon-strations. They used a lagged input approach, whereby sev-eral previous states and command inputs were used to pre-dict the future state of the helicopter. Using the helicopterdataset and aforementioned network model, they were ableto achieve state-of-the-art prediction accuracy on the heli-copter dataset.

Similarly, Haarnoja et al. [6] developed a deep Kalmanfilter in which a convolutional neural network (CNN) isused to learn a latent representation, zt of an image ot. zt isthen used an as observation of a handcrafted hidden state xt

in a standard Bayesian estimation framework. The convo-lutional neural network is trained directly with the Kalmanfilter, rather than separately. In this work, the latent repre-sentation, zt is an engineering decision and chosen by theuser. Wahlstrom et al. [19] used a similar approach, butunlike Haarnoja et al. [6], they used a convolutional net-work to encode a latent representation, zt, of an image ot,which itself was used as filtered state in a Bayesian esti-mation framework. This framework is most similar to theapproach taken in this paper, however they applied it to therelatively simple task of learning the dynamics of interac-tion with a ball.

Previous work has also shown promise for the task of ap-plying neural network based control for continuum robots.For example, Braganza et al. [2] applied a neural network to

256

256

3

128

128

32

64

64

32

32

32

32

16

16

64

8

8

128

4

4

128

25

Eachblockrepresents:Conv4x4,Stride2,

BatchNormLeakyRELU EncoderNetwork

256

256

3

128

128

32

64

64

32

32

32

32

16

16

64

8

8

128

4

4

128

25

Eachblockrepresents:Conv4x4,Stride1

Up-SampleBatchNormLeakyRELU DecoderNetwork

Figure 3. Autoencoder architecture used for learning snake robotrepresentation in this work. Encoder consists of 7 layers, each ofwhich consists of a 4 × 4, stride 2 convolution, batch normaliza-tion, and Leaky ReLU (α = 0.2). Decoder consists of 7 layers,each of which consists of a 4 × 4 convolution, up-sample, bachnorm and Leaky ReLU non linearity (α = 0.2). The encoder anddecoder each consisted of approximately 500000 parameters.

the task of controlling the length of an OctArm [12] robot.However, in this work they used a simple two layer neuralnetwork with 15 hidden states to achieve their goal. In ad-dition, they were unable to take advantage of the richnessof image data to achieve their goal. Nevertheless, they wereable to achieve promising preliminary results for trajectorytracking in this area.

3. ApproachA major challenge of applying data-driven techniques to

robotics is the difficulty of obtaining labeled datasets froma robotic platform. For this reason, we believe that it isimperative that an unsupervised approach be taken to repre-sentation learning for modeling and control of robots.

To create a dataset for training, we used the VirtualRobot Experimentation Platform (VREP) [5] to simulate acontinuum robot. In particular, we created a world envi-ronment that consisted of a floor and 4 × 6 m rectangulardimension wall to confine the snake robot. A camera wasplaced in a fixed location above the floor, centered over thewalled-off region, staring straight down. Finally, the snakerobot was commanded to move in random directions. Fig. 4shows an example image of rendered output from VREP.The simulation rendered two hours of video at 20 framesper second. This corresponds to a dataset of 144000 inputimages. In addition, the position of the head of the robot forthe entire simulation time was recorded in a separate file.

Figure 4. Example image from the dataset generate by the VirtualRobot Experimentation Platform (VREP) [5]. A robot model ofthe ACM-R5 (HiBot, Ltd.) was used as a snake robot for simula-tion. The robot was confined to a 4 × 6 m rectangle and a fixedoverhead camera perspective was used to render images.

Autoencoder

(a)

(b)

Figure 5. (a) Two example input images from the verification set(left column) and their corresponding output from the autoencoder(right column). Two features of note: the background is recon-structed nearly perfectly, while the robot is not. (b) An examplerandom input image to the autoencoder and its corresponding out-put.

0 50 100 150 200 250

0

50

100

150

200

250

image 0

0 50 100 150 200 250

0

50

100

150

200

250

image 0 nn 1

0 50 100 150 200 250

0

50

100

150

200

250

image 0 nn 2

0 50 100 150 200 250

0

50

100

150

200

250

image 0 nn 3

0 50 100 150 200 250

0

50

100

150

200

250

image 1

0 50 100 150 200 250

0

50

100

150

200

250

image 1 nn 1

0 50 100 150 200 250

0

50

100

150

200

250

image 1 nn 2

0 50 100 150 200 250

0

50

100

150

200

250

image 1 nn 3

0 50 100 150 200 250

0

50

100

150

200

250

image 2

0 50 100 150 200 250

0

50

100

150

200

250

image 2 nn 1

0 50 100 150 200 250

0

50

100

150

200

250

image 2 nn 2

0 50 100 150 200 250

0

50

100

150

200

250

image 2 nn 3

Figure 6. Nearest neighbor visualization of feature encoding learned in this task for three random samples in our validation set. Theleft-most column of each row represents a sample for comparison, and the three columns to the right represent the first, second, and thirdclosest nearest neighbor in feature space based on L2 distance.

Multiple approaches have been proposed for unsuper-vised or self-supervised representation learning for the pur-poses of motion. For example, Finn et al. [4] investigatedvarious architectures for synthesizing future video frames.They showed that the representation learned from this taskwas useful for robotic control. In addition, Xue et al. [21]developed an architecture to generate probability distribu-tions over future frames so that multiple hypotheses can besampled.

For this work, we decided to use an autoencoder ar-chitecture with an extremely limited bottleneck dimension(N = 25) to learn a representation of the continuum robot.This decision was made due to the fact that the simulationoutput was rendered from a fixed viewpoint. Because ofthe autoencoder’s extremely limited bottleneck dimension,we hypothesized that a well-trained autoencoder will con-centrate the entirety of its bottleneck on encoding the state(position and shape) of the robot, while the background ofthe simulation will be represented in the decoder’s weights.In other words, the autoencoder bottleneck will be a rep-resentation of the robot shape and position alone. Our au-toencoder architecture is shown in Fig. 3 and is based onthe context encoder architecture of Pathak et al. [13]. It

contained 1 million weight values and took ten hours totrain. In a more general setting, e.g. varying backgroundor varying viewpoint, we would elect to use a future framesynthesis task such as the one used by Xue et al. [21].

Fig. 5 (a) shows two representative input images fromthe verification set, and their corresponding output from thetrained autoencoder. From the output images, it is clearthat the background of the image (floor, wall, etc.) is re-constructed perfectly. On the other hand, the robot itself isslightly distorted. This is evidence that autoencoder has al-located its bottleneck representation budget almost entirelyto representing the robot and not the background, as desired.This is further confirmed by Fig. 5 (b) which shows the au-toencoder output on the Stanford logo input. In this case,the output still has the same background as the other exam-ples, even though it’s not present in the Stanford logo input.It has also encoded portions of the edge of the Stanford “S”as the robot. This confirms that the background representa-tion must reside in the weights of the decoder network andnot in the bottleneck.

30 20 10 0 10 20 3040

30

20

10

0

10

20

30

40t-SNE Embedding of Features (Colors represent cartesian robot position)

(a)

(b)

Figure 7. (a) Cartesian robot distance vs feature space distance fora fixed sample from the verification set. Feature space distance andcartesian distance are positively correlated, as expected, but thiseffect saturates. (b) t-SNE embedding [10] of random subset ofsamples in the verification set. Color represents cartesian positionof the sample.

4. Representation ExperimentsIn this section, we present several experiments that help

us better understand the representation learned by the en-coder.

Fig. 6 shows an experiment in which feature space near-est neighbors were retrieved from a 2000 image subset ofthe verification set. Each row represents a different imageinput and the other three columns represent the first, second,third most close feature space neighbors, moving from leftto right. Qualitatively, we can see that the nearest neighborsmatch both the shape and position of the robot in the queryimage. Because of this, it is fair to assume that our repre-sentation encodes both the position and shape of the snakerobot, as desired.

To understand better how robot position is represented inthe feature embedding, we performed several experiments.Fig. 7 (a) shows how robot cartesian distance relates to fea-ture space distance. As expected, we find a positive corre-lation between the two quantities – robots that are closer toone another have feature encodings that are closer to oneanother. However, this effect saturates past a certain dis-tance.

Encoder

LSTM

Figure 8. An experiment was performed to predict the robot’s headposition 0.15 seconds in the future (corresponding to three time-steps) using only the autoencoder trained representation as input.An LSTM was trained with the encoded features as input. Notethat the representation was fixed (i.e. backpropagation was notperformed into the encoder.)

Fig. 7 (b) shows a t-SNE embedding [10] of a randomsubset of feature space encodings of verification set images.Each sample’s hue is determined by its cartesian positionby the simple mapping x + y where x, y represent the twodimensional coordinates of the robot. It is clear that imageswith robots near one-another map to feature space embed-dings that are near each other.

Finally, we performed an experiment to see how usefulthe learned representation was for modeling the motion ofthe simulated robot. In this experiment, we predicted thefuture position of the robot’s head 0.15 seconds in the fu-ture, which corresponds to three time-steps of the simula-tion. To do this, we trained a Long Short Term Memory(LSTM) network [8] that used only encoded features to pre-

dict the future position of the robot. The LSTM networkconsisted of a 512 hidden unit LSTM, followed by threeDense-Batchnorm-Leaky ReLU layers that were 256, 128,and 64 units, respectively. The representation was fixed,and not refined in the training process (i.e. the weights ofthe encoder network were frozen). After training, we wereable to achieve an average prediction error of 30 cm. Forreference, always predicting the center of the room wouldresult in a prediction error of 208 cm, and predicting thesame position as the current time-step would result in anaverage error of ≈ 90 cm.

5. Conclusion

We investigated a data-driven approach to representingthe shape and position of a flexible continuum robot using aconvolutional autoencoder and an extremely limited bottle-neck dimension. We showed that because the input imageshad a fixed background, the autoencoder concentrated itsbottleneck entirely on representing the position and shape ofthe robot. We also showed that this learned representationis useful for modeling the motion of the robot. Future workwill include making the representation more general, bothfor non fixed viewpoint cameras and varying backgroundsusing a more sophisticated unsupervised learning schemesuch as future frame prediction. In addition, we will investi-gate the use of these techniques on physical (non-simulated)continuum robot hardware platforms.

References[1] A. T. Asbeck, R. J. Dyer, A. F. Larusson, and C. J. Walsh.

Biologically-inspired soft exosuit. In IEEE InternationalConference on Rehabilitation Robotics, pages 1–8, 2013. 1

[2] D. Braganza, D. M. Dawson, I. D. Walker, and N. Nath. Aneural network controller for continuum robots. IEEE trans-actions on robotics, 23(6):1270–1277, 2007. 2

[3] E. Brown, N. Rodenberg, J. Amend, A. Mozeika, E. Steltz,M. R. Zakin, H. Lipson, and H. M. Jaeger. Universal roboticgripper based on the jamming of granular material. Proceed-ings of the National Academy of Sciences, 107(44):18809–18814, 2010. 1

[4] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learn-ing for physical interaction through video prediction. arXivpreprint arXiv:1605.07157, 2016. 4

[5] M. Freese, S. Singh, F. Ozaki, and N. Matsuhira. Virtualrobot experimentation platform v-rep: a versatile 3d robotsimulator. In International Conference on Simulation, Mod-eling, and Programming for Autonomous Robots, pages 51–62. Springer, 2010. 3

[6] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel. Backprop kf:Learning discriminative deterministic state estimators. arXivpreprint arXiv:1605.07148, 2016. 2

[7] R. S. Hartenberg and J. Denavit. Kinematic synthesis of link-ages. McGraw-Hill, 1964. 1

0 500 1000 1500 2000Time

21012345

Met

ers

(a) True Position

X(m) Y(m)

0 500 1000 1500 2000Time

21012345

Met

ers

(b) Future Position Predicted (0.15s in the future).X(m)Y(m)

Pred X(m)Pred Y(m)

0 500 1000 1500 2000Time

21012345

Met

ers

(c) Future Position Predicted (0.15s in the future).Low Pass Filtered.

X(m)Y(m)

Pred LP X(m)Pred LP Y(m)

Figure 9. Results of the position prediction experiment. (a) TruePosition of a time segment of validation data. (b) Raw LSTM pre-dictions overlaid on the true position. (c) Low-pass filtered predic-tions overlaid on the true position. Average prediction error overvalidation data was 30 cm in a 4× 6 m room.

[8] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997. 5

[9] O. Khatib. A unified approach for motion and force con-trol of robot manipulators: The operational space formula-tion. IEEE Journal on Robotics and Automation, 3(1):43–53,1987. 1

[10] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(Nov):2579–2605,2008. 5

[11] A. D. Marchese, K. Komorowski, C. D. Onal, and D. Rus.Design and control of a soft and continuously deformable 2drobotic manipulation system. In IEEE International Confer-ence on Robotics and Automation, pages 2189–2196, 2014.1

[12] S. Neppalli, B. Jones, W. McMahan, V. Chitrakaran,I. Walker, M. Pritts, M. Csencsits, C. Rahn, and M. Gris-som. Octarm-a soft robotic manipulator. In IEEE/RSJ In-ternational Conference on Intelligent Robots and Systems,pages 2569–2569, 2007. 1, 3

[13] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.Efros. Context encoders: Feature learning by inpainting.arXiv preprint arXiv:1604.07379, 2016. 4

[14] A. Punjani and P. Abbeel. Deep learning helicopter dynamicsmodels. In 2015 IEEE International Conference on Roboticsand Automation (ICRA), pages 3223–3230. IEEE, 2015. 2

[15] D. C. Rucker, B. A. Jones, and R. J. Webster III. A geometri-cally exact model for externally loaded concentric-tube con-tinuum robots. IEEE Transactions on Robotics, 26(5):769–780, 2010. 1

[16] M. T. Tolley, R. F. Shepherd, B. Mosadegh, K. C. Galloway,M. Wehner, M. Karpelson, R. J. Wood, and G. M. White-sides. A resilient, untethered soft robot. Soft Robotics,1(3):213–223, 2014. 1

[17] L. G. Torres and R. Alterovitz. Motion planning for con-centric tube robots using mechanics-based models. In 2011IEEE/RSJ International Conference on Intelligent Robotsand Systems, pages 5153–5159. IEEE, 2011. 1

[18] N. G. Tsagarakis and D. G. Caldwell. Development and con-trol of a soft-actuated exoskeleton for use in physiotherapyand training. Autonomous Robots, 15(1):21–33, 2003. 1

[19] N. Wahlstrom, T. B. Schon, and M. P. Deisenroth. Learn-ing deep dynamical models from image pixels. IFAC-PapersOnLine, 48(28):1059–1064, 2015. 2

[20] R. J. Webster and B. A. Jones. Design and kinematic mod-eling of constant curvature continuum robots: A review. TheInternational Journal of Robotics Research, 2010. 1

[21] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual dy-namics: Probabilistic future frame synthesis via cross convo-lutional networks. arXiv preprint arXiv:1607.02586, 2016.4

Documents

Representation Learning for Modeling and Control of ... · the robot. These equations enable robust control of rigid robot arms. In contrast, for ﬂexible manipulators such as catheters,