1
Controlling Two-Wheels Self Balancing Robot Using Q-Learning Hasan Unlu, [email protected], Project ID: 94 Problem Implementing Q-Learning to control two wheels self balancing robot. Mainly PID control loop will be replaced by reinforcement learning. PID Loop Deterministic control loop is shown below. Based on robots vertical angle PID controller produces PWM to control motors. In this design there is no linear position control. Only vertical angle is being tried to keep in reference range. Complimentary Filter It combines raw accelerometer and angular velocity data and calculates angular position. Basic intuition is that if there is less movement or static, trust accelerometer result, otherwise integrate measured angular velocity and add it to last angular position [1]. Binary PID output The first step to prove before implementing Q-learning is action selection. Only backward(a=0) and forward(a=1) are going to be used in Q-Learning. Before implementing Q-Learning, regular PID output is converted into fix PWM rated motor outputs. In this video binary PID output produces only -1 and 1 to control robot. %70 PWM is selected per 1 or -1. https://www.youtube.com/watch?v=OYPt29tfekA Simulation Platform Cart-Pole-v1 in gym.openai.com is used to simulate the system [2]. Robotic Platform Two wheeler robot chassis from OSEEP. MPU-6050 Invensense 6-axis accelerometer and gyro. Raspberry Pi 3 Model B+ compute unit. Simulation Performance The proposed setup using angular position and angular velocity gave the following learning curve. Machine Learning Architectures Raw measurements only (angular velocity and unfiltered accelerometer) Calculated angular position from Complimentary filter and angular velocity. Q-Learning = % = * + , - . , , = , + +∗ % , % , Where is learning rate and is discount factor. % is next best action to choose. % is the state after action happened [3]. Only angular speed( ̇ ) and angular position() of the robot are used for discreting state generation. 257 total states are used. State 0 to 255 are for rewarded states and 256 is for terminal or exit state. [8-bit state definition] = [4-bit for | 4-bit for ̇ ] Operating is between -24 and 24 degree. Operating ̇ is between -200 and 200 degree/seconds. Raspberry Pi Power Raspberry Pi MPU6050 Motor Power L298D Motor Driver References [1] Aircraft Stability and Control 16.333, Lecture #15, MIT [2] https://gym.openai.com [3] Watkins & Dayan (1992) Q-Learning, Machine learning, 8(3), 279-292. Robotic Platform Demo Same algorithm tested on the simulation is implemented on robotic platform. The platform runs 52 episodes and it stops updating Q table. Each episodes start when robot positioned in operating range and stop when it reaches terminal state. At first, robot tries continuing backward and forward then it quickly tries different set of backwards and forwards back to back.

Controlling Two-Wheels Self Balancing Robot Using Q-Learningcs229.stanford.edu/proj2019spr/poster/94.pdf · 2019. 6. 18. · Controlling Two-Wheels Self Balancing Robot Using Q-Learning

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Controlling Two-Wheels Self Balancing Robot Using Q-Learningcs229.stanford.edu/proj2019spr/poster/94.pdf · 2019. 6. 18. · Controlling Two-Wheels Self Balancing Robot Using Q-Learning

Controlling Two-Wheels Self Balancing Robot Using Q-Learning Hasan Unlu, [email protected], Project ID: 94

ProblemImplementing Q-Learning to control two wheels selfbalancing robot. Mainly PID control loop will bereplaced by reinforcement learning.

PID LoopDeterministic control loop is shown below. Based onrobots vertical angle PID controller produces PWM tocontrol motors. In this design there is no linearposition control. Only vertical angle is being tried tokeep in reference range.

Complimentary FilterIt combines raw accelerometer and angular velocitydata and calculates angular position. Basic intuition isthat if there is less movement or static, trustaccelerometer result, otherwise integrate measuredangular velocity and add it to last angular position [1].

Binary PID outputThe first step to prove before implementing Q-learningis action selection. Only backward(a=0) andforward(a=1) are going to be used in Q-Learning.Before implementing Q-Learning, regular PID output isconverted into fix PWM rated motor outputs. In thisvideo binary PID output produces only -1 and 1 tocontrol robot. %70 PWM is selected per 1 or -1.https://www.youtube.com/watch?v=OYPt29tfekA

Simulation PlatformCart-Pole-v1 in gym.openai.com is used to simulatethe system [2].

Robotic Platform• Two wheeler robot chassis from OSEEP.• MPU-6050 Invensense 6-axis accelerometer and gyro.• Raspberry Pi 3 Model B+ compute unit.

Simulation PerformanceThe proposed setup using angular position andangular velocity gave the following learning curve.

Machine Learning Architectures• Raw measurements only (angular velocity and

unfiltered accelerometer)• Calculated angular position from Complimentary

filter and angular velocity.

Q-Learning

𝜋 𝑠 = 𝑎% = 𝑎𝑟𝑔𝑚𝑎𝑥*+,-

.

𝑄 𝑠, 𝑎

𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 ∗ 𝑄 𝑠%, 𝑎% − 𝑄 𝑠, 𝑎

Where 𝜶 is learning rate and 𝜸 is discount factor. 𝒂% isnext best action to choose. 𝒔% is the state after action 𝒂happened [3].

Only angular speed(�̇�) and angular position(𝜽) of therobot are used for discreting state generation. 257total states are used. State 0 to 255 are for rewardedstates and 256 is for terminal or exit state.

[8-bit state definition] = [4-bit for 𝜽 | 4-bit for �̇�]

Operating 𝜽 is between -24 and 24 degree.Operating �̇� is between -200 and 200 degree/seconds.

Raspberry Pi Power

Raspberry Pi

MPU6050

Motor Power

L298D Motor Driver

References[1] Aircraft Stability and Control 16.333, Lecture #15, MIT[2] https://gym.openai.com[3] Watkins & Dayan (1992) Q-Learning, Machine learning,8(3), 279-292.

Robotic Platform DemoSame algorithm tested on the simulation isimplemented on robotic platform. The platform runs52 episodes and it stops updating Q table. Eachepisodes start when robot positioned in operating 𝜽range and stop when it reaches terminal state. At first,robot tries continuing backward and forward then itquickly tries different set of backwards and forwardsback to back.