OBSTACLE AVOIDING ROBOT USING AI/REINFORCEMENT LEARNING
Problem Statement: The main objective is to learn to avoid obstacles in “N” Episodes and learn the optimal action. In this case, let's assume we need our Robot to learn optimal action as 'Right'.
Reinforcement Algorithm Used: Q learning
How L298N Drives Two DC Motors:How HC-SR04 Sensor calculates distance:Important Terms in Reinforcement Learning:
1. STATE: This is the situation in which the Robot is. Here for a basic Obstacle Avoiding Robot, there are in total 2 states ……1st state is when there is no obstacle close to it and 2ndstate in which there is an obstacle in front of it.(when I wrote the code I assumed 10 different states can be in which expected the same action. The reason I did this to illustrate a more complex environment.)
2. ACTION: In a particular State the robot performs a particular action. There are 4 actions which the robot can perform in 2nd state: “FORWARD”, “BACKWARD”, “LEFT”, “STOP”. In 1st state, the robot can perform 4 actions but to make things easier I have assumed that robot can perform only one action which is “FORWARD”( This is because its lame to consider actions such as LEFT or BACKWARD when there are no obstacles nearby.
3. NEXT STATE: This is the state robot gets in when it performs a particular “ACTION” in its current “STATE”. In obstacle avoiding the robot case, the NEXT STATE can be either a “CRASHED” state or a “SURVIVED” State. (Here SURVIVE state is same as the starting state that robot is in when its episode starts.)
/*AFTER PERFOMING AN ACTION THE ROBOT GOES INTO NEXT STATE IN THIS CASE OF OBSTACLE AVOIDING ROBOT*/
int NEXT_STATE; int STATE = 0; NEXT_STATE = STATE+1;
4. Q TABLE / Q MATRIX: This table is formed by Number of “STATES” and Number of “ACTIONS”. In Obstacle avoidance Robot’s Case, this Table is given by:
Here N_STATES = 10 AND N_ACTIONS = 4. here "0.0" indicates that any action can be performed from any of the 4 possible actions. if you, however, want to eliminate a particulateaction in a state just replace "0.0" with "-1.0" in the matrix. "-1.0" indicates that the action cannot be performed in that state. here it is assumed that we have 10 different states with each state expecting the same action. if you want your robot to learn actions that are different in each state then change the rewards from the reward matrix in the code
5. TERMINAL STATE: This is the last state in which the robot can be in. For obstacle Avoiding Robot, this state doesn’t exist as we don’t have any terminal state and want to keep our Robot learning forever.
6. REWARD MATRIX: This table or matrix is used to give rewards to the robot for certain actions. The reward is positive or negative depending upon the quality of the action.
7. ENVIRONMENT: This can also be assumed or considered as the world for the Robot. For example, we humans live on earth so basically earth is our environment.
Hyperparameters in Reinforcement Learning:
1. LEARNING RATE (ALPHA): The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of ALPHA = 1.0 is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as ALPHA = 0.1 for all scenarios.
float ALPHA = 0.2;
2. DISCOUNT FACTOR (GAMMA): The discount factor of 0 determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For GAMMA = 1.0, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q function learning leads to the propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.
float GAMMA = 0.9;
3. EXPLORATION RATE (EPSILON): This parameter decides to what extent the robot should explore the environment. Exploring the environment means performing random actions and analyzing the results through Q Values. Usually, in Q Learning (unlike SARSA) we get rid of this parameter eventually as Robot goes on learning more and more. But In this project, we are not going to rid of Epsilon as we don’t have any terminal state. Epsilon in this case will reduce to some extent and then again get reset when it goes below a threshold value. This will make sure that the robot keeps on exploring till its lifetime just like we humans do.
float EPSILON = 0.75;
Q-LEARNING ALGORITHM:
Initialize the Q-values table, Q(s, a). I have initialized these values to 0.0.
Observe the current state, s.
Choose an action, a, for that state based on one of the action selection policies explained here on the previous page (📷-soft, 📷-greedy or softmax).
PROB = RANDOM(EPSILON); if (PROB<=EPSILON) //EXPLORE THE ACTIONS { ACTION = random(0,4); FLAG = 2; } else //EXPLOIT THE ACTIONS FROM Q TABLE { ACTION = ARGMAX(Q,STATE); FLAG = 2; }
Take the action, and observe the reward, r, as well as the new state, s'.
Update the Q-value for the state using the observed reward and the maximum reward possible for the next state. The updating is done according to the formula and parameters described above.
Set the state to the new state, and repeat the process until a terminal state is reached.
To understand Q-learning better visit this link:https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c
///////////////////Implementation of Q_Learning Formula/////////////////////////
Working Video: Don't forget to check the working video of AI Robot :)
Code
Obstacle Avoidance Robot using Q-LEARNING
Obstacle Avoidance Robot using Q-LEARNINGC/C++
THIS IS AN ARTIFICIAL INTELLIGENCE BASED OBSTACLE AVOIDANCE ROBOT'S MAIN CODE.
/* This is an Obstacle Avoiding Robot using Reinforcement Learning/AI
Author of this Project : Varun Walimbe
Algorithm used in this project: Q learning
How Obstacle Avoiding Works?
1.Ultrasonic sensor is used measure distance from the obstacle using
its Echo and Trig Pins.
2.When distance is measured and if its less than 20cm then there is
an obstacle nearby otherwise robot is safe and continue Forward.
3.If obstacle is detected then robot takes left or right turn depending
on the situation.
How AI based Obstacle Avoidance Works?(Q learning)
1.Here the 1st step from upper article remains the same.However the
2nd Step is different.
2.A list of actions of the robot are initialised first. For Example
in this case actions of Robot are: Left , Forward, Backward ,Stop.
3.When the Robot comes near an obstacle it is required to perform an action.
However note that in this case Robot doesn't know which action to take as its not pre
programmed and going to learn on its own to avoid obstacles.
4.When Robot stops when there is an obstacle in front of it then it gets reward as 0
When Robot stops and goes backward it receives reward of -5
When Robot continues to move forward ignoring the obstacles it receives reward of -10
When Robot just moves left as soon as obstacle is detected it gets reward of +10
5.In this way Robot learns on its own to avoid obstacles by Reward Mechanism.*/
//////////ROBOT'S HARDWARE PARAMETERS////////////////////
int TRIG_PIN = 7;
int ECHO_PIN = 8;
int duration;
float distance;
int M1 = 13;
int M2 = 12;
int M3 = 11;
int M4 = 10;
bool Obstacle = false;
int FLAG;
/////////////////////////END/////////////////////////////
/////////////////////////////////////Q LEARNING PARAMETERS///////////////////////////////////////////
float ALPHA = 0.1; //LEARNING RATE
float GAMMA = 0.5; //DISCOUNT FACTOR
float EPSILON = 0.90; //EXPLORATION PARAMETER
int REWARD; //REWARD FOR PERFORMING AN ACTION
int EPISODES = 100;
int STATE; // CURRENT STATE OF THE ROBOT
int ACTION = 0; //ACTION PERFORMED BY THE ROBOT(0:FORWARD,1:BACKWARD ,2;STOP,3:LEFT)
float PROB; //USED FOR EPSILON DECAY
bool ACTION_TAKEN = false; //THIS VARIABLES TELLS US WHETHER AN ACTION IS TAKEN OR NOT
int NEXT_STATE; // NEXT STATE OF THE ROBOT
const int STATES = 10; //NUMBER OF STATES IN ENVIRONMENT
int ACTIONS[4] = {1,2,3,4};
const int NUMBER_OF_ACTIONS = 4; //TOTAL WE HAVE 4 ACTION FORWARD,BACKWARD,LEFT AND STOP
/*THIS IS THE Q MATRIX OR Q TABLE. THIS IS BASICALLY THE DIARY THAT ROBOT WILL LOOK INTO
BEFORE PERFORMING AN ACTION.BASED ON THE ACTION THE ROBOT WILL EARN REWARD AND THE Q VALUE
WILL BE UPDATED IN THIS Q TABLE. HERE I HAVE CONISDERED 10 STATES. I HAVE ASSUMED ALL STATES
ARE DIFFERENT EVEN THOUGH THEY ARE SAME.BASICALLY OBSTACLE AVOIDING ROBOT CONTAINS ONLY TWO STATES
i.e:
1:WHEN ITS AWAY FROM OBSTACLE
2:WHEN ITS NEAR TO THE OBSTACLE
BUT HERE TO ILLUSTRATE MORE COMPLEX ENVIRONMENT I HAVE ASSUMED THERE ARE 10 DIFFERENT STATES HERE
EXPECTING SAME/DIFFERENT ACTION.*/
float Q[STATES][NUMBER_OF_ACTIONS] = {{0.0,0.0,0.0,0.0}, //MOST IMPORTANT OF ALL IS THE Q TABLE.
{0.0,0.0,0.0,0.0}, //IT IS FORMED BY STATES AS ITS ROWS
{0.0,0.0,0.0,0.0}, //AND COLLUMNS AS ITS NUMBER OF ACTIONS
{0.0,0.0,0.0,0.0}, //INITIALISED TO ZERO IN THE START
{0.0,0.0,0.0,0.0}, // THIS WILL UPDATED IN THE FUTURE.
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0},
{0.0,0.0,0.0,0.0}};
/*THIS IS A REWARD MATRIX OR REWARD TABLE. THIS IS RESPONSIBLE FOR GIVING
REWARD TO ROBOT FOR PERFORMING PARTICULAR ACTION. IT STORES THE REWARD FOR
EACH ACTION TAKEN AT STATE. THE REWARD WILL BE POSITIVE IF THE ACTION
PERFORMED IS GOOD AND NEGATIVE IF ACTION YIELDS BAD RESULTS.*/
int REWARDS[STATES][NUMBER_OF_ACTIONS] = {{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10},
{-10,-2,-1,10}};
////////////////////////////////////////////END///////////////////////////////////////////////////
////////////////Q LEARNING UPDATE PARAMETERS////////////
float Q_OLD;
float Q_NEW;
float Q_MAX;
//////////////////////////END//////////////////////////
void setup()
{
Serial.begin(9600);
pinMode(TRIG_PIN,OUTPUT);
pinMode(ECHO_PIN,INPUT);
pinMode(M1,OUTPUT);
pinMode(M2,OUTPUT);
pinMode(M3,OUTPUT);
pinMode(M4,OUTPUT);
randomSeed(analogRead(A0));
STATE = 0;
Serial.println("TRAINING WILL START IN 5 SECONDS: ");
delay(5000);
}
////////////////////////////ROBOT'S FUNCTIONS/////////////////////////////////
void Forward()
{
digitalWrite(M1,LOW);
digitalWrite(M2,HIGH);
digitalWrite(M3,LOW);
digitalWrite(M4,HIGH);
}
void Backward()
{
digitalWrite(M1,HIGH);
digitalWrite(M2,LOW);
digitalWrite(M3,HIGH);
digitalWrite(M4,LOW);
}
void Left()
{
digitalWrite(M1,HIGH);
digitalWrite(M2,LOW);
digitalWrite(M3,LOW);
digitalWrite(M4,HIGH);
}
void Right()
{
digitalWrite(M1,LOW);
digitalWrite(M2,HIGH);
digitalWrite(M3,HIGH);
digitalWrite(M4,LOW);
}
void Stop()
{
digitalWrite(M1,LOW);
digitalWrite(M2,LOW);
digitalWrite(M3,LOW);
digitalWrite(M4,LOW);
}
bool Obstacle_Avoider()
{
digitalWrite(TRIG_PIN, HIGH);
delayMicroseconds(10);
digitalWrite(TRIG_PIN, LOW);
duration = pulseIn(ECHO_PIN ,HIGH);
distance = (duration/2)/29.1;
if(distance<15)
{
Obstacle = true;
}
if(distance>15)
{
Obstacle = false;
}
delay(10);
return Obstacle;
}
////////////////////////////////////////////END////////////////////////////////////////////////
///////////////////////////////ROBOT'S Q LEARNING FUNCTIONS////////////////////////////////////
float RANDOM(float EXPLORATION_PARAMETER)
{
/*THIS FUNCTION FINDS RANDOM NUMBER WHICH
DECIDES WHETHER AN ACTION TO BE TAKEN IS RANDOM
OR FROM Q_TABLE*/
float RANDOM_VARIABLE;
float PROBABILITY;
RANDOM_VARIABLE = random(0,100);
PROBABILITY = RANDOM_VARIABLE/100;
return PROBABILITY;
}
float DECAY(float PARAMETER)
{
/*THIS FUNCTION IS USED TO REDUCE
EPSILON(EXPLORATION PARAMETER) WITH
TIME.FINALLY AT THE END YOU GET RID
EPSILON AND THE ROBOT LEARNS TO AVOID
OBSTACLES ON ITS OWN */
PARAMETER = PARAMETER*0.98; //PARAMETER HERE IS THE EPSILON
return PARAMETER;
}
int GET_STATE()
{
int STATE_NUMBER;
STATE_NUMBER = random(0,10);
return STATE_NUMBER;
}
float MAX(float Q_Table[][4],int NEXT_S)
{
/*THIS FUNCTION FINDS THE BIGGEST NUMBER
IN Q_TABLE[NEXT_STATE]. THE MAIN ROLE OF
THIS FUNCTION IS TO FIND Q_MAX PARAMETER*/
float LIST[4];
float N1;
float N2;
float MAX_VALUE= 0.0;
float DIFF;
for(int b = 0; b<=3; b++)
{
LIST[b] = Q[NEXT_S][b];
}
for(int j = 0; j<=2 ; j++)
{
if(MAX_VALUE >LIST[j])
{
N1 = MAX_VALUE;
}
else
{
N1 = LIST[j];
}
N2 = LIST[j+1];
DIFF = N1- N2;
if(DIFF > 0)
{
MAX_VALUE = N1;
}
else
{
MAX_VALUE = N2;
}
}
return MAX_VALUE;
}
int ARGMAX(float Q_Table[][4],int S)
{
/*THIS FUNCTION FINDS THE INDEX OF
BIGGEST Q VALUE IN Q TABLE[STATE]*/
float ARRAY[4];
float N1;
float N2;
float MAX_VALUE = 0.0;
float DIFF;
float NUMBER;
int MAX_INDEX;
for(int u= 0; u<=3; u++)
{
ARRAY[u] = Q_Table[S][u];
}
for(int p = 0; p<=2 ; p++)
{
if(MAX_VALUE >ARRAY[p])
{
N1 = MAX_VALUE;
}
else
{
N1 = ARRAY[p];
}
N2 = ARRAY[p+1];
DIFF = N1- N2;
if(DIFF > 0)
{
MAX_VALUE = N1;
}
else
{
MAX_VALUE = N2;
}
}
for(int r = 0; r<=3;r++)
{
NUMBER = ARRAY[r];
if(NUMBER == MAX_VALUE)
{
MAX_INDEX = r;
break;
}
}
return MAX_INDEX;
}
void Update(float Q_TABLE[][4] , int S, int NEXT_S, int A, int ACTIONS[], int R, float LEARNING_RATE, float DISCOUNT_FACTOR)
{
/*THIS FUNCTION UPDATES THE Q TABLE AND Q VALUES. THIS UPDATE KEEPS ON HAPPENING UNTILL THE
MAIN LOOP ENDS. AT THE END OF EPISODES THE Q TABLE IS FILLED WITH VARIOUS VALUES. THE GREATER
THE VALUES THE GREATER IMPORTANCE THE ACTION HAS AT THAT PARTICULAR STATE. "Q_OLD" IS OLD VALUE
THAT THE Q MATRIX HAS.THIS IS THE VALUE WHICH GETS UPDATED EVENTUALLY. Q_NEW IS THE NEW Q_VALUE
WHICH IS CALCULATED BY THE Q LEARNING FORMULA. THE Q LEARNING FORMULA USED HERE IS BASED ON
BELLMAN EQUATION USES TEMPORAL DIFFERENCE LEARNING APPROACH.(MONTE CARLO APPROACH WILL NOT
WORK IN THIS CASE OF OBSTACLE AVOIDING ROBOT.*/
Q_OLD = Q_TABLE[S][A];
Q_MAX = MAX(Q_TABLE, NEXT_S);
Q_NEW = (1-LEARNING_RATE)*Q_OLD + LEARNING_RATE*(R + DISCOUNT_FACTOR*Q_MAX);
Serial.print("Q VALUE : ");
Serial.println(Q_NEW);
Q_TABLE[S][A] = Q_NEW;
}
///////////////////////////////////////////////////////////END///////////////////////////////////////////////////////////////
/////////////////////////////////////////START OF MAIN LOOP/////////////////////////////////////////////////
void loop()
{
/////////////////////////////////////////TRAINING////////////////////////////////////////////
for(int I =0; I<EPISODES; I++)
{
Serial.println("START :");
ACTION_TAKEN = false;
FLAG = 0;
while(true)
{
Forward();
Obstacle = Obstacle_Avoider();
if(Obstacle == true)
{
NEXT_STATE = STATE+1;
if(NEXT_STATE == 10)
{
NEXT_STATE = 0;
}
if(NEXT_STATE < 0)
{
NEXT_STATE = 0;
}
Serial.print("STATE: ");
Serial.println(STATE);
FLAG = 1;
break;
}
}
if(FLAG ==1)
{
PROB = RANDOM(EPSILON);
if (PROB<=EPSILON) //EXPLORE THE ACTIONS
{
ACTION = random(0,4);
FLAG = 2;
}
else //EXPLOIT THE ACTIONS FROM Q TABLE
{
ACTION = ARGMAX(Q,STATE);
FLAG = 2;
}
}
if(FLAG ==2)
{
if(ACTION == 0)
{
Forward();
delay(1500);
Stop();
REWARD = REWARDS[STATE][ACTION];
}
if(ACTION == 1)
{
Backward();
delay(2500);
Stop();
REWARD = REWARDS[STATE][ACTION];
}
if(ACTION == 2)
{
Stop();
REWARD = REWARDS[STATE][ACTION];
}
if(ACTION == 3)
{
Left();
delay(2000);
Stop();
REWARD = REWARDS[STATE][ACTION];
}
ACTION_TAKEN = true;
delay(500);
}
if(ACTION_TAKEN == true)
{
Update(Q,STATE,NEXT_STATE,ACTION ,ACTIONS,REWARD,ALPHA ,GAMMA);
STATE = NEXT_STATE;
EPSILON = DECAY(EPSILON);
if(EPSILON<0.5)
{
EPSILON == 0.9;
}
Serial.print("EPISODE ENDED : ");
Serial.println(I);
delay(7000);
}
}
/////////////////////////////////////END OF TRAINING///////////////////////////////////
//////////////////////////////////////EVALUATION//////////////////////////////////////////
/*USE THIS TO CHECK WHETHER YOUR Q VALUES ARE RIGHT OR WRONG. IF ALL Q VALUES ARE
COMING RIGHT OR SEEMS RIGHT/ACCURATE COMMENT THIS SECTION */
for(int y = 0; y<=9 ; y++)
{
Serial.println("SET OF Q VALUES WILL START:");
for(int l = 0; l<=3; l++)
{
Serial.print("Q VALUE :");
Serial.println(Q[y][l]);
delay(2000);
}
delay(2000);
}
Serial.println("EVALUATION ENDED");
////////////////////////////////END OF EVALUATION/////////////////////////////////////////
////////////////////////////////////////TESTING////////////////////////////////////////////
while(true)
{
Forward();
Obstacle = Obstacle_Avoider();
if(Obstacle == true)
{
STATE = GET_STATE();
ACTION = ARGMAX(Q,STATE);
Serial.print("ACTION TAKEN: ");
Serial.println(ACTION);
if(ACTION ==0)
{
Forward();
delay(1500);
Stop();
}
if(ACTION == 1)
{
Backward();
delay(1500);
Stop();
}
if(ACTION == 2)
{
Stop();
}
if(ACTION == 3)
{
Left();
delay(2000);
Stop();
}
}
}
//////////////////////////////////////////////////END OF TESTING////////////////////////////////////////////////////////////
}
//////////////////////////////////////////////////////END OF MAIN LOOP////////////////////////////////////////////////////////