This post presents the basics of the implementation of a reinforcement learning algorithm called Dynamic Bayes Decision Network. This algorithm, will create a robust and efficient expert system. In this post, I present its characteristics and I show the code of it’s implementation.
When I was doing an intership, I was presented a SaaS plaform using Microsoft Azure. The system allowed any user to buy computing nodes and pay a certain amount of money per minutes of computing. The thing here was that the company was purchasing Azure’s nodes at a defined price per hour. The company wanted me to design a manager for buying and releasing computing nodes. The system would need to be responsive, but also efficient and resistant to Economic-DoS attacks. These attacks are pretty simple: You request a huge amount of node, and then cancel you request within a minute. The target of the attacks would then buy a lot of computing node and paying for a full hour for nothing.
This problem seems to be resolvable by a reinforcement learning algorithm, but which one to use? We needed to use an algorithm that would be require little training, that would be able to efficiently manage the system’s load and that would be able to be resistent to E-DoS attacks.
Reinforcement learning is an important and vast field of study in the domain of artificial intelligence. Numerous algorithms were developed for a wide variety of applications. In his book, Russel describes in details many learning algorithms [Russell et al., 1995]. Among them are online algorithms that make use of back and forth interactions between the agent and its environment. The behavior resulting from these interactions is used to solve a task and to learn the most efficient way of doing it.
Reinforcement learning techniques are efficient ways to achieve good results with minimal knowledge about the problem [Tesauro, 2007]. Algorithms that solve Markov Decision Processes (MDP), such as Q-Learning, or Partially Observable Markov Decision Processes (POMDP) need almost no initial internal modeling of the domain. In order to enhance their performance, these will instead rely on training with rewards acquisition for the selected actions as explained by Doshi [Doshi et al., 2008]. The rewards are then used by the learning system to improve itself and also to maintain its ability to predict the distribution of the possible future states ([Ross et al., 2008], [Tesauro, 2007]).
There are different types of reinforcement learning methods. We can distinguish two opposite approaches: model based and model free. The distinction between them is in their use of the model of the environment. In model based, we suppose an expert knowledge of its dynamic. Such expertise is then represented within the mathematical framework. Dynamic programming framework contains a set of algorithms that are model based. On the other hand, model free based methods, like Q-learning and SARSA, does not suppose prior knowledge about the dynamics.
However, learning can take a long time. As explained by Doshi [Doshi et al., 2008], Poupart [Poupart et al., 2006] and Strens [Strens, 2000], it could be interesting to apply these reinforcement techniques with a Dynamic Bayesian Decision Network (DBDN) in order to limit the duration of the learning. Indeed, DBDN allows initial environment expertise to be integrated to the model quite easily in a probabilistic manner. An agent would then possess an initial robustness while still being able to learn.
In order to create a good load manager, I thought that it would be best to use the Dynamic Bayes Decision Network. It is a very simple reinforcement learning algorithm. With it, it was possible to add a good prior expertise in the system with little efforts. That’s what I implemented and in fact, the manager was quite good at protecting the system against E-DoS attacks while providing a both efficient and responsive system.
Since the code I implemented in C#.NET is proprietary, I could not give the code to the open source community. Nevertheless, I recently had another project where I compared the efficiency of model-free algorithms versus model-based learning algorithms. I will show that code in this article.
Here’s the problem. So you have a robot that needs to do some skiptracing of a target. You know the target’s bearing and the direction your robot needs to walk to get closer to the target. You don’t know the distance between the robot and the target. If the robot gets too close, if your robot is running instead of walking or if your robot is seen by the target, you get a negative reward and the target’s proximity sensing range gets bigger, so your robot has to escape the target.
While developing DBDN, we had to think of an update rule that would be simple to understand and to maintain. We also wanted it in order to incorporate as much informations as possible to increase the expertise of the robot. We defined five goals to be achieved by the algorithm.
- Proximity sensors must not be violated.
- Maximal distance must be respected.
- Skiptracer must not be visible.
- Skiptracer must not make noise by running.
- Skiptracer must not touch a civilian.
Given the current and some previous states, we first estimate the next accessible state for each possible action. This first form of expertise is implemented by remembering previous states and using them to detect the most probable state.
Second, given an estimated state, we give an individual score for every goal. For example, considering the noise goal, we will predict a good reward if the robot chooses to run, but we decrease it if the robot is close to the target, since the target would hear the skiptracer. Once every goal have a reward prediction for every possible future state, we use the utility value function that is a weighted sum over all the goals to be achieved. Finally, we select the action with the best reward.
When the action have been performed, a reward is given to the algorithm. We then find the goal that best predicted the reward and the one with the worst prediction. We transfer a fraction of the weight of the worst to the best. You can see this process as a kind of adaptation. This will allow the algorithm to become better at predicting rewards. It will therefore be more able to decide what action is to be taken.
I invite you to contact me if you want more information about the algorithm.
Now here’s the code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195
And that’s it. Enjoy!