Void’s Vault

Knowledge source for efficiency.

Dynamic Bayes Decision Network for Building Robust Expert Learning Systems

This post presents the basics of the implementation of a reinforcement learning algorithm called Dynamic Bayes Decision Network. This algorithm, will create a robust and efficient expert system. In this post, I present its characteristics and I show the code of it’s implementation.

When I was doing an intership, I was presented a SaaS plaform using Microsoft Azure. The system allowed any user to buy computing nodes and pay a certain amount of money per minutes of computing. The thing here was that the company was purchasing Azure’s nodes at a defined price per hour. The company wanted me to design a manager for buying and releasing computing nodes. The system would need to be responsive, but also efficient and resistant to Economic-DoS attacks. These attacks are pretty simple: You request a huge amount of node, and then cancel you request within a minute. The target of the attacks would then buy a lot of computing node and paying for a full hour for nothing.

This problem seems to be resolvable by a reinforcement learning algorithm, but which one to use? We needed to use an algorithm that would be require little training, that would be able to efficiently manage the system’s load and that would be able to be resistent to E-DoS attacks.

Reinforcement learning is an important and vast field of study in the domain of artificial intelligence. Numerous algorithms were developed for a wide variety of applications. In his book, Russel describes in details many learning algorithms [Russell et al., 1995]. Among them are online algorithms that make use of back and forth interactions between the agent and its environment. The behavior resulting from these interactions is used to solve a task and to learn the most efficient way of doing it.

Reinforcement learning techniques are efficient ways to achieve good results with minimal knowledge about the problem [Tesauro, 2007]. Algorithms that solve Markov Decision Processes (MDP), such as Q-Learning, or Partially Observable Markov Decision Processes (POMDP) need almost no initial internal modeling of the domain. In order to enhance their performance, these will instead rely on training with rewards acquisition for the selected actions as explained by Doshi [Doshi et al., 2008]. The rewards are then used by the learning system to improve itself and also to maintain its ability to predict the distribution of the possible future states ([Ross et al., 2008], [Tesauro, 2007]).

There are different types of reinforcement learning methods. We can distinguish two opposite approaches: model based and model free. The distinction between them is in their use of the model of the environment. In model based, we suppose an expert knowledge of its dynamic. Such expertise is then represented within the mathematical framework. Dynamic programming framework contains a set of algorithms that are model based. On the other hand, model free based methods, like Q-learning and SARSA, does not suppose prior knowledge about the dynamics.

However, learning can take a long time. As explained by Doshi [Doshi et al., 2008], Poupart [Poupart et al., 2006] and Strens [Strens, 2000], it could be interesting to apply these reinforcement techniques with a Dynamic Bayesian Decision Network (DBDN) in order to limit the duration of the learning. Indeed, DBDN allows initial environment expertise to be integrated to the model quite easily in a probabilistic manner. An agent would then possess an initial robustness while still being able to learn.

In order to create a good load manager, I thought that it would be best to use the Dynamic Bayes Decision Network. It is a very simple reinforcement learning algorithm. With it, it was possible to add a good prior expertise in the system with little efforts. That’s what I implemented and in fact, the manager was quite good at protecting the system against E-DoS attacks while providing a both efficient and responsive system.

Since the code I implemented in C#.NET is proprietary, I could not give the code to the open source community. Nevertheless, I recently had another project where I compared the efficiency of model-free algorithms versus model-based learning algorithms. I will show that code in this article.

Here’s the problem. So you have a robot that needs to do some skiptracing of a target. You know the target’s bearing and the direction your robot needs to walk to get closer to the target. You don’t know the distance between the robot and the target. If the robot gets too close, if your robot is running instead of walking or if your robot is seen by the target, you get a negative reward and the target’s proximity sensing range gets bigger, so your robot has to escape the target.

While developing DBDN, we had to think of an update rule that would be simple to understand and to maintain. We also wanted it in order to incorporate as much informations as possible to increase the expertise of the robot. We defined five goals to be achieved by the algorithm.

  1. Proximity sensors must not be violated.
  2. Maximal distance must be respected.
  3. Skiptracer must not be visible.
  4. Skiptracer must not make noise by running.
  5. Skiptracer must not touch a civilian.

Given the current and some previous states, we first estimate the next accessible state for each possible action. This first form of expertise is implemented by remembering previous states and using them to detect the most probable state.

Second, given an estimated state, we give an individual score for every goal. For example, considering the noise goal, we will predict a good reward if the robot chooses to run, but we decrease it if the robot is close to the target, since the target would hear the skiptracer. Once every goal have a reward prediction for every possible future state, we use the utility value function that is a weighted sum over all the goals to be achieved. Finally, we select the action with the best reward.

When the action have been performed, a reward is given to the algorithm. We then find the goal that best predicted the reward and the one with the worst prediction. We transfer a fraction of the weight of the worst to the best. You can see this process as a kind of adaptation. This will allow the algorithm to become better at predicting rewards. It will therefore be more able to decide what action is to be taken.

I invite you to contact me if you want more information about the algorithm.

Now here’s the code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
function [rAction, rUtility] = DBDN(rBeliefs, rPercepts, rReward, rUtility, lastAction)
%   This function is the main loop for the DBDN algorithm
%   
%   Inputs:
%   rBeliefs - matrix containing 3 timesteps of the belief. 3rd column
%       is the current belief. Each belief is of the form : 
%       [direction_toward_target; target_angle; isTooClose; isTooFar; isSeen; isNoisy];
%   
%   rPercepts - matrix containing 3 timesteps of the percepts. 3rd column
%       is the current percept. Of the form: [v, w] of target
%
%   rUtility - Utility weights: [proximity; distance; visibility; noise; ; civilians_ahead]
%
%   rReward - Received reward for the last action]
%
%   lastAction - last action taken by robot [speed;orientation]

    rUtility = updateUtility(rBeliefs, rReward, rUtility, rPercepts, lastAction);
    rActions = evaluateActions(rBeliefs,rPercepts, rUtility);
    rAction = getBestAction(rActions);
end

function [rUtility] = updateUtility(rBeliefs, rReward, rUtility, rPercepts, lastAction)
% This function calculates separated utilities for every goal using current beliefs
% The goals that give the worst expectedRewards (farthest from reality) are
% given a reduced weight, and the difference is given to the best
% expectedReward.
estimatedBelief = estimateBelief(rBeliefs, rPercepts, lastAction);
goalScores = estimateGoalsAchievement(estimatedBelief);

%get the fathest score and the closest score
[~,i] = sort(goalScores-rReward, 'descend');

diffs = abs(goalScores-goalScores(i(5)));
total = sum(diffs);

if total > 0
    %give 10% to the best from the worsts
    for j = 1:1:4
        diff = ((diffs(i(j))/total)/10)*(rUtility(i(j))-0.05);
        rUtility(i(j)) = rUtility(i(j)) - diff;
        rUtility(i(5)) = rUtility(i(5)) + diff;
    end
end
end

function [rActions] = evaluateActions(rBeliefs,rPercepts, rUtility)
% This function estimates the reward that could be won for the actions that
% leads to new estimated rBeliefs.

% List the actions and give a utility to each of them
rActions = [0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 ; ...
            0 0 pi/4 pi/2 3*pi/4 pi -pi/4 -pi/2 -3*pi/4 0 pi/4 pi/2 3*pi/4 pi -pi/4 -pi/2 -3*pi/4 ];

for i = 1:size(rActions,2)
    estimatedBelief = estimateBelief(rBeliefs, rPercepts, rActions(:,i));
    goalScores = estimateGoalsAchievement(estimatedBelief);

    %Save the score estimation for the current action
    rActions(3,i) = dot(goalScores, rUtility);
end
end

function [estimatedBelief] = estimateBelief(rBeliefs, rPercepts, rAction)
%This function estimates the future belief given the previous belief, the
%current percepts and the current action chosen by the robot.

isSeen = 0;
if rBeliefs(5,3) == 1
    if rAction(1) == 0
        isSeen = 0.7;
    elseif relativeOrientation >= pi/2
        if relativeActionOrientation > -pi/4 || relativeActionOrientation <= -3*pi/4
            isSeen = 0.4;
        else
            isSeen = 0.1;
        end
    elseif relativeOrientation <= -pi/2
        if relativeActionOrientation < pi/4 || relativeActionOrientation >= 3*pi/4
            isSeen = 0.4;
        else
            isSeen = 0.1;
        end
    end
else
    if relativeOrientation > pi/4 && isTurningLeft && relativeActionOrientation > 0
        isSeen = 0.4;
    elseif relativeOrientation < -pi/4 && isTurningLeft==0 && relativeActionOrientation < 0
        isSeen = 0.4;
    end
end


%isTooClose
isTooClose = 0;
if rBeliefs(3,3) == 1 %too close
    if(rAction(1) == 0)
        isTooClose = 1;
    elseif abs(relativeActionOrientation) < pi/2
        isTooClose = 0.7;
    end
else %not too close
    if (rBeliefs(3,1) == 1 || rBeliefs(3,2) == 1) && abs(relativeActionOrientation) <= pi/4
        % have been too close... might get too close again
        isTooClose = 0.4;
    end
end


%isTooFar
isTooFar = 0;
if rBeliefs(4,3) == 1 %too far  
    if rBeliefs(4,1) == 1 && rBeliefs(4,2) == 1 && (rAction(1) < 2 || abs(relativeActionOrientation) > 0)
        %you were too far for too long, need to run to catch up
        isTooFar = 0.7;
    elseif (rAction(1) < 1 || abs(relativeActionOrientation) > 0)
        %not walking|running toward far target
        isTooFar = 0.1;
    end
elseif rBeliefs(3,3) == 0 %in the good zone
    if rAction(1) == 0
        isTooFar = 0.5;
    elseif abs(relativeActionOrientation) > pi/2
        isTooFar = 0.6;
    elseif abs(relativeActionOrientation) > pi/4
        isTooFar = 0.4;
    end
end


%noise
isNoisy = 0;
if rAction(1) == 2
    if  rBeliefs(4,3) == 1 %too far
        isNoisy = 0.5;
    elseif rBeliefs(3,3) == 1 % too close
        isNoisy = 1;
    end
end

%civilian
civilians_ahead = 0;
if  rBeliefs(7,3) == 1 %there is a civilian ahead
    if relativeActionOrientation == 0
        civilians_ahead = 1;

    elseif relativeActionOrientation == pi/4
        if relativeOrientation >= 0
            civilians_ahead = 0.7;
        else
            civilians_ahead = 0.4;
        end

    elseif relativeActionOrientation == -pi/4
        if relativeOrientation > 0
            civilians_ahead = 0.4;
        else
            civilians_ahead = 0.7;
        end

    elseif abs(sign(relativeActionOrientation) - sign(relativeOrientation)) == 0
        civilians_ahead = 0.2;
    end
end

estimatedBelief = [direction_toward_target;target_orientation;isTooClose;isTooFar;isSeen;isNoisy; civilians_ahead];
end

function [goalScores] = estimateGoalsAchievement(belief)
% This function returns the goalScores matrix, which contains estimated
% scores for every goal that must be achieved by the robot

%proximity goal estimation
proximity = -1 * belief(3);

%distance goal estimation
distance = -1 * belief(4);

%visibility goal estimation
visibility = -1 * belief(5);

%noise goal estimation
noise = -1 * belief(6);

%civilian goal estimation
civilians_ahead = -1 * belief(7);

goalScores = [proximity; distance; visibility; noise; civilians_ahead];
end

function [rAction] = getBestAction(rActions)
% return the best action from the list of evaluated actions
    [~,i] = max(rActions(3,:));
    rAction = rActions(1:2, i);
end

And that’s it. Enjoy!