You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.md
+32-32
Original file line number
Diff line number
Diff line change
@@ -77,31 +77,31 @@ You will train your Pong agent through an "on-policy" method using policy gradie
77
77
78
78
**1.** First, you should install OpenAI Gym (using `pip install gym[atari]` - this package is currently not available on conda), and import NumPy, Gym and the necessary modules:
79
79
80
-
```{code-cell}
80
+
```python
81
81
import numpy as np
82
82
import gym
83
83
```
84
84
85
85
Gym can monitor and save the output using the `Monitor` wrapper:
86
86
87
-
```{code-cell}
87
+
```python
88
88
from gym import wrappers
89
89
from gym.wrappers import Monitor
90
90
```
91
91
92
92
**2.** Instantiate a Gym environment for the game of Pong:
93
93
94
-
```{code-cell}
94
+
```python
95
95
env = gym.make("Pong-v0")
96
96
```
97
97
98
98
**3.** Let's review which actions are available in the `Pong-v0` environment:
99
99
100
-
```{code-cell}
100
+
```python
101
101
print(env.action_space)
102
102
```
103
103
104
-
```{code-cell}
104
+
```python
105
105
print(env.get_action_meanings())
106
106
```
107
107
@@ -111,7 +111,7 @@ For simplicity, your policy network will have one output — a (log) probability
111
111
112
112
**4.** Gym can save videos of the agent's learning in an MP4 format — wrap `Monitor()` around the environment by running the following:
113
113
114
-
```{code-cell}
114
+
```python
115
115
env = Monitor(env, "./video", force=True)
116
116
```
117
117
@@ -127,7 +127,7 @@ Pong screen frames are 210x160 pixels over 3 color dimensions (red, green and bl
127
127
128
128
**1.** Check the Pong's observations:
129
129
130
-
```{code-cell}
130
+
```python
131
131
print(env.observation_space)
132
132
```
133
133
@@ -143,7 +143,7 @@ In Gym, the agent's actions and observations can be part of the `Box` (n-dimensi
143
143
144
144
(You can refer to the OpenAI Gym core [API](https://github.com/openai/gym/blob/master/gym/core.py) for more information about Gym's core classes and methods.)
145
145
146
-
```{code-cell}
146
+
```python
147
147
import matplotlib.pyplot as plt
148
148
149
149
env.seed(42)
@@ -157,7 +157,7 @@ To feed the observations into the policy (neural) network, you need to convert t
157
157
158
158
**3.** Set up a helper function for frame (observation) preprocessing:
@@ -193,42 +193,42 @@ Next, you will define the policy as a simple feedforward network that uses a gam
193
193
Start by creating a random number generator instance for the experiment
194
194
(seeded for reproducibility):
195
195
196
-
```{code-cell}
196
+
```python
197
197
rng = np.random.default_rng(seed=12288743)
198
198
```
199
199
200
200
Then:
201
201
202
202
- Set the input (observation) dimensionality - your preprocessed screen frames:
203
203
204
-
```{code-cell}
204
+
```python
205
205
D =80*80
206
206
```
207
207
208
208
- Set the number of hidden layer neurons.
209
209
210
-
```{code-cell}
210
+
```python
211
211
H =200
212
212
```
213
213
214
214
- Instantiate your policy (neural) network model as an empty dictionary.
215
215
216
-
```{code-cell}
216
+
```python
217
217
model = {}
218
218
```
219
219
220
220
In a neural network, _weights_ are important adjustable parameters that the network fine-tunes by forward and backward propagating the data.
221
221
222
222
**2.** Using a technique called [Xavier initialization](https://www.deeplearning.ai/ai-notes/initialization/#IV), set up the network model's initial weights with NumPy's [`Generator.standard_normal()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.standard_normal.html) that returns random numbers over a standard Normal distribution, as well as [`np.sqrt()`](https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html?highlight=numpy.sqrt#numpy.sqrt):
**3.** Your policy network starts by randomly initializing the weights and feeds the input data (frames) forward from the input layer through a hidden layer to the output layers. This process is called the _forward pass_ or _forward propagation_, and is outlined in the function `policy_forward()`:
230
230
231
-
```{code-cell}
231
+
```python
232
232
defpolicy_forward(x, model):
233
233
# Matrix-multiply the weights by the input in the one and only hidden layer.
234
234
h = np.dot(model["W1"], x)
@@ -251,7 +251,7 @@ Note that there are two _activation functions_ for determining non-linear relati
251
251
252
252
**4.** Define the sigmoid function separately with NumPy's [`np.exp()`](https://numpy.org/doc/stable/reference/generated/numpy.exp.html?highlight=numpy.exp#numpy.exp) for computing exponentials:
253
253
254
-
```{code-cell}
254
+
```python
255
255
defsigmoid(x):
256
256
return1.0/ (1.0+ np.exp(-x))
257
257
```
@@ -262,7 +262,7 @@ During learning in your deep RL algorithm, you use the action log probabilities
262
262
263
263
**1.** Let's define the backward pass function (`policy_backward()`) with the help of NumPy's modules for array multiplication — [`np.dot()`](https://numpy.org/doc/stable/reference/generated/numpy.dot.html?highlight=numpy.dot#numpy.dot) (matrix multiplication), [`np.outer()`](https://numpy.org/doc/stable/reference/generated/numpy.outer.html) (outer product computation), and [`np.ravel()`](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html) (to flatten arrays into 1D arrays):
264
264
265
-
```{code-cell}
265
+
```python
266
266
defpolicy_backward(eph, epdlogp, model):
267
267
dW2 = np.dot(eph.T, epdlogp).ravel()
268
268
dh = np.outer(epdlogp, model["W2"])
@@ -276,7 +276,7 @@ Using the intermediate hidden "states" of the network (`eph`) and the gradients
276
276
277
277
**2.** When applying backpropagation during agent training, you will need to save several variables for each episode. Let's instantiate empty lists to store them:
278
278
279
-
```{code-cell}
279
+
```python
280
280
# All preprocessed observations for the episode.
281
281
xs = []
282
282
# All hidden "states" (from the network) for the episode.
@@ -292,21 +292,21 @@ You will reset these variables manually at the end of each episode during traini
292
292
293
293
**3.** Next, to perform a gradient ascent when optimizing the agent's policy, it is common to use deep learning _optimizers_ (you're performing optimization with gradients). In this example, you'll use [RMSProp](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp) — an adaptive optimization [method](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). Let's set a discounting factor — a decay rate — for the optimizer:
294
294
295
-
```{code-cell}
295
+
```python
296
296
decay_rate =0.99
297
297
```
298
298
299
299
**4.** You will also need to store the gradients (with the help of NumPy's [`np.zeros_like()`](https://numpy.org/doc/stable/reference/generated/numpy.zeros_like.html)) for the optimization step during training:
300
300
301
301
- First, save the update buffers that add up gradients over a batch:
302
302
303
-
```{code-cell}
303
+
```python
304
304
grad_buffer = {k: np.zeros_like(v) for k, v in model.items()}
305
305
```
306
306
307
307
- Second, store the RMSProp memory for the optimizer for gradient ascent:
308
308
309
-
```{code-cell}
309
+
```python
310
310
rmsprop_cache = {k: np.zeros_like(v) for k, v in model.items()}
311
311
```
312
312
@@ -316,7 +316,7 @@ In this section, you will set up a function for computing discounted rewards (`d
316
316
317
317
To provide more weight to shorter-term rewards over longer-term ones, you will use a _discount factor_ (gamma) that is often a floating-point number between 0.9 and 0.99.
318
318
319
-
```{code-cell}
319
+
```python
320
320
gamma =0.99
321
321
322
322
@@ -363,48 +363,48 @@ You can stop the training at any time or/and check saved MP4 videos of saved pla
363
363
364
364
**1.** For demo purposes, let's limit the number of episodes for training to 3. If you are using hardware acceleration (CPUs and GPUs), you can increase the number to 1,000 or beyond. For comparison, Andrej Karpathy's original experiment took about 8,000 episodes.
365
365
366
-
```{code-cell}
366
+
```python
367
367
max_episodes =3
368
368
```
369
369
370
370
**2.** Set the batch size and the learning rate values:
371
371
- The _batch size_ dictates how often (in episodes) the model performs a parameter update. It is the number of times your agent can collect the state-action trajectories. At the end of the collection, you can perform the maximization of action-probability multiples.
372
372
- The [_learning rate_](https://en.wikipedia.org/wiki/Learning_rate) helps limit the magnitude of weight updates to prevent them from overcorrecting.
373
373
374
-
```{code-cell}
374
+
```python
375
375
batch_size =3
376
376
learning_rate =1e-4
377
377
```
378
378
379
379
**3.** Set the game rendering default variable for Gym's `render` method (it is used to display the observation and is optional but can be useful during debugging):
380
380
381
-
```{code-cell}
381
+
```python
382
382
render =False
383
383
```
384
384
385
385
**4.** Set the agent's initial (random) observation by calling `reset()`:
386
386
387
-
```{code-cell}
387
+
```python
388
388
observation = env.reset()
389
389
```
390
390
391
391
**5.** Initialize the previous observation:
392
392
393
-
```{code-cell}
393
+
```python
394
394
prev_x =None
395
395
```
396
396
397
397
**6.** Initialize the reward variables and the episode count:
398
398
399
-
```{code-cell}
399
+
```python
400
400
running_reward =None
401
401
reward_sum =0
402
402
episode_number =0
403
403
```
404
404
405
405
**7.** To simulate motion between the frames, set the single input frame (`x`) for the policy network as the difference between the current and previous preprocessed frames:
**8.** Finally, start the training loop, using the functions you have predefined:
417
417
418
-
```{code-cell}
418
+
```python
419
419
:tags: [output_scroll]
420
420
421
421
while episode_number < max_episodes:
@@ -546,7 +546,7 @@ A few notes:
546
546
547
547
- If you have previously run an experiment and want to repeat it, your `Monitor` instance may still be running, which may throw an error the next time you try to traini the agent. Therefore, you should first shut down `Monitor` by calling `env.close()` by uncommenting and running the cell below:
0 commit comments