Skip to content

Commit d1ee781

Browse files
authored
Merge pull request #144 from rossbar/retire-rl-notebook
2 parents 47fb909 + 1a848f2 commit d1ee781

File tree

5 files changed

+33
-46
lines changed

5 files changed

+33
-46
lines changed

.circleci/config.yml

-4
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,6 @@ jobs:
1111
steps:
1212
- checkout
1313

14-
- run:
15-
name: Install deps for building atari-py
16-
command: sudo apt-get update && sudo apt-get install -y cmake ffmpeg
17-
1814
- run:
1915
name: Install Python dependencies
2016
command: |

.github/workflows/conda.yml

+1-2
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,7 @@ jobs:
1414

1515
strategy:
1616
matrix:
17-
# NOTE: Gym/atari deps need to be solved for this to work on windows
18-
os: [ubuntu, macos] #, windows]
17+
os: [ubuntu, macos, windows]
1918

2019
defaults:
2120
run:

content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.md

+32-32
Original file line numberDiff line numberDiff line change
@@ -77,31 +77,31 @@ You will train your Pong agent through an "on-policy" method using policy gradie
7777

7878
**1.** First, you should install OpenAI Gym (using `pip install gym[atari]` - this package is currently not available on conda), and import NumPy, Gym and the necessary modules:
7979

80-
```{code-cell}
80+
```python
8181
import numpy as np
8282
import gym
8383
```
8484

8585
Gym can monitor and save the output using the `Monitor` wrapper:
8686

87-
```{code-cell}
87+
```python
8888
from gym import wrappers
8989
from gym.wrappers import Monitor
9090
```
9191

9292
**2.** Instantiate a Gym environment for the game of Pong:
9393

94-
```{code-cell}
94+
```python
9595
env = gym.make("Pong-v0")
9696
```
9797

9898
**3.** Let's review which actions are available in the `Pong-v0` environment:
9999

100-
```{code-cell}
100+
```python
101101
print(env.action_space)
102102
```
103103

104-
```{code-cell}
104+
```python
105105
print(env.get_action_meanings())
106106
```
107107

@@ -111,7 +111,7 @@ For simplicity, your policy network will have one output — a (log) probability
111111

112112
**4.** Gym can save videos of the agent's learning in an MP4 format — wrap `Monitor()` around the environment by running the following:
113113

114-
```{code-cell}
114+
```python
115115
env = Monitor(env, "./video", force=True)
116116
```
117117

@@ -127,7 +127,7 @@ Pong screen frames are 210x160 pixels over 3 color dimensions (red, green and bl
127127

128128
**1.** Check the Pong's observations:
129129

130-
```{code-cell}
130+
```python
131131
print(env.observation_space)
132132
```
133133

@@ -143,7 +143,7 @@ In Gym, the agent's actions and observations can be part of the `Box` (n-dimensi
143143

144144
(You can refer to the OpenAI Gym core [API](https://github.com/openai/gym/blob/master/gym/core.py) for more information about Gym's core classes and methods.)
145145

146-
```{code-cell}
146+
```python
147147
import matplotlib.pyplot as plt
148148

149149
env.seed(42)
@@ -157,7 +157,7 @@ To feed the observations into the policy (neural) network, you need to convert t
157157

158158
**3.** Set up a helper function for frame (observation) preprocessing:
159159

160-
```{code-cell}
160+
```python
161161
def frame_preprocessing(observation_frame):
162162
# Crop the frame.
163163
observation_frame = observation_frame[35:195]
@@ -173,7 +173,7 @@ def frame_preprocessing(observation_frame):
173173

174174
**4.** Preprocess the random frame from earlier to test the function — the input for the policy network is an 80x80 1D image:
175175

176-
```{code-cell}
176+
```python
177177
preprocessed_random_frame = frame_preprocessing(random_frame)
178178
plt.imshow(preprocessed_random_frame, cmap="gray")
179179
print(preprocessed_random_frame.shape)
@@ -193,42 +193,42 @@ Next, you will define the policy as a simple feedforward network that uses a gam
193193
Start by creating a random number generator instance for the experiment
194194
(seeded for reproducibility):
195195

196-
```{code-cell}
196+
```python
197197
rng = np.random.default_rng(seed=12288743)
198198
```
199199

200200
Then:
201201

202202
- Set the input (observation) dimensionality - your preprocessed screen frames:
203203

204-
```{code-cell}
204+
```python
205205
D = 80 * 80
206206
```
207207

208208
- Set the number of hidden layer neurons.
209209

210-
```{code-cell}
210+
```python
211211
H = 200
212212
```
213213

214214
- Instantiate your policy (neural) network model as an empty dictionary.
215215

216-
```{code-cell}
216+
```python
217217
model = {}
218218
```
219219

220220
In a neural network, _weights_ are important adjustable parameters that the network fine-tunes by forward and backward propagating the data.
221221

222222
**2.** Using a technique called [Xavier initialization](https://www.deeplearning.ai/ai-notes/initialization/#IV), set up the network model's initial weights with NumPy's [`Generator.standard_normal()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.standard_normal.html) that returns random numbers over a standard Normal distribution, as well as [`np.sqrt()`](https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html?highlight=numpy.sqrt#numpy.sqrt):
223223

224-
```{code-cell}
224+
```python
225225
model["W1"] = rng.standard_normal(size=(H, D)) / np.sqrt(D)
226226
model["W2"] = rng.standard_normal(size=H) / np.sqrt(H)
227227
```
228228

229229
**3.** Your policy network starts by randomly initializing the weights and feeds the input data (frames) forward from the input layer through a hidden layer to the output layers. This process is called the _forward pass_ or _forward propagation_, and is outlined in the function `policy_forward()`:
230230

231-
```{code-cell}
231+
```python
232232
def policy_forward(x, model):
233233
# Matrix-multiply the weights by the input in the one and only hidden layer.
234234
h = np.dot(model["W1"], x)
@@ -251,7 +251,7 @@ Note that there are two _activation functions_ for determining non-linear relati
251251

252252
**4.** Define the sigmoid function separately with NumPy's [`np.exp()`](https://numpy.org/doc/stable/reference/generated/numpy.exp.html?highlight=numpy.exp#numpy.exp) for computing exponentials:
253253

254-
```{code-cell}
254+
```python
255255
def sigmoid(x):
256256
return 1.0 / (1.0 + np.exp(-x))
257257
```
@@ -262,7 +262,7 @@ During learning in your deep RL algorithm, you use the action log probabilities
262262

263263
**1.** Let's define the backward pass function (`policy_backward()`) with the help of NumPy's modules for array multiplication — [`np.dot()`](https://numpy.org/doc/stable/reference/generated/numpy.dot.html?highlight=numpy.dot#numpy.dot) (matrix multiplication), [`np.outer()`](https://numpy.org/doc/stable/reference/generated/numpy.outer.html) (outer product computation), and [`np.ravel()`](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html) (to flatten arrays into 1D arrays):
264264

265-
```{code-cell}
265+
```python
266266
def policy_backward(eph, epdlogp, model):
267267
dW2 = np.dot(eph.T, epdlogp).ravel()
268268
dh = np.outer(epdlogp, model["W2"])
@@ -276,7 +276,7 @@ Using the intermediate hidden "states" of the network (`eph`) and the gradients
276276

277277
**2.** When applying backpropagation during agent training, you will need to save several variables for each episode. Let's instantiate empty lists to store them:
278278

279-
```{code-cell}
279+
```python
280280
# All preprocessed observations for the episode.
281281
xs = []
282282
# All hidden "states" (from the network) for the episode.
@@ -292,21 +292,21 @@ You will reset these variables manually at the end of each episode during traini
292292

293293
**3.** Next, to perform a gradient ascent when optimizing the agent's policy, it is common to use deep learning _optimizers_ (you're performing optimization with gradients). In this example, you'll use [RMSProp](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp) — an adaptive optimization [method](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). Let's set a discounting factor — a decay rate — for the optimizer:
294294

295-
```{code-cell}
295+
```python
296296
decay_rate = 0.99
297297
```
298298

299299
**4.** You will also need to store the gradients (with the help of NumPy's [`np.zeros_like()`](https://numpy.org/doc/stable/reference/generated/numpy.zeros_like.html)) for the optimization step during training:
300300

301301
- First, save the update buffers that add up gradients over a batch:
302302

303-
```{code-cell}
303+
```python
304304
grad_buffer = {k: np.zeros_like(v) for k, v in model.items()}
305305
```
306306

307307
- Second, store the RMSProp memory for the optimizer for gradient ascent:
308308

309-
```{code-cell}
309+
```python
310310
rmsprop_cache = {k: np.zeros_like(v) for k, v in model.items()}
311311
```
312312

@@ -316,7 +316,7 @@ In this section, you will set up a function for computing discounted rewards (`d
316316

317317
To provide more weight to shorter-term rewards over longer-term ones, you will use a _discount factor_ (gamma) that is often a floating-point number between 0.9 and 0.99.
318318

319-
```{code-cell}
319+
```python
320320
gamma = 0.99
321321

322322

@@ -363,48 +363,48 @@ You can stop the training at any time or/and check saved MP4 videos of saved pla
363363

364364
**1.** For demo purposes, let's limit the number of episodes for training to 3. If you are using hardware acceleration (CPUs and GPUs), you can increase the number to 1,000 or beyond. For comparison, Andrej Karpathy's original experiment took about 8,000 episodes.
365365

366-
```{code-cell}
366+
```python
367367
max_episodes = 3
368368
```
369369

370370
**2.** Set the batch size and the learning rate values:
371371
- The _batch size_ dictates how often (in episodes) the model performs a parameter update. It is the number of times your agent can collect the state-action trajectories. At the end of the collection, you can perform the maximization of action-probability multiples.
372372
- The [_learning rate_](https://en.wikipedia.org/wiki/Learning_rate) helps limit the magnitude of weight updates to prevent them from overcorrecting.
373373

374-
```{code-cell}
374+
```python
375375
batch_size = 3
376376
learning_rate = 1e-4
377377
```
378378

379379
**3.** Set the game rendering default variable for Gym's `render` method (it is used to display the observation and is optional but can be useful during debugging):
380380

381-
```{code-cell}
381+
```python
382382
render = False
383383
```
384384

385385
**4.** Set the agent's initial (random) observation by calling `reset()`:
386386

387-
```{code-cell}
387+
```python
388388
observation = env.reset()
389389
```
390390

391391
**5.** Initialize the previous observation:
392392

393-
```{code-cell}
393+
```python
394394
prev_x = None
395395
```
396396

397397
**6.** Initialize the reward variables and the episode count:
398398

399-
```{code-cell}
399+
```python
400400
running_reward = None
401401
reward_sum = 0
402402
episode_number = 0
403403
```
404404

405405
**7.** To simulate motion between the frames, set the single input frame (`x`) for the policy network as the difference between the current and previous preprocessed frames:
406406

407-
```{code-cell}
407+
```python
408408
def update_input(prev_x, cur_x, D):
409409
if prev_x is not None:
410410
x = cur_x - prev_x
@@ -415,7 +415,7 @@ def update_input(prev_x, cur_x, D):
415415

416416
**8.** Finally, start the training loop, using the functions you have predefined:
417417

418-
```{code-cell}
418+
```python
419419
:tags: [output_scroll]
420420

421421
while episode_number < max_episodes:
@@ -546,7 +546,7 @@ A few notes:
546546

547547
- If you have previously run an experiment and want to repeat it, your `Monitor` instance may still be running, which may throw an error the next time you try to traini the agent. Therefore, you should first shut down `Monitor` by calling `env.close()` by uncommenting and running the cell below:
548548

549-
```{code-cell}
549+
```python
550550
# env.close()
551551
```
552552

environment.yml

-5
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,9 @@ dependencies:
88
- matplotlib
99
- pandas
1010
- statsmodels
11-
- pip
1211
- imageio
13-
- pooch
14-
- ffmpeg # For gym/atari
1512
# For building the site
1613
- sphinx<5
1714
- myst-nb
1815
- sphinx-book-theme
1916
- sphinx-copybutton
20-
- pip:
21-
- gym[atari]==0.19

requirements.txt

-3
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,5 @@ matplotlib
55
pandas
66
statsmodels
77
imageio
8-
gym==0.18.3
9-
atari-py==0.2.5
10-
pooch==1.5.1
118
# For supporting .md-based notebooks
129
jupytext

0 commit comments

Comments
 (0)