|
42 | 42 | "<br>" |
43 | 43 | ] |
44 | 44 | }, |
45 | | - { |
46 | | - "cell_type": "code", |
47 | | - "execution_count": 1, |
48 | | - "metadata": {}, |
49 | | - "outputs": [ |
50 | | - { |
51 | | - "name": "stderr", |
52 | | - "output_type": "stream", |
53 | | - "text": [ |
54 | | - "/Users/Dora/anaconda3/envs/CSS2/lib/python3.11/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.\n", |
55 | | - " warnings.warn(\"Warning: optional dependency `torch` is not available. - skipping import of NN models.\")\n" |
56 | | - ] |
57 | | - } |
58 | | - ], |
59 | | - "source": [ |
60 | | - "import pandas as pd\n", |
61 | | - "import numpy as np\n", |
62 | | - "from sklearn.preprocessing import LabelBinarizer\n", |
63 | | - "from tpot import TPOTRegressor\n", |
64 | | - "from tpot import TPOTClassifier\n", |
65 | | - "from sklearn.model_selection import train_test_split" |
66 | | - ] |
67 | | - }, |
68 | 45 | { |
69 | 46 | "cell_type": "markdown", |
70 | 47 | "metadata": {}, |
|
73 | 50 | "\n", |
74 | 51 | "Moreover, testing out many different models, along with many different combinations of parameters, could be extremely time consuming and impractical. \n", |
75 | 52 | "\n", |
76 | | - "[TPOT](http://epistasislab.github.io/tpot/) is a tool that automates the model selection and hyperparameter tuning process using [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming). Genetic Programming is a strategy for moving from a population of poorly fit models to a population of well-fit models. TPOT also determines what preprocessing, if any, is necessary, such as PCA or standard scaling. It then exports this model to a file with the scikit-learn code written for you. \n", |
| 53 | + "[TPOT](http://epistasislab.github.io/tpot/) is a tool that automates the model selection and hyperparameter tuning process using [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming). Genetic Programming is a strategy for moving from a population of poorly fit models to a population of well-fit models. The intuition behind genetic programming is that it leverages the theory of [natural selection](https://en.wikipedia.org/wiki/Natural_selection) to more quickly find the optimal model fit. A helpful metaphor for explaining this could be the following: \n", |
77 | 54 | "\n", |
78 | | - "Although it is in your best interest to learn as much about the theory behind machine learning as possible, tools like TPOT can theoretically do the work for you. \n", |
| 55 | + "Imagine you’re trying to build the best paper airplane ever. You make a bunch of paper airplanes (these are like \"programs\" or \"models\" in our case). Then you test them to see which one flies the farthest (this is called \"fitness\"). The best ones are saved, and you use them to create new airplanes by mixing their designs or making small changes (this is like \"mutation\" and \"crossover\" in genetics). You keep repeating this process—-making, testing, and improving planes—-until you have an airplane that flies super far. This is kind of how genetic programming works, except instead of paper airplanes, it’s creating computer programs to solve problems.\n", |
| 56 | + "\n", |
| 57 | + "TPOT also determines what preprocessing, if any, is necessary, such as PCA or standard scaling. It then exports this model to a file with the scikit-learn code written for you. Although it is in your best interest to learn as much about the theory behind machine learning as possible, tools like TPOT can theoretically do the work for you. \n", |
79 | 58 | "\n", |
80 | 59 | "TPOT can be used for both classification and regression. First let's install tpot:" |
81 | 60 | ] |
82 | 61 | }, |
83 | 62 | { |
84 | 63 | "cell_type": "code", |
85 | | - "execution_count": 2, |
| 64 | + "execution_count": null, |
86 | 65 | "metadata": {}, |
87 | 66 | "outputs": [], |
88 | 67 | "source": [ |
|
92 | 71 | }, |
93 | 72 | { |
94 | 73 | "cell_type": "code", |
95 | | - "execution_count": 3, |
| 74 | + "execution_count": 2, |
96 | 75 | "metadata": {}, |
97 | | - "outputs": [], |
| 76 | + "outputs": [ |
| 77 | + { |
| 78 | + "name": "stderr", |
| 79 | + "output_type": "stream", |
| 80 | + "text": [ |
| 81 | + "/opt/anaconda3/envs/CSS/lib/python3.12/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.\n", |
| 82 | + " warnings.warn(\"Warning: optional dependency `torch` is not available. - skipping import of NN models.\")\n" |
| 83 | + ] |
| 84 | + } |
| 85 | + ], |
98 | 86 | "source": [ |
99 | 87 | "# import libraries\n", |
100 | 88 | "import pandas as pd\n", |
|
121 | 109 | }, |
122 | 110 | { |
123 | 111 | "cell_type": "code", |
124 | | - "execution_count": 4, |
| 112 | + "execution_count": 3, |
125 | 113 | "metadata": {}, |
126 | 114 | "outputs": [], |
127 | 115 | "source": [ |
|
169 | 157 | }, |
170 | 158 | { |
171 | 159 | "cell_type": "code", |
172 | | - "execution_count": 5, |
| 160 | + "execution_count": 4, |
173 | 161 | "metadata": {}, |
174 | 162 | "outputs": [], |
175 | 163 | "source": [ |
|
189 | 177 | "- **Generations**: The number of iterations that TPOT will go through to search for the best algorithm\n", |
190 | 178 | "- **Population_Size**: The number of possible solutions that TPOT will evaluate\n", |
191 | 179 | "\n", |
192 | | - "By default, TPOT uses 100 generations and 100 population size. The number of configurations it searches through is defined by generations * population_size, so by default it will search 10,000 different models. The more models you let it search through, the better your ultimate prediction will be. Here we initialize the model with just 2 generations and 2 population:" |
| 180 | + "By default, TPOT uses 100 generations and 100 population size. Note the nood to genetics with the parameter names (*generations* and *population_size*). The number of configurations it searches through is defined by generations * population_size, so by default it will search 10,000 different models. The more models you let it search through, the better your ultimate prediction will be. Here we initialize the model with just 2 generations and 2 population:" |
193 | 181 | ] |
194 | 182 | }, |
195 | 183 | { |
196 | 184 | "cell_type": "code", |
197 | | - "execution_count": 6, |
| 185 | + "execution_count": 5, |
198 | 186 | "metadata": {}, |
199 | 187 | "outputs": [ |
200 | 188 | { |
201 | 189 | "name": "stdout", |
202 | 190 | "output_type": "stream", |
203 | 191 | "text": [ |
204 | | - "0.8347868812185235\n" |
| 192 | + "0.85837120746837\n" |
205 | 193 | ] |
206 | 194 | } |
207 | 195 | ], |
|
213 | 201 | "# specify TPOT\n", |
214 | 202 | "# ----------\n", |
215 | 203 | "tpot = TPOTClassifier(generations=2, # set the number of iterations \n", |
216 | | - " population_size=2) # set number of models\n", |
| 204 | + " population_size=2, # set number of models\n", |
| 205 | + " random_state = 1) # set random seed\n", |
217 | 206 | "\n", |
218 | 207 | "# fit to training data\n", |
219 | 208 | "# ----------\n", |
|
238 | 227 | }, |
239 | 228 | { |
240 | 229 | "cell_type": "code", |
241 | | - "execution_count": 7, |
| 230 | + "execution_count": 6, |
242 | 231 | "metadata": {}, |
243 | 232 | "outputs": [], |
244 | 233 | "source": [ |
245 | 234 | "# Mac users:\n", |
246 | 235 | "# ----------\n", |
247 | | - "#!cat tpot_iris_pipeline.py\n", |
| 236 | + "#!cat tpot_census_pipeline.py\n", |
248 | 237 | "\n", |
249 | 238 | "# Windows users:\n", |
250 | 239 | "# ----------\n", |
|
267 | 256 | }, |
268 | 257 | { |
269 | 258 | "cell_type": "code", |
270 | | - "execution_count": 8, |
| 259 | + "execution_count": 7, |
271 | 260 | "metadata": {}, |
272 | 261 | "outputs": [], |
273 | 262 | "source": [ |
|
290 | 279 | }, |
291 | 280 | { |
292 | 281 | "cell_type": "code", |
293 | | - "execution_count": 9, |
| 282 | + "execution_count": 8, |
294 | 283 | "metadata": {}, |
295 | 284 | "outputs": [], |
296 | 285 | "source": [ |
|
312 | 301 | }, |
313 | 302 | { |
314 | 303 | "cell_type": "code", |
315 | | - "execution_count": 10, |
| 304 | + "execution_count": 9, |
316 | 305 | "metadata": {}, |
317 | 306 | "outputs": [ |
318 | 307 | { |
|
464 | 453 | "4 0.226957 0.229270 0.436957 0.186900 82 1518 1600 " |
465 | 454 | ] |
466 | 455 | }, |
467 | | - "execution_count": 10, |
| 456 | + "execution_count": 9, |
468 | 457 | "metadata": {}, |
469 | 458 | "output_type": "execute_result" |
470 | 459 | } |
|
482 | 471 | }, |
483 | 472 | { |
484 | 473 | "cell_type": "code", |
485 | | - "execution_count": 11, |
| 474 | + "execution_count": 10, |
486 | 475 | "metadata": {}, |
487 | 476 | "outputs": [ |
| 477 | + { |
| 478 | + "name": "stderr", |
| 479 | + "output_type": "stream", |
| 480 | + "text": [ |
| 481 | + "/var/folders/sm/vmwg2qqj01xd1c_lk1nq88sm0000gn/T/ipykernel_8574/1846636362.py:17: FutureWarning: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary. Use `to_numpy()` for conversion to a numpy array instead.\n", |
| 482 | + " y_bike_train.ravel())\n" |
| 483 | + ] |
| 484 | + }, |
488 | 485 | { |
489 | 486 | "name": "stdout", |
490 | 487 | "output_type": "stream", |
491 | 488 | "text": [ |
492 | | - "0.9077297591798615\n" |
| 489 | + "0.8255022864905334\n" |
| 490 | + ] |
| 491 | + }, |
| 492 | + { |
| 493 | + "name": "stderr", |
| 494 | + "output_type": "stream", |
| 495 | + "text": [ |
| 496 | + "/var/folders/sm/vmwg2qqj01xd1c_lk1nq88sm0000gn/T/ipykernel_8574/1846636362.py:22: FutureWarning: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary. Use `to_numpy()` for conversion to a numpy array instead.\n", |
| 497 | + " y_bike_test.ravel()))\n" |
493 | 498 | ] |
494 | 499 | } |
495 | 500 | ], |
|
502 | 507 | "# ----------\n", |
503 | 508 | "tpot = TPOTRegressor(generations=2, # set the number of iterations\n", |
504 | 509 | " population_size=2, # set number of models\n", |
505 | | - " scoring='r2') # set scoring to r2\n", |
| 510 | + " scoring='r2', # set scoring to r2\n", |
| 511 | + " random_state = 2) # set random seed\n", |
506 | 512 | "\n", |
507 | 513 | "\n", |
508 | 514 | "\n", |
|
523 | 529 | }, |
524 | 530 | { |
525 | 531 | "cell_type": "code", |
526 | | - "execution_count": 12, |
| 532 | + "execution_count": 11, |
527 | 533 | "metadata": {}, |
528 | 534 | "outputs": [], |
529 | 535 | "source": [ |
|
552 | 558 | }, |
553 | 559 | { |
554 | 560 | "cell_type": "code", |
555 | | - "execution_count": 13, |
| 561 | + "execution_count": 12, |
556 | 562 | "metadata": {}, |
557 | 563 | "outputs": [ |
558 | 564 | { |
559 | 565 | "name": "stdout", |
560 | 566 | "output_type": "stream", |
561 | 567 | "text": [ |
562 | | - "0.7080045095828637\n" |
| 568 | + "0.7069160997732427\n" |
563 | 569 | ] |
564 | 570 | } |
565 | 571 | ], |
|
570 | 576 | "\n", |
571 | 577 | "# specify TPOT\n", |
572 | 578 | "# ----------\n", |
573 | | - "tpot = TPOTClassifier(generations=5, # set the number of iterations\n", |
574 | | - " population_size=5, # set number of models\n", |
575 | | - " scoring = 'f1') # set scoring to f1\n", |
| 579 | + "tpot = TPOTClassifier(generations=5, # play with the number of iterations\n", |
| 580 | + " population_size=5, # play with the number of models\n", |
| 581 | + " scoring = 'f1', # set scoring to f1\n", |
| 582 | + " random_state = 3) # set random seed\n", |
576 | 583 | "\n", |
577 | 584 | "# fit to training data\n", |
578 | 585 | "# ----------\n", |
|
608 | 615 | "name": "python", |
609 | 616 | "nbconvert_exporter": "python", |
610 | 617 | "pygments_lexer": "ipython3", |
611 | | - "version": "3.11.5" |
| 618 | + "version": "3.12.4" |
612 | 619 | }, |
613 | 620 | "toc": { |
614 | 621 | "base_numbering": 1, |
|
0 commit comments