Skip to content

Commit 2e99c20

Browse files
authored
Merge pull request #129 from dlab-berkeley/kz
TPOT updates
2 parents a9c1707 + 803960c commit 2e99c20

File tree

5 files changed

+80
-95
lines changed

5 files changed

+80
-95
lines changed

4 Unsupervised Machine Learning and TPOT/4-2 TPOT/4-2 TPOT.ipynb

Lines changed: 57 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -42,29 +42,6 @@
4242
"<br>"
4343
]
4444
},
45-
{
46-
"cell_type": "code",
47-
"execution_count": 1,
48-
"metadata": {},
49-
"outputs": [
50-
{
51-
"name": "stderr",
52-
"output_type": "stream",
53-
"text": [
54-
"/Users/Dora/anaconda3/envs/CSS2/lib/python3.11/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.\n",
55-
" warnings.warn(\"Warning: optional dependency `torch` is not available. - skipping import of NN models.\")\n"
56-
]
57-
}
58-
],
59-
"source": [
60-
"import pandas as pd\n",
61-
"import numpy as np\n",
62-
"from sklearn.preprocessing import LabelBinarizer\n",
63-
"from tpot import TPOTRegressor\n",
64-
"from tpot import TPOTClassifier\n",
65-
"from sklearn.model_selection import train_test_split"
66-
]
67-
},
6845
{
6946
"cell_type": "markdown",
7047
"metadata": {},
@@ -73,16 +50,18 @@
7350
"\n",
7451
"Moreover, testing out many different models, along with many different combinations of parameters, could be extremely time consuming and impractical. \n",
7552
"\n",
76-
"[TPOT](http://epistasislab.github.io/tpot/) is a tool that automates the model selection and hyperparameter tuning process using [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming). Genetic Programming is a strategy for moving from a population of poorly fit models to a population of well-fit models. TPOT also determines what preprocessing, if any, is necessary, such as PCA or standard scaling. It then exports this model to a file with the scikit-learn code written for you. \n",
53+
"[TPOT](http://epistasislab.github.io/tpot/) is a tool that automates the model selection and hyperparameter tuning process using [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming). Genetic Programming is a strategy for moving from a population of poorly fit models to a population of well-fit models. The intuition behind genetic programming is that it leverages the theory of [natural selection](https://en.wikipedia.org/wiki/Natural_selection) to more quickly find the optimal model fit. A helpful metaphor for explaining this could be the following: \n",
7754
"\n",
78-
"Although it is in your best interest to learn as much about the theory behind machine learning as possible, tools like TPOT can theoretically do the work for you. \n",
55+
"Imagine you’re trying to build the best paper airplane ever. You make a bunch of paper airplanes (these are like \"programs\" or \"models\" in our case). Then you test them to see which one flies the farthest (this is called \"fitness\"). The best ones are saved, and you use them to create new airplanes by mixing their designs or making small changes (this is like \"mutation\" and \"crossover\" in genetics). You keep repeating this process—-making, testing, and improving planes—-until you have an airplane that flies super far. This is kind of how genetic programming works, except instead of paper airplanes, it’s creating computer programs to solve problems.\n",
56+
"\n",
57+
"TPOT also determines what preprocessing, if any, is necessary, such as PCA or standard scaling. It then exports this model to a file with the scikit-learn code written for you. Although it is in your best interest to learn as much about the theory behind machine learning as possible, tools like TPOT can theoretically do the work for you. \n",
7958
"\n",
8059
"TPOT can be used for both classification and regression. First let's install tpot:"
8160
]
8261
},
8362
{
8463
"cell_type": "code",
85-
"execution_count": 2,
64+
"execution_count": null,
8665
"metadata": {},
8766
"outputs": [],
8867
"source": [
@@ -92,9 +71,18 @@
9271
},
9372
{
9473
"cell_type": "code",
95-
"execution_count": 3,
74+
"execution_count": 2,
9675
"metadata": {},
97-
"outputs": [],
76+
"outputs": [
77+
{
78+
"name": "stderr",
79+
"output_type": "stream",
80+
"text": [
81+
"/opt/anaconda3/envs/CSS/lib/python3.12/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.\n",
82+
" warnings.warn(\"Warning: optional dependency `torch` is not available. - skipping import of NN models.\")\n"
83+
]
84+
}
85+
],
9886
"source": [
9987
"# import libraries\n",
10088
"import pandas as pd\n",
@@ -121,7 +109,7 @@
121109
},
122110
{
123111
"cell_type": "code",
124-
"execution_count": 4,
112+
"execution_count": 3,
125113
"metadata": {},
126114
"outputs": [],
127115
"source": [
@@ -169,7 +157,7 @@
169157
},
170158
{
171159
"cell_type": "code",
172-
"execution_count": 5,
160+
"execution_count": 4,
173161
"metadata": {},
174162
"outputs": [],
175163
"source": [
@@ -189,19 +177,19 @@
189177
"- **Generations**: The number of iterations that TPOT will go through to search for the best algorithm\n",
190178
"- **Population_Size**: The number of possible solutions that TPOT will evaluate\n",
191179
"\n",
192-
"By default, TPOT uses 100 generations and 100 population size. The number of configurations it searches through is defined by generations * population_size, so by default it will search 10,000 different models. The more models you let it search through, the better your ultimate prediction will be. Here we initialize the model with just 2 generations and 2 population:"
180+
"By default, TPOT uses 100 generations and 100 population size. Note the nood to genetics with the parameter names (*generations* and *population_size*). The number of configurations it searches through is defined by generations * population_size, so by default it will search 10,000 different models. The more models you let it search through, the better your ultimate prediction will be. Here we initialize the model with just 2 generations and 2 population:"
193181
]
194182
},
195183
{
196184
"cell_type": "code",
197-
"execution_count": 6,
185+
"execution_count": 5,
198186
"metadata": {},
199187
"outputs": [
200188
{
201189
"name": "stdout",
202190
"output_type": "stream",
203191
"text": [
204-
"0.8347868812185235\n"
192+
"0.85837120746837\n"
205193
]
206194
}
207195
],
@@ -213,7 +201,8 @@
213201
"# specify TPOT\n",
214202
"# ----------\n",
215203
"tpot = TPOTClassifier(generations=2, # set the number of iterations \n",
216-
" population_size=2) # set number of models\n",
204+
" population_size=2, # set number of models\n",
205+
" random_state = 1) # set random seed\n",
217206
"\n",
218207
"# fit to training data\n",
219208
"# ----------\n",
@@ -238,13 +227,13 @@
238227
},
239228
{
240229
"cell_type": "code",
241-
"execution_count": 7,
230+
"execution_count": 6,
242231
"metadata": {},
243232
"outputs": [],
244233
"source": [
245234
"# Mac users:\n",
246235
"# ----------\n",
247-
"#!cat tpot_iris_pipeline.py\n",
236+
"#!cat tpot_census_pipeline.py\n",
248237
"\n",
249238
"# Windows users:\n",
250239
"# ----------\n",
@@ -267,7 +256,7 @@
267256
},
268257
{
269258
"cell_type": "code",
270-
"execution_count": 8,
259+
"execution_count": 7,
271260
"metadata": {},
272261
"outputs": [],
273262
"source": [
@@ -290,7 +279,7 @@
290279
},
291280
{
292281
"cell_type": "code",
293-
"execution_count": 9,
282+
"execution_count": 8,
294283
"metadata": {},
295284
"outputs": [],
296285
"source": [
@@ -312,7 +301,7 @@
312301
},
313302
{
314303
"cell_type": "code",
315-
"execution_count": 10,
304+
"execution_count": 9,
316305
"metadata": {},
317306
"outputs": [
318307
{
@@ -464,7 +453,7 @@
464453
"4 0.226957 0.229270 0.436957 0.186900 82 1518 1600 "
465454
]
466455
},
467-
"execution_count": 10,
456+
"execution_count": 9,
468457
"metadata": {},
469458
"output_type": "execute_result"
470459
}
@@ -482,14 +471,30 @@
482471
},
483472
{
484473
"cell_type": "code",
485-
"execution_count": 11,
474+
"execution_count": 10,
486475
"metadata": {},
487476
"outputs": [
477+
{
478+
"name": "stderr",
479+
"output_type": "stream",
480+
"text": [
481+
"/var/folders/sm/vmwg2qqj01xd1c_lk1nq88sm0000gn/T/ipykernel_8574/1846636362.py:17: FutureWarning: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary. Use `to_numpy()` for conversion to a numpy array instead.\n",
482+
" y_bike_train.ravel())\n"
483+
]
484+
},
488485
{
489486
"name": "stdout",
490487
"output_type": "stream",
491488
"text": [
492-
"0.9077297591798615\n"
489+
"0.8255022864905334\n"
490+
]
491+
},
492+
{
493+
"name": "stderr",
494+
"output_type": "stream",
495+
"text": [
496+
"/var/folders/sm/vmwg2qqj01xd1c_lk1nq88sm0000gn/T/ipykernel_8574/1846636362.py:22: FutureWarning: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary. Use `to_numpy()` for conversion to a numpy array instead.\n",
497+
" y_bike_test.ravel()))\n"
493498
]
494499
}
495500
],
@@ -502,7 +507,8 @@
502507
"# ----------\n",
503508
"tpot = TPOTRegressor(generations=2, # set the number of iterations\n",
504509
" population_size=2, # set number of models\n",
505-
" scoring='r2') # set scoring to r2\n",
510+
" scoring='r2', # set scoring to r2\n",
511+
" random_state = 2) # set random seed\n",
506512
"\n",
507513
"\n",
508514
"\n",
@@ -523,7 +529,7 @@
523529
},
524530
{
525531
"cell_type": "code",
526-
"execution_count": 12,
532+
"execution_count": 11,
527533
"metadata": {},
528534
"outputs": [],
529535
"source": [
@@ -552,14 +558,14 @@
552558
},
553559
{
554560
"cell_type": "code",
555-
"execution_count": 13,
561+
"execution_count": 12,
556562
"metadata": {},
557563
"outputs": [
558564
{
559565
"name": "stdout",
560566
"output_type": "stream",
561567
"text": [
562-
"0.7080045095828637\n"
568+
"0.7069160997732427\n"
563569
]
564570
}
565571
],
@@ -570,9 +576,10 @@
570576
"\n",
571577
"# specify TPOT\n",
572578
"# ----------\n",
573-
"tpot = TPOTClassifier(generations=5, # set the number of iterations\n",
574-
" population_size=5, # set number of models\n",
575-
" scoring = 'f1') # set scoring to f1\n",
579+
"tpot = TPOTClassifier(generations=5, # play with the number of iterations\n",
580+
" population_size=5, # play with the number of models\n",
581+
" scoring = 'f1', # set scoring to f1\n",
582+
" random_state = 3) # set random seed\n",
576583
"\n",
577584
"# fit to training data\n",
578585
"# ----------\n",
@@ -608,7 +615,7 @@
608615
"name": "python",
609616
"nbconvert_exporter": "python",
610617
"pygments_lexer": "ipython3",
611-
"version": "3.11.5"
618+
"version": "3.12.4"
612619
},
613620
"toc": {
614621
"base_numbering": 1,
Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,24 @@
11
import numpy as np
22
import pandas as pd
3-
from sklearn.ensemble import RandomForestRegressor
3+
from sklearn.ensemble import AdaBoostRegressor
44
from sklearn.model_selection import train_test_split
5-
from sklearn.pipeline import make_pipeline, make_union
6-
from sklearn.tree import DecisionTreeRegressor
7-
from tpot.builtins import StackingEstimator
5+
from sklearn.pipeline import make_pipeline
6+
from sklearn.preprocessing import MaxAbsScaler
7+
from tpot.export_utils import set_param_recursive
88

99
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
1010
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
1111
features = tpot_data.drop('target', axis=1)
1212
training_features, testing_features, training_target, testing_target = \
13-
train_test_split(features, tpot_data['target'], random_state=None)
13+
train_test_split(features, tpot_data['target'], random_state=2)
1414

15-
# Average CV score on the training set was: 0.7887576294742148
15+
# Average CV score on the training set was: 0.8072309267479042
1616
exported_pipeline = make_pipeline(
17-
StackingEstimator(estimator=DecisionTreeRegressor(max_depth=8, min_samples_leaf=4, min_samples_split=10)),
18-
RandomForestRegressor(bootstrap=True, max_features=0.2, min_samples_leaf=17, min_samples_split=17, n_estimators=100)
17+
MaxAbsScaler(),
18+
AdaBoostRegressor(learning_rate=0.5, loss="square", n_estimators=100)
1919
)
20+
# Fix random state for all the steps in exported pipeline
21+
set_param_recursive(exported_pipeline.steps, 'random_state', 2)
2022

2123
exported_pipeline.fit(training_features, training_target)
2224
results = exported_pipeline.predict(testing_features)

4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_census_pipeline.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,13 @@
77
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
88
features = tpot_data.drop('target', axis=1)
99
training_features, testing_features, training_target, testing_target = \
10-
train_test_split(features, tpot_data['target'], random_state=None)
10+
train_test_split(features, tpot_data['target'], random_state=1)
1111

12-
# Average CV score on the training set was: 0.8330876330876331
13-
exported_pipeline = XGBClassifier(learning_rate=0.01, max_depth=7, min_child_weight=11, n_estimators=100, n_jobs=1, subsample=0.1, verbosity=0)
12+
# Average CV score on the training set was: 0.8623259623259623
13+
exported_pipeline = XGBClassifier(learning_rate=1.0, max_depth=4, min_child_weight=7, n_estimators=100, n_jobs=1, subsample=1.0, verbosity=0)
14+
# Fix random state in exported estimator
15+
if hasattr(exported_pipeline, 'random_state'):
16+
setattr(exported_pipeline, 'random_state', 1)
1417

1518
exported_pipeline.fit(training_features, training_target)
1619
results = exported_pipeline.predict(testing_features)
Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,19 @@
11
import numpy as np
22
import pandas as pd
3-
from sklearn.ensemble import GradientBoostingClassifier
43
from sklearn.model_selection import train_test_split
5-
from sklearn.pipeline import make_pipeline, make_union
6-
from sklearn.preprocessing import MinMaxScaler
7-
from sklearn.svm import LinearSVC
8-
from tpot.builtins import StackingEstimator
4+
from xgboost import XGBClassifier
95

106
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
117
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
128
features = tpot_data.drop('target', axis=1)
139
training_features, testing_features, training_target, testing_target = \
14-
train_test_split(features, tpot_data['target'], random_state=None)
10+
train_test_split(features, tpot_data['target'], random_state=3)
1511

16-
# Average CV score on the training set was: 0.7022602797118238
17-
exported_pipeline = make_pipeline(
18-
MinMaxScaler(),
19-
StackingEstimator(estimator=LinearSVC(C=0.0001, dual=True, loss="hinge", penalty="l2", tol=0.0001)),
20-
GradientBoostingClassifier(learning_rate=1.0, max_depth=3, max_features=0.15000000000000002, min_samples_leaf=9, min_samples_split=5, n_estimators=100, subsample=0.8500000000000001)
21-
)
12+
# Average CV score on the training set was: 0.7125312155588841
13+
exported_pipeline = XGBClassifier(learning_rate=0.1, max_depth=6, min_child_weight=2, n_estimators=100, n_jobs=1, subsample=0.75, verbosity=0)
14+
# Fix random state in exported estimator
15+
if hasattr(exported_pipeline, 'random_state'):
16+
setattr(exported_pipeline, 'random_state', 3)
2217

2318
exported_pipeline.fit(training_features, training_target)
2419
results = exported_pipeline.predict(testing_features)

4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_iris_pipeline.py

Lines changed: 0 additions & 22 deletions
This file was deleted.

0 commit comments

Comments
 (0)