Merge pull request #129 from dlab-berkeley/kz

kaseyzapatka · web-flow · commit 2e99c2053c63 · 2024-12-05T12:59:34.000-08:00
TPOT updates
diff --git a/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/4-2 TPOT.ipynb b/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/4-2 TPOT.ipynb
@@ -42,29 +42,6 @@
     "<br>"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/Users/Dora/anaconda3/envs/CSS2/lib/python3.11/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.\n",
-      "  warnings.warn(\"Warning: optional dependency `torch` is not available. - skipping import of NN models.\")\n"
-     ]
-    }
-   ],
-   "source": [
-    "import pandas as pd\n",
-    "import numpy as np\n",
-    "from sklearn.preprocessing import LabelBinarizer\n",
-    "from tpot import TPOTRegressor\n",
-    "from tpot import TPOTClassifier\n",
-    "from sklearn.model_selection import train_test_split"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -73,16 +50,18 @@
     "\n",
     "Moreover, testing out many different models, along with many different combinations of parameters, could be extremely time consuming and impractical. \n",
     "\n",
-    "[TPOT](http://epistasislab.github.io/tpot/) is a tool that automates the model selection and hyperparameter tuning process using [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming). Genetic Programming is a strategy for moving from a population of poorly fit models to a population of well-fit models. TPOT also determines what preprocessing, if any, is necessary, such as PCA or standard scaling. It then exports this model to a file with the scikit-learn code written for you. \n",
+    "[TPOT](http://epistasislab.github.io/tpot/) is a tool that automates the model selection and hyperparameter tuning process using [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming). Genetic Programming is a strategy for moving from a population of poorly fit models to a population of well-fit models. The intuition behind genetic programming is that it leverages the theory of [natural selection](https://en.wikipedia.org/wiki/Natural_selection) to more quickly find the optimal model fit. A helpful metaphor for explaining this could be the following: \n",
     "\n",
-    "Although it is in your best interest to learn as much about the theory behind machine learning as possible, tools like TPOT can theoretically do the work for you. \n",
+    "Imagine you’re trying to build the best paper airplane ever. You make a bunch of paper airplanes (these are like \"programs\" or \"models\" in our case). Then you test them to see which one flies the farthest (this is called \"fitness\"). The best ones are saved, and you use them to create new airplanes by mixing their designs or making small changes (this is like \"mutation\" and \"crossover\" in genetics). You keep repeating this process—-making, testing, and improving planes—-until you have an airplane that flies super far. This is kind of how genetic programming works, except instead of paper airplanes, it’s creating computer programs to solve problems.\n",
+    "\n",
+    "TPOT also determines what preprocessing, if any, is necessary, such as PCA or standard scaling. It then exports this model to a file with the scikit-learn code written for you. Although it is in your best interest to learn as much about the theory behind machine learning as possible, tools like TPOT can theoretically do the work for you. \n",
     "\n",
     "TPOT can be used for both classification and regression. First let's install tpot:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -92,9 +71,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 2,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/opt/anaconda3/envs/CSS/lib/python3.12/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.\n",
+      "  warnings.warn(\"Warning: optional dependency `torch` is not available. - skipping import of NN models.\")\n"
+     ]
+    }
+   ],
    "source": [
     "# import libraries\n",
     "import pandas as pd\n",
@@ -121,7 +109,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -169,7 +157,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -189,19 +177,19 @@
     "- **Generations**: The number of iterations that TPOT will go through to search for the best algorithm\n",
     "- **Population_Size**: The number of possible solutions that TPOT will evaluate\n",
     "\n",
-    "By default, TPOT uses 100 generations and 100 population size. The number of configurations it searches through is defined by generations * population_size, so by default it will search 10,000 different models. The more models you let it search through, the better your ultimate prediction will be. Here we initialize the model with just 2 generations and 2 population:"
+    "By default, TPOT uses 100 generations and 100 population size. Note the nood to genetics with the parameter names (*generations* and *population_size*). The number of configurations it searches through is defined by generations * population_size, so by default it will search 10,000 different models. The more models you let it search through, the better your ultimate prediction will be. Here we initialize the model with just 2 generations and 2 population:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "0.8347868812185235\n"
+      "0.85837120746837\n"
      ]
     }
    ],
@@ -213,7 +201,8 @@
     "# specify TPOT\n",
     "# ----------\n",
     "tpot = TPOTClassifier(generations=2,      # set the number of iterations \n",
-    "                      population_size=2)  # set number of models\n",
+    "                      population_size=2,  # set number of models\n",
+    "                      random_state = 1)   # set random seed\n",
     "\n",
     "# fit to training data\n",
     "# ----------\n",
@@ -238,13 +227,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [],
    "source": [
     "# Mac users:\n",
     "# ----------\n",
-    "#!cat tpot_iris_pipeline.py\n",
+    "#!cat tpot_census_pipeline.py\n",
     "\n",
     "# Windows  users:\n",
     "# ----------\n",
@@ -267,7 +256,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -290,7 +279,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -312,7 +301,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -464,7 +453,7 @@
        "4  0.226957  0.229270  0.436957   0.186900      82        1518  1600  "
       ]
      },
-     "execution_count": 10,
+     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -482,14 +471,30 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/var/folders/sm/vmwg2qqj01xd1c_lk1nq88sm0000gn/T/ipykernel_8574/1846636362.py:17: FutureWarning: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary.  Use `to_numpy()` for conversion to a numpy array instead.\n",
+      "  y_bike_train.ravel())\n"
+     ]
+    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "0.9077297591798615\n"
+      "0.8255022864905334\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/var/folders/sm/vmwg2qqj01xd1c_lk1nq88sm0000gn/T/ipykernel_8574/1846636362.py:22: FutureWarning: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary.  Use `to_numpy()` for conversion to a numpy array instead.\n",
+      "  y_bike_test.ravel()))\n"
      ]
     }
    ],
@@ -502,7 +507,8 @@
     "# ----------\n",
     "tpot = TPOTRegressor(generations=2,        # set the number of iterations\n",
     "                     population_size=2,    # set number of models\n",
-    "                     scoring='r2')         # set scoring to r2\n",
+    "                     scoring='r2',         # set scoring to r2\n",
+    "                     random_state = 2)     # set random seed\n",
     "\n",
     "\n",
     "\n",
@@ -523,7 +529,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -552,14 +558,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "0.7080045095828637\n"
+      "0.7069160997732427\n"
      ]
     }
    ],
@@ -570,9 +576,10 @@
     "\n",
     "# specify TPOT\n",
     "# ----------\n",
-    "tpot = TPOTClassifier(generations=5,             # set the number of iterations\n",
-    "                      population_size=5,         # set number of models\n",
-    "                      scoring = 'f1')            # set scoring to f1\n",
+    "tpot = TPOTClassifier(generations=5,             # play with the number of iterations\n",
+    "                      population_size=5,         # play with the number of models\n",
+    "                      scoring = 'f1',            # set scoring to f1\n",
+    "                      random_state = 3)          # set random seed\n",
     "\n",
     "# fit to training data\n",
     "# ----------\n",
@@ -608,7 +615,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.5"
+   "version": "3.12.4"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_bike_pipeline.py b/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_bike_pipeline.py
@@ -1,22 +1,24 @@
 import numpy as np
 import pandas as pd
-from sklearn.ensemble import RandomForestRegressor
+from sklearn.ensemble import AdaBoostRegressor
 from sklearn.model_selection import train_test_split
-from sklearn.pipeline import make_pipeline, make_union
-from sklearn.tree import DecisionTreeRegressor
-from tpot.builtins import StackingEstimator
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import MaxAbsScaler
+from tpot.export_utils import set_param_recursive
 
 # NOTE: Make sure that the outcome column is labeled 'target' in the data file
 tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
 features = tpot_data.drop('target', axis=1)
 training_features, testing_features, training_target, testing_target = \
-            train_test_split(features, tpot_data['target'], random_state=None)
+            train_test_split(features, tpot_data['target'], random_state=2)
 
-# Average CV score on the training set was: 0.7887576294742148
+# Average CV score on the training set was: 0.8072309267479042
 exported_pipeline = make_pipeline(
-    StackingEstimator(estimator=DecisionTreeRegressor(max_depth=8, min_samples_leaf=4, min_samples_split=10)),
-    RandomForestRegressor(bootstrap=True, max_features=0.2, min_samples_leaf=17, min_samples_split=17, n_estimators=100)
+    MaxAbsScaler(),
+    AdaBoostRegressor(learning_rate=0.5, loss="square", n_estimators=100)
 )
+# Fix random state for all the steps in exported pipeline
+set_param_recursive(exported_pipeline.steps, 'random_state', 2)
 
 exported_pipeline.fit(training_features, training_target)
 results = exported_pipeline.predict(testing_features)
diff --git a/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_census_pipeline.py b/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_census_pipeline.py
@@ -7,10 +7,13 @@
 tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
 features = tpot_data.drop('target', axis=1)
 training_features, testing_features, training_target, testing_target = \
-            train_test_split(features, tpot_data['target'], random_state=None)
+            train_test_split(features, tpot_data['target'], random_state=1)
 
-# Average CV score on the training set was: 0.8330876330876331
-exported_pipeline = XGBClassifier(learning_rate=0.01, max_depth=7, min_child_weight=11, n_estimators=100, n_jobs=1, subsample=0.1, verbosity=0)
+# Average CV score on the training set was: 0.8623259623259623
+exported_pipeline = XGBClassifier(learning_rate=1.0, max_depth=4, min_child_weight=7, n_estimators=100, n_jobs=1, subsample=1.0, verbosity=0)
+# Fix random state in exported estimator
+if hasattr(exported_pipeline, 'random_state'):
+    setattr(exported_pipeline, 'random_state', 1)
 
 exported_pipeline.fit(training_features, training_target)
 results = exported_pipeline.predict(testing_features)
diff --git a/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_census_pipeline_new_params.py b/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_census_pipeline_new_params.py
@@ -1,24 +1,19 @@
 import numpy as np
 import pandas as pd
-from sklearn.ensemble import GradientBoostingClassifier
 from sklearn.model_selection import train_test_split
-from sklearn.pipeline import make_pipeline, make_union
-from sklearn.preprocessing import MinMaxScaler
-from sklearn.svm import LinearSVC
-from tpot.builtins import StackingEstimator
+from xgboost import XGBClassifier
 
 # NOTE: Make sure that the outcome column is labeled 'target' in the data file
 tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
 features = tpot_data.drop('target', axis=1)
 training_features, testing_features, training_target, testing_target = \
-            train_test_split(features, tpot_data['target'], random_state=None)
+            train_test_split(features, tpot_data['target'], random_state=3)
 
-# Average CV score on the training set was: 0.7022602797118238
-exported_pipeline = make_pipeline(
-    MinMaxScaler(),
-    StackingEstimator(estimator=LinearSVC(C=0.0001, dual=True, loss="hinge", penalty="l2", tol=0.0001)),
-    GradientBoostingClassifier(learning_rate=1.0, max_depth=3, max_features=0.15000000000000002, min_samples_leaf=9, min_samples_split=5, n_estimators=100, subsample=0.8500000000000001)
-)
+# Average CV score on the training set was: 0.7125312155588841
+exported_pipeline = XGBClassifier(learning_rate=0.1, max_depth=6, min_child_weight=2, n_estimators=100, n_jobs=1, subsample=0.75, verbosity=0)
+# Fix random state in exported estimator
+if hasattr(exported_pipeline, 'random_state'):
+    setattr(exported_pipeline, 'random_state', 3)
 
 exported_pipeline.fit(training_features, training_target)
 results = exported_pipeline.predict(testing_features)
diff --git a/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_iris_pipeline.py b/4 Unsupervised Machine Learning and TPOT/4-2 TPOT/tpot_iris_pipeline.py