{ "cells": [ { "cell_type": "markdown", "id": "65a280c6-168a-4b92-92b3-e402e2104e99", "metadata": {}, "source": [ "# Data Preparation and Cleaning\n", "\n", "This notebook showcases a few general methods to clean up your training data. The better the data, the more accurate the model. While \"good data\" is highly subjective to your application, but some general guidelines could include\n", "- keeping your data formatting consistent. (binary boolean values, integers vs. floats, case-sensitivity)\n", "- Clearing duplicates\n", "- Filtering non applicable outliers\n", "- etc.\n", "***" ] }, { "cell_type": "code", "execution_count": 3, "id": "160c85aa-26e5-4d4c-a502-7eddc8734ac6", "metadata": {}, "outputs": [], "source": [ "##### Imports packages\n", "\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 4, "id": "bd9e76ff-2ab0-4f6e-89cf-a9bbd92870b0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | R | \n", "G | \n", "B | \n", "colour | \n", "
---|---|---|---|---|
0 | \n", "81.0 | \n", "42.0 | \n", "173.0 | \n", "PURPLE | \n", "
1 | \n", "222.0 | \n", "9.0 | \n", "73.0 | \n", "RED | \n", "
2 | \n", "59.0 | \n", "188.0 | \n", "227.0 | \n", "BLUE | \n", "
3 | \n", "14.0 | \n", "158.0 | \n", "29.0 | \n", "GREEN | \n", "
4 | \n", "222.0 | \n", "222.0 | \n", "82.0 | \n", "YELLOW | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
504 | \n", "139.0 | \n", "140.0 | \n", "122.0 | \n", "Grey | \n", "
505 | \n", "189.0 | \n", "236.0 | \n", "182.0 | \n", "Light Green | \n", "
506 | \n", "198.0 | \n", "166.0 | \n", "100.0 | \n", "Brown | \n", "
507 | \n", "59.0 | \n", "131.0 | \n", "189.0 | \n", "Blue | \n", "
508 | \n", "130.0 | \n", "137.0 | \n", "143.0 | \n", "Grey | \n", "
509 rows × 4 columns
\n", "\n", " | R | \n", "G | \n", "B | \n", "colour | \n", "
---|---|---|---|---|
0 | \n", "81.0 | \n", "42.0 | \n", "173.0 | \n", "purple | \n", "
1 | \n", "222.0 | \n", "9.0 | \n", "73.0 | \n", "red | \n", "
2 | \n", "59.0 | \n", "188.0 | \n", "227.0 | \n", "blue | \n", "
3 | \n", "14.0 | \n", "158.0 | \n", "29.0 | \n", "green | \n", "
4 | \n", "222.0 | \n", "222.0 | \n", "82.0 | \n", "yellow | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
504 | \n", "139.0 | \n", "140.0 | \n", "122.0 | \n", "grey | \n", "
505 | \n", "189.0 | \n", "236.0 | \n", "182.0 | \n", "light green | \n", "
506 | \n", "198.0 | \n", "166.0 | \n", "100.0 | \n", "brown | \n", "
507 | \n", "59.0 | \n", "131.0 | \n", "189.0 | \n", "blue | \n", "
508 | \n", "130.0 | \n", "137.0 | \n", "143.0 | \n", "grey | \n", "
509 rows × 4 columns
\n", "\n", " | R | \n", "G | \n", "B | \n", "colour | \n", "
---|---|---|---|---|
221 | \n", "153.000 | \n", "151.980 | \n", "80.070 | \n", "moss green | \n", "
222 | \n", "171.105 | \n", "109.905 | \n", "60.945 | \n", "browny orange | \n", "
223 | \n", "255.000 | \n", "186.915 | \n", "33.915 | \n", "mustard yellow | \n", "
224 | \n", "95.115 | \n", "255.000 | \n", "33.915 | \n", "flashy green | \n", "
225 | \n", "32.895 | \n", "83.895 | \n", "13.005 | \n", "lime green | \n", "
226 | \n", "58.905 | \n", "82.110 | \n", "49.980 | \n", "navy | \n", "
227 | \n", "173.910 | \n", "225.930 | \n", "236.895 | \n", "light blue | \n", "
228 | \n", "204.000 | \n", "176.970 | \n", "236.895 | \n", "light purple | \n", "
229 | \n", "235.110 | \n", "4.080 | \n", "186.915 | \n", "flashy pink | \n", "
230 | \n", "54.060 | \n", "31.110 | \n", "48.960 | \n", "dark purple | \n", "
231 | \n", "134.895 | \n", "236.895 | \n", "213.945 | \n", "cyan | \n", "
232 | \n", "255.000 | \n", "150.960 | \n", "88.995 | \n", "light orange | \n", "
233 | \n", "255.000 | \n", "183.090 | \n", "198.900 | \n", "light pink | \n", "
234 | \n", "236.895 | \n", "87.975 | \n", "121.125 | \n", "watermelon pink | \n", "
235 | \n", "168.045 | \n", "32.895 | \n", "60.945 | \n", "maroon | \n", "
236 | \n", "172.890 | \n", "250.920 | \n", "82.110 | \n", "lime green | \n", "
237 | \n", "255.000 | \n", "88.995 | \n", "235.110 | \n", "magenta | \n", "
238 | \n", "31.110 | \n", "166.005 | \n", "82.110 | \n", "dark green | \n", "
239 | \n", "31.110 | \n", "26.010 | \n", "147.900 | \n", "dark blue | \n", "
240 | \n", "241.995 | \n", "185.895 | \n", "0.000 | \n", "yellow | \n", "
241 | \n", "241.995 | \n", "13.005 | \n", "37.995 | \n", "red | \n", "
242 | \n", "83.895 | \n", "198.900 | \n", "255.000 | \n", "light blue | \n", "
243 | \n", "255.000 | \n", "232.050 | \n", "194.055 | \n", "light tan | \n", "
244 | \n", "47.940 | \n", "45.900 | \n", "41.055 | \n", "black | \n", "
245 | \n", "255.000 | \n", "255.000 | \n", "160.905 | \n", "light yellow | \n", "
246 | \n", "255.000 | \n", "255.000 | \n", "255.000 | \n", "white | \n", "
247 | \n", "251.940 | \n", "249.900 | \n", "245.055 | \n", "white | \n", "
248 | \n", "160.905 | \n", "166.005 | \n", "162.945 | \n", "grey | \n", "
249 | \n", "45.900 | \n", "98.940 | \n", "29.070 | \n", "conifer green | \n", "
250 | \n", "37.995 | \n", "6.885 | \n", "172.890 | \n", "deep purple | \n", "
251 | \n", "186.150 | \n", "0.000 | \n", "0.000 | \n", "blood red | \n", "
252 | \n", "93.075 | \n", "198.900 | \n", "255.000 | \n", "sky blue | \n", "
253 | \n", "205.020 | \n", "237.915 | \n", "255.000 | \n", "ice blue | \n", "
254 | \n", "255.000 | \n", "122.400 | \n", "0.000 | \n", "cheeto orange | \n", "
255 | \n", "0.000 | \n", "58.905 | \n", "109.905 | \n", "marine blue | \n", "
256 | \n", "92.055 | \n", "103.020 | \n", "111.945 | \n", "dark gray | \n", "
257 | \n", "222.105 | \n", "208.080 | \n", "175.950 | \n", "sand color | \n", "
258 | \n", "64.005 | \n", "52.020 | \n", "46.920 | \n", "dark brown | \n", "
259 | \n", "224.910 | \n", "186.915 | \n", "255.000 | \n", "lavender | \n", "
260 | \n", "255.000 | \n", "0.000 | \n", "158.100 | \n", "fuschiar | \n", "
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)
DecisionTreeClassifier(max_depth=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=7)
RandomForestClassifier(max_depth=5, n_estimators=500)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_depth=5, n_estimators=500)