Our goal in this section of the lesson is to train decision trees to perform classification of unseen data.

To train a decision tree is to construct it. And we construct the tree from a dataset. But what dataset will we use? Luckily, scikit-learn has some built in!

# Exploring datasets

There is a built-in dataset about iris flowers: it suffices to import the `load_iris` function from the `sklearn.datasets` module, and call it to get the dataset.

In [None]:
from sklearn.datasets import load_iris
dataset = load_iris()

In [None]:
# But what's a dataset? Let's have a look.
# Simply evaluate this cell.
dataset

Remember from last time how we interactively got documentation on any variable?

__Question:__ Find out what this `dataset` variable is.

Hint: look at the last character from the first line of this cell.

In [None]:
# Your exploration here:


__Question:__ what do we write to get the names of the features out of the dataset?

In [None]:
# Your code here:


So this dataset has 4 features in it, all measured in centimeters, and referring to some physical measurements of the flowers.

The `data` and `target` entries in the dataset really are the data, whereas the `target_names` and `feature_names` entries are *metadata*; they tell us about the interpretation of the data.

The `data` and `target` entries are related. Let's see how.

In [None]:
# The first entry of dataset['data'] is some measurements about a flower,
# and the first element of dataset['target'] tells us which kind of flower that is.
dataset['data'][0], dataset['target'][0]

So the data point `[5.1, 3.5, 1.4, 0.2]` is labelled `0`, which represents a flower species.

## Visual intuition for the dataset

Let's quickly get some intuition for what this dataset looks like by plotting it. Jupyter notebooks have built-in integration with the Python plotting library matplotlib. All this comes pre-installed with Anaconda.

Don't pay too much attention to the code here; we just want to get a quick visualization.

In [None]:
# Just evaluate this cell. Dig into the code on your own later.
import matplotlib.pyplot as plt
%matplotlib inline
both = list(zip(((x[0], x[2]) for x in dataset['data']), dataset['target']))
for i, mark in zip(set(dataset['target']), '^ov+'):
    X, Y = list(zip(*[p for p, d in both if d == i]))
    plt.scatter(X, Y, marker=mark)
plt.show()

This scatter plot uses the first feature for the X coordinate and the third feature for the Y coordinate; those are the sepal length and petal length, respectively. The different flower types are represented as different symbols in the plot.

### Questions

1. How many different _labels_ are there in this dataset?
2. How many _features_ does this plot consider? How many _features_ are there in total?
3. Using code, find out what the names of the flowers are in this dataset. Hint: you need to look up something in the `dataset` dictionary.
4. Using code, find out how many points there are in this dataset. Hint: use the `len` function.

In [None]:
# Your solution to question 3 here:


In [None]:
# Your solution to question 4 here:


# Decision trees

We are using decision trees as *classifiers*. In this notebook, you'll explore using the `DecisionTreeClassifier` from scikit-learn to construct decision trees.

In [None]:
from sklearn import tree
from sklearn.model_selection import train_test_split

We import the `tree` module from the `sklearn` library. This module contains the `DecisionTreeClassifier`.

We can create a new classifier by calling `DecisionTreeClassifier`.

In [None]:
classifier = tree.DecisionTreeClassifier()

We use the `.` operator to reach inside the module; since `DecisionTreeClassifier` is inside the `tree` module, we specify to Python to look inside the tree module (which we imported) and run the function `DecisionTreeClassifier`.

Before proceeding to train our decision tree model, we need to do a train test split, so that we can set aside some of the data for use as an "exam" for our model later.

__Question:__ Use the `train_test_split` function imported above to set aside 25% of the data for use in validation later.
If you don't know how to use the function, remember that you can use `train_test_split?` to see the documentation. Also don't be shy to use ChatGPT!

In [None]:
# Your answer:


Now we want to train our classifier on the 75% chunk of data obtained from the train_test_split.

We do this using the `fit` function inside `classifier`. The `fit` function takes two inputs:
1. a list of X values (feature measurements)
2. a list of Y values (labels)

These two lists must have the same size!

__Question:__ Use the `fit` function to train the model. What X and Y values do we use? Hint: they're entries inside `dataset`.

In [None]:
# Your answer:


We can visualize the decision tree that was created. Again, don't pay too much attention to this code; we could teach a separate course just about data visualization! **This code might not run correctly on your computer;** see the text below.

In [None]:
import graphviz # the library we need to visualize trees and graphs

# Define the visualization code as a function so we can reuse it later!
def visualize_tree(clf):
    dot_data = tree.export_graphviz(clf, feature_names=dataset['feature_names'], filled=True)
    return graphviz.Source(dot_data)

visualize_tree(classifier)

This code *might* not work on your computer. (Possibly graphviz isn't installed; it's somewhat tricky to install.)
In this case, you can download the visualization [here](https://computing-workshop.com/lessons/W19/ml-2/graphviz/tree.pdf).

__Questions:__
1. What is the interpretation of the colors in the tree?
2. In each node of the tree, what is the relationship between the `samples = N` line and the `value = [...]` line?
3. Notice that in all the teal and purple nodes, the first count in `value` is always zero. Why is that?
4. What is the meaning of the `value` in each node?
5. Invent three data points. Choose feature values so that the first point falls into the orange leaf; the second, into the teal leaf; and the third, into the most purple leaf. Recall that leaves are nodes with no children.
   For example, a point with `petal width = 2.0` would follow the False branch of the root node. Choose the other parameters so that the point arrives at the desired classification.
6. With the data set aside by the train_test_split, validate the model: figure out how to import and use the `accuracy_score` function from sklearn. Try googling and using the sklearn documentation at first, and give ChatGPT a try too!