123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606 |
- # TODO mention in the report that with every level of the tree the data gets smaller and smaller.
- # NEXT STEPS
- # TODO: Create an infographic and host it on a web page.
- # TODO: Gather live data from news articles (Can try using NLTK & urllib).
- # TODO: Use Natural Language Processing to automate some of the data cleaning/integration.
- ###################################################################################################################
- # Online Retail Analysis - ID3 CLASSIFICATION #
- # NOTE! Concepts will be explained with examples from the Street data set, which can be found below. #
- # The reason for this is because that data set is very small and easy to follow. #
- # #
- # 1) RESOURCES #
- # ID3 TUTORIALS: #
- # 1) https://sefiks.com/2017/11/20/a-step-by-step-id3-decision-tree-example/ #
- # 2) https://medium.com/coinmonks/what-is-entropy-and-why-information-gain-is-matter-4e85d46d2f01 #
- # #
- # DECISION TREE TUTORIAL: https://www.lucidchart.com/pages/decision-tree #
- # ENTROPY (MORE DETAILS): https://en.wikipedia.org/wiki/Entropy_(information_theory) #
- # #
- # 2) DATA SETS #
- # TEST DATA SET: This data set can be found by navigating to the STREET DATA SET region in this file. #
- # It is a part of the ID3 file because I believe it would be useful to have an example of how the ID3 code #
- # works with a data set and also provides an opportunity to better understand what the code is doing. #
- # To have a look at ID3 applied to a small data set just make a call the test_run() function at the #
- # end of the file. #
- # #
- # 3) ALGORITHM OVERVIEW #
- # Used to generate a decision tree from a given data set. It works by evaluating each attribute #
- # in the data set to place the nodes in an order that will return an accurate result. #
- # #
- # 4) USES #
- # A) Classify labeled data generally to do with NLP, approving loans and credit cards, etc. #
- # B) Another non-standard use of this algorithm is to use it to fill a missing value in the data set #
- # during the pre-processing stage. #
- # #
- ###################################################################################################################
- import math
- import copy
- # region PERFORMANCE IMPROVEMENTS (for Python 3.8)
- """
- Applied: (TO DOCUMENT)
- TODO:
- 1) Remove ever dict.keys() used and replace it with dict because dict.keys() creates a list of keys in memory.
- (More costly than looking through the dictionary itself! Further information below.)
- https://stackoverflow.com/questions/4730993/python-key-in-dict-keys-performance-for-large-dictionaries
- """
- # endregion
- # region PLAY TENNIS DATA SET
- DATASET_BY_ATTRIB_DICT = {"outlook": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast",
- "Sunny", "Sunny", "Rain", "Sunny", "Overcast", "Overcast", "Rain"],
- "temperature": ["Hot", "Hot", "Hot", "Mild", "Cool", "Cool", "Cool",
- "Mild", "Cool", "Mild", "Mild", "Mild", "Hot", "Mild"],
- "humidity": ["High", "High", "High", "High", "Normal", "Normal", "Normal",
- "High", "Normal", "Normal", "Normal", "High", "Normal", "High"],
- "wind": ["Weak", "Strong", "Weak", "Weak", "Weak", "Strong", "Strong",
- "Weak", "Weak", "Weak", "Strong", "Strong", "Weak", "Strong"]}
- # Answer as to whether or not it is a good time to play tennis.
- TARGET_ATTRIB_LIST = ["No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No"]
- # CONSTANT VARIABLES # TODO: Optimise these variables by making them immutable (specifying they are const with Python)
- TARGET_ATTRIB_NAME = "play tennis"
- TRAIN_DATA_SIZE = len(TARGET_ATTRIB_LIST)
- # endregion
- # Represents a tree node and links to derived nodes.
- class Node:
- def __init__(self, node_name, derived_nodes=[]):
- self.node_name = node_name
- self.derived_nodes = derived_nodes
- class ID3DecisionTree:
- def __init__(self):
- self.root_node = None
- # Keeps track of all the nodes at the end of the branches that are available to link to.
- # In this way, no code needs to be ran to find the next available space for a new node.
- # The node at index 0 is always the one to add to first, once the new node is linked to it, it gets popped off
- # and the new node gets appended to the end of this list.
- self.active_branch_nodes = []
- # TODO: Merge this list with the active_branch_nodes to be in dictionary format like so
- # {attrib1: [outcome1, outcome2], attrib2: [outcome1, outcome2, outcome3]}
- self.linked_attributes = []
- # IMPORTANT NOTE:
- # Key to understanding how the DecisionTree class works is understanding the dataset_occurrence_dict
- # structure, as that is what is used for most calculations. This structure contains only the data from the
- # dataset required to construct the tree. Any repetition of attribute data has been removed to reduce load.
- # The 'dataset_occurrence_dict' structure is an unordered dictionary, where the structure itself gives more
- # information about the dataset. For example, every attribute of the data set is a key, which contains
- # a dictionary of its outcomes/possible values, and for each outcome, there is a dictionary showing the
- # distribution of the outcomes for the selected target attribute.
- # Example of dictionary structure below.
- """ Example structure: (where 'AN'-attribute name; 'ON'-outcome name; 'TON'-target outcome name)
- dataset_occurrence_dict = {"AN 1": {"ON 1": {"TON 1": 1, "TON 2": 2},
- "ON 2": {"TON 1": 0, "TON 2": 1},
- "ON 3": {"TON 1": 0, "TON 2": 1}
- },
- "AN 2": {"ON 1": {"TON 1": 4, "TON 2": 0},
- "ON 2": {"TON 1": 1, "TON 2": 0}
- }
- }
-
- The example above can be read, for attribute 1 - AN1, there are 3 outcomes - ON1, ON2, ON3.
- The target has 2 possible outcomes TON1 and TON2. Those values are being tracked/accounted for,
- for each possible outcome of each attribute. For AN1, ON1 there is 1 occurrence of TON1 and 2 occurrences of
- TON2. For AN1, ON2 there are 0 occurrences of TON1, and 1 occurrence of TON2 therefore the answer for this
- branch is TON2. Same for AN1, ON3 - answer TON2. If all the occurrences of TON1 and TON2 for attrib 1 (AN1)
- are summed, we get the number of entries in the given data set.
- """
- self.dataset_occurrence_dict = {}
- # region BUILD TREE UTILITIES
- """ Construct dataset distribution/occurrence dictionary - "dataset_occurrence_dict".
- PARAMETERS
- :param (dict) dataset_by_attrib_dict
- :param (list) target_list """
- def generate_occurrences(self, dataset_by_attrib_dict, target_list):
- # TODO: assert that all attribute lists have the same length
- # Update the dictionary with each attribute
- for attrib_name in dataset_by_attrib_dict.keys():
- # STEP 1: ADD the current attribute to the 'dataset_occurrence_dict' structure
- self.dataset_occurrence_dict.update({attrib_name: {}})
- # STEP 2: Fetch a list containing only the unique data from attribute_list and target_list.
- attribute_list = dataset_by_attrib_dict[attrib_name]
- unique_attrib_outcomes = list(set(attribute_list))
- unique_answers = list(set(target_list))
- # For each unique outcome of the current attribute
- for attrib_outcome in unique_attrib_outcomes:
- # 2.1) Update dictionary to store the next attribute outcome
- self.dataset_occurrence_dict[attrib_name].update({attrib_outcome: {}})
- # print(self.dataset_occurrence_dict)
- # 2.2) For the current attribute, look at each of its outcomes and add them onto the dictionary
- for outcome in unique_answers:
- self.dataset_occurrence_dict[attrib_name][attrib_outcome].update({outcome: 0})
- # print(self.dataset_occurrence_dict)
- # STEP 3: Goes through the dataset and counts the target outcome occurrences for each attribute occurrence
- for itter in range(len(attribute_list)):
- # 3.1) Fetch the current attribute outcome and the current target outcome from the dataset.
- curr_attrib_occ = attribute_list[itter]
- curr_target_occ = target_list[itter]
- # 3.2) Update the count for the current target outcome in the current attribute outcome by 1
- self.dataset_occurrence_dict[attrib_name][curr_attrib_occ][curr_target_occ] += 1
- """ After a node is added to the tree the "dataset_occurrence_dict" dictionary should be updated.
- PARAMETERS
- :param (list) attrib_list - the raw attrib data from the dataset.
- :param (list) target_list - the raw target data from the dataset. """
- def get_next_branch_occurrences(self, dataset_by_attrib_dict, target_list):
- # This is the outcome to update the dataset_occurrence_dict by
- # A completely separate dictionary from the original, this dictionary will only hold a subdictionary
- # of the original
- subdict = copy.deepcopy(dataset_by_attrib_dict)
- subtar = copy.deepcopy(target_list)
- indices_to_remove = []
- attrib_to_remove = None
- # Looking through every possible attribute in the dictionary
- for attrib_key in subdict:
- attrib_found = False
- # Count through each list of outcomes for the given attribute.
- for count in range(len(subdict[attrib_key])):
- # If the active outcome name is equal to the current outcome value in the list
- if dataset_by_attrib_dict[attrib_key][count] == self.active_branch_nodes[0].node_name:
- attrib_found = True
- # According to the algorithm, the attribute containing the currently active outcome
- # should be removed
- if attrib_key in subdict:
- attrib_to_remove = attrib_key
- else:
- indices_to_remove.append(count)
- # print(subdict[attrib_key][count])
- # subdict[attrib_key].pop(count)
- # TODO: assert that there is only one 0 in the list otherwise it is trying to remove the wrong values
- if attrib_found:
- break
- # Processing the subdict data
- #print("Subdict: ", subdict)
- del subdict[attrib_to_remove]
- for attrib in subdict:
- #print("Discarding data in ", attrib)
- complete_list = subdict[attrib]
- sublist = [value for index, value in enumerate(complete_list) if index not in indices_to_remove]
- subdict[attrib] = sublist
- #print("After processing the data: ", subdict)
- # Processing the subtar data
- #print("Discarding data in target list")
- #print("Target data before processing: ", subtar)
- # print(indices_to_remove)
- subtar = [value for index, value in enumerate(subtar) if index not in indices_to_remove]
- #print("Target data after processing: ", subtar)
- # TODO: Call this function recursively on each branch, pass in the shrinked dictionary
- # TODO: test the base case thoroughly
- # TODO: Build a new dataset_by_attrib_dict for the current outcome
- # TODO: REMOVE outlook from the dataset dict when all its outcomes have children nodes assigned
- # (How to know if an attribute is complete???)
- return subdict, subtar
- """ Checks if a branch is complete, i.e. the target outcome was found.
- PARAMETERS
- :param (dict) target_val_dist_for_attrib
- :returns (list) comp_branches - contains all the target outcomes reached for the given attribute."""
- def track_target_outcomes(self, target_val_dist_for_attrib):
- comp_branches = []
- # Looks through each attribute outcome
- for attrib_outcome_key in target_val_dist_for_attrib.keys():
- # Tracks how many non-zero occurrences of a target outcome there are for this attribute outcome.
- non_zero_outcome_count = 0
- # This variable is set to the target outcome if the branch outcome is (100%) certain.
- branch_answer = None
- # Checks what the distribution of target outcomes is for the current attribute outcome.
- # Ex: question - how sdo people drive based on the terrain, if the terrain is flat do they drive slow
- # or fast, and what is it if the terrain is steep.
- # Target outcomes - fast and slow; attrib outcomes - flat and steep.
- # Distribution dictionary looks like this ->{'fast': {'slow': 0, 'fast': 1}, 'steep':{'slow': 2, 'fast': 1}}
- for target_outcome_key in target_val_dist_for_attrib[attrib_outcome_key].keys():
- # Fetch the number of occurrences for each target outcome for the current attribute
- """"Another Example: if the target is can_buy_computer(possible values/outcomes: Yes or No) and the current
- attribute is age (possible values/outcomes: <=30, 31..40 and >40) this will return how many of the entries
- where age is <=30 are no, then how many of the entries where age is <=30 are yes, then how many
- of the entries where age is 31..40 are yes and so on, until all cases are looked at. """
- outcome_occurrences = target_val_dist_for_attrib[attrib_outcome_key][target_outcome_key]
- # Check if the answer is certain and end the branch, i.e. count how many branches have
- # certain target outcome
- if outcome_occurrences > 0:
- non_zero_outcome_count += 1
- if non_zero_outcome_count == 1:
- branch_answer = target_outcome_key
- if non_zero_outcome_count == 0:
- print("INVALID RESULT!")
- elif non_zero_outcome_count == 1:
- print("THE ANSWER FOR <<", attrib_outcome_key, ">> is <<", branch_answer, ">>")
- comp_branches.append({attrib_outcome_key: branch_answer})
- elif non_zero_outcome_count > 1:
- print("THE BRANCH <<", attrib_outcome_key, ">> IS STILL ACTIVE!")
- return comp_branches
- # Counts the occurrences of each value for a given attribute.
- def count_value_occ(self, unique_values, attrib_data):
- attrib_val_occ = {}
- # Construct dictionary
- for value in unique_values:
- attrib_val_occ.update({value: 0})
- # Initialise Dictionary
- for u_value in unique_values:
- attrib_val_occ[u_value] = attrib_data.count(u_value)
- return attrib_val_occ
- def calc_entropy(self, attrib_uv_count, overall):
- entropy = 0
- # print("UV: ", attrib_uv_count)
- for key in attrib_uv_count.keys():
- # if there is some occurrence of the value calculate entropy,
- # otherwise ignore it (when there is 0 occurrences of the value)
- if attrib_uv_count[key] != 0:
- fraction = attrib_uv_count[key] / overall
- target_attrib_calc = fraction * math.log2(fraction)
- entropy += target_attrib_calc
- return abs(entropy)
- def calc_attrib_entropy(self, attrib_occurrences):
- entropy_list = {}
- for attrib_val_key in attrib_occurrences.keys():
- attrib_val = attrib_occurrences[attrib_val_key]
- overall = 0
- for target_values in attrib_val.values():
- overall += target_values
- print("CALC TARGET ENTROPY FOR EACH ATTRIB OUTCOME: ", attrib_val)
- attrib_entropy = self.calc_entropy(attrib_val, overall)
- entropy_list.update({attrib_val_key: attrib_entropy})
- print("Entropy list: ", entropy_list)
- return entropy_list
- # WEIGHTED AVERAGE ENTROPY for the children
- def calc_entropy_weigh_avg(self, target_val_dist_attrib, overall, attrib_entropy):
- weighted_entropy_avg = 0
- for key in target_val_dist_attrib.keys():
- curr_value = 0
- for value in target_val_dist_attrib[key].values():
- curr_value += value
- weighted_entropy_avg += curr_value / overall * attrib_entropy[key]
- # overall += curr_value
- return weighted_entropy_avg
- def calc_info_gain(self, target_entropy, target_dist_for_attrib):
- # CALCULATE ENTROPY OF Attribute
- attrib_entropy = self.calc_attrib_entropy(target_dist_for_attrib)
- # print("Attrib Entropy: ", attrib_entropy)
- weighted_avg_e = self.calc_entropy_weigh_avg(target_dist_for_attrib, TRAIN_DATA_SIZE, attrib_entropy)
- # print("Attrib Weighted AVG: ", weighted_avg_e)
- attrib_info_gain = target_entropy - weighted_avg_e
- return attrib_info_gain
- # IMPORTANT NOTE: An attribute node should always be made together with its outcomes, never an outcome alone
- # as it is not how this function was setup.
- # :param (str) name - should always be the name of an attribute.
- def build_node(self, name, completed_branches):
- attrib_node = Node(name)
- derived_nodes = []
- completed_outcomes = []
- for branch in completed_branches:
- completed_outcomes.append(list(branch.keys())[0])
- # if all outcome branches for thi attribute are completed, then the attribute is complete and its outcomes
- # should be popped off the active_branch_nodes list
- # print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CHECK COMPLETE ATTRIB: ", completed_branches)
- # print(self.dataset_occurrence_dict[name].keys())
- for outcome_name in self.dataset_occurrence_dict[name]:
- new_outcome_node = Node(outcome_name)
- # print("STATUS: NEW OUTCOME NODE CREATED")
- # Check if the branch for the current outcome is complete (Target answer is 100% certain).
- for branch in completed_branches:
- if outcome_name in branch:
- # print("FOUND OUTCOME <<", outcome_name, ">> in ", branch)
- if len(new_outcome_node.derived_nodes) == 0:
- # Formally end the node
- endpoint_node = Node(branch[outcome_name], None)
- new_outcome_node.derived_nodes.append(endpoint_node)
- # print("STATUS: NEW OUTCOME ENDPOINT NODE CREATED & LINKED")
- # The temp_outcome node is created so that the outcome node stored in the tree and the outcome node stored
- # in the active_branch_nodes list are the same. This is important because I never append directly onto the
- # tree but to a reference of the active branch of the tree. This allows to append to any depth of the tree
- # without needing to do any traversal to find the next available node.
- temp_outcome = copy.deepcopy(new_outcome_node)
- derived_nodes.append(temp_outcome)
- # If the branch is still active/available to add to
- if outcome_name not in completed_outcomes:
- # Add the new node to the active branch list
- self.active_branch_nodes.append(temp_outcome)
- """print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Completed Nodes:", acc_completed)
- acc_completed[name]["completed"] = True
- all_outcomes_list = list(self.dataset_occurrence_dict[name].keys())
- for outcome in all_outcomes_list:
- if outcome in acc_completed[name]["outcomes"]:
- print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ", outcome, " TRUE")
- else:
- print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>> ", outcome, " FALSE")
- acc_completed[name]["completed"] = False
- print(all_outcomes_list)"""
- new_outcome_node.derived_nodes.clear()
- # print("STATUS: NEW NODE CREATED")
- attrib_node.derived_nodes = derived_nodes
- return attrib_node
- # IMPORTANT NODE: active_branch_nodes is only updated when build_node function is called, therefore
- # the link will not be appropriate unless the node was created through the build_node function.
- def link_node(self, new_node):
- """
- print(" <<< CHECKING IF THE TREE SEGMENT IS BUILT RIGHT! >>> ")
- # TEMP
- print("ATTRIBUTE/PARENT NODE: ", new_node.node_name)
- print("DERIVED NODES LIST: ", new_node.derived_nodes)
- print("FOR EACH NODE IN DERIVED NODES.")
- for node in new_node.derived_nodes:
- print("\t OUTCOME NODE FOR ATTRIB: ", node.node_name)
- for other in node.derived_nodes:
- print("\t\t TARGET OUTCOME REACHED: ", other.node_name)"""
- if self.root_node is None:
- self.root_node = new_node
- else:
- # Add the new node to the tree
- # I hard coded 0 as the active node index because index 0 is always the next available node to link to.
- self.active_branch_nodes[0].derived_nodes.append(new_node)
- # Update the available nodes!
- # The node at index 0 is already taken so that node should be popped off
- self.active_branch_nodes.pop(0)
- # Builds a part of the tree (attribute node with setup derived nodes/outcome nodes) and links it to the tree.
- def build_tree_chunk(self, dataset_by_attrib_dict, target_attrib_list):
- self.generate_occurrences(dataset_by_attrib_dict, target_attrib_list)
- # print("Main DICTIONARY", self.dataset_occurrence_dict)
- # TARGET ATTRIBUTE CALCULATIONS - Required for the calculation of info_gain for the rest of the attributes.
- target_uv_data = list(set(target_attrib_list)) # TODO: POSSIBLE EFFICIENCY DECREASE
- target_uv_count = self.count_value_occ(target_uv_data, target_attrib_list)
- # print("Target Unique Value Count: ", target_uv_count)
- target_entropy = self.calc_entropy(target_uv_count, TRAIN_DATA_SIZE)
- # print("TARGET ENTROPY: ", target_entropy)
- # Build each node(calc its entropy and info_gain, and assigning each attributes outcomes as children)
- # store the node in the node list and sort the nodes by info_gain to build the tree with them.
- next_node_data = {"name": None, "info gain": 0, "completed": None}
- for attrib_name in self.dataset_occurrence_dict.keys():
- print("\n", "-" * 50)
- # ATTRIB CALCULATIONS
- print("attrib_name: ", attrib_name)
- # Contains a data structure representing the target attribute's value distribution
- # with regard to another attribute
- target_dist_for_attrib = self.dataset_occurrence_dict[attrib_name]
- # print("Target occurrences: ", target_dist_for_attrib)
- # Check if any of the branches is completed
- completed_branches = self.track_target_outcomes(target_dist_for_attrib)
- print("COMPLETED BRANCHES: ", completed_branches)
- attrib_info_gain = self.calc_info_gain(target_entropy, target_dist_for_attrib)
- # print("The INFO GAIN for <<", attrib_name, ">> is ", attrib_info_gain)
- if next_node_data["info gain"] < attrib_info_gain:
- next_node_data["name"] = attrib_name
- next_node_data["info gain"] = attrib_info_gain
- next_node_data["completed"] = completed_branches
- print("------> The next new node is: ", next_node_data["name"], "\n\n")
- new_node = self.build_node(next_node_data["name"], next_node_data["completed"])
- self.link_node(new_node)
- # endregion
- def build_tree(self, dataset_by_attrib_dict, target_attrib_list):
- self.build_tree_chunk(dataset_by_attrib_dict, target_attrib_list)
- print("\n\n")
- while len(self.active_branch_nodes) != 0:
- print(">>>>>>>>>>>>>>>>>>> Current active node: ", self.active_branch_nodes[0].node_name)
- # self.linked_attrib_names
- sub_attrib_dict, sub_tar_list = self.get_next_branch_occurrences(dataset_by_attrib_dict, target_attrib_list)
- self.build_tree_chunk(sub_attrib_dict, sub_tar_list)
- print("\n\n>>>>>>>>>>>>>>>>>>> List of active nodes: ", self.active_branch_nodes)
- print("\n\n", "<"*5, "THE TREE IS COMPLETE!", ">"*5, "\n\n")
- def visualise_tree(self):
- current_node = self.root_node
- while current_node is not None:
- print(current_node.node_name)
- # TODO this recursively, base case -> len(node.derived_nodes) == 0
- # EXTRA TODO pass in a variable called branch_track that will start off as "",
- # each time a recursion is spawned add a "\t", that way the print will have a sort of a hiearchy
- # This function runs classification on one entry and returns the answer.
- # Should only be called after the tree model was built.
- def classify(self, entry_index, dataset_by_attrib_dict):
- answer = None
- # TODO: assert that root node is not none
- current_node = self.root_node
- while current_node.derived_nodes is not None:
- print("\n <<< TRAVERSING TREE >>> ")
- print("Current Attrib: ", current_node.node_name)
- # Ask the tree which attribute/column to look for first
- column_name = current_node.node_name
- # Fetch the value for the given entry (entry_index) from the column identified by the tree.
- current_outcome_name = dataset_by_attrib_dict[column_name][entry_index]
- print("\tCurrent outcome name: ", current_outcome_name)
- # Get that node from the derived nodes list
- for outcome_node in current_node.derived_nodes:
- if outcome_node.node_name == current_outcome_name:
- # print("\n <<< TRAVERSING TREE >>> ")
- # print("FOUND VALUE FOR ENTRY <<", entry_index, ">> -> <<", outcome_node.node_name, ">>")
- current_node = outcome_node.derived_nodes[0]
- # print("Current Attrib: ", current_node.node_name)
- answer = current_node.node_name
- print(" <<< FOUND VALUE >>> ")
- print(" The answer is: ", answer)
- return answer
- def test_run_algorithm():
- print(" "*10, " << ID3 CLASSIFICATION ALGORITHM >> ", " "*10)
- tree = ID3DecisionTree()
- tree.build_tree(DATASET_BY_ATTRIB_DICT, TARGET_ATTRIB_LIST)
- # APPLY CLASSIFICATION
- # The index of the entry in the dataset.
- entry_index = 0
- tree.classify(entry_index, DATASET_BY_ATTRIB_DICT)
- test_run_algorithm()
- """
- # Remove the completed branches
- for branch in completed_branches:
- for key in branch.keys():
- target_val_dist_for_grade.pop(key)
- print("After removing completed branches: ", target_val_dist_for_grade)
- """
- # region Build Decision Tree
- # endregion
- """
- What is "Training Data"?
- Building the tree is done with training data which already has the answer to whatever question is being asked.
- the example given with the data on the slides that asks if someone can buy a laptop is training data
- because it already knows the answer.
- """
- """
- Apply information gain function to each attribute calculate_gain(attr_out)
- Should that be applied to the target as well? No
- Example:
- - G(train_data, O) = 0.246
- - G(train_data, H) = 0.151
- - G(train_data, W) = 0.048
- Once the root node is known, look at how many unique values are there.
- If there are 4 possible values and they are not numbers,
- for example "Sunny", "Rainy", etc. there should be 4 nodes.
- """
- # region Apply Classification
- """
- What is "Test Data"?
- Test data is when we get a new entry and we want to classify it.
- For example: In the bank they may use an already trained ID3 algorithm to check if you should get a credit card or not.
- They will have different attributes like - number of times you have gone bankrupt; what is your current net worth;
- are you a student; what is your credit score; etc.
- Then the target attribute will be EligibleForCreditCard(True or False)
- """
- # Use the built decision tree to look through a row of data from the data set. This is done using test data.
- # (How to evaluate if the classification has an error?)
- """
- Steps:
- 1. Find which is the current attribute to look through (To start with ask the tree which attribute is the root node)
- 1.1 (When building the tree need to make sure the attributes have the exact same name as the Node data)
- 1.2 Search trough all possible attributes
- 1.3 Check if the attribute name == the node name
-
- 2. Find the attribute value for the current row
- 2.1 Ask the data set which value is given for this attribute
- 2.2 Find the which of the children nodes in the tree are equivalent to the given value
-
- Repeat these steps recursively until an answer is found.
- """
- # endregion
|