ID3_classification.py 31 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606
  1. # TODO mention in the report that with every level of the tree the data gets smaller and smaller.
  2. # NEXT STEPS
  3. # TODO: Create an infographic and host it on a web page.
  4. # TODO: Gather live data from news articles (Can try using NLTK & urllib).
  5. # TODO: Use Natural Language Processing to automate some of the data cleaning/integration.
  6. ###################################################################################################################
  7. # Online Retail Analysis - ID3 CLASSIFICATION #
  8. # NOTE! Concepts will be explained with examples from the Street data set, which can be found below. #
  9. # The reason for this is because that data set is very small and easy to follow. #
  10. # #
  11. # 1) RESOURCES #
  12. # ID3 TUTORIALS: #
  13. # 1) https://sefiks.com/2017/11/20/a-step-by-step-id3-decision-tree-example/ #
  14. # 2) https://medium.com/coinmonks/what-is-entropy-and-why-information-gain-is-matter-4e85d46d2f01 #
  15. # #
  16. # DECISION TREE TUTORIAL: https://www.lucidchart.com/pages/decision-tree #
  17. # ENTROPY (MORE DETAILS): https://en.wikipedia.org/wiki/Entropy_(information_theory) #
  18. # #
  19. # 2) DATA SETS #
  20. # TEST DATA SET: This data set can be found by navigating to the STREET DATA SET region in this file. #
  21. # It is a part of the ID3 file because I believe it would be useful to have an example of how the ID3 code #
  22. # works with a data set and also provides an opportunity to better understand what the code is doing. #
  23. # To have a look at ID3 applied to a small data set just make a call the test_run() function at the #
  24. # end of the file. #
  25. # #
  26. # 3) ALGORITHM OVERVIEW #
  27. # Used to generate a decision tree from a given data set. It works by evaluating each attribute #
  28. # in the data set to place the nodes in an order that will return an accurate result. #
  29. # #
  30. # 4) USES #
  31. # A) Classify labeled data generally to do with NLP, approving loans and credit cards, etc. #
  32. # B) Another non-standard use of this algorithm is to use it to fill a missing value in the data set #
  33. # during the pre-processing stage. #
  34. # #
  35. ###################################################################################################################
  36. import math
  37. import copy
  38. # region PERFORMANCE IMPROVEMENTS (for Python 3.8)
  39. """
  40. Applied: (TO DOCUMENT)
  41. TODO:
  42. 1) Remove ever dict.keys() used and replace it with dict because dict.keys() creates a list of keys in memory.
  43. (More costly than looking through the dictionary itself! Further information below.)
  44. https://stackoverflow.com/questions/4730993/python-key-in-dict-keys-performance-for-large-dictionaries
  45. """
  46. # endregion
  47. # region PLAY TENNIS DATA SET
  48. DATASET_BY_ATTRIB_DICT = {"outlook": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast",
  49. "Sunny", "Sunny", "Rain", "Sunny", "Overcast", "Overcast", "Rain"],
  50. "temperature": ["Hot", "Hot", "Hot", "Mild", "Cool", "Cool", "Cool",
  51. "Mild", "Cool", "Mild", "Mild", "Mild", "Hot", "Mild"],
  52. "humidity": ["High", "High", "High", "High", "Normal", "Normal", "Normal",
  53. "High", "Normal", "Normal", "Normal", "High", "Normal", "High"],
  54. "wind": ["Weak", "Strong", "Weak", "Weak", "Weak", "Strong", "Strong",
  55. "Weak", "Weak", "Weak", "Strong", "Strong", "Weak", "Strong"]}
  56. # Answer as to whether or not it is a good time to play tennis.
  57. TARGET_ATTRIB_LIST = ["No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No"]
  58. # CONSTANT VARIABLES # TODO: Optimise these variables by making them immutable (specifying they are const with Python)
  59. TARGET_ATTRIB_NAME = "play tennis"
  60. TRAIN_DATA_SIZE = len(TARGET_ATTRIB_LIST)
  61. # endregion
  62. # Represents a tree node and links to derived nodes.
  63. class Node:
  64. def __init__(self, node_name, derived_nodes=[]):
  65. self.node_name = node_name
  66. self.derived_nodes = derived_nodes
  67. class ID3DecisionTree:
  68. def __init__(self):
  69. self.root_node = None
  70. # Keeps track of all the nodes at the end of the branches that are available to link to.
  71. # In this way, no code needs to be ran to find the next available space for a new node.
  72. # The node at index 0 is always the one to add to first, once the new node is linked to it, it gets popped off
  73. # and the new node gets appended to the end of this list.
  74. self.active_branch_nodes = []
  75. # TODO: Merge this list with the active_branch_nodes to be in dictionary format like so
  76. # {attrib1: [outcome1, outcome2], attrib2: [outcome1, outcome2, outcome3]}
  77. self.linked_attributes = []
  78. # IMPORTANT NOTE:
  79. # Key to understanding how the DecisionTree class works is understanding the dataset_occurrence_dict
  80. # structure, as that is what is used for most calculations. This structure contains only the data from the
  81. # dataset required to construct the tree. Any repetition of attribute data has been removed to reduce load.
  82. # The 'dataset_occurrence_dict' structure is an unordered dictionary, where the structure itself gives more
  83. # information about the dataset. For example, every attribute of the data set is a key, which contains
  84. # a dictionary of its outcomes/possible values, and for each outcome, there is a dictionary showing the
  85. # distribution of the outcomes for the selected target attribute.
  86. # Example of dictionary structure below.
  87. """ Example structure: (where 'AN'-attribute name; 'ON'-outcome name; 'TON'-target outcome name)
  88. dataset_occurrence_dict = {"AN 1": {"ON 1": {"TON 1": 1, "TON 2": 2},
  89. "ON 2": {"TON 1": 0, "TON 2": 1},
  90. "ON 3": {"TON 1": 0, "TON 2": 1}
  91. },
  92. "AN 2": {"ON 1": {"TON 1": 4, "TON 2": 0},
  93. "ON 2": {"TON 1": 1, "TON 2": 0}
  94. }
  95. }
  96. The example above can be read, for attribute 1 - AN1, there are 3 outcomes - ON1, ON2, ON3.
  97. The target has 2 possible outcomes TON1 and TON2. Those values are being tracked/accounted for,
  98. for each possible outcome of each attribute. For AN1, ON1 there is 1 occurrence of TON1 and 2 occurrences of
  99. TON2. For AN1, ON2 there are 0 occurrences of TON1, and 1 occurrence of TON2 therefore the answer for this
  100. branch is TON2. Same for AN1, ON3 - answer TON2. If all the occurrences of TON1 and TON2 for attrib 1 (AN1)
  101. are summed, we get the number of entries in the given data set.
  102. """
  103. self.dataset_occurrence_dict = {}
  104. # region BUILD TREE UTILITIES
  105. """ Construct dataset distribution/occurrence dictionary - "dataset_occurrence_dict".
  106. PARAMETERS
  107. :param (dict) dataset_by_attrib_dict
  108. :param (list) target_list """
  109. def generate_occurrences(self, dataset_by_attrib_dict, target_list):
  110. # TODO: assert that all attribute lists have the same length
  111. # Update the dictionary with each attribute
  112. for attrib_name in dataset_by_attrib_dict.keys():
  113. # STEP 1: ADD the current attribute to the 'dataset_occurrence_dict' structure
  114. self.dataset_occurrence_dict.update({attrib_name: {}})
  115. # STEP 2: Fetch a list containing only the unique data from attribute_list and target_list.
  116. attribute_list = dataset_by_attrib_dict[attrib_name]
  117. unique_attrib_outcomes = list(set(attribute_list))
  118. unique_answers = list(set(target_list))
  119. # For each unique outcome of the current attribute
  120. for attrib_outcome in unique_attrib_outcomes:
  121. # 2.1) Update dictionary to store the next attribute outcome
  122. self.dataset_occurrence_dict[attrib_name].update({attrib_outcome: {}})
  123. # print(self.dataset_occurrence_dict)
  124. # 2.2) For the current attribute, look at each of its outcomes and add them onto the dictionary
  125. for outcome in unique_answers:
  126. self.dataset_occurrence_dict[attrib_name][attrib_outcome].update({outcome: 0})
  127. # print(self.dataset_occurrence_dict)
  128. # STEP 3: Goes through the dataset and counts the target outcome occurrences for each attribute occurrence
  129. for itter in range(len(attribute_list)):
  130. # 3.1) Fetch the current attribute outcome and the current target outcome from the dataset.
  131. curr_attrib_occ = attribute_list[itter]
  132. curr_target_occ = target_list[itter]
  133. # 3.2) Update the count for the current target outcome in the current attribute outcome by 1
  134. self.dataset_occurrence_dict[attrib_name][curr_attrib_occ][curr_target_occ] += 1
  135. """ After a node is added to the tree the "dataset_occurrence_dict" dictionary should be updated.
  136. PARAMETERS
  137. :param (list) attrib_list - the raw attrib data from the dataset.
  138. :param (list) target_list - the raw target data from the dataset. """
  139. def get_next_branch_occurrences(self, dataset_by_attrib_dict, target_list):
  140. # This is the outcome to update the dataset_occurrence_dict by
  141. # A completely separate dictionary from the original, this dictionary will only hold a subdictionary
  142. # of the original
  143. subdict = copy.deepcopy(dataset_by_attrib_dict)
  144. subtar = copy.deepcopy(target_list)
  145. indices_to_remove = []
  146. attrib_to_remove = None
  147. # Looking through every possible attribute in the dictionary
  148. for attrib_key in subdict:
  149. attrib_found = False
  150. # Count through each list of outcomes for the given attribute.
  151. for count in range(len(subdict[attrib_key])):
  152. # If the active outcome name is equal to the current outcome value in the list
  153. if dataset_by_attrib_dict[attrib_key][count] == self.active_branch_nodes[0].node_name:
  154. attrib_found = True
  155. # According to the algorithm, the attribute containing the currently active outcome
  156. # should be removed
  157. if attrib_key in subdict:
  158. attrib_to_remove = attrib_key
  159. else:
  160. indices_to_remove.append(count)
  161. # print(subdict[attrib_key][count])
  162. # subdict[attrib_key].pop(count)
  163. # TODO: assert that there is only one 0 in the list otherwise it is trying to remove the wrong values
  164. if attrib_found:
  165. break
  166. # Processing the subdict data
  167. #print("Subdict: ", subdict)
  168. del subdict[attrib_to_remove]
  169. for attrib in subdict:
  170. #print("Discarding data in ", attrib)
  171. complete_list = subdict[attrib]
  172. sublist = [value for index, value in enumerate(complete_list) if index not in indices_to_remove]
  173. subdict[attrib] = sublist
  174. #print("After processing the data: ", subdict)
  175. # Processing the subtar data
  176. #print("Discarding data in target list")
  177. #print("Target data before processing: ", subtar)
  178. # print(indices_to_remove)
  179. subtar = [value for index, value in enumerate(subtar) if index not in indices_to_remove]
  180. #print("Target data after processing: ", subtar)
  181. # TODO: Call this function recursively on each branch, pass in the shrinked dictionary
  182. # TODO: test the base case thoroughly
  183. # TODO: Build a new dataset_by_attrib_dict for the current outcome
  184. # TODO: REMOVE outlook from the dataset dict when all its outcomes have children nodes assigned
  185. # (How to know if an attribute is complete???)
  186. return subdict, subtar
  187. """ Checks if a branch is complete, i.e. the target outcome was found.
  188. PARAMETERS
  189. :param (dict) target_val_dist_for_attrib
  190. :returns (list) comp_branches - contains all the target outcomes reached for the given attribute."""
  191. def track_target_outcomes(self, target_val_dist_for_attrib):
  192. comp_branches = []
  193. # Looks through each attribute outcome
  194. for attrib_outcome_key in target_val_dist_for_attrib.keys():
  195. # Tracks how many non-zero occurrences of a target outcome there are for this attribute outcome.
  196. non_zero_outcome_count = 0
  197. # This variable is set to the target outcome if the branch outcome is (100%) certain.
  198. branch_answer = None
  199. # Checks what the distribution of target outcomes is for the current attribute outcome.
  200. # Ex: question - how sdo people drive based on the terrain, if the terrain is flat do they drive slow
  201. # or fast, and what is it if the terrain is steep.
  202. # Target outcomes - fast and slow; attrib outcomes - flat and steep.
  203. # Distribution dictionary looks like this ->{'fast': {'slow': 0, 'fast': 1}, 'steep':{'slow': 2, 'fast': 1}}
  204. for target_outcome_key in target_val_dist_for_attrib[attrib_outcome_key].keys():
  205. # Fetch the number of occurrences for each target outcome for the current attribute
  206. """"Another Example: if the target is can_buy_computer(possible values/outcomes: Yes or No) and the current
  207. attribute is age (possible values/outcomes: <=30, 31..40 and >40) this will return how many of the entries
  208. where age is <=30 are no, then how many of the entries where age is <=30 are yes, then how many
  209. of the entries where age is 31..40 are yes and so on, until all cases are looked at. """
  210. outcome_occurrences = target_val_dist_for_attrib[attrib_outcome_key][target_outcome_key]
  211. # Check if the answer is certain and end the branch, i.e. count how many branches have
  212. # certain target outcome
  213. if outcome_occurrences > 0:
  214. non_zero_outcome_count += 1
  215. if non_zero_outcome_count == 1:
  216. branch_answer = target_outcome_key
  217. if non_zero_outcome_count == 0:
  218. print("INVALID RESULT!")
  219. elif non_zero_outcome_count == 1:
  220. print("THE ANSWER FOR <<", attrib_outcome_key, ">> is <<", branch_answer, ">>")
  221. comp_branches.append({attrib_outcome_key: branch_answer})
  222. elif non_zero_outcome_count > 1:
  223. print("THE BRANCH <<", attrib_outcome_key, ">> IS STILL ACTIVE!")
  224. return comp_branches
  225. # Counts the occurrences of each value for a given attribute.
  226. def count_value_occ(self, unique_values, attrib_data):
  227. attrib_val_occ = {}
  228. # Construct dictionary
  229. for value in unique_values:
  230. attrib_val_occ.update({value: 0})
  231. # Initialise Dictionary
  232. for u_value in unique_values:
  233. attrib_val_occ[u_value] = attrib_data.count(u_value)
  234. return attrib_val_occ
  235. def calc_entropy(self, attrib_uv_count, overall):
  236. entropy = 0
  237. # print("UV: ", attrib_uv_count)
  238. for key in attrib_uv_count.keys():
  239. # if there is some occurrence of the value calculate entropy,
  240. # otherwise ignore it (when there is 0 occurrences of the value)
  241. if attrib_uv_count[key] != 0:
  242. fraction = attrib_uv_count[key] / overall
  243. target_attrib_calc = fraction * math.log2(fraction)
  244. entropy += target_attrib_calc
  245. return abs(entropy)
  246. def calc_attrib_entropy(self, attrib_occurrences):
  247. entropy_list = {}
  248. for attrib_val_key in attrib_occurrences.keys():
  249. attrib_val = attrib_occurrences[attrib_val_key]
  250. overall = 0
  251. for target_values in attrib_val.values():
  252. overall += target_values
  253. print("CALC TARGET ENTROPY FOR EACH ATTRIB OUTCOME: ", attrib_val)
  254. attrib_entropy = self.calc_entropy(attrib_val, overall)
  255. entropy_list.update({attrib_val_key: attrib_entropy})
  256. print("Entropy list: ", entropy_list)
  257. return entropy_list
  258. # WEIGHTED AVERAGE ENTROPY for the children
  259. def calc_entropy_weigh_avg(self, target_val_dist_attrib, overall, attrib_entropy):
  260. weighted_entropy_avg = 0
  261. for key in target_val_dist_attrib.keys():
  262. curr_value = 0
  263. for value in target_val_dist_attrib[key].values():
  264. curr_value += value
  265. weighted_entropy_avg += curr_value / overall * attrib_entropy[key]
  266. # overall += curr_value
  267. return weighted_entropy_avg
  268. def calc_info_gain(self, target_entropy, target_dist_for_attrib):
  269. # CALCULATE ENTROPY OF Attribute
  270. attrib_entropy = self.calc_attrib_entropy(target_dist_for_attrib)
  271. # print("Attrib Entropy: ", attrib_entropy)
  272. weighted_avg_e = self.calc_entropy_weigh_avg(target_dist_for_attrib, TRAIN_DATA_SIZE, attrib_entropy)
  273. # print("Attrib Weighted AVG: ", weighted_avg_e)
  274. attrib_info_gain = target_entropy - weighted_avg_e
  275. return attrib_info_gain
  276. # IMPORTANT NOTE: An attribute node should always be made together with its outcomes, never an outcome alone
  277. # as it is not how this function was setup.
  278. # :param (str) name - should always be the name of an attribute.
  279. def build_node(self, name, completed_branches):
  280. attrib_node = Node(name)
  281. derived_nodes = []
  282. completed_outcomes = []
  283. for branch in completed_branches:
  284. completed_outcomes.append(list(branch.keys())[0])
  285. # if all outcome branches for thi attribute are completed, then the attribute is complete and its outcomes
  286. # should be popped off the active_branch_nodes list
  287. # print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CHECK COMPLETE ATTRIB: ", completed_branches)
  288. # print(self.dataset_occurrence_dict[name].keys())
  289. for outcome_name in self.dataset_occurrence_dict[name]:
  290. new_outcome_node = Node(outcome_name)
  291. # print("STATUS: NEW OUTCOME NODE CREATED")
  292. # Check if the branch for the current outcome is complete (Target answer is 100% certain).
  293. for branch in completed_branches:
  294. if outcome_name in branch:
  295. # print("FOUND OUTCOME <<", outcome_name, ">> in ", branch)
  296. if len(new_outcome_node.derived_nodes) == 0:
  297. # Formally end the node
  298. endpoint_node = Node(branch[outcome_name], None)
  299. new_outcome_node.derived_nodes.append(endpoint_node)
  300. # print("STATUS: NEW OUTCOME ENDPOINT NODE CREATED & LINKED")
  301. # The temp_outcome node is created so that the outcome node stored in the tree and the outcome node stored
  302. # in the active_branch_nodes list are the same. This is important because I never append directly onto the
  303. # tree but to a reference of the active branch of the tree. This allows to append to any depth of the tree
  304. # without needing to do any traversal to find the next available node.
  305. temp_outcome = copy.deepcopy(new_outcome_node)
  306. derived_nodes.append(temp_outcome)
  307. # If the branch is still active/available to add to
  308. if outcome_name not in completed_outcomes:
  309. # Add the new node to the active branch list
  310. self.active_branch_nodes.append(temp_outcome)
  311. """print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Completed Nodes:", acc_completed)
  312. acc_completed[name]["completed"] = True
  313. all_outcomes_list = list(self.dataset_occurrence_dict[name].keys())
  314. for outcome in all_outcomes_list:
  315. if outcome in acc_completed[name]["outcomes"]:
  316. print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ", outcome, " TRUE")
  317. else:
  318. print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>> ", outcome, " FALSE")
  319. acc_completed[name]["completed"] = False
  320. print(all_outcomes_list)"""
  321. new_outcome_node.derived_nodes.clear()
  322. # print("STATUS: NEW NODE CREATED")
  323. attrib_node.derived_nodes = derived_nodes
  324. return attrib_node
  325. # IMPORTANT NODE: active_branch_nodes is only updated when build_node function is called, therefore
  326. # the link will not be appropriate unless the node was created through the build_node function.
  327. def link_node(self, new_node):
  328. """
  329. print(" <<< CHECKING IF THE TREE SEGMENT IS BUILT RIGHT! >>> ")
  330. # TEMP
  331. print("ATTRIBUTE/PARENT NODE: ", new_node.node_name)
  332. print("DERIVED NODES LIST: ", new_node.derived_nodes)
  333. print("FOR EACH NODE IN DERIVED NODES.")
  334. for node in new_node.derived_nodes:
  335. print("\t OUTCOME NODE FOR ATTRIB: ", node.node_name)
  336. for other in node.derived_nodes:
  337. print("\t\t TARGET OUTCOME REACHED: ", other.node_name)"""
  338. if self.root_node is None:
  339. self.root_node = new_node
  340. else:
  341. # Add the new node to the tree
  342. # I hard coded 0 as the active node index because index 0 is always the next available node to link to.
  343. self.active_branch_nodes[0].derived_nodes.append(new_node)
  344. # Update the available nodes!
  345. # The node at index 0 is already taken so that node should be popped off
  346. self.active_branch_nodes.pop(0)
  347. # Builds a part of the tree (attribute node with setup derived nodes/outcome nodes) and links it to the tree.
  348. def build_tree_chunk(self, dataset_by_attrib_dict, target_attrib_list):
  349. self.generate_occurrences(dataset_by_attrib_dict, target_attrib_list)
  350. # print("Main DICTIONARY", self.dataset_occurrence_dict)
  351. # TARGET ATTRIBUTE CALCULATIONS - Required for the calculation of info_gain for the rest of the attributes.
  352. target_uv_data = list(set(target_attrib_list)) # TODO: POSSIBLE EFFICIENCY DECREASE
  353. target_uv_count = self.count_value_occ(target_uv_data, target_attrib_list)
  354. # print("Target Unique Value Count: ", target_uv_count)
  355. target_entropy = self.calc_entropy(target_uv_count, TRAIN_DATA_SIZE)
  356. # print("TARGET ENTROPY: ", target_entropy)
  357. # Build each node(calc its entropy and info_gain, and assigning each attributes outcomes as children)
  358. # store the node in the node list and sort the nodes by info_gain to build the tree with them.
  359. next_node_data = {"name": None, "info gain": 0, "completed": None}
  360. for attrib_name in self.dataset_occurrence_dict.keys():
  361. print("\n", "-" * 50)
  362. # ATTRIB CALCULATIONS
  363. print("attrib_name: ", attrib_name)
  364. # Contains a data structure representing the target attribute's value distribution
  365. # with regard to another attribute
  366. target_dist_for_attrib = self.dataset_occurrence_dict[attrib_name]
  367. # print("Target occurrences: ", target_dist_for_attrib)
  368. # Check if any of the branches is completed
  369. completed_branches = self.track_target_outcomes(target_dist_for_attrib)
  370. print("COMPLETED BRANCHES: ", completed_branches)
  371. attrib_info_gain = self.calc_info_gain(target_entropy, target_dist_for_attrib)
  372. # print("The INFO GAIN for <<", attrib_name, ">> is ", attrib_info_gain)
  373. if next_node_data["info gain"] < attrib_info_gain:
  374. next_node_data["name"] = attrib_name
  375. next_node_data["info gain"] = attrib_info_gain
  376. next_node_data["completed"] = completed_branches
  377. print("------> The next new node is: ", next_node_data["name"], "\n\n")
  378. new_node = self.build_node(next_node_data["name"], next_node_data["completed"])
  379. self.link_node(new_node)
  380. # endregion
  381. def build_tree(self, dataset_by_attrib_dict, target_attrib_list):
  382. self.build_tree_chunk(dataset_by_attrib_dict, target_attrib_list)
  383. print("\n\n")
  384. while len(self.active_branch_nodes) != 0:
  385. print(">>>>>>>>>>>>>>>>>>> Current active node: ", self.active_branch_nodes[0].node_name)
  386. # self.linked_attrib_names
  387. sub_attrib_dict, sub_tar_list = self.get_next_branch_occurrences(dataset_by_attrib_dict, target_attrib_list)
  388. self.build_tree_chunk(sub_attrib_dict, sub_tar_list)
  389. print("\n\n>>>>>>>>>>>>>>>>>>> List of active nodes: ", self.active_branch_nodes)
  390. print("\n\n", "<"*5, "THE TREE IS COMPLETE!", ">"*5, "\n\n")
  391. def visualise_tree(self):
  392. current_node = self.root_node
  393. while current_node is not None:
  394. print(current_node.node_name)
  395. # TODO this recursively, base case -> len(node.derived_nodes) == 0
  396. # EXTRA TODO pass in a variable called branch_track that will start off as "",
  397. # each time a recursion is spawned add a "\t", that way the print will have a sort of a hiearchy
  398. # This function runs classification on one entry and returns the answer.
  399. # Should only be called after the tree model was built.
  400. def classify(self, entry_index, dataset_by_attrib_dict):
  401. answer = None
  402. # TODO: assert that root node is not none
  403. current_node = self.root_node
  404. while current_node.derived_nodes is not None:
  405. print("\n <<< TRAVERSING TREE >>> ")
  406. print("Current Attrib: ", current_node.node_name)
  407. # Ask the tree which attribute/column to look for first
  408. column_name = current_node.node_name
  409. # Fetch the value for the given entry (entry_index) from the column identified by the tree.
  410. current_outcome_name = dataset_by_attrib_dict[column_name][entry_index]
  411. print("\tCurrent outcome name: ", current_outcome_name)
  412. # Get that node from the derived nodes list
  413. for outcome_node in current_node.derived_nodes:
  414. if outcome_node.node_name == current_outcome_name:
  415. # print("\n <<< TRAVERSING TREE >>> ")
  416. # print("FOUND VALUE FOR ENTRY <<", entry_index, ">> -> <<", outcome_node.node_name, ">>")
  417. current_node = outcome_node.derived_nodes[0]
  418. # print("Current Attrib: ", current_node.node_name)
  419. answer = current_node.node_name
  420. print(" <<< FOUND VALUE >>> ")
  421. print(" The answer is: ", answer)
  422. return answer
  423. def test_run_algorithm():
  424. print(" "*10, " << ID3 CLASSIFICATION ALGORITHM >> ", " "*10)
  425. tree = ID3DecisionTree()
  426. tree.build_tree(DATASET_BY_ATTRIB_DICT, TARGET_ATTRIB_LIST)
  427. # APPLY CLASSIFICATION
  428. # The index of the entry in the dataset.
  429. entry_index = 0
  430. tree.classify(entry_index, DATASET_BY_ATTRIB_DICT)
  431. test_run_algorithm()
  432. """
  433. # Remove the completed branches
  434. for branch in completed_branches:
  435. for key in branch.keys():
  436. target_val_dist_for_grade.pop(key)
  437. print("After removing completed branches: ", target_val_dist_for_grade)
  438. """
  439. # region Build Decision Tree
  440. # endregion
  441. """
  442. What is "Training Data"?
  443. Building the tree is done with training data which already has the answer to whatever question is being asked.
  444. the example given with the data on the slides that asks if someone can buy a laptop is training data
  445. because it already knows the answer.
  446. """
  447. """
  448. Apply information gain function to each attribute calculate_gain(attr_out)
  449. Should that be applied to the target as well? No
  450. Example:
  451. - G(train_data, O) = 0.246
  452. - G(train_data, H) = 0.151
  453. - G(train_data, W) = 0.048
  454. Once the root node is known, look at how many unique values are there.
  455. If there are 4 possible values and they are not numbers,
  456. for example "Sunny", "Rainy", etc. there should be 4 nodes.
  457. """
  458. # region Apply Classification
  459. """
  460. What is "Test Data"?
  461. Test data is when we get a new entry and we want to classify it.
  462. For example: In the bank they may use an already trained ID3 algorithm to check if you should get a credit card or not.
  463. They will have different attributes like - number of times you have gone bankrupt; what is your current net worth;
  464. are you a student; what is your credit score; etc.
  465. Then the target attribute will be EligibleForCreditCard(True or False)
  466. """
  467. # Use the built decision tree to look through a row of data from the data set. This is done using test data.
  468. # (How to evaluate if the classification has an error?)
  469. """
  470. Steps:
  471. 1. Find which is the current attribute to look through (To start with ask the tree which attribute is the root node)
  472. 1.1 (When building the tree need to make sure the attributes have the exact same name as the Node data)
  473. 1.2 Search trough all possible attributes
  474. 1.3 Check if the attribute name == the node name
  475. 2. Find the attribute value for the current row
  476. 2.1 Ask the data set which value is given for this attribute
  477. 2.2 Find the which of the children nodes in the tree are equivalent to the given value
  478. Repeat these steps recursively until an answer is found.
  479. """
  480. # endregion