Xkcd Data Visualization

Similar comics are clustered together!

Feature Distribution Menu

{{num_features}} Features

Selected Features:

comics containing multiple of the selected features:


Feature Distribution

TFIDF of Top Words in Selected Comics

TFIDF to LSA to TSNE Comic Relations

Pick Comic 1-{{num_comics}}

Picked Comic Info

221: Random Number

221: Random Number

Alt Text

Process

  1. scrape comics transcript, title, and alt-text from explainxkcd.com
  2. clean data by removing domain-specific stop-words (e.g. character names), lemminize and stem words (e.g. “chocolates”, “chocolatey”, “choco” all count as the root word, “chocolate”)
  3. represent each of the {{num_comics}} comic as a {{num_features}}-dim text vector of term-frequency x inverse document frequency (tf-idf) scores
  4. reduce the effects of synonymy and polysemy, and reduce the feature space from 7000 unique words to 50 feature by perform latent semantic analysis with truncated svd
  5. create 2d embedding of document relations with t-sne that shows similar comics located closer together
  6. build interactive, reactive data analysis web application with d3.js, bootstrap, and flask

Background Information on Xkcd Comics

xkcd comics are "A webcomic of romance, sarcasm, math, and language" (xkcd slogan). These comics are licensed under a Creative Commons Attribution-NonCommercial 2.5 License, and their transcripts are available on www.explainxkcd.com (xkcd comic's wiki). This web application uses the first {{num_comics}} comics as data source.