Feature Distribution Menu
{{num_features}} Features
Selected Features:
comics containing multiple of the selected features:
Feature Distribution
TFIDF of Top Words in Selected Comics
TFIDF to LSA to TSNE Comic Relations
Pick Comic 1-{{num_comics}}
Picked Comic Info
221: Random Number
Alt Text
Process
- scrape comics transcript, title, and alt-text from explainxkcd.com
- clean data by removing domain-specific stop-words (e.g. character names), lemminize and stem words (e.g. “chocolates”, “chocolatey”, “choco” all count as the root word, “chocolate”)
- represent each of the {{num_comics}} comic as a {{num_features}}-dim text vector of term-frequency x inverse document frequency (tf-idf) scores
- reduce the effects of synonymy and polysemy, and reduce the feature space from 7000 unique words to 50 feature by perform latent semantic analysis with truncated svd
- create 2d embedding of document relations with t-sne that shows similar comics located closer together
- build interactive, reactive data analysis web application with d3.js, bootstrap, and flask
Background Information on Xkcd Comics
xkcd comics are "A webcomic of romance, sarcasm, math, and language" (xkcd slogan). These comics are licensed under a Creative Commons Attribution-NonCommercial 2.5 License, and their transcripts are available on www.explainxkcd.com (xkcd comic's wiki). This web application uses the first {{num_comics}} comics as data source.