刘凡 9ff4d1d109 add S3,archive,truncate | 2 years ago | |
---|---|---|
.. | ||
data | 2 years ago | |
data_processing | 2 years ago | |
final_data | 2 years ago | |
static | 2 years ago | |
templates | 2 years ago | |
.gitignore | 2 years ago | |
Pipfile | 2 years ago | |
Pipfile.lock | 2 years ago | |
Procfile | 2 years ago | |
README.md | 2 years ago | |
application.py | 2 years ago | |
panel1.png | 2 years ago | |
panel2.png | 2 years ago | |
panel3.png | 2 years ago | |
requirements.txt | 2 years ago | |
runtime.txt | 2 years ago | |
screenshot-xkcd-data-visualization.png | 2 years ago |
A natural language processing and data visualization project.
hosted at: https://xkcd-data.herokuapp.com/
This single-page web application lets users interact with xkcd comics clustered by similarity. In the course of building this project, I learned how to clean data, use different natural language analysis techniques, build an interactive and reactive data visualization, and host a web application.
Feature Distribution
Users can select multiple features and see their full distribution over the t-sne plots (orange dots). This allows users too see what the total sum of the feature is coming from (blue bar in TFIDF bar chart). If multiple features are selected, comics with multiple selected features are highlighted (black dots).
Brushable Scatterplot of Comic Relations and TFIDF Values of Top 30 Words in Values Selected by Brush
Users can drag and click to select and zoom on scatterplot. On brush event, the barchart is populated with summed TFIDF values of comic picked on click, comics selected by brush, and total TFIDF value of top 30 values.
Selected Comic Panel
See what comic you have clicked on in order to visually compare comics.
xkcd comics are "A webcomic of romance, sarcasm, math, and language" (xkcd slogan). These comics are licensed under a Creative Commons Attribution-NonCommercial 2.5 License, and their transcripts are available on www.explainxkcd.com (xkcd comic's wiki). This web application uses the first 2283 comics as data source.
Languages:
beautifulsoup for text-scrapping html from www.explainxkcd.com (xkcd comic's wiki)
nltk's SnowballStemmer and spacy's Lemminizer for text-cleaning to increase coherency and reduce number of dimension
sklearn's Tfidfvectorizer for feature extraction
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. This downweights words that are common to most of the documents as those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
sklearn's Truncated SVD for reducing the feature space and perform latent semantic analysis
sklearn's TSNE for building document relations expressed as corrdinates in the 2D plane
TSNE (t-distributed Stochastic Neighbor Embedding) is a tool to visualize high-dimensional data in a 2D plane, where similar comics turn into neighboring points.
d3.js for front-end, data visualization and event-behavior (click, hover, zoom, etc.)
Flask as a web server; since the tf-idf values are stored in scipy sparse matrix, summing and slicing the arrays are efficient and fast if the operations are in Python.
Bootstrap for building reactive web layout.