刘凡 9ff4d1d109 add S3,archive,truncate | 2 years ago | |
---|---|---|
.. | ||
warc_extractor | 2 years ago | |
.gitignore | 2 years ago | |
LICENSE | 2 years ago | |
README.md | 2 years ago | |
imageboard-scraper.py | 2 years ago | |
json-extractor.py | 2 years ago | |
poetry.lock | 2 years ago | |
pyproject.toml | 2 years ago |
A collection of tools for archiving and analyzing the internet.
All scripts in this tool set are designed to be used with python 3. Python 2 is not supported.
Note: Due to the popularity of warc-extractor. It has been moved to a package on its own. As well. It is available to download directly through pip
python3 -m pip install warc-extractor
This should install the warc extractor to the command line. It can be run as follows.
warc-extractor --help
Json-json-extractor.py is a short script designed to extract a condensed CSV file from a collection of line separated JSON files. This script is designed for use with the data output of http://github.com/edsu/twarc and all of the scrapers in this project.
Json-extractor.py is a program that searches JSON files line by line and extracts specified elements. The script expects each line in a file to be a valid JSON object. As this program was primarily created to scan twarc output, all examples below assume twitter data.
Note: if .json file is actually valid json, then the entire file will be treated as a single json entry. The sole exception to this is if the root object is a list. In this case the parser will treat each entry in the list as a separate object.
Note: As of right now the parser handles newlines inside of a json object poorly. The parser assumes that all unnecessary whitespace in a .json file has been removed.
python3 json-extractor.py text
This will search every .json file in the current directory and create a file 'output.csv' containing the text of every tweet it finds. The arguments are the data pieces that the extractor will extract. For example.
python3 json-extractor.py text id created_at
This script will create a csv file with 3 columns. It will contain the text of the tweet as the first element, the id of the tweet as the second, and the created_at time stamp as the third. These labels match the json labels in the .json file exactly.
If an element is within another object separate the object and attribute by the character ":"
user:screen_name
Will return the "screen_name" attribute inside the "user" object.
entities:hashtags:text
Will return the "text" attribute inside the hash tags object which is itself inside the entities object. The entities object is a list. When the interpreter reaches a list it will copy the line for each entry found. So a single tweet with two hash tags will output two lines in the csv if hash tags are to be recorded.
-h
-string
-path
-id
-NA
-compress
-output
-dialect
-start
-end
-hashtag
To grab user_name, hash tags, and text from all Rob Ford tweets in folder /path/to/folder
python3 json-extractor.py -path /path/to/folder -string Rob user:screen_name entities:hashtags:text text
To grab all tweet text on January 1, 2014 in current folder.
python3 json-extractor.py -start 01:01:2014 -end 01:02:2014 text
Imageboard-scraper.py is a simple script designed to interact with image boards based on the 4chan API. Running the program collects all posts made since the script was last run. If the script has not been run before it collects all current posts.
python3 imageboard-scraper.py trv
Running the script with a single option will download all posts in the associated board. The above command will download all posts in the 4chan /trv (travel) board.
-h
-output
-image
-url
Warc-extractor.py is a tool designed to filter and extract files from warc archive files. This script is designed to perform three different purposes.
python3 warc-extractor.py
Running the program without any arguments scans all of the warc files in the current directory and outputs some basic information about those files.
Warc-extractor.py accepts an unlimited number of filter options. A filter option controls which warc entries the script scans.
python3 warc-extractor.py warc-type:request
In the above example the script will output basic information about all of the warc entries where the warc header 'warc-type' is set to request (case insensitive). Substrings are allowed in the second part so 'warc-type:requ' would be equivalent while 'warc-type:re' would return both 'request' and 'response' entries.
Many warc entries also contain HTTP headers which can also be accessed by filter.
python3 warc-extractor.py http:content-type:pdf
The above script finds all warc entries that contain PDF's. Specifically it would filter out any warc entry that does not contain an HTTP header 'content-type' that contains the string 'pdf'. (Note: imputing any HTTP filter implicitly filters out any warc entry that does not contain an HTTP request or response.)
There is also some information found in an HTTP object's version line. This information can be access via some special operators: error, command, path, status, version. The most important being error.
python3 warc-extractor.py http:error:200
The above script would filter out any HTTP responses that did not return error code 200, as well as implicitly remove HTTP requests which do not contain error codes.
Additionally, negative searches are also allowed.
python3 warc-extractor.py \!http:content-type:pdf
The above script would return all warc entries that do not contain contain PDF's. (Note: the '\' character is required because '!' is a reserved character in bash.)
Once you have verified that the script is only grabbing those warc entries that are required. The contents of the found warc entries can be dumped in two different ways.
python3 warc-extractor.py some:filter -dump warc
The above script would create a new warc file containing only the filtered elements.
python3 warc-extractor.py some:filter -dump content
The above script would attempt to extract the contents of the filtered entries. (Note: the -dump flag implicitly adds "warc-type:response" and "content-type:application/http" to the filters. As warc entries that do not match these filters do not contain file-like objects.)
-h
-string
-path
-output_path
-output
-dump
-silence
-error
To create a warc file containing all HTTP responses that are not file-like objects.
python3 warc-extractor.py -dump warc warc-type:response \!content-type:application/http
To dump all PDF's from a warc file to disk.
python3 warc-extractor.py -dump content http:content-type:pdf
To dump everything a warc file contains to disk.
python3 warc-extractor.py -dump content http:error:200
Warc files are complicated and huge. Creating a single script that can properly handle all of the many strange and wonderful objects that might be hidden in a warc file is a large undertaking. Because of this bugs are inevitable.
The script contains an -error command script designed to make dealing with problematic warc entries a bit easier. If the -error tag is supplied to the script, the script will do it's best to skip all entries that cause errors then write all problematic entries to a new warc file 'error.warc'. Should this script error, please try running it again with the -error tag and then upload the resulting 'error.warc' file along with the bug report.
There are many possible problems a warc file could contain that are not limited to specific entries. In these situations the -error tag will not prevent the error and will not create the error.warc file. In these cases please still fill out a bug report. However, the problem is unlikely to be fixed unless I can get access to the warc file that created the problem.
One final note, this script was programmed and tested on a Linux platform. In theory it should work on any platform that Python 3 works on; however, I make no guarantees. Help on this issue would be greatly appreciated.