刘凡 9ff4d1d109 add S3,archive,truncate | 2 tahun lalu | |
---|---|---|
.. | ||
crau | 2 tahun lalu | |
tests | 2 tahun lalu | |
.gitignore | 2 tahun lalu | |
LICENSE | 2 tahun lalu | |
Makefile | 2 tahun lalu | |
README.md | 2 tahun lalu | |
requirements-development.txt | 2 tahun lalu | |
requirements.txt | 2 tahun lalu | |
setup.py | 2 tahun lalu |
crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs.
pip install crau
Archive a list of URLs by passing them via command-line:
crau archive myarchive.warc.gz http://example.com/page-1 http://example.org/page-2 ... http://example.net/page-N
or passing a text file (one URL per line):
echo "http://example.com/page-1" > urls.txt
echo "http://example.org/page-2" >> urls.txt
echo "http://example.net/page-N" >> urls.txt
crau archive myarchive.warc.gz -i urls.txt
Run crau archive --help
for more options.
List archived URLs in a WARC file:
crau list myarchive.warc.gz
Extract a file from an archive:
crau extract myarchive.warc.gz https://example.com/page.html extracted-page.html
Run a server on localhost:8080 to play your archive:
crau play myarchive.warc.gz
There are other archiving tools, of course. The motivation to start this project was a lack of easy, fast and robust software to archive URLs - I just wanted to execute one command without thinking and get a WARC file. Depending on your problem, crau may not be the best answer - check out more archiving tools in awesome-web-archiving.
This tool can be used easily to use archiving services such as archive.is via command-line and can also, but when archiving it calls wget to do the job.
Clone the repository:
git clone https://github.com/turicas/crau.git
Install development dependencies (you may want to create a virtualenv):
cd crau && pip install -r requirements-development.txt
Install an editable version of the package:
pip install -e .
Modify everything you want to, commit to another branch and then create a pull request at GitHub.