# crau: Easy-to-use Web Archiver *crau* is the way (most) Brazilians pronounce *crawl*, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs. ## Installation `pip install crau` ## Running ### Archiving Archive a list of URLs by passing them via command-line: ```bash crau archive myarchive.warc.gz http://example.com/page-1 http://example.org/page-2 ... http://example.net/page-N ``` or passing a text file (one URL per line): ```bash echo "http://example.com/page-1" > urls.txt echo "http://example.org/page-2" >> urls.txt echo "http://example.net/page-N" >> urls.txt crau archive myarchive.warc.gz -i urls.txt ``` Run `crau archive --help` for more options. ### Extracting data from an archive List archived URLs in a WARC file: ```bash crau list myarchive.warc.gz ``` Extract a file from an archive: ```bash crau extract myarchive.warc.gz https://example.com/page.html extracted-page.html ``` ### Playing the archived data on your Web browser Run a server on [localhost:8080](http://localhost:8080) to play your archive: ```bash crau play myarchive.warc.gz ``` ## Why not X? There are other archiving tools, of course. The motivation to start this project was a lack of easy, fast and robust software to archive URLs - I just wanted to execute one command without thinking and get a WARC file. Depending on your problem, crau may not be the best answer - check out more archiving tools in [awesome-web-archiving](https://github.com/iipc/awesome-web-archiving#acquisition). ### Why not [GNU Wget](https://www.gnu.org/software/wget/)? - Lacks parallel downloading; - Some versions just crashes with segmentation fault depending on the website; - Lots of options make the task of archiving difficult; - There's no easy way to extend its behavior. ### Why not [Wpull](https://wpull.readthedocs.io/en/master/)? - Lots of options make the task of archiving difficult; - Easiest to extend than wget, but still difficult comparing to crau (since crau uses [scrapy](https://scrapy.org/)). ### Why not [crawl]()? - Lacks some features and it's difficult to contribute to (the [Gitlab instance where it's hosted](https://git.autistici.org/ale/crawl) doesn't allow registration); - Has some bugs regarding to collecting page dependencies (like static assets inside a CSS file); - Has a bug where it enters in a loop (if a static asset returns a HTML page instead of the expected file it ignores depth and keep trying to get this page's dependencies - if any of the latter dependencies also has the same problem it keeps going on infinite depth). ### Why not [archivenow](https://github.com/oduwsdl/archivenow)? This tool can be used easily to use archiving services such as [archive.is](https://archive.is) via command-line and can also, but when archiving it calls wget to do the job. ## Contributing Clone the repository: ```bash git clone https://github.com/turicas/crau.git ``` Install development dependencies (you may want to create a virtualenv): ```bash cd crau && pip install -r requirements-development.txt ``` Install an editable version of the package: ```bash pip install -e . ``` Modify everything you want to, commit to another branch and then create a pull request at GitHub.