Web Scraping

13th August 2019 at 5:40pm

当你需要抓取整个网站到电脑上离线使用时，可以考虑用这些软件。

整站抓取

HTTrack

HTTrack 提供的 GUI 简单易用，成功抓取了 Vim HTML 超链接版的文档。

wget

wget 命令是通用的解决办法。用法如下：

$ wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains website.org \
     --no-parent \
         http://www.website.org/tutorials/html/

--recursive: download the entire Web site.
--domains website.org: don't follow links outside website.org.
--no-parent: don't follow links outside the directory tutorials/html/.
--page-requisites: get all the elements that compose the page (images, CSS and so on).
--html-extension: save files with the .html extension.
--convert-links: convert links so that they work locally, off-line.
--restrict-file-names=windows: modify filenames so that they will work in Windows as well.
--no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).

保存单个网页

webpage2html

webpage2html（GitHub | PyPI）是一个将网页保存为单个 HTML （非 MHTML）文件的工具。它的原理是将 CSS / JavaScript 文件内嵌到 HTML 里，同时把图片转为 base64 data URI。安装后可以直接用 webpage2html 命令来转换：

$ webpage2html "http://www.google.com" > google.html

Chrome print to PDF

用 Chrome 可以将一个网页保存为 PDF。具体的入口在 Ctrl-P 唤起的打印窗口那。优点是简单便捷，缺点是格式信息全无。