shell爬虫--抓取某在线文档所有页面

日期：2024-12-27 作者：l7m23 移动：http://ljhr2012.riyuangf.com/mobile/quote/70080.html

在线教程一般像流水线一样，页面有上一页下一页的按钮，因此，可以利用shell写一个爬虫读取下一页链接地址，配合wget将教程所有内容抓取。

以postgresql中文网为例。下面是实例代码

#!/bin/sh
start_URL="http://www.postgres.cn/docs/9.6/preface.html"
end_URL="http://www.postgres.cn/docs/9.6/bookindex.html"
URL=$start_URL

while [ $URL != $end_URL ];do

curl -s  $URL >tmp.txt
wget $URL -P psql
grep -n 'ACCESSKEY="N"'  tmp.txt > tmp2.txt
cut -f1 -d":" tmp2.txt | head -n 1 > tmp3.txt
let LINE=`cat tmp3.txt`
let LINE--
sed -n "${LINE}p" tmp.txt > tmp4.txt
sed -i 's/HREF="https://g' tmp4.txt
sed -i 's/"https://g' tmp4.txt
sURL=`cat tmp4.txt`
cat tmp4.txt >> allurl.txt
FULLURL="http://www.postgres.cn/docs/9.6/$sURL"
URL=$FULLURL

done

rm -rf tmp.txt tmp2.txt tmp3.txt tmp4.txt

说明：

1、URL 要下载的html文件路径

2、sURL html文件的相对路径

3、FULLURL sURL和模板拼接后的完整url

4、tmp.txt 用于保存curl取得的页面数据

特别提示：本信息由相关用户自行提供，真实性未证实，仅供参考。请谨慎采用，风险自负。

点赞 0举报收藏 0评论 0

0 条相关评论

相关最新动态

推荐最新动态

点击排行