佛山公司網(wǎng)站建設(shè)一條龍全包(爬蟲爬文獻)爬蟲爬取論文數(shù)據(jù),
簡單的網(wǎng)站寫爬蟲就跟流水線加工一樣,抄抄改改,沒有問題就直接上了,直接了當省事,又是一篇沒有營養(yǎng)的水文一個比較簡單的爬蟲,適合練手學習使用,主要是爬取和采集網(wǎng)站的作品信息,包括標題、內(nèi)容及圖片,其中圖片采用了多線程爬取。
考慮到外網(wǎng)爬取,所以采用了三次訪問超時重試的機制,同時對于詳情頁的爬取采用了報錯機制跳過處理,適合新人學習爬取使用小日子的網(wǎng)站隨便爬,加大力度,使勁搞,適合 Python 爬蟲新人練手使用和學習,如果你正在找練手網(wǎng)站,不妨嘗試爬取下載數(shù)據(jù)。
詳情頁關(guān)鍵節(jié)點處理的代碼:tree = etree.HTML(html) h1=tree.xpath(//h1[@class="entry-title"]/text())[0] pattern =
r"[/\:*?"\|]" h1=re.sub(pattern, "_", h1) # 替換為下劃線 print(h1) path = f{h1}/ os.makedirs(path, exist_ok=
True) print(f">> 生成保存目錄 {h1} 文件夾成功!") ptexts=tree.xpath(//div[@class="main-text"]/p/text()) ptext=.join(ptexts)
print(ptext)with open(f{path}{h1}.txt,w,encoding=utf-8) as f: f.write(f{h1} {ptext}) print(
f">> 保存 {h1}.txt 文件成功!") imgs=tree.xpath(//div[@class="slider-for"]/div[@class="sp-slide"]/img/@src
)文章最后附上早期寫的,看看有沒有差距和不同之處呢?!附上完整源碼僅供參考學習使用# -*- coding: UTF-8 -*-# @公眾號:eryeji# https://www.nendo.jp/jp/works/。
import requestsfrom lxml import etreeimport timeimport randomimport reimport threadingimport osdefget_ua
(): ua_list = [Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1
,Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36Chrome 17.0
,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11
,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0Firefox 4.0.1,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
,Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
,Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
,Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11, ] ua=random.choice(ua_list)return
uadefget_hrefs(): url="https://www.nendo.jp/jp/works/" headers={"User-Agent":get_ua() } response=requests.get(url=url,headers=headers,timeout=
6) print(response.status_code) html = response.content.decode(utf-8)#print(html) tree = etree.HTML(html)
hrefs = tree.xpath(//div[@class="entry-content"]/a/@href) print(len(hrefs)) print(hrefs)for
href in hrefs: get_detail(href) time.sleep(3)defget_detail(href): headers = {"User-Agent"
: get_ua() } response = requests.get(url=href, headers=headers, timeout=6) print(response.status_code)
html = response.content.decode(utf-8)#print(html) tree = etree.HTML(html) h1=tree.xpath(//h1[@class="entry-title"]/text()
)[0] pattern = r"[/\:*?"\|]" h1=re.sub(pattern, "_", h1) # 替換為下劃線 print(h1) path =
f{h1}/ os.makedirs(path, exist_ok=True) print(f">> 生成保存目錄 {h1} 文件夾成功!") ptexts=tree.xpath(//div[@class="main-text"]/p/text()
) ptext=.join(ptexts) print(ptext)with open(f{path}{h1}.txt,w,encoding=utf-8) as f: f.write(
f{h1} {ptext}) print(f">> 保存 {h1}.txt 文件成功!") imgs=tree.xpath(//div[@class="slider-for"]/div[@class="sp-slide"]/img/@src
) print(len(imgs)) print(imgs) down_imgs(path, imgs)# 3次重試defget_resp(url): i = 0while i <
4:try: headers = {"User-Agent":get_ua() } response = requests.get(url, headers=headers, timeout=
10) print(response.status_code)return responseexcept requests.exceptions.RequestException:
i += 1 print(f">> 獲取網(wǎng)頁出錯,6S后將重試獲取第:{i} 次") time.sleep(i * 2)defdown_imgs
(path,imgs): threadings = []for img in imgs: t = threading.Thread(target=get_img, args=(path,img))
threadings.append(t) t.start()for x in threadings: x.join() print(f"恭喜,多線程下載圖片完成!"
)#下載圖片defget_img(path,img_url): img_name = img_url.split(/)[-1] r = get_resp(img_url) time.sleep(
1)with open(f{path}{img_name}, wb)as f: f.write(r.content) print(f">> {img_name}下載圖片成功")def
main(): get_hrefs()if __name__==__main__: main()附早期寫的:Python爬蟲,超簡單nendo官網(wǎng)作品圖片爬蟲demo
·················END·················你好,我是二大爺,革命老區(qū)外出進城務(wù)工人員,互聯(lián)網(wǎng)非早期非專業(yè)站長,喜好python,寫作,閱讀,英語不入流程序,自媒體,seo . . .
公眾號不掙錢,交個網(wǎng)友讀者交流群已建立,找到我備注 “交流”,即可獲得加入我們~聽說點 “在看” 的都變得更好看吶~關(guān)注關(guān)注二大爺唄~給你分享python,寫作,閱讀的內(nèi)容噢~掃一掃下方二維碼即可關(guān)注我噢~。
關(guān)注我的都變禿了說錯了,都變強了!不信你試試
掃碼關(guān)注最新動態(tài)公眾號ID:eryeji