已完结小说排行榜,旷世神医,唐家三少

佛山公司網(wǎng)站建設(shè)一條龍全包（爬蟲爬文獻）爬蟲爬取論文數(shù)據(jù)，

簡單的網(wǎng)站寫爬蟲就跟流水線加工一樣，抄抄改改，沒有問題就直接上了，直接了當省事，又是一篇沒有營養(yǎng)的水文一個比較簡單的爬蟲，適合練手學習使用，主要是爬取和采集網(wǎng)站的作品信息，包括標題、內(nèi)容及圖片，其中圖片采用了多線程爬取。

考慮到外網(wǎng)爬取，所以采用了三次訪問超時重試的機制，同時對于詳情頁的爬取采用了報錯機制跳過處理，適合新人學習爬取使用小日子的網(wǎng)站隨便爬，加大力度，使勁搞，適合 Python 爬蟲新人練手使用和學習，如果你正在找練手網(wǎng)站，不妨嘗試爬取下載數(shù)據(jù)。

詳情頁關(guān)鍵節(jié)點處理的代碼：tree = etree.HTML(html) h1=tree.xpath(//h1[@class="entry-title"]/text())[0] pattern =

r"[/\:*?"\|]" h1=re.sub(pattern, "_", h1) # 替換為下劃線 print(h1) path = f{h1}/ os.makedirs(path, exist_ok=

True) print(f">> 生成保存目錄 {h1} 文件夾成功！") ptexts=tree.xpath(//div[@class="main-text"]/p/text()) ptext=.join(ptexts)

print(ptext)with open(f{path}{h1}.txt,w,encoding=utf-8) as f: f.write(f{h1} {ptext}) print(

f">> 保存 {h1}.txt 文件成功！") imgs=tree.xpath(//div[@class="slider-for"]/div[@class="sp-slide"]/img/@src

)文章最后附上早期寫的，看看有沒有差距和不同之處呢？!附上完整源碼僅供參考學習使用# -*- coding: UTF-8 -*-# @公眾號：eryeji# https://www.nendo.jp/jp/works/。

import requestsfrom lxml import etreeimport timeimport randomimport reimport threadingimport osdefget_ua

(): ua_list = [Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1

,Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36Chrome 17.0

,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11

,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0Firefox 4.0.1,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

,Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50

,Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50

,Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11, ] ua=random.choice(ua_list)return

uadefget_hrefs(): url="https://www.nendo.jp/jp/works/" headers={"User-Agent":get_ua() } response=requests.get(url=url,headers=headers,timeout=

6) print(response.status_code) html = response.content.decode(utf-8)#print(html) tree = etree.HTML(html)

hrefs = tree.xpath(//div[@class="entry-content"]/a/@href) print(len(hrefs)) print(hrefs)for

href in hrefs: get_detail(href) time.sleep(3)defget_detail(href): headers = {"User-Agent"

: get_ua() } response = requests.get(url=href, headers=headers, timeout=6) print(response.status_code)

html = response.content.decode(utf-8)#print(html) tree = etree.HTML(html) h1=tree.xpath(//h1[@class="entry-title"]/text()

)[0] pattern = r"[/\:*?"\|]" h1=re.sub(pattern, "_", h1) # 替換為下劃線 print(h1) path =

f{h1}/ os.makedirs(path, exist_ok=True) print(f">> 生成保存目錄 {h1} 文件夾成功！") ptexts=tree.xpath(//div[@class="main-text"]/p/text()

) ptext=.join(ptexts) print(ptext)with open(f{path}{h1}.txt,w,encoding=utf-8) as f: f.write(

f{h1} {ptext}) print(f">> 保存 {h1}.txt 文件成功！") imgs=tree.xpath(//div[@class="slider-for"]/div[@class="sp-slide"]/img/@src

) print(len(imgs)) print(imgs) down_imgs(path, imgs)# 3次重試defget_resp(url): i = 0while i <

4:try: headers = {"User-Agent":get_ua() } response = requests.get(url, headers=headers, timeout=

10) print(response.status_code)return responseexcept requests.exceptions.RequestException:

i += 1 print(f">> 獲取網(wǎng)頁出錯，6S后將重試獲取第：{i} 次") time.sleep(i * 2)defdown_imgs

(path,imgs): threadings = []for img in imgs: t = threading.Thread(target=get_img, args=(path,img))

threadings.append(t) t.start()for x in threadings: x.join() print(f"恭喜，多線程下載圖片完成!"

)#下載圖片defget_img(path,img_url): img_name = img_url.split(/)[-1] r = get_resp(img_url) time.sleep(

1)with open(f{path}{img_name}, wb)as f: f.write(r.content) print(f">> {img_name}下載圖片成功")def

main(): get_hrefs()if __name__==__main__: main()附早期寫的：Python爬蟲，超簡單nendo官網(wǎng)作品圖片爬蟲demo

·················END·················你好，我是二大爺，革命老區(qū)外出進城務(wù)工人員，互聯(lián)網(wǎng)非早期非專業(yè)站長，喜好python，寫作，閱讀，英語不入流程序，自媒體，seo . . .

公眾號不掙錢，交個網(wǎng)友讀者交流群已建立，找到我備注 “交流”，即可獲得加入我們~聽說點 “在看” 的都變得更好看吶~關(guān)注關(guān)注二大爺唄~給你分享python，寫作，閱讀的內(nèi)容噢~掃一掃下方二維碼即可關(guān)注我噢~。

關(guān)注我的都變禿了說錯了，都變強了！不信你試試

掃碼關(guān)注最新動態(tài)公眾號ID：eryeji

佛山公司網(wǎng)站建設(shè)一條龍全包（爬蟲爬文獻）爬蟲爬取論文數(shù)據(jù)，

最新文章

分類目錄