日本黄色三级视频,亚洲精品一区二区另类图片,97精品高清一区二区三区

All IT eBooks多線程爬取-寫在前面

對(duì)一個(gè)爬蟲愛好者來(lái)說(shuō)，或多或少都有這么一點(diǎn)點(diǎn)的 收集癖 ~ 發(fā)現(xiàn)好的圖片，發(fā)現(xiàn)好的書籍，發(fā)現(xiàn)各種能存放在電腦上的東西，都喜歡把它批量的爬取下來(lái)。然后放著，是的，就這么放著.......然后慢慢的遺忘掉.....

All IT eBooks多線程爬取-爬蟲分析

打開網(wǎng)址 http://www.allitebooks.com/ 發(fā)現(xiàn)特別清晰的小頁(yè)面，一看就好爬

在點(diǎn)擊一本圖書進(jìn)入，發(fā)現(xiàn)下載的小鏈接也很明顯的展示在了我們面前，小激動(dòng)一把，這么清晰無(wú)廣告的網(wǎng)站不多見了。

All IT eBooks多線程爬取-擼代碼

這次我采用了一個(gè)新的模塊 requests-html 這個(gè)模塊的作者之前開發(fā)了一款 requests ，你應(yīng)該非常熟悉了，線程控制采用的 queue
安裝 requests-html 模塊

            
              pip install requests-html

關(guān)于這個(gè)模塊的使用，你只需要使用搜索引擎搜索一下這個(gè)模塊名稱，那文章也是很多滴，作為能學(xué)到這篇博客的你來(lái)說(shuō)，是很簡(jiǎn)單的拉~

我們編寫一下核心的內(nèi)容

            
              from requests_html import HTMLSession
from queue import Queue
import requests
import random

import threading
CARWL_EXIT = False
DOWN_EXIT = False

#####
# 其他代碼
####
if __name__ == '__main__':

    page_queue = Queue(5)
    for i in range(1,6):
        page_queue.put(i)  # 把頁(yè)碼存儲(chǔ)到page_queue里面

    # 采集結(jié)果
    data_queue = Queue()

    # 記錄線程列表
    thread_crawl = []
    # 每次開啟5個(gè)線程
    craw_list = ["采集線程1號(hào)","采集線程2號(hào)","采集線程3號(hào)","采集線程4號(hào)","采集線程5號(hào)"]

    for thread_name in craw_list:
        c_thread = ThreadCrawl(thread_name,page_queue,data_queue)
        c_thread.start()
        thread_crawl.append(c_thread)

    while not page_queue.empty():
        pass

    # 如果page_queue為空，采集線程退出循環(huán)
    CARWL_EXIT = True
    for thread in thread_crawl:
        thread.join()
        print("抓取線程結(jié)束")

上面就是爬取圖書詳情頁(yè)面的線程了，我開啟了5個(gè)線程爬取，頁(yè)碼也只爬取了 5 頁(yè)，如果你需要更多的，只需要修改

            
                  page_queue = Queue(5)
    for i in range(1,6):
        page_queue.put(i)  # 把頁(yè)碼存儲(chǔ)到page_queue里面
Python資源分享qun 784758214 ,內(nèi)有安裝包，PDF，學(xué)習(xí)視頻，這里是Python學(xué)習(xí)者的聚集地，零基礎(chǔ)，進(jìn)階，都?xì)g迎

下面我們把 ThreadCrawl 類編寫完畢

            
              session = HTMLSession()

# 這個(gè)地方是 User_Agents 以后我把他配置到服務(wù)器上面，就可以遠(yuǎn)程獲取了  這個(gè)列表里面有很多項(xiàng)，你自己去源碼里面找吧
USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20"
]
# 獲取圖書下載鏈接的線程類
class ThreadCrawl(threading.Thread):
    # 構(gòu)造函數(shù)
    def __init__(self,thread_name,page_queue,data_queue):

        super(ThreadCrawl,self).__init__()
        self.thread_name = thread_name
        self.page_queue = page_queue
        self.data_queue = data_queue
        self.page_url = "http://www.allitebooks.com/page/{}"   #URL拼接模板

    def run(self):
        print(self.thread_name+" 啟動(dòng)*********")

        while not CARWL_EXIT:
            try:
                page = self.page_queue.get(block=False)
                page_url = self.page_url.format(page)   # 拼接URL操作
                self.get_list(page_url)   # 分析頁(yè)面鏈接 

            except Exception as e:
                print(e)
                break

    # 獲取當(dāng)前列表頁(yè)所有圖書鏈接
    def get_list(self,url):
        try:
            response = session.get(url)
        except Exception as e:
            print(e)
            raise e

        all_link = response.html.find('.entry-title>a') # 獲取頁(yè)面所有圖書詳情鏈接

        for link in all_link:
            self.get_book_url(link.attrs['href'])   # 獲取圖書鏈接

    # 獲取圖書下載鏈接
    def get_book_url(self,url):
        try:
            response = session.get(url)

        except Exception as e:
            print(e)
            raise e

        download_url = response.html.find('.download-links a', first=True)

        if download_url is not None: # 如果下載鏈接存在，那么繼續(xù)下面的爬取工作
            link = download_url.attrs['href']
            self.data_queue.put(link)   # 把圖書下載地址 存儲(chǔ)到 data_queue里面，準(zhǔn)備后面的下載
            print("抓取到{}".format(link))

上述代碼一個(gè)非常重要的內(nèi)容就是把圖書的 下載鏈接 存儲(chǔ)到了 data_queue 里面，這些數(shù)據(jù) 在另一個(gè)下載線程里面是最基本的數(shù)據(jù)。

下面開始編寫圖書下載的類和方法。

我開啟了4個(gè)線程，操作和上面的非常類似

            
              class ThreadDown(threading.Thread):
    def __init__(self, thread_name, data_queue):
        super(ThreadDown, self).__init__()
        self.thread_name = thread_name
        self.data_queue = data_queue

    def run(self):
        print(self.thread_name + ' 啟動(dòng)************')
        while not DOWN_EXIT:
            try:
                book_link = self.data_queue.get(block=False)
                self.download(book_link)
            except Exception as e:
                pass

    def download(self,url):
        # 隨機(jī)瀏覽器User-Agent
        headers = {"User-Agent":random.choice(USER_AGENTS)}
        # 獲取文件名字
        filename = url.split('/')[-1]
        # 如果url里面包含pdf
        if '.pdf' in url or '.epub' in url:
            file = 'book/'+filename  # 文件路徑已經(jīng)寫死，請(qǐng)?jiān)诟夸浵葎?chuàng)建好一個(gè)book文件夾
            with open(file,'wb') as f:  # 開始二進(jìn)制寫文件
                print("正在下載 {}".format(filename))
                response = requests.get(url,stream=True,headers=headers)
                # 獲取文件大小
                totle_length = response.headers.get("content-length")
                # 如果文件大小不存在，則直接寫入返回的文本
                if totle_length is None:
                    f.write(response.content)
                else:
                    for data in response.iter_content(chunk_size=4096):
                        f.write(data)
                    else:
                        f.close()

                print("{}下載完成".format(filename))

if __name__ == '__main__': 

# 其他代碼在上面
    thread_image = []
    image_list = ['下載線程1號(hào)', '下載線程2號(hào)', '下載線程3號(hào)', '下載線程4號(hào)']
    for thread_name in image_list:
        d_thread = ThreadDown(thread_name, data_queue)
        d_thread.start()
        thread_image.append(d_thread)

    while not data_queue.empty():
        pass

    DOWN_EXIT = True
    for thread in thread_image:
        thread.join()
        print("下載線程結(jié)束")
Python資源分享qun 784758214 ,內(nèi)有安裝包，PDF，學(xué)習(xí)視頻，這里是Python學(xué)習(xí)者的聚集地，零基礎(chǔ)，進(jìn)階，都?xì)g迎

如果你把我上面的代碼都組合完畢，那么應(yīng)該可以很快速的去爬取圖書了，當(dāng)然這些圖書都是英文了，下載下來(lái)你能不能讀....... 我就不知道了。

更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號(hào)聯(lián)系： 360901061

您的支持是博主寫作最大的動(dòng)力，如果您喜歡我的文章，感覺我的文章對(duì)您有幫助，請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點(diǎn)擊下面給點(diǎn)支持吧，站長(zhǎng)非常感激您！手機(jī)微信長(zhǎng)按不能支付解決辦法：請(qǐng)將微信支付二維碼保存到相冊(cè)，切換到微信，然后點(diǎn)擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對(duì)您有幫助就好】元

2元

5元

10元

20元

自定義

亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频

Python爬蟲入門【13】：All IT eBooks多線程爬取

All IT eBooks多線程爬取-寫在前面

All IT eBooks多線程爬取-爬蟲分析

All IT eBooks多線程爬取-擼代碼