龙行博客

走路看风景,经历看人生,岁月留痕迹,人生留轨迹,17的历史,18的豪情,时间的匆忙,人生的风景,放开心胸往前走,成功再远行,放开理想往前走,梦想再行动。
现在位置:首页 > 编程语言 > Python > python3爬取美女图片

python3爬取美女图片

龙行    Python    2018-11-13    152    0评论    本文已被百度收录点击查看详情

在python2的基础上做了些修改,支持在python3环境下运行,附件改后缀py,运行即可,输入下图中类似画红圈的数字,每个数字对应了一个相册

import urllib.request
import lxml.html
import time
import os
import re
 
def serchIndex(name):
    url='https://www.nvshens.com/girl/search.aspx?name='+name
    print(url)
    html = urllib.request.urlopen(url).read().decode('UTF-8')
    return html
 
def selectOne(html):
    tree = lxml.html.fromstring(html)
    one = tree.cssselect('#DataList1 > tr > td:nth-child(1) > li > div > a')[0]
    href = one.get('href')
    url = 'https://www.nvshens.com'+href+'album/'
    print(url)
    html = urllib.request.urlopen(url).read().decode('UTF-8')
    print(html)
    return html
 
def findPageTotal(html):
    tree = lxml.html.fromstring(html)
    lis = tree.cssselect('#photo_list > ul > li')
    list = []
    for li in lis:
        url = li.cssselect('div.igalleryli_div > a')
        href = url[0].get('href')
        list.append(href)
    findimage_urls = set(list)
    print(findimage_urls)
    return findimage_urls
 
def dowmloadImage(image_url,filename)  :
    for i in  range(len(image_url)):
        try:
            req = urllib.request.Request(image_url)
            req.add_header('User-Agent','chrome 4{}'.format(i))
            image_data = urllib.request.urlopen(req).read()
        except (urllib.request.HTTPError, urllib.request.URLError) as e:
            time.sleep(0.1)
            continue
        open(filename,'wb').write(image_data)
        break
 
def mkdirByGallery(path):
    # 去除首位空格
    path = path.strip()
    path = 'E:\\py\\photo\\'+path
    #这两个函数之间最大的区别是当父目录不存在的时候os.mkdir(path)
    #不会创建,os.makedirs(path)
    #则会创建父目录。
    isExists = os.path.exists(path)
    if not isExists:
        os.makedirs(path)
    return path
 
if __name__ != '__main__':
        name = str(input("name:"))
        html = serchIndex(name)
        html = selectOne(html)
        pages = findPageTotal(html)
        img_id = 1
        for page in pages:
            path = re.search(r'[0-9]+',page).group()
            path = mkdirByGallery(path)
            for i in range(1,31):
                url='https://www.nvshens.com'+page+str(i)+'.html'
                html = urllib.request.urlopen(url).read().decode('UTF-8')
                tree = lxml.html.fromstring(html)
                title = tree.cssselect('head > title')[0].text
                if title.find(u"该页面未找到")!= -1:
                    break
                imgs = tree.cssselect('#hgallery > img')
                list = []
                for img in imgs:
                    src = img.get('src')
                    list.append(src)
                image_urls = set(list)
                image_id = 0
                for image_url in image_urls:
                    dowmloadImage(image_url,path+'\\'+'2018-{}-{}-{}.jpg'.format(img_id,i,image_id))
                    image_id += 1
            img_id += 1
 
if __name__ == '__main__':
    page = str(input("pageid:"))
    path = mkdirByGallery(page)
    for i in range(1,31):
        url = 'https://www.nvshens.com/g/' + page+'/' + str(i) + '.html'
        print(url)
        html = urllib.request.urlopen(url).read().decode('UTF-8')
        tree = lxml.html.fromstring(html)
        title = tree.cssselect('head > title')[0].text
        if title.find(u"该页面未找到") != -1:
            break
        imgs = tree.cssselect('#hgallery > img')
        list = []
        for img in imgs:
            src = img.get('src')
            list.append(src)
        image_urls = set(list)
        image_id = 0
        for image_url in image_urls:
            dowmloadImage(image_url, path+'\\'+'2018-{}-{}.jpg'.format(i,image_id))
            image_id += 1
 
if __name__ != '__main__':
    url = 'https://www.nvshens.com/gallery/meitui/'
    print(url)
    html = urllib.request.urlopen(url).read().decode('UTF-8')
    tree = lxml.html.fromstring(html)
    lis = tree.cssselect('#listdiv > ul > li')
    list = []
    for li in lis:
        url = li.cssselect('div.galleryli_div > a')
        href = url[0].get('href')
        list.append(href)
    findimage_urls = set(list)
    print(findimage_urls)
    print(len(findimage_urls))
py3


评论一下 分享本文 赞助站长

赞助站长X

扫码赞助站长
联系站长
龙行博客
  • 版权申明:此文如未标注转载均为本站原创,自由转载请表明出处《龙行博客》。
  • 本文网址:https://www.liaotaoo.cn/69.html
  • 上篇文章:js对textarea换行符的处理方案
  • 下篇文章:支持https的二维码api
  • python3
挤眼 亲亲 咆哮 开心 想想 可怜 糗大了 委屈 哈哈 小声点 右哼哼 左哼哼 疑问 坏笑 赚钱啦 悲伤 耍酷 勾引 厉害 握手 耶 嘻嘻 害羞 鼓掌 馋嘴 抓狂 抱抱 围观 威武 给力
提交评论

清空信息
关闭评论
快捷导航
联系博主
在线壁纸
给我留言
光羽影视
音乐欣赏
返回顶部