Python Web Crawling (파이썬 크롤링) :: 네이버 웹툰 이미지 크롤링하기

업데이트: May 23, 2021

Naver Webtoon Crawling

⭐ 파이썬으로 네이버 웹툰 이미지 크롤링하기

😃 우측 상단의 아이콘을 누르시면 직접 실습해보실 수 있습니다 :)

colab 실습 화면

✔ 10개 회차 크롤링

Python Web Crawling에 필요한 라이브러리를 설치하고 Import한다.

필요한 Python 패키지 설치

  pip install requests bs4 pytest-shutil bboxes cvlib

# Import

from bs4 import BeautifulSoup
import requests
import os

웹툰 페이지의 html 소스를 가져온 후, wrt_nm class의 parent tag의 text를 webtoonName array에 저장한다.

F12를 누르면 개발자모드에 진입하실 수 있습니다😊

# Webtoon Url : ex) 바른연애길잡이 
url = "https://comic.naver.com/webtoon/list.nhn?titleId=703852&page=1" 

# 크롤링 우회
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
html = requests.get(url, headers = headers)
result = BeautifulSoup(html.content, "html.parser")

webtoonName = result.find("span", {"class", "wrt_nm"}).parent.get_text().strip().split('\n')
# webtoonName = ['바른연애 길잡이', '\t\t\t\t\t\t\t남수']

현재 directory의 하위 폴더에 이미지를 저장할 folder를 만든다.

cwd = os.getcwd()
files = os.listdir(cwd)
# 현재 directory 위치
print(cwd, end='\n\n')

# 크롤링한 이미지를 저장할 폴더를 만듦
if os.path.isdir(os.path.join(cwd,  webtoonName[0])) == False: 
    os.mkdir(webtoonName[0])
   
print(webtoonName[0] + " folder created successfully!")
os.chdir(os.path.join(cwd,  webtoonName[0])) 

/home/jihye/Home/blog/_posts/ipynb_folder

바른연애 길잡이 folder created successfully!

title class의 td tag 찾은 후, 최근 10회차에 대한 웹툰 이미지를 크롤링한다.

title = result.findAll("td", {"class", "title"})

for t in title:
    
    # 회차별로 directory를 만든 후 해당 directory로 이동
    os.mkdir((t.text).strip()) 
    os.chdir(os.getcwd() + "//" + (t.text).strip()) 

    # 각 회차별 html 소스 가져오기
    url ="https://comic.naver.com" + t.a['href']
    html2 = requests.get(url, headers = headers) 
    result2 = BeautifulSoup(html2.content, "html.parser") 

    # webtoon image 가져오기
    webtoonImg = result2.find("div", {"class", "wt_viewer"}).findAll("img")
    num = 1 #image_name
    
    for i in webtoonImg:
        saveName = os.getcwd() + "//" + str(num) + ".jpg"
        with open(saveName, "wb") as file:
            src = requests.get(i['src'], headers = headers) 
            file.write(src.content) #
        num += 1

    os.chdir("..")

    # 한 회차 이미지 저장 완료!
    print((t.text).strip() + "   saved completely!") 

 saved completely!
 saved completely!
 saved completely!
 saved completely!
 saved completely!
 saved completely!
 saved completely!
 saved completely!
 saved completely!
 saved completely!

크롤링 결과

✔ 전체 회차 크롤링

위의 주피터 노트북을 활용하면 10개의 회차에 대해 크롤링할 수 있다. 이번에는 웹툰의 더 많은 회차에 대해 크롤링 해본다.

from bs4 import BeautifulSoup
import requests
import os

# 크롤링하고 싶은 페이지의 개수
page_num = 18

for i in range(page_num):

    # Webtoon Url : ex) 바른연애길잡이 
    url = "https://comic.naver.com/webtoon/list.nhn?titleId=703846&page={0}".format(i) 

    # 크롤링 우회
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
    html = requests.get(url, headers = headers)
    result = BeautifulSoup(html.content, "html.parser")

    webtoonName = result.find("span", {"class", "wrt_nm"}).parent.get_text().strip().split('\n')

    cwd = os.getcwd()
    files = os.listdir(cwd)
    print(cwd)

    if os.path.isdir(os.path.join(cwd,  webtoonName[0])) == False: 
        os.mkdir(webtoonName[0])
    

    print(webtoonName[0] + "page {0} folder created successfully!".format(i))
    os.chdir(os.path.join(cwd,  webtoonName[0])) 
    
    title = result.findAll("td", {"class", "title"})

    for t in title:
        
        #웹툰 디렉토리 안에 회차별로 디렉토리 만들기
        if os.path.isdir(os.path.join(cwd,  webtoonName[0], (t.text).strip())): 
            break
        
        os.mkdir((t.text).strip()) 

        #회차별 디렉토리로 이동
        os.chdir(os.getcwd() + "//" + (t.text).strip()) 
        print(os.getcwd())
        #각 회차별 url
        url ="https://comic.naver.com" + t.a['href']
        #헤더 우회해서 링크 가져오기
        html2 = requests.get(url, headers = headers) 
        result2 = BeautifulSoup(html2.content, "html.parser") 

        # webtoon image 찾기
        webtoonImg = result2.find("div", {"class", "wt_viewer"}).findAll("img")
        num = 1 #image_name
        
        for i in webtoonImg:
            saveName = os.getcwd() + "//" + str(num) + ".jpg"
            with open(saveName, "wb") as file:
                src = requests.get(i['src'], headers = headers) 
                file.write(src.content) #
            num += 1

        os.chdir("..")

        # 한 회차 이미지 저장 완료!
        print((t.text).strip() + "   saved completely!") 
        
    os.chdir("..")

크롤링 성공 !!

reference

https://github.com/wngus4296/20_1_pythonStudy

Twitter Facebook LinkedIn

Jihye Back

Python Web Crawling (파이썬 크롤링) :: 네이버 웹툰 이미지 크롤링하기

Naver Webtoon Crawling

✔ 10개 회차 크롤링

✔ 전체 회차 크롤링

공유하기

댓글남기기

참고

[Paper Review] RAFT: Adapting Language Model to Domain Specific RAG 논문 리뷰

[4] RAG (Retriever Augumented Generation)

[3-2] A Survey of Large Language Models - Adaptation of LLMs

[3-1] A Survey of Large Language Models - 다양한 LLMs부터, Data, Architecture, Training 까지