크롤링 (Beautiful Soup 라이브러리) - 네이버 만화 읽어오기

Python 2023. 2. 1. 16:22

from urllib.request import urlopen
from bs4 import BeautifulSoup

myurl = 'http://comic.naver.com/webtoon/weekday'

# 이 페이지에 request 해서 데이터를 가져온 후 변수에 저장한다.
response = urlopen(myurl)

# <class 'http.client.HTTPResponse'>
print(type(response))

# BeautifulSoup()를 이용해서 데이터를 분석한다.
soup = BeautifulSoup(response, 'html.parser')

# Beautiful Soup 객체를 적절한 들여쓰기 형태로 출력해준다.
# print(soup.prettify())

title = soup.find("title").string
print(title)

썸네일 이미지 저장하기 (+ 폴더 생성 )

import os

from urllib.request import urlopen
from bs4 import BeautifulSoup
from pandas import DataFrame


myparser = 'html.parser'
myurl = 'https://comic.naver.com/webtoon/weekday'
response = urlopen(myurl)
soup = BeautifulSoup(response, myparser)

# print(result) # 결과물을 cartoon.html 파일에 복사
# print(type(result))

weekday_dict = {'mon':'월요일', 'tue':'화요일', 'wed':'수요일', 
                'thu':'목요일', 'fri':'금요일', 'sat':'토요일', 'sun':'일요일'}
myfolder = 'c:\\imsi\\'

try :
    if not os.path.exists(myfolder): #임시 폴더 생성
        os.mkdir(myfolder)
        
    for mydir in weekday_dict.values() :
        mypath = myfolder + mydir
        if os.path.exists(mypath) :
            pass
        else : # '월요일'부터 '일요일'까지 폴더 생성
            os.mkdir(mypath)
  
  
  mylist = [] # 데이터를 저장할 리스트

mytarget = soup.find_all('div', attrs={'class':'thumb'})
print('만화 총 개수 : %d' % (len(mytarget)))

for abcd in mytarget :
    myhref = abcd.find('a').attrs['href']
    myhref = myhref.replace('/webtoon/list.nhn?','')
    result = myhref.split('&')
    mytitleid = result[0].split('=')[1]
    myweekday = result[1].split('=')[1]
    myweekday = weekday_dict[myweekday]
    # print(mytitleid + '/' + myweekday)
    
    imgtag = abcd.find('img')
    #print(imgtag)
    
    mysrc = imgtag.attrs['src']
    mytitle = imgtag.attrs['title'].strip()
    mytitle = mytitle.replace('?', '').replace(':','')
    
    #print(mytitle +'/' + mysrc)
    
    mytuple = tuple([mytitleid,myweekday,mytitle,mysrc])
    mylist.append(mytuple)
    
    # 이미지 저장 함수
    saveFile(mysrc, myweekday, mytitle)

print(mylist)
myframe = DataFrame(mylist, columns = ['타이틀 번호', '요일', '제목', '링크'])
filename = 'cartoon.csv'
myframe.to_csv(filename, encoding='utf-8', index=False)
print(filename + '파일로 저장되었습니다.')
except FileExistsError as err :
    pass # 오류 무시하고 넘기기
    

# saveFile() 함수는 웹 페이지에 존재하는 이미지를 로컬 컴퓨터에 저장하기 위한 함수이다.
def saveFile(mysrc, myweekday, mytitle):
    image_file = urlopen(mysrc)
    filename = myfolder + myweekday + '\\' + mytitle + '.jpg'
    
    # print(fileanme)
    myfile = open(filename, mode='wb')
    myfile.write(image_file.read()) #이미지로 저장됨

'Python' 카테고리의 다른 글

파이썬 크롤링 (Selenium / Chrome Driver) 설치하기 (0)	2023.02.03
크롤링 (Beautiful Soup 라이브러리) - 네이버 영화 순위 (0)	2023.02.01
크롤링 (Beautiful Soup 라이브러리) - 선택자(selector) (0)	2023.02.01
크롤링 (Beautiful Soup 라이브러리) - 태그의 속성 (0)	2023.02.01
크롤링 (문자열의 집합을 표현하는 정규 표현식) (0)	2023.02.01

ABOUT ME

usop의 개발일지 usop의 개발일지

'Python' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Python' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바