HTML 파싱을 이용한 크롤링 | Howard's nest

0%

HTML 파싱을 이용한 크롤링

Posted on 2020-02-21 Edited on 2020-05-05 In TIL.. , Crawling Disqus:

HTML 데이터 파싱을 이용한 크롤링

웹페이지 분석 : URL 찾기
요청 -> 응답 : HTML(str) 가져오기
HTML(str) -> BeautifulSoup 객체에서 css-selector를 통해 내용을 가져옴 -> 데이터프레임으로 변환

크롤링 과정

BeautifulSoup라는 클래스의 매서드 활용 response안에 있는 애들을 BeautifulSoup으로 HTML로 파싱

# requests.get 매서드로 요철
response = requests.get(url)

# response.content로 내용 선택, html.parser로 HTML로 파싱
dom = BeautifulSoup(response.content, "html.parser")

`bs4.BeautifulSoupr`

select 매서드 : 여러개의 element 객체를 리스트로 가져옴 select_one 매서드 : 하나의 element 객체를 가져옴

# 크롬 개발자 도구에서 copy selector했을때 id가 나올때 까지 찾아 복사함
# 리스트안의 딕셔너리 형태로 만들어줌{"컬럼명":내용}
datas = []
for element in elements:
    datas.append({
        "title": element.select_one('.tit_g').text.strip().replace("\n",""), #tit_g이름의 클래스 안에 텍스트를 잡음
        "link" : element.select_one('a').get("href"), #href속성의 값을 가져옴
    })
article_df = pd.DataFrame(datas)