【AWS】スクレイピング(2.Beautiful SoupでHTML分析)

前提

下記を実施しWeb情報を取得できていること
amegaeru.hatenablog.jp

実践！

１．Beautiful Soupインストール確認
１－１．SageMakerノートブックで下記を実行し、beautifulsoup4がインストールされていることを確認

pip list

２．HTML分析
２－１．下記を実行し、HTMLがタグで分割されていることを確認

import requests
from bs4 import BeautifulSoup

load_url = "https://xxx.co.jp"
html = requests.get(load_url)
soup = BeautifulSoup(html.content, "html.parser")

print(soup)

※あまりよくわからん。。。

３．タグを抽出
３－１．下記を実行し、タグが抽出されることを確認

import requests
from bs4 import BeautifulSoup

load_url = "https://xxx.co.jp"
html = requests.get(load_url)
soup = BeautifulSoup(html.content, "html.parser")

print(soup.find("title").text)

４．タグの全要素を取得
４－１．下記を実行

import requests
from bs4 import BeautifulSoup

load_url = "https://yahoo.co.jp"
html = requests.get(load_url)
soup = BeautifulSoup(html.content, "html.parser")

for element in soup.find_all("li"):
    print(element.text)

５．idやclassタグで特定要素を取得
５－１．下記を実行

import requests
from bs4 import BeautifulSoup

load_url = "https://yahoo.co.jp"
html = requests.get(load_url)
soup = BeautifulSoup(html.content, "html.parser")

for element in soup.find_all(id="Message"):
    print(element.text)

６．リンク一覧取得
６－１．下記を実行

import requests
from bs4 import BeautifulSoup

load_url = "https://xxx.co.jp/"
html = requests.get(load_url)
soup = BeautifulSoup(html.content, "html.parser")

for element in topic.find_all("a"):
    print(element.text)
    url = element.get("href")
    print(url)

あめがえるのITブログ

頑張りすぎない。ほどほどに頑張るブログ。

【AWS】スクレイピング(2.Beautiful SoupでHTML分析)

前提

実践！