A Light in the ...
\n", + "£51.77
\n", + "\n", + " \n", + " \n", + " In stock\n", + " \n", + "
\n", + " \n", + "diff --git a/LiteralA-Grupo-9.ipynb b/LiteralA-Grupo-9.ipynb
new file mode 100644
index 0000000..7b60bb3
--- /dev/null
+++ b/LiteralA-Grupo-9.ipynb
@@ -0,0 +1,1005 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "authorship_tag": "ABX9TyMEdnFiwn2ZS6PxZp5kBQVs"
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Practica de Grupo 9 - De acuerdo a las recomendaciones se busca otra pagina web a la del ejemplo y se ejecuta la actividad.**\n",
+ "**http://books.toscrape.com**\n",
+ "\n",
+ "1.- Esta sección que instala los paquetes a utilizar\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "ZfReJrK77xQs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "id": "2_M3acf55VPn"
+ },
+ "outputs": [],
+ "source": [
+ "%pip -q install requests beautifulsoup4 lxml"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Sección para importar los paquetes\n",
+ "1. requests, me permite hacer la solicitud HTTP.\n",
+ "2. BeautifulSoup, me ayuda a convertir el HTML en un árbol para poder recorrer.\n",
+ "3. time sirve para pausar el scraping y no sobrecargar la página.\n",
+ "4. urljoin, sirve para construir enlaces completos cuando en la web aparecen enlaces relativos."
+ ],
+ "metadata": {
+ "id": "CRnShOTg7SI3"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import time, math, re\n",
+ "import requests\n",
+ "from urllib.parse import urljoin\n",
+ "from bs4 import BeautifulSoup"
+ ],
+ "metadata": {
+ "id": "3luOLpTy5mgF"
+ },
+ "execution_count": 3,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Aquí aprendí cómo descargar el HTML de una página web.\n",
+ "Uso requests.get(URL) para pedirle al servidor la página, y luego con .text obtengo el código HTML en formato de texto."
+ ],
+ "metadata": {
+ "id": "TxOgH6o07-BE"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "BASE = \"http://books.toscrape.com/\"\n",
+ "headers = {\"User-Agent\": \"Mozilla/5.0\"}\n",
+ "\n",
+ "resp = requests.get(BASE, headers=headers, timeout=20)\n",
+ "print(\"status:\", resp.status_code, \"| bytes:\", len(resp.text))\n",
+ "src = resp.text"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "nC0gXWjL5pgo",
+ "outputId": "15202c11-aeed-4d20-d865-9a85cadac16a"
+ },
+ "execution_count": 4,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "status: 200 | bytes: 51294\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "El HTML descargado es solo texto plano. Con BeautifulSoup lo transformamos en un objeto llamado soup, que nos permite navegar fácilmente por etiquetas, atributos y clases."
+ ],
+ "metadata": {
+ "id": "3oPfd-Us8Gjr"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "soup = BeautifulSoup(src, \"lxml\")\n",
+ "print(soup.title.text) # título de la página"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "yFCagT3Y5sNI",
+ "outputId": "15990207-ea86-44f7-a877-5ff5e80f9564"
+ },
+ "execution_count": 5,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\n",
+ " All products | Books to Scrape - Sandbox\n",
+ "\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Una de las partes más importantes que aprendí es cómo buscar etiquetas dentro del HTML, con .select(\"article.product_pod\") busco todos los artículos de libros en la página.\n",
+ "Cada artículo tiene título, precio, rating y enlace dentro de distintas etiquetas."
+ ],
+ "metadata": {
+ "id": "H0DhjDyY8TDJ"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Por etiqueta + clase\n",
+ "pods = soup.select(\"article.product_pod\")\n",
+ "len(pods), pods[0]"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "a5Bq1_ZW5u_U",
+ "outputId": "6015a237-fc28-4db9-c705-ec199a0c0123"
+ },
+ "execution_count": 6,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(20,\n",
+ " £51.77 \n",
+ " \n",
+ " \n",
+ " In stock\n",
+ " \n",
+ " A Light in the ...
\n",
+ "
£51.77
,\n", + "£53.74
,\n", + "£50.10
]" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ] + }, + { + "cell_type": "code", + "source": [ + "soup.select(\"article.product_pod p.star-rating\")[:3]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CX_bQFp-54Sv", + "outputId": "1f363fef-feef-4151-eeab-87f3895cfd44" + }, + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[,\n", + " ,\n", + " ]" + ] + }, + "metadata": {}, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Ya que encontré las etiquetas, aprendí a extraer la información que contienen:\n", + "\n", + "tag.get_text(strip=True), con esta variante obtuve el texto dentro de una etiqueta.\n", + "\n", + "tag[\"atributo\"], accede a un atributo específico, como un enlace href.\n", + "\n", + "También aprendí que algunas clases en HTML pueden servir para obtener valores (ej. rating)." + ], + "metadata": { + "id": "45sRp5m38j-g" + } + }, + { + "cell_type": "code", + "source": [ + "pod = pods[0]\n", + "\n", + "# título está en el atributo 'title' del \n", + "title = pod.select_one(\"h3 a\")[\"title\"]\n", + "\n", + "# precio en p.price_color (ej. '51.77')\n", + "price = pod.select_one(\"p.price_color\").get_text(strip=True)\n", + "\n", + "# rating: segunda clase de p.star-rating (ej. ['star-rating','Three'] → 'Three')\n", + "rating = pod.select_one(\"p.star-rating\")[\"class\"][1]\n", + "\n", + "# enlace relativo → absoluto\n", + "href = pod.select_one(\"h3 a\")[\"href\"]\n", + "url = urljoin(BASE, href)\n", + "\n", + "title, price, rating, url" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "S1lkc9jx58WY", + "outputId": "700b04b5-139f-4e3d-fd87-f38b19223210" + }, + "execution_count": 10, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "('A Light in the Attic',\n", + " '£51.77',\n", + " 'Three',\n", + " 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Challenge 1 – Extraer todos los libros de la página**" + ], + "metadata": { + "id": "otWYDGnX9HCp" + } + }, + { + "cell_type": "markdown", + "source": [ + "Aquí practiqué cómo hacer un bucle for para recorrer todos los libros de la página y guardarlos como tuplas (título, precio, rating, url)." + ], + "metadata": { + "id": "RgUc7NM69Lnb" + } + }, + { + "cell_type": "code", + "source": [ + "books = []\n", + "for pod in soup.select(\"article.product_pod\"):\n", + " title = pod.select_one(\"h3 a\")[\"title\"]\n", + " price = pod.select_one(\"p.price_color\").get_text(strip=True)\n", + " rating = pod.select_one(\"p.star-rating\")[\"class\"][1]\n", + " href = pod.select_one(\"h3 a\")[\"href\"]\n", + " url = urljoin(BASE, href)\n", + " books.append((title, price, rating, url))\n", + "\n", + "len(books), books[:3]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fSA05cSK6IDa", + "outputId": "94d3ad1f-bf28-4505-ad2d-0654d4729325" + }, + "execution_count": 11, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20,\n", + " [('A Light in the Attic',\n", + " '£51.77',\n", + " 'Three',\n", + " 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'),\n", + " ('Tipping the Velvet',\n", + " '£53.74',\n", + " 'One',\n", + " 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'),\n", + " ('Soumission',\n", + " '£50.10',\n", + " 'One',\n", + " 'http://books.toscrape.com/catalogue/soumission_998/index.html')])" + ] + }, + "metadata": {}, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Challenge 2 – Función get_books**" + ], + "metadata": { + "id": "vXZgWawx9Q1m" + } + }, + { + "cell_type": "markdown", + "source": [ + "Luego aprendí a modularizar el código, es decir, a meterlo dentro de una función para reutilizarlo en diferentes páginas." + ], + "metadata": { + "id": "MZgWkSZ69Vuy" + } + }, + { + "cell_type": "code", + "source": [ + "def get_books(page_url: str):\n", + " \"\"\"Scrapea una página de listado de BooksToScrape y devuelve [(title, price, rating, url), ...].\"\"\"\n", + " r = requests.get(page_url, headers=headers, timeout=20)\n", + " s = BeautifulSoup(r.text, \"lxml\")\n", + " out = []\n", + " for pod in s.select(\"article.product_pod\"):\n", + " title = pod.select_one(\"h3 a\")[\"title\"]\n", + " price = pod.select_one(\"p.price_color\").get_text(strip=True)\n", + " rating = pod.select_one(\"p.star-rating\")[\"class\"][1]\n", + " href = pod.select_one(\"h3 a\")[\"href\"]\n", + " url = urljoin(page_url, href)\n", + " out.append((title, price, rating, url))\n", + " return out\n", + "\n", + "test = get_books(BASE)\n", + "len(test), test[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XCt4Mx1M6M2Z", + "outputId": "64956bf6-9794-4e3a-a4f3-916092a07f0f" + }, + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20,\n", + " ('A Light in the Attic',\n", + " '£51.77',\n", + " 'Three',\n", + " 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'))" + ] + }, + "metadata": {}, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Challenge 3 – Seguir paginación (todas las páginas)**" + ], + "metadata": { + "id": "d5WGO3fT9epn" + } + }, + { + "cell_type": "markdown", + "source": [ + "Aquí aprendí que no basta con leer solo la primera página. Muchas webs tienen varias páginas y debemos buscar el botón Next para seguir.\n", + "El truco es:\n", + "\n", + "1. Buscar li.next a en el HTML.\n", + "\n", + "2. Usar urljoin para construir el nuevo link.\n", + "\n", + "3. Repetir hasta que no haya más páginas." + ], + "metadata": { + "id": "sKyY5CEi9hOT" + } + }, + { + "cell_type": "code", + "source": [ + "def get_all_books(start_url: str, delay=0.5):\n", + " books = []\n", + " url = start_url\n", + " while True:\n", + " r = requests.get(url, headers=headers, timeout=20)\n", + " s = BeautifulSoup(r.text, \"lxml\")\n", + " # acumular libros de la página actual\n", + " for pod in s.select(\"article.product_pod\"):\n", + " title = pod.select_one(\"h3 a\")[\"title\"]\n", + " price = pod.select_one(\"p.price_color\").get_text(strip=True)\n", + " rating = pod.select_one(\"p.star-rating\")[\"class\"][1]\n", + " href = pod.select_one(\"h3 a\")[\"href\"]\n", + " absurl = urljoin(url, href)\n", + " books.append((title, price, rating, absurl))\n", + " # ¿hay siguiente?\n", + " nxt = s.select_one(\"li.next a\")\n", + " if not nxt:\n", + " break\n", + " url = urljoin(url, nxt[\"href\"])\n", + " time.sleep(delay)\n", + " return books\n", + "\n", + "all_books = get_all_books(BASE) # ~1000 libros en total\n", + "len(all_books), all_books[:3]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "F3ELLvlX6S_T", + "outputId": "40057320-502f-4575-8474-34733cb7982a" + }, + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1000,\n", + " [('A Light in the Attic',\n", + " '£51.77',\n", + " 'Three',\n", + " 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'),\n", + " ('Tipping the Velvet',\n", + " '£53.74',\n", + " 'One',\n", + " 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'),\n", + " ('Soumission',\n", + " '£50.10',\n", + " 'One',\n", + " 'http://books.toscrape.com/catalogue/soumission_998/index.html')])" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Challenge 4 – Extraer por categorías**" + ], + "metadata": { + "id": "9wc8lFeI9vaP" + } + }, + { + "cell_type": "markdown", + "source": [ + "Por último, aprendí que también se puede organizar el scraping por categoría, ya que el sitio tiene un menú lateral con enlaces de cada género de libros." + ], + "metadata": { + "id": "NpN4lATX9xvm" + } + }, + { + "cell_type": "code", + "source": [ + "# 1) obtener URLs de categorías\n", + "r = requests.get(BASE, headers=headers, timeout=20)\n", + "s = BeautifulSoup(r.text, \"lxml\")\n", + "cat_links = [(a.get_text(strip=True), urljoin(BASE, a[\"href\"]))\n", + " for a in s.select(\"ul.nav-list a\") if a.get(\"href\")]\n", + "cat_links[:5]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "G7HUFgw26eRf", + "outputId": "504eff9d-7729-4b39-fc7e-4010502a277b" + }, + "execution_count": 14, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('Books', 'http://books.toscrape.com/catalogue/category/books_1/index.html'),\n", + " ('Travel',\n", + " 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html'),\n", + " ('Mystery',\n", + " 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html'),\n", + " ('Historical Fiction',\n", + " 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html'),\n", + " ('Sequential Art',\n", + " 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html')]" + ] + }, + "metadata": {}, + "execution_count": 14 + } + ] + }, + { + "cell_type": "code", + "source": [ + "# 2) diccionario: categoria -> primeros N libros\n", + "def get_books_by_category(limit_per_cat=30, delay=0.5):\n", + " res = {}\n", + " for cat_name, cat_url in cat_links:\n", + " books = get_all_books(cat_url, delay=delay)\n", + " res[cat_name] = books[:limit_per_cat]\n", + " time.sleep(delay)\n", + " return res\n", + "\n", + "buckets = get_books_by_category(limit_per_cat=10, delay=0.2)\n", + "list(buckets.keys())[:5], len(buckets)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PqCCdUjL6jKs", + "outputId": "d8874e3c-26bc-4444-d89d-a78ea2a88d01" + }, + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(['Books', 'Travel', 'Mystery', 'Historical Fiction', 'Sequential Art'], 51)" + ] + }, + "metadata": {}, + "execution_count": 15 + } + ] + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "df = pd.DataFrame(all_books, columns=[\"title\",\"price\",\"rating\",\"url\"])\n", + "df.head()\n", + "\n", + "# df.to_csv(\"books.csv\", index=False)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "8lJoA4gb6uXN", + "outputId": "8d796f3b-fdcd-4a4a-d32c-6145bfc16512" + }, + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " title price rating \\\n", + "0 A Light in the Attic £51.77 Three \n", + "1 Tipping the Velvet £53.74 One \n", + "2 Soumission £50.10 One \n", + "3 Sharp Objects £47.82 Four \n", + "4 Sapiens: A Brief History of Humankind £54.23 Five \n", + "\n", + " url \n", + "0 http://books.toscrape.com/catalogue/a-light-in... \n", + "1 http://books.toscrape.com/catalogue/tipping-th... \n", + "2 http://books.toscrape.com/catalogue/soumission... \n", + "3 http://books.toscrape.com/catalogue/sharp-obje... \n", + "4 http://books.toscrape.com/catalogue/sapiens-a-... " + ], + "text/html": [ + "\n", + "| \n", + " | title | \n", + "price | \n", + "rating | \n", + "url | \n", + "
|---|---|---|---|---|
| 0 | \n", + "A Light in the Attic | \n", + "£51.77 | \n", + "Three | \n", + "http://books.toscrape.com/catalogue/a-light-in... | \n", + "
| 1 | \n", + "Tipping the Velvet | \n", + "£53.74 | \n", + "One | \n", + "http://books.toscrape.com/catalogue/tipping-th... | \n", + "
| 2 | \n", + "Soumission | \n", + "£50.10 | \n", + "One | \n", + "http://books.toscrape.com/catalogue/soumission... | \n", + "
| 3 | \n", + "Sharp Objects | \n", + "£47.82 | \n", + "Four | \n", + "http://books.toscrape.com/catalogue/sharp-obje... | \n", + "
| 4 | \n", + "Sapiens: A Brief History of Humankind | \n", + "£54.23 | \n", + "Five | \n", + "http://books.toscrape.com/catalogue/sapiens-a-... | \n", + "