From 894bc572037d72431ebf5d75c062c0b0fb20dadf Mon Sep 17 00:00:00 2001 From: carlosmolina2007-gif Date: Mon, 25 Aug 2025 20:27:31 -0500 Subject: [PATCH] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- LiteralA-Grupo-9.ipynb | 1005 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1005 insertions(+) create mode 100644 LiteralA-Grupo-9.ipynb diff --git a/LiteralA-Grupo-9.ipynb b/LiteralA-Grupo-9.ipynb new file mode 100644 index 0000000..7b60bb3 --- /dev/null +++ b/LiteralA-Grupo-9.ipynb @@ -0,0 +1,1005 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "authorship_tag": "ABX9TyMEdnFiwn2ZS6PxZp5kBQVs" + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "**Practica de Grupo 9 - De acuerdo a las recomendaciones se busca otra pagina web a la del ejemplo y se ejecuta la actividad.**\n", + "**http://books.toscrape.com**\n", + "\n", + "1.- Esta sección que instala los paquetes a utilizar\n", + "\n" + ], + "metadata": { + "id": "ZfReJrK77xQs" + } + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "2_M3acf55VPn" + }, + "outputs": [], + "source": [ + "%pip -q install requests beautifulsoup4 lxml" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Sección para importar los paquetes\n", + "1. requests, me permite hacer la solicitud HTTP.\n", + "2. BeautifulSoup, me ayuda a convertir el HTML en un árbol para poder recorrer.\n", + "3. time sirve para pausar el scraping y no sobrecargar la página.\n", + "4. urljoin, sirve para construir enlaces completos cuando en la web aparecen enlaces relativos." + ], + "metadata": { + "id": "CRnShOTg7SI3" + } + }, + { + "cell_type": "code", + "source": [ + "import time, math, re\n", + "import requests\n", + "from urllib.parse import urljoin\n", + "from bs4 import BeautifulSoup" + ], + "metadata": { + "id": "3luOLpTy5mgF" + }, + "execution_count": 3, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Aquí aprendí cómo descargar el HTML de una página web.\n", + "Uso requests.get(URL) para pedirle al servidor la página, y luego con .text obtengo el código HTML en formato de texto." + ], + "metadata": { + "id": "TxOgH6o07-BE" + } + }, + { + "cell_type": "code", + "source": [ + "BASE = \"http://books.toscrape.com/\"\n", + "headers = {\"User-Agent\": \"Mozilla/5.0\"}\n", + "\n", + "resp = requests.get(BASE, headers=headers, timeout=20)\n", + "print(\"status:\", resp.status_code, \"| bytes:\", len(resp.text))\n", + "src = resp.text" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "nC0gXWjL5pgo", + "outputId": "15202c11-aeed-4d20-d865-9a85cadac16a" + }, + "execution_count": 4, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "status: 200 | bytes: 51294\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "El HTML descargado es solo texto plano. Con BeautifulSoup lo transformamos en un objeto llamado soup, que nos permite navegar fácilmente por etiquetas, atributos y clases." + ], + "metadata": { + "id": "3oPfd-Us8Gjr" + } + }, + { + "cell_type": "code", + "source": [ + "soup = BeautifulSoup(src, \"lxml\")\n", + "print(soup.title.text) # título de la página" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "yFCagT3Y5sNI", + "outputId": "15990207-ea86-44f7-a877-5ff5e80f9564" + }, + "execution_count": 5, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + " All products | Books to Scrape - Sandbox\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Una de las partes más importantes que aprendí es cómo buscar etiquetas dentro del HTML, con .select(\"article.product_pod\") busco todos los artículos de libros en la página.\n", + "Cada artículo tiene título, precio, rating y enlace dentro de distintas etiquetas." + ], + "metadata": { + "id": "H0DhjDyY8TDJ" + } + }, + { + "cell_type": "code", + "source": [ + "# Por etiqueta + clase\n", + "pods = soup.select(\"article.product_pod\")\n", + "len(pods), pods[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a5Bq1_ZW5u_U", + "outputId": "6015a237-fc28-4db9-c705-ec199a0c0123" + }, + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20,\n", + "
\n", + "
\n", + " \"A\n", + "
\n", + "

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "

\n", + "

A Light in the ...

\n", + "
\n", + "

£51.77

\n", + "

\n", + " \n", + " \n", + " In stock\n", + " \n", + "

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
)" + ] + }, + "metadata": {}, + "execution_count": 6 + } + ] + }, + { + "cell_type": "code", + "source": [ + "links = pods[0].select(\"h3 a\")\n", + "links[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UcMSwMIo5yNC", + "outputId": "54e1f594-2ad0-400e-d7d0-fccc030b8549" + }, + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "A Light in the ..." + ] + }, + "metadata": {}, + "execution_count": 7 + } + ] + }, + { + "cell_type": "code", + "source": [ + "soup.select(\"article.product_pod p.price_color\")[:3]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d0svdSkp51qq", + "outputId": "3d750e01-b228-41eb-da78-550f44240533" + }, + "execution_count": 8, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[

£51.77

,\n", + "

£53.74

,\n", + "

£50.10

]" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ] + }, + { + "cell_type": "code", + "source": [ + "soup.select(\"article.product_pod p.star-rating\")[:3]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CX_bQFp-54Sv", + "outputId": "1f363fef-feef-4151-eeab-87f3895cfd44" + }, + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "

,\n", + "

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "

,\n", + "

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "

]" + ] + }, + "metadata": {}, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Ya que encontré las etiquetas, aprendí a extraer la información que contienen:\n", + "\n", + "tag.get_text(strip=True), con esta variante obtuve el texto dentro de una etiqueta.\n", + "\n", + "tag[\"atributo\"], accede a un atributo específico, como un enlace href.\n", + "\n", + "También aprendí que algunas clases en HTML pueden servir para obtener valores (ej. rating)." + ], + "metadata": { + "id": "45sRp5m38j-g" + } + }, + { + "cell_type": "code", + "source": [ + "pod = pods[0]\n", + "\n", + "# título está en el atributo 'title' del \n", + "title = pod.select_one(\"h3 a\")[\"title\"]\n", + "\n", + "# precio en p.price_color (ej. '51.77')\n", + "price = pod.select_one(\"p.price_color\").get_text(strip=True)\n", + "\n", + "# rating: segunda clase de p.star-rating (ej. ['star-rating','Three'] → 'Three')\n", + "rating = pod.select_one(\"p.star-rating\")[\"class\"][1]\n", + "\n", + "# enlace relativo → absoluto\n", + "href = pod.select_one(\"h3 a\")[\"href\"]\n", + "url = urljoin(BASE, href)\n", + "\n", + "title, price, rating, url" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "S1lkc9jx58WY", + "outputId": "700b04b5-139f-4e3d-fd87-f38b19223210" + }, + "execution_count": 10, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "('A Light in the Attic',\n", + " '£51.77',\n", + " 'Three',\n", + " 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Challenge 1 – Extraer todos los libros de la página**" + ], + "metadata": { + "id": "otWYDGnX9HCp" + } + }, + { + "cell_type": "markdown", + "source": [ + "Aquí practiqué cómo hacer un bucle for para recorrer todos los libros de la página y guardarlos como tuplas (título, precio, rating, url)." + ], + "metadata": { + "id": "RgUc7NM69Lnb" + } + }, + { + "cell_type": "code", + "source": [ + "books = []\n", + "for pod in soup.select(\"article.product_pod\"):\n", + " title = pod.select_one(\"h3 a\")[\"title\"]\n", + " price = pod.select_one(\"p.price_color\").get_text(strip=True)\n", + " rating = pod.select_one(\"p.star-rating\")[\"class\"][1]\n", + " href = pod.select_one(\"h3 a\")[\"href\"]\n", + " url = urljoin(BASE, href)\n", + " books.append((title, price, rating, url))\n", + "\n", + "len(books), books[:3]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fSA05cSK6IDa", + "outputId": "94d3ad1f-bf28-4505-ad2d-0654d4729325" + }, + "execution_count": 11, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20,\n", + " [('A Light in the Attic',\n", + " '£51.77',\n", + " 'Three',\n", + " 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'),\n", + " ('Tipping the Velvet',\n", + " '£53.74',\n", + " 'One',\n", + " 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'),\n", + " ('Soumission',\n", + " '£50.10',\n", + " 'One',\n", + " 'http://books.toscrape.com/catalogue/soumission_998/index.html')])" + ] + }, + "metadata": {}, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Challenge 2 – Función get_books**" + ], + "metadata": { + "id": "vXZgWawx9Q1m" + } + }, + { + "cell_type": "markdown", + "source": [ + "Luego aprendí a modularizar el código, es decir, a meterlo dentro de una función para reutilizarlo en diferentes páginas." + ], + "metadata": { + "id": "MZgWkSZ69Vuy" + } + }, + { + "cell_type": "code", + "source": [ + "def get_books(page_url: str):\n", + " \"\"\"Scrapea una página de listado de BooksToScrape y devuelve [(title, price, rating, url), ...].\"\"\"\n", + " r = requests.get(page_url, headers=headers, timeout=20)\n", + " s = BeautifulSoup(r.text, \"lxml\")\n", + " out = []\n", + " for pod in s.select(\"article.product_pod\"):\n", + " title = pod.select_one(\"h3 a\")[\"title\"]\n", + " price = pod.select_one(\"p.price_color\").get_text(strip=True)\n", + " rating = pod.select_one(\"p.star-rating\")[\"class\"][1]\n", + " href = pod.select_one(\"h3 a\")[\"href\"]\n", + " url = urljoin(page_url, href)\n", + " out.append((title, price, rating, url))\n", + " return out\n", + "\n", + "test = get_books(BASE)\n", + "len(test), test[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XCt4Mx1M6M2Z", + "outputId": "64956bf6-9794-4e3a-a4f3-916092a07f0f" + }, + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20,\n", + " ('A Light in the Attic',\n", + " '£51.77',\n", + " 'Three',\n", + " 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'))" + ] + }, + "metadata": {}, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Challenge 3 – Seguir paginación (todas las páginas)**" + ], + "metadata": { + "id": "d5WGO3fT9epn" + } + }, + { + "cell_type": "markdown", + "source": [ + "Aquí aprendí que no basta con leer solo la primera página. Muchas webs tienen varias páginas y debemos buscar el botón Next para seguir.\n", + "El truco es:\n", + "\n", + "1. Buscar li.next a en el HTML.\n", + "\n", + "2. Usar urljoin para construir el nuevo link.\n", + "\n", + "3. Repetir hasta que no haya más páginas." + ], + "metadata": { + "id": "sKyY5CEi9hOT" + } + }, + { + "cell_type": "code", + "source": [ + "def get_all_books(start_url: str, delay=0.5):\n", + " books = []\n", + " url = start_url\n", + " while True:\n", + " r = requests.get(url, headers=headers, timeout=20)\n", + " s = BeautifulSoup(r.text, \"lxml\")\n", + " # acumular libros de la página actual\n", + " for pod in s.select(\"article.product_pod\"):\n", + " title = pod.select_one(\"h3 a\")[\"title\"]\n", + " price = pod.select_one(\"p.price_color\").get_text(strip=True)\n", + " rating = pod.select_one(\"p.star-rating\")[\"class\"][1]\n", + " href = pod.select_one(\"h3 a\")[\"href\"]\n", + " absurl = urljoin(url, href)\n", + " books.append((title, price, rating, absurl))\n", + " # ¿hay siguiente?\n", + " nxt = s.select_one(\"li.next a\")\n", + " if not nxt:\n", + " break\n", + " url = urljoin(url, nxt[\"href\"])\n", + " time.sleep(delay)\n", + " return books\n", + "\n", + "all_books = get_all_books(BASE) # ~1000 libros en total\n", + "len(all_books), all_books[:3]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "F3ELLvlX6S_T", + "outputId": "40057320-502f-4575-8474-34733cb7982a" + }, + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1000,\n", + " [('A Light in the Attic',\n", + " '£51.77',\n", + " 'Three',\n", + " 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'),\n", + " ('Tipping the Velvet',\n", + " '£53.74',\n", + " 'One',\n", + " 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'),\n", + " ('Soumission',\n", + " '£50.10',\n", + " 'One',\n", + " 'http://books.toscrape.com/catalogue/soumission_998/index.html')])" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Challenge 4 – Extraer por categorías**" + ], + "metadata": { + "id": "9wc8lFeI9vaP" + } + }, + { + "cell_type": "markdown", + "source": [ + "Por último, aprendí que también se puede organizar el scraping por categoría, ya que el sitio tiene un menú lateral con enlaces de cada género de libros." + ], + "metadata": { + "id": "NpN4lATX9xvm" + } + }, + { + "cell_type": "code", + "source": [ + "# 1) obtener URLs de categorías\n", + "r = requests.get(BASE, headers=headers, timeout=20)\n", + "s = BeautifulSoup(r.text, \"lxml\")\n", + "cat_links = [(a.get_text(strip=True), urljoin(BASE, a[\"href\"]))\n", + " for a in s.select(\"ul.nav-list a\") if a.get(\"href\")]\n", + "cat_links[:5]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "G7HUFgw26eRf", + "outputId": "504eff9d-7729-4b39-fc7e-4010502a277b" + }, + "execution_count": 14, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('Books', 'http://books.toscrape.com/catalogue/category/books_1/index.html'),\n", + " ('Travel',\n", + " 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html'),\n", + " ('Mystery',\n", + " 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html'),\n", + " ('Historical Fiction',\n", + " 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html'),\n", + " ('Sequential Art',\n", + " 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html')]" + ] + }, + "metadata": {}, + "execution_count": 14 + } + ] + }, + { + "cell_type": "code", + "source": [ + "# 2) diccionario: categoria -> primeros N libros\n", + "def get_books_by_category(limit_per_cat=30, delay=0.5):\n", + " res = {}\n", + " for cat_name, cat_url in cat_links:\n", + " books = get_all_books(cat_url, delay=delay)\n", + " res[cat_name] = books[:limit_per_cat]\n", + " time.sleep(delay)\n", + " return res\n", + "\n", + "buckets = get_books_by_category(limit_per_cat=10, delay=0.2)\n", + "list(buckets.keys())[:5], len(buckets)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PqCCdUjL6jKs", + "outputId": "d8874e3c-26bc-4444-d89d-a78ea2a88d01" + }, + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(['Books', 'Travel', 'Mystery', 'Historical Fiction', 'Sequential Art'], 51)" + ] + }, + "metadata": {}, + "execution_count": 15 + } + ] + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "df = pd.DataFrame(all_books, columns=[\"title\",\"price\",\"rating\",\"url\"])\n", + "df.head()\n", + "\n", + "# df.to_csv(\"books.csv\", index=False)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "8lJoA4gb6uXN", + "outputId": "8d796f3b-fdcd-4a4a-d32c-6145bfc16512" + }, + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " title price rating \\\n", + "0 A Light in the Attic £51.77 Three \n", + "1 Tipping the Velvet £53.74 One \n", + "2 Soumission £50.10 One \n", + "3 Sharp Objects £47.82 Four \n", + "4 Sapiens: A Brief History of Humankind £54.23 Five \n", + "\n", + " url \n", + "0 http://books.toscrape.com/catalogue/a-light-in... \n", + "1 http://books.toscrape.com/catalogue/tipping-th... \n", + "2 http://books.toscrape.com/catalogue/soumission... \n", + "3 http://books.toscrape.com/catalogue/sharp-obje... \n", + "4 http://books.toscrape.com/catalogue/sapiens-a-... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titlepriceratingurl
0A Light in the Attic£51.77Threehttp://books.toscrape.com/catalogue/a-light-in...
1Tipping the Velvet£53.74Onehttp://books.toscrape.com/catalogue/tipping-th...
2Soumission£50.10Onehttp://books.toscrape.com/catalogue/soumission...
3Sharp Objects£47.82Fourhttp://books.toscrape.com/catalogue/sharp-obje...
4Sapiens: A Brief History of Humankind£54.23Fivehttp://books.toscrape.com/catalogue/sapiens-a-...
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"# df\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tipping the Velvet\",\n \"Sapiens: A Brief History of Humankind\",\n \"Soumission\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"\\u00c2\\u00a353.74\",\n \"\\u00c2\\u00a354.23\",\n \"\\u00c2\\u00a350.10\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rating\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"One\",\n \"Five\",\n \"Three\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"url\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html\",\n \"http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html\",\n \"http://books.toscrape.com/catalogue/soumission_998/index.html\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Conclusión: Dentro de la prectica elegí una pagina para practivar otros escenarios y he logrado aprender que de una página web en HTML plano a una estructura en Python que puedo recorrer y analizar. Con requests obtuve el contenido, con BeautifulSoup lo convertí en algo navegable, y después aprendí a buscar etiquetas, extraer atributos, recorrer páginas y modularizar mi código en funciones." + ], + "metadata": { + "id": "i3cPlfNb97OG" + } + } + ] +} \ No newline at end of file