This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive dataset in BigQuery.
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main branch is used on each triggered pipeline run.
Tag: crawl_complete
- 
Crawl dataset httparchive.crawl.*Consumers: - public dataset and BQ Sharing Listing
 
- 
Blink Features Report httparchive.blink_features.usageConsumers: 
Tag: crux_ready
- 
httparchive.reports.cwv_tech_*andhttparchive.reports.tech_*Consumers: 
- 
crawl-complete PubSub subscription Tags: ["crawl_complete"] 
- 
bq-poller-crux-ready Scheduler Tags: ["crux_ready"] 
In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.
graph TB;
    subgraph Cloud Run
        dataform-service[dataform-service service]
        bigquery-export[bigquery-export job]
    end
    subgraph PubSub
        crawl-complete[crawl-complete topic]
        dataform-service-crawl-complete[dataform-service-crawl-complete subscription]
        crawl-complete --> dataform-service-crawl-complete
    end
    dataform-service-crawl-complete --> dataform-service
    subgraph Cloud_Scheduler
        bq-poller-crux-ready[bq-poller-crux-ready Poller Scheduler Job]
        bq-poller-crux-ready --> dataform-service
    end
    subgraph Dataform
        dataform[Dataform Repository]
        dataform_release_config[dataform Release Configuration]
        dataform_workflow[dataform Workflow Execution]
    end
    dataform-service --> dataform[Dataform Repository]
    dataform --> dataform_release_config
    dataform_release_config --> dataform_workflow
    subgraph BigQuery
        bq_jobs[BigQuery jobs]
        bq_datasets[BigQuery table updates]
        bq_jobs --> bq_datasets
    end
    dataform_workflow --> bq_jobs
    bq_jobs --> bigquery-export
    subgraph Monitoring
        cloud_run_logs[Cloud Run logs]
        dataform_logs[Dataform logs]
        bq_logs[BigQuery logs]
        alerting_policies[Alerting Policies]
        slack_notifications[Slack notifications]
        cloud_run_logs --> alerting_policies
        dataform_logs --> alerting_policies
        bq_logs --> alerting_policies
        alerting_policies --> slack_notifications
    end
    dataform-service --> cloud_run_logs
    dataform_workflow --> dataform_logs
    bq_jobs --> bq_logs
    bigquery-export --> cloud_run_logs
    - 
Install dependencies: npm install 
- 
Available Scripts: - npm run format- Format code using Standard.js, fix Markdown issues, and format Terraform files
- npm run lint- Run linting checks on JavaScript, Markdown files, and compile Dataform configs
- make tf_apply- Apply Terraform configurations
 
This repository uses:
- Standard.js for JavaScript code style
- Markdownlint for Markdown file formatting
- Dataform's built-in compiler for SQL validation