Python library for interfacing with the Mozilla Data Collective REST API.
Install the package using pip:
pip install datacollective-
Get your API key from the Mozilla Data Collective dashboard
-
Set up your environment:
If you have cloned the repository, you can run the following command:
# Copy the example environment file
cp .env.example .envOtherwise, copy and paste the following into a file called .env in your present working directory.
MDC_API_KEY=<MDC_API_KEY> # change to your MDC API Key
MDC_API_URL=https://datacollective.mozillafoundation.org/api # change to MDC API URL endpoint
MDC_DOWNLOAD_PATH=~/.mozdata/datasets # change to where you want to download datasets-
Configure your API key by editing
.env:# Required: Your MDC API key MDC_API_KEY=your-api-key-here # Optional: Download path for datasets (defaults to ~/.mozdata/datasets) MDC_DOWNLOAD_PATH=~/.mozdata/datasets
-
Start using the library:
from datacollective import DataCollective # Initialize the client client = DataCollective() # Download a dataset client.get_dataset('mdc-dataset-id')
The client loads configuration from environment variables or .env files:
MDC_API_KEY- Your Mozilla Data Collective API key (required)MDC_API_URL- API endpoint (defaults to production)MDC_DOWNLOAD_PATH- Where to download datasets (defaults to~/.mozdata/datasets)
Create a .env file in your project root:
# MDC API Configuration
MDC_API_KEY=your-api-key-here
MDC_API_URL=https://datacollective.mozillafoundation.org/api
MDC_DOWNLOAD_PATH=~/.mozdata/datasetsNote: Never commit .env files to version control as they contain sensitive information.
from datacollective import DataCollective
# Initialize client (loads from .env automatically)
client = DataCollective()
# Verify your configuration
print(f"API URL: {client.api_url}")
print(f"Download path: {client.download_path}")
# Download a dataset
dataset = client.get_dataset('your-dataset-id')note: today, this feature only works with Mozilla Common Voice datasets
from datacollective import DataCollective
client = DataCollective()
dataset = client.load_dataset("<dataset-id>") # Load dasaset into memory
df = dataset.to_pandas() # Convert to pandas for queryable form
dataset.splits # A list of all splits available in the dataset
You can use different environment configurations:
# Production environment (default, uses .env)
client = DataCollective()
# Development environment (uses .env.development)
client = DataCollective(environment='development')
# Staging environment (uses .env.staging)
client = DataCollective(environment='staging')This project is released under MPL (Mozilla Public License) 2.0.