Datasets Reference Guide

This comprehensive guide includes all datasets used in the CCAI9012 starter kits, plus additional similar datasets for extended research and projects.

Datasets Used in Starter Kits

Dataset Category Module Description Direct Link Size License
Building Profile & Road Network Computer Vision Module 1 Image pairs for GAN training - building profiles and corresponding road networks GitHub - GANmapper ~500MB MIT
Yelp Open Dataset Text Analysis Module 2 Business reviews, user data, and check-ins for sentiment analysis Yelp Dataset ~10GB Academic Use
Inside Airbnb Dataset Text Analysis Module 2 Airbnb listings and reviews data for accommodation analysis Inside Airbnb Varies by city CC0 1.0
Energy Action Plans Document Analysis Module 2 PDF documents containing energy action plans CCHRC ~100MB Public Domain
Google Street View Imagery Computer Vision Module 3 & 4 Street-level imagery for urban analysis and perception scoring Google Maps API API-based Commercial
Webcam Data Computer Vision Module 4 Real-time webcam feeds for pedestrian behavior analysis Skyline Webcams Streaming Varies
California Housing Prices Machine Learning Module 4 Housing price data for regression analysis Scikit-learn ~1MB BSD
German Credit Dataset Bias Detection Module 5 Credit approval data for fairness analysis AIF360 ~100KB Public
COMPAS Dataset Bias Detection Module 5 Criminal risk assessment data for bias auditing Kaggle - COMPAS ~50KB Public

Additional Public Datasets

Urban Planning & Real Estate

Dataset Description Direct Link Size License
NYC Property Sales Real estate transactions in New York City NYC OpenData ~500MB Public
London Housing Data UK housing prices and features Kaggle - London Housing ~50MB CC0 1.0
Zillow Home Value Data US housing market data Kaggle - Zillow ~2GB Public
OpenStreetMap Building Data Global building footprints and attributes OSM Buildings Varies ODbL
Microsoft Building Footprints Global building footprints from satellite imagery GitHub - MS Buildings ~100GB ODbL

Review & Text Data

Dataset Description Direct Link Size License
Amazon Product Reviews Multi-domain product reviews for sentiment analysis Kaggle - Amazon Reviews ~3GB Academic
TripAdvisor Hotel Reviews Hotel reviews with ratings and locations Kaggle - TripAdvisor ~100MB CC0 1.0
Twitter Sentiment Analysis Tweet data with sentiment labels Kaggle - Twitter Sentiment ~200MB Academic

Computer Vision & Street Imagery

Dataset Description Direct Link Size License
Mapillary Street View Global street-level imagery with semantic segmentation Mapillary ~50GB Commercial
Cityscapes Dataset Urban street scenes with semantic annotations Cityscapes ~50GB Academic
ADE20K Dataset Scene parsing dataset with indoor/outdoor scenes MIT ADE20K ~3GB BSD
COCO Dataset Object detection and segmentation COCO ~25GB CC BY 4.0

Pedestrian & Traffic Data

Dataset Description Direct Link Size License
MOT Challenge Multi-object tracking in pedestrian scenarios MOT Challenge ~10GB Academic
US Highway Traffic Data Federal Highway Administration traffic monitoring and statistics FHWA Policy Information ~1GB Public
NYC Taxi Trip Data Taxi trip records for mobility analysis NYC TLC ~10GB/month Public
Bike Share Data Global bike sharing system data Kaggle - Bike Share ~100MB Public

Fairness & Bias Detection

Dataset Description Direct Link Size License
Adult Income Dataset Census data for income prediction bias analysis UCI Adult ~5MB Public
Bank Marketing Dataset Marketing campaign data for fairness analysis UCI Bank ~5MB Public
ProPublica COMPAS Criminal risk assessment analysis ProPublica ~1MB Public
Fair Lending Dataset Mortgage lending data for discrimination analysis FFIEC HMDA ~1GB Public
Chicago Police Data Police incident reports for bias analysis Chicago Data Portal ~2GB Public

Government & Policy Documents

Dataset Description Direct Link Size License
EU Law Documents European Union legal texts and policies EUR-Lex Varies Public
UN Documents United Nations reports and resolutions UN Documentation Varies Public

Environmental & Climate Data

Dataset Description Direct Link Size License
NASA Climate Data Global climate and weather observations NASA Earthdata ~100TB Public
EPA Air Quality Data US air pollution measurements EPA AQS ~10GB Public
Copernicus Climate Data Store European climate reanalysis data, forecasts, and observations CDS Climate Copernicus Varies Free Registration
OpenWeatherMap Weather data and forecasts OpenWeather API API-based Commercial
Sentinel Satellite Data European satellite imagery for environmental monitoring Copernicus Hub ~100TB Free

Dataset Usage Guidelines

Before Using Any Dataset:

  1. Check License Requirements: Ensure you comply with each dataset’s license terms
  2. Verify Data Quality: Examine data completeness and potential biases
  3. Consider Privacy: Be aware of personal information and anonymization needs
  4. Cite Properly: Always provide proper attribution when using datasets
  5. Update Regularly: Check for newer versions or updates to datasets

Technical Considerations:

Additional Resources:


Last updated: November 2024 For questions about dataset usage or suggestions for additions, please submit an issue.