Datasets Reference Guide
This comprehensive guide includes all datasets used in the CCAI9012 starter kits, plus additional similar datasets for extended research and projects.
Datasets Used in Starter Kits
| Dataset |
Category |
Module |
Description |
Direct Link |
Size |
License |
| Building Profile & Road Network |
Computer Vision |
Module 1 |
Image pairs for GAN training - building profiles and corresponding road networks |
GitHub - GANmapper |
~500MB |
MIT |
| Yelp Open Dataset |
Text Analysis |
Module 2 |
Business reviews, user data, and check-ins for sentiment analysis |
Yelp Dataset |
~10GB |
Academic Use |
| Inside Airbnb Dataset |
Text Analysis |
Module 2 |
Airbnb listings and reviews data for accommodation analysis |
Inside Airbnb |
Varies by city |
CC0 1.0 |
| Energy Action Plans |
Document Analysis |
Module 2 |
PDF documents containing energy action plans |
CCHRC |
~100MB |
Public Domain |
| Google Street View Imagery |
Computer Vision |
Module 3 & 4 |
Street-level imagery for urban analysis and perception scoring |
Google Maps API |
API-based |
Commercial |
| Webcam Data |
Computer Vision |
Module 4 |
Real-time webcam feeds for pedestrian behavior analysis |
Skyline Webcams |
Streaming |
Varies |
| California Housing Prices |
Machine Learning |
Module 4 |
Housing price data for regression analysis |
Scikit-learn |
~1MB |
BSD |
| German Credit Dataset |
Bias Detection |
Module 5 |
Credit approval data for fairness analysis |
AIF360 |
~100KB |
Public |
| COMPAS Dataset |
Bias Detection |
Module 5 |
Criminal risk assessment data for bias auditing |
Kaggle - COMPAS |
~50KB |
Public |
Additional Public Datasets
Urban Planning & Real Estate
| Dataset |
Description |
Direct Link |
Size |
License |
| NYC Property Sales |
Real estate transactions in New York City |
NYC OpenData |
~500MB |
Public |
| London Housing Data |
UK housing prices and features |
Kaggle - London Housing |
~50MB |
CC0 1.0 |
| Zillow Home Value Data |
US housing market data |
Kaggle - Zillow |
~2GB |
Public |
| OpenStreetMap Building Data |
Global building footprints and attributes |
OSM Buildings |
Varies |
ODbL |
| Microsoft Building Footprints |
Global building footprints from satellite imagery |
GitHub - MS Buildings |
~100GB |
ODbL |
Review & Text Data
| Dataset |
Description |
Direct Link |
Size |
License |
| Amazon Product Reviews |
Multi-domain product reviews for sentiment analysis |
Kaggle - Amazon Reviews |
~3GB |
Academic |
| TripAdvisor Hotel Reviews |
Hotel reviews with ratings and locations |
Kaggle - TripAdvisor |
~100MB |
CC0 1.0 |
| Twitter Sentiment Analysis |
Tweet data with sentiment labels |
Kaggle - Twitter Sentiment |
~200MB |
Academic |
Computer Vision & Street Imagery
| Dataset |
Description |
Direct Link |
Size |
License |
| Mapillary Street View |
Global street-level imagery with semantic segmentation |
Mapillary |
~50GB |
Commercial |
| Cityscapes Dataset |
Urban street scenes with semantic annotations |
Cityscapes |
~50GB |
Academic |
| ADE20K Dataset |
Scene parsing dataset with indoor/outdoor scenes |
MIT ADE20K |
~3GB |
BSD |
| COCO Dataset |
Object detection and segmentation |
COCO |
~25GB |
CC BY 4.0 |
Pedestrian & Traffic Data
| Dataset |
Description |
Direct Link |
Size |
License |
| MOT Challenge |
Multi-object tracking in pedestrian scenarios |
MOT Challenge |
~10GB |
Academic |
| US Highway Traffic Data |
Federal Highway Administration traffic monitoring and statistics |
FHWA Policy Information |
~1GB |
Public |
| NYC Taxi Trip Data |
Taxi trip records for mobility analysis |
NYC TLC |
~10GB/month |
Public |
| Bike Share Data |
Global bike sharing system data |
Kaggle - Bike Share |
~100MB |
Public |
Fairness & Bias Detection
| Dataset |
Description |
Direct Link |
Size |
License |
| Adult Income Dataset |
Census data for income prediction bias analysis |
UCI Adult |
~5MB |
Public |
| Bank Marketing Dataset |
Marketing campaign data for fairness analysis |
UCI Bank |
~5MB |
Public |
| ProPublica COMPAS |
Criminal risk assessment analysis |
ProPublica |
~1MB |
Public |
| Fair Lending Dataset |
Mortgage lending data for discrimination analysis |
FFIEC HMDA |
~1GB |
Public |
| Chicago Police Data |
Police incident reports for bias analysis |
Chicago Data Portal |
~2GB |
Public |
Government & Policy Documents
| Dataset |
Description |
Direct Link |
Size |
License |
| EU Law Documents |
European Union legal texts and policies |
EUR-Lex |
Varies |
Public |
| UN Documents |
United Nations reports and resolutions |
UN Documentation |
Varies |
Public |
Environmental & Climate Data
| Dataset |
Description |
Direct Link |
Size |
License |
| NASA Climate Data |
Global climate and weather observations |
NASA Earthdata |
~100TB |
Public |
| EPA Air Quality Data |
US air pollution measurements |
EPA AQS |
~10GB |
Public |
| Copernicus Climate Data Store |
European climate reanalysis data, forecasts, and observations |
CDS Climate Copernicus |
Varies |
Free Registration |
| OpenWeatherMap |
Weather data and forecasts |
OpenWeather API |
API-based |
Commercial |
| Sentinel Satellite Data |
European satellite imagery for environmental monitoring |
Copernicus Hub |
~100TB |
Free |
Dataset Usage Guidelines
Before Using Any Dataset:
- Check License Requirements: Ensure you comply with each dataset’s license terms
- Verify Data Quality: Examine data completeness and potential biases
- Consider Privacy: Be aware of personal information and anonymization needs
- Cite Properly: Always provide proper attribution when using datasets
- Update Regularly: Check for newer versions or updates to datasets
Technical Considerations:
- Storage: Large datasets may require cloud storage solutions
- Processing: Consider computational requirements for big datasets
- APIs: Some datasets require API keys and have rate limits
- Preprocessing: Plan for data cleaning and transformation steps
- Ethics: Consider the ethical implications of your analysis
Additional Resources:
Last updated: November 2024
For questions about dataset usage or suggestions for additions, please submit an issue.