Part III: Datasets & Risks
Building Blocks & Guardrails of AI
A 2-Class Module on Critical Principles for Training and Using AI Models
What This Module Is
Prior introductions to AI focus on outputs & process.
This module focuses on training inputs & concerns.
We look inside AI training & implementation to understand:
- Why data matters
- Why do we need so much data
- Why ethics & risk mitigation measures are so important
AI systems are Garbage In and Garbage Out.
Once you see the training process & potential risks, you can begin to reason about:
- How do we evaluate and control the data quality
- How do we balance model and dataset size
- And how should we use AI critically
The Arc
| Class | Title | What Happens | What You Learn |
|---|---|---|---|
| 1 | Datasets of AI | We dive into the training process of AI models | Scaling laws of AI, training data processing, and evaluating data quality |
| 2 | Risks of AI | We examine the multifaceted impacts and risks of AI | Ethics and risks of AI in business, socioeconomics, environments, and most importantly, on human beings |
What We’re Asking You to Do
In this module, you are asked to think critically and systematically.
Besides from user → output and from system → mechanism → behaviour,
Also, think about input → model → output to fill in the missing puzzle
Your task is to:
-
See the building blocks of AI.
Understanding how an AI system functions largely depends on the quantity and quality of training data. -
Locate the mechanism.
Explore how the quality and quantity of training data affect model predictive power. -
Reflect the risks.
Criticize how AI models may create concerns and threats to society, the environment, and humanity.
Learning Outcomes
By the end of this module, you will be able to:
- Understand why training data matters to all AI systems
- Identify criteria for training data quality and quantity
- Communicate AI model training process through clear diagrams and structured explanations
- Analyze how training inputs change AI model behavior
- Reflect on how datasets, ethics, and risks interact
How You’ll Be Assessed
This module is assessed through A1.3 — Case Study 3: The Data Matters, an individual case study focused on explaining why the dataset matters.
You will analyze a pre-trained GenAI/LLM system and show:
- What is its boundary for domain-specific tasks (e.g., generating a realistic nighttime street view image)
- What dysfunctionality does the market-ready model struggle with
- How specific training dataset or prompt engineering might improve the model performance
Full requirements and deliverables are provided in the A1.3 assignment-specific outline link.
This should be read alongside the general case study rubric link.
Resources
Reading Material: see link.
Datasets of AI:
-
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models (arXiv:2001.08361). arXiv. https://doi.org/10.48550/arXiv.2001.08361
-
Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755–759. https://doi.org/10.1038/s41586-024-07566-y
Risks of AI:
- Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task (arXiv:2506.08872). arXiv. https://doi.org/10.48550/arXiv.2506.08872
The One-Liner
“The anatomy of data is the building blocks of all AI models: garbage data in, garbage model out.”