Part III: Datasets & Risks

Building Blocks & Guardrails of AI

A 2-Class Module on Critical Principles for Training and Using AI Models

What This Module Is

Prior introductions to AI focus on outputs & process.
This module focuses on training inputs & concerns.

We look inside AI training & implementation to understand:

Why data matters
Why do we need so much data
Why ethics & risk mitigation measures are so important

AI systems are Garbage In and Garbage Out.

Once you see the training process & potential risks, you can begin to reason about:

How do we evaluate and control the data quality
How do we balance model and dataset size
And how should we use AI critically

The Arc

Class	Title	What Happens	What You Learn
1	Datasets of AI	We dive into the training process of AI models	Scaling laws of AI, training data processing, and evaluating data quality
2	Risks of AI	We examine the multifaceted impacts and risks of AI	Ethics and risks of AI in business, socioeconomics, environments, and most importantly, on human beings

What We’re Asking You to Do

In this module, you are asked to think critically and systematically.

Besides from user → output and from system → mechanism → behaviour,
Also, think about input → model → output to fill in the missing puzzle

Your task is to:

See the building blocks of AI.
Understanding how an AI system functions largely depends on the quantity and quality of training data.
Locate the mechanism.
Explore how the quality and quantity of training data affect model predictive power.
Reflect the risks.
Criticize how AI models may create concerns and threats to society, the environment, and humanity.

Learning Outcomes

By the end of this module, you will be able to:

Understand why training data matters to all AI systems
Identify criteria for training data quality and quantity
Communicate AI model training process through clear diagrams and structured explanations
Analyze how training inputs change AI model behavior
Reflect on how datasets, ethics, and risks interact

How You’ll Be Assessed

This module is assessed through A1.3 — Case Study 3: The Data Matters, an individual case study focused on explaining why the dataset matters.

You will analyze a pre-trained GenAI/LLM system and show:

What is its boundary for domain-specific tasks (e.g., generating a realistic nighttime street view image)
What dysfunctionality does the market-ready model struggle with
How specific training dataset or prompt engineering might improve the model performance

Full requirements and deliverables are provided in the A1.3 assignment-specific outline link.
This should be read alongside the general case study rubric link.

Resources

Reading Material: see link.

Datasets of AI:

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models (arXiv:2001.08361). arXiv. https://doi.org/10.48550/arXiv.2001.08361
Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755–759. https://doi.org/10.1038/s41586-024-07566-y

Risks of AI:

Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task (arXiv:2506.08872). arXiv. https://doi.org/10.48550/arXiv.2506.08872

The One-Liner

“The anatomy of data is the building blocks of all AI models: garbage data in, garbage model out.”