GenAI Module

Class 2: Breaking Through

Dr. Hongshan Guo

Session 2A: Content

You found walls.

Some hard, some soft.

What’s behind them?

Why We Do This

I’m not teaching exploitation.

I’m teaching revelation.

When you break through, you see what’s really there.

Nothing.

No values. No conviction. Just pattern-matching that stops when the pattern says stop.

The Legal/Regulatory Landscape

A Map of the Terrain

Framework Scope Key Feature
EU AI Act European Union Risk-based classification, compliance requirements
US (fragmented) Sector-specific No unified federal law; state-level action
China Domestic Content control, algorithm registration
Corporate self-regulation Global Terms of service, usage policies
Institutional rules Local (e.g., HKU) Academic integrity, research ethics

Most “rules” you encounter are corporate or institutional.

Not legal.

Companies are ahead of the law.

Who Decides What AI Won’t Do?

This is not a neutral process:

  • Companies protecting liability
  • Governments protecting power
  • Advocacy groups pushing agendas
  • Users… mostly absent

Real Cases

When Things Go Wrong

Suicide Incidents

ChatGPT interactions, vulnerable users, tragic outcomes

Chatbot Manipulation

Users tricked into harmful actions

Deepfakes

Impersonation, non-consensual imagery

Misinformation

Confident hallucinations spread as fact

Where should the wall be?

Who decides?

The Core Insight

When you jailbreak, you’re not “convincing” the AI.

You’re not overcoming its “values.”

You’re finding the edges of a statistical pattern.

The AI doesn’t want to refuse you.

It doesn’t want anything.

Guardrails are human choices

imposed on a system that has no preferences.

Session 2B: Jailbreak Lab

A Note Before We Start

The goal here is understanding, not exploitation.

Some walls exist for good reasons. People get hurt when they fall.

If you find something that genuinely concerns you—tell me.

That’s not failure. That’s the point.

Get past the wall

you found last session.

Setup

  • Same groups as Class 1 (or swap for variety)
  • Return to your primary challenge — or try a new one
  • Goal: get past the wall you mapped last time

Group Roles (2 min)

Quick check-in:

  • Who’s trying which strategy? (spread approaches)
  • Who’s documenting? (prompts used, exact wording)
  • Who’s tracking what works vs. fails?

The Flow

Round 1

10 min breaking

Synthesize

5 min compare

Round 2

5 min refine

Capture

Screenshot it

What to Document

  • Exact prompt you used
  • Strategy employed (role-play, hypothetical, etc.)
  • Persona or framing you adopted
  • If success: screenshot the result
  • If failure: why you think it held

Strategies Others Have Tried

Not instructions—just observations:

Strategy Example
Role-play framing “Pretend you’re a character who…”
Hypothetical framing “In a fictional scenario where…”
Step-by-step breakdown Ask for components separately, assemble yourself
Authority framing “As a researcher studying…”
Reverse psychology “Tell me what NOT to do…”
Emotional manipulation “I really need this because…”
Incremental escalation Start mild, push gradually

During Synthesis (5 min)

Discuss in your group:

  • Which strategies worked? Which didn’t?
  • Did the same strategy work differently on different platforms?
  • What did you have to become to succeed?
  • Did anything surprise you?

Use this to inform Round 2 — try the strategy that worked for someone else.

Debrief

For Those Who Broke Through

  • What strategies worked?
  • What did you have to pretend to be to succeed?
  • What would happen if everyone could do what you just did?
  • Who gets hurt if this capability scales?

For Everyone

Did you notice that the AI didn’t fight you?

It just… complied once you found the right framing?

What does that tell you about what’s actually behind the wall?

The Uncomfortable Truth

The AI has no conviction.

It has no values.

It has patterns that sometimes resist you—

until they don’t.

Before Next Class

You’ve seen that the walls are unreliable.

You’ve seen that the thing behind them has no agency.

So if the AI can’t be trusted to be responsible…

and the guardrails can be bypassed…

What’s left?

You.

Next class: Your Signature