๐Ÿ” ์—ฐ๊ตฌ์˜ ํ•„์š”์„ฑ


LLM์˜ ๋ฐœ์ „์œผ๋กœ ์ž๋™ ์ฝ”๋“œ ์ƒ์„ฑ์ด ํ™œ๋ฐœํ•ด์กŒ์ง€๋งŒ, ํ‘œ์ค€์ฒ˜๋Ÿผ ์“ฐ์ด๋Š” ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ(HumanEval+, MBPP+)๋Š” ์—ฌ์ „ํžˆ pass@k ์ค‘์‹ฌ์˜ functional correctness์— ๋จธ๋ฌผ๋Ÿฌ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ํ˜•์‹์ด ์˜ฌ๋ฐ”๋ฅธ ์ž…๋ ฅ(well-formed input, ์˜ˆ: ํ•จ์ˆ˜๊ฐ€ ๊ธฐ๋Œ€ํ•˜๋Š” ํƒ€์ž…ยท๋ฒ”์œ„๋ฅผ ๋ชจ๋‘ ๋งŒ์กฑํ•˜๋Š” ์ •์ƒ ์ž…๋ ฅ)์— ๋Œ€ํ•ด ์ •๋‹ต์„ ๋‚ด๋Š”์ง€๋งŒ ์‹คํ–‰, ์ธก์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ž…๋ ฅ์ด ์ง€์ผœ์•ผ ํ•  ์ „์ œ์กฐ๊ฑด์„ ์ฝ”๋“œ๊ฐ€ ์‹ค์ œ๋กœ ๊ฐ•์ œํ•˜๋Š”์ง€๋Š” ์ „ํ˜€ ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋ฌธ์ œ๋Š” ์—ฌ๊ธฐ์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ task description์€ "๋ฆฌ์ŠคํŠธ์˜ ์ธ๋ฑ์Šค๋กœ ์›์†Œ๋ฅผ ๊บผ๋‚ธ๋‹ค"์ฒ˜๋Ÿผ ์ž…๋ ฅ์ด ๋งŒ์กฑํ•ด์•ผ ํ•  ์กฐ๊ฑด์„ ์•”๋ฌต์ ์œผ๋กœ๋งŒ ๋‚จ๊ธฐ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์กฐ๊ฑด(์˜ˆ: "์ธ๋ฑ์Šค๋Š” 0 ์ด์ƒ, ๋ฆฌ์ŠคํŠธ ๊ธธ์ด ๋ฏธ๋งŒ์˜ ์ •์ˆ˜์—ฌ์•ผ ํ•œ๋‹ค")์„ ๋ช…์‹œ์ ์œผ๋กœ ๊ธฐ์ˆ ํ•œ ๊ทœ์น™์„ contract๋ผ๊ณ  ํ•˜๋ฉฐ, ์ด๋ฅผ ์œ„๋ฐ˜ํ•˜๋Š” ์ž…๋ ฅโ€”์Œ์ˆ˜ ์ธ๋ฑ์Šค, ๋ฌธ์ž์—ด ์ธ๋ฑ์Šค์ฒ˜๋Ÿผ ์ „์ œ์กฐ๊ฑด์„ ๊นจ๋Š” ์ž…๋ ฅโ€”์„ ill-formed input์ด๋ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ํ‰๊ฐ€ ์Šค์œ„ํŠธ๋Š” ์ด๋Ÿฌํ•œ ill-formed input์„ ์‚ฌ์ „์— ์ œ๊ฑฐํ•œ ์ฑ„ well-formed input๋งŒ ์‹คํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ƒ์„ฑ ์ฝ”๋“œ๊ฐ€ ์ž…๋ ฅ ์กฐ๊ฑด์„ ์ „ํ˜€ ํ™•์ธํ•˜์ง€ ์•Š์•„๋„ pass@k ์ ์ˆ˜๋Š” ๋†’๊ฒŒ ๋‚˜์˜ต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋Œ€ํ‘œ ๋ชจ๋ธ๋“ค์€ pass@1 75โ€“82%๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ contract satisfaction์€ 0%์ธ illusion of correctness๋ฅผ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค. ์ฆ‰ "์ •๋‹ต์„ ๋‚ด๋Š”๊ฐ€" ์™€ "์ž˜๋ชป๋œ ์ž…๋ ฅ์„ ์˜๋„ํ•œ assertion์œผ๋กœ ๊ฑฐ๋ถ€ํ•˜๋Š”๊ฐ€" ๋Š” ์ฝ”๋“œ ํ’ˆ์งˆ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ถ•์ด์ง€๋งŒ, ํ›„์ž๋Š” ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ตฌ์กฐ์ ์œผ๋กœ ์ธก์ •์ด ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. HumanEval+์™€ MBPP+๊ฐ€ reference contract ํ•„๋“œ๋ฅผ ์ด๋ฏธ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์—๋„, ์ด๋ฅผ ill-formed input์„ ๊ฑธ๋Ÿฌ๋‚ด๋Š” ์šฉ๋„๋กœ๋งŒ ์“ฐ๊ธฐ ๋•Œ๋ฌธ์— ๊ณต๋ฐฑ์€ ํ•ด์†Œ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ์ด ํ‰๊ฐ€ ๊ณต๋ฐฑ์„ ๋ฉ”์šฐ๊ธฐ ์œ„ํ•ด ContractEval์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ContractEval์€ HumanEval+/MBPP+๋ฅผ ํ™•์žฅํ•ด, ์ƒ์„ฑ ์ฝ”๋“œ๊ฐ€ ill-formed input์— ๋Œ€ํ•ด ๋‹จ์ˆœํžˆ ํฌ๋ž˜์‹œํ•˜๋Š”์ง€๊ฐ€ ์•„๋‹ˆ๋ผ ์˜๋„ํ•œ contract assertion์œผ๋กœ ๋ช…์‹œ์ ์œผ๋กœ ๊ฑฐ๋ถ€ํ•˜๋Š”์ง€๋ฅผ ํ‘œ์ค€ํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

โœจ ContractEval์ด๋ž€?


image.png

ContractEval์€ LLM ์ฝ”๋“œ ํ‰๊ฐ€๋ฅผ functional correctness์—์„œ ํ•œ ๋‹จ๊ณ„ ํ™•์žฅํ•ด, contract ์ค€์ˆ˜๊นŒ์ง€ ํ•จ๊ป˜ ์ธก์ •ํ•˜๋Š” contract-aware benchmark์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด pass@k ํ‰๊ฐ€๊ฐ€ ์ •์ƒ ์ž…๋ ฅ(well-formed input)์—์„œ์˜ ์ •๋‹ต ์—ฌ๋ถ€๋งŒ ํ™•์ธํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, ContractEval์€ ์ž˜๋ชป๋œ ์ž…๋ ฅ(ill-formed input)์— ๋Œ€ํ•ด ์ƒ์„ฑ ์ฝ”๋“œ๊ฐ€ ์˜๋„ํ•œ assertion์œผ๋กœ ๋ช…์‹œ์ ์œผ๋กœ ๊ฑฐ๋ถ€(reject)ํ•˜๋Š”์ง€๊นŒ์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

ContractEval์€ 3๊ฐ€์ง€ ์„ค๊ณ„ ์ถ•์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

**1. Contract-Aware Query Reconstruction

(๋ฌธ์ œ ์„ค๋ช…์— ์ˆจ์–ด ์žˆ๋Š” ์ž…๋ ฅ ์กฐ๊ฑด์„ ๋“œ๋Ÿฌ๋‚ด๊ธฐ)**

2. Contract-Violating Test Construction

(์˜๋„ํ•œ ๊ณ„์•ฝ๋งŒ ์ •ํ™•ํžˆ ์œ„๋ฐ˜ํ•˜๋Š” ํ…Œ์ŠคํŠธ ๋งŒ๋“ค๊ธฐ)

3. Combining Contracts and Code

(์ •๋‹ต ์ฝ”๋“œ์— contract๋ฅผ ์ถ”๊ฐ€ํ•˜๊ธฐ)