๐Ÿ” ์—ฐ๊ตฌ์˜ ํ•„์š”์„ฑ


image.png

LLM์˜ ๋ฐœ์ „์œผ๋กœ ์ž๋™ ์ฝ”๋“œ ์ƒ์„ฑ๊ณผ ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค ์ƒ์„ฑ์ด ํ™œ๋ฐœํ•ด์กŒ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ํ‘œ์ค€์ฒ˜๋Ÿผ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ(HumanEval+, MBPP+)๋Š” ์—ฌ์ „ํžˆ pass@k ์ค‘์‹ฌ์˜ functional correctness์— ์ดˆ์ ์„ ๋‘๊ณ  ์žˆ์–ด, ์ƒ์„ฑ ์ฝ”๋“œ๊ฐ€ well-formed input์—์„œ ์ •๋‹ต์„ ๋‚ด๋Š”์ง€์— ๋Œ€ํ•œ ์‹ ํ˜ธ๋งŒ ๊ฐ•ํ•˜๊ฒŒ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ์‹ค์ œ ์†Œํ”„ํŠธ์›จ์–ด์—์„œ โ€œ์ •ํ™•ํ•œ ํ‰๊ฐ€โ€๋Š” ์ •๋‹ต ์ถœ๋ ฅ๋ฟ ์•„๋‹ˆ๋ผ, ์ž…๋ ฅ ์กฐ๊ฑด(input conditions)โ€”contracts๋กœ ์ฃผ์–ด์ง€๋Š” input validity constraints๊ณผ ์˜ˆ์™ธ ์ฒ˜๋ฆฌ ๊ทœ์น™โ€”์„ ์œ„๋ฐ˜ํ•˜๋Š” ill-formed input์„ ์˜๋„๋Œ€๋กœ rejectํ•˜๋Š”์ง€๊นŒ์ง€ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด ๋Šฅ๋ ฅ์€ ๊ธฐ์กด pass@k ํ‰๊ฐ€์—์„œ๋Š” ๊ฑฐ์˜ ๊ด€์ฐฐ๋˜์ง€ ์•Š์•„, LLM์˜ contract-awareness๊ฐ€ ์‚ฌ์‹ค์ƒ ๊ฐ„๊ณผ๋ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ์ด ํ‰๊ฐ€ ๊ณต๋ฐฑ์„ ์ธก์ • ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋ฉ”์šฐ๊ธฐ ์œ„ํ•ด ContractEval์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ContractEval์€ ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ(HumanEval+, MBPP+)์— contract-violating tests(CVTs)๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ, ์ƒ์„ฑ ์ฝ”๋“œ๊ฐ€ ill-formed input์„ ๋‹จ์ˆœํžˆ ์‹คํŒจํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์˜๋„ํ•œ assertion์œผ๋กœ ๋ช…์‹œ์ ์œผ๋กœ rejectํ•˜๋Š”์ง€๋ฅผ ํ‘œ์ค€ํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

โœจ ContractEval์ด๋ž€?


image.png

ContractEval๋Š” LLM ์ฝ”๋“œ ํ‰๊ฐ€๋ฅผ functional correctness์—์„œ ํ•œ ๋‹จ๊ณ„ ํ™•์žฅํ•ด, Contract ์ค€์ˆ˜๊นŒ์ง€ ํ•จ๊ป˜ ์ธก์ •ยท๊ฐ•ํ™”ํ•˜๋Š” Contract-aware benchmark์ž…๋‹ˆ๋‹ค.

ContractEval์€ 4๊ฐ€์ง€ ๊ตฌ์„ฑ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด pass@k ํ‰๊ฐ€๊ฐ€ ์ฃผ๋กœ well-formed input์—์„œ์˜ ์ •๋‹ต ์—ฌ๋ถ€์— ์ง‘์ค‘ํ•˜๋Š” ๋ฐ˜๋ฉด, ContractEval์€ ill-formed input์— ๋Œ€ํ•ด ๋ชจ๋ธ์ด ๋‹จ์ˆœํžˆ ํฌ๋ž˜์‹œํ•˜๋Š”์ง€ ์—ฌ๋ถ€๊ฐ€ ์•„๋‹ˆ๋ผ, intended rejectiond์„ ์ˆ˜ํ–‰ํ•˜๋Š”์ง€๊นŒ์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ํ‰๊ฐ€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด, ContractEval์€ ๋‹ค์Œ 3๊ฐ€์ง€ ํ•ต์‹ฌ ์„ค๊ณ„ ์ถ•์— ๊ธฐ๋ฐ˜ํ•ด ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

1. Contract ์œ„๋ฐ˜ ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค ์ƒ์„ฑ (CVT Generation)