-56-
µÚËÄÕ¶¯Ì¬¹æ»®
¡ì1 ÒýÑÔ
1.1 ¶¯Ì¬¹æ»®µÄ·¢Õ¹¼°Ñо¿ÄÚÈÝ
¶¯Ì¬¹æ»®£¨dynamic programming£©ÊÇÔ˳ïѧµÄÒ»¸ö·ÖÖ§£¬ÊÇÇó½â¾ö²ß¹ý³Ì£¨decision process£©×îÓÅ»¯µÄÊýѧ·½·¨¡£20 ÊÀ¼Í50 Äê´ú³õR. E. Bellman µÈÈËÔÚÑо¿¶à½×¶Î¾ö²ß¹ý ³Ì(multistep decision process)µÄÓÅ»¯ÎÊÌâʱ£¬Ìá³öÁËÖøÃûµÄ×îÓÅÐÔÔÀí£¨principle of
optimality£©£¬°Ñ¶à½×¶Î¹ý³Ìת»¯ÎªÒ»ÏµÁе¥½×¶ÎÎÊÌ⣬Öð¸öÇó½â£¬´´Á¢Á˽â¾öÕâÀà¹ý³Ì ÓÅ»¯ÎÊÌâµÄз½·¨¡ª¶¯Ì¬¹æ»®¡£1957 Äê³ö°æÁËËûµÄÃûÖø¡¶Dynamic Programming¡·£¬Õâ ÊǸÃÁìÓòµÄµÚÒ»±¾Öø×÷¡£
¶¯Ì¬¹æ»®ÎÊÊÀÒÔÀ´£¬ÔÚ¾¼Ã¹ÜÀí¡¢Éú²úµ÷¶È¡¢¹¤³Ì¼¼ÊõºÍ×îÓÅ¿ØÖƵȷ½ÃæµÃµ½Á˹㠷ºµÄÓ¦Óá£ÀýÈç×î¶Ì·Ïß¡¢¿â´æ¹ÜÀí¡¢×ÊÔ´·ÖÅä¡¢É豸¸üС¢ÅÅÐò¡¢×°ÔصÈÎÊÌ⣬Óö¯ ̬¹æ»®·½·¨±ÈÓÃÆäËü·½·¨Çó½â¸üΪ·½±ã¡£
ËäÈ»¶¯Ì¬¹æ»®Ö÷ÒªÓÃÓÚÇó½âÒÔʱ¼ä»®·Ö½×¶ÎµÄ¶¯Ì¬¹ý³ÌµÄÓÅ»¯ÎÊÌ⣬µ«ÊÇһЩÓëʱ
¼äÎ޹صľ²Ì¬¹æ»®£¨ÈçÏßÐԹ滮¡¢·ÇÏßÐԹ滮£©£¬Ö»ÒªÈËΪµØÒý½øÊ±¼äÒòËØ£¬°ÑËüÊÓΪ ¶à½×¶Î¾ö²ß¹ý³Ì£¬Ò²¿ÉÒÔÓö¯Ì¬¹æ»®·½·¨·½±ãµØÇó½â¡£
Ó¦Ö¸³ö£¬¶¯Ì¬¹æ»®ÊÇÇó½âijÀàÎÊÌâµÄÒ»ÖÖ·½·¨£¬ÊÇ¿¼²ìÎÊÌâµÄÒ»ÖÖ;¾¶£¬¶ø²»ÊÇ
Ò»ÖÖÌØÊâËã·¨£¨ÈçÏßÐԹ滮ÊÇÒ»ÖÖËã·¨£©¡£Òò¶ø£¬Ëü²»ÏóÏßÐԹ滮ÄÇÑùÓÐÒ»¸ö±ê×¼µÄÊý ѧ±í´ïʽºÍÃ÷È·¶¨ÒåµÄÒ»×鹿Ôò£¬¶ø±ØÐë¶Ô¾ßÌåÎÊÌâ½øÐоßÌå·ÖÎö´¦Àí¡£Òò´Ë£¬ÔÚѧϰ ʱ£¬³ýÁËÒª¶Ô»ù±¾¸ÅÄîºÍ·½·¨ÕýÈ·Àí½âÍ⣬ӦÒԷḻµÄÏëÏóÁ¦È¥½¨Á¢Ä£ÐÍ£¬Óô´ÔìÐ﵀ ¼¼ÇÉÈ¥Çó½â¡£
Àý1 ×î¶Ì·ÏßÎÊÌâ
ͼ1 ÊÇÒ»¸öÏßÂ·Íø£¬Á¬ÏßÉϵÄÊý×Ö±íʾÁ½µãÖ®¼äµÄ¾àÀ루»ò·ÑÓã©¡£ÊÔѰÇóÒ»ÌõÓÉA µ½G ¾àÀë×î¶Ì£¨»ò·ÑÓÃ×îÊ¡£©µÄ·Ïß¡£
ͼ1 ×î¶Ì·ÏßÎÊÌâ
Àý2 Éú²ú¼Æ»®ÎÊÌâ
¹¤³§Éú²úijÖÖ²úÆ·£¬Ã¿µ¥Î»£¨Ç§¼þ£©µÄ³É±¾Îª1£¨Ç§Ôª£©£¬Ã¿´Î¿ª¹¤µÄ¹Ì¶¨³É±¾Îª3
£¨Ç§Ôª£©£¬¹¤³§Ã¿¼¾¶ÈµÄ×î´óÉú²úÄÜÁ¦Îª6£¨Ç§¼þ£©¡£¾µ÷²é£¬Êг¡¶Ô¸Ã²úÆ·µÄÐèÇóÁ¿µÚ Ò»¡¢¶þ¡¢Èý¡¢Ëļ¾¶È·Ö±ðΪ2£¬3£¬2£¬4£¨Ç§¼þ£©¡£Èç¹û¹¤³§ÔÚµÚÒ»¡¢¶þ¼¾¶È½«È«ÄêµÄÐè Çó¶¼Éú²ú³öÀ´£¬×ÔÈ»¿ÉÒÔ½µµÍ³É±¾£¨ÉÙ¸¶¹Ì¶¨³É±¾·Ñ£©£¬µ«ÊǶÔÓÚµÚÈý¡¢Ëļ¾¶È²ÅÄÜÉÏ ÊеIJúÆ·Ð踶´æ´¢·Ñ£¬Ã¿¼¾Ã¿Ç§¼þµÄ´æ´¢·ÑΪ0.5£¨Ç§Ôª£©¡£»¹¹æ¶¨Äê³õºÍÄêÄ©ÕâÖÖ²úÆ· ¾ùÎÞ¿â´æ¡£ÊÔÖÆ¶¨Ò»¸öÉú²ú¼Æ»®£¬¼´°²ÅÅÿ¸ö¼¾¶ÈµÄ²úÁ¿£¬Ê¹Ò»ÄêµÄ×Ü·ÑÓã¨Éú²ú³É±¾ ºÍ´æ´¢·Ñ£©×îÉÙ¡£
1.2 ¾ö²ß¹ý³ÌµÄ·ÖÀà
¸ù¾Ý¹ý³ÌµÄʱ¼ä±äÁ¿ÊÇÀëÉ¢µÄ»¹ÊÇÁ¬ÐøµÄ£¬·ÖΪÀëɢʱ¼ä¾ö²ß¹ý³Ì£¨discrete-time
-57-
decision process£©ºÍÁ¬ÐøÊ±¼ä¾ö²ß¹ý³Ì£¨continuous-time decision process£©£»¸ù¾Ý¹ý³ÌµÄ ÑݱäÊÇÈ·¶¨µÄ»¹ÊÇËæ»úµÄ£¬·ÖΪȷ¶¨ÐÔ¾ö²ß¹ý³Ì£¨deterministic decision process£©ºÍËæ »úÐÔ¾ö²ß¹ý³Ì£¨stochastic decision process£©£¬ÆäÖÐÓ¦ÓÃ×î¹ãµÄÊÇÈ·¶¨ÐÔ¶à½×¶Î¾ö²ß¹ý³Ì¡£ ¡ì2 »ù±¾¸ÅÄî¡¢»ù±¾·½³ÌºÍ¼ÆËã·½·¨
2.1 ¶¯Ì¬¹æ»®µÄ»ù±¾¸ÅÄîºÍ»ù±¾·½³Ì
Ò»¸ö¶à½×¶Î¾ö²ß¹ý³Ì×îÓÅ»¯ÎÊÌâµÄ¶¯Ì¬¹æ»®Ä£ÐÍͨ³£°üº¬ÒÔÏÂÒªËØ¡£ 2.1.1 ½×¶Î
½×¶Î(step)ÊǶÔÕû¸ö¹ý³ÌµÄ×ÔÈ»»®·Ö¡£Í¨³£¸ù¾Ýʱ¼ä˳Ðò»ò¿Õ¼ä˳ÐòÌØÕ÷À´»®·Ö½× ¶Î£¬ÒԱ㰴½×¶ÎµÄ´ÎÐò½âÓÅ»¯ÎÊÌâ¡£½×¶Î±äÁ¿Ò»°ãÓÃk = 1,2,L,n±íʾ¡£ÔÚÀý1 ÖÐÓÉA ³ö·¢Îªk = 1£¬ÓÉB (i = 1,2) i ³ö·¢Îªk = 2£¬ÒÀ´ËÏÂÈ¥´ÓF (i =1,2) i ³ö·¢Îªk = 6£¬¹² n = 6¸ö½×¶Î¡£ÔÚÀý2 Öа´ÕÕµÚÒ»¡¢¶þ¡¢Èý¡¢Ëļ¾¶È·ÖΪk = 1,2,3,4£¬¹²Ëĸö½×¶Î¡£ 2.1.2 ״̬
״̬£¨state£©±íʾÿ¸ö½×¶Î¿ªÊ¼Ê±¹ý³ÌËù´¦µÄ×ÔÈ»×´¿ö¡£ËüÓ¦ÄÜÃèÊö¹ý³ÌµÄÌØÕ÷²¢ ÇÒÎÞºóЧÐÔ£¬¼´µ±Ä³½×¶ÎµÄ״̬±äÁ¿¸ø¶¨Ê±£¬Õâ¸ö½×¶ÎÒÔºó¹ý³ÌµÄÑݱäÓë¸Ã½×¶ÎÒÔǰ¸÷ ½×¶ÎµÄ״̬Î޹ء£Í¨³£»¹ÒªÇó״̬ÊÇÖ±½Ó»ò¼ä½Ó¿ÉÒÔ¹Û²âµÄ¡£
ÃèÊö״̬µÄ±äÁ¿³Æ×´Ì¬±äÁ¿£¨state variable£©¡£±äÁ¿ÔÊÐíȡֵµÄ·¶Î§³ÆÔÊÐí״̬¼¯ºÏ (set of admissible states)¡£ÓÃk x ±íʾµÚk ½×¶ÎµÄ״̬±äÁ¿£¬Ëü¿ÉÒÔÊÇÒ»¸öÊý»òÒ»¸öÏòÁ¿¡£ ÓÃk X ±íʾµÚk ½×¶ÎµÄÔÊÐí״̬¼¯ºÏ¡£ÔÚÀý1 ÖÐ2 x ¿ÉÈ¡1 2 B ,B £¬»ò½«i B ¶¨ÒåΪ
i(i = 1,2)£¬Ôò1 2 x = »ò2£¬¶ø{1,2} 2 X = ¡£
n ¸ö½×¶ÎµÄ¾ö²ß¹ý³ÌÓÐn +1¸ö״̬±äÁ¿£¬n+1 x ±íʾn x ÑݱäµÄ½á¹û¡£ÔÚÀý1 ÖÐ7 x È¡ G £¬»ò¶¨ÒåΪ1£¬¼´1 7 x = ¡£
¸ù¾Ý¹ý³ÌÑݱäµÄ¾ßÌåÇé¿ö£¬×´Ì¬±äÁ¿¿ÉÒÔÊÇÀëÉ¢µÄ»òÁ¬ÐøµÄ¡£ÎªÁ˼ÆËãµÄ·½±ãÓÐʱ ½«Á¬Ðø±äÁ¿ÀëÉ¢»¯£»ÎªÁË·ÖÎöµÄ·½±ãÓÐʱÓÖ½«ÀëÉ¢±äÁ¿ÊÓΪÁ¬ÐøµÄ¡£ ״̬±äÁ¿¼ò³ÆÎª×´Ì¬¡£ 2.1.3 ¾ö²ß
µ±Ò»¸ö½×¶ÎµÄ״̬ȷ¶¨ºó£¬¿ÉÒÔ×÷³ö¸÷ÖÖÑ¡Ôñ´Ó¶øÑݱ䵽ÏÂÒ»½×¶ÎµÄij¸ö״̬£¬Õâ ÖÖÑ¡ÔñÊֶγÆÎª¾ö²ß£¨decision£©£¬ÔÚ×îÓÅ¿ØÖÆÎÊÌâÖÐÒ²³ÆÎª¿ØÖÆ£¨control£©¡£ ÃèÊö¾ö²ßµÄ±äÁ¿³Æ¾ö²ß±äÁ¿£¨decision variable£©£¬±äÁ¿ÔÊÐíȡֵµÄ·¶Î§³ÆÔÊÐí¾ö²ß ¼¯ºÏ£¨set of admissible decisions£©¡£ÓÃ( ) k k u x ±íʾµÚk ½×¶Î´¦ÓÚ״̬k x ʱµÄ¾ö²ß±äÁ¿£¬ ËüÊÇk x µÄº¯Êý£¬ÓÃ( ) k k U x ±íʾk x µÄÔÊÐí¾ö²ß¼¯ºÏ¡£ÔÚÀý1ÖÐ( ) 2 1 u B ¿ÉÈ¡1 2 C ,C »ò3 C £¬
¿É¼Ç×÷(1) 1,2,3 2 u = £¬¶ø(1) {1,2,3} 2 U = ¡£ ¾ö²ß±äÁ¿¼ò³Æ¾ö²ß¡£ 2.1.4 ²ßÂÔ
¾ö²ß×é³ÉµÄÐòÁгÆÎª²ßÂÔ£¨policy£©¡£Óɳõʼ״̬1 x ¿ªÊ¼µÄÈ«¹ý³ÌµÄ²ßÂÔ¼Ç×÷ ( ) 1 1 p x n £¬¼´
( ) { ( ), ( ), , ( )} 1n 1 1 1 2 2 n n p x = u x u x L u x .
ÓɵÚk ½×¶ÎµÄ״̬k x ¿ªÊ¼µ½ÖÕֹ״̬µÄºó²¿×Ó¹ý³ÌµÄ²ßÂÔ¼Ç×÷( ) kn k p x £¬¼´
( ) { ( ), , ( )} kn k k k n n p x = u x L u x £¬k = 1,2,L, n ?1.
ÀàËÆµØ£¬ÓɵÚk µ½µÚj ½×¶ÎµÄ×Ó¹ý³ÌµÄ²ßÂÔ¼Ç×÷
-58-
( ) { ( ), , ( )} kj k k k j j p x = u x L u x .
¿É¹©Ñ¡ÔñµÄ²ßÂÔÓÐÒ»¶¨µÄ·¶Î§£¬³ÆÎªÔÊÐí²ßÂÔ¼¯ºÏ(set of admissible policies)£¬ÓÃ
( ), ( ), ( ) 1n 1 kn k kj k P x P x P x ±íʾ¡£
2.1.5. ×´Ì¬×ªÒÆ·½³Ì
ÔÚÈ·¶¨ÐÔ¹ý³ÌÖУ¬Ò»µ©Ä³½×¶ÎµÄ״̬ºÍ¾ö²ßΪÒÑÖª£¬Ï½׶εÄ״̬±ãÍêȫȷ¶¨¡£ÓÃ ×´Ì¬×ªÒÆ·½³Ì£¨equation of state transition£©±íʾÕâÖÖÑÝ±ä¹æÂÉ£¬Ð´×÷
( , ), 1,2, , . 1 x T x u k n k = k k k = L + £¨1£©
ÔÚÀý1 ÖÐ×´Ì¬×ªÒÆ·½³ÌΪ( ) k 1 k k x = u x + ¡£ 2.1.6. Ö¸±êº¯ÊýºÍ×îÓÅÖµº¯Êý
Ö¸±êº¯Êý(objective function)ÊǺâÁ¿¹ý³ÌÓÅÁÓµÄÊýÁ¿Ö¸±ê£¬ËüÊǶ¨ÒåÔÚÈ«¹ý³ÌºÍËùÓÐ ºó²¿×Ó¹ý³ÌÉϵÄÊýÁ¿º¯Êý£¬ÓÃ( , , , , ) k ,n k k k +1 n+1 V x u x L x ±íʾ£¬k = 1,2,L,n¡£Ö¸±êº¯ ÊýÓ¦¾ßÓпɷÖÀëÐÔ£¬¼´k n V , ¿É±íΪk k k n x u V 1, , , + µÄº¯Êý£¬¼ÇΪ
( , , , , ) ( , , ( , , , )) , +1 +1 +1, +1 +1 +1 = k n k k k n k k k k n k k n V x u x L x ?x u V x u L x
²¢ÇÒº¯Êýk
?¶ÔÓÚ±äÁ¿k n V +1, ÊÇÑϸñµ¥µ÷µÄ¡£
¹ý³ÌÔÚµÚj ½×¶ÎµÄ½×¶ÎÖ¸±êÈ¡¾öÓÚ״̬j x ºÍ¾ö²ßj u £¬ÓÃ( , ) j j j v x u ±íʾ¡£Ö¸±êº¯ ÊýÓÉv ( j 1,2, ,n) j = L ×é³É£¬³£¼ûµÄÐÎʽÓУº ½×¶ÎÖ¸±êÖ®ºÍ£¬¼´
¦²
+ + n j k
=
=
k n k k k n j j j
V (x ,u , x , , x ) v (x ,u ) , 1 L 1 £¬
½×¶ÎÖ¸±êÖ®»ý£¬¼´
¦°
+ + n j k
=
=
k n k k k n j j j
V (x ,u , x , , x ) v (x ,u ) , 1 L 1 £¬
½×¶ÎÖ¸±êÖ®¼«´ó£¨»ò¼«Ð¡£©£¬¼´
( , , , , ) max(min) ( , ) k ,n k k k 1 n 1 k j n j j j V x u x x v x u
+ + ¡Ü¡Ü
L = .
ÕâЩÐÎʽϵÚk µ½µÚj½×¶Î×Ó¹ý³ÌµÄÖ¸±êº¯ÊýΪ( , , , ) k , j k k j+1 V x u L x ¡£
¸ù¾Ý×´Ì¬×ªÒÆ·½³ÌÖ¸±êº¯Êýk n V , »¹¿ÉÒÔ±íʾΪ״̬k x ºÍ²ßÂÔkn p µÄº¯Êý£¬¼´
( , ) k ,n k kn V x p ¡£ÔÚk x ¸ø¶¨Ê±Ö¸±êº¯Êýk n V , ¶Ôkn p µÄ×îÓÅÖµ³ÆÎª×îÓÅÖµº¯Êý£¨optimal value
function£©£¬¼ÇΪ( ) k k f x £¬¼´
( ) opt ( , ) ,
( ) k n k kn
p P x k k
f x V x p
kn¡Êkn k
= £¬
ÆäÖÐopt ¿É¸ù¾Ý¾ßÌåÇé¿öÈ¡max »òmin ¡£
2.1.7 ×îÓŲßÂÔºÍ×îÓŹìÏß
ʹָ±êº¯Êýk n V , ´ïµ½×îÓÅÖµµÄ²ßÂÔÊÇ´Ók ¿ªÊ¼µÄºó²¿×Ó¹ý³ÌµÄ×îÓŲßÂÔ£¬¼Ç×÷
*
{ * , , *} p = u L u ¡£*
kn k n 1n
p ÊÇÈ«¹ý³ÌµÄ×îÓŲßÂÔ£¬¼ò³Æ×îÓŲßÂÔ£¨optimal policy£©¡£´Ó³õʼ ״̬( * )
1 1 x = x ³ö·¢£¬¹ý³Ì°´ÕÕ*
1n p ºÍ×´Ì¬×ªÒÆ·½³ÌÑݱäËù¾ÀúµÄ״̬ÐòÁÐ { , , , * }
1 * 2 *
1 n+
x x L x ³Æ×îÓŹìÏߣ¨optimal trajectory£©¡£
-59-
2.1.8 µÝ¹é·½³Ì
ÈçÏ·½³Ì³ÆÎªµÝ¹é·½³Ì
?? ??? = ? = =
+ + ¡Ê + +
( ) opt { ( , ) ( )}, , ,1 ( ) 0 1
1 1 ( ) 1 1
f x v x u f x k n L f x
k k k k k u U x k k n n
k k k