把清大課程總表丟進資料庫

  一直不知道學校到底有沒有把比較好處理的課程「總」表放在網路上,但是又不能去問(去年電商專題的EZCheat讓部分教授們對課務組和計中施壓,不會把那樣的資料給我們),最後只好自己從校務資訊系統的查詢網頁著手了,先跳過表單檢查步驟強制用空字串當關鍵字以一次查到所有課程,然後用PHP去做parsing和塞進資料庫。
  雖然講起來沒什麼,但這東西花了我整個下午,(說起來還頗對不起英翔學長的....答應的BuzzShare的進度已經落後很多了)中途還冒出一些目前還不知道怎麼解決的問題,比方說因為我幾乎不會用正規表示法,只好改用投機的方式去把 <input> 改成 <input/>。(這件事告訴我們修過【正規語言】跟擅長使用正規表示法是兩回事)
  總之,程式碼如下,已經改成只要把網頁給他就可以的function了:function nthuCourseParse( $webpage ) {
$data = file_get_contents( $webpage );
$data = mb_convert_encoding( $data, 'UTF-8', 'BIG-5' );

// Preserve only <table />
$tableStart = strpos( $data, '<table' );
$tableEnd = strpos( $data, '</table>' );
$data = substr( $data, $tableStart, $tableEnd - $tableStart + 8 );

/*
Make it to match XML format
The wierd ');"></td>"' is used for handling the button.
Note that   is not allowed in XML until other DTD declaration.
*/
$replace = array(
'<BR>' => '<BR/>',
'<br>' => '<br/>',
' ' => ' ',
');"></td>' => ');" /></td>',
'&' => '&'
);
foreach( $replace as $from => $to )
$data = str_replace( $from, $to, $data );

$parser = xml_parser_create();
xml_parser_set_option( $parser, XML_OPTION_CASE_FOLDING, 0 );
xml_parser_set_option( $parser, XML_OPTION_SKIP_WHITE, 1 );
if( !xml_parse_into_struct( $parser, $data, $values, $tags ) )
printf(
'Line %d, Col %d: %s',
xml_get_current_line_number( $parser ),
xml_get_current_column_number( $parser ),
xml_error_string( xml_get_error_code( $parser ) )
);
xml_parser_free( $parser );

/*$colName = array( '科號', '科目', '學分', '時間', '教室/容量',
'教師', '人限', '備註', '人數', '對象', '擋修', '大綱' );*/
/*
Parse into another structure in which each element is a course.
Note that empty string is also parsed.
The initial value 4 of the for-loop is based on the structure of <table />,
as the increasing-step 4 is to skip the other row of <table />.
*/
$tr = $tags['tr'];
for( $c = 0, $i = 4; $i+1 < count( $tr ); $i+=4, $c++ ) {
$tdi = 0; // index of td
for( $j = $tr[$i]+1; $j < $tr[$i+1]; $j++ ) { // Trace each <td /> for data
$re = $values[ $j ];
$arr[$c][$tdi][] .= trim( $re['value'] );
if( $re['tag'] == 'td' )
if( in_array( $re['type'], array( 'close', 'complete' ) ) )
$tdi++;
}
}
return $arr;
}

這樣就會回傳一個比較容易懂的結構了,至少比較方便直接呼叫到想要的東西,可以有如是的用法:foreach( $arr as $r => $course ) {
$db[$r]['sn'] = $course[0][1];
$db[$r]['chinese'] = $course[1][0];
$db[$r]['english'] = $course[1][2];
$db[$r]['credit'] = $course[2][1];
$db[$r]['period'] = $course[3][1];
list( $db[$r]['location'] ) = explode( '/', $course[4][1] );
$db[$r]['teacher']['chinese'] = $course[5][1];
$db[$r]['teacher']['english'] = $course[5][3];
}

其實也可以用數字當index直接呼叫所需的資料啦,不過先做過這件事的話就比較可以知道自己再寫什麼:foreach( $db as $i => $course ) {
$teacher = $db->getOne( sprintf( "SELECT sn FROM teacher WHERE name = '%s'", $course['teacher']['chinese'] ) );
$db->exec( "INSERT INTO course ( sn, name, teacher ) VALUES ( '%s', '%s', %d )", $course['sn'], $course['chinese'], $teacher ) );
}

沒有留言: