back to index

HTML Form Decoder

Sat Apr 10 2004, 1:26 PM

Function:

Works as a filter for HTML pages. Extracts forms into a line-by-line format, easy to parse with tools like grep, cut, sed, or awk. If the page was fetched by curl with "-i" argument (including the HTTP response headers), extracts also cookies. Useful for various scripts that have to do a request for a page, then fill in the form fields there, then submit the form back.

Examples

http://www.google.com/
FULLCOOKIE:PREF=ID=05db53036ce1112c:TM=1081596654:LM=1081596654:S=3kyARetFXE5FOnzf; expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com

COOKIE:PREF=ID=05db53036ce1112c:TM=1081596654:LM=1081596654:S=3kyARetFXE5FOnzf FORM:1:f|METHOD:|ACTION:/search FORM:1:f|INPUT:hidden|NAME:hl|VALUE:en FORM:1:f|INPUT:hidden|NAME:ie|VALUE:ISO-8859-1 FORM:1:f|INPUT:|NAME:q|VALUE: FORM:1:f|INPUT:submit|NAME:btnG|VALUE:Google Search FORM:1:f|INPUT:submit|NAME:btnI|VALUE:I'm Feeling Lucky

FULLCOOKIE stands for the verbatim Set-Cookie: header,

COOKIE stands for only the cookie itself, stripped from the timing information.

http://slashdot.org/
FORM:1:|METHOD:POST|ACTION:/users.pl

FORM:1:|INPUT:TEXT|NAME:unickname|VALUE: FORM:1:|INPUT:HIDDEN|NAME:returnto|VALUE:/ FORM:1:|INPUT:HIDDEN|NAME:op|VALUE:userlogin FORM:1:|INPUT:PASSWORD|NAME:upasswd|VALUE: FORM:1:|INPUT:CHECKBOX|NAME:login_temp|VALUE:yes FORM:1:|INPUT:SUBMIT|NAME:userlogin|VALUE:Log in

FORM:2:|METHOD:|ACTION://slashdot.org/pollBooth.pl FORM:2:|INPUT:hidden|NAME:qid|VALUE:1089 FORM:2:|INPUT:radio|NAME:aid|VALUE:1 FORM:2:|INPUT:radio|NAME:aid|VALUE:2 FORM:2:|INPUT:radio|NAME:aid|VALUE:3 FORM:2:|INPUT:radio|NAME:aid|VALUE:4 FORM:2:|INPUT:radio|NAME:aid|VALUE:5 FORM:2:|INPUT:radio|NAME:aid|VALUE:6 FORM:2:|INPUT:radio|NAME:aid|VALUE:7 FORM:2:|INPUT:radio|NAME:aid|VALUE:8 FORM:2:|INPUT:submit|NAME:|VALUE:Vote

FORM:3:|METHOD:get|ACTION:http://freshmeat.net/search/ FORM:3:|INPUT:hidden|NAME:link|VALUE:freshmeat.net FORM:3:|INPUT:text|NAME:q|VALUE:

FORM:4:|METHOD:GET|ACTION://slashdot.org/search.pl FORM:4:|INPUT:TEXT|NAME:query|VALUE: FORM:4:|INPUT:SUBMIT|NAME:|VALUE:Search

If more forms are present in the page, they are separated by an empty line.

Output

For every FORM, the form order number is increased. Eventually form name is extracted from the tag. This, and all subsequent form elements, are then labeled as FORM:number:name.

Pipes (|) are used as separators, for easy use with cut or with eg. php function explode(), with small chance to interfere with content of the variables.

For the FORM tag, the METHOD (usually GET/POST) is extracted, together with the ACTION specified (the URL the query will be submitted to).
TODO: ENCTYPE support and support of submitting files.

The classical INPUT tags are simple; the TYPE, NAME, and VALUE are extracted.

For SELECT and OPTION tags the name of the form is extracted from the SELECT tag and used for the subsequent OPTIONs. Eventual SELECTED value is appended if specified.

TEXTAREA behaves in a way equivalent to INPUT.

Files:

handleform.c - source code; quick and dirty but does the job.

If you have any comments or questions about the topic, please let me know here:
Your name:
Your email:
Spambait
Leave this empty!
Only spambots enter stuff here.
Feedback: