As I have been searching over the internet, I found out lots of people having problems with Form-based authentication when using Apache Nutch crawler. All posts I have found ended with no solution, so I am giving you one option here.
By default, Nutch uses protocol-http plugin to retrieve pages. The plugin protocol-httpclient supports several HTTP authentication schemes out of the box and uses (still :() HttpClient v3.x (to use this plugin, you will need to update conf/nutch-site.xml plugin.includes properties). Credentials for specific hosts are read from conf/httpclient-auth.xml file.
Good option is to define xml for saving forms credentials. I for example, used:
<credentials username="myUsn" password="myPass"> <formscope loginPage="httpMethodPageUrl" className="si.zitnik.pathToClassName" port="portNum" /> </credentials>
The last thing is writing your own login class that accepts previously read parameters and authenticates to the page. The easiest way is to write it directly inside protocol-httpclient plugin. (If you want to write it somewhere else, you will need to modify dependencies in plugin’s build xmls).
After that enjoy crawling!