{"id":590,"date":"2011-12-20T18:08:38","date_gmt":"2011-12-20T16:08:38","guid":{"rendered":"https:\/\/blog.zitnik.si\/?p=590"},"modified":"2011-12-20T18:10:09","modified_gmt":"2011-12-20T16:10:09","slug":"apache-nutch-1-4-form-authentication-solved","status":"publish","type":"post","link":"https:\/\/blog.zitnik.si\/?p=590","title":{"rendered":"Apache Nutch 1.4 &#8211; Form authentication [SOLVED]"},"content":{"rendered":"<p><a href=\"https:\/\/blog.zitnik.si\/wp-content\/uploads\/2011\/12\/nutch_logo_tm.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft size-full wp-image-591\" title=\"nutch_logo_tm\" src=\"https:\/\/blog.zitnik.si\/wp-content\/uploads\/2011\/12\/nutch_logo_tm.gif\" alt=\"\" width=\"121\" height=\"48\" \/><\/a>As I have been searching over the internet, I found out lots of people having problems with Form-based authentication when using Apache Nutch crawler. All posts I have found ended with no solution, so I am giving you one option here.<\/p>\n<p>By default, Nutch uses <strong>protocol-http<\/strong> plugin to retrieve pages. The plugin <strong>protocol-httpclient<\/strong> supports several HTTP authentication schemes out of the box and uses (still :() HttpClient v3.x (to use this plugin, you will need to update <strong>conf\/nutch-site.xml<\/strong> <strong>plugin.includes<\/strong> properties). Credentials for specific hosts are read from <strong>conf\/httpclient-auth.xml<\/strong> file.<\/p>\n<p>Good option is to define xml for saving forms credentials. I for example, used:<\/p>\n<pre>&lt;credentials username=\"myUsn\" password=\"myPass\"&gt;\r\n      &lt;formscope loginPage=\"httpMethodPageUrl\"\u00a0\r\n                 className=\"si.zitnik.pathToClassName\"\r\n                 port=\"portNum\" \/&gt;\r\n&lt;\/credentials&gt;<\/pre>\n<p>Then I edited <strong>setCredentials<\/strong> method inside <strong>Http<\/strong> class in <strong>protocol-httpclient<\/strong> plugin to read new type of credentials. In the method <strong>resolveCredentials<\/strong> I instantiate class given by className and call the login function (build your prefered way of abstract classes\/interfaces to make the procedure as generic as possible). In the plugin, httpclient uses BROWSER_COMPATIBILITY Cookie policy, so we need no further changes.<\/p>\n<p>The last thing is writing your own login class that accepts previously read parameters and authenticates to the page. The easiest way is to write it directly inside protocol-httpclient plugin. (If you want to write it somewhere else, you will need to modify dependencies in plugin&#8217;s build xmls).<\/p>\n<p>After that enjoy crawling!<\/p>\n<div style=\"margin-top: 0px; margin-bottom: 0px;\" class=\"sharethis-inline-share-buttons\" ><\/div>","protected":false},"excerpt":{"rendered":"<p>As I have been searching over the internet, I found out lots of people having problems with Form-based authentication when using Apache Nutch crawler. All posts I have found ended with no solution, so I am giving you one option here. By default, Nutch uses protocol-http plugin to retrieve pages.&#8230;<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"https:\/\/blog.zitnik.si\/?p=590\">Continue reading<span class=\"screen-reader-text\">Apache Nutch 1.4 &#8211; Form authentication [SOLVED]<\/span><\/a><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[],"class_list":["post-590","post","type-post","status-publish","format-standard","hentry","category-computer-engineering","entry"],"_links":{"self":[{"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/posts\/590","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=590"}],"version-history":[{"count":4,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/posts\/590\/revisions"}],"predecessor-version":[{"id":793,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/posts\/590\/revisions\/793"}],"wp:attachment":[{"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=590"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=590"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=590"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}