As I have been searching over the internet, I found out lots of people having problems with Form-based authentication when using Apache Nutch crawler. All posts I have found ended with no solution, so I am giving you one option here.
By default, Nutch uses protocol-http plugin to retrieve pages. The plugin protocol-httpclient supports several HTTP authentication schemes out of the box and uses (still :() HttpClient v3.x (to use this plugin, you will need to update conf/nutch-site.xml plugin.includes properties). Credentials for specific hosts are read from conf/httpclient-auth.xml file.
Good option is to define xml for saving forms credentials. I for example, used:
<credentials username="myUsn" password="myPass"> <formscope loginPage="httpMethodPageUrl" className="si.zitnik.pathToClassName" port="portNum" /> </credentials>
Then I edited setCredentials method inside Http class in protocol-httpclient plugin to read new type of credentials. In the method resolveCredentials I instantiate class given by className and call the login function (build your prefered way of abstract classes/interfaces to make the procedure as generic as possible). In the plugin, httpclient uses BROWSER_COMPATIBILITY Cookie policy, so we need no further changes.
The last thing is writing your own login class that accepts previously read parameters and authenticates to the page. The easiest way is to write it directly inside protocol-httpclient plugin. (If you want to write it somewhere else, you will need to modify dependencies in plugin’s build xmls).
After that enjoy crawling!
Nice solution 🙂
I’m beginner in Java and want to try your solution for Nutch.
Why need it ~loginPage=”httpMethodPageUrl”~ and could you publish example of loginClass or/and some changes?
Thank you 🙂
loginPage is needed to login to a site.
For example you crawl sites at http://myservice.com/folder1/main.html, but to login to site, you need to make post request: http://myservice.com/login.php?username=denis&pass=wtfispass.
loginPage is therefore “http://myservice.com/login.php” and className is class that builds and posts post request.
You can also hardcode url into a class and loginPage would be not necessary.
Br,
Slavko
Hi,
Thank you for your solution. It’s really useful.
I was just wondering how did you send the HttpResponse back? Since that is the return in Http.java ‘s getResopnse() function.
I am trying to solve the form post authentication.
I wrote a class in HttpClient 4.1 but that’s not useful. So, I wrote this :
URL url = new URL(“http://somewebsite/docs/DOC-2264″);
String authStr =”usrname:password”;
String encodedAuthStr = Base64.encodeBase64String(authStr.getBytes());
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod(“POST”);
connection.setDoOutput(true);
connection.setRequestProperty(“Authorization”, “Basic ” + encodedAuthStr);
InputStream content = (InputStream)connection.getInputStream();
BufferedReader in =
new BufferedReader (new InputStreamReader (content));
Hi!
For authentications I needed, I did everything in the login class just to get httpsession and that was all.