As I have been searching over the internet, I found out lots of people having problems with Form-based authentication when using Apache Nutch crawler. All posts I have found ended with no solution, so I am giving you one option here.
By default, Nutch uses protocol-http plugin to retrieve pages. The plugin protocol-httpclient supports several HTTP authentication schemes out of the box and uses (still :() HttpClient v3.x (to use this plugin, you will need to update conf/nutch-site.xml plugin.includes properties). Credentials for specific hosts are read from conf/httpclient-auth.xml file.
Good option is to define xml for saving forms credentials. I for example, used:
<credentials username="myUsn" password="myPass"> <formscope loginPage="httpMethodPageUrl" className="si.zitnik.pathToClassName" port="portNum" /> </credentials>
The last thing is writing your own login class that accepts previously read parameters and authenticates to the page. The easiest way is to write it directly inside protocol-httpclient plugin. (If you want to write it somewhere else, you will need to modify dependencies in plugin’s build xmls).
After that enjoy crawling!
Nice solution 🙂
I’m beginner in Java and want to try your solution for Nutch.
Why need it ~loginPage=”httpMethodPageUrl”~ and could you publish example of loginClass or/and some changes?
Thank you 🙂
loginPage is needed to login to a site.
For example you crawl sites at http://myservice.com/folder1/main.html, but to login to site, you need to make post request: http://myservice.com/login.php?username=denis&pass=wtfispass.
loginPage is therefore “http://myservice.com/login.php” and className is class that builds and posts post request.
You can also hardcode url into a class and loginPage would be not necessary.
Thank you for your solution. It’s really useful.
I was just wondering how did you send the HttpResponse back? Since that is the return in Http.java ‘s getResopnse() function.
I am trying to solve the form post authentication.
I wrote a class in HttpClient 4.1 but that’s not useful. So, I wrote this :
URL url = new URL(“http://somewebsite/docs/DOC-2264″);
String authStr =”usrname:password”;
String encodedAuthStr = Base64.encodeBase64String(authStr.getBytes());
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty(“Authorization”, “Basic ” + encodedAuthStr);
InputStream content = (InputStream)connection.getInputStream();
BufferedReader in =
new BufferedReader (new InputStreamReader (content));
For authentications I needed, I did everything in the login class just to get httpsession and that was all.