Today we started at 9 o’clock at Gdansk University. I attended IWSSA Workshop (International Workshop on System/Software Architectures).
There were some lectures regarding IS Architectures, Multi-threading, but similar to my topic of research, the most important talks were:
- An Architecture for Efficient Web Crawling (Inma Hernandez, Carlos R. Rivero, David Ruiz and Rafael Corchuelo, University of Seville, Spain)
They introduce term “Virtual Integration” – to be able to work with a website as it is a database. The idea is to first use search engines to get hubs of pages and crawl them, then keep only relevant and then extract information from them and present structured data to user.
A hub is page that has more links to relevant pages and many links to irrelevant ones.
Requirements for crawling:
- Retrieve of relevant pages
- Deep Web Access
- Efficiency (not to DoS the servers)
- Unsupervision (to scale well – provide only link to a main page and then work autnomously)
Crawling types: Crawling, Focused Crawling, Classifier-based Crawler = the most efficient crawler.
In their work they try to validate the page based on the URL.
“Patritia tree” – tree of website structure
Evaluation: They used top 41 sites on the Internet and 4 academic sites (DBLP, Google Scholar, MS Academic Search and ?). They have collected tiny fractions of the whole sites = 100 hubs. They were trying to discover types of pages: e.g. Products, Reviews and Authors for Amazon. On all sites they achieved 95+-3% F score.
They have online demo: CALA Demo.
- A Reference Architecture to Devise Web Information Extractors (Hassan A. Sleiman and Rafael Corchuelo, University of Seville, Spain)
Idea is that pages are encoded into HTML Problems for IE:
- Lots of techniques
- None universally applicable
- No development support tools
- No validation consesus
- No reference architecture
-> They propose reference architecture. I believe it is nothing new. The presenter said it is different and GATE, UIMA or CoreNLP tools are not useful because they work on semi-structured data.
Of course I used the time after last session for tomorrow presentation making (I surely follow Bill Gates’ words: “To be a good professional engineer, always start to study late for exams cause it teaches you how to manage time and tackle emergencies.”) – postpone presentation making for later, you may get some new ideas 🙂
In the evening we went to gala dinner to Sheraton hotel in Gdansk downtown. After dinner we also “touched” the Baltic sea :).