View article

LP2, an adaptive algorithm for information extraction from web-related texts

Authors

Fabio Ciravegna

Publication date

2001

Conference

In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining

Description

Abstract (LP) 2 is an algorithm for adaptive Information Extraction from Web-related text that induces symbolic rules by learning from a corpus tagged with SGML tags. Induction is performed by bottom-up generalisation of examples in a training corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in tagging. Shallow NLP is used to generalise rules beyond the flat word structure. Generalization allows a better coverage on unseen texts, as it limits data sparseness and overfitting in the training phase. In experiments on publicly available corpora the algorithm outperforms any other algorithm presented in literature and tested on the same corpora. Experiments also show a significant gain in using NLP in terms of (1) effectiveness (2) reduction of training time and (3) training corpus size. In this paper we present the machine learning algorithm for rule induction. In particular we focus on the NLP-based generalisation and the strategy for pruning both the search space and the final rule set.

Total citations

Cited by 147

20012002200320042005200620072008200920102011201220132014201520162017201820192020202120221 8 11 16 10 8 15 11 10 6 8 7 4 4 5 5 3 4 3 3 1 2

Scholar articles

2, an adaptive algorithm for information extraction from web-related texts

DF Ciravegna - 2001

Cited by 147 Related articles