By Sunita Sarawagi, Indian Institute of Technology, CSE, India, sunita@iitb.ac.in
The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem.
This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rule-based and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process.
Information Extraction deals with the automatic extraction of information from unstructured sources. This field has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The text surveys over two decades of information extraction research from various communities such as computational linguistics, machine learning, databases and information retrieval.
Information Extraction provides a taxonomy of the field along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. It elaborates on rule-based and statistical methods for entity and relationship extraction. In each case it highlights the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. It surveys techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process.
Information Extraction is an ideal reference for anyone with an interest in the fundamental concepts of this technology. It is also an invaluable resource for those researching, designing or deploying models for extraction.