Modelli teorici e percorsi d'analisi. Carocci, Roma, Pandolfi E. Studies on Language Norms in Context. Peter Lang, Amati R. Managing improvement in healthcare: Attaining, sustaining and spreading quality. Palgrave Macmillan. In our case, we need to identify paraphrases or sets of sentences having the same semantic content. There exist different approaches in this field that introduce similarity measures between sentences and clustering algorithms.

The analysis can be carried out on different levels: surface markers comparison using n-grams [ 30 ]; concept matching [ 9 ] based on ontology structure [ 15 , 31 ]; syntactic analysis and similarity measures between trees [ 32 , 33 ]; extractions based on multilingual paraphrase corpora [ 2 , 24 ].

In our approach, the semantic annotation is an element of the retrieved results and can be integrated in a clustering algorithm. Our hypothesis is that paraphrase sentences are annotated either by the same categories or by categories that are situated very close to each other in the structure of the ontology.

We can therefore use new similarity measures that combine term vector similarity and annotation vector similarity. We can then define the distance D s 1 , s 2 as the product of the Eucledean distances:. This distance accounts for the hierarchy and the relation between semantic categories in the ontology in the following manner: if a sentence is annotated by a child category, it is also implicitly annotated by all its parent categories. The development of tools for the automatic processing of scientific publications on a large scale has become possible thanks to two advances.

Secondly, different approaches have been developed for the text-mining of the scientific publications by using statistical methods based on machine learning techniques, or by using knowledge based methods relying on linguistic models. Our approach relies on a linguistic ontology that organizes surface markers used for the annotation. A text segment in a document can carry more than one semantic annotation. The structure of the table depends on the nature of the information and cannot be defined in advance.

In fact, the types of data and the values that will come to enrich the document metadata are not necessarily known in advance and depend on the semantic annotation tools as well as on the level of analysis and the user. Historically, the development of different document representation structures has been conditioned by the idea that the text is an ordered hierarchy of content objects OHCO. This term was first introduced by [ 11 ] in order to propose data exchange formats ensuring the portability, the data integrity and the possibility of multiple visualization at different levels of representation.


XML trees are often used for full text document representation but they are less adapted to manage multiple relations and links between different objects. It is invoked within HTTP and uses six different verbs that permit identifying the repository, listing the repository's metadata prefixes, listing the repository's sets, listing the records corresponding to a given period of time and a metadata prefix, listing the record identifiers and getting a record given a specific identifier.

The aim of Dublin Core DC is to provide a common trunk of descriptors and a structure which is sufficient to facilitate the exchange and search of resources across different communities and description formats proper to each discipline. DC defines 15 elements which are all optional and multivalued.

Although they provide only a weak structure expressiveness, these 15 elements are considered to be part of the OAI-PMH [ 21 ]. DC offers conceptually an implementation and interoperability within the semantic web that uses the RDF format. The differences between these two models are studied in [ 7 , 38 ]. We have chosen to use MongoDB, as document stores support more complex data than key-value stores. This model permits querying collections of documents with loosely defined fields. In a relational database, each record in a table needs the same number of fields, while in a document oriented database documents in a collection can have different fields.

So, a document oriented database stores, retrieves, and manages semi-structured data. Document databases use object notations such JSON [ 10 ] Javascript Object Notation which is a ' lightweight, text-based, language-independent data interchange format '. It was designed to be lightweight, traversable and efficient. Version 1. BSON format offers the ability to embed various formats such as PDF, which represents a considerable advantage compared to the technical specifications of our requirements specification.

The implementation of different servers is based on the virtualization of operating systems. The second virtual machine under Windows 7 runs a professional solution used for the conversion of PDF into text which is the input of the semantic processing tools. The web interface is developed using AJAX to provide dynamic functionalities, i.

Various graphical information representations enhance the understanding of the source document. The user interface proposes document descriptions that resemble bibliographic record references, but that contain, together with the traditional metadata fields, other information issued from the semantic annotation, which contribute to the construction of a better representation of the source document see Figure 3. This approach also achieves a certain optimization of the services of the providers of scientific publications and remains compatible with existent business models on the web.

If the metadata harvested by the OAI protocol gives access to full text publications, the latter can the exploited in order to enrich the metadata. The automatic semantic analysis of the text can therefore provide new metadata descriptors that give information on some specific parts of the publication's content that are extracted and categorized through automatic semantic analysis. For example, our interface proposes semantically enriched bibliographies see Figure 4 , that list citations together with text extracts containing the corresponding reference key in the text.

The system permits processing of articles and PhD theses and visualizing the annotations distribution. Figure 5 shows the distribution of bibliographic citations in Natural Language Processing PhD theses [ 4 ]. On the horizontal axis we have the values corresponding to the text segments, starting with the first sentence of the document and going to the end.

On the vertical axis we have the different semantic categories. The purpose of this approach is to enable the development of new information retrieval applications providing rapid access to some specific types of information that are difficult to identify by a simple keyword search. Their evaluation of the precision shows some promising results and underscores the interest in semantic annotation of information retrieval.

The aim is to accelerate access to information and knowledge acquisition by categorized text syntheses generated dynamically according to the user's need. Our evaluation consists of the observation of the performance of the annotated method on a corpus of documents.

The publications we considered are relatively long; each document contains an average of 1, sentences. Among the , sentences only about 8, contain bibliographic references, and of them 1, were annotated by the system. Table 2 presents some statistics resulting from the corpora that were processed.

This shows that only a small number of the sentences in the documents are relevant. Table 3 presents the distribution of the annotation categories. Twenty-two percent of the sentences that contain bibliographic references were categorized by the annotation. This is the case when the author cites other work without explicitly expressing the motivations for the citation.

This also means that the annotation permits identifying a small number of the citation sentences that carry specific semantic relations. This approach is used for the construction of an IR system that extracts highly relevant segments from large corpora. Table 4 shows the distribution of the annotations of the semantic category of "Information". We can see a majority of results and citations in the corpus. This work relies on several different disciplines: computer science, information science and NLP, and is being performed at a time when each of these disciplines are growing to a maturity that permits elaborating new approaches for semantic text processing.

This work represents an important first step because it provides a fully automated processing chain for harvesting, conversion and semantic annotation. The relations established between metadata and semantically annotated textual elements in scientific publications give qualitative information to improve measures for mapping science. This approach emphasizes the organization of linguistic markers that are used as resources by the semantic annotation engine. The first results are encouraging, but it is still necessary to address some of the limitations of this approach related to the working hypotheses.

Moreover, an evaluation on a bigger corpus is necessary in order to confirm the reliability of the tool for citation analysis. Finally, the construction of multilingual resources is also under development, primarily for English, which will allow us to process larger corpora. The approach that is proposed in this article can open a new discussion on bibliometric indicators. At this early stage of the implementation, we can start by studying the citation contexts and discuss the relevance of this approach for further applications.

Moreover, by exploiting the semantic annotations of citations, when they are considered in relation to document metadata, we can provide a better access to the document content in an information retrieval perspective. In fact, the annotations identify some of the textual segments as relevant according to specific points of view.

This orientation was considered in previous work [ 3 , 39 ]. However, OAI-PMH, by providing access to full text documents, allows us to envisage new perspectives for the mapping of science. This work is supported by the Francophonie University Agency which is a global network of institutions of higher education and research, Paris-Sorbonne University and John Libbey Eurotext edition who allow us to annotate their scientific journals.

Atanassova, M. Bertin, and J. Bannard and C. Paraphrasing with bilingual parallel corpora. Association for Computational Linguistics, Categorizations and annotations of citation in research evaluation. Bibliosemantic: a linguistic and data-processing technique by contextual exploration. PhD thesis, University of Paris-Sorbonne, Bertin, I.

Atanassova, and J. Extraction of author's definitions using indexed reference identification. Association for Computational Linguistics. Bertin, J. Djioua, and Y. Automatic annotation in text for bibliometrics use. Chubin and S. Content analysis of references: Adjunct or alternative to citation counting? Corley and R. Measuring the semantic similarity of texts. Technical report, Internet Engineering Task Force, DeRose, D. Durand, E. Mylonas, and A. What is text, really? Makkaoui, and J. Towards automatic thematic sheets based on discursive categories in biomedical literature.

Association for Computing Machinery.