It is difficult to integrate data from different sources (databases or XML dumps) because they often use different schemas and different content standards to represent their data.
In a typical example Amsterdam may be represented in one XML file in one way:
<object> <inventoryNumber>1</inventoryNumber> <place>Den Haag, NL</place> </object>
and in another file in the other way:
<item>
<inv>1</inv>
<location>
<city>The Hague</city>
<country>Netherlands</country>
</location>
</item>
In a Semantic Web application these representations should be unified by converting both XML files to RDF and semantically enriching the data.
In our example this would mean that both XML tag place and XML subtree location should be converted to RDF properties and declared as subproperties of some standardized property customerLocation. As a result both properties would be found on a query for the standardized property customerLocation.
Further, we should analyze the values of these tags to realize that they refer to the same city, the capital of The Netherlands. This allows looking it up in an external vocabulary, e.g. Geonames, and replace the literal strings with unambigous code http://www.geonames.org/2747373. As a result, we can point both locations on a map, or to pull its Chineese name from Geonames.
In the literature the first step is often called "schema integration" and the second - "data enrichment".
AnnoCultor is a specific tool, able to help a programmer in writing data converters to convert XML files to RDF with both schema integration and data enrichment. AnnoCultor is suitable for converting product catalogs, museum collections, vocabularies and thesauri of terms, and other databases or XML files.
We used AnnoCultor to convert collections of:
We also used AnnoCultor to convert thesauri of:
During the conversion, your engineers will face technical problems; it is quite likely that we have already faced exactly the same problems in our work and developed reusable solutions. Accordingly, with AnnoCultor your conversion tasks switch from developing specific solutions to putting available solutions together that saves time and improves quality.
Please, refer to our analysis of the effort needed to convert a dataset, as presented in the paper Porting Cultural Repositories to the Semantic Web by Borys Omelayenko, as listed in the publications section.
AnnoCultor 2.x requires knowledge of XML to write converters as XML documents. Java knowledge is not compulsory. However, in some cases, basic Java knowledge is needed to write code snippets to customize and fine-tune custom converters.
AnnoCultor is distributed under the Apache license.