You are on page 1of 50

Approaches to the Integration of Distributed and Heterogeneous Data Resources

Ahmet Sayar Indiana University Computer Science Department


1

Motivation
Integrating data from multiple data sources Distributed query and transactions of data. Definitions and adoptions of data, metadata and their storages. Accessing the data seamlessly. Transparency, support for heterogeneity, extensibility and scalability.
2

Outline
Data Integration Approaches
Application Specific Solutions Application-Integration Framework
ASIS (Application Specific Information System)

Database Federation
Ogsa-DAI (Ogsa-Data Access and Integration) Compare ASIS with Ogsa-DAI

Digital Libraries
SRB (Storage Resource Broker) Sompels Digital Library Approach Compare ASIS with SRB and Sompels DL
3

Application Specific Solutions


The most common means of data integration Expensive -in terms of time and skills Developing and using requires deep system knowledge Better results for special-purpose applications Fragile
Changes to the underlying sources may easily break the application

Hard to extend
A new data source requires new code to be written
4

Outline
Data Integration Approaches
Application Specific Solutions Application-Integration Framework
ASIS

Database Federation
Ogsa-DAI Compare ASIS with Ogsa-DAI

Digital Libraries
SRB Sompels DL Compare ASIS with SRB and Sompels DL
5

Application-Integration Framework
It can also be called component-based framework
Such as CORBA or Filters with common interfaces

Not necessarily address data integration issues Based on common data model (such as CML and GML)
With adaptors, if the source change the adaptor may have to change, but application may never see it.

Adding a new source is easy


a new adaptor may need to be written. The adaptor may already be exist online.

No need to detailed system knowledge Ex. ASIS - OGC GIS Application Integration Framework
6

ASIS (1)
Enables inter-service communication through welldefined service interfaces, message formats and capabilities metadata. Data model is ASL (Application Specific Lang.) Metadata model is capability document Data and metadata have common predefined schema Components are Filter Services
Web Services, comon service interfaces defined in WSDL Information/data services enabling distributed access, querying and transformation through their predictable input/output interfaces. Chainable, located, and capable of updating their metadata manually or dynamically
7

ASIS (2)
Data and data storage model
Any data can be integrated into the system after transforming to ASL. Heterogeneity is handled at the end-Filters with adaptors. ASL is community-accepted application specific language
GML (Geographic Markup Lang.) in GIS applications CML (Chemistry Markup Lang.) in Chemistry applications

Filters common service interfaces


getCapabilities, getData, getFeatureInfo.

Requests to Filters interfaces


getCapabilitiesReq, getDataReq, getFeatureInfoReq

Expected return types are defined in Filters capability metadata


8

ASIS (3)
Metadata and Metadata storage model:
Data integration is done through Filters capability metadata Metadata is stored in local Filters file system as a flat file. Capability:
Inspired from OGC WMS capability specification. Look like Dublin Core format. Capability like structure is also used in Gannons approach (XPOLA), for Grid services security issues. Describes dynamic Web/Grid resources. Updated manually or dynamically. Consists of descriptor, service and provider metadata Inter-service communication is achieved without a third-party. Enables chain of Filters.

ASIS (4)
Data Access and Filter Chaining
F3 F1
State Boundary F2 Earth Fault Earth Fault

State Boundary

F4
Fault

Each Filter is capable of acting as both a server and a client Capability integration is done through getCapability service interface Requests for common service interfaces are created in accordance with predefined XML schema

Filter Name F1 F2

Initial Data Provided None Earth (raster) Fault (vector)

After Chaining Data Provided Earth, Fault and State Boundary Earth and Fault Fault

F3
F4

State Boundary (vect) State Boundary


10

Outline
Data Integration Approaches
Application Specific Solutions Application-Integration Framework
ASIS

Database Federation
Ogsa-DAI Compare ASIS with Ogsa-DAI

Digital Libraries
SRB Sompels DL Compare ASIS with SRB and Sompels DL
11

Database Federation
Middleware consisting of database management system Uniform access to number of heterogeneous data sources Provides query language used to combine, contrast, analyze and manipulate the data Data integration is done through Database integration. Combine data from multiple sources in a single SQL statement query recreation. Ex. Ogsa-DAI (Open Grid Service Architecture Data Access and Integration)
12

Ogsa-DAI (1)
Provides common Java API for accessing and integrating data resources such relational and XML databases, and files- in Grid environment Specifically designed for OGSA architecture SQL queries on relational resources and XPath statements on XML collections Provides data pipelining (similar to Filter chaining) via an XML document called perform document. Allows developers to easily add or extend functionality within Ogsa-DAI, activity document.

13

Ogsa-DAI (2)
Data and storage model :
Any data stored in XML or relational databases, files No common data model Data is provided through GDS (Grid Data Services) Uses Ogsa-DQP (Distributed Query Processor) to coordinate to access to multiple data services The enactment engine is the core of Ogsa-DAI. Orchestrate running of the perform document Information in perform document includes:
The list of activities and their XML schemas and implementation classes. The list of role mappers and details The info about data resource
14

Ogsa-DAI (3)
Metadata storage model:
Metadata is kept in Catalog Service (MCS) MCS enables attribute-based querying Metadata is for the datasets, data can be anything (binary, text ..) Data integration is done through XML based activity file mixing activities (in SQL queries) and metadata

Simple data access scenario


A client contacts a DAISGR first to locate the GDSFs. Accesses suitable GDSFs directly to find out more about their properties and the data resources they represent. Asks GDSF to instantiate a GDS Accesses resource by sending the GDS the GDS-Perform doc.
15

Ogsa-DAI (4)
Metadata model:
No common schema for metadata like capability Defines Metadata for the datasets
No schema in XML Stored in Database tables as attributes

Defines Metadata for the Database system to enable querying and defining activities
Schema in XML (mcsActivity.xsd schema file) Kept as XML file in the file system (mcsActivity.xml)
16

ASIS vs. Ogsa-DAI


Ogsa-DAI does not define metadata and data in XML schema. Metadata is mixed with Database schema. ASIS has predefined data and metadata models. Ogsa-DAI uses any data, and they have predefined Database schema to enable querying and accessing data. ASISs data integration is on demand and based on capability federation. Instead, Ogsa-DAIs data integration is coded in XML struc perform and activity documents. Ogsa-DAI has central (MCS), ASIS has distributed metadata approach. Both system are based on Web Services. Ogsa-DAI uses GridFTP, and ASIS uses NaradaBrokering for the performance issues in data transfers.
17

Outline
Data Integration Approaches
Application Specific Solutions Application-Integration Framework
ASIS

Database Federation
Ogsa-DAI Compare ASIS with Ogsa-DAI

Digital Libraries
SRB Sompels DL Compare ASIS with SRB and Sompels DL
18

Digital Libraries
Main focus is publishing and discovering of the digital objects. Digital Objects : file, URL, SQL command string and any string of bits. Collects data from multiple different data sources. It is little bit different from the other data integration approaches
Data curation services such as publishing and removing data from the data sources.

Ex. SRB (Storage Resource Broker) and Sompels Digital Library Approach
19

SRB (1)
A federated client server system Each server managing/brokering a set of resources An implementation architecture for
Data grids Digital Libraries.

Storage resources include digital libraries, MSS, UniTree and file systems SRB consists of three components
MCAT services, SRB servers to access to storage repositories and SRB clients

Mediates access to distributed heterogeneous resources Uses MCAT (Metadata Catalog Service) to facilitate brokering and attribute based querying. Integrates data and metadata 20

Data and storage model:


SRB (2)

Uniform storage interface Resource-specific drivers to map from defined storage to interface Storage resources are registered within SRB as physical resources Logical resources (LSR) enable replication. LSR = one or more than one physical resource Client API refers to LSR. Collections are created by LSR

Metadata storage model (MCAT):


Serves both a core-metadata and domain-dependent metadata Core-metadata is a standardized schema like Dublin Core Stores metadata about data, collections, users, resources, methods Attribute based access and querying, updating metadata catalog Implemented as a relational database. Oracle, DB2 or Sybase Abstraction and Replica information for data Global user name space and authentication Authorization through ACL and tickets 21

SRB (3)
Metadata and Metadata Exchange Model:
MAPS (Metadata Attribute Presentation Structure) Independent of the internal representation of the attributes inside the catalog. Provides a uniform interface specification that can be used between user applications and the MCAT catalog and vice verse. Structures which form the MAPS:
MAPS_Query_Struct, MAPS_Result_Struct, MAPS_Update_Struct and MAPS_Definition_Struct

Mapping from MAPS to other models and exchange format. Dublin Core format is under implementation.
22

SRB (4)
Simple data access scenario:
SRB server spawns SRB agent to authenticate the user/Application by comparing it with information stored in MCAT. Find the location in MCAT. Check user request against permissions stored in MCAT. SRB agent contacts user with the result of his request. SRB agent communicates with the user through a port specific to this client session.

SRB server chaining scenario (integrated SRBs):


First 3 steps from simple data access case. SRB agent contacts remote SRB agent via remote SRB server. The second SRB agent returns the pointer to the data item to the first SRB agent which passes it on to the user. The SRB client interact with the data item directly. The federated SRB scheme -SRB server acts as a client to another.
23

ASIS vs. SRB


SRB doesnt define metadata in XML structure (as ASIS does) SRB uses any data but ASIS uses ASL SRB keeps the metadata in Catalogue Services (MCAT). ASIS uses XML structured capability metadata SRB has central metadata handling approach, ASIS has distributed metadata handling approach ASISs data integration is based on metadata federation, SRBs data integration is based on SRB server federation. Instead of Filters, SRB uses SRB server and agents for accessing data resources.
24

Sompels DL (1)
Scholarly communication as a network-based workflow Instead of Filters and ASL in ASIS, Sompel defines repositories and digital objects, respectively. Repository is a networked system that provides services pertaining to a collection of Digital Objects Repositories have common service interfaces.
Obtain, Harvest and Put.

Two classes of participants.


Data providers (DP) and Service providers (SP)

SP collect metadata from DPs (via 3 service interface); normalize and cluster it to deal with duplicates. DP offer some type of search mechanism for their own repositories.
25

Sompels DL (2)
Data and storage model: Data is the abstraction of the Digital Objects Digital Objects = Digital data + key metadata. Serialization of Digital Objects = Surrogates Surrogates
Information for the value chains and service information used at repository service interfaces. In the XML/RDF format Composed of dataStream and/or Entity tag elements. Chained object is defined by keymetadataID or providerInfo.

Different storage types: book repositories, teaching object repositories, dataset repositories etc. Repositories are active nodes. Repositories enable the use and re-use of materials in many contexts.
26

Metadata model:

Sompels DL (3)

Surrogates are essentially metadata records for objects Based on Dublin Core format with domain specific extensions. Dublin core has 15 standard entities to define resources. For more details see http://doublincore.org

Chaining for integrating data:


Application/User doesnt need to use workflow engine or script to create or run the chain. (As in ASIS) Chain (they call value chain) is hidden in the surrogates. Surrogates are updated through the common interfaces (put obtain and harvest) of the resources. Chain is defined in the Entity element in the surrogate document with the Lineage sub element.

Sample chaining scenario:


A paper might have references to some papers and these papers might be references to some other papers. Value chain does not stop. Papers have different metadata (value added) through value chain 27

ASIS vs. Sompels Approach


Instead of Filters and ASL in ASIS, Sompel defines repositories and digital objects respectively
DP correspond to End-Filters, and SP correspond to Filters in ASIS

ASIS do not have publishing or putting service interfaces


Obtain corresponds to getData in ASIS Harvest corresponds to getCapabilities in ASIS

Both have distributed metadata approaches for data integration


ASIS direct communication between Filters by using GetCapabilities interface Sompes DL direct communication between repositories and services by using Harvest interface

Sompels DL uses Dublin Core for the representation of the resources ASIS uses its own schema. ASIS uses ASL for the representation of the data - Sompels approach doesnt have common data model.
28

Summary
Application-Integration Framework (ASIS)
Easy to add new sources Using online Filters providing required adaptors peer-to-peer chain of Filters no central metadata catalog server Distributed capability exchange and aggregation SOA

Re-usable components (Filters) for different applications in predefined domain Implications of Filter services
Scalable and Fault-tolerant
Load-balancing and caching

Dynamically updating capability metadata

29

THANKS !

30

APPENDIX

31

Capability in Grid Services Security


XPOLA
The infrastructure is built on a peer-to-peer chain-of-trust model. No central admins WS-Security compliant Extensible PKI and SAML based Dynamic and reusable (manually or automatically generated) Composed of two sectors.
Policy document (SAML, lifetime info, binding info etc.) Providers signature

Existing grid security solutions to fine-grained authorization were not addressing general Web/Grid services in compliant with Web Services security specs. With central admins, other approaches dont address dynamic services
32

Sample Capabilities File (too simplified) GIS Domain


<?xml version='1.0' encoding="UTF-8" standalone="no" ?> <!DOCTYPE WMT_MS_Capabilities SYSTEM "http://toro.ucs.indiana.edu:8086/xml/capabilities.dtd"> <Capabilities version="1.1.1" updateSequence="0"> <Service> <Name>CGL_Mapping</Name> <Title>CGL_Mapping WMS</Title> <OnlineResource xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" /> <ContactInformation> .. </ContactInformation> </Service> <Capability> <Request> <GetCapabilities> <Format>WMS_XML</Format> <DCPType><HTTP><Get> <OnlineResource xmlns:xlink="http://w3.org/1999/xlink" xlink:type="simple xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" /> </Get></HTTP></DCPType> </GetCapabilities> <GetMap> <Format>image/GIF</Format> <Format>image/PNG</Format> <DCPType><HTTP><Get> <OnlineResource xmlns:xlink="http://w3.org/1999/xlink" xlink:type="simple xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" /> </Get></HTTP></DCPType> </GetMap> </Request> <Layer> <Name>California:Faults</Name> <Title>California:Faults</Title> <SRS>EPSG:4326</SRS> <LatLonBoundingBox minx="-180" miny="-82" maxx="180" maxy="82" / > </Layer> </Capability> </Capabilities>

33

Dublin Core
Challenge of resource description and discovery Language for making a particular class of statements about resources There 2 namespaces Dublin Core element set (dc)and Dublin Core qualifiers (dcq ex. dcq:iso8601). Some of Dublin core metadata element set
Title (dc:title), subject, description, creator, publisher, type, format, source, language, rights

Using DC in RDF, specifications for DC in RDF (work in progress) Resource has(verb) property(dc:creator) X(dc:Ahmet)
34

Sample Dublin Core

35 http://www.ils.unc.edu/mrc/jcdl2006/slides/kunze.pdf

Open Archive Initiative OAI

36

OAI
Deals with e-print server world Need to develop services that permitted searching across papers housed at multiple repositories Repositories also needed capabilities to automatically identify and copy papers that had been deposited in them. Definition of an interface to permit e-print servers to expose the metadata for the papers that it held. Service providers with similar metadata standards need to harvest this metadata Service providers act as a federation of repositories, by indexing documents, so that multiple collections cen be searched as though they form a single collection
37

OAI-PMH
For the variety of the communities engaged in publishing content on the Web Any networked server can emplly the protocol to enable service providers to collect its metadata HTTP-based request-response transaction Service Providers
Harvest metadata from Data Providers using the OAI protocol and use the returned metadata as a basis for building value-added services.

Data Providers (repositories)


Adopt OAI technical as a means of exposing metadata about their content.
38

Comments on OAI
OAI-PMH is ultimately only as useful as the metadata it transports. The tendency of implementers to almost exclusively apply the lowest common denominator of unqualified dublin core makes it difficult to implement more advanced search interface features. Content providers should prefer more expressive metadata schema like MARC or qualified DC and find ways to augment humangenerated descriptive metadata.
39

Sompels Digital Library Approach

40

Sompels Approach Hierarchy steps

41 http://msc.mellon.org/Meetings/Interop/lagoze_data_model.pdf

Sompels DL Data Model

42 msc.mellon.org/Meetings/Interop/lagoze_data_model.pdf

Ogsa-DAI

43

Ogsa-DAI Figure
http://www.globus.org/grid_software/data/dai.php

44

Perform Document

http://www.ogsadai.org.uk/documentation/ogsadai-wsi-2.2/doc/interaction/Perform.html

45

MCS
MCS present a design of Metadata Catalog Service that provides mechanism for storing and accessing descriptive metadata attributes Requirements: Store domain-independent attributes, user-defined attributes, query with a set of attributes, query with a logical name, authentication, authorization and auditing Allows users to discover data sets based on the value of descriptive attributes, rather then requiring to know specific names or physical locations of data items
46

MCAT vs. MCS


MCAT can be used just with SRB MCS can be used just in OGSA architecture MCAT stores both physical and logical addresses MCS stores logical metadata attributes and handles that can be resolved by a data location or data access services. They can both be extended for serving application-specific metadata, but they dont have generalized way for doing that.
47

SRB

48

SRB

49

CLIENT
Example interaction with SRB using Scommands:
Sinit
Start interaction with SRB

Spwd
Display current position within SRB repository

Smeta -i I UDSMD0=author I UDSMD1=bob myfile


Add metadata describing the author the file

Smeta -i I UDSMD0=author I UDSMD1=arthur


Search for files with author metadata set as arthur

Sget myFile
Copy myFile from SRB to local storage

Sreplicate S anotherResource myFile


Create a replica of myFile on anotherResource

Srm myFile
Remove myFile (and all replicas) from SRB

Sexit
End interaction with SRB
50

You might also like