DBTA Workshop on Semantic Data Processing
Over the last few years, not only the sheer volume of data that have to be handled by many applications has significantly grown in volume, but also their heterogeneity has increased. Hence, data integration has become more and more important to deal with the heterogeneity of different data sources. A key to data integration is the use of semantic technology that focuses on the meaning and relationship of data and information.While traditional information technology predefines meanings and relationships into data formats and the application program code at design time, semantic technology, by contrast, encodes meanings separately from data and content files, and separately from application program code (i.e., an abstraction layer above traditional information technology. In addition, as data and information are inherently afflicted with uncertainty, semantic fusion (e.g., for example of multi-source sensory data), is a way to reduce the uncertainty in the reasoning process.
This workshop aims at bringing together researchers and practitioners actively working on aspects of Data Integration and Semantic Data Processing in the context of Big Data to foster discussions on novel scientific trends and recent developments from leading-edge industry and academic institutions.
After a workshop entitled Big Data, Cloud Data Management, and NoSQL (with focus on 'Volume of Big Data') which was held in October 2012, this workshop will be dedicated to 'Variety in Big Data'. In summer 2014, a third workshop of this mini-series with focus on 'Velocity in Big Data' (i.e., data streams) is planned.
Please note: The workshop is for SI/DBTA members only.
Date: February 7, 2014, 10:00 – 16:30, Welcome Coffee at 9:30
Place: Sky Lounge, Stade de Suisse, Berne
||Marc Lieber, Principal Consultant at Trivadis AG, Basel
Introduction to ontologies; Technological challenges; Combining relational databases and ontologies.
||Prof. Abraham Bernstein, PhD., University of Zürich.
Large-Scale Graph Processing with Signal/Collect.
||Dr. Vincent Lenders, Research Program Manager at armasuisse S&T, Thun.
Big Data Challenges and Solutions in Open Source Intelligence.
||Prof. Dr. Philippe Cudre-Mauroux, University of Fribourg.
Entity-Centric Data Management.
||Dr. Albert Blarer, Principal Consultant at Trivadis AG, Zürich.
Handling uncertainty of data and information.
||Dr. Can Türker, Functional Genomics Center, Zürich (FGCZ)
Big Issues in (Life Sciences) Research Management Systems.
||Marc Schöni, Microsoft Schweiz; Meinrad Weiss, Senior Technology Manager Trivadis AG.
The Microsoft Big Data architecture approach.
||End of workshop
Prof. Dr. Heiko Schuldt, University of Basel
Dr. Martin Wunderli, CTO, Trivadis AG
- Marc Lieber: Introduction to ontologies; Technological challenges; Combining relational databases and ontologies.
Ontologies are shared conceptualization of knowledge in a particular domain. It consists of a collection of classes, properties and instances. “Semantic technologies” refers to a broad spectrum of techniques for finding signals in large or complex ontology oriented datasets. In this talk I will present an overview of the world of Graph databases and focus especially on RDF Triple stores for storing W3c Semantic Web-based ontologies. Some providers such as Oracle offer an interesting approach for storing such triples in a relational database or in a NoSQL database and combining ontology oriented data with relationally stored data. Also, I will present some practical use cases implemented in the pharmaceutical and banking industries.
- Prof. Abraham Bernstein, Ph.D.: Large-Scale Graph Processing with Signal/Collect.
Graphs are probably the most versatile data structure used in computers. Just consider the plethora of information produced daily by people (on facebook, twitter, by email, in newspapers, mechanical turk, etc.), by sensors (e.g., environmental data, cellular information), or by computers (e.g., financial data on stock exchanges, where 60% of all trades are algorithmic, or Semantic Web data). Most of these data-sources readily map to typed graphs, where vertices either represent entities (e.g., people) or large data items such as texts, images, or sound/movie snippets and the edges represent relations between them (e.g., friends). The growing availability of these graphs enables opportunities to discover new knowledge by interlinking and analyzing previously unconnected data sets. Support for processing large graphs is, however, rather limited. Some use Relational Databases or MapReduce – good solutions for many problems. However, the underlying models of these approaches do not map nicely to typed graphs and require programmers to shoehorn their problem into those formats.
In this talk I will present the Signal/Collect programming model for distributed graph processing. I will demonstrate that this abstraction can capture the essence of many algorithms on graphs in a concise way by giving Signal/Collect adaptations of various relevant algorithms. I will also show that Signal/Collect algorithms can process web-scale graphs in minutes. Finally, I will present TripleRush, a high-performance, distributed graph-store that was built on top of Signal/Collect.
Bio: Abraham Bernstein is a full professor of informatics at the University of Zurich, Switzerland. His current research focuses on various aspects of the semantic web, knowledge discovery, and crowd computing / collective intelligence. His work is based on both social science (organizational psychology/sociology/economics) and technical (computer science, artificial intelligence) foundations. Mr. Bernstein is a Ph.D. from MIT and has a Diploma in Computer Science (comparable to a M.Sc.) from the Swiss Federal Institute in Zurich (ETH). He is on the editorial board of the International Journal on Semantic Web and Information Systems, the Informatik Spektrum by Springer, the Jornal of Web Semantics, the ACM Transactions on Intelligent Interactive Systems, and the ACM Transactions on Internet Technology.
- Dr. Vincent Lenders: Big Data Challenges and Solutions in Open Source Intelligence.
Open-source intelligence (OSINT) is intelligence derived from public information. OSINT is currently experiencing a huge boom in the industry. While traditional OSINT sources such as news reports, journals or TV are well known, the rise of online social media has led to a new situation in which hundreds of millions of people around the world are actively contributing to openly available content. Recent use cases have demonstrated the collective power of this user-generated content with promising new applications in investment, marketing, defense and security. However, the challenge in a hyper-connected world where the volume of data and information from heterogeneous sources continues to grow exponentially is how to find the weak signals and interconnections in this ocean of superfluous noise. Innovative and scalable methods are therefore needed to retrieve, store and find useful patterns in petabytes of unstructured data efficiently. This talk will survey recent challenges in OSINT and potential solutions to turn large amounts of unstructured data into actionable intelligence.
- Prof. Dr. Philippe Cudre-Mauroux: Entity-Centric Data Management.
Until recently, structured (e.g., relational) and unstructured (e.g., textual) data were managed very differently: Structured data was queried declaratively using languages such as SQL, while unstructured data was searched using boolean queries over inverted indices. Today, we witness the rapid emergence of entity-centric techniques to bridge the gap between different types of content and manage both unstructured and structured data more effectively. I will start this talk by giving a few examples of entity-centric data management. I will then describe two recent systems that were built in my lab and revolve around entity-centric data management techniques: ZenCrowd, a socio-technical platform that automatically connects HTML pages to structured entities, and Diplodocus[RDF], a scalable and efficient back-end to manage semantic graphs of entities.
Bio: Philippe Cudre-Mauroux is a Swiss-NSF Professor and the director of the eXascale Infolab (http://exascale.info/) at the University of Fribourg in Switzerland. Previously, he was a postdoctoral associate working in the Database Systems group at MIT. He received his Ph.D. from the Swiss Federal Institute of Technology EPFL, where he won both the Doctorate Award and the EPFL Press Mention in 2007. Before joining the University of Fribourg, he worked on distributed information and media management for HP, IBM Research, and Microsoft Research. His research interests are in large-scale data management infrastructures for non-relational data.
- Dr. Can Türker: Big Issues in (Life Sciences) Research Management Systems.
Core facilities, such as the Functional Genomics Center Zurich (FGCZ), are research enablers and supporters. As such, they have to manage transparently the entire life cycle of a research project from their submission to their finish and publication, respectively. This goes far beyond just storing big research data produced by the researchers within these projects. On the one hand, basic project management issues such project submission, reviewing, member management, charging and invoicing, etc. must be supported. On the other hand, a framework for capturing quality-controlled scientific annotation of analytical data, integrating different analytical technologies and data analysis tools, and supporting user communication as well as integrative, access-controlled data search and usage is required. This talk discusses research management issues and presents B-Fabric, a system developed and running at the FGCZ that tackles these issues.
- Marc Schöni, Meinrad Weiss:The Microsoft Big Data architecture approach.
Microsoft has been doing Big Data long before it was a mega- trend in the market: At Bing they analyze over 100 petabytes of data to deliver high quality search results. More broadly, Microsoft provides a range of solutions to help customers address big data challenges. The family of data warehouse solutions from Microsoft® SQL Server®, SQL Server® Fast Track Data Warehouse and SQL Server® Parallel Data Warehouse offer a robust and scalable platform for storing and analyzing data in a traditional data warehouse. Parallel Data Warehouse (PDW) offers customers Enterprise-class performance thanks to its massive parallel processing architecture and can handle massive data volumes of over 1’000 TB. In addition to the traditional capabilities mentioned above, Microsoft is embracing Apache Hadoop™ as part of an end to end roadmap to deliver on the vision of providing business insights to all users by activating new types of data of any size. Microsoft is working strong to broadening the accessibility and usage of Hadoop to users, developers and IT professionals. End users can use the Hive ODBC Driver or Hive Add-in for Excel to analyze data from Hadoop using familiar tools such as Microsoft Excel and award winning BI clients such as Power Query, PowerPivot, Power View and Power Map for Excel. The session will give an overview over the different pieces of the Microsoft approach and show an end-to-end sample how they work together.