Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes and petabytes. As the volume of data keeps growing, the types of data generated by applications become richer than before. As a result, traditional relational databases are challenged to capture, store, search, share, analyze, and visualize data. Many IT companies attempt to manage big data challenges using a NoSQL (“not only SQL”) database, such as Cassandra or HBase, and may employ a distributed computing system such as Hadoop. NoSQL databases are typically key-value stores that are non-relational, distributed, horizontally scalable, and schema-free. So, do we still need data modeling today? Traditional data modeling focuses on resolving the complexity of relationships among schema-enabled data. However, these considerations do not apply to non-relational, schema-less databases. As a result, old ways of data modeling no longer apply. We need a new methodology to manage big data for maximum business value. Big Data Model The big data model is an abstract layer used to manage the data stored in physical devices. Today we have large volumes of data with different formats stored in global devices. The big data model provides a visual way to manage data resources, and creates fundamental data architecture so that we can have more applications to optimize data reuse and reduce computing costs. The following diagram illustrates the general architecture of big data: About the authors Jinbao Zhu is a Principle Software Engineer at CA Technologies in Beijing, where he is responsible for designing solutions and developing key features for the data modeling product line. He has more than 11 years experience in developing and deploying enterprise-level software solutions focused on data processing and applications. Before joining CA Technologies, Jinbao worked in such areas as WebGIS architecture and solution, MS Exchange API, natural language processing, and parallel message analysis. Jinbao is a volunteer on a big data group researching Hadoop clusters and subprojects, and is developing several POC prototypes. Figure 1 In the diagram above there are three model layers. The physical data layer is the data we have in a big data system. It can have different data types such as video, audio, logs, business tables, and so on. The data modeling layer is the abstract data model we build to manage physical data. The computing modeling layer is the application layer that we build to retrieve information for business value. With these three models, we build data models to separate physical data and data use. This means, the application is able to access data through the data model instead of accessing the physical data. This makes applications ﬂexible and data manageable. To construct a big data model, we must ﬁrst create data blocks based on data storage, data type, relationship, read-write requirement, and so on. Further, we must have modeling applications maintain these models, so that data models are able to display and store the latest data. Hybrid Data Modeling Though NoSQL databases evolved to resolve speciﬁc issues of managing big data, there are good reasons for SQL’s enduring popularity, and current trends are to provide both SQL and NoSQL features in the same database. Some traditional RDBMS vendors, such as Microsoft and Oracle, provide NoSQL features in recent releases, such as the columnar storage support in Microsoft SQL Server 2012 and Teradata Warehouse 14. From the NoSQL side, projects like the Hive subproject of Hadoop and the CQL query language of Cassandra, have grown quickly to provide an SQL-like façade to a NoSQL database. Google, the leader in big data, is also developing a hybrid distributed database, Megastore, that can run SQL-speciﬁc features with a NoSQL storage speciﬁcation. That’s not surprising - both the quantity and variety of data are growing rapidly and we need many models and tools to process it. In the past we tried to cram data of all types into an RDBMS; now we must keep some of it out. As Big Data drives changes in the collection and processing of data, we need to reconsider existing data and storage models to allow for the increasing importance of implementation issues such as the performance decrease imposed by join operations and pressures on storage space as data grows to exceed the capacity of hardware storage. There are new options to resolve these issues--we can elect to migrate big data into a NoSQL database, possibly running a hybrid system as a tradeoff, taking advantage of the low risks and costs of these promising new options. Remodeling Tools and Techniques In the age of big data, popular data modeling tools such as CA ERwin® Data Modeler continue to help us analyze and understand our data architectures by applying hybrid data modeling concepts. Instead of creating pure a relational data model, we now can embed NoSQL submodels within a relational data model. In general, data size and performance bottlenecks are the factors that help us decide which data goes to the NoSQL system. Furthermore, we can redesign models. For example, we can denormalize to reduce the dependency on relationships, and aggregate attributes from different entities into an entity in ER diagrams. Since we need to present the hierarchy of original data in the documentation, we need a good data representation format to assist in understanding original data relationship. The attribute-value orientation of JSON is well-suited to represent the key-value structure of many NoSQL stores, though XML still has its place. With this transformation in representation, modeling tools like CA ERwin Data Modeler can analyze schema In the past we tried to cram data of all types into an RDBMS; now we must keep some of it out. The attribute-value orientation of JSON is well-suited to represent the key-value structure of many NoSQL stores, though XML still has its place. Allen Wang is a development manager on CA ERwin at CA Technologies in Bejing. He has worked for CA Technologies for over 6 years, and has experience with Information Governance and Data Modeling. Allen leads the CA ERwin team in the China Technology Center. Allen is interested in Agile methodology, data modeling and big data. changes and generate a code template to build a bridge between the current RDBMS and a target NoSQL system. This assists with data migration from an RDBMS to a NoSQL database, and gives the whole picture for data deployed in various data systems. The following diagram illustrates data migration design: Data Migration NoSQL Data: Data with the following characteristics is well-suited for a NoSQL system: Data volume growing rapidly1 Columnar growth of data2 Document and tuple data3 Hierarchical and graph data4 RDBMS Data: Data with the following characteristics might be better suited for a traditional RDBMS: OLTP required (On-Line Transaction Processing) 5 ACID requirements (atomicity, consistency, isolation, durability) 6 Complex data relationship 7 Complex query requirements8 Data Consumption It often takes more effort to migrate data consumers to new storage in new system than to migrate the data itself. Once we have used a hybrid model to migrate data, applications designed to access the data using relational methodology must be redesigned to adapt to the new data architecture. Hopefully, future hybrid databases will provide built-in migration tools to help leverage the potential of NoSQL. In the present, remodeling data is critical before an existing system is migrated to NoSQL. In addition, modeling tools must to be evaluated to verify they meet the new requirements. Physical Data-aware Modeling For the past 30 years, ‘structure ﬁrst, collect later’ has been the data modeling approach. With this approach, we determine the data structure before any data goes into the data system. In other words, the data structure deﬁnition is Figure 2 It often takes more effort to migrate data consumers to new storage in new system than to migrate the data itself. determined by how the data is used. With big data, this traditional approach is no longer applicable, as data doesn’t necessarily have ﬁxed schemas or formats. As a result, we need a new modeling methodology to suit big data characteristics and deployment. I propose a new approach for managing the variety of big data. Dynamic Schema Deﬁnition Even with big unstructured or semi-structured data, we still need to deﬁne schemas because data relationships can be more complex than before. Hiding data relationship logic in a program is not a good way to manage data complexity. Because big data uses the ‘structure later’ approach, in most of the cases, we can only know the data schema after the data has been created. The proposed methodology is to deﬁne schema on existing data after it has been collected and stored, and use the schema to retrieve data at runtime while processing. In the deﬁnition, the schema function should be as atomic and isolated as possible. To achieve this, we must ﬁnd out the scope of the data the schema function applies to, and the versions of data that the schema function can work for. This is called “dynamic schema” to distinguish it from traditional ﬁxed schema deﬁned before data is collected. The following is an example of a schema function showing how to fetch BirthDate and PostCode from an identity number. Data Block A data block is the physical data deﬁned with metadata. We must know the actual location of data, so we can model it on a modeling layer, and so the computing program knows where to retrieve it. Include the following ﬁelds in the metadata deﬁnition: Region servers, the physical location of data storage• Data access path that the program can use to access data• Data format of the saved data• Other attributes added as needed when the data is used• As a general rule to manage similar types of data, place data of identical format Figure 3 Even with big unstructured or semi- structured data, we still need to deﬁne schemas because data relationships can be more complex than before. and schema functions in the same data block. The data in a data block should be stored in close physical location. For example, data that might be used at the same time should be stored in the same rack, or at least in the same region. Data Version In a real-time production scenario, data may have different formats at different times in its lifecycle. Data models should manage these versions of data, so that schema functions can ensure compatible data update. Schema functions should be version-aware, so historic data can be retrieved with the appropriate legacy schema function. Data Relationship The data relationship among multiple data blocks still exists in practice, although most of the data in the big data system are designed as relationship- less (that is, without joins) to facilitate scalability. However, denormalization may lead to data duplication in storage, and to problems in a distributed system if data has to be updated frequently. We need to balance between duplicating data for scalability and query performance and practical requirements for update performance. Relationships are deﬁned on schema functions in data blocks. The relationships are just data links. Optional proxy applications can exist to maintain consistency for data changes, such as deleting data blocks or changing data format. The important beneﬁt is computing tasks can fetch data using the data relationship deﬁnition in a bunch of data blocks. This also can be proactively designed in a modeling layer Data Computing Modeling On the top of the physical data model, we normally need to create data ﬂow and compute tasks for business requirements. With physical-data modeling, it’s possible to create a computing model, which can present the logic path of computing data. This will help computing tasks be well designed and enable more efﬁcient data reuse. Hadoop provides a new distributed data processing model, and its HBase database provides an impressive solution for data replication, backup, scalability, and so on. Hadoop also provides the Map/Reduce computing framework to retrieve value from data stored in a distributed system. Map/Reduce is a framework for parallel processing using mappers dividing a problem into smaller sub-problems to feed reducers that process the sub- problems and produce the ﬁnal answer. As a business grows, computing requirements becomes more and more complex. For example, some normal computing tasks might need a couple of Map/Reduce subtasks to meet speciﬁc business requirements. We need a better design on the dataﬂow of complex computing. Reuse of a existing Mapper/Reducer is also a potential requirement for effective production environments. Computing Unit A Mapper and Reducer pair is a minimum unit to compute the ﬁnal output based on the input. A good design is one that is simple, because it can aid debugging and provide a more stable environment. The key point is how to Map/Reduce is a framework for parallel processing using mappers dividing a problem into smaller sub- problems to feed reducers that process the sub-problems and produce the ﬁnal answer. make the Mapper and Reducer pair more independent of physical data, so that we can better reuse these computing units to accelerate development and reduce cost. Computing Flow As the following diagram shows, rather than writing bigger Mappers and Reducer, we can combine existing Mappers and Reducers in a workﬂow. The output of the ﬁrst Map-Reduce is the input of the second Map-Reduce. The whole work ﬂow can be visualized and managed based on the data model introduced above. Data computing modeling helps manage the computing algorithm on the data stored in raw format. When new tools are developed, the data model will assist in taking the computing task forward, and help monitor the progress of data processing. Meanwhile, we can embed the data model into a big data system like Hadoop. Once data is moved or imported, Hadoop can synchronize with the data model ﬁrst. Integration with Hive Hive creates separate data systems for better querying and computing. The tables in Hive have well-deﬁned schemas. We can reverse engineer to import schemas to speciﬁc data blocks and integrate Hive data and external RAW data in Hadoop’s HBase database in the same data model. This provides a useful basic architecture if Hive has to extract data from RAW data, and run computing tasks on both data sources during data mining. Conclusion The complexity of a relational database limits the scalability of data storage, but makes it very easy to query data through an SQL engine. Big data NoSQL systems have the opposite characteristics: unlimited scalability with more limited query capabilities. The challenge of big data is querying data easily. Creating data models on physical data and computing path help manage raw data. he future will bring more hybrid systems combining the attributes of both approaches. Meanwhile, the dynamic schema model proposed in this paper and systems such as Hive offer help in the challenge of managing big data. References NoSQL Data Modeling Techniques http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/• Opetarational Data, Hadoop and New Modeling• http://erwin.com/expert_blogs/detail/opetarational_data_hadoop_and_new_modeling/ Big data http://en.wikipedia.org/wiki/Big_data• Figure 4 NOTICES Copyright © 2012 CA. All rights reserved. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. The information in this publication could include typographical errors or technical inaccuracies, and CA, Inc. (“CA”) and the authors assume no responsibility for its accuracy or completeness. The statements and opinions expressed in this publication are those of the authors and are not necessarily those of CA. Certain information in this publication may outline CA’s general product direction. However, CA may make modiﬁcations to any CA product, software program, service, method or procedure de - scribed in this publication at any time without notice, and the development, release and timing of any features or functionality described in this publication remain at CA’s sole discretion. CA will support only the referenced products in accordance with (i) the documentation and speciﬁcations provided with the referenced product, and (ii) CA’s then-current maintenance and support policy for the referenced product. Notwithstanding anything in this publication to the contrary, this publication shall not: (i) constitute product documentation or speciﬁcations under any exist - ing or future written license agreement or services agreement relating to any CA software product, or be subject to any warranty set forth in any such written agreement; (ii) serve to affect the rights and/or obligations of CA or its licensees under any existing or future written license agreement or services agreement relating to any CA software product; or (iii) serve to amend any product documentation or speciﬁcations for any CA software product. Any reference in this publication to third-party products and websites is provided for convenience only and shall not serve as the authors’ or CA’s endorsement of such products or websites. Your use of such products, websites, any information regarding such products or any materials provided with such products or on such websites shall be at your own risk. To the extent permitted by applicable law, the content of this publication is provided “AS IS” without warranty of any kind, including, without limitation, any implied warranties of merchantabil - ity, ﬁtness for a particular purpose, or non-infringement. In no event will the authors or CA be liable for any loss or damage, direct or indirect, arising from or related to the use of this publica - tion, including, without limitation, lost proﬁts, lost investment, business interruption, goodwill or lost data, even if expressly advised in advance of the possibility of such damages. Neither the content of this publication nor any software product or service referenced herein serves as a substitute for your compliance with any laws (including but not limited to any act, statute, regula - tion, rule, directive, standard, policy, administrative order, executive order, and so on (collectively, “Laws”) referenced herein or otherwise or any contract obligations with any third parties. You should consult with competent legal counsel regarding any such Laws or contract obligations.