Data Virtualization - marketing slides versus reality
The concept of Data virtualization has become a hot topic recently as a promise for better, faster, more flexible and improved operations in the Business Intelligence world.
After seeing the powerpoint marketing materials of the vendors one might get an impression that after acquiring a DV server it would be possible join pretty much all the data sources known in an organization and forget about performance and expect to have immeditate results on the reports and dashboards. Furthermore business users might from now one forget about the need to maintain the whole bunch of BI, DW and ETL consultants because they are promised to be able to self-serve. And of course that problematic data warehousing database and ETL server can be shut down and buried...
Is it really so ?
Data virtualization weaknesses and challenges
Data virtualization can bring value to an organization however there are many challenges and potential disadvantages when used inadequately:
- Having a Data virtualization platform does not solve any business specific problem. It might help identify where a problem lays by splitting up complex queries into intermediate views but again having 'single version of the truth' requires close cooperation between the managers, business analysts and IT
- When a query is submitted, the DV platform tries to guess the optimal way to fetch and join the data. In many (most) cases it's simply impossible to perform this operation in a timely and reliable fashion. This still requires a well thought organization IT strategy and architecture for making the source data accessible efficiently.
- Data virtualization is not a magic tool which could join one milion of records from one database with five milion of records from the other database where network traffic is involved. It uses a standard jdbc, odbc or native connectors and in all cases the data needs to be sent accross network in order to be joined, grouped or matched.
- Data virtualization can be very powerful when the data is analyzed thus queried by one product, by one customer, or heavily filtered, preferably with small amounts of underlying data. It looks great in sales presentations and during POC's however in most real world scenarios the need is to process milions of underlying rows. This will take time, in fact the same amount of time as using any other ETL tool or Reporting platform.
- Good approach is to install a DV platform when we are really sure that it resolves a specific business problem and is a perfect fit for it. It's very common scenario to end up trying to make a DV tool act as an ETL tool.
- It is hard to use data virtualization software by the business users, very unlikely to do successful self-service queries by business analysts without SQL, database and IT background in general. Might be a perfect solution for data scientists though.
- When performance, stability, availability, 100% data consistency and correctness is a must then use traditional DW approach
Data virtualization strengths and benefits
In which scenarios data virtualization is useful and a good choice?
- Bi-Modal Logical Data Warehouse. In this scenario a logical data warehouse is created with a data virtualization tool and it complements the traditional, controlled, centralized enterprise data warehouse (EDW). In this scenario LDW might use the EDW data where necessary, also combine it with data from source systems, intermediate spreadsheets, etc.
- Right-time (Real-time or Near real-time processing), microbatching (small portions of the data processed constantly) instead of one big daily load.
- While the key is to start fast and deliver some business results very quickly. In this scenario to some degree failure is ok and fail fast is good. Priceless for POC's. Project can be delivered in weeks instead of months or years.
- When a project or a task is focused on data exploration, some areas are unforseen and it might involve experimenting
- A need for reading and processing unstructured, semi-structured data, including Excel, XML, JSon, web (especially when there are small amounts). Also data directly from the Web, Web serices, document databases etc.
- A data virtualization platform has a built-in pattern for accessing the data (SQL for databases, MDX for olap, REST and SOAP for WEB Services, events in message queues, web site crawl). This makes it transparent for the end users where the data comes from.
- In a typical data virtualization scenario the data doesn't need to be in realtional form, hierarchical is fine.
- It implements Agile Business Intelligence approach, it is easily extensible
Data virtualization real-life use cases
- Single view of entity - for example Customer analytics, Single view of the Customer (customer 360)
- Social media
- Real-time operational views (for example summary of incidents, ranking, matrix of incident assignment, project status) with minimum delay allowed
- Self-service portal integration
- Agile Business Intelligence - operational decision support systems, such as inventory control, risk management
- Real time analytics and reporting
- Performance dashboards
- Virtual data marts
- EDW prototyping and competitive BI
- Data services - Information as a Service (IaaS), Information feeds, Logical data abstraction, Virtual Data Layer Virtual MDM
- Cloud integration - including reading Web data, two way web automation, SaaS and cloud with no API, external watch