How to Get Started With a Big Data Proof of Concept
One of the biggest struggles businesses have when implementing a Big Data infrastructure is with the proof of concept (POC) phase. Initiating a Big Data POC can often seem intimidating: you know it can enable you to gain valuable business insights, but you’re not sure where to start. Getting started with Big Data doesn’t have to be an overwhelming process, though. Following these initial key steps will start you and your business in the right direction.
Step 1: Determine a Clear and Attainable Goal
Before you begin hiring employees and procuring hardware and software, think about the problem you are trying to solve with Big Data. Work cross-functionally (with lines of business, data scientists, IT administrators, etc.) to apply the SMART principles—specific, measurable, attainable, relevant, time-bound—to help you set a clear goal. Ask these questions:
- Who? What? When? Where? Why? How?
- What criteria will we use to measure our progress?
- Is the goal achievable or realistic?
- Is the goal relevant to the business?
- When will this goal be accomplished?
For example, a retail company may have the goal of increasing the lifetime value of its customers. This is a common mistake businesses make when setting up a POC—setting a broad goal without a clear definition of time. As a result, many companies spend months in the POC phase because they are unclear about when their goal has been achieved.
It is important to ensure you set a measurable and attainable goal. One such goal for the retail company, then, could be to increase the lifetime value of its customers by 2 percent (realistic and achievable) within one year (time-bound) by increasing basket size both in stores and online.
Step 2: Perform a Data Assessment
Once you have defined your goal, perform a data assessment, which should look like this:
- Determine what data you are currently collecting, such as purchasing history.
- Decide what data you need to start collecting to answer the questions you want to address. Continuing with the retail example above, one of the ways to know if a company has successfully increased the basket size of its customers may be to match the in-store customers with their online accounts.
- Understand the data quality to know the limitations. For example, assuming the retailer can’t match its in-store customers with their online accounts to determine the lifetime value of the customer, the company might consider changing its goal. It could, for example, break the original goal into two more realistic goals:
1) To increase the lifetime value of its known customers by 5 percent
2) To encourage 10 percent of the customers to self-identify by offering relevant deals
Step 3: Deploy Your Big Data Initiative
There are three main components, or layers, to a Big Data deployment: infrastructure, data management, and application. Many companies spend a lot of time—too much time—on the first two components. But the key to deploying a Big Data POC is to get through the infrastructure and data management as quickly as possible, by virtualizing, to get to the third and most important step: executing the application layer. After all, gaining valuable insight from your data is what you’re ultimately after. Infrastructure – This is a combination of hardware (compute, storage, networking) and virtualization software. It will be in your best interest to speak to your IT organization before investing in any hardware, as they may be able to provide you with virtual machines to run your Big Data workloads. Big Data, more than any other application, requires the cooperation of lines of business and IT. You may find it helpful to refer to the virtualize high-performance server guide for more information and before speaking with IT.
If in talking with IT you find that you need to procure hardware, work with them to procure hardware for you to virtualize, such as rack-mounted servers. These are popular for Big Data applications because they employ direct-attached storage (DAS), which, as many companies have already discovered, is an economical way to provide the data transfer bandwidth needed for Big Data workloads.
Data Management – With Big Data comes, of course, a wealth of data—both structured and unstructured—to manage and analyze. Hadoop is a platform that can handle both types of data; think of it as a distributed database for Big Data. There are currently a number of Hadoop distributions available, including Cloudera, Hortonworks, MapR, IBM, and Pivotal.
Changing distributions is not a simple task, however, so doing some research beforehand will be beneficial. Some important factors to consider are performance, speed, ease of use, cost, training, and support. You may find that within your organization, different business units are already standardizing their data management on different distributions and different versions. This is ok, because virtualizing Big Data will allow you to run several vendors’ products and versions on the same group hardware. Refer to deploying virtualized Hadoop Systems for more information.
Application – Once you have your data management plan and infrastructure in place, and you have collected your data, it’s time to bring value to the business by analyzing the data. This is the time to bring your data scientist, analysts and engineers into the picture, to determine the proper algorithms to begin drawing insights from all of the data.
Get Started Today
Setting up a Big Data POC is within your reach. Set a SMART goal, collect the right data, and set up your distribution on virtualized infrastructure. Infrastructure performance during the POC phase is not as much of a concern as you go into production. The point is to get started as quickly as possible and to focus on gleaning insights to meet the needs of the business. And to glean those insights, you’ll need a data team. Create a center of excellence, a team of data analysts and scientists who can make sense of the data to help your business get the most out of it.
For more helpful tips on getting started, explore the ways in which VMware empowers Big Data.