Is Your Data Ready for Machine Learning?

Artificial intelligence is changing the world more rapidly than anyone could have predicted a few years ago. The explosion in available data, coupled with low-cost computing power and dramatic advances in AI capabilities, will enable organizations to optimize their operations, personalize their products, and anticipate future demand.

Yet, a recent survey shows that most companies still aren’t fully prepared to take advantage of this technology.


of surveyed executives have executed an enterprise-wide data strategy.

In partnership with Dell Technologies and Intel, Forbes Insights surveyed more than 700 top executives about their plans for AI and machine learning. While three-out-of-four CxOs say AI is a core component of their digital transformation plans, less than 25 percent implemented it anywhere within their organization.

Just 11 percent have executed an enterprise-wide data strategy, and a scant 2 percent say they have a solid data governance process in place. In fact, while most organizations have more data than they know what to do with, much of it remains siloed, unstructured or otherwise ill-prepared for use in machine learning models.

Without the right data, AI initiatives will fail.

How Much Data Is Enough?

Organizations should start their AI journey by figuring out the questions they want to answer and the predictive capabilities they’d like to develop, says Josh Simons, senior director and chief technologist for high performance computing at VMware. That will determine the data they should collect.

But the amount and types of data organizations will need also depends on whether they’re using supervised or unsupervised machine learning models. 


of surveyed executives have a solid data governance process in place.

Supervised learning trains the model to look for specific results. It’s what allows Amazon Alexa to understand what you’re saying or your iPhone to unlock when it sees your face. Supervised learning requires a significant volume of labeled data, but it can allow you to build powerful predictive models.

Unsupervised learning involves analyzing pools of raw data to detect patterns and identify anomalies,such as combing through computer security logs to flag potential cyberattacks. The amount of data you need depends on what you want the model to do.

Some enterprises will start with an unsupervised model to identify patterns in data, and then use those to structure the data for use with a supervised one, says Cambron Carter, director of engineering for GumGum, a computer vision company that builds AI solutions for the advertising, medical, and sports industries.

“It’s as if I dumped a bag of marbles onto a table and told you to sort them, without telling you anything else,” Carter says. “You could sort them by size, color, design or whatever. But you’re going to impose some kind of structure on those marbles.”

If you’re training a robotic arm to identify parts passing by on an assembly line, you can start with a set of a few thousand labeled images, or even fewer depending on the task, says Carter. If you’re dealing with more complex tasks—like diagnosing cavities on dental x-rays or identifying logos on Formula One race cars as they zoom past—you’ll need to start with a significantly larger set of labeled data.

But volume alone isn’t enough; the data also needs to represent what you’d encounter in real-world scenarios, he adds.

Is AI the New BYOD?

Emerging tech like artificial intelligence (AI) is making its way from the consumer realm into the office.

“Let’s say I’m trying to train a model to recognize ten different animals and I’ve collected a million images to do so,” he says. “If 900,000 of them are tigers, the system is going to learn that if it predicts ‘tiger,’ it will be right 90 percent of the time.” The model may appear to be highly accurate, but it will offer poor predictive value in the real world.

You’ll also need three distinct pools of data: one to train your model, another to validate that it’s accurate and a third set of data to test it before putting it into production, says Yiwen Huang, CEO of, an automated machine learning platform.

Getting to the Goldilocks data set is tricky. Start with too little data, and you risk overfitting—creating a model that works well with your training set but poorly when it encounters new data. Use an insufficiently diverse mix of data, and you could be biasing the results in a particular direction.

“The unfortunate answer is there’s a lot of trial and error,” says Carter. “It depends on the scope, what you’re trying to learn, and how many variables you’re dealing with. There’s an artisanal component to it as well as an experiential one.”

Is Your Data Clean?

One you have the data locked in, you need to prep it—removing duplicates, ensuring fields are formatted consistently and so on. It’s not a task you want to leave to highly paid, hard-to-find analytics experts, yet that’s something many companies do, says Simons.

“The data science community complains a lot about how much of their time is spent collecting data and getting it into a format they can actually feed into these algorithms,” he says. 

It can be a hugely complicated task, and if you don’t do it right, it’s garbage-in garbage-out.

If you’ve got a large pool of unstructured data, such as collections of random images, you’ll need to assign labels that help the machine learning model understand what it’s looking at. That may require bringing in subject matter experts to fuel the learning process by manually labeling images. For example, doctors identifying which x-rays indicate the presence or absence of tumors.

The process can be challenging, expensive and time-consuming, Simons adds. Businesses will need to decide whether it makes more sense to build their own data sets or acquire pre-labeled ones.

In some cases, organizations may want to mix both structured and unstructured data to indentify common threads, such as records from a CRM database with text comments from user forums.

“There’s a lot of potentially valuable unmined data in those forums,” he says. “You can apply sentiment analysis and use it as additional signal alongside your more structured business system data.”

How Do You Get Started?

Building out a machine learning platform can be time and resource intensive. Data scientists are in high demand, and the return on investment is far from guaranteed. But there are a few ways organizations can ease into it.

“The best thing to do is start small,” says Huang. 

Identify a use case that has value, but is also practical and where data is readily available. The best way to understand whether you have the right data set is to try it out. But it can be really expensive to do that.

Huang says companies can upload their data sets to the platform, which builds the optimal model for them using one of 20 commonly used learning algorithms. In roughly an hour, he says, they’ll be able to find out if their data will yield a high-quality model.

There are a growing number of AI startups that can take your data and build a rudimentary model for you, adds Simons. Organizations can also get a head start by taking algorithms that are already trained on data similar to theirs, and customizing them—an approach known as transfer learning.

Then, there are the huge AI-as-a-service platforms such as Amazon Rekognition, Clarifai and Google Cloud Vision API. But those tend to be limited in functionality and expensive at scale.

“It will soon be table stakes for every company to have a chat bot or some kind of image recognition,” says Simons.

What will really provide business value is differentiated machine learning, where organizations are applying these techniques to their own data and solving their own problems.

And if you haven’t already started down this road, you’re behind the curve.