When AI Sees Only Part of the Data – What Happens Then?

By SQream

11.24.2021 twitter linkedin facebook

(This is part two of a two-part series describing how future AI outcomes might be hindered by today’s databases bottlenecks.)

In part one of this series, we discussed McKinsey Institute’s report on the different paths organizations took on their way to integrating AI into their business. The bottom line of that report was that companies which attempted to integrate AI into the business and delivered a high positive impact on their operating income had similar characteristics of how they approached their AI journey. Those characteristics: understand that this is a long-term journey which will include some failures along the way, focus on one or two areas of the business and not try everything at once, and, maybe most importantly – invest the needed time and make all the relevant data accessible for the AI and ML in the training phases.

That last point – the need to make *all* the relevant data accessible, and understand that you can’t rush the training process if you want the AI to deliver on its promise – is probably the biggest challenge for organizations trying to leverage their business with AI. 

In practice, the reality is that today’s data analytics infrastructure can’t take *all* the data, analyze it and provide insights which would enable the telcos to act before the predicted events actually occur. With all the promise of predictive analytics in helping telcos discover earlier signs of customer churn so customers can be addressed before they finalize their decision to leave, and with all the promise of up-sell and cross-sell opportunities to specific customers which can be suggested by the AI engine, telcos still must choose a trade-off: since insights are time-sensitive and would provide no value if they are produced too late, the only way to get them in time is to simply reduce the amount of data being analyzed, and choose specific segments of customers, or time frames of usage, etc. 

Time is Essential in AI / ML Training 

Let’s assume that only 20% of the data can be analyzed if we need the insights to arrive ASAP. That would mean that at least we’re getting value from 20% of our assets at a timeframe which still provides potential value, and we chose to disregard the other 80% data assets because we just can’t deal with them right now.

In the present, this seems like a hard but somewhat fruitful compromise – at least we’re getting something. However, if we look into the future, this kind of mindset takes us onto a very problematic path.

If we recall that McKinsey research, and one of its major insights, we remember that leader companies focused on specific domains and allowed the needed time to enable their AI and ML engines to be trained by as much data within these domains as possible. This made these AI engines “experts,” so these companies could trust them with decisions, which translated into positive impact on revenues. 

Companies which did not spend enough time training their AI and ML engines, and did not focus on specific domains, didn’t reach this level of trust in their AI, which translated to less impact on revenues, and more frustrations. 

How Much Data is Enough?

Deciding to analyze just 20% of the data because you simply can’t handle more is not “focusing on specific domains.” It’s more like “doing research on a non-representative and unrandomized sample, then deciding that the results you got are representative to the entire database.” In the present, these companies might argue that since they are limited to analyzing just 20% of the data, their insights will be translated into decisions targeting only the same 20% of customers which were included in the analyzed data.

But if these companies are planning to gradually integrate AI and ML into their operations, feeding just 20% of the data to these infant engines would create machines which are expected to make decisions on all the present and future data, but only met 20% of it. It is like limiting a child to interacting with just 20% of the objects a child interacts with during their early years, then setting them free into the world and watching them decide that you can eat a pencil because it looks similar to a breadstick (which they’ve met), not realizing it is more similar to a wooden straw (which they haven’t met).

The Takeaway

So how do we prevent these AI and ML engines from eating pencils? If we made all the data accessible to analysis – we’d get the insights we need and the training we want, but we’d get them at a time in which they are irrelevant. Speed is the major issue here (along with data silos, data prep and ingestion issues, but those are another post). Speed is also the path to the solution – we need to find a way to analyze high data volumes at low latency. This means speeding up the process of querying growing data, while not compromising on accuracy.  

To learn how SQream helps accelerate analytics for AI / ML models, read the datasheet.