Since October of last year I have had the opportunity to work with an startup working on automated machine learning and I thought that I would share some thoughts on the experience and the details of what one might want to consider around the start of a journey with a “data scientist in a box”.
I’ll start by saying that machine learning and ‘artificial intelligence has almost forced itself into my work several times in the past eighteen months, all in slightly different ways.
The first brush was back in June 2018 when one of the developers I was working with wanted to demonstrate to me a scoring model for loan applications based on the analysis of some other transactional data that indicated loans that had been previously granted. The model had no explanation and no details other than the fact that it allowed you to stitch together a transactional dataset which it assessed using a naïve Bayes algorithm. We had a run at showing this to a wider audience but the palate for examination seemed low and I suspect that in the end the real reason was we didn’t have real data and only had a conceptual problem to be solved.
The second go was about six months later when another colleague in the same team came up with a way to classify data sets and in fact developed a flexible training engine and data tagging approach to determining whether certain columns in data sets were likely to be names, addresses, phone numbers and email addresses. On face value you would think this to be something simple but in reality, it is of course only as good as the training data and in this instance we could easily confuse the system and the data tagging with things like social security numbers that looked like phone numbers, postcodes that were simply numbers and ultimately could be anything and so on. Names were only as good as the locality from which the names training data was sourced and cities, towns. Streets and provinces all proved to most work ok but almost always needed region-specific training data. At any rate, this method of classifying contact data for the most part met the rough objectives of the task at hand and so we soldiered on.
A few months later I was called over to a developer’s desk and asked for my opinion on a side project that one of the senior developers and architects had been working on. The objective was ambitious but impressive. The solution had been built in response to three problems in the field. The first problem to be solved was decoding why certain records were deemed to be related to one another when with the naked eye they seemed to not be, or vice versa. While this piece didn’t involve any ML per se, the second part of the solution did, in that it self-configured thousands of combinations of alternative fuzzy matching criteria to determine an optimal set of duplicate record matching rules.
This was understandably more impressive and practically understandable – almost self-explanatory. This would serve as a great utility for a consultant, a data analyst or a relative layperson to find explainability in how potential duplicate records were determined to have a relationship. This was specifically important because it immediately could provide value to field services personnel and clients. In addition, the developer had cunningly introduced a manual matching option that allowed a user to evaluate two records and make a decision through visual assessment as to whether two records could potentially be considered related to one another.
In some respects what was produced was exactly the way that I like to see products produced. The field describes the problem; the product management organization translates that into more elaborate stories and looks for parallels in other markets, across other business areas and for ubiquity. Once those initial requirements have been gathered it is then to engineering and development to come up with a prototype that works toward solving the issue.
The more experienced the developer of course the more comprehensive the result may be and even the more mature the initial iteration may be. Product is then in a position to pitch the concept back at the field, to clients and a selective audience to get their perspective on the solution and how well it matches the for solving the previously articulated problem.
The challenge comes when you have a less tightly honed intent, a less specific message and a more general problem to solve and this comes now to the latest aspect of machine learning and artificial intelligence that I picked up.
One of the elements with dealing with data validation and data preparation is the last mile of action that you have in mind for that data. If your intent is as simple as one of, “let’s evaluate our data sources, clean them up and makes them suitable for online transaction processing” then that’s a very specific mission. You need to know what you want to evaluate, what benchmark you wish to evaluate them against and then have some sort of remediation plan for them so that they support the use case for which they’re intended – say, supporting customer calls into a call centre. The only areas where you might consider artificial intelligence and machine learning for applicability in this instance might be for determining matches against the baseline but then the question is whether you simply have a Boolean decision or whether in fact, some sort of stack ranking is relevant at all. It could be argued either way, depending on the application.
When you’re preparing data for something like a decision beyond data quality though, the mission is perhaps a little different. Effectively your goal may be to cut the cream of opportunities off the top of a pile of contacts, leads, opportunities or accounts. As such, you want to use some combination of traits within the data set to determine influencing factors that would determine a better (or worse) outcome. Here, linear regression analysis for scoring may be sufficient. The devil, of course, lies in the details and unless you’re intimately familiar with the data and the proposition that you’re trying to resolve for you have to do a lot of trial and error experimentation and validation. For statisticians and data scientists this is all very obvious and you could say, is a natural part of the work that they do. Effectively the challenge here is feature selection. A way of reducing complexity in the model that you will ultimately apply to the scoring.
The journey I am on right now with a technology partner, focuses on ways to actually optimise the features in a way that only the most necessary and optimised features will need to be considered. This, in turn, makes the model potentially simpler and faster to execute, particularly at scale. So while the regression analysis still needs to be done, determining what matters, what has significance and what should be retained vs discarded in terms of the model design, is being all factored into the model building in an automated way. This doesn’t necessarily apply to all kinds of AI and ML work but for this specific objective it is perhaps more than adequate and one that doesn’t require a data scientist to start delivering a rapid yield.