Enterprise AI: How to Build an AI Dataset
Finding and acquiring the data needed to build an enterprise dataset is perhaps the most critical task for organizations looking to build their own AI models.
Even with practice, things can easily go wrong, said Waseem Ali, CEO of consultancy Rockborne. “Everything always starts with data, and if your data is bad, the model won’t be good.”
Instead, he suggests, many times the challenge for businesses shouldn’t be to try to conquer the world with their first project, but to first pilot something that will allow them to go further.
Examine the specific business needs and requirements for a data or digital project, ask what problems need to be solved, what “hunches” need to be queried, but avoid drilling down into the “big picture impact” first.
As Johannes Maunz, head of AI at industrial IoT specialist Hexagon, explains, start from first principles to acquire data for a specific use case.
“No single deep learning model will solve all use cases,” Maunz said. “Compare the status quo to where you need to improve. What available data do you need to capture? Do it in a small or limited way, just for that one use case.”
Hexagon’s approach generally focuses on its own sensors, which contain data about buildings’ walls, windows, doors, etc. By rendering content in a browser, Hexagon can understand the data and its standards, formats, consistency, etc.
Start by thinking about what data and datasets the business already has or can use that meet the requirements. This often requires close collaboration with legal and privacy teams, even in an industrial internal environment. Maunz advises making sure the data specified for use does not contain any private personal information. The business can then build the model they want to use and train it - assuming the cost and feasibility are in place.
Next, you need transparency into decision points, as well as signal value to assess factors like usability, feasibility, and business effectiveness, or data on potential performance versus competitors.
For data that the company does not currently have, it may require some negotiation with partners or customers to obtain it.
“Frankly, people are very open to it – but there always has to be a contract,” Maunz said. “Only then can we start what we normally call data activities. Sometimes it makes sense to have more data than is needed so that the business can downsample.”
Data quality and simplicity are critical
Emile Naus, partner at supply chain consulting firm BearingPoint, stressed the focus on AI/ML data quality. Keep it as simple as possible, as complexity makes correct decisions difficult and hurts results — and then there are biases and intellectual property to consider. “Internal data isn’t perfect, but at least you can get a sense of how good it is,” Naus added.
He warned that complex multi-dimensional line fitting driven by AI/ML can deliver better results than easy-to-use 2D line fitting or even 3D line fitting – optimizing production, solving “recipes,” minimizing waste, etc. – if companies have “free” access to the data they need.
“Like all models, because AI models are used to build another model, and models can always be wrong, data governance is key. The pieces you don’t have may actually be more important, and you have to figure out the completeness and accuracy of the data.”
Andy Crisp, senior vice president of data and analytics at Dun & Bradstreet (D&B), recommends using customer insights and key data elements to establish data quality standards and tolerances, measurement, and monitoring.
“For example, the data [customers want, or get from us] might also inform their models, and we’re doing about 46 billion data quality calculations, taking our data and then maybe calculating it again against those standards, and then publishing data quality observations every month,” Crisp said.
For example, through specific standards, specific attributes must perform well enough to be passed on to the next team, and the team adopts these standards and tolerances, the results of these measurements and observation points, and works with the data management department to acquire, organize and maintain the data.
“There’s no other way to do it than to take the time to do it and develop your understanding. It’s like starting by cutting one piece of wood and then checking the lengths so you don’t end up cutting 50 completely wrong.”
Enterprises need to “know what is good” to improve data performance and insights, and then bring them together. Maintaining rigor in the problem statement narrows the data identification of the required datasets. Meticulous annotation and metadata enable the management of controlled datasets, enabling a true scientific approach, identifying biases and helping to minimize them.
Beware of bold statements that lump multiple factors together, and make sure to “test until you break”, this is one area where IT organizations do not want to “move fast and break things”. All data used must meet standards and must be constantly checked and remediated.
“Measure and monitor, remediate and improve,” Crisp said, noting that Dun & Bradstreet’s quality engineering team is made up of about 70 team members around the world. “High-quality engineering capabilities will help reduce things like hallucinations.”
Greg Hanson, vice president of Informatica Northern Europe, Middle East and Africa, agrees that setting goals is critical to helping companies determine how to best use their time to catalog information, integrate information, and train AI to support the data needed for results.
Even an organization’s own data is often dispersed and hidden across different locations, clouds, or on-premises environments.
“Catalog all your data assets, understand where that data resides, and also consider using AI to speed up data management,” Hanson said.
Ensure data is governed before it is captured
All data quality rules need to be implemented before the AI engine ingests data, assuming that proper governance and compliance are in place. If the enterprise doesn’t measure, quantify and fix, it will only make the wrong decisions faster, Hanson added: “Remember: Garbage in, garbage out.”
Tendü Yogurtçu, CTO of data suite vendor Precisely, said that depending on the size and industry type, companies can consider forming a steering committee or cross-functional committee to help define best practices and processes for all related AI initiatives. It can also help accelerate the process by identifying common use cases or patterns between teams, which themselves will continue to change as organizations learn from pilots and production.
Data governance frameworks may need to expand to cover a variety of AI models. That being said, potential AI use cases abound.
“Take the insurance industry as an example. To model risk and pricing accuracy, insurance companies need detailed information about wildfire and flood risks, parcel topography, exact location of buildings within the parcel, distance to fire hydrants, and distance to potential danger points like gas stations,” Yogurtçu explained.
However, Richard Fayers, senior data and analytics leader at consultancy Slalom, warned that building AI models, especially generative AI, can be expensive.
“Maybe there are areas where companies can collaborate — like law or medicine, where we start to see value is when you augment generative AI with your data — and you can do it in a variety of ways.”
For example, in the construction sector, users can supplement large language models with their own datasets and documents for querying. Similarly, ticket search platforms can intelligently consider a set of natural language-based criteria that are not one-to-one with metadata and tags.
“For example, you could use a ticketing platform to discover ‘kid-friendly weekend shows,’ and that type of search is currently quite difficult,” Fayers said.
Even dataset building and prompt engineering for something like ChatGPT still requires a focus on data quality and governance for a more “conversational” approach, he said, and prompt engineering will become an essential skill that’s in high demand.