Is Unstructured Data Actually Structured?

Matillion recently held its first anniversary and first live edition Deep Dish Data Webinar with industry experts to share their experiences, challenges, and strategies for driving successful sales. The conversation covered various topics, from cold calling to leveraging technology. The panel consisted of Mark Balkenende, VP of Product Marketing, and Molly Sanbo, Director of Product Marketing at Matillion, alongside Joe Reis, Co-Author of ‘The Fundamentals of Data Engineering’ and practicing Data Engineer and Architect, as well as Mike Galvin, CEO and Co-Founder of OneSix.

The panel addressed the following questions: 

Is data modelling still a fundamental of data?

Has the fundamentals of data engineering been put to the side whilst AI takes center stage? People ask you if your data is ready, but how often do you find that data is ready for a particular use case or consumption? Joe states, “Most data sets are gnarly, but we still have to focus on establishing the foundation and getting the fundamentals in place. For example, for AI to really work, we need to take some steps back and ensure the basic needs of our data are met by using techniques such as data modeling.” It’s interesting how, at one point, fundamentals like data modeling were so important 10, 20, or 25 years ago, and now we have the luxury of some of these modern data platforms like Databricks and Snowflake, where no matter the code quality, it still performs. 

There are a lot of foundational principles that are applicable regardless of technology. When you think about modeling data, you’re modeling the business at the end of the day. To reduce and eliminate errors further down the line, you need to ultimately start structuring your data and modeling it for the business. 

Is unstructured data actually structured? 

As new technologies are making their mark, so is more unstructured data. For example, some Matillion customers are all about unstructured data: How do we bring in our PDFs, video files, and images, and how do we merge that with the structured and semi-structured data? On the other hand, we have customers who don’t know the different data sets or the potential possibilities of their data. So, what do we do with all this unstructured data? There are at least two main categories of users for unstructured data:

  • Advanced Users: These include those using sophisticated technologies like computer vision for specific use cases involving unstructured data.
  • Basic Users: Most enterprises in this category require simple functionalities like viewing invoices through Power BI. These users prioritize storing data in a single location for easier governance and standardized security.

Additionally, some advanced users are conducting real-time analysis of unstructured data, which is a valuable capability.

Some Matillion Customer Stories Examples: 

  1. Customer Support Tickets: You’re getting all this data in different forms, whether it’s call transcripts, free text, or survey data; how do you run sentiment analysis on that? How do you use methodologies like RAG to cross-reference your commonly used answers? How do you automate that process without hiring more people? How do we automate alerts? Do we want to put notifications out to our customers? 
  2. Data Classification: Using generative AI and bringing in different data sources. For example, this huge mass of data comes in from different sources. What's my claim data? What’s my PII data that shouldn't be in my Snowflake instance? 


New technologies are making sentiment analysis and data classification much easier, but this ease can also introduce challenges, especially regarding security and compliance. However, the effectiveness and safety of AI applications still heavily depend on the quality and control of the underlying data sets. You can achieve impressive results with the right data model, but you risk making serious mistakes without it.

Ultimately, it all comes down to the underlying data model. Who would deploy a large language model on a corporate data set without ensuring proper data governance? If we don’t address these fundamentals, we're essentially in the same position as before.

The concept of 'know your data' includes assessing if your data is AI-ready, but ensuring your people are ready is equally important. Data literacy is crucial. No matter what technology is implemented, the technology won't be effective if employees don't understand the data and its specific implications for the business. It is essential to teach people the business context of the data and how it should be used. Engineering the data is one part, but if the people who need to leverage it don't fully understand it, the value of the data is lost. This is a significant issue.

Some additional questions from the webinar included: 

Do you see sentiment analysis and classification easier with generative AI than with traditional Natural Language Processing methods from ten years ago?

Answer: Joe states “i feel like you can more semantict similarity than you used to be able to so im curious if that means sentiment but i am curious to see what the research shows.” Molly mentions, “One thing we are noticing with customers at Matillion is that with the boom of generative AI and talking about use cases such as churn prediction and sentiment analysis, we actually have a lot of customers that have already been using machine learning to do this but are now able to do it in a different, more automated way, so some of the use cases aren’t drastically new - just solutions are more accessible.” Mike finishes by saying, “We are still scratching the surface with all this,” and provides an example use case of using AI to help gas station employees, who tend to be a lot more temporary employees, quickly access necessary procedural information from manuals. This demonstrates practical applications of these advancements and does it quickly. The overall sentiment is that while significant improvements exist, there is still much-untapped potential in integrating various data types for comprehensive business insights.

Do you believe that fine-tuning LLMs significantly helps with compliance efforts such as GDPR / HIPAA? 

Answer: Mike mentions “The amount of times people are thinking about doing that yet, it is so early.” Mark states, “You need to ensure that whatever you are doing with that type of data is in a data stack that is very private, that the model you control, you own and have full access.” In relation to the fine-tuning aspect. Joe mentions “It depends on the training, are you doing this in house or using a third party. Doing it in-house, you have to look at the price of fine-tuning versus using something like RAG, as it can be expensive if you have to fine-tune each time.” 

Unstructured data always needs structure added to it to be analyzed. Do you agree with that, or can unstructured data be used to solve problems without adding structure?

Answer: Joe mentions that even unstructured data, like images and videos, has an inherent structure. Integrating different data types is a major challenge and opportunity. The necessity of adding structure depends on the use case. Mark adds in that identifiers are needed to link data back to specific entities for actionable insights. However, for pattern identification, identifiers may not be essential. Joe rounds out by saying traditionally, data was manually tagged for search purposes, but now algorithms can automatically identify and label data, simplifying processes like image recognition. Though still evolving, advancements in handling unstructured data show great promise.

Is AI mature enough to incorporate all the traditional ways of data governance and systems of truth, or are we still far off that goal?

Answer: Mike states that significant technological advancements over the past five years have allowed data governance teams to automate many previously manual processes. Modern tools can automatically tag, classify, assign ownership, and build workflows within reporting architectures and databases, centralizing governance. This shift has reduced reliance on individual knowledge and spreadsheets. While several tools now assist with this automation, integrating traditional data governance methods with AI is still developing and remains a complex, ongoing challenge.

Are organizations incorporating ROI analytics into their POCs with generative AI?

Answer: Molly addresses this question by stating that implementing automated responses in customer support has shown a positive ROI. Using a Matillion example, for instance, 20% of support tickets now receive fully automated responses, 45% are mostly automated with some human intervention, and there has been an overall 40% time savings. Tracking ROI through metrics like volume efficiency and time to answer is crucial. Additionally, this implementation has led to qualitative benefits, such as improved documentation and better resource utilization. Although still in the early stages, the organization is experiencing significant, measurable, and qualitative benefits from incorporating analytics into its processes.

Key takeaways 

Joe: This is an exciting time for data and AI, presenting a huge opportunity for data professionals to achieve their goals. We may be entering a new golden age of data, but we must build solid foundations to realize its full potential.

Molly: With the focus on AI, now is the time to return to basics. Engage with the business to identify key problems and explore new possibilities. The goal is to provide real value to the business with data.

Mike: Start small with proof of concepts (POCs) and leverage experienced partners to navigate challenges. Continuous education is crucial as the field evolves rapidly. Stay informed to keep up with changes.

Mark: The possibilities with new technology are vast, but it's essential to prioritize data literacy and governance. Avoid past mistakes by ensuring proper policies, governance, and security are in place from the start.

You can listen to the whole webinar here, or if you’d like to see Matillion’s AI capabilities in action, book a demo today.