Making use of Unstructured Data: Part I

AI Use Cases in Banking

May 07, 2025

Introdcution

A white paper by IDC on “What Every Executive Needs to Know About Unstructured Data” points out that “In 2022, 90% of the data generated by organizations was unstructured, and only 10% was structured. That year, Organizations globally generated 57,280 exabytes of unstructured data….” . As part of the findings, the whitepaper also mentions “Fifty-five percent (of respondents of the survey) say that less than half of all unstructured data is shared among employees or systems.”

Bankers often encounter such scenarios involving unstructured data in their day-to-day operations. These documents could be in the form of Emails, Company Annual Reports, Audit Reports (Internal/External), Contract Documents, Bank Guarantee/Letter of Credit Documents, Collateral Documents, Call transcripts from call centers, Photos/Images (such as from project site visits) and data from Media/Social Media.

Predominantly, such unstructured data is either sparsely used or discarded, depriving organizations of valuable input for effective decision-making.

In a series of posts on Unstructured Data, I will explore alternative approaches for making effective use of Unstructured Data for the benefit of commercial banks.

In this post, I have tried to explain the basic concepts around the usage of unstructured data in banks.

What is Unstructured Data?

Wiki defines Unstructured Data as “Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.”

MongoDB defines Unstructured Data as “Unstructured data is information that is not arranged according to a preset data model or schema, and therefore cannot be stored in a traditional relational database or RDBMS.”

In short, Unstructured Data refers to any data NOT in a structured form (such as RDBMS). Broadly, such data can be either Semi-Structured or fully Unstructured.

Let’s look at some of the key differences among the data types.

Structured Data Vs Semi-Structured Data Vs Unstructured Data

Types of Unstructured Data

Examples (in a banking context)

Documents:
- PDFs: Contracts, Annual/Audit reports, or LC/BG Documents
- Word Documents: Ad-hoc internal/customer reports
Text-Based:
- Emails: From Customers, Internal Departments, Regulators, etc. - Free-form text with varying formats and content.
Multimedia:
- Images: Photos or satellite imagery of project sites
- Videos: Drone footage of projects etc.
- Audio: Voice recordings/call center logs of customer interactions
Web Content:
- Blogs: Articles with varying structures and embedded media
- Media Mentions: News , Tweets, Comments, or captions on media
- Customer Reviews: Unstructured feedback on sites
- Webpages: Regulatory Websites - HTML content with text, images, and links

Illustrative Usage of Unstructured Data

Unstructured data can be very innovatively used to improve process efficiency and decision making in banks; Illustratively:

Reading a document and identifying risky/non-standard clauses
Sentiment analysis of commentaries/opinions in Annual Reports or Audit Reports
Comparing two documents and ensuring similarity/completeness
Automated triggering of customized responses based on customer emails
Triggering workflows based on negative news in the media
Creating chatbots on lengthy contracts/LC/BG documents for easy information retrieval
Automated scrutiny of call records to understand the intent of a risky customer
Analyzing satellite imagery/drone footage/real-time images to determine the extent of project completion

Challenges in Processing Unstructured Data

Processing unstructured data in the banking industry presents unique challenges due to the sector's reliance on vast amounts of diverse, sensitive, and complex data, coupled with stringent regulatory requirements and the need for real-time insights. Below are some of the key challenges:

1. Data Volume and Variety

Challenge: Banks generate and collect massive amounts of unstructured data from sources like customer emails, call center recordings, social media interactions, loan applications (e.g., PDFs), and transaction notes. The sheer volume and diversity (text, audio, images, videos) overwhelm traditional data processing systems.

For Example: A bank may receive thousands of customer documents daily in PDF format, each with varying layouts, handwritten notes, or scanned documents, making automated extraction difficult.

Impact: Manual processing is time-consuming, and scaling automated systems to handle diverse formats requires significant infrastructure investment.

2. Lack of Standardization

Challenge: Unstructured data lacks a consistent format or schema, unlike structured data in databases (e.g., transaction records). This makes it hard to extract relevant information systematically.

For Example: Customer complaints via emails or social media posts on platforms like X vary in tone, language, and structure, complicating sentiment analysis or issue categorization.

Impact: Inconsistent data formats hinder integration with existing banking systems, slowing down processes like fraud detection or customer service.

3. Data Quality and Noise

Challenge: Unstructured data often contains irrelevant or ambiguous information (noise), such as informal language, typos, or incomplete documents, which reduces the accuracy of analysis.

For Example: A customer’s loan application might include handwritten notes or incomplete fields, or a call transcript might have background noise, making it hard to extract key details like income or credit history.

Impact: Poor data quality can lead to errors in credit risk assessments or customer profiling, affecting decision-making.

4. Complex Processing Requirements

Challenge: Extracting meaningful insights from unstructured data requires advanced technologies like natural language processing (NLP), optical character recognition (OCR), or image recognition, which are computationally intensive and require specialized expertise.

For Example: Analyzing scanned KYC (Know Your Customer) documents (e.g., passports, utility bills) involves OCR to extract text and NLP to validate information, but recognition errors can lead to compliance issues.

Impact: Banks must invest in AI tools and skilled data scientists, increasing operational costs.

5. Regulatory Compliance and Data Privacy

Challenge: Banking is heavily regulated, with laws like GDPR, CCPA, or AML (Anti-Money Laundering) requiring secure handling, storage, and processing of customer data. Unstructured data, often containing sensitive information (e.g., PII in emails), is harder to anonymize or audit.

For Example: A customer’s email containing account details or a recorded call discussing financial disputes must be processed to extract insights while ensuring compliance with data protection laws.

Impact: Non-compliance risks hefty fines, reputational damage, and legal consequences, necessitating robust governance frameworks.

6. Real-Time Processing Needs

Challenge: Banking operations like fraud detection, customer service, or credit approvals often require real-time or near-real-time analysis, but processing unstructured data (e.g., transaction notes or social media signals) is time-intensive.

Banking Example: Detecting fraud by analyzing unstructured transaction descriptions or customer behavior on social media requires rapid processing, but current NLP models may struggle with latency.

Impact: Delays in processing can lead to missed fraud alerts or poor customer experiences, affecting trust and revenue.

7. Integration with Legacy Systems

Challenge: Many banks rely on legacy systems designed for structured data, which are incompatible with unstructured data processing tools. Integrating modern AI solutions with these systems is complex and costly.

For Example: A bank’s core banking system may not support direct analysis of unstructured loan contracts or customer chat logs, requiring middleware or system overhauls.

Impact: Integration challenges slow digital transformation and limit the ability to leverage unstructured data for competitive advantage.

8. Cost and Resource Intensity

Challenge: Processing unstructured data requires significant investment in storage (e.g., cloud solutions like AWS S3), AI tools (e.g., NLP platforms), and skilled personnel, which can strain budgets.

For Example: Implementing a system to analyze call center recordings for customer sentiment involves licensing NLP software, training models, and hiring data engineers.

Impact: High costs may deter smaller banks from adopting advanced unstructured data analytics, widening the gap with larger competitors.

9. Accuracy and Contextual Understanding

Challenge: Unstructured data often requires contextual interpretation, which AI models may struggle with, especially in nuanced banking scenarios involving slang, regional dialects, or domain-specific jargon.

For Example: An NLP model analyzing customer emails might misinterpret sarcastic complaints or banking-specific terms like “overdraft protection,” leading to incorrect sentiment analysis.

Impact: Inaccurate insights can result in misguided strategies, such as targeting the wrong customers for loan offers or missing critical risk signals.

Conclusion

In banking, structured, semi-structured, and unstructured data are pivotal.

Structured data, like transaction records, fuels core operations with its organized format. Semi-structured data, such as API logs, enables digital banking and fraud detection with adaptable structures.

Unstructured data, encompassing emails, call recordings, and social media posts, dominates in volume, offering deep customer insights but presenting significant processing and compliance hurdles due to its lack of predefined format.

Advanced AI, NLP, and cloud solutions are crucial for extracting value from unstructured data, enabling banks to personalize services, detect fraud, and monitor brand sentiment. Integrating these data types ensures banks remain agile and competitive in a data-driven landscape.

In the subsequent part of the series, I will discuss more about how some of the unstructured data types could be processed and used for the benefit of the banking industry.

References

UNTAPPED VALUE: What Every Executive Needs to Know About Unstructured Data by IDC [https://images.g2crowd.com/uploads/attachment/file/1350731/IDC-Unstructured-Data-White-Paper.pdf]
Structured vs. unstructured data: What's the difference by IBM [https://www.ibm.com/think/topics/structured-vs-unstructured-data]

Fintrails

Discussion about this post