Introdcution
A white paper by IDC on “What Every Executive Needs to Know About Unstructured Data” points out that “In 2022, 90% of the data generated by organizations was unstructured, and only 10% was structured. That year, Organizations globally generated 57,280 exabytes of unstructured data….” . As part of the findings, the whitepaper also mentions “Fifty-five percent (of respondents of the survey) say that less than half of all unstructured data is shared among employees or systems.”
Bankers often encounter such scenarios involving unstructured data in their day-to-day operations. These documents could be in the form of Emails, Company Annual Reports, Audit Reports (Internal/External), Contract Documents, Bank Guarantee/Letter of Credit Documents, Collateral Documents, Call transcripts from call centers, Photos/Images (such as from project site visits) and data from Media/Social Media.
Predominantly, such unstructured data is either sparsely used or discarded, depriving organizations of valuable input for effective decision-making.
In a series of posts on Unstructured Data, I will explore alternative approaches for making effective use of Unstructured Data for the benefit of commercial banks.
In this post, I have tried to explain the basic concepts around the usage of unstructured data in banks.
What is Unstructured Data?
Wiki defines Unstructured Data as “Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.”
MongoDB defines Unstructured Data as “Unstructured data is information that is not arranged according to a preset data model or schema, and therefore cannot be stored in a traditional relational database or RDBMS.”
In short, Unstructured Data refers to any data NOT in a structured form (such as RDBMS). Broadly, such data can be either Semi-Structured or fully Unstructured.
Let’s look at some of the key differences among the data types.
Structured Data Vs Semi-Structured Data Vs Unstructured Data
Types of Unstructured Data
Examples (in a banking context)
Documents:
PDFs: Contracts, Annual/Audit reports, or LC/BG Documents
Word Documents: Ad-hoc internal/customer reports
Text-Based:
Emails: From Customers, Internal Departments, Regulators, etc. - Free-form text with varying formats and content.
Multimedia:
Images: Photos or satellite imagery of project sites
Videos: Drone footage of projects etc.
Audio: Voice recordings/call center logs of customer interactions
Web Content:
Blogs: Articles with varying structures and embedded media
Media Mentions: News , Tweets, Comments, or captions on media
Customer Reviews: Unstructured feedback on sites
Webpages: Regulatory Websites - HTML content with text, images, and links
Illustrative Usage of Unstructured Data
Unstructured data can be very innovatively used to improve process efficiency and decision making in banks; Illustratively:
Reading a document and identifying risky/non-standard clauses
Sentiment analysis of commentaries/opinions in Annual Reports or Audit Reports
Comparing two documents and ensuring similarity/completeness
Automated triggering of customized responses based on customer emails
Triggering workflows based on negative news in the media
Creating chatbots on lengthy contracts/LC/BG documents for easy information retrieval
Automated scrutiny of call records to understand the intent of a risky customer
Analyzing satellite imagery/drone footage/real-time images to determine the extent of project completion
Challenges in Processing Unstructured Data
Processing unstructured data in the banking industry presents unique challenges due to the sector's reliance on vast amounts of diverse, sensitive, and complex data, coupled with stringent regulatory requirements and the need for real-time insights. Below are some of the key challenges:
1. Data Volume and Variety
Challenge: Banks generate and collect massive amounts of unstructured data from sources like customer emails, call center recordings, social media interactions, loan applications (e.g., PDFs), and transaction notes. The sheer volume and diversity (text, audio, images, videos) overwhelm traditional data processing systems.
For Example: A bank may receive thousands of customer documents daily in PDF format, each with varying layouts, handwritten notes, or scanned documents, making automated extraction difficult.
Impact: Manual processing is time-consuming, and scaling automated systems to handle diverse formats requires significant infrastructure investment.
2. Lack of Standardization
Challenge: Unstructured data lacks a consistent format or schema, unlike structured data in databases (e.g., transaction records). This makes it hard to extract relevant information systematically.
For Example: Customer complaints via emails or social media posts on platforms like X vary in tone, language, and structure, complicating sentiment analysis or issue categorization.
Impact: Inconsistent data formats hinder integration with existing banking systems, slowing down processes like fraud detection or customer service.
3. Data Quality and Noise
Challenge: Unstructured data often contains irrelevant or ambiguous information (noise), such as informal language, typos, or incomplete documents, which reduces the accuracy of analysis.
For Example: A customer’s loan application might include handwritten notes or incomplete fields, or a call transcript might have background noise, making it hard to extract key details like income or credit history.
Impact: Poor data quality can lead to errors in credit risk assessments or customer profiling, affecting decision-making.
4. Complex Processing Requirements
Challenge: Extracting meaningful insights from unstructured data requires advanced technologies like natural language processing (NLP), optical character recognition (OCR), or image recognition, which are computationally intensive and require specialized expertise.
For Example: Analyzing scanned KYC (Know Your Customer) documents (e.g., passports, utility bills) involves OCR to extract text and NLP to validate information, but recognition errors can lead to compliance issues.
Impact: Banks must invest in AI tools and skilled data scientists, increasing operational costs.
5. Regulatory Compliance and Data Privacy
Challenge: Banking is heavily regulated, with laws like GDPR, CCPA, or AML (Anti-Money Laundering) requiring secure handling, storage, and processing of customer data. Unstructured data, often containing sensitive information (e.g., PII in emails), is harder to anonymize or audit.
For Example: A customer’s email containing account details or a recorded call discussing financial disputes must be processed to extract insights while ensuring compliance with data protection laws.
Impact: Non-compliance risks hefty fines, reputational damage, and legal consequences, necessitating robust governance frameworks.
6. Real-Time Processing Needs
Challenge: Banking operations like fraud detection, customer service, or credit approvals often require real-time or near-real-time analysis, but processing unstructured data (e.g., transaction notes or social media signals) is time-intensive.
Banking Example: Detecting fraud by analyzing unstructured transaction descriptions or customer behavior on social media requires rapid processing, but current NLP models may struggle with latency.
Impact: Delays in processing can lead to missed fraud alerts or poor customer experiences, affecting trust and revenue.
7. Integration with Legacy Systems
Challenge: Many banks rely on legacy systems designed for structured data, which are incompatible with unstructured data processing tools. Integrating modern AI solutions with these systems is complex and costly.
For Example: A bank’s core banking system may not support direct analysis of unstructured loan contracts or customer chat logs, requiring middleware or system overhauls.
Impact: Integration challenges slow digital transformation and limit the ability to leverage unstructured data for competitive advantage.
8. Cost and Resource Intensity
Challenge: Processing unstructured data requires significant investment in storage (e.g., cloud solutions like AWS S3), AI tools (e.g., NLP platforms), and skilled personnel, which can strain budgets.
For Example: Implementing a system to analyze call center recordings for customer sentiment involves licensing NLP software, training models, and hiring data engineers.
Impact: High costs may deter smaller banks from adopting advanced unstructured data analytics, widening the gap with larger competitors.
9. Accuracy and Contextual Understanding
Challenge: Unstructured data often requires contextual interpretation, which AI models may struggle with, especially in nuanced banking scenarios involving slang, regional dialects, or domain-specific jargon.
For Example: An NLP model analyzing customer emails might misinterpret sarcastic complaints or banking-specific terms like “overdraft protection,” leading to incorrect sentiment analysis.
Impact: Inaccurate insights can result in misguided strategies, such as targeting the wrong customers for loan offers or missing critical risk signals.
Conclusion
In banking, structured, semi-structured, and unstructured data are pivotal.
Structured data, like transaction records, fuels core operations with its organized format. Semi-structured data, such as API logs, enables digital banking and fraud detection with adaptable structures.
Unstructured data, encompassing emails, call recordings, and social media posts, dominates in volume, offering deep customer insights but presenting significant processing and compliance hurdles due to its lack of predefined format.
Advanced AI, NLP, and cloud solutions are crucial for extracting value from unstructured data, enabling banks to personalize services, detect fraud, and monitor brand sentiment. Integrating these data types ensures banks remain agile and competitive in a data-driven landscape.
In the subsequent part of the series, I will discuss more about how some of the unstructured data types could be processed and used for the benefit of the banking industry.
References
UNTAPPED VALUE: What Every Executive Needs to Know About Unstructured Data by IDC [https://images.g2crowd.com/uploads/attachment/file/1350731/IDC-Unstructured-Data-White-Paper.pdf]
Structured vs. unstructured data: What's the difference by IBM [https://www.ibm.com/think/topics/structured-vs-unstructured-data]


