NOV 2025Document Intelligence

Institutional Email Automation SaaS

Enterprise SaaS platform that automates institutional email credential workflows with OCR, AI scoring, and secure delivery.

NestJS

PostgreSQL

Prisma

React

TypeScript

Tesseract.js

Brevo API

JWT

Code Demo

Overview

The Institutional Email Automation SaaS is an enterprise-grade platform designed to streamline the complex workflow of processing and distributing institutional email credentials. Built for colleges and universities managing thousands of student accounts, the system combines OCR technology, AI-powered validation, and secure delivery mechanisms.

The platform has processed over 10,000+ documents with 92% automation rate, reducing manual data entry workload by 85% and cutting credential delivery time from days to minutes.

10K+

Documents Processed

92%

Automation Rate

85%

Time Saved

99.7%

Delivery Success

Problem

Educational institutions face significant challenges in managing institutional email distributions:

Manual Data Entry Bottleneck: Staff manually transcribe credentials from ID cards and documents - time-consuming and error-prone process averaging 3-5 minutes per student.
Document Format Variability: Institutions receive documents in various formats - scanned PDFs, photos, faxes - each requiring different processing approaches.
Validation Challenges: No systematic way to verify extracted data accuracy before sending credentials to students.
Security Concerns: Sensitive credential information must be transmitted securely while maintaining audit trails.
Coordination Issues: Multiple departments involved in approval workflows without clear communication channels.

Real Impact: A mid-sized college processing 2,000 new students annually spent approximately 100-150 hours on manual credential distribution, with 5-8% error rate requiring corrections and reprocessing.

Solution

Built a comprehensive SaaS platform with intelligent document processing and automated workflows:

Intelligent Document Processing

OCR Engine with Preprocessing

Implemented Tesseract.js with image preprocessing pipeline (grayscale conversion, contrast enhancement, noise reduction) to extract text from documents with 94% accuracy.

AI Confidence Scoring

Developed custom confidence scoring algorithm analyzing OCR certainty, pattern matching, and field validation to flag uncertain extractions for manual review.

Smart Field Extraction

Regex-based pattern matching for structured fields (student ID, name, email format) with fuzzy matching for variations and OCR errors.

Approval Workflow System

Multi-Stage Validation

Three-tier approval process: AI auto-approval for high-confidence extractions (score >90%), staff review for medium confidence (70-90%), and manual entry for low confidence.

Batch Processing

Admins can process multiple documents simultaneously with bulk approval capabilities, reducing processing time by 70%.

Secure Credential Delivery

Encrypted Email Distribution

Integration with Brevo API for transactional email delivery with encrypted credential transmission and delivery verification.

Audit Trail System

Complete logging of all actions - extraction, approval, modifications, and delivery - with timestamp, user, and reason tracking for compliance.

Architecture

The platform is built on a modern microservices-inspired architecture using NestJS modules for clear separation of concerns:

System Architecture

Frontend Layer (React + TypeScript)

Admin dashboard for document upload and processing
Real-time status updates using polling
Batch approval interface with inline editing
Comprehensive audit log viewer

API Layer (NestJS)

Document processing module with OCR integration
Validation service with confidence scoring
Email delivery module with queue management
Authentication and authorization using JWT

Data Layer (PostgreSQL + Prisma)

Document metadata and processing status
Extracted credentials with confidence scores
Audit log with full event history
User management and permissions

External Services

Tesseract.js for OCR processing
Brevo API for email delivery
AWS S3 for document storage (future enhancement)

Processing Pipeline

Document Upload: Admin uploads PDF/image documents through web interface
Preprocessing: Image enhancement, format normalization, quality checks
OCR Extraction: Tesseract processes document and extracts raw text
Field Parsing: Regex patterns and fuzzy matching identify structured fields
Confidence Scoring: AI algorithm assigns confidence scores to extractions
Routing: High-confidence → auto-approve, medium → review queue, low → manual entry
Approval: Admin reviews and approves/modifies extractions in bulk
Delivery: Approved credentials queued for email delivery via Brevo
Verification: Delivery status tracked and logged for audit trail

Tech Stack & Decisions

Backend: NestJS + PostgreSQL

Chose NestJS for its enterprise-ready architecture, built-in dependency injection, and excellent TypeScript support:

Modular Architecture: Clean separation between document processing, validation, email delivery, and auth modules
Prisma ORM: Type-safe database queries with automatic migrations and excellent PostgreSQL support
Dependency Injection: Easy to test and maintain with clear module boundaries
Built-in Guards: JWT authentication and role-based access control implemented cleanly

OCR: Tesseract.js

Selected Tesseract.js for client-side OCR processing:

Open Source: No licensing costs, active community, extensive language support
Browser-Based: Runs in browser, reducing server load and API costs
Accuracy: 94% accuracy after preprocessing pipeline implementation
Trade-off: Slower than cloud OCR services (AWS Textract, Google Vision) but cost-effective for MVP

Email: Brevo API

Integrated Brevo for transactional email delivery:

Reliability: 99.9% uptime SLA with automatic failover
Deliverability: Built-in spam prevention and domain reputation management
Tracking: Open rates, click tracking, and delivery confirmations
Cost-Effective: Generous free tier (300 emails/day) suitable for initial deployment

Database: PostgreSQL

PostgreSQL chosen for robust relational data management:

ACID Compliance: Critical for credential data integrity
JSON Support: Flexible storage for OCR metadata and audit logs
Full-Text Search: Built-in FTS for searching documents and logs
Prisma Integration: Excellent ORM support with type safety

Core Features

Document Processing Engine

Multi-Format Support: Handles PDF, JPG, PNG with automatic format detection
Image Preprocessing: Grayscale conversion, contrast enhancement, noise reduction for optimal OCR
Intelligent Field Detection: Pattern matching for student ID, name, email format, date fields
Batch Upload: Process up to 50 documents simultaneously

AI Confidence Scoring

Multi-Factor Analysis: Combines OCR confidence, pattern match strength, field validation
Automatic Routing: High confidence (90%+) auto-approved, medium (70-90%) flagged for review
Learning System: Tracks admin corrections to improve future confidence thresholds
Visual Indicators: Color-coded confidence scores in review interface

Admin Approval Workflow

Bulk Operations: Select and approve multiple records with single action
Inline Editing: Quick corrections without leaving approval screen
Comparison View: Side-by-side original document and extracted data
Rejection System: Flag problematic extractions with reason codes

Secure Credential Delivery

Encrypted Transmission: TLS encryption for all email communications
Template System: Customizable email templates with institution branding
Delivery Tracking: Real-time status updates (sent, delivered, opened, failed)
Retry Logic: Automatic retry with exponential backoff for failed deliveries

Comprehensive Audit System

Full Event Logging: Every action logged with user, timestamp, and before/after states
Searchable History: Filter by date range, user, action type, document
Export Capabilities: CSV export for compliance reporting
Retention Policy: Configurable log retention periods

Engineering Challenges

1. OCR Accuracy with Variable Quality Documents

Challenge: Documents arrived in wildly varying quality - faded photocopies, smartphone photos at angles, low-resolution scans. Raw Tesseract accuracy was 67% on real-world documents.

Solution:

Built preprocessing pipeline with OpenCV.js for image enhancement (deskewing, contrast adjustment, binarization)
Implemented adaptive thresholding based on image histogram analysis
Added multiple OCR passes with different preprocessing settings, selecting best result
Created document quality classifier to route low-quality docs directly to manual entry

Result: OCR accuracy improved from 67% to 94% on production documents.

2. Handling OCR Errors and Ambiguities

Challenge: OCR commonly confused similar characters (0/O, 1/I/l, 5/S) causing invalid student IDs and email addresses.

Solution:

Implemented context-aware character correction using field constraints (student ID must be 8 digits)
Built fuzzy matching against existing student database to catch similar-looking errors
Added Levenshtein distance checking for name fields against enrollment records
Created confidence penalty system for ambiguous character combinations

Result: Character confusion errors reduced by 78%, with remaining ambiguities flagged for review.

3. Balancing Automation with Accuracy

Challenge: Setting confidence thresholds - too high meant excessive manual review (defeating automation purpose), too low meant errors reaching students.

Solution:

Implemented multi-factor confidence scoring combining OCR confidence, pattern match strength, database validation results
Created configurable three-tier system: auto-approve (90%+), review queue (70-90%), manual entry (<70%)
Built feedback loop tracking admin corrections to tune thresholds over time
Added field-level confidence allowing partial automation (auto-approve ID but review name)

Result: Achieved 92% automation rate with 0.3% error rate reaching end users.

4. Email Delivery at Scale

Challenge: Sending thousands of credential emails without triggering spam filters or hitting API rate limits.

Solution:

Implemented queue-based delivery system with rate limiting (300 emails/hour respecting Brevo limits)
Added exponential backoff retry logic for transient failures (network issues, temporary DNS failures)
Configured SPF, DKIM, DMARC records for institutional domain to improve deliverability
Built monitoring dashboard tracking delivery rates, bounce rates, spam complaints

Result: 99.7% delivery success rate with <0.1% spam complaints.

5. Database Performance with Large Audit Logs

Challenge: Audit logging every action created massive table growth (50K+ records/month), slowing down queries and impacting UX.

Solution:

Implemented database partitioning by month for audit log table
Added compound indexes on common query patterns (userId + timestamp, documentId + timestamp)
Created separate read replica for audit queries to avoid impacting transactional performance
Implemented log archival system moving records older than 6 months to cold storage

Result: Query performance maintained under 100ms even with 500K+ audit records.

Security & Reliability

Authentication & Authorization

JWT-Based Auth: Secure token-based authentication with refresh token rotation
Role-Based Access Control: Admin, supervisor, operator roles with granular permissions
Session Management: Automatic timeout after 30 minutes inactivity
Password Security: bcrypt hashing with salt rounds = 12

Data Protection

Encryption at Rest: PostgreSQL with TDE (Transparent Data Encryption)
Encryption in Transit: TLS 1.3 for all API communications
Credential Masking: Passwords partially masked in UI (show first 2, last 2 characters)
Document Purging: Original documents deleted after 30 days (configurable)

Reliability & Monitoring

Health Checks: Automated health endpoints for database, email service, OCR processing
Error Tracking: Sentry integration for real-time error monitoring and alerting
Backup Strategy: Daily automated PostgreSQL backups with 30-day retention
Uptime SLA: 99.5% uptime over 6 months of production operation

Compliance & Audit

Complete Audit Trail: Every action logged with user, timestamp, IP address, before/after states
FERPA Compliance: Student data handling follows FERPA guidelines
Data Retention: Configurable retention policies for different data types
Export Capabilities: Compliance reports exportable to CSV/PDF

Performance & Impact Metrics

Performance Metrics

94%

OCR Accuracy Rate

92%

Automation Rate

<2s

Document Processing Time

99.7%

Email Delivery Success

Business Impact

10K+

Documents Processed

85%

Time Savings vs Manual

0.3%

Error Rate

99.5%

Platform Uptime

Real-World Impact

Time Savings: Reduced credential distribution workflow from 3-5 days to under 1 hour for batches of 500+ students.

Cost Reduction: Eliminated need for 2 FTE staff dedicated to credential processing, saving approximately $80K/year.

Accuracy Improvement: Error rate dropped from 5-8% with manual entry to 0.3% with automated system.

Student Experience: Credential delivery time reduced from 3-5 days to under 1 hour, significantly improving new student onboarding experience.

Key Learnings

1. OCR is 80% Preprocessing, 20% Recognition

Initially focused on tuning Tesseract parameters but saw minimal improvement. Real gains came from investing in robust preprocessing pipeline - image enhancement, deskewing, adaptive thresholding. Quality preprocessing took accuracy from 67% to 94%. Lesson: In document processing, cleaning the input is more impactful than tuning the algorithm.

2. Embrace Human-in-the-Loop from Day One

Early version attempted 100% automation which resulted in unacceptable error rates. Pivoted to hybrid approach with confidence-based routing. Ironically, 92% automation with human oversight proved more valuable than 100% automation with errors. Lesson: Perfect automation is often impossible - design for supervised automation instead.

3. Audit Logs Are Not Optional for Enterprise

Initially treated audit logging as a nice-to-have feature. During pilot deployment, institutions demanded complete audit trails for compliance. Retrofitting comprehensive logging was painful. Lesson: For any system handling sensitive data, build audit logging from day one - it will be required eventually.

4. Email Deliverability is Hard

Naively assumed sending emails was straightforward. Encountered spam filter issues, rate limits, bounce handling complexity. Learned about SPF/DKIM/DMARC, warm-up periods, sender reputation. Lesson: Email delivery at scale requires as much engineering as the core product - don't underestimate infrastructure services.

5. TypeScript + Prisma is Enterprise-Ready Stack

NestJS + Prisma + PostgreSQL proved excellent for enterprise SaaS. Type safety caught bugs at compile time, Prisma migrations simplified schema evolution, NestJS modularity aided testing. Would choose this stack again for similar projects. Lesson: Modern TypeScript ecosystem has matured to the point of being production-ready for serious applications.

Future Improvements

Machine Learning-Based OCR Enhancement

Replace rule-based confidence scoring with trained ML model that learns from admin corrections. Could improve automation rate from 92% to 95%+ while maintaining accuracy.

Effort: Medium | Impact: High

Cloud OCR Service Integration

Add option to use AWS Textract or Google Vision API for challenging documents. Could improve accuracy on difficult documents from 75% to 95%+.

Effort: Low | Impact: Medium

Mobile Document Capture App

Build mobile app (React Native) for field staff to capture and upload documents with real-time feedback on image quality. Could reduce low-quality document submissions by 60%.

Effort: High | Impact: Medium

Automated Student Database Integration

Direct integration with institutional student information systems (Banner, PeopleSoft) for automatic validation and credential generation. Could eliminate manual entry entirely.

Effort: High | Impact: Very High

Advanced Analytics Dashboard

Build comprehensive analytics showing processing trends, error patterns, staff productivity, common correction types. Help institutions optimize workflows.

Effort: Medium | Impact: Medium

Multi-Tenant Architecture

Refactor to proper multi-tenant architecture with data isolation, custom branding per institution, and usage-based pricing. Enable SaaS scaling to hundreds of institutions.

Effort: Very High | Impact: Very High

Two-Factor Authentication for Recipients

Add optional 2FA verification before credential delivery for high-security institutions. Student must verify identity via SMS/email code before receiving credentials.

Effort: Medium | Impact: Low

Real-Time Collaboration

Add WebSocket-based real-time updates so multiple admins can collaborate on document processing without conflicts or stale data.

Effort: Medium | Impact: Medium