It is best practice to have a separate database for each microservice based on its purpose. In this post we will understand how to analyse the purpose based on a scenario and choose the right database.
In-Short
CaveatWisdom
Caveat: We can easily run into cost overruns if we do not choose the right database and design it properly based on the purpose of our application.
Wisdom:
- Understand the access patterns (Queries) which you make on our database.
- Understand how your database storage scales, will it be in terra bytes or petabytes.
- Analyse what is most important for your application among Consistency, Availability and Partition Tolerance.
- Choose Purpose-Built databases on AWS cloud based on Application Purpose.
In-Detail
Scenario
Let us consider we are developing a loan processing application for a Bank. The Loan could be an Auto Loan, Home Loan, Personal Loan or any other loan.
Requirements of Scenario
- Customer visits loan application portal, reviews all the loan products and interest rates and applies for a loan. The portal should give a smooth experience to the customer without any latencies. Once Application is submitted it should be acknowledged immediately and should be queued for processing.
- Bank Expects huge volume of loan applications across regions with its marketing efforts, loan application data should be stored in a scalable database and as it is very important data it should be replicated in multiple regions for high availability.
- While processing the loan, credit worthiness of the customer has to be analysed.
- Customer Profiles and Loan Documents with content management system should be stored in a secured scalable database.
- Based on credit worthiness of customer loan documents should be sent for final manual approval.
- Loan account for the customer should be created and loan transactions data should be maintained in it. We also need to do ad hoc queries on the transactions data in relation to floating interest rates and repayment schedules for generating different statements and reports.
- Loan Application data should be sent to data warehouse for marketing and customer analytics. In the same data warehouse data from other sources and products will be ingested to improve marketing strategies and showcase relevant products to customers.
- Immutable records of loan application events should be maintained for regulatory and compliance purposes and also these records should be securely shared with the insurer of the asset created by the customer with the loan.
Note: The architecture is simplified for purpose of discussion, real production scenario architecture could be much more complex.
Architecture Brief
Customer facing loan application portal is a static website hosted on Amazon S3 with the CloudFront integration. API Calls are made to API Gateway which transfers the request data to Lambda functions for processing. Fanout mechanism with a combination of SNS and SQS is adopted to process and ingest data to multiple databases parallelly. Process workflow including manual approval is handled by AWS Step Functions. SNS is used for internal notifications and SES is used to intimate the status of the loan by email to the customer.
Command Query Responsibility Segregation (CQRS) pattern is adopted in the above architecture with separate lambda functions for ingesting data and processing the data.
Analysing Purpose and Choosing the Database
- When handling customer queries, especially at the time of acquiring a new customer or selling a new product to an existing customer, the response time should be very less. To give a very good experience to the customer all general frequent query responses, session state and the data which afford to be stale should be cached. Amazon ElastiCache for Redis is a managed distributed in-memory data store built for this kind of purpose. It gives a high-performance microsecond latency caching solution. It comes with multi-AZ capability for high availability.
- Loan application data will be mostly key value pairs like loan amount, loan type, customer id, etc. As per the requirement 2, huge volume of loan data has to stored and retrieved for procession and at the same time it should be replicated to multiple regions for very high availability. Amazon DynamoDB is a key-value store database which can give single-digit millisecond latency even at peta-byte scale. It has an inherent capability to replicate the data to other regions with Global Tables and enabling DynamoDB streams. So, this is suitable for storing Loan application data and triggering Loan Processing Lambda function with DynamoDB streams.
- As per the requirement 3 of the scenario, creditworthiness of the customer has to be analysed before arriving at a decision of sanctioning loan amount to the customer. It is assumed that bank collects data of customers from various sources and also maintain the data of its relationships with its existing customers. Creditworthiness is calculated on many factors especially the history of the relationship the customer had with bank with various products like savings account, credit cards, income and repayment history. When it comes to querying the relationships and analysing the data, we need a graph database. Amazon Neptune is a fully managed graph database service that work with highly connected datasets. It can scale to handle billions of relationships and lets you query them with milliseconds latency. It stores data items as vertices of the graph, and the relationships between them as edges. Loan application data can be ingested to Amazon Neptune and creditworthiness can be analysed.
- As per requirement 4, customer profiles and loan documents should be maintained with a content management system. Loan documents contain critical legal information which could be changing based on various products. The documents can differ based on the law of different states in which the bank operates. To address these requirements schema of the database should be dynamic. We may need to query and process these documents with-in milliseconds. We need to have a No-SQL Document Database for content management system of loan documents which scale. Amazon DocumentDB (with MongoDB Compatibility) is a fully managed database service which can supports both instance-based clusters which can scale up-to 128tb and also Elastic Clusters which can scale even to petabytes of data. We can put loan documents with dynamic schema as JSON documents in DocumentDB. We can use MongoDB drivers for developing our application with the DocumentDB. Additionally signed and scanned copies of the documents can be maintained in an S3 bucket with a reference of the document in the DocumentDB.
- As per the requirement 6, Loan Account should be opened for the customer where transactions data should be maintained. Here we need to maintain the integrity of the transactions data with a fixed schema and Online Transaction Processing (OLTP ). SQL is more suitable for doing ad hoc queries for generating statements. A Relational Database is more suitable for this purpose. Amazon RDS which supports six SQL database engines (Aurora, MySQL, PostgreSQL, MariaDB, Oracle and MS SQL Server) is managed service for relational databases. Amazon RDS manages backups, software patching, automatic failure detection, and recovery which are tedious manual tasks if we maintain the database ourselves. We can focus on our application development instead of maintaining these tasks. If we are comfortable with MySQL and PostgreSQL we can choose Amazon Aurora based on version compatibility. Aurora gives more throughput than standard MySQL and PostgreSQL engines as it uses Clustered Volume storage which is native to cloud.
- As per requirement 7, loan application data has to be stored for marketing and customer analytics. The data warehouse also stores data from multiple sources, which could run into petabytes. The data could be analysed with Machine Learning algorithms which can help in targeted marketing. Amazon Redshift is a fully managed, petabyte-scale data warehouse service with a group of nodes called cluster. Amazon Redshift service manages provisioning capacity, monitoring and backing up the cluster, and applying patches and upgrades to the Amazon Redshift engine which can run one or more databases. We can run SQL commands for analysis on the database. Amazon Redshift supports SQL client tools connecting through Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC). Amazon Redshift also give serverless options where we need not to provision any clusters, it automatically provisions data warehouse capacity and scales the underlying resources. With serverless we pay only when the data warehouse. We can use Amazon Redshift ML to train and deploy machine learning models with SQL. We can use Amazon SageMaker to train the model with data in Amazon RedShift for customer analytics.
- As per the requirement 8, immutable records of loan application events should be maintained for regulatory and compliance purposes. We need to have ledge database to maintain immutable records and also securely share the date with other stakeholders in block chain applications. The insurer who is insuring the asset which is created out of the loan take by the customer may need this data for insuring purpose. With Amazon Quantum Ledger Database (Amazon QLDB) we can maintain all the activities with respect to the loan in an immutable, and cryptographically verifiable transaction log owned by the bank. We can track the history of credits and debits in loan transactions and also verify the data lineage of an insurance claim on the asset. Amazon QLDB is a fully managed service and we pay only for what we use.
In this post I have discussed how to choose a purpose-built database based on our application. I will be discussing designing and implementation of these databases in my future posts.