NoSql Database Cluster Configuration for high Performance

Posted by vilas in NoSql, Software Development on May 27, 2015

This post focuses on current commodity hardware. Its optimal setting for various types of use cases and how your choice of NoSql database can be configured to use it to achieve high performance.

Hardware Performance

Disk Drive Performance

For disk drives, most time consuming operation is seek times. From wikipedia, seek time of 7200 RPM drives is about 4 ms. So in worst case, disk drives can read about 2500 blocks per second. It needs at least 4-5 disk seeks (for database table with over 100K records, it will need at least 4-5 levels of btree) for Primary key Btree index lookup. So realistically, its performance is about 500 records per second with 5 seeks.

For sequential reads/writes (for append only databases or kafka like use case), 7200 RPM disk drive can do about 100 Mbytes per second. Or for 1KB record size it can read/write about 100,000 records per second.

SSD Performance

High performance SSDs today can do about 100,000 IOPs per second of about 4K size. Performance of SSD is almost same as sequential disk accesses.

It may be a good idea to use disk drives for append only files like redo logs or append type workloads like in kafka.

Object Serialization/deserialization Performance

Based on performance analysis done on this blog, it is about 250,000 serialization/de-serialization of 1,000 byte payload using ProtoBufs on Intel Intel Core i3-3240 (Ivy Bridge) 3.40GHz processor. It should linearly scale with more CPUs and cores as this operation is mostly CPU bound.

Typical hardware configuration

For cost and performance, we will use typical hadoop recommended configuration as our baseline.

#CPUs/Cores

Based on the price, power consumption, use case and rack size, try to get as many CPUs as possible per machines.

Its always better to scale up to certain limit before thinking about scaling out. Typical hardware with 8, 16 or even 32 CPUs is possible in todays commodity hardware configurations.

Just to put some perspective, typical memory hash map implementations like redis can do ~100,000 get/set per second per CPU. So most of NoSql workloads are going to be memory and disk bound (and not CPU bound) as throughput of disk based NoSqls is never going to be 100,000 queries per second per CPU.

Physical Memory

Try to get as much physical memory as possible within constraints of price, size and power consumption. It is possible to have memories ranging from 72 GB, 128GB, 256GB and 512GB. More the better.

Disk Drives and SSDs

SSDs beat disk drives in all respect except the size. Commodity hardware can have 14 disk drives of 2-4TB sizes, that amount to about 24 TB to 48 TB storage with disk.

Cost and physical slot size requirements stop SSDs to reach to that kind of storage per machine. Typical SSD storage could range from 1TB up to 10TBs.

Choosing hard drives or SSDs

Pure size perspective hard drives make sense. You will need to fill up 48 TBs before considering about sharding. But performance is a big bottleneck with hard drives. With 14 drives, you can have 14*500=7000 record QPS.

SSDs can do about 100,000 IOPS or 20,000 QPS for record (1/5th due to btree lookup). With 2-10 SSDs per hardware, one can expect about 20,000*10=200,000 QPS for records.

You can choose hybrid approach. Based on the use case some collections could be in SSDs and some in hard drives. For Example – If you are building E-commerce site then, use case usually behaves in a way that 80-90% of queries use hot records from memory. For example, there will be lot of queries on iPhone or standard items which could be in the memory. Or queries to get user preferences etc. In this case, it may be OK to use hard drives. For append only use cases like redo logs, kafka use disk drives.

RAID

For random accesses use RAID 0 for performance. Most of NoSql databases provide redundancy, so redundant RAID configurations may not be required. But if you are using disk drives and data size requirements are going to be less than available size then you may consider using redundancy RAID configurations. At least single disk failures may not trigger node rebalancing.

For append only use cases, redo logs, kafka etc don’t use RAID. As these use cases rely heavily on the sequential access speed of disk.

Payloads

OLTP

OLPT use cases mostly tend to operate on single record. It is easier from scalability, extensibility and maintenance point of view to have schemas properly normalized. De-normalized schemas create maintenance issues as some future use cases may not be easy to implement if schema is not properly normalized. Classic example is, teacher changing her class timings. For normalized schema just one record in one table needs to change. But for de-normalized schemas, records of all students taking that class might have to be updated.

For OLTP workloads it is better to choose NoSql which supports relational type schemas and joins.

New wave of NewSql databases is picking up where they are focusing on “functionality of SQL/relational databases with scale of NoSql” mantra.

Data warehousing / Analytics

Use case in this case are mostly to insert records in fact tables and be able to join with dimension tables. Key here is dimension tables change very rarely. Sometimes these kinds of dimension tables are called slowly changing dimensions. For Example – Walmart adding support for new product or adding new store.

Due to their insert only nature, their size can go infinitely. Because of this proper sharding of data is a absolute must. Also, due to their append/insert only nature, it is ok to de-normalize schema up to certain extent. Schema de-normalization, doesn’t mean joins are not required. In these payloads as well, you still will need a join support for joining with dimension tables. For Example – To count number of products sold in Walmart store A from Cereal section in bottom rack. In this case there are 3 dimensions used; Store dimension, Section dimension and Rack dimension. Rack and Section dimensions may be hierarchical and may not be easily mapped to some metadata to avoid joins. In this case it may be best to use joins (and we will be better off using database supporting joins) rather than name-value or document databases which don’t support joins.

Bottom line is, data warehousing type of payloads need de-normalized data schema with shardable, horizontally scalable databases, but it may not be a bad idea to select latest type of NewSql databases which are highly scalable and support joins and other relational database constructs. Google released datastore which is highly scalable and also support some basic SQL like queries and joins.

Real time analytics

In this case, data from data warehousing is fed back to OLTP databases (or other databases which can be queried in real time) so that they can use this data to serve user better. For Example – Sites providing with recommendations based on user current activity.

This could be huge amounts of mostly read only data (data might change once a day, this change could be of insert, update or delete type) fed from data warehousing which will be queried by realtime use case. For Example – Say user is watching some movie; based on that user is provided with recommendations. Recommendations engine in this case uses both real time OLTP data (to get current user activities) and data warehousing data to use in recommendations algorithms to come up with best possible recommendations for that user.

NoSQL Databases

Due to scale and size of data, any selected NoSql database should support sharding and replication.

Most of current wave of top NoSql databases are mostly document or name/value and not completely designed to take full advantage of current commodity hardware specs.

MongoDB

use RAID 0 for collection and index files.

MongoDB architecture almost forces use of SSDs which limits maximum storage capacity per node. Which further forces going for sharding when other resources like CPU or memory haven’t reached their peak capacity.

Due to its memory mapped files and table locking for writes, it cannot use commodity hardware stack to its full capacity. Sharding further causes, hardware, operational, maintenance costs sky rocketing along with fragile structure of cluster since even sharding cannot completely solve underlying architectural issues.

On top, their loose (document) schema model comes in a way of future enhancements as due to need for atomicity, schema de-normalization is forced which in turn increases the document size and eventually suffers the performance and other issues crop up due de-normalized schema.

Bad queried can further degrade cluster performance since not frequently used records can clog up physical memory due to memory mapped files.

Cassandra

Cassandra read and writes could be sequential (their architecture document is not clear if reads are sequential). SSTables are written to disk sequentially. They are read periodically sequentially to merge in to one SSTable.

Even though reads and writes are sequential, overtime due to increasing file count disk fragmentation can occur. They loose speed of sequential accesses which will show up in the performance and throughput.

Based on this information, we can assume that Cassandra node may not use RAID and should not need to use SSD.

Couchbase

Couchbase is document database like MongoDB so de-normalization issues discussed in MongoDB also apply to Couchbase.

Couchbase is classic example of append only database where every insert/update/delete is appended to datafile. And Compaction keeps compacting data files for every bucket.

If the data size is bigger than the physical memory then, random access reads may be required to fetch the record.

Since performance of disk drives and SSD is same for sequential accesses writes can use either SSD or disk drives. For reads SSD will outperform hard drives due to their random access nature.

So based on use case and data size you may have to choose between SSD and disk drive.

If physical memory is about same as data size per node then use disk drive. I would say data size bigger by up to 50%, use disk drive.
If your use case is such that 80-90% of queries use hot records from memory then use disk drive.
Any other case, use SSD, but note that you might have to start sharding earlier than you wanted.

Kafka

Even though Kafka is not a NoSql database, it shows up in most of the NoSql cluster databases.

Its payload is mostly append only disk drive type with not much CPU needs. It usually needs as many disk drives as possible with no need for RAID.

Cassandra, CouchBase, Document database issues, memory mapped file issues, MongoDB, MongoDB scalability issues, Name Value pair database issues, Next generation Nosql, NoSql database, NoSql Issues, NoSql scalability, performance, sharding issues, WonderDB

Leave a comment

Database BTree+ Index

Posted by vilas in NoSql, Software Development, Uncategorized on May 10, 2015

BTree data structure is generalization of self balanced tree where in, node can contain more than one child node. This generalization is useful in storing it on to the disk files. In this case node can be mapped to disk block and can easily be read and written to the disk.

Btree+ is further generalization of Btree where in indexes and records are stored in leaf nodes. Branch nodes only contain pointers to other leaf and branch nodes. This optimization is useful in further performance improvements in increasing the throughput since branch nodes don’t need to be locked except during rebalancing. Data is usually inserted or deleted in leaf nodes hence only leaf nodes need to be locked during CRUD operations.

Further in BTree+ leaf nodes can contain pointers to previous and next leaf node for index range traversals.

Node Types

Every node contains ordered list of items in ascending or descending order depending on how it is created.

Description and sizes of various nodes is given based on WonderDB implementation. They might defer slightly from different implementations.

Branch

Branch nodes contain next level ordered list of pointers to other nodes. Order list is based on max key of every node. For default block size of 2048 bytes, it stores pointers to 150 next level branch or leaf nodes. It also contains copy of maximum key value, (last key of the ultimate leaf node it is pointing to). This key is used to tree traversal to get to proper leaf node during query processing.

For example, say branch node B contains pointers to 2 leaf nodes X and Y. Max key of X node is say 100 and max key of Y is say 200 then node B will store these 2 nodes in order X and Y. Also, in this case max value of node B will be 200.

Root

Root node is special case of branch or leaf node. This is the entry point into the data structure. If all keys can be stored in one node, then it will be of leaf node. Else, usually root is a branch node.

Leaf

Leaf nodes contain ordered list of index or record key data. It also contains pointer to next and previous leaf nodes. Current implementation needs pointers to previous nodes but we are working on enhancement to remove that dependence.

Based on the size of the key, it contains as many keys that can fit in leaf block size. Default block size is 2048 but can be changed during the construction of the index by providing STORAGE keyword.

Max key value of leaf is usually last key in the node.

For Example, say leaf node can contain keys 100, 200 and 300 in the order. Also, its max key value in this case will be 300.

Operations

Insertion

During insertion it starts at the root node. And it traverses up to the leaf node (position in the ordered list of keys) where the item can be inserted.

Since every item in all nodes (branch and leaf) are ordered, it uses binary search to locate the item in the node where new item can be inserted. It follows this step recursively from root up to the leaf and inserts that item into the leaf.

Insertion balancing

If newly added item increases the size of leaf node more than the block size (default 2048 bytes), then it splits the node in to 2 leaf nodes and traverses upward to insert pointer of newly added leaf into the parent branch node. If after inserting this new pointer, if branch node needs to be split because of its, size then it recursively continues upward until root node to rebalance. Since rebalancing is changing whole structure of the BTree+, whole BTree+ needs to be locked during this operation.

Deletion

Similar to insertion, deletion also starts at the root node and traverses up to the leaf where item to be deleted is located. And it removes that item from the leaf node.

Delete rebalancing

If leaf node becomes empty after the deletion of the item then again, it needs to rebalancing to remove its pointer from the parent node and recursively up to the root node if required. During this time, whole BTree+ needs to be locked.

Update

Fortunately update is not required. Update can turn out to be delete and insert operation which will be two separate operations on the BTree+

Tree locking scenarios

Read

Following steps are performed

Read lock on the tree.
Traverse up to the leaf node
Read lock leaf node.
Unlock read lock on tree started on #1.
Return iterator so that range (or PK lookup) query can walk through the range. Iterator properly unlocks and locks leaf nodes while traversing from node to node.
At the end it is query processor (callers) responsibility to unlock the read lock on the leaf node.

Insert

Following steps are performed

Read lock on the tree.
Traverse up to the leaf node.
Write lock leaf node.
Try inserting an item. If it needs to split because of of size increase, then
1. Release write lock on the leaf. (Step #3)
2. unlock read lock (Step #1).
3. acquire write lock on the tree, since rebalancing operation might be performed.
4. Again Traverse up to the leaf node.
5. Write lock leaf node. In theory this is not required since there is a lock on the tree itself. There are no other threads in this tree at this time.
6. If rebalance is required. Rebalance the tree
7. Unlock write lock on the tree
Insert the item.
Unlock write lock on leaf
Unlock read lock on the tree.

Delete

Read lock on the tree.
Traverse to the leaf node.
Write lock leaf node.
Remove the item. If it needs to reblance
1. Release write lock on the item
2. Release read lock on the tree.
3. Acquire write lock on the tree
4. Traverse to the leaf
5. Remove the item.
6. Rebalance of required.
7. Unlock tree
Unlock leaf.
Unlock tree.

Node levels in Wonderdb

WonderDB is transactional NoSql database implemented in Java. We support Btree+ index. Numbers provided here are some internals of WonderDB implementation.

Branch nodes store disk pointer to next node (about 10 bytes). So for default 2048 bytes, it stores about 200 items in branch node.

Leaf node on the other hand stores actual key value and the pointer to the actual record contents. So for key size of 100 bytes, it needs 110 bytes (100 for the key and 10 for the pointer to the record). So it stores about 100 keys in the leaf block.

Based on above assumptions,

2 level tree will store 100*200 = 20000 items, 200+1 = 201 blocks = 201*2048 ~ 400KB disk space

3 level tree will store 100*200*200 = 4000000 = 4M items, size of tree will be (200*200)*2048 ~ 80MB

4 level tree will store 100*200*200*200 = 1600000000 = 1.6B items, disk space = 200*200*200*2048 = 1.6 GB

From above calculations you can easily see why whole btree can be present in physical memory if we assume we have 50+ GB physical memory. Machines with 50+GB is considered commodity hardware nowadays.

This calculation is very important in choosing BTree+ vs Hash index if range query is not required. If configured properly, hash index can perform 2-3 times faster than BTree+ which will be huge improvement.

BTree+, Database BTree+, Hash Table, HashMap, MongoDB scalability issues, NoSql architecture, NoSql database, NoSql scalability, performance, WonderDB

Leave a comment

Database Hash Index

Posted by vilas in NoSql, Software Development, Uncategorized on May 10, 2015

Hash index is based on hash map data structure. It has lot of properties like, rehasing, load factor etc. You can read more about it in wikipedia.

We will more focus on usage and design of this data structure in the context of databases, where its size could be bigger than the physical memory and will need to be serialized to disk. Following are various points to consider when using it in databases.

Rehash – Increasing/decreasing size of buckets

Hash maps need to lock the whole data structure during the rehash. This may not be very expensive operation for in memory implementations of hash map. But may be almost impossible in the context of databases since rehash will have to lock the whole table.

Most databases suggest to rebuild the index when performance starts degrading over time due size of items in the index vs number of buckets. Or load factor going below certain threshold value.

Generic implementations of hashCode() and equals() methods

It may not be easy to provide hashCode() and equals() implementations specific to your index class. Usually databases will use their generic implementations which may not be very efficient to your index class. In wonderdb, we are working on a feature to register data type.

Load factor – Calculating buck size

By default Java hash map implementation sets load factor to 0.75. Allowing size of hash map to grow up to certain size before it is rehashed. For example say initial capacity of hash map is 100, then it will be rehashed when 75th item is stored in to the hash map.

We need different load factor considerations for hash index. Lets take a example on how to calculate load factor for hash indexes.

As shown in the figure above index entries are stored in disk block. It is called leaf node in wonderdb. Say size of disk block is 2048 bytes. Now lets say you want to store long in to the index (8 bytes). Also assume index stores pointer to actual table record, say pointer size is 12 bytes, then it needs 20 bytes to store index item in the disk. Say one disk block (2048 bytes) stores 100 index items.

In this case, for optimized use of disk space it will be better to assume optimal settings will be 100 items per bucket, instead of 1 item per bucket (as in Java hash map implementation). So load factor settings for hash index will be .0075. Or in another way to look at it; to store 750 items, java hash map will need 1000 buckets to achieve 0.75 load factor, whereas hash index will need only 7.5 (or 8) buckets to store 1000 items since we need to assume 100 items per bucket in case of hash index.

BTree+ Vs Hash Index

Advantage of hash map data structure vs tree data structure is its access time is constant time (O1). Or it just takes 1 comparison (one instruction) to get to the item assuming it is properly organized. Where as for tree the access time is O(log n). Or it takes 10 compare instructions to get to an item in a tree containing 1024 items.

So for 1M items it takes 20 compares and for 1B it takes 30 compares for the tree. Where as for hash map it will take 1 compare instruction to get to the item. But problem is, we probably wont be able to store billions to items in hash map due to physical memory size constraints.

So for storing billions of items in hash index we need to also take load factor in to account. Say load factor is 0.0075 in case of storing longs, or in other words, if bucket should contain 100 items for optimal disk usage then already it needs to do 100 long 100 = 7 compares within the buckets to get to the item. So already access time is no more O1 but O7.

So to see optimal performance, we need number of items for Btree+ to do at lease 3x of 7 or 21 compares which is ~ 1M items.

So point here is, unless you are expecting millions of items in the index, dont even consider hash index.

Performance

WonderDB supports both types of indexes and allows user to create one.

We are able to see hash index performance can go up to 2x for a case where BTree+ needed 3 level deep structure.

We see performance of 60K queries per second in case of 32000 items with key size of 1500. We selected key size of 1500 just to make force 3 level deep BTree+. Where as for BTree+ we see close to 35-40K queries per second.

Here was the setup for the test.

Btree+

It needed 32000 leaf nodes to store 32000 items since it can store only 1 item per node with 1500 size with 2048 block size.

It needed 32000/150 = 234 branch blocks. Branch blocks store pointers to next branch or leaf blocks and can store 150 items per block.

Next level of branch blocks needed 234/150 = 2 blocks.

And root node pointed to 2 next level blocks.

So our tree structure had 4 levels with

– 2 items in root node

– 2 branch blocks pointing to 234 items in next level

– 234 branch to point to 32000 leaf blocks.

– and 32000 leaf blocks.

It needed 1 compare (on root level) + 8 compares (on next level containing 234 pointers) + 15 compares (to get to leaf node) + 1 compare to find item in leaf node = 25 compares.

Hash Index

We had 32000 buckets to make hash index fully optimized to get O1 compare to get to an item.

With this setup we saw 60K queries per second for hash index and 40K queries per second for BTree+. over 50% improvement.

BTree+, Hash Index, Hash Table, HashMap, memory allocation, Name Value pair database issues, Next generation Nosql, NoSql architecture, NoSql database, NoSql scalability

Leave a comment

What is better? Unsafe, ByteBuffer or Direct ByteBuffer?

Posted by vilas in NoSql, Software Development, Uncategorized on May 10, 2015

Summary

Unsafe memory allocation is about 10-12% faster than ByteBuffer and Direct ByteBuffer. Since it lives out side of JVM heap, application needs to manage its allocation and deallocation.

There are couple of JAVA runtime options to consider as well which might help boost performance of ByteBuffer.

Conclusion is, consider all 3 options listed below and see which option is better for your specific use case. We are working on open sourcing a wrapper library which will wrap Unsafe buffer to java ByteBuffer so that they can used interchangeably in the code.

In WonderDB, we are going to support all 3 options. Since most of the our cache memory is allocated at the beginning and destroyed at the end, we really have no issues with application requiring to manage memory in case of Unsafe memory allocation.

Our Use Case

WonderDB is a NoSql transactional database built on top of relational database architecture and design concepts. Our caches are fully disk backed and we manage life cycle of data buffers by paging in and out of disk files. Currently we have built our caching engine on top of direct ByteBuffer and now in the process of evaluating other approaches for performance improvements.

Different ways to store bytes in to memory

We are considering following options:

ByteBuffer
Direct ByteBuffer
Unsafe

Most of access to cache in WonderDB is to read and write serialized byte arrays.

I found detailed performance analysis of all above three methods in Alexey’s blog. Unsafe read/write performance for byte arrays is about 10-12% better than ByteBuffer or Direct ByteBuffer.

Even though Unsafe memory allocation is faster, we believe, there are couple of JVM runtime options which should help ByteBuffer allocation.

JVM runtime options to consider

-XX:+UseLargePages

Most of the major operating systems support LargeMemory page settings. This setting improves performance due to following reasons.

increased performance through increased Translation Lookaside Buffer (TLB) hits
pages are locked in memory and are never swapped out which will guarantee whole JVM heap remains in RAM. Same guarantee could not be given for Direct ByteBuffer memory.
contiguous pages are pre-allocated and cannot be used for anything else but for System V shared memory, for example JVM heap.
less bookkeeping work for the kernel in that part of virtual memory due to larger page sizes

Only ByteBuffer can take advantage of this jvm setting. It does not help direct ByteBuffer or Unsafe buffers since they are are allocated off heap and out of control of JVM.

More information about LargeMemory pages can be read here in redhat site.

Please note that LargeMemory pages need to be enabled on the machine before using this option. Please click Oracle site here to understand how to enable Large pages in different operating systems including Linux, Windows and Solaris.

The way this option can be used is, set the size of Large Pages little bigger than the size of java heap size you are considering. This way, whole java heap will be pinned to the memory and os wont page in and out its contents.

-XX:+CompressedOops java option

This option allows pointer compression in 64bit JVM to reduce the heap.

ByteBuffer, database, Direct ByteBuffer, JAVA runtime options, JVM runtime options, memory allocation, memory mapped file, memory mapped file issues, Next generation Nosql, NoSql database, NoSql scalability, performance, Unsafe memory allocation, WonderDB

Leave a comment

Why Memory-mapped files in NoSql are not scalable

Posted by vilas in NoSql, Software Development on April 18, 2013

Background

Most operating systems allow mapping a section of file into physical memory. This physical memory/buffer contents can then be changed automatically reflecting changes in the file.

Since operating system manages life cycle of mapped buffer, NoSql database using mapped file just need to focus on business logic which is supporting CRUD operations and let low level memory/buffer management handled by underlying operating system.

For example, insert operations can be appended at the end of the file, whereas update operations might be processed in-place if new record size is smaller than allocated record size in the file for that record or that record will be relocated at the end of the file changing record pointers causing need to update index entries for that record and leaving old record area in the file as garbage.

Update operation might create some garbage (free area) in the file and delete for sure will create free unusable area in the file. Most of the NoSql databases clean up this space by compaction which can be very expensive.

Issues

Heavy write (update and delete) use cases

As you can see write use cases with increasing record sizes will definitely be affected due to possibly updates might need to relocate the record causing need for changing indexes and expensive compaction.

Similarly deletes will cause similar issues of creating garbage in the file which eventually needing for very expensive compaction.

Shard rebalancing

Shard rebalancing forces database to scan at least 50% of records from the collection assuming index is present on the shard key. Else it will have to scan 100% of records in the collection to check if record needs to be migrated to new shard. These reads will cause messing up page cache since less frequently used records will be brought to the cache. This will further cause performance of current query execution since the records required for that query might not be available in the os page cache.

Shard rebalancing by definition going to delete about 50% record from one node to be moved to new node almost always causing a expensive compaction operation.

One bad query causing full table scan can completely ruin performance of a whole database

As discussed earlier in shard rebalancing, full table scan query can completely ruin os page cache affecting performance of entire database.

Conclusion

You have to be very careful if your database has following characteristics

Database size greater than physical memory size.
Sharded cluster
Update/Delete heavy use cases.

If your database size is going to be greater than physical memory size, memory mapped file architecture can become a serious performance issue.

If to overcome this issue if you have to go for sharding, you are basically pushing the burden on operational maintenance of shard. Leave alone issue of under utilization of CPU, memory and disk resources in individual nodes and issues with cost of power, space utilization. On top, rebalancing operation will always follow costly compaction operation.

Current commodity hardware can have tens of CPU cores with over tens to hundreds of GB physical memory and tens of TBs of disk space. This is lot of compute and storage power and should be used to its full capacity before going for sharding and adding a new node.

Sharded clusters and update/delete heavy uses cases are going to force very expensive compaction operation which will affect overll performance of the database.

Cassandra, compaction issues, CouchBase, CouchDB, database, Document database issues, memory mapped file, memory mapped file issues, MongoDB, MongoDB scalability issues, Name Value pair database issues, Next generation Nosql, NoSql, NoSql database, NoSql Issues, NoSql scalability, sharding issues

Leave a comment

Time for major leap in NoSql architecture

Posted by vilas in NoSql, Software Development on March 16, 2013

Current state of NoSql. Where are we heading

There are too many options to choose from

There are at least 3 types of NoSqls available today, document based, Name-Value pair and Graph databases and more than 20 vendors focusing on one of the NoSql technologies.

NoSql vendors successfully created space for NoSql by stressing 2 points:

CAP theorem and
“one-size does not fit all” or “right tool for the job”

At the time industry desperately needed this promise of flexible and scalable database architecture.

NoSql was able to teach us that compromising “strong consistency” (for eventual consistency) for availability was not a sin, but was need of time. This was a great contribution of NoSql movement to our industry. We have come to the point where we are not scared away from non transactional databases.

Stressing second point was required in the early days of NoSql, but due to this, number of NoSql databases have mushroomed in to unmanageable count all claiming “right tool for the job” further complicating database evaluation process for most of the organizations.

The problem is there are too many of “right tool for the job” and no one database is 100% fit.

Getting locked in to specific NoSql database

Once painful evaluation process is finished it is possible that organizations keep using same technology for other use cases which are not really a great fit for that NoSql technology. This is where product managers and developers start scrambling to tweak use cases to fit to selected NoSql database.

Whats wrong in choosing one of the document NoSqls for my document stroing needs

This is very valid question. Problem here is most of the document databases focus on specific document type mostly json or xml. But is this the right strategy?

Remember the days of XML? Everything seem to converge towards XML from SOA to XML databases.

Now its time for json. JSON everywhere. But as a technology evaluator , should I be locked in to json databases? Will we be talking json in next few years?

Name-Value databases

Whats wrong in selecting one of the Name-Value pair databases available right now if my use case is only to store name-values in some clustered persistent storage?

This is again a very valid question.

Most of the Name-Value databases are offshoots of BigTable white paper by Google. These databases provide extreme scalability and availability for simplicity of data structures.

So if you have simple data structure (Name-Value) to store and very huge data, which can grow to 1000s of nodes in a cluster, these databases are definitely a possibility.

But you have to be aware of complexity it might create with simple change in query requirements.

Architectural compromises and limitations in some of the NoSql databases

We have been using relational databases for so long that unconsciously we start assuming all databases would be supporting basic database features like record locking, buffer management, logging etc.

But it turns out quite a few NoSql databases so far don’t implement these basic features which can bite as data set starts growing.

We already have started hearing rumblings of moving away from NoSql back to mysql or other relational database system.

I just hope these basic features will be implemented very soon in those databases.

What’s next for NoSql

Relational databases are going to be here and will always be compared with during any technical evaluation. We also have come a long way to understand the need for scalability and availability.

NoSql cannot get away with NoSqlness. For the query language; It has to be better next generation Sql where it uses 80-20 rule and supports 80% of most common sql constructs ignoring the rest and adding new ones for specific cases like document, name-value or graph interactions.

Architecturally it has to provide some basic relational database architectures to be successful.

It should provide basic database architecture shell where anything could be stored in it. Imagine XML, JSON, NAME-VALUE and Graph and relational columns in single database!

In short future NoSql database is going to be highly extensible providing different schemes and algorithms for storing, querying and indexing. User can choose to store any format in the database with this engine making sense of format of data during indexing and select calls.

Imagine simple use case starting with relational columns. As use case changes, future new columns storing XML/JSON/or some custom format. And some new columns storing name value pairs etc, without compromising NoSql principles; “Scalability, Availability and flexibility”.

Cassandra, CouchBase, CouchDB, database, Document database issues, MongoDB, MongoDB scalability issues, Name Value pair database issues, Next generation Nosql, NoSql, NoSql architecture, NoSql database, NoSql Issues, NoSql scalability

1 Comment

Vilas Athavale

NoSql Database Cluster Configuration for high Performance

Hardware Performance

Disk Drive Performance

SSD Performance

Object Serialization/deserialization Performance

Typical hardware configuration

#CPUs/Cores

Physical Memory

Disk Drives and SSDs

Choosing hard drives or SSDs

RAID

Payloads

OLTP

Data warehousing / Analytics

Real time analytics

NoSQL Databases

MongoDB

Cassandra

Couchbase

Kafka

Database BTree+ Index

Node Types

Branch

Root

Leaf

Operations

Insertion

Insertion balancing

Deletion

Delete rebalancing

Update

Tree locking scenarios

Read

Insert

Delete

Node levels in Wonderdb

Database Hash Index

Rehash – Increasing/decreasing size of buckets

Generic implementations of hashCode() and equals() methods

Load factor – Calculating buck size

BTree+ Vs Hash Index

Performance

Btree+

Hash Index

What is better? Unsafe, ByteBuffer or Direct ByteBuffer?

Summary

Our Use Case

Different ways to store bytes in to memory

JVM runtime options to consider

-XX:+UseLargePages

-XX:+CompressedOops java option

Why Memory-mapped files in NoSql are not scalable

Background

Issues

Heavy write (update and delete) use cases

Shard rebalancing

One bad query causing full table scan can completely ruin performance of a whole database

Conclusion

Time for major leap in NoSql architecture

Current state of NoSql. Where are we heading

There are too many options to choose from

Getting locked in to specific NoSql database

Whats wrong in choosing one of the document NoSqls for my document stroing needs

Name-Value databases

Architectural compromises and limitations in some of the NoSql databases

What’s next for NoSql

Recent Posts

Archives

Categories

Meta