MongoDB 的 terichdb schema

作者: rockeet 发表日期: 2015年12月15日分类: 未分类评论: 0 条阅读次数: 5,572 次

Mongodb 虽然是 schemaless (不需要 schema) 的文档数据库，但是，同一个表中的数据一般都有相同的结构，我们需要将这样的结构抽象出来，用以提高数据库的性能

terichdb 的数据有以下类型：

DataType	Aliases	Description
Any		Not supportted, real type is stored as first byte of the data
Nested		Not supportted
Uint08	uint8,byte,ubyte
Sint08	int8,sbyte
Uint16
Sint16	int16
Uint32
Sint32	int32
Uint64
Sint64	int64
Uint128		just support storing, not indexing
Sint128	int128	just support storing, not indexing
Float32	float
Float64	double
Float128		just support storing, not indexing
Decimal128		just support storing, not indexing
Uuid	guid	16 bytes(128 bits) binary
Fixed		Fixed length binary
VarSint		Not supportted
VarUint		Not supportted
StrZero		Zero ended string
TwoStrZero		Special, now just for BSON RegEx type
Binary		Prefixed by length(var_uint) in bytes
CarBin		Cardinal Binary, prefixed by uint32 length

从 mongodb 获取到bson数据后，根据一个bson文档中字段的类型，与 terich db 做如下转化

MongoDB数据类型	terichdb 数据类型
Object	Nested
Bool	Uint08,
NumberInt	Sint32,
NumberLong bsonTimestamp mongo::Date	Sint64,
NumberDouble	Float64,
Decimal128	Decimal128
Oid	Fixed, // Fixed length binary
Code String Symbol	StrZero, // Zero ended string
RegEx	TwoStrZero, // Special, now just for BSON RegEx type
Array BinData CodeWScope	CarBin, // Cardinal Binary, prefixed by uint32 length
Undefined jstNULL	长度为0的Binary
MaxKey MinKey	无对应类型

这是一个 terich db 的 schema json 样例（tablename=headline）：

{
	"RowSchema": {
		"columns" : {
			"_id"    : { "type": "fixed"  , "uType": 16, "length": 12 },
			"ts"     : { "type": "sint32" , "uType": 16 },
			"type"   : { "type": "StrZero", "uType":  2 },
			"id"     : { "type": "StrZero", "uType":  2 },
			"manual" : { "type": "uint08" , "uType":  8 },
			"list"   : { "type": "nested" , "uType":  4 },
			"$$"     : { "type": "CarBin" ,
			   	"comment1": "bin data of the schema-less fields, must be the last field",
			   	"comment2": "schema-less fields may be many optional fields"
		   	}
		}
	},
	"TableIndex" : [
		{ "fields": "_id", "ordered" : true, "unique" : true },
		{ "fields": "+ts,-type", "ordered" : true },
		{ "fields": "id" , "ordered" : true }
	],
	"NestedSchemaSet": {
		"1": {
			"columns": {
				"date"   : { "type": "sint64" , "uType": 9 },
				"count"  : { "type": "sint32" , "uType": 16 },
				"title"  : { "type": "StrZero", "uType": 2 },
				"fromid" : { "type": "StrZero", "uType": 2 },
				"docid"  : { "type": "StrZero", "uType": 2 },
				"source" : { "type": "StrZero", "uType": 2 },
				"image"  : { "type": "StrZero", "uType": 2 },
				"tag"    : { "type": "StrZero", "uType": 2 },
				"is_news": { "type": "sint32" , "uType": 16 },
				"dtype"  : { "type": "sint32" , "uType": 16 },
				"$$"     : { "type": "carbin" }
			}
		}
	}
}

{

"RowSchema": {

"columns" : {

"_id" : { "type": "fixed" , "uType": 16, "length": 12 },

"ts" : { "type": "sint32" , "uType": 16 },

"type" : { "type": "StrZero", "uType": 2 },

"id" : { "type": "StrZero", "uType": 2 },

"manual" : { "type": "uint08" , "uType": 8 },

"list" : { "type": "nested" , "uType": 4 },

"$$" : { "type": "CarBin" ,

"comment1": "bin data of the schema-less fields, must be the last field",

"comment2": "schema-less fields may be many optional fields"

}

"TableIndex" : [

{ "fields": "_id", "ordered" : true, "unique" : true },

{ "fields": "+ts,-type", "ordered" : true },

{ "fields": "id" , "ordered" : true }

"NestedSchemaSet": {

"1": {

"columns": {

"date" : { "type": "sint64" , "uType": 9 },

"count" : { "type": "sint32" , "uType": 16 },

"title" : { "type": "StrZero", "uType": 2 },

"fromid" : { "type": "StrZero", "uType": 2 },

"docid" : { "type": "StrZero", "uType": 2 },

"source" : { "type": "StrZero", "uType": 2 },

"image" : { "type": "StrZero", "uType": 2 },

"tag" : { "type": "StrZero", "uType": 2 },

"is_news": { "type": "sint32" , "uType": 16 },

"dtype" : { "type": "sint32" , "uType": 16 },

"$$" : { "type": "carbin" }

}

另一个样例（tablename=news）：

{
	"RowSchema": {
		"columns" : {
			"_id"    : { "type": "StrZero", "uType":  2 },
			"url"    : { "type": "StrZero", "uType":  2 },
			"fdts"   : { "type": "sint32" , "uType": 16 },
			"fi"     : { "type": "sint32" , "uType": 16 },
			"fts"    : { "type": "sint32" , "uType": 16 },
			"level"  : { "type": "uint08" , "uType": 16 },
			"score"  : { "type": "float32", "uType":  1 },
			"stat"   : { "type": "sint32" , "uType": 16 },
			"status" : { "type": "sint32" , "uType": 16 },
			"ts"     : { "type": "sint32" , "uType": 16 },
			"type"   : { "type": "sint32" , "uType": 16 },
			"il"     : { "type": "nested" , "uType":  3,
				"nested": {
					"columns": {
						"txt": { "type": "strzero", "uType": 2 },
						"url": { "type": "strzero", "uType": 2 }
					}
				}
			},
			"$$"     : { "type": "CarBin" , "uType":  3 }
		}
	},
	"TableIndex" : [
		{ "fields": "_id", "ordered" : true, "unique" : true },
		{ "fields": "ts" , "ordered" : true, "unique" : true }
	]
}

{

"RowSchema": {

"columns" : {

"_id" : { "type": "StrZero", "uType": 2 },

"url" : { "type": "StrZero", "uType": 2 },

"fdts" : { "type": "sint32" , "uType": 16 },

"fi" : { "type": "sint32" , "uType": 16 },

"fts" : { "type": "sint32" , "uType": 16 },

"level" : { "type": "uint08" , "uType": 16 },

"score" : { "type": "float32", "uType": 1 },

"stat" : { "type": "sint32" , "uType": 16 },

"status" : { "type": "sint32" , "uType": 16 },

"ts" : { "type": "sint32" , "uType": 16 },

"type" : { "type": "sint32" , "uType": 16 },

"il" : { "type": "nested" , "uType": 3,

"nested": {

"columns": {

"txt": { "type": "strzero", "uType": 2 },

"url": { "type": "strzero", "uType": 2 }

}

"$$" : { "type": "CarBin" , "uType": 3 }

}

"TableIndex" : [

{ "fields": "_id", "ordered" : true, "unique" : true },

{ "fields": "ts" , "ordered" : true, "unique" : true }

]

}

说明:

“mongoType” 是 mongodb bson 类型对应的整数编码，因为 terichdb 需要对接到多种数据库，mongoType是专为 mongodb 而设计（当然，还有一个 mysqlType）
“type” 的名字不区分大小写，例如 “strzero” 和 “StrZero” 相同，都是以字符 ‘\0’ 结尾的字符串（C 语言字符串）
MongoDB 的 Oid 是固定 12 个字节长度的二进制串，转化为 terichdb 类型时，是 Fixed，length=12
TableIndex 中
1. fields 表示索引的字段，可以是多字段联合索引，字段的顺序很重要
2. 字段名前面的“+”，“-”号表示在该字段上是递增，还是递减，“+”表示递增，“-”表示递减，默认是“+”，可以省略，“-”不可省略
3. Index 可以是 ordered，mongodb 的 index 都是 ordered，但是 terichdb 支持 “ordered”: false 的 index，此时是 hash index
4. Index 可以是 unique的，默认是非unique的，即可以有重复的 index key
名字为“$$”的字段，必须是一个schema的最后一个字段，类型必须是carbin，表示所有“其它”字段，当同一个表中难以抽象出固定的schema时，可以只将所有公共的字段抽象出来，其它字段扔进一个“自由”的二进制字段。
NestedSchemaSet表示：mongodb 文档中的字段是Object或Array时，实际上是一个嵌套的“子文档”，子文档可以有内部结构，同一个表可以多个这样NestedSchema，每个NestedSchema有一个schema id，示例中“1”就是schema id，schema id必须是整数，从1开始编号。
1. 当嵌套字段是 Array 时（tablename=headline），设置uType=4，不需其它额外说明
2. 当嵌套字段是 Object 时，并且能抽象出一个共同的 schema 时，直接将 nested 写在里面，不需要生成 NesetdSchemaSet，样例2（tablename=news）就是这样。
3. 当嵌套字段是 Object 时，但不能抽象出一个共同的 schema 时，需要生成 NesetdSchemaSet（没有相应样例），此时 terichdb 会根据 NesetdSchemaSet 中的 schemaId 解码。

MongoDB 的 terichdb schema

您可能感兴趣的文章:

发表评论

近期文章

近期评论

文章归档

分类目录

功能