MongoDB建立index,可以设置为unique.若在之前有重复的键值,那么我们必须要去除掉重复的.

若我们使用的是原生的mongo且版本小于3.0,则unique为true外,外加参数dropDups为true即可.但除此之外,在mongo3.0+,tokumx,我们只能手工搞定.

一段如下的脚本可以帮助mongo去除带重复键的doc. 在第一行改为想要去重并加unique的collection name,第二行改为一个不存在,临时的collection name,以便存储过程数据.等脚本结束后第二行的collection会被删掉. 第5,13行需要指定field name. 比如我想要让sid这个字段值为唯一.

var coll="coll_to_remove"
var tmp_coll = "sid_dups"

db.getCollection(coll).mapReduce(
    function() { emit(this.sid,1); },  
    function(k,vals) { return Array.sum(vals); }, 
    {out:tmp_coll}
)

db.getCollection(tmp_coll).find({"value":{$gt:1}}).forEach(
    function(obj) { 
        var cur = db.getCollection(coll).find(
            {sid:obj._id},
            {_id:1}
        ); 
        var skip = true;
        while(cur.hasNext()) {
            var doc = cur.next();
            if (skip) {
                skip = false;
                continue
            }
            db.getCollection(coll).remove(
                {_id: doc._id}); 
        }
    }
)

db.getCollection(coll).ensureIndex({"sid":1},{unique:true})

db.getCollection(tmp_coll).drop()

Ref:

  • http://stackoverflow.com/questions/13190370/how-to-remove-duplicates-based-on-a-key-in-mongodb
  • http://glassonionblog.wordpress.com/2012/03/19/mongodb-how-to-check-for-duplicates-in-a-collection/
  • http://stackoverflow.com/questions/8405331/how-to-remove-duplicate-record-in-mongodb-by-mapreduce
Categories: Code

Yu

Ideals are like the stars: we never reach them, but like the mariners of the sea, we chart our course by them.

Leave a Reply

Your email address will not be published. Required fields are marked *