Betting the Farm on MongoDB

php中文网

发布时间：2016-06-07 16:34:37

1891人浏览过

来源于php中文网

原创

This is a guest post by Jon Dokulil, VP of Engineering at Hudl. Hudls CTO, Brian Kaiser, will be speaking at MongoDB World about migrating from SQL Server to MongoDB Hudl helps coaches win. We give sports teams from peewee to the pros onli

This is a guest post by Jon Dokulil, VP of Engineering at Hudl. Hudl’s CTO, Brian Kaiser, will be speaking at MongoDB World about migrating from SQL Server to MongoDB

Hudl helps coaches win. We give sports teams from peewee to the pros online tools to make working with and analyzing video easy. Today we store well over 600 million video clips in MongoDB spread across seven shards. Our clips dataset has grown to over 350GB of data with over 70GB of indexes. From our first year of a dozen beta high schools we’ve grown to service the video needs of over 50,000 sports teams worldwide.

Why MongoDB

When we began hacking away on Hudl we chose SQL Server as our database. Our backend is written primarily in C#, so it was a natural choice. After a few years and solid company growth we realized SQL Server was quickly becoming a bottleneck. Because we run in EC2, vertically scaling our DB was not a great option. That’s when we began to look at NoSQL seriously and specifically MongoDB. We wanted something that was fast, flexible and developer-friendly.

After comparing a few alternative NoSQL databases and running our own benchmarks, we settled on MongoDB. Then came the task of moving our existing data from SQL Server to MongoDB. Video clips were not only our biggest dataset, it was also our most frequently-accessed data. During our busy season we average 75 clip views per second but peak at over 800 per second. We wanted to migrate the dataset with zero downtime and zero data loss. We also wanted to have fail-safes ready during each step of the process so we could recover immediately from any unanticipated problems during the migration.

In this post we’ll take a look at our schema design choices, our migration plan and the performance we’ve seen with MongoDB.

Schema Design

In SQL Server we normalized our data model. Pulling together data from multiple tables is SQL’s bread-and-butter. In the NoSQL world joins are not an option and we knew that simply moving the SQL tables directly over to MongoDB and doing joins in code was a bad idea. So, we looked at how our application interacted with SQL and created an optimized schema in MongoDB.

Before I get into the schema we chose, I’ll try to provide context to Hudl’s product. Below is a screenshot of our ‘Library’ page. This is where coaches spend much of their time reviewing and analyzing video.

You see above a video playing and a kind of spreadsheet underneath. The video represents one angle of one clip (many of our teams film two or three angles each game). The spreadsheet contains rows of clips and columns of breakdown data. The breakdown data gives context to what happened in the clip. For example, the second clip was a defensive play from the 30 yard line. It was first and ten and was a run play to the left. This breakdown data is incredibly important for coaches to spot patterns and trends in their opponents play (as well as make sure they don’t have an obvious patterns that could be used against them).

When we translated this schema to MongoDB we wanted to optimize for the most-common operations. Watching video clips and editing clip metadata are our two highest frequency operations. To maximize performance we made a few important decisions.

We chose to encapsulate an entire clip per document. Watching a clip would involve a single document lookup. Because MongoDB stores each document contiguously on disk, it would minimize the number of disk seeks when fetching a clip not in memory, which means faster clip loads.
We denormalized our column names to speed up both writes and reads. Writes are faster because we no longer have to lookup or track Column IDs. A write operation is as simple as:
```
db.clips.update({teamId:205, _id:123}, 
{$set: {'data.PLAY TYPE':'Pass'}}) 
```
Reads are also faster because we no longer have to join on the ClipDataColumn table to get the column names. This comes at a cost of greater storage and memory requirements as we store the same column names in multiple documents. Despite that, we felt the performance benefits were worth the cost.

One of the most important considerations when designing a schema in MongoDB is choosing a shard key. Have a good shard key is critical for effective horizontal scaling. Data is stored in shards (each shard is a replica set) and we can add new shards easily as our dataset grows. Replica sets don’t need to know about each other, they are only concerned with their own data. The MongoDB Router (mongos) is the piece that sees the whole picture. It knows which shard houses each document.

When you perform a query against a sharded collection, the shard key is not required. However, there is a cost penalty for not providing the shard key. The key is used to know which shard contains the answer to your query. Without it, the query has to be sent to all shards in your cluster. To illustrate this, I’ve got a four shard cluster. The shard key is TeamId (the property is named ‘t’), and you can see that clips belonging to teams 1-100 live on Shard 1, 101-200 live on Shard 2, etc. Given the query to find clip ‘123’, only Shard 3 will respond with results, but Shards 1, 2 and 4 must also process and execute the query. This is known as a scatter/gather query. In low volume this is ok, but you won’t see the benefits of horizontal scalability if every query has to be sent to all shards. Only when the shard key is provided can the query be sent directly to Shard 3. This is known as a targeted query.

VISBOOM

AI虚拟试衣间，时尚照相馆。

下载

For our Clips collection, we chose TeamId as our shard key. We looked at a few different possible shard keys:

We considered sharding by clipId (_id) but decided against it because we let coaches organize clips into playlists (similar to a song playlist in iTunes or Spotify). While queries to all clips in a playlist are less common than grabbing an individual clip, they are common enough that we wanted it to use a targeted query.
We also considered sharding by the playlist Id, but we wanted the ability for clips to be a part of multiple playlists. The shard key, once set, is immutable. Clips can be added or removed from playlists at any time.
We finally settled on TeamId. TeamId is easily available to us when making the vast majority of our queries to the Clips collection. Only for a few infrequent operations would we need to use scatter/gather queries.

The Transition

As I mentioned, we needed to transition from SQL Server to MongoDB with zero downtime. In case anything went wrong, we needed fallbacks and fail-safes along the way. Our approach was two-fold. In the background we ran a process that ‘fork-lifted’ data from SQL Server to MongoDB. While that ran in the background, we created a multiplexed DAO (data access object, our db abstraction layer) that would only read from SQL but would write to both SQL and MongoDB. That allowed us to batch-move all clips without having to worry about stale data. Once the two databases were completely synced up, we switched over to perform all reads from MongoDB. We continued to dual-write so we could easily switch back to SQL Server if problems arose. After we felt confident in our MongoDB solution, we pulled the plug on SQL Server.

In step one we took a look at how we read and wrote clip data. That let us design an optimal MongoDB schema. We then refactored our existing database abstraction layer to use data-structures that matched the MongoDB schema. This gave us a chance to prove out the schema ahead of time.

Next we began sending write operations to both SQL and MongoDB. This was an important step because it allowed our data fork-lifting process work through all clips one after another while protecting us from data corruption.

The data fork-lifting process took about a week to complete. The time was due to both the large size of the dataset and our own throttling logic. We throttled the rate of data migration to minimize the impact on normal operations. We didn’t want coaches to feel any pain during this migration.

After the data fork-lift was complete we began the process of reading from MongoDB. We built in the ability to progressively send more and more read traffic to MongoDB. That allowed us to gain confidence in our code and the MongoDB cluster without having to switch all-at-once. After a while with dual writes but all MongoDB reads, we turned off dual writes and dropped the tables in SQL Server. It was both a scary moment (sure, we had backups… but still!) and very satisfying. Our SQL database size was reduced by over 80GB. Of that total amount, 20GB was index data, which means our memory footprint was also greatly reduced.

Performance

We have been thrilled with the performance of MongoDB. MongoDB exceeded our average performance goal of 100ms and, just as important, is consistently performant. While it’s good to keep an eye on average times, it’s more important to watch the 90th and 99th percentile performance metrics. With MongoDB our average clip load time is around 18ms and our 99th percentile times are typically at or under 100ms.

Clip load times during the same time period during season

Conclusion

Our transition from SQL Server to MongoDB started with our largest and most critical dataset. After having gone through it, we are very happy with the performance and scalability of MongoDB and appreciate how developer-friendly it is to work with. Moving from a relational to a NoSQL database naturally has a learning curve. Now that we are over it we feel very good about our ability to scale well into the future. Perhaps most telling of all, most new feature development at Hudl is done in MongoDB. We feel MongoDB lets us focus more on writing features to help coaches win and less time crafting database scripts.

原文地址：Betting the Farm on MongoDB, 感谢原作者分享。

The used command is not allowed with this MySQL version - 如何解决MySQL报错：该MySQL版本不允许使用该命令

The MySQL server is running with the --skip-grant-tables option - 如何解决MySQL报错：MySQL服务器正在使用--skip-grant-tables选项运行

The total number of locks exceeds the lock table size - 如何解决MySQL报错：锁数量超过了锁表大小

mysqladmin - MySQL 服务器管理程序

无法启动mysql 1067

相关专题

Python 自然语言处理（NLP）基础与实战

本专题系统讲解 Python 在自然语言处理（NLP）领域的基础方法与实战应用，涵盖文本预处理（分词、去停用词）、词性标注、命名实体识别、关键词提取、情感分析，以及常用 NLP 库（NLTK、spaCy）的核心用法。通过真实文本案例，帮助学习者掌握使用 Python 进行文本分析与语言数据处理的完整流程，适用于内容分析、舆情监测与智能文本应用场景。

2026.01.27

拼多多赚钱的5种方法拼多多赚钱的5种方法

在拼多多上赚钱主要可以通过无货源模式一件代发、精细化运营特色店铺、参与官方高流量活动、利用拼团机制社交裂变，以及成为多多进宝推广员这5种方法实现。核心策略在于通过低成本、高效率的供应链管理与营销，利用平台社交电商红利实现盈利。

109

2026.01.26

edge浏览器怎样设置主页 edge浏览器自定义设置教程

在Edge浏览器中设置主页，请依次点击右上角“...”图标 > 设置 > 开始、主页和新建标签页。在“Microsoft Edge 启动时”选择“打开以下页面”，点击“添加新页面”并输入网址。若要使用主页按钮，需在“外观”设置中开启“显示主页按钮”并设定网址。

2026.01.26

苹果官方查询网站苹果手机正品激活查询入口

苹果官方查询网站主要通过 checkcoverage.apple.com/cn/zh/ 进行，可用于查询序列号（SN）对应的保修状态、激活日期及技术支持服务。此外，查找丢失设备请使用 iCloud.com/find，购买信息与物流可访问 Apple (中国大陆) 订单状态页面。

136

2026.01.26

npd人格什么意思 npd人格有什么特征

NPD（Narcissistic Personality Disorder）即自恋型人格障碍，是一种心理健康问题，特点是极度夸大自我重要性、需要过度赞美与关注，同时极度缺乏共情能力，背后常掩藏着低自尊和不安全感，影响人际关系、工作和生活，通常在青少年时期开始显现，需由专业人士诊断。

2026.01.26

windows安全中心怎么关闭 windows安全中心怎么执行操作

关闭Windows安全中心（Windows Defender）可通过系统设置暂时关闭，或使用组策略/注册表永久关闭。最简单的方法是：进入设置 > 隐私和安全性 > Windows安全中心 > 病毒和威胁防护 > 管理设置，将实时保护等选项关闭。

2026.01.26

2026年春运抢票攻略大全春运抢票攻略教你三招手【技巧】

铁路12306提供起售时间查询、起售提醒、购票预填、候补购票及误购限时免费退票五项服务，并强调官方渠道唯一性与信息安全。

122

2026.01.26

个人所得税税率表2026 个人所得税率最新税率表

以工资薪金所得为例，应纳税额 = 应纳税所得额 × 税率 - 速算扣除数。应纳税所得额 = 月度收入 - 5000 元 - 专项扣除 - 专项附加扣除 - 依法确定的其他扣除。假设某员工月工资 10000 元，专项扣除 1000 元，专项附加扣除 2000 元，当月应纳税所得额为 10000 - 5000 - 1000 - 2000 = 2000 元，对应税率为 3%，速算扣除数为 0，则当月应纳税额为 2000×3% = 60 元。

2026.01.26