How to Price/Cost an ML or Data Project - Part 1
At Oxygen AI, we've tackled data projects (data science, data engineering, big data, data scraping, etc) across a wide variety of industries, using various technologies and methods. This has given us insights on how much effort it takes to complete a project, and what types of costs and risks are involved in addition to that of a conventional project.
Thus we have a gut feeling on project costing and pricing. This has been invaluable in building successful customer relationships, and building a practice that can sustain profitability.
When sharing the pricing of a potential project with a client or customer, it is crucial to be upfront with the risks involved as well as the cost components. And for some situations, it's best to decline projects.
This is the first of a few articles on pricing and costing data projects. Before we get into real numbers and financial estimates, I'll start with high-level notions that may be familiar to program and project managers.
Mature vs. Custom Applications
It has to be said that machine learning and deep learning is advancing rapidly. There are many platforms (on AWS, GCP and others) that have turnkey solutions involving images recognition, natural language understanding and other applications. Rather than engage with a team of data scientists, it could be sufficient to use these platforms. Pricing of these solutions is in most cases transparent. However, for many users, these solutions are too inflexible or opaque for a custom need.
The same can be said for data scraping and data acquisition. Many turnkey platforms exist but may be too limited in scope or functionality for some users.
Thus, my focus will be projects that stem from a custom need.
Dynamic Data vs Static Data
Another segmentation of data projects can come from the data source. Some applications have data that has a relatively unchanging structure and whose relevant properties don't change over time. Accounting data comes to mind, where inputs to systems have to adhere to strict and well-established rules.
Other data is unstructured, follows no strict structure, and has properties that will change over time. A common example of dynamic data is web marketing data. People's preferences change over time, so a site that garners a flood of clicks today may only generate a trickle a month from now.
So, my other focus will be on projects that draw from relatively dynamic and unstructured data.
Data Projects are Different
Since data projects involve code and devops, there is a tendency to assume they have the same cost/risk profiles as web development projects. But there are a few key differences, depending on the specific project. Such difference include:
- Uncertainty of Success: This truth is rarely spoken, but projects that involve machine learning models have a high risk of failure compared to web development projects. Anyone can throw data at an algorithm to train model and get an output. However, can a resulting model's performance meet the demands of the end user? This is not a forgone conclusion. There are best practices to get there, but sometimes running these seem more data art than data science.
- Indirect Path to Success: Related to the last point, data projects are very iterative. In a conventional project, participants move sequentially through established steps. For data projects, such linearity is a luxury. Data practitioners move back and forth between phases until performance thresholds are achieved.
- Moving Targets: Related to dynamic data, a scraping script or machine learning model has to be robust enough to withstand unpredictable changes in the data sources or the data distributions. Either code must be written to deal with this uncertainty, or a proper level of support must be planned after a solution is delivered.
The factors above can have a significant impact on the project cost and schedule. Because of the non-linear and iterative nature of data projects, they can sometimes defy being boxed into traditional Gantt charts. So in practice, even if a project is presented linearly to a customer, internally the uncertainties have to be accounted for in some way. Such uncertainties impact:
- Labor hours: Ideally a task could take 5 hours. But one or two iterations could stretch this. Experience can serve as a guide to how many iterations are needed and thus what an expected number of labor hours are required.
- Infrastructure costs: CPU/GPU/TPU time is also a critical factor if cloud machines are being rented. The more iterations, the higher the bills.
- Increased support: After a project is delivered, what support is needed to update scripts, retrain models, and otherwise mitigate the impact of dynamic source data?