Data Farming

This post is a follow-on from one of my earlier blog posts titled “Efficient Experimental Design”. If you haven’t read it, you can find it here. In the earlier post, I discussed the benefits of efficient experimental design and outlined some simple examples. In this post, I’ll be discussing the concept of data farming and the benefits of taking a data farming approach when designing an experiment.

Introduction

Data farming is a descriptive metaphor that captures the notion of generating data with the intention of maximizing the information gained from a simulation model. Data mining may be a more familiar term when thinking of “big data”. Susan Sanchez uses a very interesting metaphor to compare these two concepts.

If you think of miners in the real-world, they search for valuable ore buried in the earth but have no control of what is there or where it lies. As they work, they gain information about the geology of the earth and can use this to improve future efforts. Data mining follows the same idea.
Now, thinking about real-world farmers who nurture land to maximize their overall yield. They manipulate the land using farming techniques in order to increase their overall gain. Experiments can then be conducted to assess if these techniques are effective in this goal. Data farmers follow a similar process in that they manipulate simulation models to maximize their overall (information) yield and “grow” their data in this way to facilitate the identification of useful characteristics of a model.

The term “data farming” is also used in non-simulation contexts still as a metaphor for a method of dealing with big data. Data farming in an industrial setting has been quoted as a means for “enhancing data on hand and determining the most relevant data that need to be collected” (Sanchez 2020). It is also widely used in healthcare settings. In this setting, it has been stated that the goal of data farmers should be “to examine how best to use the tools available in our electronic systems to increase the volume of actionable data that are readily available”.

So, clearly, data farming is used in a variety of concepts and can be described in many different ways but all descriptions seem to allude to the same goal, to “grow” the data available to provide more useful insights. In this post, I will mainly discuss data farming in a simulation context as a natural follow-on from my simulation experiment design post.

A bit more detail…

The basic experimental design concepts discussed in my earlier post can be very useful but greater insights can be gained when, rather than restricting the experiment to small designs, a large-scale, data farming approach is applied where space-filling designs such as Latin Hypercubes (LH) are used from the outset.

A space filling design, in simple terms, describes a design which has points everywhere in the experimental region with as few gaps as possible. The example of a LH achieves this but the explanation of the inner workings of a LH is probably beyond the scope of this post. The important point is that the LH as well as many other designs exhibit good space-filling properties. If you’d like to learn more about designs of experiments such as the LH, you can find descriptions of many designs in the paper by Susan Sanchez referenced at the end of the post.

In the context of a simulation optimization (SO) experiment, where the method will look to find some global optimum under some set of conditions using techniques such as stochastic gradient descent or stochastic trust-region methods, considerations must be made for the choices of a number of factors such as number of samples, direction and size of each step as well as number of repetitions of the procedure. One potential negative aspect in this context is the required computational cost as SO methods can take a very long time to converge to an optimum especially if superfluous input factors are included.

Rather than trying to optimize the solution, a data farming approach fits some metamodel to the response surface and uses this to guide the optimization attempts. The resulting dataset has then been ”grown” to encompass the behaviour of the response(s) over a range of factors of interest. Metamodels can provide assessments of which inputs are key drivers of a simulation and allow us to ignore superfluous factors. They can also show whether non-linear or interaction effects exist and may reveal other characteristics of the response surface. Large space-filling designs can also provide diagnostics such as lack-of-fit assessments. These models can also be used to identify undesirable solutions and the reasons for the poor performance which aids the understanding of the robustness of the system.

Clearly, one of the major benefits here is that not only do we get a desirable solution, we gain further information about the general behaviour of the simulation model and better understand why this particular solution works well. Another benefit of having large amounts of data from a designed data farming experiment is that this limits the chances of spurious findings which can cause problems when working with observational big data.

There are a number of ways in which a simulation analyst can incorporate data farming concepts into a study to gain extra insight from the time-consuming task of building and validating a simulation model. All of these design methods require some extra effort in advance of the experiment to put together the ”computational nuts and bolts needed to automate the data farming process” and timeliness is, of course, important in this context. Efficient DOE is a necessity once we decide to explore more than a handful of factors but if factor levels must be changed manually, then the analyst’s time becomes more of a concern than computation time. However, setting up a data farming environment from the outset could be very worthwhile allowing us to automate the run generation process and grow data without the worry of input errors.

The ”nuts and bolts” of a data farming process, in summary, involve the following steps:

  • Identify all input requirements for the model.
  • Choose a suitable design for the system and appropriate range of variation for the factors.
  • For each design point, modify the base design factor values to the current design point; execute the model with these settings; extract suitable output measures (if needed); collate the design point specification with the output measures and append to prior run results.
  • Repeat previous step for the desired number of replications.
  • Use statistical tools to analyze output.

Conclusion

The ”nuts and bolts” of setting up a data farming experiment may require some additional time and effort but this extra effort is reduced when modelling platforms are built with data farming in mind. I may have taken the view of a simulation analyst here but both the simulation and data analytics communities stand to benefit from gaining additional insight and growing understanding in a timely manner.

I hope you enjoyed this post and as usual, feel free to leave a comment or contact me through the contact from here if you’d like to discuss. If you’d like to read further into DOE or data farming, some references are below. Thanks for reading!

WORK SMARTER, NOT HARDER: A TUTORIAL ON DESIGNING AND CONDUCTING SIMULATION EXPERIMENTS – S. M. Sanchez & H. Wan

Data Farming: Methods for the Present, Opportunities for the Future – S. M. Sanchez

Better Big Data via Data Farming Experiments – S. M. Sanchez & P. J. Sanchez