When the inputs include epistemic uncertainty (uncertainty arising from imprecise measurements), most of these computational problems are known to be NP-hard in general, but we have developed an array of work-around solutions that make practical calculations highly scalable.
Characterizing uncertainty on the dataset
Dubito provides multiple avenues for assessing and characterizing uncertainty in inputs, including:
-
Significant Digit Conventions
-
Natural Language Approximators
-
Poisson Count Model
-
Shlyakhter inflation
-
User Specification
-
Validation Correction
These methods ascribe estimated uncertainties to data values even if the data as provided lack any specification or statement about their imprecision. The methods can be combined.
Extra features
Dubito has built-in features that detect data problems and calculation errors that would otherwise render conclusions specious. For instance, it automatically checks that measurement dimensions balance and units, if present, conform and can be compared or combined in mathematical operations.
Dubito also provides protection also against overfitting, a grave yet very common methodological error that is rarely noticed in practice. Overfitting typically means statistical predictions should be believed. The software automatically considers the effects of model uncertainty which arises from possible doubt that the form of the model was initially correctly and fully described by the analyst. Dubito takes these and other matters into consideration as it expresses the reliability of output calculations and inferences to ensure that it does not overstate its conclusions.
Interpretability
Dubito expresses its output in a way that is understandable. It employs a variety of schemes for expressing uncertainty in computed results in formats that are understandable by users despite the wide array of biases and misconceptions that psychometry has documented in human cognition. It can use natural-language expressions to clearly communicate analytical results to data scientists and less-technical decision maker.
Dubito is modular in design and ready for multiple environments. It can be deployed on the web or as a totally secured, behind-the-firewall solution.
Characterizing uncertainty on the dataset
Dubito provides multiple avenues for assessing and characterizing uncertainty in inputs, including:
-
Significant Digit Conventions
-
Natural Language Approximators
-
Poisson Count Model
-
Shlyakhter inflation
-
User Specification
-
Validation Correction
These methods ascribe estimated uncertainties to data values even if the data as provided lack any specification or statement about their imprecision. The methods can be combined.
Summary
Dubito is a library for data scientists of predictive algorithms applicable to data sets ranging in size from very small to very large samples. These algorithms are specially designed to account for sensor imprecision, missing data, censored data, uncertain biases, and epistemic uncertainties that conventional methods neglect. Dubito offers analytical solutions for highly heterogeneous data that includes quantitative and qualitative information.
The problem it solves
Dubito is based on the Quiet Doubt philosophy which holds that uncertainty analysis is too important to be left in the hands of analysts, who sometimes neglect it or may not really appreciate how critical the consideration of imprecision and uncertainty often is.
Traditional risk assessment strategies are not always up to the challenge, and this deficiency can be serious. For instance, Monte Carlo simulation is very widely used, but it is computationally slow and it does not always yield a complete picture of the uncertainty. As one analyst ruefully noted, “Random sampling is terrible at finding worst-case scenarios, although terrorists are pretty good at it.”
Tech specs
The predictive algorithms in Dubito are used to make predictions from available data, to detect anomalies in data sets in real time, forecast future trends, and a host of other uses in risk analysis, strategic and tactical planning, forensic analyses and vulnerability studies. The methods are also useful in system design and scenario gaming. The list of predictive algorithms currently being implemented in Dubito includes:
Software Description
Philosophy
The philosophy behind Dubito: Quiet Doubt
All data is uncertain, although it is easy to neglect this uncertainty. Quiet Doubt acknowledges and tracks this uncertainty through calculations without need for expertise in uncertainty analysis.
Quiet Doubt
Quiet Doubt arises from the idea that uncertainty analysis is too important to be left in the hands of analysts, who often lack the time, skill or resources to undertake the proper accounting of uncertainties. Existing programs and software tools for calculating with uncertainty, even those intended to be simple to use, require analysts to learn about theories or methods of uncertainty propagation to use them effectively. Software should conduct uncertainty quantification and propagation automatically without the user even knowing it is happening. Quiet Doubt is a software feature that facilitates the spread of routine uncertainty analysis by making it the responsibility of the software infrastructure rather than a working concern of analysts. It should be, to the maximum extent possible, an automated process that happens behind the scenes, much as spell checking and correction occurs quietly and unobtrusively as documents are originally typed. The software should interject with warnings only when reducing the implied precision of outputs no longer suffices to indicate the uncertainty of the resulting calculations. Quiet Doubt involves a wide variety of strategies for automating uncertainty quantification, propagation and reporting.
Significant digits
For instance, in order to assess uncertainty about inputs without requiring a user to explicitly characterize uncertainty, the software can recognize significant digits in all inputs as uncertainty-encoding conventions. Thus, as data values and parameters are entered into software, they are assumed to be associated with at least as much epistemic uncertainty as would be implied by missing digits of their decimal representations. For instance, if a user enters (or the software reads from a file) the number 23.45, the software will interpret this input to be the interval [23.445, 23.455]. Likewise, if an entered value is 1200, it is interpreted as the interval [1150, 1250]. These minimal uncertainties should be propagated through any calculations the software makes. Well-designed software will simultaneously recognize mathematical constants such as 3.14156, unit conversion fractions, and the 2 in a square as precise mathematical values and not apply the significant digit interpretation to estimate uncertainty about them.
Hedge words
Quiet doubt also understands linguistic hedges known as approximators (e.g., about, around, almost, up to, около, 左右, حدود) which are often used in natural languages to express uncertainty attending numerical values. The implications of the approximators for the magnitude of these uncertainties have been quantitatively studied for English expressions. Research to quantitatively characterize the implications of approximators in other languages is relatively straightforward to conduct using online tools such as Amazon Mechanical Turk and games with a purpose in the sense of von Ahn.
There are other techniques, appropriate for particular situations, are available for estimating input uncertainties when they are not fully specified explicitly by the analyst. These methods ascribe estimated uncertainties to data values even when the data as provided lack any specification or statement about their imprecision, and even when the numbers are specified with many apparently significantly significant digits.
Robust uncertainty analysis
After uncertainties about the inputs are characterized, Quiet Doubt automatically applies robust uncertainty projection algorithms to all the calculations that underlie the analyses requested by a user. The intent of Quiet Doubt is to invisibly conduct appropriate ancillary uncertainty analyses. The software then modifies calculation outputs, reducing the number of digits in decimal numbers to reflect the reliability of each value in the face of the uncertainty analysis. When reducing the implied precision of outputs is no longer sufficient to represent the actual uncertainty associated with the resulting calculations, warning or error messages should appear in addition to or in replacement of the computed numbers. The point of Quiet Doubt is to be as unobtrusive as possible while ensuring that the software outputs do not mislead the user.
Intermediate strategies are also available when requested by users. For instance, summary textual messages or graphical depictions can also accompany results that briefly explain their trustworthiness. Human perception of risks and uncertainties are well known to be affected by many cognitive biases. Quiet Doubt employs a variety of schemes for expressing uncertainty in computed results in formats that are most easily understandable by users despite the wide array of biases and misconceptions that psychometry has documented in human cognition. For instance, it can use natural-language expressions to clearly communicate analytical results to data scientists and less-technical decision makers.
Extensive checking
Quiet Doubt automatically makes a variety of other checks that help to ensure the integrity of the analyses. If an analysis makes an assumption about the underlying statistical distribution of a variable, software checks that the available data actually have that distribution, and that the available data do not contradict the assumption about the distribution. The software can also detect a variety of other data problems and calculation errors. It automatically checks that measurement dimensions balance and units, if present, conform and can be compared or combined in mathematical operations the analysts requests. As a part of its automated uncertainty analysis Quiet Doubt provides protection also against model overfitting, a grave yet very common methodological error that is rarely noticed in practice. Software can likewise automatically consider the effects of model uncertainty and other matters as it expresses the reliability of output calculations and inferences to ensure that it does not overstate the conclusions.