Navigating the development challenges in creating complex data systems

Dittmer, Sören; Roberts, Michael; Gilbey, Julian; Biguri, Ander; Preller, Jacobus; Rudd, James H. F.; Aston, John A. D.; Schönlieb, Carola-Bibiane

doi:10.1038/s42256-023-00665-x

Perspective
Published: 01 June 2023

Navigating the development challenges in creating complex data systems

Nature Machine Intelligence volume 5, pages 681–686 (2023)Cite this article

1393 Accesses
4 Citations
8 Altmetric
Metrics details

Subjects

Abstract

Data science systems (DSSs) are a fundamental tool in many areas of research and are now being developed by people with a myriad of backgrounds. This is coupled with a crisis in the reproducibility of such DSSs, despite the wide availability of powerful tools for data science and machine learning over the past decade. We believe that perverse incentives and a lack of widespread software engineering skills are among the many causes of this crisis and analyse why software engineering and building large complex systems is, in general, hard. Based on these insights, we identify how software engineering addresses those difficulties and how one might apply and generalize software engineering methods to make DSSs more fit for purpose. We advocate two key development philosophies: one should incrementally grow—not plan then build—DSSs, and one should use two types of feedback loop during development—one that tests the code’s correctness and another that evaluates the code’s efficacy.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Consequences of a code that is or is not correct and is or is not efficacious.**

**Fig. 2: Visualization of bad and good software architectures.**

**Fig. 3: Visualization of Agile development.**

**Fig. 4: Illustration of how the usefulness of feedback loops depends on their alignment θ and cycle time t.**

Fig. 5: When growing a DSS, you must be able to support the cherry on top as early as possible.

Automated discovery of algorithms from data

Article 19 February 2024

QSPcc reduces bottlenecks in computational model simulations

Article Open access 01 September 2021

Mathematical discoveries from program search with large language models

Article Open access 14 December 2023

References

Haibe-Kains, B. et al. Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020).
Article Google Scholar
Pineau, J. et al. Improving reproducibility in machine learning research: a report from the neurIPS 2019 reproducibility program. J. Mach. Learn. Res. 22, 7459–7478 (2021).
MATH Google Scholar
Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
Article Google Scholar
Karpathy, A. A Recipe for Training Neural Networks; https://karpathy.github.io/2019/04/25/recipe/ (2019).
Aboumatar, H. & Wise, R. A. Notice of retraction. Aboumatar et al. Effect of a program combining transitional care and long-term self-management support on outcomes of hospitalized patients with chronic obstructive pulmonary disease: a randomized clinical trial. JAMA. 2018;320(22):2335–2343. JAMA 322, 1417–1418 (2019).
Bhandari Neupane, J. et al. Characterization of leptazolines A-D, polar oxazolines from the Cyanobacterium leptolyngbya sp., reveals a glitch with the ‘Willoughby-Hoye’ scripts for calculating NMR chemical shifts. Org. Lett. 21, 8449–8453 (2019).
Article Google Scholar
Gall, J. General Systemantics (General Systemantics Press, 1975).
Brabban, P., Case, S., Cutts, S., Diniz, C. & Crawford, L. Data Pipeline Playbook; https://data-pipeline.playbook.ee/ (2021).
Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3, 199–217 (2021).
Article Google Scholar
Parnas, D. L. On the criteria to be used in decomposing systems into modules. Commun. ACM 15, 1053–1058 (1972).
Article Google Scholar
Sutherland, J. & Sutherland, J. V. Scrum: The Art of Doing Twice the Work in Half the Time (Currency, 2014).
Fowler, M. & Highsmith, J. et al. The Agile manifesto. Software Dev. 9, 28–35 (2001).
Google Scholar
Farley, D. Modern Software Engineering: Doing What Works to Build Better Software Faster (Addison-Wesley, 2021).
Bass, L., Clements, P. & Kazman, R. Software Architecture in Practice (Addison-Wesley, 2003).
Reddy, V. S. The SpaceX effect. New Space 6, 125–134 (2018).
Article Google Scholar
Vance, A. & Sanders, F. Elon Musk (Harper Collins, 2015).
Smith, R. J. Shuttle problems compromise space program: with the shuttle earth-bound, political troubles and cost overruns take off. Science 206, 910–914 (1979).
Article Google Scholar
Perkel, J. M. How to fix your scientific coding errors. Nature 602, 172–173 (2022).
Article Google Scholar
Lakshmanan, V., Robinson, S. & Munn, M. Machine Learning Design Patterns (O’Reilly Media, 2020).
Krekel, H. et al. Pytest x.y; https://github.com/pytest-dev/pytest (2004).
MacIver, D. R. Hypothesis x.y.; https://github.com/HypothesisWorks/hypothesis-python (2016).
Baumgartner, P. Ways I Use Testing as a Data Scientist https://www.peterbaumgartner.com/blog/testing-for-data-science/ (2021).
Niels, B. pandera: statistical data validation of pandas dataframes. In Proc. 19th Python in Science Conference (eds Agarwal, M. et al.) 116–124 (2020).
Goodhart, C. A. in Monetary Theory and Practice 91–121 (Springer, 1984).
Hoskin, K. in Accountability: Power, Ethos and the Technologies of Managing (eds Munro., R. & Mouritsen, J.) 265 (Cengage Learning EMEA, 1996).
Muller, J. Z. in The Tyranny of Metrics (Princeton Univ. Press, 2019).
The Turing Way Community. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research 1.0.1 (Alan Turing Institute, 2021).
Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998).
Article MATH Google Scholar
Valverde, S. & Solé, R. V. Hierarchical small worlds in software architecture. Preprint at https://arxiv.org/abs/cond-mat/0307278 (2003).

Download references

Acknowledgements

We are grateful to the EU/EFPIA Innovative Medicines Initiative project DRAGON (101005122; S.D. and M.R., AIX-COVNET, C.-B.S.), Trinity Challenge BloodCounts! project (M.R., J.G. and C.-B.S.), EPSRC Cambridge Mathematics of Information in Healthcare Hub EP/T017961/1 (M.R., J.H.F.R., J.A.D.A. and C.-B.S.), Cantab Capital Institute for the Mathematics of Information (C.-B.S.), the European Research Council for Horizon 2020 grant no. 777826 (C.-B.S.), the Alan Turing Institute (C.-B.S.), the Wellcome Trust (J.H.F.R.), Cancer Research UK Cambridge Centre (C9685/A25177; C.-B.S.), the British Heart Foundation (J.H.F.R.), NIHR Cambridge Biomedical Research Centre (J.H.F.R.), HEFCE (J.H.F.R.), Leverhulme Trust project on ‘Breaking the non-convexity barrier’ (C.-B.S.), the Philip Leverhulme Prize (C.-B.S.), EPSRC grants EP/S026045/1 and EP/T003553/1 (C.-B.S.) and the Wellcome Innovator Award RG98755 (C.-B.S.). We are also grateful to Intel for financial support, I. Selby for creative input, and J.-C. Lohmann, S. Griffith, J. Tang and F. Zhang for comments and discussions.

Author information

These authors contributed equally: Sören Dittmer, Michael Roberts.

Authors and Affiliations

Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
Sören Dittmer, Michael Roberts, Julian Gilbey, Ander Biguri, Anna Breger, Jan Stanczuk & Carola-Bibiane Schönlieb
ZeTeM, University of Bremen, Bremen, Germany
Sören Dittmer
Department of Medicine, University of Cambridge, Cambridge, UK
Michael Roberts & James H. F. Rudd
Addenbrooke’s Hospital, Cambridge University Hospitals NHS Trust, Cambridge, UK
Effrossyni Gkrania-Klotsas, Judith Babar & Jacobus Preller
Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK
John A. D. Aston
Department of Radiology, University of Cambridge, Cambridge, UK
Ian Selby, Jonathan R. Weir-McCall, Lorena Escudero Sánchez & Evis Sala
Faculty of Mathematics, University of Vienna, Vienna, Austria
Anna Breger
Department of Mathematics, University of Manchester, Manchester, UK
Matthew Thorpe
Royal Papworth Hospital, Cambridge, UK
Jonathan R. Weir-McCall
Language Technology Laboratory, University of Cambridge, Cambridge, UK
Anna Korhonen
Population Health and Genomics, School of Medicine, University of Dundee, Dundee, UK
Emily Jefferson
Department of Biomedical Imaging and Image-guided Therapy, Medical University of Vienna, Vienna, Austria
Georg Langs & Helmut Prosch
National Heart and Lung Institute, Imperial College London, London, UK
Guang Yang
Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
Jing Tang & Tolou Shadbahr
Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK
Philip Teare & Mishal Patel
Clinical Pharmacology & Safety Sciences, AstraZeneca, Cambridge, UK
Mishal Patel
contextflow GmbH, Wien, Austria
Marcel Wassin & Markus Holzer
Institute of Astronomy, University of Cambridge, Cambridge, UK
Nicholas Walton
Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
Pietro Lió

Authors

Sören Dittmer
View author publications
You can also search for this author in PubMed Google Scholar
Michael Roberts
View author publications
You can also search for this author in PubMed Google Scholar
Julian Gilbey
View author publications
You can also search for this author in PubMed Google Scholar
Ander Biguri
View author publications
You can also search for this author in PubMed Google Scholar
Jacobus Preller
View author publications
You can also search for this author in PubMed Google Scholar
James H. F. Rudd
View author publications
You can also search for this author in PubMed Google Scholar
John A. D. Aston
View author publications
You can also search for this author in PubMed Google Scholar
Carola-Bibiane Schönlieb
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

AIX-COVNET Collaboration

Michael Roberts
, Sören Dittmer
, Ian Selby
, Anna Breger
, Matthew Thorpe
, Julian Gilbey
, Jonathan R. Weir-McCall
, Effrossyni Gkrania-Klotsas
, Anna Korhonen
, Emily Jefferson
, Georg Langs
, Guang Yang
, Helmut Prosch
, Jacobus Preller
, Jan Stanczuk
, Jing Tang
, Judith Babar
, Lorena Escudero Sánchez
, Philip Teare
, Mishal Patel
, Marcel Wassin
, Markus Holzer
, Nicholas Walton
, Pietro Lió
, Tolou Shadbahr
, James H. F. Rudd
, John A. D. Aston
, Evis Sala
& Carola-Bibiane Schönlieb

Corresponding authors

Correspondence to Sören Dittmer or Michael Roberts.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Ben MacArthur and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dittmer, S., Roberts, M., Gilbey, J. et al. Navigating the development challenges in creating complex data systems. Nat Mach Intell 5, 681–686 (2023). https://doi.org/10.1038/s42256-023-00665-x

Download citation

Received: 11 July 2022
Accepted: 25 April 2023
Published: 01 June 2023
Issue Date: July 2023
DOI: https://doi.org/10.1038/s42256-023-00665-x