Abstract
Write-optimized data structures (WODS), offer the potential to keep up with cyberstream event rates and give sub-second query response for key items like IP addresses. These data structures organize logs as the events are observed. To work in a real-world environment and not fill up the disk, WODS must efficiently expire older events. As the basis for our research into organizing security monitoring data, we implemented a tool, called Diventi, to index IP addresses in connection logs using RocksDB (a write-optimized LSM tree). We extended Diventi to automatically expire data as part of the data structures’ normal operations. We guarantee that Diventi always tracks the N most recent events and tracks no more than \(N+k\) events for a parameter \(k < N\), while ensuring the index is opportunistically pruned. To test Diventi at scale in a controlled environment, we used anonymized traces of IP communications collected at SuperComputing 2019. We synthetically extended the 2.4 billion connection events to 100 billion events. We tested Diventi vs. Elasticsearch, a common log indexing tool. In our test environment, Elasticsearch saw an ingestion rate of at best 37,000 events/s while Diventi sustained ingestion rates greater than 171,000 events/s. Our query response times were as much as 100 times faster, typically answering queries in under 80 ms. Furthermore, we saw no noticeable degradation in Diventi from expiration. We have deployed Diventi for many months where it has performed well and supported new security analysis capabilities.
Similar content being viewed by others
Data availability
The anonymized traces from SuperComputing and source code for our IP address indexing tool and our test result data will be made publicly available with this publication.
Code availability
Our code and associated tools will be made publicly available with this publication.
Notes
Traditionally, each level in an LSM tree is implemented as a B-tree. However, modern systems maintain sorted runs instead and store metadata information in main-memory for each level (fence pointers that store information for every disk page of every run).
In the analysis section, we show in detail how the value of k derives from the values of several other parameters, but in practice k should be \( < N/3\).
\(e_{\text {hist}}\) This is slightly oversimplifying: actually \(I_{\text {compact}}\) and \(I_{\text {update}}\) both depend on ingestion rate, whereas \(e_{\text {hist}}\) actually depends on the rate of event creations, as determined by their timestamps. In a practical system, these two rates will be the same, as events should be ingested at the same rate as they are created. However, in the case of our test setup, we are ingesting a pre-generated backlog of data as fast as possible, so the ingestion rate is significantly higher than the event-creation rate.
References
Bender, M.A., Farach-Colton, M., Fineman, J.T., Fogel, Y.R., Kuszmaul, B.C., Nelson, J.: Cache-oblivious streaming B-trees. In: Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 81–92. ACM (2007)
Bender, M.A., Farach-Colton, M., Jannen, W., Johnson, R., Kuszmaul, B.C., Porter, D.E., Yuan, J., Zhan, Y.: An introduction to B\(^{\epsilon}\)-trees and write-optimization. Login USENIX Mag. 40(5), 22–28 (2015)
Berry, J.W., Porter, A.M.: Stateful streaming in distributed memory supercomputers. In: Slides from an Invited Talk at the Chesapeake Large-Scale Analytics Conference (CLSAC), October 2016. Slides archived at OSTI (2016). https://www.osti.gov/servlets/purl/1406959. Accessed 3 April 2018
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Brodal, G.S., Fagerberg, R.: Lower bounds for external memory dictionaries. In: Proceedings of the Fourteenth Annual ACM–SIAM Symposium on Discrete Algorithms, pp. 546–554. Society for Industrial and Applied Mathematics (2003)
Brodal, G.S., Demaine, E.D., Fineman, J.T., Iacono, J., Langerman, S., Munro, J.I.: Cache-oblivious dynamic dictionaries with update/query tradeoffs. In: Proceedings of the Twenty-First Annual ACM–SIAM Symposium on Discrete Algorithms, pp. 1448–1456. SIAM (2010)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4 (2008)
Claise, B., Bryant, S.: Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information. Technical Report, RFC 5101, January (2008)
Dean, J., Ghemawat, S.: LevelDB: a fast and lightweight key/value database library by Google (2021). http://code.google.com/p/leveldb/
DHS Cybersecurity and Infrastructure Security Agency (CISA): Einstein (2021). https://www.cisa.gov/einstein
Facebook: RocksDB (2021). https://github.com/facebook/rocksdb
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)
Matsunobu, Y., Dong, S., Lee, H.: MyRocks: LSM-tree database storage engine serving Facebook’s social graph. Proc. VLDB Endow. 13(12), 3217–3230 (2020)
NetFlow, Cisco Systems Inc: Netflow Services Solutions Guide. Cisco System, Inc. (2007). http://www.cisco.com/en/US/docs/ios/solutions_docs/netflow/nfwhite.html
O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (LSM-tree). Acta Inform. 33(4), 351–385 (1996)
Pandey, P., Singh, S., Bender, M.A., Berry, J.W., Farach-Colton, M., Johnson, R., Kroeger, T.M., Phillips, C.A.: Timely reporting of heavy hitters using external memory. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1431–1446 (2020)
Paxson, V.: Bro: a system for detecting network intruders in real-time. Comput. Netw. 31(23), 2435–2463 (1999)
Sarkar, S., Papon, T.I., Staratzis, D., Athanassoulis, M.: Lethe: a tunable delete-aware LSM engine. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 893–908 (2020)
Acknowledgements
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under Contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
Funding
This research was funded by Sandia National Labs LDRD 218336 and 222383. We also gratefully acknowledge support from NSF Grants CCF-2118832, CCF-2106827, CCF-1725543, CSR-1763680, CCF-1716252, CNS-1938709.
Author information
Authors and Affiliations
Contributions
All authors contributed to the concepts, design, analysis and experimentation planning. JV and DRD lead the implementation and testing and were supervised by EDT and TMK. The initial conception and inspiration for this work was a team effort as was the writing and revisions.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Vorobyeva, J., Delayo, D.R., Bender, M.A. et al. Using advanced data structures to enable responsive security monitoring. Cluster Comput 25, 2893–2914 (2022). https://doi.org/10.1007/s10586-021-03463-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03463-5