Towards a semi-automatic classifier of malware through tweets for early warning threat detection
Published 2024-05-15
Keywords
- Malware,
- Classification,
- NLP,
- Twitter,
- Text Mining.
How to Cite
Copyright (c) 2024 Claudia Lanza, Lorenzo Lodi
This work is licensed under a Creative Commons Attribution 4.0 International License.
Funding data
-
Ministero dell'Università e della Ricerca
Grant numbers PON "Ricerca e Innovazione" 2014-2020 Asse IV, Azione IV.4, Azione IV.6, avviso DM 1062 del 10.08.2021, RTD-A a regime di tempo pieno, codice identificativo 1062_R10_INNOVAZIONE, settore concorsuale 11/A4, settore scientifico disciplinare M-STO/08.
Abstract
This paper presents a method for developing a malware ontology structure by detecting malware instances on Twitter. The ontology represents a semi-automatic classifier fed by the data extracted from tweets. In particular, the automatic part of the presented methodology relies on a pattern-based approach to detect trigger expressions leading to new information about malware, whilst the manual one covers the evaluation of the results by domain-experts, who also validate the reliability of the semantic relationships within the ontology framework. We present preliminary results on the application of our methodology to tweets extracted from MalwareBazaar database showing how the documents’ collection analysis, through Natural Language Processing (NLP) tasks, can support the knowledge retrieval and documents’ classification procedures for building early warning system of detected malware. Results obtained from this research paper within the time framework of 2023 are referred to the previous version of the current social network X.
Metrics
References
- Adem, Tahir, and Muhammed Mutlu Yapici. 2022. “A Novel Malware Classification and Augmentation Model Based on Convolutional Neural Network.” Computers & Security 112. https://doi.org/10.1016/j.cose.2021.102515.
- Akhtar, Muhammad Shoaib, and Tao Feng. 2022. “Malware Analysis and Detection Using Machine Learning Algorithms.” Symmetry 14 (11): 2304. DOI: https://doi.org/10.3390/sym14112304
- Andrei, Brazhuk. 2019. “Semantic Model of Attacks and Vulnerabilities Based on CAPEC and CWE Dictionaries.” International Journal of Open Information Technologies 7 (3): 38-41.
- Anicic, Darko, Paul Fodor, Sebastian Rudolph, Roland Stühmer, Nenad Stojanovic, and Rudi Studer. 2010. “A Rule-Based Language for Complex Event Processing and Reasoning.” In Web Reasoning and Rule Systems. RR 2010. Lecture Notes in Computer Science, edited by Pascal Hitzler, and Thomas Lukasiewicz, vol 6333, 4: 42–57. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-15918-3_5. DOI: https://doi.org/10.1007/978-3-642-15918-3_5
- Annachhatre, Chinmayee, Thomas H. Austin, and Mark Stamp. 2015. “Hidden Markov Models for Malware Classification.” Journal of Computer Virology and Hacking Techniques 11: 59–73. https://doi.org/10.1007/s11416-014-0215-x. DOI: https://doi.org/10.1007/s11416-014-0215-x
- Antoniou, G., van Harmelen, F. (2004). “Web Ontology Language: OWL”. In Handbook on Ontologies. International Handbooks on Information Systems, edited by Steffen Staab, and Rudi Studer. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-540-24750-0_4. DOI: https://doi.org/10.1007/978-3-540-24750-0_4
- Arora, Monika, and Vineet Kansal. 2019. “Character Level Embedding with Deep Convolutional Neural Network for Text Normalization of Unstructured Data for Twitter Sentiment Analysis.” Social Network Analysis and Mining 9: 12. https://doi.org/10.1007/s13278-019-0557-y. DOI: https://doi.org/10.1007/s13278-019-0557-y
- Auger, Alain, and Caroline Barrière. 2008. “Pattern-based Approaches to Semantic Relation Extraction: A State-of-the-Art.” Terminology 14 (1). https://doi.org/10.1075/term.14.1.02aug. DOI: https://doi.org/10.1075/term.14.1
- Babic, Bojan, Nenad Nesic, and Zoran Miljkovic. 2008. “A Review of Automated Feature Recognition with Rule-based Pattern Recognition.” Computers in Industry 59 (4): 321–337. DOI: https://doi.org/10.1016/j.compind.2007.09.001
- Akshat Bakliwal, Piyush Arora, Senthil Madhappan, Nikhil Kapre, Mukesh Singh, and Vasudeva Varma. 2012. “Mining Sentiments from Tweets.” In Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, 11–18. Jeju, Korea: Association for Computational Linguistics.
- Barnard, Josie. 2016. “Tweets as Microfiction: On Twitter’s Live Nature and 140-Character Limit as Tools for Developing Storytelling Skills.” New Writing 13 (1): 3–16. https://doi.org/10.1080/14790726.2015.1127975. DOI: https://doi.org/10.1080/14790726.2015.1127975
- Bartoletti, Massimo, Stefano Lande, and Alessandro Massa. 2016. “Faderank: An Incremental Algorithm for Ranking Twitter Users.” In Web Information Systems Engineering–WISE 2016: 17th International Conference, Shanghai, China, Proceedings, Part II 17, 55–69. Springer International Publishing. DOI: https://doi.org/10.1007/978-3-319-48743-4_5
- Blomqvist, Eva, and Kurt Sandkuhl. 2005. “Patterns in Ontology Engineering: Classification of Ontology Patterns.” ICEIS 3: 413–416.
- Brazhuk, Andrei. 2019. “Semantic Model of Attacks and Vulnerabilities Based on CAPEC and CWE Dictionaries.” International Journal of Open Information Technologies 7(3): 38–41.
- Cappelletti Rafael, and Sastry Nishanth. 2012. “IARank: Ranking Users on Twitter in Near Real-Time, Based on Their Information Amplification Potential.” International Conference on Social Informatics, 70–77. Alexandria, VA, USA. https://doi.org/10.1109/SocialInformatics.2012.82. DOI: https://doi.org/10.1109/SocialInformatics.2012.82
- Christodorescu, Mihai, Sanjit Jha, Sanjit A. Seshia, Dawn Song, and Randal E Bryant. 2005. “Semantics-Aware Malware Detection.” IEEE Symposium on Security and Privacy (S&P’05), Oakland, CA, USA, 2005, 32–46. https://doi.org/10.1109/SP.2005.2032–46. DOI: https://doi.org/10.1109/SP.2005.20
- Concone, Mário. 2012. “Twitter Event Detection: Combining Wavelet Analysis and Topic Inference Summarization.” DSIE’12, Doctoral Symposium on Informatics Engineering, 1: 11–16.
- Das Sarma, Anish, Atish Das Sarma, Sreenivas Gollapudi, and Rina Panigrahy. 2010. “Ranking Mechanisms in Twitter-Like Forums.” In Proceedings of the Third ACM International Conference on Web Search and Data Mining WSDM’10, 21–30, February 4-6. New York City, New York, USA: Association for Computer Machinery. DOI: https://doi.org/10.1145/1718487.1718491
- Das, Tushar Kant, and P. Mohan Kumar. 2013. “BIG Data Analytics: A Framework for Unstructured Data Analysis.” International Journal of Engineering and Technology 5: 153–156.
- Donalds, Charlette, and Kweku-Muata Osei-Bryson. 2019. “Toward a Cybercrime Classification Ontology: A Knowledge-Based Approach.” Computers in Human Behavior 92: 403–418. DOI: https://doi.org/10.1016/j.chb.2018.11.039
- Drakopoulos, Georgios, Andreas Kanavos, and Athanasios K Tsakalidis. 2016. “Evaluating Twitter Influence Ranking with System Theory.” WEBIST 1: 113–120. DOI: https://doi.org/10.5220/0005811701130120
- Europol Public Information. 2017. “Common Taxonomy for Law Enforcement and The National Network of CSIRTs.” https://www.europol.europa.eu/cms/sites/default/files/documents/common_taxonomy_for_law_enforcement_and_csirts_v1.3.pdf.
- Evert, Stefan. 2008. “Corpora and Collocations.” In Corpus Linguistics: an international handbook 2, 1212–1248. Berlin, New York: De Gruyter Mouton. DOI: https://doi.org/10.1515/9783110213881.2.1212
- Gaglio, Salvatore, Giuseppe Lo Re, and Marco Morana. 2016. “A Framework for Real-Time Twitter Data Analysis.” Computer Communications 73: 236–242. DOI: https://doi.org/10.1016/j.comcom.2015.09.021
- Georgiadou, Anna, Spiros Mouzakitis, and Dimitris Askounis. 2021. “Assessing MITRE ATT&CK Risk Using a Cyber-Security Culture Framework.” Sensors 21(9): 3267. DOI: https://doi.org/10.3390/s21093267
- Glimm, Birte, Ian Horrocks, Boris Motik, Rob Shearer, and Giorgos Stoilos. 2012. “A Novel Approach to Ontology Classification.” Journal of Web Semantics 14: 84–101. DOI: https://doi.org/10.1016/j.websem.2011.12.007
- Guarino, Nicola, Daniel Oberle, and Steffen Staab. 2009. “What Is an Ontology?.” Handbook on Ontologies 1–17. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-540-92673-3. DOI: https://doi.org/10.1007/978-3-540-92673-3_0
- Gupta, Rishabh, and Rajesh N Rao. 2020. “Towards Semantic Noise Cleansing of Categorical Data Based on Semantic Infusion.” https://doi.org/10.48550/arXiv.2002.02238.
- Gutierrez, Carlos Enrique, Mohammad Reza Alsharif, Katsumi Yamashita, and Mahdi Khosravy. 2014. “A Tweets Mining Approach to Detection of Critical Events Characteristics Using Random Forest.” Int J Next-Gener Comput 5(2): 167–176.
- Habibi, Omar, Mohammed Chemmakha, and Mohamed Lazaar. 2023. “Performance Evaluation of CNN and Pre-trained Models for Malware Classification.” Arabian Journal for Science and Engineering: 1–15. DOI: https://doi.org/10.1007/s13369-023-07608-z
- Huang, Hsien-Der, Tsung-Yen Chuang, Yi-Lang Tsai, and Chang-Shing Lee. 2010. “Ontology-based Intelligent System for Malware Behavioral Analysis.” In International Conference on Fuzzy Systems, 1–6, Barcelona, Spain. doi: 10.1109/FUZZY.2010.5584325. DOI: https://doi.org/10.1109/FUZZY.2010.5584325
- Jakubíček, Miloš, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, and Vít Suchomel. 2014. “Finding Terms in Corpora for Many Languages with the Sketch Engine.” In Proceedings of the demonstrations at the 14th conference of the european chapter of the association for computational linguistics, 56-56. Gothenburg, Sweden: Association for Computational Linguistics. https://doi.org/10.3115/v1/E14-2014. DOI: https://doi.org/10.3115/v1/E14-2014
- Kang, Boojoong, KimTaekeun, Heejun Kwon, Yangseo Choi, and Eul Gyu Im. 2012. “Malware Classification Method via Binary Content Comparison.” In Proceedings of the 2012 ACM Research in Applied Computation Symposium, 316–321, New York, NY: Association for Computing Machinery. https://doi.org/10.1145/2401603.2401672. DOI: https://doi.org/10.1145/2401603.2401672
- Kalash, Mahmoud, Mrigank Rochan, Noman Mohammed, Neil D.B. Bruce, Yang Wang, and Farkhund Iqbal. 2018. “Malware Classification with Deep Convolutional Neural Networks.” In 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 1–5. Paris, France. https://doi.org/10.1109/NTMS.2018.8328749. DOI: https://doi.org/10.1109/NTMS.2018.8328749
- Kilgarriff, Adam, Pavel Rychlý, Pavel Smrž, and David Tugwell. 2008. “The Sketch Engine.” Practical lexicography: a reader: 297–306. DOI: https://doi.org/10.1093/oso/9780199292332.003.0020
- Kinable, Joris, and Orestis Kostakis. 2011. “Malware classification based on call graph clustering.” Journal in Computer Virology 7(4): 233–245. https://doi.org/10.1007/s11416-011-0151-y. DOI: https://doi.org/10.1007/s11416-011-0151-y
- Kotenko, Igor, and Elena Doynikova. 2015. “The CAPEC based generator of attack scenarios for network security evaluation.” In 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), 436–441. Warsaw, Poland. https://doi.org/10.1109/IDAACS.2015.7340774. DOI: https://doi.org/10.1109/IDAACS.2015.7340774
- Kwon, Roger, Ashley Travis, Jerry Castleberry, Penny Mckenzie, and Sri Nikhil Gupta Gourisetti. 2020. “Cyber Threat Dictionary Using MITRE ATT&CK Matrix and NIST Cybersecurity Framework Mapping.” Resilience Week (RWS), 106–112. DOI: https://doi.org/10.1109/RWS50334.2020.9241271
- León-Araúz, Pilar, Antonio San Martín, and Pamela Faber. 2016. “Pattern-based Word Sketches for the Extraction of Semantic Relations.” In Proceedings of the 5th International workshop on Computational Terminology (Computerm2016), 73–82. Osaka, Japan.
- Lo, Siaw Ling, Raymond Chiong, and David Cornforth. 2016. “Ranking of High-value Social Audiences on Twitter.” Decision Support Systems 85: 34–48. DOI: https://doi.org/10.1016/j.dss.2016.02.010
- Lohmann, Steffen and Vincent Link, Eduard Marbach, and Stefan Negru. 2015. “WebVOWL: Web-based Visualization of Ontologies.” In Knowledge Engineering and Knowledge Management: EKAW 2014 Satellite Events, VISUAL, EKM1, and ARCOE-Logic, Linköping, Sweden, November 24-28, 2014. Revised Selected Papers, 19: 154–158. Springer International Publishing. DOI: https://doi.org/10.1007/978-3-319-17966-7_21
- Mathews, Sherin Mary. 2019. “Explainable Artificial Intelligence Applications in NLP, Biomedical, and Malware Classification: A Literature Review.” Intelligent Computing. CompCom 2019. Advances in Intelligent Systems and Computing, 998. Cham: Springer. https://doi.org/10.1007/978-3-030-22868-2_90. DOI: https://doi.org/10.1007/978-3-030-22868-2_90
- Mirza, Qublai K. Ali., Irfan Awan, and Muhammad Younas. 2018. “CloudIntell: An Intelligent Malware Detection System.” Future Generation Computer Systems 86: 1042–1053. DOI: https://doi.org/10.1016/j.future.2017.07.016
- Montangero, Manuela, and Marco Furini. 2015. “Trank: Ranking Twitter Users According to Specific Topics.” In 2015 12th Annual IEEE Consumer Communications and Networking Conference (CCNC), 767–772. Las Vegas, NV, USA. https://doi.org/10.1109/CCNC.2015.7158074. DOI: https://doi.org/10.1109/CCNC.2015.7158074
- Noro, Tomoya, Fei Ru, Feng Xiao, and Takehiro Tokuda. 2013. “Twitter User Rank Using Keyword Search.” Information Modelling and Knowledge Bases XXIV. Frontiers in Artificial Intelligence and Applications 251: 31–48.
- Pascanu, Razvan, Jack W. Stokes, Hermineh Sanossian, Mady Marinescu, and Anil Thomas. 2015. “Malware Classification with Recurrent Networks.” In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1916-1920. South Brisbane, QLD, Australia. https://doi.org/10.1109/ICASSP.2015.7178304. DOI: https://doi.org/10.1109/ICASSP.2015.7178304
- Qaiser, Shahzad, and Ramsha Ali. 2018. “Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents.” International Journal of Computer Applications 181(1): 25–29. DOI: https://doi.org/10.5120/ijca2018917395
- Rastogi, Nidhi, Sharmishtha Dutta, Mohammed J. Zaki, Alex Gittens, and Charu Aggarwal. 2020. “Malont: An Ontology for Malware Threat Intelligence.” In International Workshop on Deployable Machine Learning for Security Defense, 28–44. Cham: Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-59621-7_2
- Sabottke, Carl, Octavian Suciu, and Tudor Dumitraș. 2015. “Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting {Real-World} Exploits.” In 24th USENIX Security Symposium (USENIX Security 15), 1041–1056.
- Sahu, Manish Kumar, Manish Ahirwar, and A. Hemlata. 2014. “A Review of Malware Detection Based on Pattern Matching Technique.” International Journal of Computer Science and Information Technologies (IJCSIT) 5 (1): 944–947.
- Sankaranarayanan, Jagan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. “Twitterstand: News in Tweets.” In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 42–51. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1653771.1653781. DOI: https://doi.org/10.1145/1653771.1653781
- Singh, Jagsir, and Jaswinder Singh. 2018. “Challenge of Malware Analysis: Malware Obfuscation Techniques.” International Journal of Information Security Science 7(3): 100–110.
- Sivakumar, Ramakrishnan, and P.V. Arivoli,. 2011. “Ontology Visualization PROTÉGÉ Tools–A Review.” International Journal of Advanced Information Technology (IJAIT) 1: 1-11. http://dx.doi.org/10.5121/ijait.2011.1401.
- Subbian, Karthik, and Prem Melville. 2011. “Supervised Rank Aggregation for Predicting Influencers in Twitter.” In 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing, 661–665. Boston, MA, USA. https://doi.org/10.1109/PASSAT/SocialCom.2011.167. DOI: https://doi.org/10.1109/PASSAT/SocialCom.2011.167
- Vasiliev, Yuli. 2020. Natural Language Processing with Python and spaCy: a practical introduction. San Francisco, California, USA: No Starch Press.
- Tang, Yonghe, Xuyan Qi, Jing Jing, Liu Chunling, and Weiyu Dong. 2023. “BHMDC: A Byte and Hex N-gram Based Malware Detection and Classification Method.” Computers & Security 103118. DOI: https://doi.org/10.1016/j.cose.2023.103118
- Tekerek, Adem, and Muhammed Mutlu Yapici. 2022. “A Novel Malware Classification and Augmentation Model Based on Convolutional Neural Network.” Computers & Security 112: 102515. DOI: https://doi.org/10.1016/j.cose.2021.102515
- Zareen, Syed, Padia Ankur, Tim Finin, Lisa Mathews, and Joshi Anupam. 2016. “UCO: A Unified Cybersecurity Ontology.” In Workshops at the Thirtieth AAAI Conference on Artificial Intelligence. Palo Alto, California, USA: AAAI Press.
- Wang, Xiao Hang, D. Qing Zhang, Tao Gu, and Hung, Keng Pung. 2004. “Ontology-Based Context Modeling and Reasoning Using OWL.” In IEEE Annual Conference on Pervasive Computing and Communications Workshops, 2004. Proceedings of the Second, 18–22. Orlando, FL, USA. https://doi.org/10.1109/PERCOMW.2004.1276898. DOI: https://doi.org/10.1109/PERCOMW.2004.1276898
- Xiong, Wenjun, Emeline Legrand, Oscar Åberg, and Robert Lagerström. 2022. “Cyber Security Threat Modeling Based on the MITRE Enterprise ATT&CK Matrix.” Software and Systems Modeling 21.1: 157–177. DOI: https://doi.org/10.1007/s10270-021-00898-7
- Xu, Xin, and Hubo Cai. 2021. “Ontology and Rule-Based Natural Language Processing Approach for Interpreting Textual Regulations on Underground Utility Infrastructure.” Advanced Engineering Informatics 48, 101288. DOI: https://doi.org/10.1016/j.aei.2021.101288
- Yamaguchi, Yuto, Tsubasa Takahashi, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2010. “Turank: Twitter User Ranking Based on User-Tweet Graph Analysis.” In Web information systems engineering–WISE 2010: 11th International Conference, Hong Kong, China, December 12-14, 2010. Proceedings, 11, 240–253. Springer Berlin Heidelberg. DOI: https://doi.org/10.1007/978-3-642-17616-6_22