Tomáš Koctúr: If you are thinking about deploying machine learning, collect quality data as early as possible

Tomáš Koctúr works as a Data Science Engineer in the largest IT company in the east of Slovakia. According to him, artificial intelligence is one of the top areas of IT and in an interview he gave us an insight into the work of the data science team at Deutsche Telekom IT Solutions Slovakia.

What projects are you working on in Košice?

Most of our business is in the field of artificial intelligence called Natural Language Processing (NLP). The specificity of this area of machine learning is that we process text in an unstructured form. In this field, we work on almost all types of tasks that NLP brings. We focus on NLP because the nature of the telco industry mainly brings tasks like these. Among the projects, I will mention semantic information retrieval, text data clustering, prediction and classification based on text input, chatbots and voicebots. Outside of NLP, we are partly involved in computer vision projects, and we are extensively involved in the areas of automatic speech recognition and text to speech. Recently, we have also been focusing on anomaly detection in our systems to complement, and perhaps later completely replace, standard monitoring tools for systems, applications or infrastructure.

Do you have a team for data science at Deutsche Telekom IT Solutions Slovakia?

Yes, 3 years ago we started building a team dedicated to this area. At the moment there are 8 data science engineers. We also have developers and automation engineers in the team. Even in this number we are probably the biggest data science team in the region and we are still growing.

How long does it take to train the models?

It depends on the task and on the data. A simple model can be trained in 5 minutes. If we’re talking about models for automatic speech recognition, it takes a few weeks. In NLP, it can be months. Powerful graphics cards are used for training, which speeds up the training of models thanks to parallel data processing. But even a graphics card is no guarantee that a model will be trained quickly. For example, the currently largest NLP language model with 175 billion parameters would take 355 years to train on a single graphics card, according to its creators. Therefore, large numbers of graphics cards or dedicated ASIC circuits are used. Currently, the dominant trend is to make language models bigger, which I don’t entirely agree with, so the time required to train models is expected to increase.

How many graphics cards do you use to train the models in your team?

Units to dozens, depends on the task. However, it should be noted that even in our concern we cannot afford to train extremely large language models, as I mentioned in the previous answer. Firstly, training such a model would take a long time, cost a lot of money, and secondly, we don’t even have enough data for that. It is important to realize that the bigger the model is, the more data it needs to be trained well. In practice, that is, if we are not talking about Google, Microsoft or Facebook, large language models that have already been trained are used, and they are only retrained, that is, adapted to the task at hand. In this case, the retraining “only” takes a few days or weeks.

What problems do you encounter in your work?

I think we face the same problems as other data scientists. A data scientist never has enough data and enough computing power. On the other hand, even if they had enough data and computing power, a standard data scientist would make the model bigger and therefore use larger amount of data and computing power. And to subsequently improve the model, he would need more data and computing power again, it’s such a vicious circle J. We also often encounter poor data quality. Once the data are bad, the model will be also bad after training. Or the other option is to clean the data and nobody likes to do that. Therefore, my advice to anyone considering deploying any machine learning based solution in any domain would be, collect the data, make sure the data quality is good and start as early as possible. Of course, GDPR needs to be complied with

Is it possible to deploy machine learning in business sectors where we don’t have collected data?

Yes, and it happens to us quite often. One option is to start collecting the necessary data when defining the task, and once we have collected enough data, we can train the model. Another option is to use transfer learning, where we use a model trained on another task and retrain it on a new task with a small amount of relevant data. Next option is data mining on the Internet. The Internet is large enough and often contains data that would be useful for training, we just need to get them. However, creativity and smartness are needed here, because even if the data are publicly available, often the owner protects them from automated collection.

What do you think is the biggest benefit of AI?

First of all, we can scale work and we don’t have to make decisions about things that we can teach a computer to make for us. Before, a human had to do a trivial task, today you train a model that can solve several thousand decisions in a short time. We also have to realise that humans are not perfect and do not perform any task 100%. ML models are not 100% accurate either, but in some machine learning tasks, models perform better than humans. These claims are, of course, confirmed by studies, but also by practice. For example, the most well-known electric autonomous cars have a lower accident rate in autonomous mode than human-controlled cars. But these cars also have crashes and, unfortunately, they are not the fault of their crew, which is why people are afraid of this technology. However, it is important to remember that it is statistically safer than if a human were driving.

In what areas do you see the representation of AI in the future?

We have to remember that AI has been around for decades. Even some AI technologies are already so established that the public doesn’t even perceive them as artificial intelligence. For example, we have been using optical character recognition (OCR or text recognition) from scans or photos for about 20 years now, machine learning is used for that. Nowadays, we will more often see deployments in cognitive tasks such as speech recognition and synthesis, image and video recognition and processing, understanding unstructured text and generating meaningful text, and of course, combinations of these. Essentially, AI can be deployed for any task and we see that a lot these days. Also, the area of Reinforcement Learning is developing fast and is mainly used in robotics and control (e.g. autonomous driving).

Some of the public perceive AI as a threat – is this feeling justified?

Partly because of the problem with the acceptance of data science that we have in our society, I became an ambassador for Deutsche Telekom IT Solutions Slovakia on this topic. People imagine AI as a terminator that will kill us all, employees are often worried about losing their jobs. I don’t want to scare people, but even though current artificial intelligence is represented by machine learning, which is really just a very large mathematical formula, it may be partly true. AI can replace humans in a variety of job tasks that are trivial or repetitive. If a job task is done by AI, it is natural that a human will perform another, more complex or creative task. Even the first industrial revolution did not replace humans, it only replaced certain jobs, and the development of humanity created new jobs that did not exist before. In order for human not to be replaced by a machine, he must constantly develop. And to ensure that artificial intelligence does not kill us, regulation is needed at national and supranational level. For example, so that critical and life-threatening applications of artificial intelligence are regulated appropriately.

What kind of regulation do you have in mind?

This is where I am trying to be involved as a member of the Commission for Ethics and Regulation of Artificial Intelligence, which was set up by the Ministry of Investment, Regional Development and Informatisation of the Slovak Republic. Development is unstoppable and the problem of ethicality has been and will continue to be present. The problem of ethics and human rights in AI has already arisen in several countries, so it is essential to learn from others’ mistakes and to help by adjusting the regulations of this sector so that the ethical aspect is preserved, but also so that development in this area is not unnecessarily blocked. Our Commission assists the Ministry in commenting on these issues, whether within the EU, UNESCO or the OECD, as these regulations are mostly made at supranational level.

Post Views: 0

Tomáš Koctúr: If you are thinking about deploying machine learning, collect quality data as early as possible

What projects are you working on in Košice?

Do you have a team for data science at Deutsche Telekom IT Solutions Slovakia?

How long does it take to train the models?

How many graphics cards do you use to train the models in your team?

What problems do you encounter in your work?

Is it possible to deploy machine learning in business sectors where we don’t have collected data?

What do you think is the biggest benefit of AI?

In what areas do you see the representation of AI in the future?

Some of the public perceive AI as a threat – is this feeling justified?

What kind of regulation do you have in mind?

Edita Fabian Hilgartová

Add comment

Cancel reply

Unique team from Košice is developing an application for a German health insurance company

The Importance of Women’s Voices in IT: Insights from Unique People

Software is eating the world and the importance of continuous architecture and technology innovation for medical software

Recent articles

Unique team from Košice is developing an application for a German health insurance company

The Importance of Women’s Voices in IT: Insights from Unique People

Software is eating the world and the importance of continuous architecture and technology innovation for medical software

Our unique 2022

1st anniversary of Unique People in Croatia

INDEX MAG

Contact

Tomáš Koctúr: If you are thinking about deploying machine learning, collect quality data as early as possible

What projects are you working on in Košice?

Do you have a team for data science at Deutsche Telekom IT Solutions Slovakia?

How long does it take to train the models?

How many graphics cards do you use to train the models in your team?

What problems do you encounter in your work?

Is it possible to deploy machine learning in business sectors where we don’t have collected data?

What do you think is the biggest benefit of AI?

In what areas do you see the representation of AI in the future?

Some of the public perceive AI as a threat – is this feeling justified?

What kind of regulation do you have in mind?

Edita Fabian Hilgartová

Add comment

You may also like

Recent articles

INDEX MAG

Contact