Building the Machine Learning Infrastructure

constructionMaking intelligent and accurate predictions is the core objective of machine learning and artificial intelligence applications. To achieve that objective, the machine learning or artificial intelligence application needs clean and well-organized information in a robust ecosystem architecture.

Machine Learning (ML) is the process of a computer system making a prediction based on samples of past observations. There are various types of ML methods. One of the approaches is where the ML algorithm is trained using a labeled or unlabeled training data set to produce a model. New input data is introduced to the ML algorithm and it makes a prediction based on the model. The prediction is evaluated for accuracy and if the accuracy is acceptable, the ML algorithm is deployed. If the accuracy is not acceptable, the ML algorithm is trained again with an augmented training data set. This is just a very high-level example as there are many factors and other steps involved.

Machine Learning Example

Artificial intelligence (AI) takes machine learning to a more dynamic level producing a feedback loop in which an algorithm can learn from its experiences. In many cases an intelligent agent is used to perceive an environment and detect changes in the environment and then reacts to that change based on information and rules it has been taught.

Every AI program is dependent on information to make predictions and decisions. That information needs to be structured in the appropriate context to make informed decisions.

An example of appropriate context comes from an example application of a robotic vacuum cleaner [1] that would navigate a room on its own and how it was measured that it was doing a “good job”. The metric chosen was focused on “picking up the dirt” and therefore to measure the volume of dirt it vacuumed and the amount of time it spent collecting it. Based on this objective the vacuum would learn that when it bumped into an object dirt would get picked up, and thus it learned to identify where the most dirt was collected next to furniture or some other object and would bump the object harder to dislodge any additional dirt, such as knocking over a plant and dumping the dirt on the floor and then collecting it. It consumed more energy which in turn cost more, not to mention causing a mess, but it did a “good job” based on the metric by which it was measured. It based this on the context of the information to which it had access.

Keeping this type of approach continued to increase expenses and decrease benefits.

The solution was to change the perspective to a new metric of “clean the room and keep it clean” and thus the application learned to just focus on expending energy only in the areas that needed to be vacuumed and reduced the cost of energy consumed by the device. It needed additional sensors to accomplish this new mission which at first sight would seem to increase cost, but the reduction of energy used was paid back with each occurrence producing significant value. It functioned on the terms of efficiency.

For AI, machine learning, and any type of analytics, the better the information is modeled, structured and organized for fast retrieval, the more effective and efficient the processing will perform.

Conversely the more complex the model or structure, the more complex the processing.


AI and ML algorithms that search for patterns in unstructured or non-relational data still need structure. Even schema-less data must be wrangled into meaningful structures. AI and ML algorithms are most effective when the enterprise architecture enables efficient access and retrieval of information for specific contexts. The ingestion framework for an enterprise ecosystem architecture needs to consider the information and data needed for machine learning and analytics. The landed data should be a single usage point where data can be used across multiple applications and platforms, in other words land once, use many.

Kylo is an open source solution for data ingestion and data lake management employing NiFi templates to build an ingestion pipeline with cleansing, wrangling, and governance to transform data into meaningful structures needed for machine learning and analytics.

Kylo Workflow

Kylo provides an ingestion framework that is a key component of any machine learning infrastructure. It leverages Nifi and Spark and is flexible to add others. The ingestion framework includes a wrangling component that facilitates the transformation of data into meaningful structures that ML and AI will rely on to make enhanced predictions. Data lineage is also captured in the framework to enforce governance. The framework accelerates the development process and iterations critical in constantly improving model accuracy.

Boosting business outcomes with the best ML and AI applications truly relies on a robust machine learning infrastructure and a well-thought-out ecosystem architecture. Kylo is a Teradata sponsored open source project under the Apache 2.0 license that provides an extensible framework for the machine learning infrastructure. Teradata also provides an ecosystem architecture consulting service to harness the vast experience of technology professionals in combining the right mix of technologies and data platforms into an efficient digital ecosystem.


[1] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Upper Saddle River, New Jersey: Pearson Higher Education, 1995, pp. 46-61.

Pat Photo 2014

Pat Alvarado is a Teradata Certified Master providing technical consultation on analytic ecosystem architecture, workload distribution, and multi-genre analytics across multiple platform and analytics technologies.

Pat started his career as a hardware engineer building test instrumentation for mil-spec components and later point of sale systems for fine dining restaurants. After developing firmware for his micro controller hardware designs, Pat moved into software engineering developing data management applications with open source GNU software on distributed UNIX servers and disk-less workstations based on the Berkeley Software Distribution (BSD) as a departure from the proprietary AT&T UNIX and became known as FreeBSD.

Pat joined Teradata in 1989 providing technical education to hardware engineers on the DBC/1012 architecture and was part of the team building out the parallel software development environment on ClearCase.

Presently, Pat provides consulting and thought leadership on relational database management systems (RDBMS), Document and NOSQL database systems, Hadoop distributed file systems (HDFS), exploratory analytics platforms, etc. both on-premise and in the cloud via SQL, MapReduce, SQL-MR, Java, Python and other open source languages and architectures for structured, unstructured, and evolving schemas. Pat manages technical consultants in the development and implementation of data analytics and MapReduce extensions in Java, C++, R, etc. Development of relational and dimensional data models and universal modeling to bridge relational and non-relational schemas to support business processes.

Japan liaison for establishing relationships between U.S. and Japan organizations through business processes. Leveraging cultural approaches to manufacturing through continuous process flow and leveling workload (heijunka) and applying a continuous improvement process (kaizen).

Leave a Reply

Your email address will not be published. Required fields are marked *