Big Data
A modern scalable approach to Business Intellingence
Processes, technologies and tools that help organisations collect, organise, analyse and present data to provide business decision support: this is the traditional definition of B.I. which, when applied to today’s reality, cannot ignore technical and technological complexities such as:
- Processing of data in large quantities and produced with high frequency, not infrequently in real-time.
- Design of storage and processing architectures able to grow gradually over time in proportion to requirements, safeguarding investments in terms of hw and sw.
- Need to integrate information organised in very heterogeneous formats, ranging, for example, from text files, to relational databases, to video streams.
These, and many others (data quality, governance, AI, etc.), are the aspects that characterise that sector now generally known as Big Data.
Humanativa boasts pioneering experience in this field, thanks to which it is able to provide expertise covering every aspect in terms of methodology, technology, design and implementation.
Data Platform
Choosing an organisational model
Many features and principles underlying the new B.I. require not only adequate technical tools, but above all changes at organisational level, creation of competencies, definition of responsibilities. For instance, it is necessary to establish which corporate figures/units are responsible for:
- Defining governance policies
- Monitoring compliance with governance criteria (e.g. auditing)
- Define quality criteria
- Monitoring compliance with quality criteria (e.g. data stewardship)
Two are currently considered the main technical/organisational approaches for the proper design and management of a data platform: Data Fabric and Data Mesh.
Humanativa offers its expertise in this area to guide customers in making the most suitable choices for their context.
Data Fabric
The Forrester Wave 2016 Q4 publication illustrates the fundamental properties of this architecture, which is centralised both from a technical and organisational point of view, characteristics that make it suitable for small, medium or large-sized companies with a typically pyramid-shaped organisation chart.
Humanativa is able to apply its expertise on Big Data, to support customers in both Large Enterprises and SMEs, for the realisation of both on-cloud and on-premises data fabric, based on both commercial and open source latest-generation products.
Data Mesh
In 2018, Zhamak Dehghani, an expert in emerging technologies at Thoughtworks, formulates the new Data Mesh paradigm for data platforms, which, as it provides for decentralised platform management, appears to be most applicable in medium- and large-sized organisations characterised by multiple organisational units with a high degree of independence.
This paradigm requires the support of advanced enabling technologies, such as data virtualisation, query federation, identity federation, data product lifecycle management, etc., which the major cloud players have only recently begun to make available, but Humanativa’s skills can guide the customer to full use of them.
From Data Warehouse to Data Lakehouse
The architectural foundations of a data platform
Data Warehouse
Humanativa‘s experience in the field of data integration / B.I. has its roots in the historical moment before the advent of Big Data, when large data warehouses were realised based on consolidated DBMS technologies, mainly relational, characterised by monolithic infrastructures, poorly scalable and strongly constraining in terms of data modelling but, at the same time, very robust (since transactional) and easily usable by end-users and data scientists thanks to standard formal query languages (SQL).
Data Lake
With the deflagrating development of Big Data, traditional technologies could no longer support the new characteristics of the large flows of information, which were coming in: unstructured, inhomogeneous in nature, in such volumes that they could no longer be handled with monolithic architectures and structured logical modelling paradigms.
Humanativa has followed, from the outset, the birth and evolution of the new distributed, scalable and unstructured storage technologies, starting with the now well-known Hadoop HDFS, together with its rich ecosystem of products, such as Hive, HBase, Spark.
At this stage, the concept of Data Lake was born, as a scalable and unstructured data repository, oriented towards Big Data.
Data Lakehouse
The Data Lake paradigm, thanks to its scalability and versatility, soon became a standard in the Big Data sphere. However, at the same time, its limitations in terms of usability were felt: less robustness due to the lack of atomicity of update operations, the impossibility of modifying information already recorded (write-once/read-many paradigm), the need to transform unstructured information a posteriori into structured data marts that can be queried via SQL, etc.
Currently, therefore, a new approach, called Data Lakehouse, stands out as an architectural standard, combining the benefits of its predecessors in a single solution. Humanativa‘s expertise in this context is well aligned with new supporting technologies such as Apache Iceberg and Delta Lake.
Data Pipeline
Architectures for data acquisition in the Big Data context
Lambda Architecture
Humanativa has gained extensive experience in the design and implementation of data pipelines in the Big Data field, based on the consolidated Lambda architecture, which allows it to tap into both batch and realtime data sources, using open source technologies such as Kafka and Spark, including their commercial serverless cloud equivalents, as well as data integration products such as Talend, Data Stage, Power BI, etc.
Kappa Architecture
Thanks to its mastery of streaming technologies such as Kafka, Spark Streaming, Flink, etc., which can also be used in serverless mode on the major Cloud platforms, Humanativa is able to effectively support the customer in the realisation of data pipelines more oriented towards streaming, based on the so-called Kappa paradigm, which emphasises the role of realtime processing in both data acquisition and use.
Vendors and Technologies for Data Platforms
The business intelligence sector, particularly in its application to the Big Data context, is still expanding rapidly, both in terms of market and technology. Humanativa endeavours to keep itself constantly up-to-date with the technological offerings of products and services, both in the commercial and open source spheres.
Certified consultants in the areas of Microsoft Azure, Amazon AWS, Google Cloud Platform, Cloudera Data Platform, Databricks are able to support the customer in all phases of the implementation of data platforms both in the cloud and on-premises, according to an approach that meets their specific needs in terms of technical, security and business requirements.
Thanks to its in-depth experience, also historically, with open source technologies oriented to Big Data, Humanativa is able to operate at any level of the architectural stack with products such as Apache Hadoop, Ozone, Iceberg, Delta Lake, Spark, Kafka, Trino, Ranger, Atlas, etc.
On the (visual) data integration side, Humanativa offers expertise in widely used commercial products such as Talend, Microsoft Power BI, IBM Data Stage, Informatica, as well as open source products such as Apache Nifi. On the front-end side, know-how ranges from established products such as Tableau and Qlik to emerging open source products such as Apache Superset and Metabase.
Advanced applications
Thanks to its in-depth knowledge of programming languages such as Scala and Python, as well as of data integration and data science frameworks such as Apache Spark, TensorFlow, Keras, Pandas, Scikit, Humanativa is able to realise highly performing data pipelines, both in the ingestion phases and in the application of machine learning models.
For many Clients characterised by particularly critical requirements in terms of performance, Humanativa has realised complex Spark processes of data extraction, loading and transformation (ELT) entirely based on dynamic Scala code, configurable in a user-friendly manner, but extremely performant, robust and versatile in its possible applications.
Apache Zeppelin
Humanativa generally adopts Apache Zeppelin as a user interface for prototyping Spark applications, analysis and, in particular contexts, for scheduling processes.
Apache AirFlow
Humanativa has chosen Apache AirFlow as the open source product of choice for process orchestration. Major cloud vendors also offer AirFlow in their market place or display a version already integrated in their environment (e.g. Google Cloud Composer).
Jupyter
Over the years, Jupyter has established itself as a de-facto standard in the data science sector, as a GUI for Python scripts for data analysis and the realisation of machine learning models.
Humanativa offers support in this area both at the architectural level, for the deployment of the product, and in the use of the product, thanks to the expertise of its data scientists.
Our Open Source Data Fabric Solution
Designed with the aim, on the one hand, of lowering recurring licence costs and, on the other, of making use of state-of-the-art technologies actively supported by the open source community, the Big Data architectural solution developed by Humanativa consists of a family of products distributed in containerised form both on-premise and on-cloud which, as a whole, functionally cover all the requirements of a modern data fabric:
- Data Lakehouse
- Data ingestion / data processing
- Data analytics / business intelligence
- Data governance / data quality / security
- Monitoring / auditing