Get the latest scoop or just follow and discuss our latest studies.
Nutch’s HtmlParser parses the whole page and returns parsed text, outlinks and additional meta data. Some parts of this are really useful like the outlinks but that’s basically it. The problem is that the parsed text is too general for the purpose of precise data extraction. Fortunately the HtmlParser provides [...]
Release management and continuous integration processes are real reflection of team’s maturity and experience. Software development process itself requires considerable amount of coordination and communication to make sure that every piece of the puzzle fits perfectly so that product can be delivered to the market or the customer.
After providing short introduction to Big Data analytics on the Hadoop in the first article of this series, the focus of this one is to introduce the vendors market and the spectrum of Big Data analytics platforms. Completely commercial enterprise solutions provided by SAP, Oracle, IBM, Microsoft and other big [...]
The need to easily access, store and analyze high-volume unstructured data, commonly known as Big Data, has been initially triggered by Facebook, Amazon, Yahoo and Google who began implementing Big Data models to satisfy their interaction data requirements which relational databases, traditional ETL and BI could not handle anymore. The [...]
In order to have a functional testing tool which will provide us the possibility to cover the test cases for a variety of technologies, we have extended Jmeter. Below you can find some extensions created by our company: The source code of Apache Jmeter components are available on github Samplers [...]
Preface Amount of data stored in database/files is growing every day, using this fact there become a need to build cheaper, mainatenable and scalable environments capable of storing big amounts of data („Big Data“). Conventional RDBMS systems became too expensive and not scalable based on today’s needs, so it is [...]
Engineers in charge of Test Automation (further: “TA”) face different challenges in the projects, depending on team size, geographic locations of teams, complexity of project, technologies, methodology/lifecycle used etc. In this whitepaper, we focus on understanding and dealing with TA challenges common in Agile environment. The fundamental difference between Agile [...]
The Problem One of the fundamental questions in any business is how to spend time. Striving to systematically save time on every single step in a software development process is often overlooked. This paper discusses how automated software checks provide valuable insight into the state of a product, which saves time [...]
Software development includes functional testing no matter which methodology you are using: waterfall, agile or any other. With the rapid growth of software development expectations and new technologies like hadoop, we need a tool which can fulfill most of our functional testing needs. Apache Jmeter is one of these tools, [...]
A paper on indentifying and skipping processed data – an effort to minimize cloud resource wasting in Hadoop when processing data from HDFS. Problem The main problem that I’m trying to describe and resolve in this paper is identifying and skipping processed data on Hadoop. In turn this helped us [...]
At its core, HBase is distributed, persistent, sparse column oriented, multidimensional data repository. Multiple dimensions of data are supported by having multiple versions of each HBase table cell. This enables unique identification of each cell by 3 keys: row key, column and version. Following is a simplified view on how [...]
Recent blog posts
- Developing Cross-Platform Mobile Applications Using Phonegap
- Alternating Nutch flow to store fetched data to CSV
- Precise data extraction with Apache Nutch
- Apache Nutch Overview
- Bad vs Good Search Experience
- Regression Suite
- Apache Solr – Slow queries and frequent terms
- Amazon Elastic MapReduce – Part 2 (Amazon S3 Input Format)
- HBase backup, anyone?
- Hadoop powered BI and Analytics – Part 2